Wednesday 10 July 2019

Reverse Proxying to pgAdmin

Reverse proxying requests to a pgAdmin server is becoming more and more popular if posts to the mailing lists are to be taken as an indicative measure; more often than not when using pgAdmin in a container (of which there have now been over 10 million pulls)! Typically users will deploy a reverse proxy for a couple of reasons; to host multiple applications in different subdirectories under the same domain, or to add SSL/TLS support independently of the application.

Because of the number of questions asked, I spent a little time over the last couple of days doing some testing and updating the documentation with some examples. Here's a blog-ified version of that work.


Nginx

Nginx is winning the battle of the web servers these days, beating out Lighttpd (which is still a fine piece of software) and the ageing and arguably bloated Apache HTTPD. All of these servers support reverse proxying, and whilst I've looked at Nginx, the examples shown below can easily be translated to the other servers if you prefer to run them instead.

In the following examples, we have pgAdmin running in a Docker container (in which it's hosted under Gunicorn). For simplicity, the examples have Nginx running on the host machine, but it can also be easily run in another container, sharing a Docker network with pgAdmin. In such a configuration there is no need to map the pgAdmin container port to the host.

The container is launched as shown below. See the documentation for information on other useful environment variables you can set and paths you can map.

The commands below will pull the latest version of the container from the repository, and run it with port 5050 on the host mapped to port 80 on the container. It will set the default username and password to user@domain.com and SuperSecret respectively.

docker pull dpage/pgadmin4
docker run -p 5050:80 \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -d dpage/pgadmin4

A simple configuration to reverse proxy with Nginx to pgAdmin at the root directory looks like this:

server {
    listen 80;
    server_name _;

    location / {
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}

Here we tell Nginx to listen on port 80, and respond to any server name (sent by the client in the Host header). We then specify that all requests under the root directory are proxied back to port 5050 on the local host, and that the Host header is passed along as well. The proxy_redirect option tells the server not to rewrite the Location header.

But what if we want to host pgAdmin under a subdirectory, say /pgadmin4? In this case we need to change the path at the top of the location block and add the X-Script-Name header to the requests made to the pgAdmin container to tell it what subdirectory it's hosted under. This is shown below:

server {
    listen 80;
    server_name _;

    location /pgadmin4/ {
        proxy_set_header X-Script-Name /pgadmin4;
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}

OK, so that's cool but we're talking about super top secret database stuffs here. It needs to be encrypted! Adding SSL/TLS support to the configuration is largely unrelated to pgAdmin itself, except that as with the subdirectory, we need to tell it the URL scheme (http or https) to use. We do this by setting the X-Scheme header. The other changes are to add a redirect from http to https, and to configure SSL/TLS:

server {
    listen 80;
    return 301 https://$host$request_uri;
}

server {
    listen 443;
    server_name _;

    ssl_certificate /etc/nginx/server.crt;
    ssl_certificate_key /etc/nginx/server.key;

    ssl on;
    ssl_session_cache builtin:1000 shared:SSL:10m;
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_prefer_server_ciphers on;

    location /pgadmin4/ {
        proxy_set_header X-Script-Name /pgadmin4;
        proxy_set_header X-Scheme $scheme;
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}


Traefik

Traefik is becoming increasingly popular in containerised environments because it's designed to integrate with the orchestration platform being used and auto-configure itself as much as possible. The examples below show it working with plain Docker, but it will work with Swarm, Compose, Kubernetes and other orchestrators as well. Note that Traefik is designed purely for reverse proxying, routing and load balancing; it's not a general purpose webserver.

In the examples below, the following Traefik configuration is used. Because it auto-configures itself, the changes we make to get the desired configuration are to the way the pgAdmin container is deployed, not to the Traefik configuration:

defaultEntryPoints = ["http", "https"]

[entryPoints]
  [entryPoints.http]
    address = ":80"
      [entryPoints.http.redirect]
        entryPoint = "https"
  [entryPoints.https]
    address = ":443"
      [entryPoints.https.tls]

[docker]
domain = "domain_name"
watch = true

With this configuration, Traefik will automatically detect containers as they are launched, and reverse proxy to them using the virtual hostname generated from the container name and the domain in its config file, e.g. <container_name>.<domain>. SSL/TLS is enabled in this setup, with a redirect from plain http to https. The certificate used will be the default one built into Traefik; see the documentation for details on how Let's Encrypt or certificates from other issuers can be used.

To host pgAdmin at the root directory, we simply launch a container with the correct name, and no host to container port mapping:

docker pull dpage/pgadmin4
docker run --name "pgadmin4" \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -d dpage/pgadmin4

With the configuration and commands above, Traefik will host pgAdmin at https://pgadmin4.domain_name/. Of course, the domain name should be changed to a real one, and a suitable CNAME, A or AAAA record should be added to the DNS zone file.

In order to host pgAdmin under a subdirectory, as in the Nginx example we need to tell both the proxy server and pgAdmin about the subdirectory. We tell pgAdmin by setting the SCRIPT_NAME environment variable, and we tell Traefik by adding a label to the container instance. For example:

docker pull dpage/pgadmin4
docker run --name "pgadmin4" \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -e "SCRIPT_NAME=/pgadmin4" \
    -l "traefik.frontend.rule=PathPrefix:/pgadmin4" \
    -d dpage/pgadmin4


Conclusion

Users are using reverse proxy servers to provide an interface between their clients and the pgAdmin server. These can be more traditional servers such as Nginx, or purpose designed reverse proxy servers such as Traefik. In either case, it's simple to configure SSL/TLS support or to host pgAdmin in a subdirectory.

Sunday 30 June 2019

Indexing Documents for Full Text Search

During a conversation with $BOSS a couple of days ago the topic of the new Full Text Search in MongoDB came up. He (knowing of, but presumably having never worked with FTS in PostgreSQL) suggested I might create a tool like the one the MongoDB guys were proudly showing off in their keynote video from their recent conference to make it easy to index and search HTML documents on disk. I looked at him with a slight tilt of the head and remarked that it's already easy to do that, and just takes a few lines of Python code or similar.

So, here I am on my flight back to the UK having written 36 lines of code whilst munching on my breakfast and enjoying my free seat upgrade. I should probably note that those 36 lines of code include import statements, comments, blank lines for readability, and a nicely formatted 7 line SQL query that could easily be just one line. Plus, the vast majority of the code is really just boilerplate to read a directory of files and extract their contents.

Anyway, before I can go into that, it was necessary to first create a schema to hold the data we'd need to query. This is basically just a table to hold whatever information you like about each document, with a tsvector column for searching. For my test I decided to make the path to the file the primary key, and to store the title of the HTML document and the text in the body. The body text isn't really needed, but it does make it a touch easier to use a trigger to automatically maintain the tsvector data - plus of course, I could also then use that column and get rid of the source documents.

Here's what the table looked like:

  CREATE TABLE docs
  (
      path text NOT NULL,
      title text NOT NULL,
      body text NOT NULL,
      tsv tsvector NOT NULL,
      PRIMARY KEY (path)
  );

  CREATE INDEX docs_tsv_idx
      ON docs USING gin (tsv);

I also added the aforementioned trigger to maintain the tsvector. I got slightly fancy with this, indexing both the title and the body fields, and giving them different weights for ranking purposes:

CREATE FUNCTION update_tsv() RETURNS trigger
    LANGUAGE 'plpgsql' VOLATILE NOT LEAKPROOF
AS $BODY$
begin
  new.tsv :=
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.title,'')), 'A') ||
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.body,'')), 'D');
 return new;
end
$BODY$;

CREATE TRIGGER update_tsv
       BEFORE INSERT OR UPDATE ON docs
       FOR EACH ROW EXECUTE PROCEDURE update_tsv();

Now that we have somewhere to store our data, here's the code to load it with data. This should be run in a Python Virtual Environment that has the psycopg2 and BeautifulSoup4 packages installed:

from bs4 import BeautifulSoup
import psycopg2
import glob

# Load from the docs/ directory
path = 'docs/*.html'

# Connect to the database and begin a transaction
conn = psycopg2.connect("dbname=fts_blog user=fts_blog password=fts_blog")
cur = conn.cursor()

for file in glob.iglob(path):
    print("Loading: {} ".format(file))

    with open(file) as f:
        html = f.read()

        # Read the HTML and extract the title and body
        soup = BeautifulSoup(html, features="html.parser")
        title = soup.find('title').decode_contents(formatter="html").replace('&nbsp;', ' ')
        body = soup.find('body').decode_contents(formatter="html")

        # Create the SQL statement
        sql = """INSERT INTO docs
            (path, title, body)
        VALUES
            (%s, %s, %s)
        ON CONFLICT ON CONSTRAINT docs_pkey DO UPDATE SET
            title = %s,
            body = %s;"""

        cur.execute(sql, [file, title, body, title, body])

conn.commit()
cur.close()
conn.close()

As you can see, the code is pretty simple. It basically just loops around all the HTML files found in the docs/ subdirectory, uses BeautifulSoup4 to extract the document title and body, and then runs an INSERT on the database table. For convenience, the query also handles conflicts to allow easy re-indexing.

To load some data, I created a directory calls docs/ alongside the script, and copied the PostgreSQL HTML documentation into the directory. I could then run the indexer:

$ python indexer.py
Loading: docs/sql-alterroutine.html 
Loading: docs/backup-dump.html 
Loading: docs/gist-examples.html 
...
...

Process finished with exit code 0

Finally, I needed to test the search capabilities. That's done through a SQL query of course, for example;

SELECT
    ts_rank("tsv", plainto_tsquery('trigger')) AS "rank",
    path,
    title,
    ts_headline(body,
                 plainto_tsquery('trigger'),
                 'StartSel=*,StopSel=*,MaxFragments=2,' ||
                 'FragmentDelimiter=...,MaxWords=30,MinWords=1') AS "headline"
FROM
    docs
WHERE
    tsv @@ plainto_tsquery('trigger')
ORDER BY rank DESC LIMIT 20

So here we are selecting the document rank, the path and title of the document, as well as a headline - a small snippet of the document that highlights where the matches were found within it, as you'd probably see on your favourite search engine. That's all coming from the docs table of course, and is restricted by our search query and then sorted by the rank and limited to 20 results.

NOTE: The search term in the query above is 'trigger'. This word is actually included three times in the query text, so make sure you change them all if using the query above as a starting point for your own.

When we run the query we see the output below which can of course, be used however we like in an application.


So there we have it; a table with an index and trigger, and a trivial Python script and we have a working indexer and search engine. Oh, and in case you were wondering, no, this is not new functionality in PostgreSQL. It was added to the core server with version 8.3 which was released in 2008. Prior to that there was a contrib module called tsearch2, offering largely equivalent functionality from version 7.4 which was released in 2003. Do try to keep up Mongo 😀

Wednesday 29 May 2019

Avoiding Gmail's confidential mode

So this is one of the very few (maybe the first?) blog entries I've written that aren't directly related to PostgreSQL, however, it does affect how I (and others) may work on the project.

Last night I received email from Google about my personal G Suite account which I use for all my day-to-day email, which for the most part is related to work on pgAdmin and PostgreSQL. Google were proudly announcing the rollout of their new Gmail Confidential Mode update. If you've not come across this yet, then essentially what it does is allow users to send emails that can be deleted or expired after a certain amount of time, optionally require SMS verification to open them, and prevent printing (but not screen-shots of course), forwarding or downloading etc.

When using the Gmail web interface, this all works fairly seamlessly. I can see why some people would want it if that's all they use, however, like many people, I also use other clients, for example, via IMAP. In that case, instead of the original email Gmail sends a placeholder email to replace the actual message which contains a link to allow you to login to Google and view the message online (assuming the SMS verification passes and the message hasn't been deleted or expired of course). That's going to be quite inconvenient to me, besides which, I really don't want anyone to be able to control access to emails they've sent me, after I've received them.

There's another problem affecting PostgreSQL's mailing lists however. How long will it be until someone sends such a message to one of the PostgreSQL lists, where it will do nothing but clutter up the archives and annoy other users (who won't be able to read the message anyway as they won't be able to login to Google as pgsql-hackers@postgresql.org or whatever the list address was)?


Fixing the PostgreSQL mail servers

After some discussion with some of the PostgreSQL sysadmin team, we discovered that Gmail adds a header to the messages that have confidential mode enabled (X-Gm-Locker: <token>). This is easy for us to deal with - Stefan was very quickly able to add a rule to the Exim configuration on our inbound SMTP servers that will immediately reject any messages with this header, sending a message back to the user explaining why their message was rejected, and to resend it with confidential mode disabled.


Fixing my Gmail outbound email

Disabling confidential mode for my own outbound email was pretty easy:
  1. Select the option to "Manage this domain"
  2. Navigate the G Suite admin interface to Apps -> G Suite -> Gmail -> User Settings.
  3. Under Confidential Mode, set the Disable option, and save the settings.
Now, confidential mode will no longer be available in the compose message interface in Gmail. Well, once the settings have propagated around and the UI has refreshed anyway.


Rejecting confidential messages from Gmail

This part is a little more complex, and requires a suitable G Suite account. It will not work with plain Gmail.
  • Select the option to "Manage this domain"
  • Navigate the G Suite admin interface to Apps -> G Suite -> Gmail -> Advanced Settings -> Content compliance
  • Add a policy:
    • Give it a name
    • Select "Inbound" for the email messages to be affected in section 1.
    • Set the option for "If ANY of the following match the message" in section 2, then add a new expression:
      • Set the type to "Metadata match".
      • Select "Gmail confidential mode" under Attribute.
      • Select "Message is in Gmail confidential mode" under Match type.
      • Save the expression.
  • In section 3, select Reject the message as the action and enter a suitable custom message, e.g. "This domain does not accept messages sent with Gmail Confidential Mode enabled. Please disable and resend."
  • Save the settings.
You should now find that email is rejected by Google, if confidential mode is enabled. I had my friend Jonathan send me a test message, which was duly rejected with the message:

550 5.7.1 This domain does not accept messages sent with Gmail Confidential Mode enabled. Please disable and resend. - gcdp z20sor188854uan.2 - gsmtp

There. Normality has been restored.