Tuesday, 25 August 2020

Testing pgAdmin

Like any software, pgAdmin has bugs. At the time of writing there are 119 new or in-progress issues in the bug tracker, which is pretty good for a piece of software with so many moving parts, and regular-as-clockwork monthly releases. Over 10% of these issues are currently assigned to the reporter to gather further information, plus there will undoubtedly be some duplicates and support issues/feature requests in that number that haven’t yet been picked up in our weekly backlog refinement meeting and either closed or moved to the feature or support trackers.

Manual Testing

We have one person, Fahar, who is permanently assigned as QA to the pgAdmin team at EDB, though he is also responsible for a number of other product releases such as the PostgreSQL installers for Windows and macOS (pgAdmin is the vast majority of his work however).


Fahar primarily does four things:


  • Works to verify that issues reported to users are reproducible, when they require complex setup or the developers have failed to reproduce them.

  • Verifies that bugs that have been fixed by the developers actually do fix the reported problem. 

  • Smoke tests release builds; that is, ensures the packages can be installed and basic functionality works prior to public release.

  • Tests new features once development is believed to be complete.

Automation

All pgAdmin tests and builds are automated using Jenkins. The system will notify the development team of any failures via email and via a dedicated Slack channel (internal to EDB). There are also large screens mounted on the wall in development offices to show the status of the test and build runs in real time.

Automated Testing

We have three primary automated test suites for pgAdmin, which are essential in allowing us to release quickly and regularly, with minimal issues.

Python Tests

The Python tests (sometimes referred to as the API tests) are intended to test the Python application that implements the pgAdmin server. The test suite will instantiate the server and then make hundreds of calls to the server with different inputs to ensure itr responds as expected. A small number of these tests operate directly on the Python code, and do not run through the API (e.g. for testing server import/export).


One of the Python tests is the RE-SQL (Reverse Engineered SQL) test. This is a special case as it’s a single test which is actually comprised of around 460 sub-tests, designed to test the SQL generation in pgAdmin; each sub-test will create or modify an object, and then validate that the SQL generated by pgAdmin is correct.


The Python tests are run on every check in to the pgAdmin GIT repository. There are nearly 1200 tests (plus the 460 RE-SQL tests), which are run on Windows, macOS and Linux against all supported database servers (PostgreSQL 9.5 through 12, and EDB Postgres Advanced Server 9.5 through 12). This equates to nearly 100,000 Python tests being run on every check in. The exact number isn’t quite that because some tests only run on certain database server versions.

Jasmine Tests

The Jasmine tests are unit tests for the Javascript code used in the frontend user interface of pgAdmin, named after the framework that is used to run them. At the time of writing, there are 866 Javascript unit tests, each of which is run on Windows, macOS and Linux. That means that 2,598 Javascript unit tests are run on every check in.

Feature Tests

Feature tests are “full system” tests in which an instance of pgAdmin is started and loaded into a browser, and then inputs are automatically given to the browser to simulate end-to-end operation of the application, with the displayed output then being verified. The feature tests are run on a nightly basis as they take a long time to execute. They are run using Selenoid using Google Chrome as the browser (we plan to add Firefox in the future), and run against all supported database server versions.


There are only around 15 feature tests in the suite (which actually runs from the Python test framework), however each test may actually verify anything from 1 to 20 or more individual pieces of functionality.

Linting

Various tasks run as part of the automated tests to check the code formatting, including Javascript linting and Python PEP-8 validation.

Automated Builds

Test Builds

Test builds of all packages (Docker, tarballs, docs, RPMs, DEBs, installers etc) are made with every check in to the GIT repository. These are discarded after the build completes as their primary purpose is to verify that all the code builds without error on every platform, and that packages can be successfully generated.

Snapshot Builds

Snapshot builds are created on a nightly basis and made available for testing and use by anyone that is interested in contributing or needs a fix in a hurry.

QA Builds

QA builds are produced on the Monday before one of the monthly releases. They are uploaded to a server from where Fahar will download and smoke test them to ensure there are no obvious show-stopping issues. These builds will also be tested to ensure they are carrying the latest version of certain dependencies, such as the OpenSSL libraries.

Static Code Analysis

EDB maintains a SonarQube server for performing static code analysis of all the product code within the company, including the Open Source pgAdmin code. This allows us to monitor the state of a number of key metrics that can be detected through analysis:


  • Bugs

  • Security vulnerabilities

  • Security hotspots (code that should be reviewed as it’s a possible source of a vulnerability)

  • Code smells (code that looks badly structured or implemented)

  • Code duplications


The current state of the pgAdmin code can be seen below:



Note that the sharp rise on July 19th was the result of a SonarQube upgrade that added/changed a number of the analysis rules.


We’re in the process of working through the different issues, with focus currently on code smells (mostly overly complex functions that need to be refactored) and security hotspots (mostly regexps in the code). Each of the issues reported may or may not involve a code change - at the very least they will be reviewed and marked as “Won’t fix”.

Developer Testing

All the developers are required to test patches before they are submitted for inclusion in pgAdmin. This includes running all relevant existing tests (there’s no point in running Jasmine tests for a change that only touches Python code for example), and linting tests (e.g. the PEP-8 checks). They are also expected to include suitable tests (or updates to tests) to validate changes they’ve made.


Many of the developers also use static code analysis in their development environments, either the tools built directly into products like PyCharm, or personal SonarQube scans.

Statistics

There are various statistics we can look at as some form of code quality metric.

Team

The team size is interesting because it gives us an idea of how much each member of the team is responsible for on average.


There are currently 6 people working on pgAdmin full time. In addition, we have a number of people at EDB who are primarily focussed on Postgres Enterprise Manager, which as it is based on pgAdmin, occasionally means that that team will contribute to pgAdmin as well. We also have a number of community contributors who submit the occasional patch.


Counting only the full time developers, that’s 19.8 not-yet-fixed bugs per developer.

Users

The number of users is interesting because the bigger the user base, the more likely we are to get issues reported.


Unfortunately, we don’t know how many users we have because pgAdmin is freely available. We do know that Docker is currently reporting over 50 million pulls of the official pgAdmin container, and that’s almost certainly not where most people get their copy of pgAdmin from.


Counting only the Docker users, and assuming each one of them has pulled every one of the 34 releases since the first one uploaded there, that's 1.47 million users (yeah, OK, we definitely don’t have that many users, but still…). That works out at 0.000081 not-yet-fixed bugs per user, though arguably this number is also pretty meaningless. A better one might be the number of bugs reported over the lifetime of the codebase (which is 1856, including issues in new features that ended up being fixed prior to release), which gives us 0.001263 reported and verified bugs per user.

Code

Looking at the code, we see the following stats:


dpage@hal:~/git/pgadmin4$ perl ~/bin/cloc.pl .

    3650 text files.

    3148 unique files.                                          

    1115 files ignored.


https://github.com/AlDanial/cloc v 1.66  T=12.18 s (217.1 files/s, 27242.3 lines/s)

--------------------------------------------------------------------

Language          files          blank        comment           code

--------------------------------------------------------------------

Python              714          21204          30011          92424

JavaScript          285          10216           9984          75844

SQL                1384           5383           2889          38516

JSON                108             31              0          24752

SASS                 34            995             16           5210

make                  2            111             37           3717

CSS                  39            376             66           2012

HTML                 28             42             20           1548

XML                  13              0              0           1208

C++                   9            342            247           1177

Bourne Shell         12            298            204           1096

Qt                    3              0              0            945

DOS Batch             1             85              8            284

C/C++ Header         10             94            107            244

Qt Project            1             25              7             99

Objective C++         1              2             11             14

Windows Resource File 1              0              0              1

--------------------------------------------------------------------

SUM:               2645          39204          43607         249091

--------------------------------------------------------------------


So, there are a shade under a quarter of a million lines of code in pgAdmin, which equates to one not-yet-fixed bug per 2,093 lines of code. By way of comparison, this Stackoverflow answer cites a book that in turn cites a number of studies into industry average bug rates, which are believed to be in the order of 1-25 bugs per 1000 lines of code.

Conclusion

pgAdmin has a relatively small number of developers, but a large amount of code and number of users. From the data we have, we can assume that with such a large user base, bugs are more likely to be reported than if there was a user base of tens or hundreds of users. Despite this, the number of issues per 1000 lines of code seems to be very low in comparison to industry studies.


Despite the huge amount of effort that goes into automated testing and builds, there will always be bugs. We rely on users to give us feedback when they find issues, and aim to fix everything as soon as possible, prioritising based on impact and likelihood of any given user running into each issue. We also try to balance time spent on bug fixes with new features and housekeeping tasks (aimed at keeping the code maintainable).


None of this would be possible without the automation in use for building and testing the code, or without the hard working developers (including Fahar) and users that report and help them fix issues, to whom I owe a huge debt of gratitude.


Thursday, 25 June 2020

system_stats extension for PostgreSQL

One of my colleagues at EDB spent some time working on a new extension for PostgreSQL and EDB Postgres Advanced Server as part of a proof of concept that I asked him to look into. I'm pleased to say that we've decided to make that work Open Source, releasing it under the PostgreSQL licence.

The system_stats extension offers a number of stored procedures that are useful when monitoring Postgres. They expose various system metrics to the database server, allowing you to extend your database monitoring to include information about CPU, memory, disk and network usage without having to add a separate mechanism to gain access to that information.

A number of functions are included:

pg_sys_os_info()
This interface allows the user to get operating system statistics.

postgres=# SELECT * FROM pg_sys_os_info();
-[ RECORD 1 ]-------+--------------------------------------------------------------------------------------------------
name                | Darwin
version             | Darwin Kernel Version 19.4.0: Wed Mar  4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64
host_name           | hal.local
domain_name         | 
handle_count        | 
process_count       | 531
thread_count        | 
architecture        | x86_64
last_bootup_time    | 
os_up_since_seconds | 281466

pg_sys_cpu_info()
This interface allows the user to get CPU information.

postgres=# SELECT * FROM pg_sys_cpu_info();
-[ RECORD 1 ]------+---------------
vendor             | 
description        | 
model_name         | MacBookPro15,1
processor_type     | 
logical_processor  | 12
physical_processor | 6
no_of_cores        | 12
architecture       | x86_64
clock_speed_hz     | 2900000000
cpu_type           | 7
cpu_family         | 260141638
byte_order         | 1234
l1dcache_size      | 32
l1icache_size      | 32
l2cache_size       | 256
l3cache_size       | 12288

pg_sys_cpu_usage_info()
This interface allows the user to get CPU usage information. Values are a percentage of time spent by CPUs for all operations.

postgres=# SELECT * FROM pg_sys_cpu_usage_info();
-[ RECORD 1 ]-------------------+----------
usermode_normal_process_percent | 1.6806724
usermode_niced_process_percent  | 0
kernelmode_process_percent      | 0
idle_mode_percent               | 98.31933
io_completion_percent           | 
servicing_irq_percent           | 
servicing_softirq_percent       | 
user_time_percent               | 
processor_time_percent          | 
privileged_time_percent         | 
interrupt_time_percent          | 

pg_sys_memory_info()
This interface allows the user to get memory usage information. All the values are in bytes.

postgres=# SELECT * FROM pg_sys_memory_info();
-[ RECORD 1 ]----+------------
total_memory     | 34359738368
used_memory      | 31668727808
free_memory      | 2691010560
swap_total       | 3221225472
swap_used        | 2672033792
swap_free        | 549191680
cache_total      | 
kernel_total     | 
kernel_paged     | 
kernel_non_paged | 
total_page_file  | 
avail_page_file  |

pg_sys_io_analysis_info()
This interface allows the user to get an I/O analysis of block devices.

postgres=# SELECT * FROM pg_sys_io_analysis_info();
-[ RECORD 1 ]-+-------------
device_name   | disk0
total_reads   | 9044867
total_writes  | 6688535
read_bytes    | 127989366784
write_bytes   | 127197491200
read_time_ms  | 3270007
write_time_ms | 3959392
-[ RECORD 2 ]-+-------------
device_name   | disk2
total_reads   | 2222011
total_writes  | 3213728
read_bytes    | 25915008512
write_bytes   | 99796177408
read_time_ms  | 22302918
write_time_ms | 65947889

pg_sys_disk_info()
This interface allows the user to get the disk information.

postgres=# SELECT * FROM pg_sys_disk_info();
-[ RECORD 1 ]----+---------------------
mount_point      | /
file_system      | apfs
drive_letter     | 
drive_type       | 
file_system_type | /dev/disk1s5
total_space      | 2000796545024
used_space       | 11124047872
free_space       | 1328550260736
total_inodes     | 19539028760
used_inodes      | 487630
free_inodes      | 19538541130
-[ RECORD 2 ]----+---------------------
mount_point      | /System/Volumes/Data
file_system      | apfs
drive_letter     | 
drive_type       | 
file_system_type | /dev/disk1s1
total_space      | 2000796545024
used_space       | 11124047872
free_space       | 1328550260736
total_inodes     | 19539028760
used_inodes      | 487630
free_inodes      | 19538541130
-[ RECORD 3 ]----+---------------------
mount_point      | /private/var/vm
file_system      | apfs
drive_letter     | 
drive_type       | 
file_system_type | /dev/disk1s4
total_space      | 2000796545024
used_space       | 11124047872
free_space       | 1328550260736
total_inodes     | 19539028760
used_inodes      | 487630
free_inodes      | 19538541130
-[ RECORD 4 ]----+---------------------
mount_point      | /Volumes/Backup
file_system      | hfs
drive_letter     | 
drive_type       | 
file_system_type | /dev/disk3
total_space      | 2000796545024
used_space       | 11124047872
free_space       | 1328550260736
total_inodes     | 19539028760
used_inodes      | 487630
free_inodes      | 19538541130

pg_sys_load_avg_info()
This interface allows the user to get the average load of the system over 1, 5, 10 and 15 minute intervals.

postgres=# SELECT * FROM pg_sys_load_avg_info();
-[ RECORD 1 ]------------+----------
load_avg_one_minute      | 1.4116211
load_avg_five_minutes    | 1.5273438
load_avg_ten_minutes     | 
load_avg_fifteen_minutes | 1.5668945

pg_sys_process_info()
This interface allows the user to get process information.

postgres=# SELECT * FROM pg_sys_process_info();
-[ RECORD 1 ]------+----
total_processes    | 531
running_processes  | 528
sleeping_processes | 0
stopped_processes  | 0
zombie_processes   | 2

pg_sys_network_info()
This interface allows the user to get network interface information.

postgres=# SELECT * FROM pg_sys_network_info();
-[ RECORD 1 ]---+-------------
interface_name  | lo0
ip_address      | 127.0.0.1
tx_bytes        | 281804800
tx_packets      | 882164
tx_errors       | 0
tx_dropped      | 0
rx_bytes        | 281804800
rx_packets      | 882164
rx_errors       | 0
rx_dropped      | 0
link_speed_mbps | 0
-[ RECORD 2 ]---+-------------
interface_name  | en0
ip_address      | 192.168.0.2
tx_bytes        | 3770546176
tx_packets      | 13311181
tx_errors       | 1164
tx_dropped      | 0
rx_bytes        | 7829563392
rx_packets      | 9070534
rx_errors       | 0
rx_dropped      | 0
link_speed_mbps | 0
-[ RECORD 3 ]---+-------------
interface_name  | utun4
ip_address      | 172.24.64.89
tx_bytes        | 5830656
tx_packets      | 27350
tx_errors       | 0
tx_dropped      | 0
rx_bytes        | 75328512
rx_packets      | 124340
rx_errors       | 0
rx_dropped      | 0
link_speed_mbps | 0

pg_sys_cpu_memory_by_process()
This interface allows the user to get the CPU and memory information for each process ID.

postgres=# SELECT * FROM pg_sys_cpu_memory_by_process();
-[ RECORD 1 ]---------+-----------------
pid                   | 62633
name                  | postgres
running_since_seconds | 
cpu_usage             | 4.48
memory_usage          | 0.02
memory_bytes          | 6848512
-[ RECORD 2 ]---------+-----------------
pid                   | 62632
name                  | psql
running_since_seconds | 
cpu_usage             | 0
memory_usage          | 0.01
memory_bytes          | 2453504
-[ RECORD 3 ]---------+-----------------
pid                   | 62594
name                  | Google Chrome He
running_since_seconds | 
cpu_usage             | 0
memory_usage          | 0.09
memory_bytes          | 31694848
...
...

All functions are restricted for use by superusers and members of the monitor_system_stats role due to the potentially sensitive nature of the information they make available. To grant a user permission to use the extension, grant them that role. For example:

GRANT monitor_system_stats TO nagios;

The extension is currently supported on Linux, macOS and Windows, though not all statistics are available or relevant on all platforms, so you may see some NULL values being returned.

Pull Requests for additional platform support, additional functions, or bug fixes are welcome, as are bug reports.

You can find the code and a source release on Github; please see the README for more information. Devrim Gündüz has also made RPMs available on yum.postgresql.org.

Wednesday, 10 July 2019

Reverse Proxying to pgAdmin

Reverse proxying requests to a pgAdmin server is becoming more and more popular if posts to the mailing lists are to be taken as an indicative measure; more often than not when using pgAdmin in a container (of which there have now been over 10 million pulls)! Typically users will deploy a reverse proxy for a couple of reasons; to host multiple applications in different subdirectories under the same domain, or to add SSL/TLS support independently of the application.

Because of the number of questions asked, I spent a little time over the last couple of days doing some testing and updating the documentation with some examples. Here's a blog-ified version of that work.


Nginx

Nginx is winning the battle of the web servers these days, beating out Lighttpd (which is still a fine piece of software) and the ageing and arguably bloated Apache HTTPD. All of these servers support reverse proxying, and whilst I've looked at Nginx, the examples shown below can easily be translated to the other servers if you prefer to run them instead.

In the following examples, we have pgAdmin running in a Docker container (in which it's hosted under Gunicorn). For simplicity, the examples have Nginx running on the host machine, but it can also be easily run in another container, sharing a Docker network with pgAdmin. In such a configuration there is no need to map the pgAdmin container port to the host.

The container is launched as shown below. See the documentation for information on other useful environment variables you can set and paths you can map.

The commands below will pull the latest version of the container from the repository, and run it with port 5050 on the host mapped to port 80 on the container. It will set the default username and password to user@domain.com and SuperSecret respectively.

docker pull dpage/pgadmin4
docker run -p 5050:80 \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -d dpage/pgadmin4

A simple configuration to reverse proxy with Nginx to pgAdmin at the root directory looks like this:

server {
    listen 80;
    server_name _;

    location / {
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}

Here we tell Nginx to listen on port 80, and respond to any server name (sent by the client in the Host header). We then specify that all requests under the root directory are proxied back to port 5050 on the local host, and that the Host header is passed along as well. The proxy_redirect option tells the server not to rewrite the Location header.

But what if we want to host pgAdmin under a subdirectory, say /pgadmin4? In this case we need to change the path at the top of the location block and add the X-Script-Name header to the requests made to the pgAdmin container to tell it what subdirectory it's hosted under. This is shown below:

server {
    listen 80;
    server_name _;

    location /pgadmin4/ {
        proxy_set_header X-Script-Name /pgadmin4;
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}

OK, so that's cool but we're talking about super top secret database stuffs here. It needs to be encrypted! Adding SSL/TLS support to the configuration is largely unrelated to pgAdmin itself, except that as with the subdirectory, we need to tell it the URL scheme (http or https) to use. We do this by setting the X-Scheme header. The other changes are to add a redirect from http to https, and to configure SSL/TLS:

server {
    listen 80;
    return 301 https://$host$request_uri;
}

server {
    listen 443;
    server_name _;

    ssl_certificate /etc/nginx/server.crt;
    ssl_certificate_key /etc/nginx/server.key;

    ssl on;
    ssl_session_cache builtin:1000 shared:SSL:10m;
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_prefer_server_ciphers on;

    location /pgadmin4/ {
        proxy_set_header X-Script-Name /pgadmin4;
        proxy_set_header X-Scheme $scheme;
        proxy_set_header Host $host;
        proxy_pass http://localhost:5050/;
        proxy_redirect off;
    }
}


Traefik

Traefik is becoming increasingly popular in containerised environments because it's designed to integrate with the orchestration platform being used and auto-configure itself as much as possible. The examples below show it working with plain Docker, but it will work with Swarm, Compose, Kubernetes and other orchestrators as well. Note that Traefik is designed purely for reverse proxying, routing and load balancing; it's not a general purpose webserver.

In the examples below, the following Traefik configuration is used. Because it auto-configures itself, the changes we make to get the desired configuration are to the way the pgAdmin container is deployed, not to the Traefik configuration:

defaultEntryPoints = ["http", "https"]

[entryPoints]
  [entryPoints.http]
    address = ":80"
      [entryPoints.http.redirect]
        entryPoint = "https"
  [entryPoints.https]
    address = ":443"
      [entryPoints.https.tls]

[docker]
domain = "domain_name"
watch = true

With this configuration, Traefik will automatically detect containers as they are launched, and reverse proxy to them using the virtual hostname generated from the container name and the domain in its config file, e.g. <container_name>.<domain>. SSL/TLS is enabled in this setup, with a redirect from plain http to https. The certificate used will be the default one built into Traefik; see the documentation for details on how Let's Encrypt or certificates from other issuers can be used.

To host pgAdmin at the root directory, we simply launch a container with the correct name, and no host to container port mapping:

docker pull dpage/pgadmin4
docker run --name "pgadmin4" \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -d dpage/pgadmin4

With the configuration and commands above, Traefik will host pgAdmin at https://pgadmin4.domain_name/. Of course, the domain name should be changed to a real one, and a suitable CNAME, A or AAAA record should be added to the DNS zone file.

In order to host pgAdmin under a subdirectory, as in the Nginx example we need to tell both the proxy server and pgAdmin about the subdirectory. We tell pgAdmin by setting the SCRIPT_NAME environment variable, and we tell Traefik by adding a label to the container instance. For example:

docker pull dpage/pgadmin4
docker run --name "pgadmin4" \
    -e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
    -e "PGADMIN_DEFAULT_PASSWORD=SuperSecret" \
    -e "SCRIPT_NAME=/pgadmin4" \
    -l "traefik.frontend.rule=PathPrefix:/pgadmin4" \
    -d dpage/pgadmin4


Conclusion

Users are using reverse proxy servers to provide an interface between their clients and the pgAdmin server. These can be more traditional servers such as Nginx, or purpose designed reverse proxy servers such as Traefik. In either case, it's simple to configure SSL/TLS support or to host pgAdmin in a subdirectory.

Sunday, 30 June 2019

Indexing Documents for Full Text Search

During a conversation with $BOSS a couple of days ago the topic of the new Full Text Search in MongoDB came up. He (knowing of, but presumably having never worked with FTS in PostgreSQL) suggested I might create a tool like the one the MongoDB guys were proudly showing off in their keynote video from their recent conference to make it easy to index and search HTML documents on disk. I looked at him with a slight tilt of the head and remarked that it's already easy to do that, and just takes a few lines of Python code or similar.

So, here I am on my flight back to the UK having written 36 lines of code whilst munching on my breakfast and enjoying my free seat upgrade. I should probably note that those 36 lines of code include import statements, comments, blank lines for readability, and a nicely formatted 7 line SQL query that could easily be just one line. Plus, the vast majority of the code is really just boilerplate to read a directory of files and extract their contents.

Anyway, before I can go into that, it was necessary to first create a schema to hold the data we'd need to query. This is basically just a table to hold whatever information you like about each document, with a tsvector column for searching. For my test I decided to make the path to the file the primary key, and to store the title of the HTML document and the text in the body. The body text isn't really needed, but it does make it a touch easier to use a trigger to automatically maintain the tsvector data - plus of course, I could also then use that column and get rid of the source documents.

Here's what the table looked like:

  CREATE TABLE docs
  (
      path text NOT NULL,
      title text NOT NULL,
      body text NOT NULL,
      tsv tsvector NOT NULL,
      PRIMARY KEY (path)
  );

  CREATE INDEX docs_tsv_idx
      ON docs USING gin (tsv);

I also added the aforementioned trigger to maintain the tsvector. I got slightly fancy with this, indexing both the title and the body fields, and giving them different weights for ranking purposes:

CREATE FUNCTION update_tsv() RETURNS trigger
    LANGUAGE 'plpgsql' VOLATILE NOT LEAKPROOF
AS $BODY$
begin
  new.tsv :=
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.title,'')), 'A') ||
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.body,'')), 'D');
 return new;
end
$BODY$;

CREATE TRIGGER update_tsv
       BEFORE INSERT OR UPDATE ON docs
       FOR EACH ROW EXECUTE PROCEDURE update_tsv();

Now that we have somewhere to store our data, here's the code to load it with data. This should be run in a Python Virtual Environment that has the psycopg2 and BeautifulSoup4 packages installed:

from bs4 import BeautifulSoup
import psycopg2
import glob

# Load from the docs/ directory
path = 'docs/*.html'

# Connect to the database and begin a transaction
conn = psycopg2.connect("dbname=fts_blog user=fts_blog password=fts_blog")
cur = conn.cursor()

for file in glob.iglob(path):
    print("Loading: {} ".format(file))

    with open(file) as f:
        html = f.read()

        # Read the HTML and extract the title and body
        soup = BeautifulSoup(html, features="html.parser")
        title = soup.find('title').decode_contents(formatter="html").replace('&nbsp;', ' ')
        body = soup.find('body').decode_contents(formatter="html")

        # Create the SQL statement
        sql = """INSERT INTO docs
            (path, title, body)
        VALUES
            (%s, %s, %s)
        ON CONFLICT ON CONSTRAINT docs_pkey DO UPDATE SET
            title = %s,
            body = %s;"""

        cur.execute(sql, [file, title, body, title, body])

conn.commit()
cur.close()
conn.close()

As you can see, the code is pretty simple. It basically just loops around all the HTML files found in the docs/ subdirectory, uses BeautifulSoup4 to extract the document title and body, and then runs an INSERT on the database table. For convenience, the query also handles conflicts to allow easy re-indexing.

To load some data, I created a directory calls docs/ alongside the script, and copied the PostgreSQL HTML documentation into the directory. I could then run the indexer:

$ python indexer.py
Loading: docs/sql-alterroutine.html 
Loading: docs/backup-dump.html 
Loading: docs/gist-examples.html 
...
...

Process finished with exit code 0

Finally, I needed to test the search capabilities. That's done through a SQL query of course, for example;

SELECT
    ts_rank("tsv", plainto_tsquery('trigger')) AS "rank",
    path,
    title,
    ts_headline(body,
                 plainto_tsquery('trigger'),
                 'StartSel=*,StopSel=*,MaxFragments=2,' ||
                 'FragmentDelimiter=...,MaxWords=30,MinWords=1') AS "headline"
FROM
    docs
WHERE
    tsv @@ plainto_tsquery('trigger')
ORDER BY rank DESC LIMIT 20

So here we are selecting the document rank, the path and title of the document, as well as a headline - a small snippet of the document that highlights where the matches were found within it, as you'd probably see on your favourite search engine. That's all coming from the docs table of course, and is restricted by our search query and then sorted by the rank and limited to 20 results.

NOTE: The search term in the query above is 'trigger'. This word is actually included three times in the query text, so make sure you change them all if using the query above as a starting point for your own.

When we run the query we see the output below which can of course, be used however we like in an application.


So there we have it; a table with an index and trigger, and a trivial Python script and we have a working indexer and search engine. Oh, and in case you were wondering, no, this is not new functionality in PostgreSQL. It was added to the core server with version 8.3 which was released in 2008. Prior to that there was a contrib module called tsearch2, offering largely equivalent functionality from version 7.4 which was released in 2003. Do try to keep up Mongo 😀