Sunday 30 June 2019

Indexing Documents for Full Text Search

During a conversation with $BOSS a couple of days ago the topic of the new Full Text Search in MongoDB came up. He (knowing of, but presumably having never worked with FTS in PostgreSQL) suggested I might create a tool like the one the MongoDB guys were proudly showing off in their keynote video from their recent conference to make it easy to index and search HTML documents on disk. I looked at him with a slight tilt of the head and remarked that it's already easy to do that, and just takes a few lines of Python code or similar.

So, here I am on my flight back to the UK having written 36 lines of code whilst munching on my breakfast and enjoying my free seat upgrade. I should probably note that those 36 lines of code include import statements, comments, blank lines for readability, and a nicely formatted 7 line SQL query that could easily be just one line. Plus, the vast majority of the code is really just boilerplate to read a directory of files and extract their contents.

Anyway, before I can go into that, it was necessary to first create a schema to hold the data we'd need to query. This is basically just a table to hold whatever information you like about each document, with a tsvector column for searching. For my test I decided to make the path to the file the primary key, and to store the title of the HTML document and the text in the body. The body text isn't really needed, but it does make it a touch easier to use a trigger to automatically maintain the tsvector data - plus of course, I could also then use that column and get rid of the source documents.

Here's what the table looked like:

  CREATE TABLE docs
  (
      path text NOT NULL,
      title text NOT NULL,
      body text NOT NULL,
      tsv tsvector NOT NULL,
      PRIMARY KEY (path)
  );

  CREATE INDEX docs_tsv_idx
      ON docs USING gin (tsv);

I also added the aforementioned trigger to maintain the tsvector. I got slightly fancy with this, indexing both the title and the body fields, and giving them different weights for ranking purposes:

CREATE FUNCTION update_tsv() RETURNS trigger
    LANGUAGE 'plpgsql' VOLATILE NOT LEAKPROOF
AS $BODY$
begin
  new.tsv :=
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.title,'')), 'A') ||
    setweight(to_tsvector('pg_catalog.english',
      coalesce(new.body,'')), 'D');
 return new;
end
$BODY$;

CREATE TRIGGER update_tsv
       BEFORE INSERT OR UPDATE ON docs
       FOR EACH ROW EXECUTE PROCEDURE update_tsv();

Now that we have somewhere to store our data, here's the code to load it with data. This should be run in a Python Virtual Environment that has the psycopg2 and BeautifulSoup4 packages installed:

from bs4 import BeautifulSoup
import psycopg2
import glob

# Load from the docs/ directory
path = 'docs/*.html'

# Connect to the database and begin a transaction
conn = psycopg2.connect("dbname=fts_blog user=fts_blog password=fts_blog")
cur = conn.cursor()

for file in glob.iglob(path):
    print("Loading: {} ".format(file))

    with open(file) as f:
        html = f.read()

        # Read the HTML and extract the title and body
        soup = BeautifulSoup(html, features="html.parser")
        title = soup.find('title').decode_contents(formatter="html").replace(' ', ' ')
        body = soup.find('body').decode_contents(formatter="html")

        # Create the SQL statement
        sql = """INSERT INTO docs
            (path, title, body)
        VALUES
            (%s, %s, %s)
        ON CONFLICT ON CONSTRAINT docs_pkey DO UPDATE SET
            title = %s,
            body = %s;"""

        cur.execute(sql, [file, title, body, title, body])

conn.commit()
cur.close()
conn.close()

As you can see, the code is pretty simple. It basically just loops around all the HTML files found in the docs/ subdirectory, uses BeautifulSoup4 to extract the document title and body, and then runs an INSERT on the database table. For convenience, the query also handles conflicts to allow easy re-indexing.

To load some data, I created a directory calls docs/ alongside the script, and copied the PostgreSQL HTML documentation into the directory. I could then run the indexer:

$ python indexer.py
Loading: docs/sql-alterroutine.html 
Loading: docs/backup-dump.html 
Loading: docs/gist-examples.html 
...
...

Process finished with exit code 0

Finally, I needed to test the search capabilities. That's done through a SQL query of course, for example;

SELECT
    ts_rank("tsv", plainto_tsquery('trigger')) AS "rank",
    path,
    title,
    ts_headline(body,
                 plainto_tsquery('trigger'),
                 'StartSel=*,StopSel=*,MaxFragments=2,' ||
                 'FragmentDelimiter=...,MaxWords=30,MinWords=1') AS "headline"
FROM
    docs
WHERE
    tsv @@ plainto_tsquery('trigger')
ORDER BY rank DESC LIMIT 20

So here we are selecting the document rank, the path and title of the document, as well as a headline - a small snippet of the document that highlights where the matches were found within it, as you'd probably see on your favourite search engine. That's all coming from the docs table of course, and is restricted by our search query and then sorted by the rank and limited to 20 results.

NOTE: The search term in the query above is 'trigger'. This word is actually included three times in the query text, so make sure you change them all if using the query above as a starting point for your own.

When we run the query we see the output below which can of course, be used however we like in an application.


So there we have it; a table with an index and trigger, and a trivial Python script and we have a working indexer and search engine. Oh, and in case you were wondering, no, this is not new functionality in PostgreSQL. It was added to the core server with version 8.3 which was released in 2008. Prior to that there was a contrib module called tsearch2, offering largely equivalent functionality from version 7.4 which was released in 2003. Do try to keep up Mongo 😀