Doc2Vec Searching of Lang Database#

In this recipe, I will demonstrate the use of the gensim package and a simple doc2vec model trained on stories from the Lang coloured fairy books to support semantic retrieval of fairy stories.

The approach can be summarised as follows:

  • generate a vocabulary of terms representative of the search corpus;

  • generate a vector space where each dimension is a word in the vocabulary;

  • generate a vector for each document or search phrase;

  • retrieve documents based on similarity between document vector and search phrase vector.

The following recipe is inspired by How to make a search engine on Movies Description.

See also https://www.kaggle.com/hgilles06/a-doc2vec-search-engine-cord19-new-version for ideas on a possible graphical user interface.

Connecting to the Database#

We’re going to work with our Lang fairy story database, so let’s set up a connection to it:

from sqlite_utils import Database

db_name = "demo.db"

db = Database(db_name)

Let’s remind ourselves of the database structure:

print(db.schema)
CREATE TABLE [books] (
   [book] TEXT,
   [title] TEXT,
   [text] TEXT,
   [last_para] TEXT,
   [first_line] TEXT,
   [provenance] TEXT,
   [chapter_order] INTEGER,
   PRIMARY KEY ([book], [title])
);
CREATE TABLE [books_metadata] (
   [title] TEXT,
   [year] INTEGER
);
CREATE VIRTUAL TABLE [books_fts] USING FTS5 (
    [title], [text],
    content=[books]
);
CREATE TABLE 'books_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'books_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
CREATE TABLE 'books_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'books_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TRIGGER [books_ai] AFTER INSERT ON [books] BEGIN
  INSERT INTO [books_fts] (rowid, [title], [text]) VALUES (new.rowid, new.[title], new.[text]);
END;
CREATE TRIGGER [books_ad] AFTER DELETE ON [books] BEGIN
  INSERT INTO [books_fts] ([books_fts], rowid, [title], [text]) VALUES('delete', old.rowid, old.[title], old.[text]);
END;
CREATE TRIGGER [books_au] AFTER UPDATE ON [books] BEGIN
  INSERT INTO [books_fts] ([books_fts], rowid, [title], [text]) VALUES('delete', old.rowid, old.[title], old.[text]);
  INSERT INTO [books_fts] (rowid, [title], [text]) VALUES (new.rowid, new.[title], new.[text]);
END;

Recall that we can perform a full text search:

#q = 'king "three sons" gold'
q = 'hansel witch'
_q = f'SELECT title FROM books_fts WHERE books_fts MATCH {db.quote(q)} ;'

for row in db.query(_q):
    print(row["title"])
Hansel And Grettel

We can randomly sample a selection of rows with a query of the following form:

# Via https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b

_q_random_sample = """
SELECT * FROM books
WHERE rowid IN (SELECT rowid FROM books
                WHERE title NOT LIKE "Preface"
                ORDER BY random() LIMIT {});
"""

for row in db.query(_q_random_sample.format(5)):
    print(row["title"])
King Lindorm
Long, Broad, And Quickeye
The Princess Bella-Flor
The Hoodie-Crow.
Prince Featherhead And The Princess Celandine

Simple Model#

We could use an off-the-shelf model to process documents, or we can train our own model from our own documents so that the word vectors are aligned to our dataset. In a large corpus, we can train on a sample of documents if they are representative of the whole.

If we train against the whole dataset, we can search into the dataset directly from the model. If train the model on a partial collection, then we can only compare search phrases and documents that we have generated vectors for.

To create the model, it helps if we clean the documents, e.g. by decasing, and removing punctuation:

from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_numeric, remove_stopwords

def clean_text(text):
    """Generate a cleaned, tokenised version of a text."""
    CUSTOM_FILTERS = [lambda x: x.lower(),
                      strip_tags, strip_punctuation,
                      strip_numeric, remove_stopwords]
    
    return preprocess_string(text, CUSTOM_FILTERS)

Apply the cleaning function to the text on the way in to creating the training corpus:

sample_corpus = db.query(_q_random_sample.format(9999))

sample_docs = [(clean_text(r['text']),
               f"{r['book']}::{r['title']}", #create a unique tag
               r['title'])
               for r in sample_corpus]

# For the first doc, preview the first 5 cleaned words and title
sample_docs[0][0][:5], sample_docs[0][1], sample_docs[0][2]
(['time', 'certain', 'country', 'lived', 'king'],
 'The Blue Fairy Book::The Bronze Ring',
 'The Bronze Ring')
# The gensim model needs named tuples
# including at least a words and tags dimension
# Naively we can just use a document index count as the tag
from collections import namedtuple

StoryDoc = namedtuple('StoryDoc',
                      'words tags title')

sample_docs_training = []

for i, sample_doc in enumerate(sample_docs):
    sample_docs_training.append(StoryDoc(sample_doc[0],
                                         [sample_doc[1]], # This must be a list
                                         sample_doc[2]))
from gensim.models import Doc2Vec

# Define the parameters for building the model.
# We can also pass a list of documents
# via the first "documents" parameter
# and the model will be trained against those.
# Alternatively, create an empty model and train it later.
model = Doc2Vec(
                # dm: training algorithm;
                # 1: distributed memory/PV-DM;
                # 0: distributed bag of words (PV-DBOW)
                dm=1,
                # vector_size: size of feature vectors
                vector_size=300,
                # window: max dist between current & predicted word
                window=10,
                # hs: 1: hierarchical softmax;
                # hs: 0 : negative sampling if negative
                hs=0,
                # min_count: ignore words w/ lower frequency
                # There is a risk to setting this too high
                # particularly if a search term is likely unique,
                # as it might be with a name. On the other hand,
                # for such situations, a simple search might be better?
                min_count=1,
                # sample: randomly downsample hi-frequency words
                # useful range: (0, 1e-5)
                sample=1e-5,
                )

The model is built around a vocabulary extracted from the training document corpus.

# Build the model vocabulary
model.build_vocab(sample_docs_training)

We can now train the model (this may take some time for a large corpus):

# It would be useful if we could display a progress bar for this
model.train(sample_docs_training,
            total_examples=model.corpus_count,
            epochs=100, start_alpha=0.01, end_alpha=0.01)

Rather than creating a model each time we want to use it, we can save the model and then load it as required:

# Save a model
model.save("lang_model.gensim")

# Load in a model
model = Doc2Vec.load("lang_model.gensim")

To retrieve a document matching a search phrase, we need to encode the search phrase and then try to find a matching document:

search_phrase = """
hansel and his sister were cast out by their wicked stepmother
and went into forest and met an evil witch
"""

# Preprocess the search phrase
tokens = clean_text(search_phrase)

tokens
['hansel',
 'sister',
 'cast',
 'wicked',
 'stepmother',
 'went',
 'forest',
 'met',
 'evil',
 'witch']

Generate a vector for the tokens:

# Generate the vector representation from the model
search_vector = model.infer_vector(tokens, alpha=0.001, epochs = 50)

We can now search for related documents from the original training set based on how well their vectors match the vector generated for the search phrase:

# Find the top 10 matches
matches = model.dv.most_similar([search_vector], topn=10)
# To rank every document from the training corpus
# set: topn=model.docvecs.count

# The response gives the original training document ids and match scores
matches
[('The Blue Fairy Book::Hansel And Grettel', 0.728550136089325),
 ('The Yellow Fairy Book::The Witch', 0.5643637180328369),
 ('The Yellow Fairy Book::The White Duck', 0.5299764275550842),
 ('The Red Fairy Book::The Three Dwarfs', 0.5016872882843018),
 ('The Red Fairy Book::The Twelve Brothers', 0.4773011803627014),
 ('The Yellow Fairy Book::The Crow', 0.46543797850608826),
 ('The Yellow Fairy Book::The Dead Wife', 0.4470345079898834),
 ('The Red Fairy Book::Mother Holle', 0.4352571666240692),
 ('The Red Fairy Book::Brother And Sister', 0.43186280131340027),
 ('The Yellow Fairy Book::The Nixy', 0.4286130964756012)]

Let’s try another one:

search_phrase = """
a poor orphan girl lives with her wicked stepmother and sisters
but then her fairy godmother appears and she goes to a ball 
and leaves at midnight
but loses her slipper then finally marries the prince
"""

# Preprocess the search phrase
tokens = clean_text(search_phrase)
search_vector = model.infer_vector(tokens, alpha=0.01, epochs = 50)
model.dv.most_similar([search_vector], topn=10)
[('The Blue Fairy Book::Cinderella, Or The Little Glass Slipper',
  0.48652735352516174),
 ('The Brown Fairy Book::Which Was The Foolishest?', 0.4111166298389435),
 ('The Green Fairy Book::Spindle, Shuttle, And Needle', 0.3964698314666748),
 ('The Lilac Fairy Book::The Believing Husbands', 0.39020511507987976),
 ('The Lilac Fairy Book::The Brownie Of The Lake', 0.38550546765327454),
 ('The Brown Fairy Book::Habogi', 0.3834902048110962),
 ('The Blue Fairy Book::Little Red Riding Hood', 0.37533891201019287),
 ('The Lilac Fairy Book::The Hoodie-Crow.', 0.37399569153785706),
 ('The Violet Fairy Book::The Child Who Came From An Egg', 0.3730131685733795),
 ('The Violet Fairy Book::The Maiden With The Wooden Helmet',
  0.3719061613082886)]

Note that the result is stochastic (has a random element) in the way that the search vector is inferred: if you rerun the query, you will likely generate a different search vector. As a consequence, the search results returned are likely differ in their order and match scores each time the query is run.

Creating a Search Tool#

The next step is to register a custom SQLite function that will generate a vector for a search term and return matching records on that basis.

from vtfunc import TableFunction

class SQLite_doc2Vec_Model_Search(TableFunction):
    params = ['query', 'threshold']
    columns = ['book', 'title', 'score']
    name = 'doc2vec_model_search'
    
    model = Doc2Vec.load("lang_model.gensim")
    
    def initialize(self, query=None, num=None, threshold=None):
        tokens = clean_text(query)
        search_vector = model.infer_vector(tokens, alpha=0.01, epochs = 50)
        scores = model.dv.most_similar([search_vector],
                                            topn=len(model.dv))
        if threshold:
            scores = [(t, s) for (t, s) in scores if s >= threshold ]

        self._iter = iter(scores)
 
    def iterate(self, idx):
        (tag, score) = next(self._iter)
        items = tag.split("::")
        return (items[0], items[1], score,)

# And register the function
SQLite_doc2Vec_Model_Search.register(db.conn)

The query searches over the model and can take various forms:

  • doc2vec_model_search("search phrase")

  • doc2vec_model_search("search phrase", MIN_SCORE)

For example:

model_query = f"""
SELECT *
FROM doc2vec_model_search('''{search_phrase}''', 0.45 );
"""

for i in db.execute(model_query):
    print(i)
 
('The Blue Fairy Book', 'Cinderella, Or The Little Glass Slipper', 0.48834240436553955)

Saving Model Vectors into the Database#

If we look at the object type of one of the model vectors, we see that it is a numpy.ndarray, which can be easily represented as a list:

type(model.dv[0])
numpy.ndarray

We can store this data in the SQLite database as a BLOB. To simplify the process of converting the array into and out of the appropriate format for storage in the database compared to its use as a gensim vector, we can register a custom handler for the numpy.ndarray object:

# Via https://stackoverflow.com/a/18622264/454773
# See also: https://github.com/joosephook/sqlite3-numpy
import sqlite3
import numpy as np
import io

def adapt_array(arr):
    """
    http://stackoverflow.com/a/31312102/190597 (SoulNibbler)
    """
    out = io.BytesIO()
    np.save(out, arr)
    out.seek(0)
    return sqlite3.Binary(out.read())

def convert_array(text):
    out = io.BytesIO(text)
    out.seek(0)
    return np.load(out)


# Converts np.array to TEXT when inserting
sqlite3.register_adapter(np.ndarray, adapt_array)

# Converts TEXT to np.array when selecting
sqlite3.register_converter("array", convert_array)

Now we need to reset the database to a connection that supports the custom handler we have just registered:

# Reset the database connection to used the parsed datatype
db.conn = sqlite3.connect(db_name, detect_types=sqlite3.PARSE_DECLTYPES)

We can now create a table with a custom “array” datatype:

# Give ourselves a clean slate
db["story_vectors"].drop(ignore=True)


# sqlite_utils doesn't appear to support custom types (yet?!)
# The following errors on the "array" datatype
"""
db["story_vectors"].create({
    "book": str,
    "title": str,
    "tag": str, # a unique key derived from book and title
    "vector": "array",
}, pk=("book", "title"),
    # The following is not currently supported by sqlite_utils
   #foreign_keys=[ (("book", "title"), "books", ("book", "title"))] # local-table-id, foreign-table, foreign-table-id]
)
"""

# so we can create a table the old fashioned way...
vector_table_create = """
CREATE TABLE story_vectors 
    (tag TEXT PRIMARY KEY, vector array, book TEXT, title TEXT );
"""

cur = db.conn.cursor()
cur.execute(vector_table_create)
<sqlite3.Cursor at 0x12d1fbf10>

We can generate a list of dictionaries, one per record used to train the model, that can then be added directly to the story_vectors database table:

xrecords = []

for tag in model.dv.index_to_key:
    xrecords.append({'book': tag.split('::')[0],
                     'title': tag.split('::')[1],
                     'tag': tag, 'vector': model.dv[tag]})

And add the records directly to the database:

db["story_vectors"].insert_all(xrecords)
<Table story_vectors (tag, vector, book, title)>

Let’s pull an example record back showing just the first few elements of the vector associated with the record:

_q = f'SELECT * FROM story_vectors LIMIT 1;'

for row in db.query(_q):
    print(row['tag'], row['book'], row['title'], row['vector'][:10])
The Blue Fairy Book::The Bronze Ring The Blue Fairy Book The Bronze Ring [-1.1545537  -1.5133394  -0.9199181   3.0243099   0.9392002   2.3881683
 -1.5769789  -0.30107376 -0.3657083   1.5464693 ]
import pandas as pd

_q = f'SELECT * FROM story_vectors;'
df = pd.read_sql(_q, db.conn)

df.head(3)
tag vector book title
0 The Blue Fairy Book::The Bronze Ring [-1.1545537, -1.5133394, -0.9199181, 3.0243099... The Blue Fairy Book The Bronze Ring
1 The Blue Fairy Book::Prince Hyacinth And The D... [-2.2750967, -0.05328803, -0.42831513, -0.4234... The Blue Fairy Book Prince Hyacinth And The Dear Little Princess
2 The Blue Fairy Book::East Of The Sun And West ... [-0.11260367, 1.1501226, 0.4695935, 0.6856275,... The Blue Fairy Book East Of The Sun And West Of The Moon

We can get the cosine similarity for each row relative to a search vector using the sklearn.metrics.pairwise.cosine_similarity applied to a dataframe of vectors we want to match against.

The cosine_similarity() function will happily accept two pandas dataframes, such as an N x M matrix of vectors we want to score against, and a 1 x M search vector matrics:

from sklearn.metrics.pairwise import cosine_similarity

# Grab the vectors and expand each on across columns
match_vectors_df = df['vector'].apply(pd.Series)
search_vector_df = pd.DataFrame(search_vector).T

df['score'] = cosine_similarity(match_vectors_df,
                                search_vector_df)

df[df['score']>0.45].sort_values("score", ascending=False).head()
tag vector book title score
6 The Blue Fairy Book::Cinderella, Or The Little... [-0.7598129, 1.4367288, 0.19395736, 0.573754, ... The Blue Fairy Book Cinderella, Or The Little Glass Slipper 0.486527

So it’s easy enough to create a custom function to search over the vectors table rather than the model:

class SQLite_doc2Vec_Table_Search(TableFunction):
    params = ['query', 'threshold']
    columns = ['book', 'title', 'score']
    name = 'doc2vec_search'
    
    # If we move this into the body, we can update the database
    # and respond to new rows added to story_vectors table
    _q = f'SELECT book, title, vector FROM story_vectors;'
    df = pd.read_sql(_q, db.conn)
    match_vectors_df = df['vector'].apply(pd.Series)
    
    def initialize(self, query=None, threshold=None):
        df = self.df
        tokens = clean_text(query)
        search_vector = model.infer_vector(tokens, alpha=0.01, epochs = 50)

        search_vector_df = pd.DataFrame(search_vector).T

        # Find cosine similarity
        df['score'] = cosine_similarity(self.match_vectors_df,
                                        search_vector_df)
        
        # Apply minimum threshold if set
        _iterator = df[df['score']>=threshold] if threshold else df

        self._iter = _iterator[self.columns].itertuples(index=False,
                                                        name=None)
 
    def iterate(self, idx):
        row = next(self._iter)
        return (row[0], row[1], row[2],)

# And register the function
SQLite_doc2Vec_Table_Search.register(db.conn)

Let’s try it out:

vector_query = f"""
SELECT *
FROM doc2vec_search("{search_phrase}")
WHERE score>0.4 ORDER BY score DESC LIMIT 3;
"""

pd.read_sql(vector_query, db.conn)
book title score
0 The Blue Fairy Book Cinderella, Or The Little Glass Slipper 0.522628
1 The Red Fairy Book The Six Sillies 0.410686
2 The Brown Fairy Book Habogi 0.403348

Let’s try another one:

search_phrase2 = """
A girl lives with her grandmother and a wicked lord comes along and tells her
she can't get married unless she spins a funeral shroud and a wedding gown made from nettles.
As she spins, he gets ill, but can't die until she finishes it.
"""

vector_query = f"""
SELECT *
FROM doc2vec_search("{search_phrase2}")
WHERE score>0.3 ORDER BY score DESC LIMIT 3;
"""

pd.read_sql(vector_query, db.conn)
book title score
0 The Red Fairy Book The Nettle Spinner 0.639598
1 The Pink Fairy Book The Merry Wives 0.449067
2 The Blue Fairy Book Rumpelstiltzkin 0.444151
search_phrase3 = """
A girl in kitchen makes soup for king gets married.
"""

vector_query = f"""
SELECT *
FROM doc2vec_search("{search_phrase3}")
WHERE score>0.3 ORDER BY score DESC LIMIT 10;
"""

pd.read_sql(vector_query, db.conn)
book title score
0 The Yellow Fairy Book The Yellow Fairy Book 0.624495
1 The Blue Fairy Book Little Red Riding Hood 0.590124
2 The Red Fairy Book The Six Sillies 0.572264
3 The Blue Fairy Book Why The Sea Is Salt 0.547218
4 The Pink Fairy Book I Know What I Have Learned 0.535688
5 The Pink Fairy Book The Merry Wives 0.531312
6 The Red Fairy Book The True History Of Little Golden Hood 0.529552
7 The Pink Fairy Book Peter Bull 0.528966
8 The Pink Fairy Book The House In The Wood 0.517915
9 The Lilac Fairy Book A Lost Paradise 0.516366

Add TF-IDF#

To improve the doc2vec performance, it might be worth exploring a model that has a stricter minimum frequency for words in the corpus, but that also mixes a TF-IDF (term frequency, inverse document frequency) component in the ranking score?