Creating a 19th c. Notes & Queries Index Database#

Although putting textual content into a database allows us to create full text search tools over that content, a lot of work and effort went into creating the original indexes. So can we scrape the text data from the indexes and generate add the index data to a database original_index table to create a comprehensive searchable index?

To start with, what columns might such a table need? Let’s review an example index issue of Notes & Queries. We can get the ID for such a page by querying the metadata database table:

from sqlite_utils import Database

db_name = "nq_demo.db"
db = Database(db_name)
from pathlib import Path

# And the default download dir file path
dirname = 'ia-downloads'

p = Path(dirname)
from pandas import read_sql
q = "SELECT id FROM metadata WHERE is_index=1 AND date LIKE '1849%' LIMIT 1"

sample_index_id = read_sql(q, db.conn)["id"].iloc[0]
sample_index_id
'sim_notes-and-queries_1849-1850_1_index'

Really we should ensure we have downloaded a copy of that index document, although as we assume here, it should be in the download cache already from when we created the monolithic index PDF.

Cleaning the Text#

If we preview the PDF of an index issue, we see it has a regular two column structure. We can also see structure in the way that the index terms, and the subsidiary index terms, are organised:

from IPython.display import IFrame

IFrame( (p / sample_index_id / f'{sample_index_id}.pdf').as_posix(), width=600, height=500)

Let’s also have a look at some of the raw search text for an index issue:

from ia_utils.download_and_extract_text import download_and_extract_text

simple_index_text = download_and_extract_text(sample_index_id)

print(simple_index_text[:1000])
 
 
 
 
A.
Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418.
—— on a passage in Goldsmith, 83.
— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori- anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399.
—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257. 313.
— works, 93.
Alicui on Bec ket’s grace-cup, 143.
—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar Brackley’s sermon, 35).
Almanack (Poor Robia’s), 470.
Alms-basins
sample_index_text = download_and_extract_text(sample_index_id, typ="djvutxt")

print(sample_index_text[:1000])
A. 


Apsey of St. Wandrille, 382. 486. 

Abdication of James I1., 39. 489. 

Aberdeen, Burnet prize at, 91. 

Aboriginal chambers near Tilbury, 462. 

A. (B.) on emancipation of the Jews, 475. 

Accuracy of references, 170. 

Addison's books, 212. 

Adolphus on a recent novel, 231. 

Advent bells, 121. 

Adversaria, 73. 86. 

Elfric’s colloquy, 168. 197, 232. 248. 278. 

Elian, translation of, 267. 284. 

A. (F. RB.) on Sterne’s Koran, 418. 

—— on a passage in Goldsmith, 83. 

— Queen of Hearts, 320. 

Agricola(C.), Propugnaculum anti-Pistori- 
anum, 203. 

A. (J. D.) on swords worn in public, 415. 

Alban’s (St.) Day, 399. 

—._, law courts at, 306. 

Albert (Le Petit), 474. 

Alchemy, metrical writings on, 60. 

Ale »xandria a (Ptolemy of), 142. 170. 

Alfred’s (King) Geography of Europe, 257. 
313. 


— works, 93. 

Alicui on Bec ket’s grace-cup, 143. 

—— on Bishop Barnaby, 132. 

All Angels and St. Michael’s, feast of, 235. 

* All to-broke,” 490. 

Allusion in Friar
sample_index_id
'sim_notes-and-queries_1849-1850_1_index'

Inspecting some of the documents shows that there is no guarantee that the search text correctly represents index items on a new line, although in certain documents it appears as if line breaks after each entry are provided (as in the original scanned image).

There are also “sub-elements” on separate lines that relate to a major heading that we really need to “fill down” on, although there is may be no indication in the text (e.g. no series of dashes or a tab characters) to indicate the the subsidiary nature of a reference. (Note that there may be further clues in the original XML, for example, from the location of the text.) However, subsidiary entries do often appear to start with a lower case letter, so let’s use that as a heuristic: if the line starts with a lower case letter, it’s a subsidiary entry. More detailed inspection of the index search text also suggests that in some cases - separator characters may appear in the search text.

To create a complete index, one possible approach is to:

  • normalise a single entry and all its subsidiary entries onto a single line;

  • parse a single entry and all its subsidiary entries into appropriate database records.

Rather than consider the XML and all the additional processing that incurs, let’s try to “repair” the document as best we can. Another thing we could try to exploit is the alphabetical order of entries, but let’s leave that as an open question and only return to it if we find issues occurring that alphabetisation might help us address.

So let’s start by repairing the text and normalising the lines before considering how to parse the entries.

Reinserting Line Breaks#

If we can identify where line breaks are likely to be missing, we should be able to reinsert them.

By inspection of the raw search text, it seems that we have a page number (digits), space character, and then typically the next entry start by a capital letter (subsidiary lines seem to start with a lower case character). We can perform a regular expression substitution to match this pattern and replace the space after the final page number with an end-of-line character.

Some lines also start with opening quotes of various flavours ( or for example), or incorrectly recognised quotes rendered as a * character. We can also insert line breaks in advance of these:

import re

def repair_index_missing_line_breaks(text):
    """Attempt to repair missing line breaks."""
    # Add line break after page number
    # allowing a single optional grace character at end for incorrect OCR
    repaired_text  = re.sub(r"([0-9].?\s*\.?)[\s]+([\(‘“\"'\*A-Z])", r'\1\n\2', text)
    
    return repaired_text

Let’s see how that looks:

repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )

print(repaired_sample_index)
A. 


Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418. 

—— on a passage in Goldsmith, 83. 

— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori- 
anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399. 

—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257. 
313. 


— works, 93.
Alicui on Bec ket’s grace-cup, 143. 

—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar

Removing Unwanted Line Breaks#

If what appear to be page numbers appear on the their own line, they should presumably appear as page numbers for the previous reference.

In other cases, a subsidiary reference might incorrectly be place on one line, or a line might end on a comma. In such cases, we might assume the associated line breaks to be unwanted.

So let’s replace the line breaks in those locations with spaces, and then also replace any double spaces we might have introduced (or that were present withing the original scanned text) with a single space:

def repair_index_unwanted_line_breaks(text):
    """Attempt to repair extraneous line breaks."""
    # Fix unwanted line end before page number
    repaired_text  = re.sub(r"\n([0-9].*)", r' \1', text)
    # Fix unwanted line end before subsidiary entry (initial lower case character)
    # Identify subsidiary split with a ::: separator
    repaired_text  = re.sub(r"\n([a-z].*)", r' ::: \1', repaired_text)
    # Fix unwanted line break after comma
    #repaired_text  = re.sub(r",\s*\n", r', ZZ', repaired_text)
    
    # Remove duplicate spaces
    repaired_text  = re.sub(r"  ", r' ', repaired_text)
    
    return repaired_text

How do things look now?

repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )
repaired_sample_index = repair_index_unwanted_line_breaks( repaired_sample_index )

print(repaired_sample_index)
A. 


Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418. 

—— on a passage in Goldsmith, 83. 

— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori- ::: anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399. 

—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257. 313. 


— works, 93.
Alicui on Bec ket’s grace-cup, 143. 

—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar

Inspecting the above, we see there are “issues” that we might be able to address, such as line entries that should be separated, based on a closer inspection of the XML returned from the scan that includes the position on the page.

But at least we have something to work with.

Parsing Entries and Adding Them to a Database#

Let’s now consider how we might structure our database entries.

First, we have simple “primary” entries, such as Agincourt, Sir Hilary charged at, 158. 190.

We might put this into a record of the form:

[{"source_id": id_val, "index_term": "Agincourt, Sir Hilary charged at", "page": 158}
{"source_id": id_val, "index_term": "Agincourt, Sir Hilary charged at", "page": 190}]

The page numbers are relative to a particular volume, so we also need to be able to capture information to identify what the page numbers are with reference to. The index document filenames take the form Notes and Queries 1875: Vol 3 Index so we can parse out the year and volume and add these to the record too.

from parse import parse

def get_index_metadata_from_title(title):
    """Get year and volume from title."""
    metadata = parse("Notes and Queries {year}: Vol {vol} Index", title)
    if metadata:
        metadata = {"year": metadata["year"], "vol": metadata["vol"]}
    else:
        metadata = {"year": None, "vol": None}
    return metadata

Here’s how it works:

sample_index_page_title = "Notes and Queries 1875: Vol 3 Index"

get_index_metadata_from_title(sample_index_page_title)
{'year': '1875', 'vol': '3'}

In the table, we might also provide a type column to distinguish between primary (P) and subsidiary (S) entries, along with subsidiary column which should be empty in simple cases.

For a line entry such as A. (E. H.) on baptismal superstition, 197. on curfew, at Morpeth, 312. on Duresme and Dunelm, 206. we not the the first entry is actually a subsidiary entry, the on keyword identifying the subsidiarity to the main term A. (E. H.).

We might then desire to have partial records of the form:

[{"index_term": "A. (E. H.)", "typ": "S", "page": 197, "subsidiary": "on baptismal superstition"},
{"index_term": "A. (E. H.)", "typ": "S", "page": 312, "subsidiary": "on curfew, at Morpeth"},
{"index_term": "A. (E. H.)", "typ": "S", "page": 206, "subsidiary": "on Duresme and Dunelm,"}
]

Inspection of other records with subsidiary terms suggests that a comma may also be used as to denote initial subsidiarity, as or example illustrated here:

Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.

In this case, the multiple items are based on the original term before the initial comma (this might be a hasty assumption if the key term itself includes a comma, but the we might hope for an “on” separator to clarify the position.

We also note in that example a possible repair we could make to the original text: removing the word- split hyphenation.

def clean_text_remove_word_split_hyphenation(text):
    """Remove word split hyphenation."""
    cleaned_text = re.sub(r"([a-z])[-—–][\n]([a-z])", r'\1\2', text)
    
    return cleaned_text

Let’s do a quick test of that:

test_eol_hyphenation = "Berkeley (Bishop), adventures of Gau-\ndentio di Lucca, 247.successful experiments, 217."

clean_text_remove_word_split_hyphenation(test_eol_hyphenation)
'Berkeley (Bishop), adventures of Gaudentio di Lucca, 247.successful experiments, 217.'

So let’s start by suggesting the following database record structure as something to work towards:

%%writefile ia_utils/create_db_table_index_entries.py
def create_db_table_index_entries(db, drop=True):
    """Create an index_entries database table and an associated full-text search table."""
    # If we want to remove the table completely, we can drop  it
    table_name = "index_entries"
    db[table_name].drop(ignore=True)
    db[f"{table_name}_fts"].drop(ignore=True)

    db[table_name].create({
        "source_id": str, 
        "year": str,
        "vol": str,
        "index_term": str, 
        "typ": str,
        "subsidiary": str,
        "page_num": int
    })

    # Enable full text search
    # This creates an extra virtual table ({table_name}_fts) to support the full text search
    db[table_name].enable_fts(["source_id", "index_term", "subsidiary", "year", "vol", "page_num"],
                                 create_triggers=True, tokenize="porter")
Overwriting ia_utils/create_db_table_index_entries.py

Load in the package and create the index entries database table:

from ia_utils.create_db_table_index_entries import create_db_table_index_entries

create_db_table_index_entries(db)

We now need to consider various ways of parsing line items, including:

  • extracting multiple page numbers for a single entry;

  • identifying entries that mask subsidiary terms.

We have already adopted a convention of using ::: to separate subsidiary items, so let’s apply that a bit further to separate out “on” terms and comma separated terms. We might also have a catch all in case there are elements appearing after a page number that are perhaps rightly new entries but that we shall treat as subsidiaries.

We could possibly also try to “fudge” page numbers that look like numbers-ish, for eexample, if there is a set of numbers that ends with an s or a z. where we might guess (possibly incorrectly) at a 5 or 2.

def _repair_index_subsidiary_separator_line(text):
    """Repair entries at line level."""
    
    # Very risky number substitutions
    # We want to access \1 so we need the alternative syntax
    repaired_text  = re.sub(r"([0-9])[sS]\.?", r'\g<1>5', text)
    repaired_text  = re.sub(r"([0-9])[zZ]\.?", r'\g<1>2', repaired_text)
    
    # Subsidiary terms based on "on" - this may be overly aggressive to be starting with
    repaired_text  = re.sub(r"([^(on)]*)( on .*)", r'\1 ::: \2', repaired_text)
    # Subsidiary terms based on dashes at start of line
    repaired_text  = re.sub(r'^[-—–]+', r' ::: ', repaired_text)
    # Subsidiary terms based on multiple dashes within line (unlikely to be hyphen)
    repaired_text  = re.sub(r'[-—–]{2,}', r' ::: ', repaired_text)
    # Subsidiary terms based on dash after a number
    repaired_text  = re.sub(r'([0-9\.,]+\s*)[-—–]+', r'\1 :::', repaired_text)
    
    # Subsidiary terms based on page numbers
    repaired_text  = re.sub(r"([0-9]\.) *([‘“\"'\*A-Za-z])", r'\1 ::: \2', repaired_text)
    # Subsidiary terms based on "on" - this may be overly aggressive
    #repaired_text  = re.sub(r"^([^:]*)( on .*)", r'\1 ::: \2', repaired_text)
    # Or only apply after a number
    #repaired_text  = re.sub(r"([0-9]\.)\s*(on)", r'\1 ::: \2', repaired_text)
    if "::: on" in repaired_text:
        # Also split at start
        repaired_text = re.sub(r"^([^(on)]*) (on)", r"\1 ::: \2", repaired_text)
    # Subsidiary terms based on ","
    #elif ":::" in repaired_text:
    # If we have numbers separated by commas, replace the commas with a .
    repaired_text = re.sub(r'(\s+[0-9]+)\s*,\s*([0-9]+)',r'\1. \2', repaired_text)
    # If we have a comma before a number, separate after the number
    # Allow a grace character
    re.sub(r"^([^:,]*),\s*([0-9][0-9\.\s]+[A-Za-z]?)[^\n]", r'\1 \2:::', repaired_text)
    # If we have a comma appear before a separator, separate on it
    repaired_text  = re.sub(r"^([^:,]*),\s*([^0-9]+)", r'\1 :::\2', repaired_text)

    # Provide a catch all to add separators after what look like page numbers
    repaired_text  = re.sub(r"([0-9]\s*[^:].?)\s*([A-Za-z].*)$", r'\1 ::: \2', repaired_text)
    
    # Remove uncaught dashes at start and end of phrase
    repaired_text = ":::".join([p.strip("-—– ") for p in repaired_text.split(":::")])
    
    return repaired_text


def repair_index_subsidiary_separator(text):
    """Attempt to identify where subsidiary splits occur."""
    # These are applied at the line level
    repaired_lines = [_repair_index_subsidiary_separator_line(line.strip()) for line in text.split("\n") if line]
    
    # Patch back any overly aggressively split lines
    return "\n".join(repaired_lines).replace("\n:::", ":::")

Let’s see how that works:

repaired_sample_index2 = repaired_sample_index
repaired_sample_index2 = repair_index_subsidiary_separator(repaired_sample_index2)

print(repaired_sample_index2)
A.
Apsey of St. Wandrille:::382. 486.
Abdication of James I1.:::39. 489.
Aberdeen:::Burnet prize at, 91.
Aboriginal chambers near Tilbury:::462.
A. (B.):::on emancipation of the Jews, 475.
Accuracy of references:::170.
Addison's books:::212.
Adolphus:::on a recent novel, 231.
Advent bells:::121.
Adversaria:::73. 86.
Elfric’s colloquy:::168. 197. 232. 248. 278.
Elian:::translation of, 267. 284.
A. (F. RB.):::on Sterne’s Koran, 418.::::::on a passage in Goldsmith, 83.:::Queen of Hearts, 320.
Agricola(C.):::Propugnaculum anti-Pistori:::anum, 203.
A. (J. D.):::on swords worn in public, 415.
Alban’s (St.) Day:::399.:::._, law courts at, 306.
Albert (Le Petit):::474.
Alchemy:::metrical writings on, 60.
Ale »xandria a (Ptolemy of):::142. 170.
Alfred’s (King) Geography of Europe:::257. 313.:::works, 93.
Alicui:::on Bec ket’s grace-cup, 143.::::::on Bishop Barnaby, 132.
All Angels and St. Michael’s:::feast of, 235.
* All to-broke:::” 490.
Allusion in Friar

And for the comma separator:

text_comma_subsidiary = "Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217."

repair_index_subsidiary_separator(text_comma_subsidiary)
'Berkeley (Bishop):::adventures of Gau- dentio di Lucca, 247.:::successful experiments, 217.'

Having made an attempt at some subsidiary separators, we can now try to parse out the various components. At the start of the line we have the primary entry, then we may have one or more line numbers or one or more subsidiary phrases.

Let’s look at how to parse out page numbers. There may be one or more page numbers separated by spaces or by . characters.

# This is a rather crude approach that just grabs all the numbers we can find
def extract_page_numbers_from_line(text):
    """Extract one or more page numbers from text."""
    # Try to nudge things towards finding numbers at the end of the phrase
    end_of_text = re.sub(r'^[^0-9]*([0-9\.,\s]*$)', r'\1', text)
    start_of_text = text.replace(end_of_text, '')
    # Then just bludgeon out all the possible page numbers
    page_numbers = re.findall(r'\d+', end_of_text)
    return start_of_text, page_numbers

Let’s see how that works:

# Use a test example of subsidiary elements; there is no page number in the first part
[extract_page_numbers_from_line(t) for t in repair_index_subsidiary_separator(text_comma_subsidiary).split(":::")[1:]]
[('adventures of Gau- dentio di Lucca, ', ['247']),
 ('successful experiments, ', ['217'])]

And if there are no numbers?

extract_page_numbers_from_line("No numbers here")
('No numbers here', [])
def parse_index_line(text):
    """Parse out elements of the index entry."""
    
    # Split the entry in subsidiary parts and clean white space
    parts = [p.strip() for p in text.split(":::")]

    # Do we have one entry or many?
    if len(parts) == 1:
        # There are no subsidiary parts
        # The first part is the main index entry
        # from which we need to separate one or more page references
        entry_text, page_numbers = extract_page_numbers_from_line(parts[0])
        index_entries = [{"index_term": entry_text, "typ": "P",
                         "page_numbers": page_numbers}]
    else:
        # There are subsidiary parts
        # In this case, we get each subsidiary part and its page references
        # Get the subsidiary parts
        index_entries = []
        for p in parts[1:]:
            entry_text, page_numbers = extract_page_numbers_from_line(p)
            subsidiary_entry = {"index_term": parts[0],
                                "subsidiary": entry_text, "typ": "S",
                                "page_numbers": page_numbers}
            index_entries.append(subsidiary_entry)

    return index_entries
parse_index_line('“ Noise" derivations of 81. 106. 138. 218. 35')
[{'index_term': '“ Noise" derivations of ',
  'typ': 'P',
  'page_numbers': ['81', '106', '138', '218', '35']}]

So does that work?!

parse_index_line(repaired_sample_index2.split("\n")[4])
[{'index_term': 'Aboriginal chambers near Tilbury',
  'subsidiary': '',
  'typ': 'S',
  'page_numbers': ['462']}]

In the above case, we have an error in that we have rolled one index entry as a subsidiary to an initial index entry because of a missing page number for the first entry.

In this case, alphabetic sorting checks across several index entries (and subsidiaries) might help us detect this error; for example, if a subsidiary term sorts between the index term and the next index term, we might guess that the subsidiary is actually a main index term.

Note that if we construct a full text search across the index_term and subsidiary columns, we are likely to get false positives but we shouldn’t miss anything…

We can now try to create a complete set of records that we could upload to out database.

To start with, we need the metadata, which means we need the title.

def get_title_from_id(db, id_val):
    """get the title of the issue from the database."""
    q = f'SELECT title FROM metadata WHERE id="{id_val}"'

    return read_sql(q, db.conn)["title"][0]

For example:

index_base_data = get_index_metadata_from_title(get_title_from_id(db, sample_index_id))
index_base_data
{'year': ' 1849 - 1850', 'vol': '1'}

Now we need to separate each line item into multiple items. The pandas dataframe can come to out aid here, with its ability to easily split out listed items in one cell onto multiple rows:

import pandas as pd

example_subsidiary_df = pd.DataFrame(parse_index_line(repaired_sample_index2.split("\n")[4]))
example_subsidiary_df
index_term subsidiary typ page_numbers
0 Aboriginal chambers near Tilbury S [462]

We can now “explode” that dataframe against the lists of page numbers to get one row per item:

example_subsidiary_df.explode('page_numbers').rename(columns={"page_numbers": "page_num"})
index_term subsidiary typ page_num
0 Aboriginal chambers near Tilbury S 462

Let’s see if we can now put all those pieces together. Essentially, for each index line, we need to generate the complete set of records we want to add to the database.

def construct_index_records(id_val=None, text=None, metadata=None, retval="explode"):
    """Generate a complete set of index records from original search text document."""
    if id_val is None and text is None:
        return []

    text = download_and_extract_text(id_val, typ="djvutxt") if text is None else text

    records = []
    # Repair the text
    repaired_text = repair_index_missing_line_breaks( text )
    repaired_text = repair_index_unwanted_line_breaks( repaired_text )
    repaired_text = repair_index_subsidiary_separator( repaired_text )

    for line in repaired_text.split("\n"):
        if line:
            new_line = parse_index_line(line)
            records.extend(new_line)
    
    if retval not in ["df", "explode"] or id_val is None:
        # Return the list of dicts, without the metadata
        return records

    # WARNING - if we used provided text, the id_val and the text may actually be inconsistent
    index_base_data = get_index_metadata_from_title(get_title_from_id(db, id_val))
    # Generate a dataframe
    records_df = pd.DataFrame(records)
    
    records_df["source_id"] = id_val
    records_df["year"] = index_base_data["year"]
    records_df["vol"] = index_base_data["vol"]
    
    if retval=="explode":
        return records_df.explode('page_numbers').rename(columns={"page_numbers": "page_num"})
    elif retval=="df":
        return records_df

And when we run it:

construct_index_records(sample_index_id)[1000: 1200]
index_term typ page_num subsidiary source_id year vol
844 Complutensian Polygiot S 402 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
844 Complutensian Polygiot S 431 sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
845 Compton Street S 228 Soho, sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
846 Conrad of Salisbury’ P 8 NaN sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
847 Descriptio utriusque P NaN NaN sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
... ... ... ... ... ... ... ...
1015 D. P NaN NaN sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
1016 D. S NaN on Lord Chatham's speech on American sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
1017 D. S 12 stamp act, sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
1018 D. S 214 iden frog, sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1
1019 D. S NaN sim_notes-and-queries_1849-1850_1_index 1849 - 1850 1

200 rows × 7 columns

It’s far from ideal, but at least gives us something to work with. So let’s add it to the database, and see how a search feels.

db["index_entries"].insert_all(construct_index_records(sample_index_id).to_dict(orient="records"))
<Table index_entries (source_id, year, vol, index_term, typ, subsidiary, page_num)>

Let’s try a search:

search_term = "cure"

q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""

read_sql(q, db.conn)
source_id index_term subsidiary year vol page_num
0 sim_notes-and-queries_1849-1850_1_index Cure for the hooping-cough 1849 - 1850 1 397
1 sim_notes-and-queries_1849-1850_1_index G. (J.) charm to cure the murrain in cows, 1849 - 1850 1 349
2 sim_notes-and-queries_1849-1850_1_index Warts charms for cure of, 1849 - 1850 1 349
3 sim_notes-and-queries_1849-1850_1_index Warts charms for cure of, 1849 - 1850 1 482

Let’s create a search index over all the index issues up to 1900 excluding the cumulative indexes.

First, grab all the indexes from the database and then filter to just the years we are interested in.

We really should support convenient year searching by adding a year column to the table, or creating a convenient, custom query function to handle years.

q = """
SELECT * FROM (SELECT CAST(strftime("%Y", datetime) AS INT) AS year, *
    FROM metadata
    WHERE is_index=1 AND id NOT LIKE "%cumulative%") WHERE year < 1900;
"""

indexes = read_sql(q, db.conn)

indexes = indexes.to_dict(orient="records")
indexes[:3]
[{'year': 1850,
  'id': 'sim_notes-and-queries_1849-1850_1_index',
  'date': '1849 - 1850',
  'datetime': '1850-03-20T00:00:00',
  'series': None,
  'vol': '1',
  'iss': 'Index',
  'title': 'Notes and Queries  1849 - 1850: Vol 1 Index',
  'next_id': 'sim_notes-and-queries_1849-11-03_1_1',
  'prev_id': '',
  'is_index': 1,
  'restricted': ''},
 {'year': 1850,
  'id': 'sim_notes-and-queries_1850_2_index',
  'date': '1850',
  'datetime': '1850-03-20T00:00:00',
  'series': None,
  'vol': '2',
  'iss': 'Index',
  'title': 'Notes and Queries  1850: Vol 2 Index',
  'next_id': 'sim_notes-and-queries_1850-06-01_2_31',
  'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',
  'is_index': 1,
  'restricted': ''},
 {'year': 1851,
  'id': 'sim_notes-and-queries_1851_3_index',
  'date': '1851',
  'datetime': '1851-03-20T00:00:00',
  'series': None,
  'vol': '3',
  'iss': 'Index',
  'title': 'Notes and Queries  1851: Vol 3 Index',
  'next_id': 'sim_notes-and-queries_1851-01-04_3_62',
  'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',
  'is_index': 1,
  'restricted': ''}]
# Download the tqdm progress bar tools
from tqdm.notebook import tqdm

db["index_entries"].delete_where()
db["index_entries_fts"].delete_where()

# List of indexes already loaded: indexes

for index_record in tqdm(indexes):
    index_records = construct_index_records(index_record["id"])
    db["index_entries"].insert_all( index_records.to_dict(orient="records"))

And how about a search…

search_term = "sin eater"

q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""

read_sql(q, db.conn)[:50]
source_id index_term subsidiary year vol page_num
0 sim_notes-and-queries_1851_4_index B. (E. H.) on mazer-wood and sin-eaters, 1851 4 211
1 sim_notes-and-queries_1851_4_index Sin-eaters notices respecting, 1851 4 211
2 sim_notes-and-queries_1852_6_index E .) on the sin-eater, 1852 6 541
3 sim_notes-and-queries_1852_6_index I eeper ( (Alex.) on the sin-eater, 1852 6 541
4 sim_notes-and-queries_1852_6_index Symons Jelinger C.) origin of sin-eater, 1852 6 300
5 sim_notes-and-queries_1870_6_index Sin-eater origin of the, 1870 6 450
6 sim_notes-and-queries_1876_6_index Sin-eater 1876 6 505
7 sim_notes-and-queries_1876_6_index Sin-eater 1876 6 505
8 sim_notes-and-queries_1877_7_index B. (A. C.) on the Sin-eater, 1877 7 14
9 sim_notes-and-queries_1877_7_index Sin-eater 1877 7 14
10 sim_notes-and-queries_1883_7_index Sin-eater 1883 7 25
11 sim_notes-and-queries_1883_7_index Sin-eater 1883 7 334
12 sim_notes-and-queries_1883_8_index Sin-eater 1883 8 255
13 sim_notes-and-queries_1895_8_index Sin-eaters 1895 8 288
14 sim_notes-and-queries_1895_8_index Sin-eaters 1895 8 288
15 sim_notes-and-queries_1895_8_index Sin-eaters 1895 8 332
16 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 169
17 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 296
18 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 110
19 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 111
20 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 109
21 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 169
22 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 236
23 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 296
24 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 111
25 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 110
26 sim_notes-and-queries_1896_9_index Owen (J. P.) on sin-eaters, 1896 9 109
27 sim_notes-and-queries_1896_9_index Owen (J. P.) on sin-eaters, 1896 9 236
28 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 110
29 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 109
30 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 169
31 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 236
32 sim_notes-and-queries_1896_9_index Sin-eaters 1896 9 296
33 sim_notes-and-queries_1896_9_index Thomas (N. W.) on sin-eaters, 1896 9 169
search_term = "boggart"

q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""

read_sql(q, db.conn)[:50]
source_id index_term subsidiary year vol page_num
0 sim_notes-and-queries_1869_4_index Boggarts and Feorin 1869 4 508
1 sim_notes-and-queries_1869_4_index Boggarts and Feorin 1869 4 508
2 sim_notes-and-queries_1869_4_index Higson (John) on Boggarts and Feorin, 1869 4 5
3 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 23
4 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 156
5 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 216
6 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 287
7 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 365
8 sim_notes-and-queries_1870_5_index Boggarts and Feorin 1870 5 517
9 sim_notes-and-queries_1870_5_index Bowker (James) on Boggarts, Feorin, &c. 1870 5 365
10 sim_notes-and-queries_1870_5_index Boggarts Feorin, 1870 5 287
11 sim_notes-and-queries_1870_5_index Davies (Wm.) on Boggarts, Feorin, &c., 1870 5 216
12 sim_notes-and-queries_1870_5_index Hermentrude on Boggarts, 1870 5 23
13 sim_notes-and-queries_1870_5_index Boggarts Feorin, &c., 1870 5 156
14 sim_notes-and-queries_1870_5_index Riley (H. T.) on Boggarts, Feorin, &c., 1870 5 216
15 sim_notes-and-queries_1870_5_index Smith (W. J. B.) on Boggarts, Feorin, &e., 1870 5 317
16 sim_notes-and-queries_1895_8_index : Boggart, 85 ah by a sheep, 1895 8 170
17 sim_notes-and-queries_1895_8_index Boggart=ghost 1895 8 85
18 sim_notes-and-queries_1895_8_index Boggart=ghost 1895 8 255

Add the Index Tables to the Full Database#

Let’s also add the index tables to our full database.

db_name_full = "full_nq.db"
db_full = Database(db_name_full)

create_db_table_index_entries(db_full)

db_full["index_entries"].delete_where()
db_full["index_entries_fts"].delete_where()

# List of indexes already loaded: indexes

for index_record in tqdm(indexes):
    index_records = construct_index_records(index_record["id"])
    db_full["index_entries"].insert_all( index_records.to_dict(orient="records"))