Creating a 19th c. Notes & Queries Index Database
Contents
Creating a 19th c. Notes & Queries Index Database#
Although putting textual content into a database allows us to create full text search tools over that content, a lot of work and effort went into creating the original indexes. So can we scrape the text data from the indexes and generate add the index data to a database original_index
table to create a comprehensive searchable index?
To start with, what columns might such a table need? Let’s review an example index issue of Notes & Queries. We can get the ID for such a page by querying the metadata database table:
from sqlite_utils import Database
db_name = "nq_demo.db"
db = Database(db_name)
from pathlib import Path
# And the default download dir file path
dirname = 'ia-downloads'
p = Path(dirname)
from pandas import read_sql
q = "SELECT id FROM metadata WHERE is_index=1 AND date LIKE '1849%' LIMIT 1"
sample_index_id = read_sql(q, db.conn)["id"].iloc[0]
sample_index_id
'sim_notes-and-queries_1849-1850_1_index'
Really we should ensure we have downloaded a copy of that index document, although as we assume here, it should be in the download cache already from when we created the monolithic index PDF.
Cleaning the Text#
If we preview the PDF of an index issue, we see it has a regular two column structure. We can also see structure in the way that the index terms, and the subsidiary index terms, are organised:
from IPython.display import IFrame
IFrame( (p / sample_index_id / f'{sample_index_id}.pdf').as_posix(), width=600, height=500)
Let’s also have a look at some of the raw search text for an index issue:
from ia_utils.download_and_extract_text import download_and_extract_text
simple_index_text = download_and_extract_text(sample_index_id)
print(simple_index_text[:1000])
A.
Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418.
—— on a passage in Goldsmith, 83.
— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori- anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399.
—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257. 313.
— works, 93.
Alicui on Bec ket’s grace-cup, 143.
—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar Brackley’s sermon, 35).
Almanack (Poor Robia’s), 470.
Alms-basins
sample_index_text = download_and_extract_text(sample_index_id, typ="djvutxt")
print(sample_index_text[:1000])
A.
Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418.
—— on a passage in Goldsmith, 83.
— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori-
anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399.
—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257.
313.
— works, 93.
Alicui on Bec ket’s grace-cup, 143.
—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar
sample_index_id
'sim_notes-and-queries_1849-1850_1_index'
Inspecting some of the documents shows that there is no guarantee that the search text correctly represents index items on a new line, although in certain documents it appears as if line breaks after each entry are provided (as in the original scanned image).
There are also “sub-elements” on separate lines that relate to a major heading that we really need to “fill down” on, although there is may be no indication in the text (e.g. no series of dashes or a tab characters) to indicate the the subsidiary nature of a reference. (Note that there may be further clues in the original XML, for example, from the location of the text.) However, subsidiary entries do often appear to start with a lower case letter, so let’s use that as a heuristic: if the line starts with a lower case letter, it’s a subsidiary entry. More detailed inspection of the index search text also suggests that in some cases -
separator characters may appear in the search text.
To create a complete index, one possible approach is to:
normalise a single entry and all its subsidiary entries onto a single line;
parse a single entry and all its subsidiary entries into appropriate database records.
Rather than consider the XML and all the additional processing that incurs, let’s try to “repair” the document as best we can. Another thing we could try to exploit is the alphabetical order of entries, but let’s leave that as an open question and only return to it if we find issues occurring that alphabetisation might help us address.
So let’s start by repairing the text and normalising the lines before considering how to parse the entries.
Reinserting Line Breaks#
If we can identify where line breaks are likely to be missing, we should be able to reinsert them.
By inspection of the raw search text, it seems that we have a page number (digits), space character, and then typically the next entry start by a capital letter (subsidiary lines seem to start with a lower case character). We can perform a regular expression substitution to match this pattern and replace the space after the final page number with an end-of-line character.
Some lines also start with opening quotes of various flavours (‘
or “
for example), or incorrectly recognised quotes rendered as a *
character. We can also insert line breaks in advance of these:
import re
def repair_index_missing_line_breaks(text):
"""Attempt to repair missing line breaks."""
# Add line break after page number
# allowing a single optional grace character at end for incorrect OCR
repaired_text = re.sub(r"([0-9].?\s*\.?)[\s]+([\(‘“\"'\*A-Z])", r'\1\n\2', text)
return repaired_text
Let’s see how that looks:
repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )
print(repaired_sample_index)
A.
Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418.
—— on a passage in Goldsmith, 83.
— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori-
anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399.
—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257.
313.
— works, 93.
Alicui on Bec ket’s grace-cup, 143.
—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar
Removing Unwanted Line Breaks#
If what appear to be page numbers appear on the their own line, they should presumably appear as page numbers for the previous reference.
In other cases, a subsidiary reference might incorrectly be place on one line, or a line might end on a comma. In such cases, we might assume the associated line breaks to be unwanted.
So let’s replace the line breaks in those locations with spaces, and then also replace any double spaces we might have introduced (or that were present withing the original scanned text) with a single space:
def repair_index_unwanted_line_breaks(text):
"""Attempt to repair extraneous line breaks."""
# Fix unwanted line end before page number
repaired_text = re.sub(r"\n([0-9].*)", r' \1', text)
# Fix unwanted line end before subsidiary entry (initial lower case character)
# Identify subsidiary split with a ::: separator
repaired_text = re.sub(r"\n([a-z].*)", r' ::: \1', repaired_text)
# Fix unwanted line break after comma
#repaired_text = re.sub(r",\s*\n", r', ZZ', repaired_text)
# Remove duplicate spaces
repaired_text = re.sub(r" ", r' ', repaired_text)
return repaired_text
How do things look now?
repaired_sample_index = repair_index_missing_line_breaks( sample_index_text[:1000] )
repaired_sample_index = repair_index_unwanted_line_breaks( repaired_sample_index )
print(repaired_sample_index)
A.
Apsey of St. Wandrille, 382. 486.
Abdication of James I1., 39. 489.
Aberdeen, Burnet prize at, 91.
Aboriginal chambers near Tilbury, 462.
A. (B.) on emancipation of the Jews, 475.
Accuracy of references, 170.
Addison's books, 212.
Adolphus on a recent novel, 231.
Advent bells, 121.
Adversaria, 73. 86.
Elfric’s colloquy, 168. 197, 232. 248. 278.
Elian, translation of, 267. 284.
A. (F. RB.) on Sterne’s Koran, 418.
—— on a passage in Goldsmith, 83.
— Queen of Hearts, 320.
Agricola(C.), Propugnaculum anti-Pistori- ::: anum, 203.
A. (J. D.) on swords worn in public, 415.
Alban’s (St.) Day, 399.
—._, law courts at, 306.
Albert (Le Petit), 474.
Alchemy, metrical writings on, 60.
Ale »xandria a (Ptolemy of), 142. 170.
Alfred’s (King) Geography of Europe, 257. 313.
— works, 93.
Alicui on Bec ket’s grace-cup, 143.
—— on Bishop Barnaby, 132.
All Angels and St. Michael’s, feast of, 235.
* All to-broke,” 490.
Allusion in Friar
Inspecting the above, we see there are “issues” that we might be able to address, such as line entries that should be separated, based on a closer inspection of the XML returned from the scan that includes the position on the page.
But at least we have something to work with.
Parsing Entries and Adding Them to a Database#
Let’s now consider how we might structure our database entries.
First, we have simple “primary” entries, such as Agincourt, Sir Hilary charged at, 158. 190.
We might put this into a record of the form:
[{"source_id": id_val, "index_term": "Agincourt, Sir Hilary charged at", "page": 158}
{"source_id": id_val, "index_term": "Agincourt, Sir Hilary charged at", "page": 190}]
The page numbers are relative to a particular volume, so we also need to be able to capture information to identify what the page numbers are with reference to. The index document filenames take the form Notes and Queries 1875: Vol 3 Index so we can parse out the year and volume and add these to the record too.
from parse import parse
def get_index_metadata_from_title(title):
"""Get year and volume from title."""
metadata = parse("Notes and Queries {year}: Vol {vol} Index", title)
if metadata:
metadata = {"year": metadata["year"], "vol": metadata["vol"]}
else:
metadata = {"year": None, "vol": None}
return metadata
Here’s how it works:
sample_index_page_title = "Notes and Queries 1875: Vol 3 Index"
get_index_metadata_from_title(sample_index_page_title)
{'year': '1875', 'vol': '3'}
In the table, we might also provide a type
column to distinguish between primary (P
) and subsidiary (S
) entries, along with subsidiary column which should be empty in simple cases.
For a line entry such as A. (E. H.) on baptismal superstition, 197. on curfew, at Morpeth, 312. on Duresme and Dunelm, 206. we not the the first entry is actually a subsidiary entry, the on
keyword identifying the subsidiarity to the main term A. (E. H.)
.
We might then desire to have partial records of the form:
[{"index_term": "A. (E. H.)", "typ": "S", "page": 197, "subsidiary": "on baptismal superstition"},
{"index_term": "A. (E. H.)", "typ": "S", "page": 312, "subsidiary": "on curfew, at Morpeth"},
{"index_term": "A. (E. H.)", "typ": "S", "page": 206, "subsidiary": "on Duresme and Dunelm,"}
]
Inspection of other records with subsidiary terms suggests that a comma may also be used as to denote initial subsidiarity, as or example illustrated here:
Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217.
In this case, the multiple items are based on the original term before the initial comma (this might be a hasty assumption if the key term itself includes a comma, but the we might hope for an “on” separator to clarify the position.
We also note in that example a possible repair we could make to the original text: removing the word- split
hyphenation.
def clean_text_remove_word_split_hyphenation(text):
"""Remove word split hyphenation."""
cleaned_text = re.sub(r"([a-z])[-—–][\n]([a-z])", r'\1\2', text)
return cleaned_text
Let’s do a quick test of that:
test_eol_hyphenation = "Berkeley (Bishop), adventures of Gau-\ndentio di Lucca, 247.successful experiments, 217."
clean_text_remove_word_split_hyphenation(test_eol_hyphenation)
'Berkeley (Bishop), adventures of Gaudentio di Lucca, 247.successful experiments, 217.'
So let’s start by suggesting the following database record structure as something to work towards:
%%writefile ia_utils/create_db_table_index_entries.py
def create_db_table_index_entries(db, drop=True):
"""Create an index_entries database table and an associated full-text search table."""
# If we want to remove the table completely, we can drop it
table_name = "index_entries"
db[table_name].drop(ignore=True)
db[f"{table_name}_fts"].drop(ignore=True)
db[table_name].create({
"source_id": str,
"year": str,
"vol": str,
"index_term": str,
"typ": str,
"subsidiary": str,
"page_num": int
})
# Enable full text search
# This creates an extra virtual table ({table_name}_fts) to support the full text search
db[table_name].enable_fts(["source_id", "index_term", "subsidiary", "year", "vol", "page_num"],
create_triggers=True, tokenize="porter")
Overwriting ia_utils/create_db_table_index_entries.py
Load in the package and create the index entries database table:
from ia_utils.create_db_table_index_entries import create_db_table_index_entries
create_db_table_index_entries(db)
We now need to consider various ways of parsing line items, including:
extracting multiple page numbers for a single entry;
identifying entries that mask subsidiary terms.
We have already adopted a convention of using :::
to separate subsidiary items, so let’s apply that a bit further to separate out “on” terms and comma separated terms. We might also have a catch all in case there are elements appearing after a page number that are perhaps rightly new entries but that we shall treat as subsidiaries.
We could possibly also try to “fudge” page numbers that look like numbers-ish, for eexample, if there is a set of numbers that ends with an s
or a z
. where we might guess (possibly incorrectly) at a 5
or 2
.
def _repair_index_subsidiary_separator_line(text):
"""Repair entries at line level."""
# Very risky number substitutions
# We want to access \1 so we need the alternative syntax
repaired_text = re.sub(r"([0-9])[sS]\.?", r'\g<1>5', text)
repaired_text = re.sub(r"([0-9])[zZ]\.?", r'\g<1>2', repaired_text)
# Subsidiary terms based on "on" - this may be overly aggressive to be starting with
repaired_text = re.sub(r"([^(on)]*)( on .*)", r'\1 ::: \2', repaired_text)
# Subsidiary terms based on dashes at start of line
repaired_text = re.sub(r'^[-—–]+', r' ::: ', repaired_text)
# Subsidiary terms based on multiple dashes within line (unlikely to be hyphen)
repaired_text = re.sub(r'[-—–]{2,}', r' ::: ', repaired_text)
# Subsidiary terms based on dash after a number
repaired_text = re.sub(r'([0-9\.,]+\s*)[-—–]+', r'\1 :::', repaired_text)
# Subsidiary terms based on page numbers
repaired_text = re.sub(r"([0-9]\.) *([‘“\"'\*A-Za-z])", r'\1 ::: \2', repaired_text)
# Subsidiary terms based on "on" - this may be overly aggressive
#repaired_text = re.sub(r"^([^:]*)( on .*)", r'\1 ::: \2', repaired_text)
# Or only apply after a number
#repaired_text = re.sub(r"([0-9]\.)\s*(on)", r'\1 ::: \2', repaired_text)
if "::: on" in repaired_text:
# Also split at start
repaired_text = re.sub(r"^([^(on)]*) (on)", r"\1 ::: \2", repaired_text)
# Subsidiary terms based on ","
#elif ":::" in repaired_text:
# If we have numbers separated by commas, replace the commas with a .
repaired_text = re.sub(r'(\s+[0-9]+)\s*,\s*([0-9]+)',r'\1. \2', repaired_text)
# If we have a comma before a number, separate after the number
# Allow a grace character
re.sub(r"^([^:,]*),\s*([0-9][0-9\.\s]+[A-Za-z]?)[^\n]", r'\1 \2:::', repaired_text)
# If we have a comma appear before a separator, separate on it
repaired_text = re.sub(r"^([^:,]*),\s*([^0-9]+)", r'\1 :::\2', repaired_text)
# Provide a catch all to add separators after what look like page numbers
repaired_text = re.sub(r"([0-9]\s*[^:].?)\s*([A-Za-z].*)$", r'\1 ::: \2', repaired_text)
# Remove uncaught dashes at start and end of phrase
repaired_text = ":::".join([p.strip("-—– ") for p in repaired_text.split(":::")])
return repaired_text
def repair_index_subsidiary_separator(text):
"""Attempt to identify where subsidiary splits occur."""
# These are applied at the line level
repaired_lines = [_repair_index_subsidiary_separator_line(line.strip()) for line in text.split("\n") if line]
# Patch back any overly aggressively split lines
return "\n".join(repaired_lines).replace("\n:::", ":::")
Let’s see how that works:
repaired_sample_index2 = repaired_sample_index
repaired_sample_index2 = repair_index_subsidiary_separator(repaired_sample_index2)
print(repaired_sample_index2)
A.
Apsey of St. Wandrille:::382. 486.
Abdication of James I1.:::39. 489.
Aberdeen:::Burnet prize at, 91.
Aboriginal chambers near Tilbury:::462.
A. (B.):::on emancipation of the Jews, 475.
Accuracy of references:::170.
Addison's books:::212.
Adolphus:::on a recent novel, 231.
Advent bells:::121.
Adversaria:::73. 86.
Elfric’s colloquy:::168. 197. 232. 248. 278.
Elian:::translation of, 267. 284.
A. (F. RB.):::on Sterne’s Koran, 418.::::::on a passage in Goldsmith, 83.:::Queen of Hearts, 320.
Agricola(C.):::Propugnaculum anti-Pistori:::anum, 203.
A. (J. D.):::on swords worn in public, 415.
Alban’s (St.) Day:::399.:::._, law courts at, 306.
Albert (Le Petit):::474.
Alchemy:::metrical writings on, 60.
Ale »xandria a (Ptolemy of):::142. 170.
Alfred’s (King) Geography of Europe:::257. 313.:::works, 93.
Alicui:::on Bec ket’s grace-cup, 143.::::::on Bishop Barnaby, 132.
All Angels and St. Michael’s:::feast of, 235.
* All to-broke:::” 490.
Allusion in Friar
And for the comma separator:
text_comma_subsidiary = "Berkeley (Bishop), adventures of Gau- dentio di Lucca, 247.successful experiments, 217."
repair_index_subsidiary_separator(text_comma_subsidiary)
'Berkeley (Bishop):::adventures of Gau- dentio di Lucca, 247.:::successful experiments, 217.'
Having made an attempt at some subsidiary separators, we can now try to parse out the various components. At the start of the line we have the primary entry, then we may have one or more line numbers or one or more subsidiary phrases.
Let’s look at how to parse out page numbers. There may be one or more page numbers separated by spaces or by .
characters.
# This is a rather crude approach that just grabs all the numbers we can find
def extract_page_numbers_from_line(text):
"""Extract one or more page numbers from text."""
# Try to nudge things towards finding numbers at the end of the phrase
end_of_text = re.sub(r'^[^0-9]*([0-9\.,\s]*$)', r'\1', text)
start_of_text = text.replace(end_of_text, '')
# Then just bludgeon out all the possible page numbers
page_numbers = re.findall(r'\d+', end_of_text)
return start_of_text, page_numbers
Let’s see how that works:
# Use a test example of subsidiary elements; there is no page number in the first part
[extract_page_numbers_from_line(t) for t in repair_index_subsidiary_separator(text_comma_subsidiary).split(":::")[1:]]
[('adventures of Gau- dentio di Lucca, ', ['247']),
('successful experiments, ', ['217'])]
And if there are no numbers?
extract_page_numbers_from_line("No numbers here")
('No numbers here', [])
def parse_index_line(text):
"""Parse out elements of the index entry."""
# Split the entry in subsidiary parts and clean white space
parts = [p.strip() for p in text.split(":::")]
# Do we have one entry or many?
if len(parts) == 1:
# There are no subsidiary parts
# The first part is the main index entry
# from which we need to separate one or more page references
entry_text, page_numbers = extract_page_numbers_from_line(parts[0])
index_entries = [{"index_term": entry_text, "typ": "P",
"page_numbers": page_numbers}]
else:
# There are subsidiary parts
# In this case, we get each subsidiary part and its page references
# Get the subsidiary parts
index_entries = []
for p in parts[1:]:
entry_text, page_numbers = extract_page_numbers_from_line(p)
subsidiary_entry = {"index_term": parts[0],
"subsidiary": entry_text, "typ": "S",
"page_numbers": page_numbers}
index_entries.append(subsidiary_entry)
return index_entries
parse_index_line('“ Noise" derivations of 81. 106. 138. 218. 35')
[{'index_term': '“ Noise" derivations of ',
'typ': 'P',
'page_numbers': ['81', '106', '138', '218', '35']}]
So does that work?!
parse_index_line(repaired_sample_index2.split("\n")[4])
[{'index_term': 'Aboriginal chambers near Tilbury',
'subsidiary': '',
'typ': 'S',
'page_numbers': ['462']}]
In the above case, we have an error in that we have rolled one index entry as a subsidiary to an initial index entry because of a missing page number for the first entry.
In this case, alphabetic sorting checks across several index entries (and subsidiaries) might help us detect this error; for example, if a subsidiary term sorts between the index term and the next index term, we might guess that the subsidiary is actually a main index term.
Note that if we construct a full text search across the index_term
and subsidiary
columns, we are likely to get false positives but we shouldn’t miss anything…
We can now try to create a complete set of records that we could upload to out database.
To start with, we need the metadata, which means we need the title.
def get_title_from_id(db, id_val):
"""get the title of the issue from the database."""
q = f'SELECT title FROM metadata WHERE id="{id_val}"'
return read_sql(q, db.conn)["title"][0]
For example:
index_base_data = get_index_metadata_from_title(get_title_from_id(db, sample_index_id))
index_base_data
{'year': ' 1849 - 1850', 'vol': '1'}
Now we need to separate each line item into multiple items. The pandas
dataframe can come to out aid here, with its ability to easily split out listed items in one cell onto multiple rows:
import pandas as pd
example_subsidiary_df = pd.DataFrame(parse_index_line(repaired_sample_index2.split("\n")[4]))
example_subsidiary_df
index_term | subsidiary | typ | page_numbers | |
---|---|---|---|---|
0 | Aboriginal chambers near Tilbury | S | [462] |
We can now “explode” that dataframe against the lists of page numbers to get one row per item:
example_subsidiary_df.explode('page_numbers').rename(columns={"page_numbers": "page_num"})
index_term | subsidiary | typ | page_num | |
---|---|---|---|---|
0 | Aboriginal chambers near Tilbury | S | 462 |
Let’s see if we can now put all those pieces together. Essentially, for each index line, we need to generate the complete set of records we want to add to the database.
def construct_index_records(id_val=None, text=None, metadata=None, retval="explode"):
"""Generate a complete set of index records from original search text document."""
if id_val is None and text is None:
return []
text = download_and_extract_text(id_val, typ="djvutxt") if text is None else text
records = []
# Repair the text
repaired_text = repair_index_missing_line_breaks( text )
repaired_text = repair_index_unwanted_line_breaks( repaired_text )
repaired_text = repair_index_subsidiary_separator( repaired_text )
for line in repaired_text.split("\n"):
if line:
new_line = parse_index_line(line)
records.extend(new_line)
if retval not in ["df", "explode"] or id_val is None:
# Return the list of dicts, without the metadata
return records
# WARNING - if we used provided text, the id_val and the text may actually be inconsistent
index_base_data = get_index_metadata_from_title(get_title_from_id(db, id_val))
# Generate a dataframe
records_df = pd.DataFrame(records)
records_df["source_id"] = id_val
records_df["year"] = index_base_data["year"]
records_df["vol"] = index_base_data["vol"]
if retval=="explode":
return records_df.explode('page_numbers').rename(columns={"page_numbers": "page_num"})
elif retval=="df":
return records_df
And when we run it:
construct_index_records(sample_index_id)[1000: 1200]
index_term | typ | page_num | subsidiary | source_id | year | vol | |
---|---|---|---|---|---|---|---|
844 | Complutensian Polygiot | S | 402 | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 | |
844 | Complutensian Polygiot | S | 431 | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 | |
845 | Compton Street | S | 228 | Soho, | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
846 | Conrad of Salisbury’ | P | 8 | NaN | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
847 | Descriptio utriusque | P | NaN | NaN | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
1015 | D. | P | NaN | NaN | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
1016 | D. | S | NaN | on Lord Chatham's speech on American | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
1017 | D. | S | 12 | stamp act, | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
1018 | D. | S | 214 | iden frog, | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
1019 | D. | S | NaN | sim_notes-and-queries_1849-1850_1_index | 1849 - 1850 | 1 |
200 rows × 7 columns
It’s far from ideal, but at least gives us something to work with. So let’s add it to the database, and see how a search feels.
db["index_entries"].insert_all(construct_index_records(sample_index_id).to_dict(orient="records"))
<Table index_entries (source_id, year, vol, index_term, typ, subsidiary, page_num)>
Let’s try a search:
search_term = "cure"
q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""
read_sql(q, db.conn)
source_id | index_term | subsidiary | year | vol | page_num | |
---|---|---|---|---|---|---|
0 | sim_notes-and-queries_1849-1850_1_index | Cure for the hooping-cough | 1849 - 1850 | 1 | 397 | |
1 | sim_notes-and-queries_1849-1850_1_index | G. (J.) | charm to cure the murrain in cows, | 1849 - 1850 | 1 | 349 |
2 | sim_notes-and-queries_1849-1850_1_index | Warts | charms for cure of, | 1849 - 1850 | 1 | 349 |
3 | sim_notes-and-queries_1849-1850_1_index | Warts | charms for cure of, | 1849 - 1850 | 1 | 482 |
Let’s create a search index over all the index issues up to 1900 excluding the cumulative indexes.
First, grab all the indexes from the database and then filter to just the years we are interested in.
We really should support convenient year searching by adding a year column to the table, or creating a convenient, custom query function to handle years.
q = """
SELECT * FROM (SELECT CAST(strftime("%Y", datetime) AS INT) AS year, *
FROM metadata
WHERE is_index=1 AND id NOT LIKE "%cumulative%") WHERE year < 1900;
"""
indexes = read_sql(q, db.conn)
indexes = indexes.to_dict(orient="records")
indexes[:3]
[{'year': 1850,
'id': 'sim_notes-and-queries_1849-1850_1_index',
'date': '1849 - 1850',
'datetime': '1850-03-20T00:00:00',
'series': None,
'vol': '1',
'iss': 'Index',
'title': 'Notes and Queries 1849 - 1850: Vol 1 Index',
'next_id': 'sim_notes-and-queries_1849-11-03_1_1',
'prev_id': '',
'is_index': 1,
'restricted': ''},
{'year': 1850,
'id': 'sim_notes-and-queries_1850_2_index',
'date': '1850',
'datetime': '1850-03-20T00:00:00',
'series': None,
'vol': '2',
'iss': 'Index',
'title': 'Notes and Queries 1850: Vol 2 Index',
'next_id': 'sim_notes-and-queries_1850-06-01_2_31',
'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',
'is_index': 1,
'restricted': ''},
{'year': 1851,
'id': 'sim_notes-and-queries_1851_3_index',
'date': '1851',
'datetime': '1851-03-20T00:00:00',
'series': None,
'vol': '3',
'iss': 'Index',
'title': 'Notes and Queries 1851: Vol 3 Index',
'next_id': 'sim_notes-and-queries_1851-01-04_3_62',
'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',
'is_index': 1,
'restricted': ''}]
# Download the tqdm progress bar tools
from tqdm.notebook import tqdm
db["index_entries"].delete_where()
db["index_entries_fts"].delete_where()
# List of indexes already loaded: indexes
for index_record in tqdm(indexes):
index_records = construct_index_records(index_record["id"])
db["index_entries"].insert_all( index_records.to_dict(orient="records"))
And how about a search…
search_term = "sin eater"
q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""
read_sql(q, db.conn)[:50]
source_id | index_term | subsidiary | year | vol | page_num | |
---|---|---|---|---|---|---|
0 | sim_notes-and-queries_1851_4_index | B. (E. H.) | on mazer-wood and sin-eaters, | 1851 | 4 | 211 |
1 | sim_notes-and-queries_1851_4_index | Sin-eaters | notices respecting, | 1851 | 4 | 211 |
2 | sim_notes-and-queries_1852_6_index | E .) | on the sin-eater, | 1852 | 6 | 541 |
3 | sim_notes-and-queries_1852_6_index | I eeper ( (Alex.) | on the sin-eater, | 1852 | 6 | 541 |
4 | sim_notes-and-queries_1852_6_index | Symons Jelinger C.) | origin of sin-eater, | 1852 | 6 | 300 |
5 | sim_notes-and-queries_1870_6_index | Sin-eater | origin of the, | 1870 | 6 | 450 |
6 | sim_notes-and-queries_1876_6_index | Sin-eater | 1876 | 6 | 505 | |
7 | sim_notes-and-queries_1876_6_index | Sin-eater | 1876 | 6 | 505 | |
8 | sim_notes-and-queries_1877_7_index | B. (A. C.) | on the Sin-eater, | 1877 | 7 | 14 |
9 | sim_notes-and-queries_1877_7_index | Sin-eater | 1877 | 7 | 14 | |
10 | sim_notes-and-queries_1883_7_index | Sin-eater | 1883 | 7 | 25 | |
11 | sim_notes-and-queries_1883_7_index | Sin-eater | 1883 | 7 | 334 | |
12 | sim_notes-and-queries_1883_8_index | Sin-eater | 1883 | 8 | 255 | |
13 | sim_notes-and-queries_1895_8_index | Sin-eaters | 1895 | 8 | 288 | |
14 | sim_notes-and-queries_1895_8_index | Sin-eaters | 1895 | 8 | 288 | |
15 | sim_notes-and-queries_1895_8_index | Sin-eaters | 1895 | 8 | 332 | |
16 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 169 | |
17 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 296 | |
18 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 110 | |
19 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 111 | |
20 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 109 | |
21 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 169 | |
22 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 236 | |
23 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 296 | |
24 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 111 | |
25 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 110 | |
26 | sim_notes-and-queries_1896_9_index | Owen (J. P.) | on sin-eaters, | 1896 | 9 | 109 |
27 | sim_notes-and-queries_1896_9_index | Owen (J. P.) | on sin-eaters, | 1896 | 9 | 236 |
28 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 110 | |
29 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 109 | |
30 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 169 | |
31 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 236 | |
32 | sim_notes-and-queries_1896_9_index | Sin-eaters | 1896 | 9 | 296 | |
33 | sim_notes-and-queries_1896_9_index | Thomas (N. W.) | on sin-eaters, | 1896 | 9 | 169 |
search_term = "boggart"
q = f"""
SELECT * FROM index_entries_fts
WHERE index_entries_fts MATCH {db.quote(search_term)};
"""
read_sql(q, db.conn)[:50]
source_id | index_term | subsidiary | year | vol | page_num | |
---|---|---|---|---|---|---|
0 | sim_notes-and-queries_1869_4_index | Boggarts and Feorin | 1869 | 4 | 508 | |
1 | sim_notes-and-queries_1869_4_index | Boggarts and Feorin | 1869 | 4 | 508 | |
2 | sim_notes-and-queries_1869_4_index | Higson (John) | on Boggarts and Feorin, | 1869 | 4 | 5 |
3 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 23 | |
4 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 156 | |
5 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 216 | |
6 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 287 | |
7 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 365 | |
8 | sim_notes-and-queries_1870_5_index | Boggarts and Feorin | 1870 | 5 | 517 | |
9 | sim_notes-and-queries_1870_5_index | Bowker (James) | on Boggarts, Feorin, &c. | 1870 | 5 | 365 |
10 | sim_notes-and-queries_1870_5_index | Boggarts | Feorin, | 1870 | 5 | 287 |
11 | sim_notes-and-queries_1870_5_index | Davies (Wm.) | on Boggarts, Feorin, &c., | 1870 | 5 | 216 |
12 | sim_notes-and-queries_1870_5_index | Hermentrude | on Boggarts, | 1870 | 5 | 23 |
13 | sim_notes-and-queries_1870_5_index | Boggarts | Feorin, &c., | 1870 | 5 | 156 |
14 | sim_notes-and-queries_1870_5_index | Riley (H. T.) | on Boggarts, Feorin, &c., | 1870 | 5 | 216 |
15 | sim_notes-and-queries_1870_5_index | Smith (W. J. B.) | on Boggarts, Feorin, &e., | 1870 | 5 | 317 |
16 | sim_notes-and-queries_1895_8_index | : Boggart, 85 | ah by a sheep, | 1895 | 8 | 170 |
17 | sim_notes-and-queries_1895_8_index | Boggart=ghost | 1895 | 8 | 85 | |
18 | sim_notes-and-queries_1895_8_index | Boggart=ghost | 1895 | 8 | 255 |
Add the Index Tables to the Full Database#
Let’s also add the index tables to our full database.
db_name_full = "full_nq.db"
db_full = Database(db_name_full)
create_db_table_index_entries(db_full)
db_full["index_entries"].delete_where()
db_full["index_entries_fts"].delete_where()
# List of indexes already loaded: indexes
for index_record in tqdm(indexes):
index_records = construct_index_records(index_record["id"])
db_full["index_entries"].insert_all( index_records.to_dict(orient="records"))