# Generating a Full Text Searchable Database for Notes & Queries#

Whilst the PDF documents corresponding to each issue of Notes and Queries are quite large files, the searchable, OCR retrieved text documents are much smaller and can be easily added to a full-text searchable database.

We can create a simple, file based SQLite database that will provide a full-text search facility over each issue of Notes & Queries.

Recall that we previously downloaded the metadata for issues of Notes & Queries held by the Internet Archive to a CSV file.

We can load that metadata in from the CSV file using the function we created and put into a simple Python package directory previously:

from ia_utils.open_metadata_records import open_metadata_records

data_records[:3]

[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'date': '1849-11-03',
'title': 'Notes and Queries  1849-11-03: Vol 1 Iss 1',
'vol': '1',
'iss': '1',
'prev_id': 'sim_notes-and-queries_1849-1850_1_index',
'next_id': 'sim_notes-and-queries_1849-11-10_1_2',
'restricted': ''},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'date': '1849-11-10',
'title': 'Notes and Queries  1849-11-10: Vol 1 Iss 2',
'vol': '1',
'iss': '2',
'prev_id': 'sim_notes-and-queries_1849-11-03_1_1',
'next_id': 'sim_notes-and-queries_1849-11-17_1_3',
'restricted': ''},
{'id': 'sim_notes-and-queries_1849-11-17_1_3',
'date': '1849-11-17',
'title': 'Notes and Queries  1849-11-17: Vol 1 Iss 3',
'vol': '1',
'iss': '3',
'prev_id': 'sim_notes-and-queries_1849-11-10_1_2',
'next_id': 'sim_notes-and-queries_1849-11-24_1_4',
'restricted': ''}]


We also saved the data to a simple local database, so we could alternatively retrieve the data from there.

First open up a connection to the database:

from sqlite_utils import Database

db_name = "nq_demo.db"
db = Database(db_name)


And then make a simple query onto it:

from pandas import read_sql

q = "SELECT * FROM metadata;"


id date datetime series vol iss title next_id prev_id is_index restricted
0 sim_notes-and-queries_1849-11-03_1_1 1849-11-03 1849-11-03T00:00:00 None 1 1 Notes and Queries 1849-11-03: Vol 1 Iss 1 sim_notes-and-queries_1849-11-10_1_2 sim_notes-and-queries_1849-1850_1_index 0
1 sim_notes-and-queries_1849-11-10_1_2 1849-11-10 1849-11-10T00:00:00 None 1 2 Notes and Queries 1849-11-10: Vol 1 Iss 2 sim_notes-and-queries_1849-11-17_1_3 sim_notes-and-queries_1849-11-03_1_1 0
2 sim_notes-and-queries_1849-11-17_1_3 1849-11-17 1849-11-17T00:00:00 None 1 3 Notes and Queries 1849-11-17: Vol 1 Iss 3 sim_notes-and-queries_1849-11-24_1_4 sim_notes-and-queries_1849-11-10_1_2 0

## Adding an issues Table to the Database#

We already have a metadata table in the database, but we can also add more tables to it.

For at least the 19th century issues of Notes & Queries, a file is available for each issue that contains searchable text extracted from that issue. If we download those text files and add them to our own database, then we can create our own full text searchable database over the content of those issues.

Let’s create a simple table structure for the searchable text extracted from each issue of Notes & Queries containing the content and a unique identifier for each record.

We can relate this table to the metadata table through a foreign key. What this means is that for each entry in the issues table, we also expect to find an entry in the metadata table under the same identifier value.

We will also create a full text search table associated with the table:

%%writefile ia_utils/create_db_table_issues.py
def create_db_table_issues(db, drop=True):
"""Create an issues database table and an associated full-text search table."""
table_name = "issues"
# If required, drop any previously defined tables of the same name
if drop:
db[table_name].drop(ignore=True)
db[f"{table_name}_fts"].drop(ignore=True)
elif db[table_name].exists():
print(f"Table {table_name} exists...")
return

# Create the table structure for the simple issues table
db[table_name].create({
"id": str,
"content": str
}, pk=("id"), foreign_keys=[ ("id", "metadata", "id"), # local-table-id, foreign-table, foreign-table-id)
])

# Enable full text search
# This creates an extra virtual table (issues_fts) to support the full text search
# A stemmer is applied to support the efficacy of the full-text searching
db[table_name].enable_fts(["id", "content"],
create_triggers=True, tokenize="porter")


Load that function in from the local package and call it:

from ia_utils.create_db_table_issues import create_db_table_issues

create_db_table_issues(db)


To add the content data to the database, we need to download the searchable text associated with each record from the Internet Archive.

Before we add the data in bulk, let’s do a dummy run of the steps we need to follow.

First, we need to download the full text file from the Internet Archive, given a record identifier. We’ll use the first data record to provide us with the identifier:

data_records[0]

{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'date': '1849-11-03',
'title': 'Notes and Queries  1849-11-03: Vol 1 Iss 1',
'vol': '1',
'iss': '1',
'prev_id': 'sim_notes-and-queries_1849-1850_1_index',
'next_id': 'sim_notes-and-queries_1849-11-10_1_2',
'restricted': ''}


The download step takes the identifier and requests the OCR Search Text file.

We will download the Internet Archive files to the directory we specified previously.

from pathlib import Path

p = Path(dirname)


# Import the necessary packages

formats=["OCR Search Text"])

[]


Recall that the files are download into a directory with a name that corresponds to the record identifier.

import os

os.listdir( p / data_records[0]['id'])

['sim_notes-and-queries_1849-11-03_1_1_hocr_searchtext.txt.gz',
'sim_notes-and-queries_1849-11-03_1_1_hocr_pageindex.json.gz',
'sim_notes-and-queries_1849-11-03_1_1_hocr_searchtext.txt',
'sim_notes-and-queries_1849-11-03_1_1_page_numbers.json',
'sim_notes-and-queries_1849-11-03_1_1_hocr_pageindex.json']


We now need to uncompress the .txt.gz file to access the fully formed text file.

The gzip package provides us with the utility we need to access the contents of the archive file.

In fact, we don’t need to actually uncompress the file into the directory, we can open it and extract its contents “in memory”.

%%writefile ia_utils/get_txt_from_file.py
from pathlib import Path
import gzip

# Create a simple function to make it even easier to extract the full text content
if typ=="searchtext":
p_ = Path(dirname) / id_val / f'{id_val}_hocr_searchtext.txt.gz'
f = gzip.open(p_,'rb')
elif typ=="djvutxt":
p_ = Path(dirname) / id_val / f'{id_val}_djvu.txt'
else:
content = ""
return content


Let’s see how it works, previewing the first 200 characters of the unarchived text file:

from ia_utils.get_txt_from_file import get_txt_from_file

get_txt_from_file(data_records[0]['id'])[:200]

' \n \n \n \nNOTES anp QUERIES:\nA Medium of Enter-Communication\nFOR\nLITERARY MEN, ARTISTS, ANTIQUARIES, GENEALOGISTS, ETC.\n‘* When found, make a note of.’—Carrain Corrie.\nVOLUME FIRST.\nNoveMBER, 1849—May, '


If we inspect the text in more detail, we see there are various things in it that we might want to simplify. For example, quotation marks appear in various guises, such as opening and closing quotes of different flavours. We could normalise these to a simpler form (for example, “straight” quotes ' and "), However, if opening and closing quotes are reliably recognised they do provide us with a simple text for matching text contained within the quotes. So for now, let’s leave the originally detected quotes in place.

Having got a method in place, let’s now download the contents of the non-index issues for 1849.

q = """
SELECT id, title
WHERE is_index = 0
AND strftime('%Y', datetime) = '1849'
"""
results

id title
0 sim_notes-and-queries_1849-11-03_1_1 Notes and Queries 1849-11-03: Vol 1 Iss 1
1 sim_notes-and-queries_1849-11-10_1_2 Notes and Queries 1849-11-10: Vol 1 Iss 2
2 sim_notes-and-queries_1849-11-17_1_3 Notes and Queries 1849-11-17: Vol 1 Iss 3
3 sim_notes-and-queries_1849-11-24_1_4 Notes and Queries 1849-11-24: Vol 1 Iss 4
4 sim_notes-and-queries_1849-12-01_1_5 Notes and Queries 1849-12-01: Vol 1 Iss 5
5 sim_notes-and-queries_1849-12-08_1_6 Notes and Queries 1849-12-08: Vol 1 Iss 6
6 sim_notes-and-queries_1849-12-15_1_7 Notes and Queries 1849-12-15: Vol 1 Iss 7
7 sim_notes-and-queries_1849-12-22_1_8 Notes and Queries 1849-12-22: Vol 1 Iss 8
8 sim_notes-and-queries_1849-12-29_1_9 Notes and Queries 1849-12-29: Vol 1 Iss 9

The data is return from the read_sql() function as a pandas dataframe.

This pandas package provides a very powerful set of tools for working with tabular data, including being able to iterate over he rows of the table and apply a function to each one.

If we define a function to download the corresponding search text file from the Internet Archive and extract the text from the downloaded archive file, we can apply that function with a particular column value taken from each row of the dataframe and add the returned content to a new column in the same dataframe.

Here’s an example function:

%%writefile ia_utils/download_and_extract_text.py
from ia_utils.get_txt_from_file import get_txt_from_file

"""Download search text from Internet Archive, extract the text and return it."""
if verbose:
if typ=="searchtext":
formats=["OCR Search Text"])
elif typ=="djvutxt":
formats=["DjVuTXT"])
else:
return ''

text = get_txt_from_file(id_val, typ=typ)
return text


The Python pandas package natively provides an apply() function. However, the tqdm progress bar package also provides an “apply with progress bar” function, .progress_apply() if we enable the appropriate extensions:

# Dowload the tqdm progrss bar tools
from tqdm.notebook import tqdm

#And enable the pandas extensions
tqdm.pandas()


Let’s apply our download_and_extract_text() function to each row of our records table for 1849, keeping track of progress with a progress bar:

from ia_utils.download_and_extract_text import download_and_extract_text

results

id title content
0 sim_notes-and-queries_1849-11-03_1_1 Notes and Queries 1849-11-03: Vol 1 Iss 1 \n \n \n \nNOTES anp QUERIES:\nA Medium of En...
1 sim_notes-and-queries_1849-11-10_1_2 Notes and Queries 1849-11-10: Vol 1 Iss 2 |\n \nA MEDIUM OF INTER-COMMUNICATION\nFOR\nLI...
2 sim_notes-and-queries_1849-11-17_1_3 Notes and Queries 1849-11-17: Vol 1 Iss 3 \n \n \nceeeeeeeeee eee\nA MEDIUM OF\nLITERAR...
3 sim_notes-and-queries_1849-11-24_1_4 Notes and Queries 1849-11-24: Vol 1 Iss 4 \n \n \n \n~ NOTES anp QUERIES:\nA MEDIUM OF\...
4 sim_notes-and-queries_1849-12-01_1_5 Notes and Queries 1849-12-01: Vol 1 Iss 5 \n \n \nNOTES anp\nQUERIES:\nA MEDIUM OF INTE...
5 sim_notes-and-queries_1849-12-08_1_6 Notes and Queries 1849-12-08: Vol 1 Iss 6 \n \nOTES\nAND QUERIES\nA MEDIUM OF INTER-COM...
6 sim_notes-and-queries_1849-12-15_1_7 Notes and Queries 1849-12-15: Vol 1 Iss 7 \n \n \nNOTES\nA MEDIUM OF\nAND QUERIES\nINTE...
7 sim_notes-and-queries_1849-12-22_1_8 Notes and Queries 1849-12-22: Vol 1 Iss 8 \n \nNOTES ann QUERIES\nA MEDIUM OF\nINTER-CO...
8 sim_notes-and-queries_1849-12-29_1_9 Notes and Queries 1849-12-29: Vol 1 Iss 9 \nNOTE\nA MEDIUM OF\nAND QUERIES\nINTER-COMMU...

We can now add that data table directly to our database using the pandas .to_sql() method:

# Add the issue database table
table_name = "issues"
results[["id", "content"]].to_sql(table_name, db.conn, index=False, if_exists="append")


Note that this recipe does not represent a very efficient way of handling things: the pandas dataframe is held in memory, so as we add more rows, the memory requirements to store the data increase. A more efficient approach might be to create a function that retrieves each file, adds its contents to the database, and then perhaps even deletes the downloaded file, rather than adding the content to the in-memory dataframe.

Let’s see if we can query it, first at the basic table level:

q = """
SELECT id, content
FROM issues
WHERE LOWER(content) LIKE "%customs%"
"""

id content
0 sim_notes-and-queries_1849-11-17_1_3 \n \n \nceeeeeeeeee eee\nA MEDIUM OF\nLITERAR...
1 sim_notes-and-queries_1849-11-24_1_4 \n \n \n \n~ NOTES anp QUERIES:\nA MEDIUM OF\...
2 sim_notes-and-queries_1849-12-15_1_7 \n \n \nNOTES\nA MEDIUM OF\nAND QUERIES\nINTE...
3 sim_notes-and-queries_1849-12-29_1_9 \nNOTE\nA MEDIUM OF\nAND QUERIES\nINTER-COMMU...

This is not overly helpful, perhaps. We can do better with the full text search, which will also allow us to return a snippet around the first, or highest ranked, location of any matched search terms:

search_term = "customs"

q = f"""
SELECT id, snippet(issues_fts, -1, "__", "__", "...", 10) as clip
FROM issues_fts WHERE issues_fts MATCH {db.quote(search_term)} ;
"""


id clip
0 sim_notes-and-queries_1849-11-10_1_2 ...At length the __custom__ became general in ...
1 sim_notes-and-queries_1849-11-17_1_3 ...royal domains, leases of __customs__, &c., ...
2 sim_notes-and-queries_1849-11-24_1_4 ...the Manners and __Customs__ of Ancient Gree...
3 sim_notes-and-queries_1849-12-01_1_5 ...Morning, as was his __Custom__, attended by...
4 sim_notes-and-queries_1849-12-15_1_7 ...So far as English usages and __customs__ ar...
5 sim_notes-and-queries_1849-12-22_1_8 ...Sessions House and the __Custom__ House of ...
6 sim_notes-and-queries_1849-12-29_1_9 ...elucidation of old world __customs__ and ob...

This is okay as far as is goes: we can identify issues of Notes and Queries that contain a particular search term, retrieve the whole document, and even display a concordance for the first (or highest ranking) occurrence of the search term(s) to provide context for the response. But it’s not ideal. For example, to display a concordance of each term in the full text document that matches our search term, we need to generate our own concordance, which may be difficulat where matches are inexact (for example if the match relies on stemming). There are also many pages in each issue of Notes and Queries and it would be useful if we could get the result at a better level of granularity.

The ouseful_sqlite_search_utils package includes various functions for allowing us to tunnel into a text document to retrieve The tools aren’t necessarily the fastest utilities to run, particularly on large databases, but they get their eventually.

One particular utility will split a document into sentences and return each sentence on a separate row of a newly created virtual table. We can then search within these values for our search term, although we are limited to running exact match queries, rather than the more forgiving full text search queries:

from ouseful_sqlite_search_utils import snippets

snippets.register_snippets(db.conn)

q = """
SELECT * FROM
(SELECT id, sentence
FROM issues, get_sentences(1, NULL, issues.content)
WHERE issues.id = "sim_notes-and-queries_1849-11-10_1_2")
WHERE sentence LIKE "% custom %"
"""

# Show the full result record in each case

Couldn't import dot_parser, loading of dot files will not be possible.

[{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'An examination of the structure of books of this period would confirm this view, and show that their apparent clumsiness is to be explained by the facility it was then the custom to afford for the interpolation or extraction of “sheets,” by a contrivance somewhat resembling that\npapers in a cover, and known as the “ patent leaf-holder,”\n'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'At length the custom became general in Aden ; and it was not only drunk in the\nnight by those who were desirous of being kept awake, but in the day for the sake of its other agreeable qualities.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'From hence the custom extended itself to many other towns of Arabia, particularly to Medina, and then to Grand Cairo in Egypt, where the dervises of Yemen, who lived in a district by themselves, drank coffee on the nights they intended to spend in\n| devotion.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'In that year, Soliman Aga, ambassador from the Sultan Mahomet the Fourth, arrived, who, with his retinue, brought a considerable quantity of coffee with them, and made presents of it to per- sons both of the court and city, and is sup- posed to have established the custom of drinking it.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'How often\nman stumble upon some elucidation of a doubtful |\nphrase, or disputed passage ;—some illustration of an obsolete custom hitherto unnoticed ; — some biogra- phical aneedote or precise date hitherto unrecorded ;— some book, or some edition, hitherto unknown or im- perfectly described.'}]


## Extracting Pages#

To make for more efficient granular searching, it would be useful if our content was stored in a more granular way.

Ideally, we would extract items at the “article” level, but there is no simple way of chunking the document at this level. We could process it to extract items at the sentence or paragraph level and add those to their own table, but that might be too granular.

However, by inspection of the files available for each issue, there appears to be another level of organisation that we can access: the page level.

Page metadata is provided in the the form of two files:

• OCR Page Index: downloaded as a compressed .gz file the expanded file contains a list of lists. Each inner list contains four integers and each page has an associated inner list. The first and second integers in each inner list are the character count in the search text file representing the first and last characters on the corresponding page;

• Page Numbers JSON: the pages numbers JSON file, which is downloaded as an uncompressed JSON file contains a JSON object with a "pages" attribute that returns a list of records; each record has four attributes: "leafNum": int (starting with index value 1), "ocr_value": list (a list of candidate OCR values), "pageNumber": str and "confidence": float. A top-level "confidence" attribute gives an indication of how likely it is that page numbers are available across the whole document.

We also need the OCR Search Text file.

Warning

There seems to be an off-by-one error in the page numbering. For example, the scanned pages for https://archive.org/details/sim_notes-and-queries_1859-08-06_8_188/ range from page 101 to 120, whereas the ...page_numbers.json file claims, with confidence 100%, that the pages range from 102 to 121?

Let’s get a complete set of necessary files for a few sample records:

%%writefile ia_utils/download_ia_records_by_format.py
# Dowload the tqdm progress bar tools
from tqdm.notebook import tqdm

from pathlib import Path

formats = formats if formats else ["OCR Search Text", "OCR Page Index", "Page Numbers JSON"]

for record in tqdm(records):
_id = record['id']
formats=formats,
silent = True)

from ia_utils.download_ia_records_by_format import download_ia_records_by_format

# Grab page counts and page structure files
sample_records = data_records[:5]



We now need to figure out how to open and parse the page index and page numbers files, and check the lists are the correct lengths.

The Python zip function lets us “zip” together elements from different, parallel lists. We can also insert the same item, repeatedly, into each row using the itertools.repeat() function to generate as many repetitions of the same character as are required:

import itertools


Example of using itertools.repeat():

# Example of list
list(zip(itertools.repeat("a"), [1, 2], ["x","y"]))

[('a', 1, 'x'), ('a', 2, 'y')]


We can now use this approach to create a zipped combination of the record ID values, page numbers and page character indexes.

import gzip
import json
import itertools

#for record in tqdm(sample_records):

record = sample_records[0]

id_val = record['id']
p_ = Path(dirname) / id_val

# Get the page numbers
with open(p_ / f'{id_val}_page_numbers.json', 'r') as f:

# Get the page character indexes
with gzip.open(p_ / f'{id_val}_hocr_pageindex.json.gz', 'rb') as g:
# The last element seems to be redundant

# Optionally text the record counts are the same for page numbers and character indexes
#assert len(page_indexes) == len(page_numbers['pages'])

# Preview the result
list(zip(itertools.repeat(id_val), page_numbers['pages'], page_indexes))[:5]

[('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 1, 'ocr_value': [], 'pageNumber': '', 'confidence': 0},
[0, 301, 559, 14345]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 2, 'ocr_value': ['4', '3'], 'pageNumber': '', 'confidence': 0},
[301, 307, 14345, 15954]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 3, 'ocr_value': ['2'], 'pageNumber': '2', 'confidence': 100},
[307, 3212, 15954, 101879]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 4, 'ocr_value': ['3'], 'pageNumber': '3', 'confidence': 100},
[3212, 7431, 101879, 228974]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 5, 'ocr_value': [], 'pageNumber': '4', 'confidence': 100},
[7431, 12267, 228974, 370105])]


We could add this page related data directly to the pages table, or we could create another simple database table to store it.

Here’s what a separate table might look like:

%%writefile ia_utils/create_db_table_pages_metadata.py
if drop:
"id": str,
"page_idx": int, # This is just a count as we work through the pages
"page_char_start": int,
"page_char_end": int,
"page_leaf_num": int,
"page_num": str, # This is to allow for things like Roman numerals
# Should we perhaps try to cast an int for the page number
# and have a page_num_str for the original ?
"page_num_conf": float # A confidence value relating to the page number detection
}, pk=("id", "page_idx")) # compound foreign keys not currently available via sqlite_utils?


Import that function from the local package and run it:

from ia_utils.create_db_table_pages_metadata import create_db_table_pages_metadata



The following function “zips” together the contents of the page index and page numbers files. Each “line item” is a rather unwieldy mixmatch of elements, but we’ll deal with those in a moment:

%%writefile ia_utils/raw_pages_metadata.py
import itertools
import json
import gzip
from pathlib import Path

p_ = Path(dirname) / id_val

# Get the page numbers
with open(p_ / f'{id_val}_page_numbers.json', 'r') as f:
# We can ignore the last value

# Get the page character indexes
with gzip.open(p_ / f'{id_val}_hocr_pageindex.json.gz', 'rb') as g:
# The last element seems to be redundant

# Add the id and an index count
return zip(itertools.repeat(id_val), range(len(page_indexes)),
page_numbers['pages'], page_indexes)


For each line item in the zipped datastructure, we can parse out values into a more readable data object:

%%writefile ia_utils/parse_page_metadata.py

"""Parse out page attributes from the raw page metadata construct."""
_id = item[0]
page_idx = item[1]
_page_nums = item[2]
ix = item[3]
obj = {'id': _id,
'page_idx': page_idx, # Maintain our own count, just in case; should be page_leaf_num-1
'page_char_start': ix[0],
'page_char_end': ix[1],
'page_leaf_num': _page_nums['leafNum'],
'page_num': _page_nums['pageNumber'],
'page_num_conf':_page_nums['confidence']
}
return obj


Let’s see how that looks:

from ia_utils.raw_pages_metadata import raw_pages_metadata

break

{'id': 'sim_notes-and-queries_1849-11-03_1_1', 'page_idx': 0, 'page_char_start': 0, 'page_char_end': 301, 'page_leaf_num': 1, 'page_num': '', 'page_num_conf': 0}


We can now trivially add the page metadata to the pages_metadata database table. Let’s try it with our sample:

%%writefile ia_utils/add_page_metadata_to_db.py

for record in records:
id_val = record["id"]
if verbose:
print(id_val)

# Add records to the database


And run it with the page metadata records selected via a id_val:

from ia_utils.add_page_metadata_to_db import add_page_metadata_to_db

# Clear the db table



Let’s see how that looks:

from pandas import read_sql

q = "SELECT * FROM pages_metadata LIMIT 5"


id page_idx page_char_start page_char_end page_leaf_num page_num page_num_conf
0 sim_notes-and-queries_1849-11-03_1_1 0 0 301 1 0.0
1 sim_notes-and-queries_1849-11-03_1_1 1 301 307 2 0.0
2 sim_notes-and-queries_1849-11-03_1_1 2 307 3212 3 2 100.0
3 sim_notes-and-queries_1849-11-03_1_1 3 3212 7431 4 3 100.0
4 sim_notes-and-queries_1849-11-03_1_1 4 7431 12267 5 4 100.0

Alternatively, we can view the results as a Python dictionary:

read_sql(q, db.conn).to_dict(orient="records")[:3]

[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 0,
'page_char_start': 0,
'page_char_end': 301,
'page_leaf_num': 1,
'page_num': '',
'page_num_conf': 0.0},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 1,
'page_char_start': 301,
'page_char_end': 307,
'page_leaf_num': 2,
'page_num': '',
'page_num_conf': 0.0},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 2,
'page_char_start': 307,
'page_char_end': 3212,
'page_leaf_num': 3,
'page_num': '2',
'page_num_conf': 100.0}]


For each file containg the search text for a particular issue, we also need a routine to extract the page level content. Which is to say, we need to chunk the content based on character indices associated with the first and last characters on each page in the corresponding search text file.

This essentially boils down to:

• grabbing the page index values;

• grabbing the page search text;

• chunking the search text according to the page index values.

We can apply a page chunker at the document level, paginating the content file, and adding things to the database.

%%writefile ia_utils/chunk_page_text.py

from ia_utils.get_txt_from_file import get_txt_from_file

def chunk_page_text(db, id_val):
"""Chunk text according to page_index values."""

q = f'SELECT * FROM pages_metadata WHERE id="{id_val}"'

text = get_txt_from_file(id_val)

for ix in page_indexes:
ix["page_text"] = text[ix["page_char_start"]:ix["page_char_end"]].strip()

return page_indexes


Let’s see if we’ve managed to pull out some page text:

from ia_utils.chunk_page_text import chunk_page_text

# Create a sample index ID
sample_id_val = sample_records[0]["id"]

# Get the chunked text back as part of the metadata record
sample_pages = chunk_page_text(db, sample_id_val)

sample_pages[:3]

[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 0,
'page_char_start': 0,
'page_char_end': 301,
'page_leaf_num': 1,
'page_num': '',
'page_num_conf': 0.0,
'page_text': 'NOTES anp QUERIES:\nA Medium of Enter-Communication\nFOR\nLITERARY MEN, ARTISTS, ANTIQUARIES, GENEALOGISTS, ETC.\n‘* When found, make a note of.’—Carrain Corrie.\nVOLUME FIRST.\nNoveMBER, 1849—May, 1850.\nLONDON: GEORGE BELL, 186. FLEET STREET 1850.\n[SOLD BY ALL BOOKSELLERS AND NEWSMEN. |'},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 1,
'page_char_start': 301,
'page_char_end': 307,
'page_leaf_num': 2,
'page_num': '',
'page_num_conf': 0.0,
'page_text': ''},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 2,
'page_char_start': 307,
'page_char_end': 3212,
'page_leaf_num': 3,
'page_num': '2',
'page_num_conf': 100.0,


### Modifying the pages_metadata Table in the Database#

Using the chunk_page_text() function, we can add page content to our pages metadata in-memory. But what if we want to add it to the database. The pages_metadata already exists, but does not include a text column. However, we can modify that table to include just such a column:

db["pages_metadata"].add_column("page_text", str)

<Table pages_metadata (id, page_idx, page_char_start, page_char_end, page_leaf_num, page_num, page_num_conf, page_text)>


We can also enable a full text search facility over the table. Our interest is primarily in searching over the page_text, but if we include a couple of other columns, that can help us key into records in other tables.

# Enable full text search
# This creates an extra virtual table to support the full text search

<Table pages_metadata (id, page_idx, page_char_start, page_char_end, page_leaf_num, page_num, page_num_conf, page_text)>


We can now update the records in the pages_metadata table so they include the page_text:

q = f'SELECT DISTINCT(id) FROM pages_metadata;'

for sample_id_val in id_vals:
updated_pages = chunk_page_text(db, sample_id_val["id"])


We should now be able to search at the page level:

search_term = "customs"

q = f"""
"""


id page_idx page_text
0 sim_notes-and-queries_1849-11-10_1_2 5 22 NOTES\n \nAND QUERIES.\nCatalogue — in whic...
1 sim_notes-and-queries_1849-11-10_1_2 8 Nov. 10. 1849.)\nNOTES AND QUERIES.\n25\n \nne...
2 sim_notes-and-queries_1849-11-10_1_2 9 bring with him some coffee, which he believed ...
3 sim_notes-and-queries_1849-11-10_1_2 12 Nov. 10. 1849.]\nActing her passions on our st...
4 sim_notes-and-queries_1849-11-10_1_2 14 ~—\n \n|\n \nNov. 10. 1849.]\nNOTES AND QUERIE...
5 sim_notes-and-queries_1849-11-17_1_3 8 = 17. 1849.] }\nreceive his representations an...
6 sim_notes-and-queries_1849-11-24_1_4 2 ~vwe eS | FY\nweNTe 6 FS-r—lCUcUOrlClC hLOOlhC...
7 sim_notes-and-queries_1849-11-24_1_4 6 NOTES AND QUERIES.\n \n \n \nNov. 24. 1849.]\n...
8 sim_notes-and-queries_1849-11-24_1_4 15 NOTES AND QUERIES.\nJust published, Part II., ...
9 sim_notes-and-queries_1849-12-01_1_5 5 NOTES AND QUERIES.\n \nmore than three Passeng...

We can then bring in additional columns from the original pages_metadata table:

search_term = "customs"

q = f"""
"""


page_num page_leaf_num id page_idx page_text
0 23 6 sim_notes-and-queries_1849-11-10_1_2 5 22 NOTES\n \nAND QUERIES.\nCatalogue — in whic...
1 26 9 sim_notes-and-queries_1849-11-10_1_2 8 Nov. 10. 1849.)\nNOTES AND QUERIES.\n25\n \nne...
2 27 10 sim_notes-and-queries_1849-11-10_1_2 9 bring with him some coffee, which he believed ...
3 30 13 sim_notes-and-queries_1849-11-10_1_2 12 Nov. 10. 1849.]\nActing her passions on our st...
4 32 15 sim_notes-and-queries_1849-11-10_1_2 14 ~—\n \n|\n \nNov. 10. 1849.]\nNOTES AND QUERIE...
5 42 9 sim_notes-and-queries_1849-11-17_1_3 8 = 17. 1849.] }\nreceive his representations an...
6 52 3 sim_notes-and-queries_1849-11-24_1_4 2 ~vwe eS | FY\nweNTe 6 FS-r—lCUcUOrlClC hLOOlhC...
7 56 7 sim_notes-and-queries_1849-11-24_1_4 6 NOTES AND QUERIES.\n \n \n \nNov. 24. 1849.]\n...
8 65 16 sim_notes-and-queries_1849-11-24_1_4 15 NOTES AND QUERIES.\nJust published, Part II., ...
9 71 6 sim_notes-and-queries_1849-12-01_1_5 5 NOTES AND QUERIES.\n \nmore than three Passeng...

## Automatically Populating the pages Table from the issues Table#

Rather than manually adding the page data to the pages table, we can automatically create the pages table from the content contained in the issues table and the page metadata in the metadata table.

TO DO - CREATE TABLE AS ;

• maybe also as an extra demonstrate how to generate this automatically from a trigger

• discuss various advantages and disadvantages of each approach; one is a step wise pipeline (create as) other is reactive and automatic ( trigger)