Generating a Full Text Searchable Database for Notes & Queries
Contents
Generating a Full Text Searchable Database for Notes & Queries#
Whilst the PDF documents corresponding to each issue of Notes and Queries are quite large files, the searchable, OCR retrieved text documents are much smaller and can be easily added to a full-text searchable database.
We can create a simple, file based SQLite database that will provide a full-text search facility over each issue of Notes & Queries.
Recall that we previously downloaded the metadata for issues of Notes & Queries held by the Internet Archive to a CSV file.
We can load that metadata in from the CSV file using the function we created and put into a simple Python package directory previously:
from ia_utils.open_metadata_records import open_metadata_records
data_records = open_metadata_records()
data_records[:3]
[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'date': '1849-11-03',
'title': 'Notes and Queries 1849-11-03: Vol 1 Iss 1',
'vol': '1',
'iss': '1',
'prev_id': 'sim_notes-and-queries_1849-1850_1_index',
'next_id': 'sim_notes-and-queries_1849-11-10_1_2',
'restricted': ''},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'date': '1849-11-10',
'title': 'Notes and Queries 1849-11-10: Vol 1 Iss 2',
'vol': '1',
'iss': '2',
'prev_id': 'sim_notes-and-queries_1849-11-03_1_1',
'next_id': 'sim_notes-and-queries_1849-11-17_1_3',
'restricted': ''},
{'id': 'sim_notes-and-queries_1849-11-17_1_3',
'date': '1849-11-17',
'title': 'Notes and Queries 1849-11-17: Vol 1 Iss 3',
'vol': '1',
'iss': '3',
'prev_id': 'sim_notes-and-queries_1849-11-10_1_2',
'next_id': 'sim_notes-and-queries_1849-11-24_1_4',
'restricted': ''}]
We also saved the data to a simple local database, so we could alternatively retrieve the data from there.
First open up a connection to the database:
from sqlite_utils import Database
db_name = "nq_demo.db"
db = Database(db_name)
And then make a simple query onto it:
from pandas import read_sql
q = "SELECT * FROM metadata;"
data_records_from_db = read_sql(q, db.conn)
data_records_from_db.head(3)
id | date | datetime | series | vol | iss | title | next_id | prev_id | is_index | restricted | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | sim_notes-and-queries_1849-11-03_1_1 | 1849-11-03 | 1849-11-03T00:00:00 | None | 1 | 1 | Notes and Queries 1849-11-03: Vol 1 Iss 1 | sim_notes-and-queries_1849-11-10_1_2 | sim_notes-and-queries_1849-1850_1_index | 0 | |
1 | sim_notes-and-queries_1849-11-10_1_2 | 1849-11-10 | 1849-11-10T00:00:00 | None | 1 | 2 | Notes and Queries 1849-11-10: Vol 1 Iss 2 | sim_notes-and-queries_1849-11-17_1_3 | sim_notes-and-queries_1849-11-03_1_1 | 0 | |
2 | sim_notes-and-queries_1849-11-17_1_3 | 1849-11-17 | 1849-11-17T00:00:00 | None | 1 | 3 | Notes and Queries 1849-11-17: Vol 1 Iss 3 | sim_notes-and-queries_1849-11-24_1_4 | sim_notes-and-queries_1849-11-10_1_2 | 0 |
Adding an issues
Table to the Database#
We already have a metadata table in the database, but we can also add more tables to it.
For at least the 19th century issues of Notes & Queries, a file is available for each issue that contains searchable text extracted from that issue. If we download those text files and add them to our own database, then we can create our own full text searchable database over the content of those issues.
Let’s create a simple table structure for the searchable text extracted from each issue of Notes & Queries containing the content and a unique identifier for each record.
We can relate this table to the metadata table through a foreign key. What this means is that for each entry in the issues table, we also expect to find an entry in the metadata table under the same identifier value.
We will also create a full text search table associated with the table:
%%writefile ia_utils/create_db_table_issues.py
def create_db_table_issues(db, drop=True):
"""Create an issues database table and an associated full-text search table."""
table_name = "issues"
# If required, drop any previously defined tables of the same name
if drop:
db[table_name].drop(ignore=True)
db[f"{table_name}_fts"].drop(ignore=True)
elif db[table_name].exists():
print(f"Table {table_name} exists...")
return
# Create the table structure for the simple issues table
db[table_name].create({
"id": str,
"content": str
}, pk=("id"), foreign_keys=[ ("id", "metadata", "id"), # local-table-id, foreign-table, foreign-table-id)
])
# Enable full text search
# This creates an extra virtual table (issues_fts) to support the full text search
# A stemmer is applied to support the efficacy of the full-text searching
db[table_name].enable_fts(["id", "content"],
create_triggers=True, tokenize="porter")
Load that function in from the local package and call it:
from ia_utils.create_db_table_issues import create_db_table_issues
create_db_table_issues(db)
To add the content data to the database, we need to download the searchable text associated with each record from the Internet Archive.
Before we add the data in bulk, let’s do a dummy run of the steps we need to follow.
First, we need to download the full text file from the Internet Archive, given a record identifier. We’ll use the first data record to provide us with the identifier:
data_records[0]
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'date': '1849-11-03',
'title': 'Notes and Queries 1849-11-03: Vol 1 Iss 1',
'vol': '1',
'iss': '1',
'prev_id': 'sim_notes-and-queries_1849-1850_1_index',
'next_id': 'sim_notes-and-queries_1849-11-10_1_2',
'restricted': ''}
The download step takes the identifier and requests the OCR Search Text
file.
We will download the Internet Archive files to the directory we specified previously.
from pathlib import Path
# Create download dir file path, as before
dirname = "ia-downloads" # This is a default
p = Path(dirname)
And now download the text file for the sample record:
# Import the necessary packages
from internetarchive import download
download(data_records[0]['id'], destdir=p, silent = True,
formats=["OCR Search Text"])
[]
Recall that the files are download into a directory with a name that corresponds to the record identifier.
The data files are actually download as compressed archive files, as we can see if we review the download directory we saved our test download to:
import os
os.listdir( p / data_records[0]['id'])
['sim_notes-and-queries_1849-11-03_1_1_hocr_searchtext.txt.gz',
'sim_notes-and-queries_1849-11-03_1_1_hocr_pageindex.json.gz',
'sim_notes-and-queries_1849-11-03_1_1_hocr_searchtext.txt',
'sim_notes-and-queries_1849-11-03_1_1_page_numbers.json',
'sim_notes-and-queries_1849-11-03_1_1_hocr_pageindex.json']
We now need to uncompress the .txt.gz
file to access the fully formed text file.
The gzip
package provides us with the utility we need to access the contents of the archive file.
In fact, we don’t need to actually uncompress the file into the directory, we can open it and extract its contents “in memory”.
%%writefile ia_utils/get_txt_from_file.py
from pathlib import Path
import gzip
# Create a simple function to make it even easier to extract the full text content
def get_txt_from_file(id_val, dirname="ia-downloads", typ="searchtext"):
"""Retrieve text from downloaded text file."""
if typ=="searchtext":
p_ = Path(dirname) / id_val / f'{id_val}_hocr_searchtext.txt.gz'
f = gzip.open(p_,'rb')
content = f.read().decode('utf-8')
elif typ=="djvutxt":
p_ = Path(dirname) / id_val / f'{id_val}_djvu.txt'
content = p_.read_text()
else:
content = ""
return content
Let’s see how it works, previewing the first 200 characters of the unarchived text file:
from ia_utils.get_txt_from_file import get_txt_from_file
get_txt_from_file(data_records[0]['id'])[:200]
' \n \n \n \nNOTES anp QUERIES:\nA Medium of Enter-Communication\nFOR\nLITERARY MEN, ARTISTS, ANTIQUARIES, GENEALOGISTS, ETC.\n‘* When found, make a note of.’—Carrain Corrie.\nVOLUME FIRST.\nNoveMBER, 1849—May, '
If we inspect the text in more detail, we see there are various things in it that we might want to simplify. For example, quotation marks appear in various guises, such as opening and closing quotes of different flavours. We could normalise these to a simpler form (for example, “straight” quotes '
and "
), However, if opening and closing quotes are reliably recognised they do provide us with a simple text for matching text contained within the quotes. So for now, let’s leave the originally detected quotes in place.
Having got a method in place, let’s now download the contents of the non-index issues for 1849.
q = """
SELECT id, title
FROM metadata
WHERE is_index = 0
AND strftime('%Y', datetime) = '1849'
"""
results = read_sql(q, db.conn)
results
id | title | |
---|---|---|
0 | sim_notes-and-queries_1849-11-03_1_1 | Notes and Queries 1849-11-03: Vol 1 Iss 1 |
1 | sim_notes-and-queries_1849-11-10_1_2 | Notes and Queries 1849-11-10: Vol 1 Iss 2 |
2 | sim_notes-and-queries_1849-11-17_1_3 | Notes and Queries 1849-11-17: Vol 1 Iss 3 |
3 | sim_notes-and-queries_1849-11-24_1_4 | Notes and Queries 1849-11-24: Vol 1 Iss 4 |
4 | sim_notes-and-queries_1849-12-01_1_5 | Notes and Queries 1849-12-01: Vol 1 Iss 5 |
5 | sim_notes-and-queries_1849-12-08_1_6 | Notes and Queries 1849-12-08: Vol 1 Iss 6 |
6 | sim_notes-and-queries_1849-12-15_1_7 | Notes and Queries 1849-12-15: Vol 1 Iss 7 |
7 | sim_notes-and-queries_1849-12-22_1_8 | Notes and Queries 1849-12-22: Vol 1 Iss 8 |
8 | sim_notes-and-queries_1849-12-29_1_9 | Notes and Queries 1849-12-29: Vol 1 Iss 9 |
The data is return from the read_sql()
function as a pandas dataframe.
This pandas package provides a very powerful set of tools for working with tabular data, including being able to iterate over he rows of the table and apply a function to each one.
If we define a function to download the corresponding search text file from the Internet Archive and extract the text from the downloaded archive file, we can apply that function with a particular column value taken from each row of the dataframe and add the returned content to a new column in the same dataframe.
Here’s an example function:
%%writefile ia_utils/download_and_extract_text.py
from internetarchive import download
from ia_utils.get_txt_from_file import get_txt_from_file
def download_and_extract_text(id_val, p="ia-downloads", typ="searchtext", verbose=False):
"""Download search text from Internet Archive, extract the text and return it."""
if verbose:
print(f"Downloading {id_val} issue text")
if typ=="searchtext":
download(id_val, destdir=p, silent = True,
formats=["OCR Search Text"])
elif typ=="djvutxt":
download(id_val, destdir=p, silent = True,
formats=["DjVuTXT"])
else:
return ''
text = get_txt_from_file(id_val, typ=typ)
return text
The Python pandas package natively provides an apply()
function. However, the tqdm
progress bar package also provides an “apply with progress bar” function, .progress_apply()
if we enable the appropriate extensions:
# Dowload the tqdm progrss bar tools
from tqdm.notebook import tqdm
#And enable the pandas extensions
tqdm.pandas()
Let’s apply our download_and_extract_text()
function to each row of our records table for 1849, keeping track of progress with a progress bar:
from ia_utils.download_and_extract_text import download_and_extract_text
results['content'] = results["id"].progress_apply(download_and_extract_text)
results
id | title | content | |
---|---|---|---|
0 | sim_notes-and-queries_1849-11-03_1_1 | Notes and Queries 1849-11-03: Vol 1 Iss 1 | \n \n \n \nNOTES anp QUERIES:\nA Medium of En... |
1 | sim_notes-and-queries_1849-11-10_1_2 | Notes and Queries 1849-11-10: Vol 1 Iss 2 | |\n \nA MEDIUM OF INTER-COMMUNICATION\nFOR\nLI... |
2 | sim_notes-and-queries_1849-11-17_1_3 | Notes and Queries 1849-11-17: Vol 1 Iss 3 | \n \n \nceeeeeeeeee eee\nA MEDIUM OF\nLITERAR... |
3 | sim_notes-and-queries_1849-11-24_1_4 | Notes and Queries 1849-11-24: Vol 1 Iss 4 | \n \n \n \n~ NOTES anp QUERIES:\nA MEDIUM OF\... |
4 | sim_notes-and-queries_1849-12-01_1_5 | Notes and Queries 1849-12-01: Vol 1 Iss 5 | \n \n \nNOTES anp\nQUERIES:\nA MEDIUM OF INTE... |
5 | sim_notes-and-queries_1849-12-08_1_6 | Notes and Queries 1849-12-08: Vol 1 Iss 6 | \n \nOTES\nAND QUERIES\nA MEDIUM OF INTER-COM... |
6 | sim_notes-and-queries_1849-12-15_1_7 | Notes and Queries 1849-12-15: Vol 1 Iss 7 | \n \n \nNOTES\nA MEDIUM OF\nAND QUERIES\nINTE... |
7 | sim_notes-and-queries_1849-12-22_1_8 | Notes and Queries 1849-12-22: Vol 1 Iss 8 | \n \nNOTES ann QUERIES\nA MEDIUM OF\nINTER-CO... |
8 | sim_notes-and-queries_1849-12-29_1_9 | Notes and Queries 1849-12-29: Vol 1 Iss 9 | \nNOTE\nA MEDIUM OF\nAND QUERIES\nINTER-COMMU... |
We can now add that data table directly to our database using the pandas .to_sql()
method:
# Add the issue database table
table_name = "issues"
results[["id", "content"]].to_sql(table_name, db.conn, index=False, if_exists="append")
Note that this recipe does not represent a very efficient way of handling things: the pandas dataframe is held in memory, so as we add more rows, the memory requirements to store the data increase. A more efficient approach might be to create a function that retrieves each file, adds its contents to the database, and then perhaps even deletes the downloaded file, rather than adding the content to the in-memory dataframe.
Let’s see if we can query it, first at the basic table level:
q = """
SELECT id, content
FROM issues
WHERE LOWER(content) LIKE "%customs%"
"""
read_sql(q, db.conn)
id | content | |
---|---|---|
0 | sim_notes-and-queries_1849-11-17_1_3 | \n \n \nceeeeeeeeee eee\nA MEDIUM OF\nLITERAR... |
1 | sim_notes-and-queries_1849-11-24_1_4 | \n \n \n \n~ NOTES anp QUERIES:\nA MEDIUM OF\... |
2 | sim_notes-and-queries_1849-12-15_1_7 | \n \n \nNOTES\nA MEDIUM OF\nAND QUERIES\nINTE... |
3 | sim_notes-and-queries_1849-12-29_1_9 | \nNOTE\nA MEDIUM OF\nAND QUERIES\nINTER-COMMU... |
This is not overly helpful, perhaps. We can do better with the full text search, which will also allow us to return a snippet around the first, or highest ranked, location of any matched search terms:
search_term = "customs"
q = f"""
SELECT id, snippet(issues_fts, -1, "__", "__", "...", 10) as clip
FROM issues_fts WHERE issues_fts MATCH {db.quote(search_term)} ;
"""
read_sql(q, db.conn)
id | clip | |
---|---|---|
0 | sim_notes-and-queries_1849-11-10_1_2 | ...At length the __custom__ became general in ... |
1 | sim_notes-and-queries_1849-11-17_1_3 | ...royal domains, leases of __customs__, &c., ... |
2 | sim_notes-and-queries_1849-11-24_1_4 | ...the Manners and __Customs__ of Ancient Gree... |
3 | sim_notes-and-queries_1849-12-01_1_5 | ...Morning, as was his __Custom__, attended by... |
4 | sim_notes-and-queries_1849-12-15_1_7 | ...So far as English usages and __customs__ ar... |
5 | sim_notes-and-queries_1849-12-22_1_8 | ...Sessions House and the __Custom__ House of ... |
6 | sim_notes-and-queries_1849-12-29_1_9 | ...elucidation of old world __customs__ and ob... |
This is okay as far as is goes: we can identify issues of Notes and Queries that contain a particular search term, retrieve the whole document, and even display a concordance for the first (or highest ranking) occurrence of the search term(s) to provide context for the response. But it’s not ideal. For example, to display a concordance of each term in the full text document that matches our search term, we need to generate our own concordance, which may be difficulat where matches are inexact (for example if the match relies on stemming). There are also many pages in each issue of Notes and Queries and it would be useful if we could get the result at a better level of granularity.
The ouseful_sqlite_search_utils
package includes various functions for allowing us to tunnel into a text document to retrieve The tools aren’t necessarily the fastest utilities to run, particularly on large databases, but they get their eventually.
One particular utility will split a document into sentences and return each sentence on a separate row of a newly created virtual table. We can then search within these values for our search term, although we are limited to running exact match queries, rather than the more forgiving full text search queries:
from ouseful_sqlite_search_utils import snippets
snippets.register_snippets(db.conn)
q = """
SELECT * FROM
(SELECT id, sentence
FROM issues, get_sentences(1, NULL, issues.content)
WHERE issues.id = "sim_notes-and-queries_1849-11-10_1_2")
WHERE sentence LIKE "% custom %"
"""
# Show the full result record in each case
read_sql(q, db.conn).to_dict(orient="records")
Couldn't import dot_parser, loading of dot files will not be possible.
[{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'An examination of the structure of books of this period would confirm this view, and show that their apparent clumsiness is to be explained by the facility it was then the custom to afford for the interpolation or extraction of “sheets,” by a contrivance somewhat resembling that\npapers in a cover, and known as the “ patent leaf-holder,”\n'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'At length the custom became general in Aden ; and it was not only drunk in the\nnight by those who were desirous of being kept awake, but in the day for the sake of its other agreeable qualities.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'From hence the custom extended itself to many other towns of Arabia, particularly to Medina, and then to Grand Cairo in Egypt, where the dervises of Yemen, who lived in a district by themselves, drank coffee on the nights they intended to spend in\n| devotion.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'In that year, Soliman Aga, ambassador from the Sultan Mahomet the Fourth, arrived, who, with his retinue, brought a considerable quantity of coffee with them, and made presents of it to per- sons both of the court and city, and is sup- posed to have established the custom of drinking it.'},
{'id': 'sim_notes-and-queries_1849-11-10_1_2',
'sentence': 'How often\nman stumble upon some elucidation of a doubtful |\nphrase, or disputed passage ;—some illustration of an obsolete custom hitherto unnoticed ; — some biogra- phical aneedote or precise date hitherto unrecorded ;— some book, or some edition, hitherto unknown or im- perfectly described.'}]
Extracting Pages#
To make for more efficient granular searching, it would be useful if our content was stored in a more granular way.
Ideally, we would extract items at the “article” level, but there is no simple way of chunking the document at this level. We could process it to extract items at the sentence or paragraph level and add those to their own table, but that might be too granular.
However, by inspection of the files available for each issue, there appears to be another level of organisation that we can access: the page level.
Page metadata is provided in the the form of two files:
OCR Page Index
: downloaded as a compressed.gz
file the expanded file contains a list of lists. Each inner list contains four integers and each page has an associated inner list. The first and second integers in each inner list are the character count in the search text file representing the first and last characters on the corresponding page;Page Numbers JSON
: the pages numbers JSON file, which is downloaded as an uncompressed JSON file contains a JSON object with a"pages"
attribute that returns a list of records; each record has four attributes:"leafNum": int
(starting with index value 1),"ocr_value": list
(a list of candidate OCR values),"pageNumber": str
and"confidence": float
. A top-level"confidence"
attribute gives an indication of how likely it is that page numbers are available across the whole document.
We also need the OCR Search Text
file.
Warning
There seems to be an off-by-one error in the page numbering. For example, the scanned pages for https://archive.org/details/sim_notes-and-queries_1859-08-06_8_188/ range from page 101 to 120, whereas the ...page_numbers.json
file claims, with confidence 100%, that the pages range from 102 to 121?
Let’s get a complete set of necessary files for a few sample records:
%%writefile ia_utils/download_ia_records_by_format.py
# Dowload the tqdm progress bar tools
from tqdm.notebook import tqdm
from pathlib import Path
from internetarchive import download
def download_ia_records_by_format(records, path=".", formats=None):
"""Download records from Internet Archive given ID and desired format(s)"""
formats = formats if formats else ["OCR Search Text", "OCR Page Index", "Page Numbers JSON"]
for record in tqdm(records):
_id = record['id']
download(_id, destdir=path,
formats=formats,
silent = True)
from ia_utils.download_ia_records_by_format import download_ia_records_by_format
# Grab page counts and page structure files
sample_records = data_records[:5]
download_ia_records_by_format(sample_records, p)
We now need to figure out how to open and parse the page index and page numbers files, and check the lists are the correct lengths.
The Python zip
function lets us “zip” together elements from different, parallel lists. We can also insert the same item, repeatedly, into each row using the itertools.repeat()
function to generate as many repetitions of the same character as are required:
import itertools
Example of using itertools.repeat()
:
# Example of list
list(zip(itertools.repeat("a"), [1, 2], ["x","y"]))
[('a', 1, 'x'), ('a', 2, 'y')]
We can now use this approach to create a zipped combination of the record ID values, page numbers and page character indexes.
import gzip
import json
import itertools
#for record in tqdm(sample_records):
record = sample_records[0]
id_val = record['id']
p_ = Path(dirname) / id_val
# Get the page numbers
with open(p_ / f'{id_val}_page_numbers.json', 'r') as f:
page_numbers = json.load(f)
# Get the page character indexes
with gzip.open(p_ / f'{id_val}_hocr_pageindex.json.gz', 'rb') as g:
# The last element seems to be redundant
page_indexes = json.loads(g.read().decode('utf-8'))[:-1]
# Optionally text the record counts are the same for page numbers and character indexes
#assert len(page_indexes) == len(page_numbers['pages'])
# Preview the result
list(zip(itertools.repeat(id_val), page_numbers['pages'], page_indexes))[:5]
[('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 1, 'ocr_value': [], 'pageNumber': '', 'confidence': 0},
[0, 301, 559, 14345]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 2, 'ocr_value': ['4', '3'], 'pageNumber': '', 'confidence': 0},
[301, 307, 14345, 15954]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 3, 'ocr_value': ['2'], 'pageNumber': '2', 'confidence': 100},
[307, 3212, 15954, 101879]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 4, 'ocr_value': ['3'], 'pageNumber': '3', 'confidence': 100},
[3212, 7431, 101879, 228974]),
('sim_notes-and-queries_1849-11-03_1_1',
{'leafNum': 5, 'ocr_value': [], 'pageNumber': '4', 'confidence': 100},
[7431, 12267, 228974, 370105])]
We could add this page related data directly to the pages table, or we could create another simple database table to store it.
Here’s what a separate table might look like:
%%writefile ia_utils/create_db_table_pages_metadata.py
def create_db_table_pages_metadata(db, drop=True):
if drop:
db["pages_metadata"].drop(ignore=True)
db["pages_metadata"].create({
"id": str,
"page_idx": int, # This is just a count as we work through the pages
"page_char_start": int,
"page_char_end": int,
"page_leaf_num": int,
"page_num": str, # This is to allow for things like Roman numerals
# Should we perhaps try to cast an int for the page number
# and have a page_num_str for the original ?
"page_num_conf": float # A confidence value relating to the page number detection
}, pk=("id", "page_idx")) # compound foreign keys not currently available via sqlite_utils?
Import that function from the local package and run it:
from ia_utils.create_db_table_pages_metadata import create_db_table_pages_metadata
create_db_table_pages_metadata(db)
The following function “zips” together the contents of the page index and page numbers files. Each “line item” is a rather unwieldy mixmatch of elements, but we’ll deal with those in a moment:
%%writefile ia_utils/raw_pages_metadata.py
import itertools
import json
import gzip
from pathlib import Path
def raw_pages_metadata(id_val, dirname="ia-downloads"):
"""Get page metadata."""
p_ = Path(dirname) / id_val
# Get the page numbers
with open(p_ / f'{id_val}_page_numbers.json', 'r') as f:
# We can ignore the last value
page_numbers = json.load(f)
# Get the page character indexes
with gzip.open(p_ / f'{id_val}_hocr_pageindex.json.gz', 'rb') as g:
# The last element seems to be redundant
page_indexes = json.loads(g.read().decode('utf-8'))[:-1]
# Add the id and an index count
return zip(itertools.repeat(id_val), range(len(page_indexes)),
page_numbers['pages'], page_indexes)
For each line item in the zipped datastructure, we can parse out values into a more readable data object:
%%writefile ia_utils/parse_page_metadata.py
def parse_page_metadata(item):
"""Parse out page attributes from the raw page metadata construct."""
_id = item[0]
page_idx = item[1]
_page_nums = item[2]
ix = item[3]
obj = {'id': _id,
'page_idx': page_idx, # Maintain our own count, just in case; should be page_leaf_num-1
'page_char_start': ix[0],
'page_char_end': ix[1],
'page_leaf_num': _page_nums['leafNum'],
'page_num': _page_nums['pageNumber'],
'page_num_conf':_page_nums['confidence']
}
return obj
Let’s see how that looks:
from ia_utils.raw_pages_metadata import raw_pages_metadata
from ia_utils.parse_page_metadata import parse_page_metadata
sample_pages_metadata_item = raw_pages_metadata(id_val)
for pmi in sample_pages_metadata_item:
print(parse_page_metadata(pmi))
break
{'id': 'sim_notes-and-queries_1849-11-03_1_1', 'page_idx': 0, 'page_char_start': 0, 'page_char_end': 301, 'page_leaf_num': 1, 'page_num': '', 'page_num_conf': 0}
We can now trivially add the page metadata to the pages_metadata
database table. Let’s try it with our sample:
%%writefile ia_utils/add_page_metadata_to_db.py
from ia_utils.parse_page_metadata import parse_page_metadata
from ia_utils.raw_pages_metadata import raw_pages_metadata
def add_page_metadata_to_db(db, records, dirname="ia-downloads", verbose=False):
"""Add page metadata to database."""
for record in records:
id_val = record["id"]
if verbose:
print(id_val)
records = [parse_page_metadata(pmi) for pmi in raw_pages_metadata(id_val, dirname)]
# Add records to the database
db["pages_metadata"].insert_all(records)
And run it with the page metadata records selected via a id_val
:
from ia_utils.add_page_metadata_to_db import add_page_metadata_to_db
# Clear the db table
db["pages_metadata"].delete_where()
# Add the metadata to the table
add_page_metadata_to_db(db, sample_records)
Let’s see how that looks:
from pandas import read_sql
q = "SELECT * FROM pages_metadata LIMIT 5"
read_sql(q, db.conn)
id | page_idx | page_char_start | page_char_end | page_leaf_num | page_num | page_num_conf | |
---|---|---|---|---|---|---|---|
0 | sim_notes-and-queries_1849-11-03_1_1 | 0 | 0 | 301 | 1 | 0.0 | |
1 | sim_notes-and-queries_1849-11-03_1_1 | 1 | 301 | 307 | 2 | 0.0 | |
2 | sim_notes-and-queries_1849-11-03_1_1 | 2 | 307 | 3212 | 3 | 2 | 100.0 |
3 | sim_notes-and-queries_1849-11-03_1_1 | 3 | 3212 | 7431 | 4 | 3 | 100.0 |
4 | sim_notes-and-queries_1849-11-03_1_1 | 4 | 7431 | 12267 | 5 | 4 | 100.0 |
Alternatively, we can view the results as a Python dictionary:
read_sql(q, db.conn).to_dict(orient="records")[:3]
[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 0,
'page_char_start': 0,
'page_char_end': 301,
'page_leaf_num': 1,
'page_num': '',
'page_num_conf': 0.0},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 1,
'page_char_start': 301,
'page_char_end': 307,
'page_leaf_num': 2,
'page_num': '',
'page_num_conf': 0.0},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 2,
'page_char_start': 307,
'page_char_end': 3212,
'page_leaf_num': 3,
'page_num': '2',
'page_num_conf': 100.0}]
For each file containg the search text for a particular issue, we also need a routine to extract the page level content. Which is to say, we need to chunk the content based on character indices associated with the first and last characters on each page in the corresponding search text file.
This essentially boils down to:
grabbing the page index values;
grabbing the page search text;
chunking the search text according to the page index values.
We can apply a page chunker at the document level, paginating the content file, and adding things to the database.
The following function will load
%%writefile ia_utils/chunk_page_text.py
from pandas import read_sql
from ia_utils.get_txt_from_file import get_txt_from_file
def chunk_page_text(db, id_val):
"""Chunk text according to page_index values."""
q = f'SELECT * FROM pages_metadata WHERE id="{id_val}"'
page_indexes = read_sql(q, db.conn).to_dict(orient="records")
text = get_txt_from_file(id_val)
for ix in page_indexes:
ix["page_text"] = text[ix["page_char_start"]:ix["page_char_end"]].strip()
return page_indexes
Let’s see if we’ve managed to pull out some page text:
from ia_utils.chunk_page_text import chunk_page_text
# Create a sample index ID
sample_id_val = sample_records[0]["id"]
# Get the chunked text back as part of the metadata record
sample_pages = chunk_page_text(db, sample_id_val)
sample_pages[:3]
[{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 0,
'page_char_start': 0,
'page_char_end': 301,
'page_leaf_num': 1,
'page_num': '',
'page_num_conf': 0.0,
'page_text': 'NOTES anp QUERIES:\nA Medium of Enter-Communication\nFOR\nLITERARY MEN, ARTISTS, ANTIQUARIES, GENEALOGISTS, ETC.\n‘* When found, make a note of.’—Carrain Corrie.\nVOLUME FIRST.\nNoveMBER, 1849—May, 1850.\nLONDON: GEORGE BELL, 186. FLEET STREET 1850.\n[SOLD BY ALL BOOKSELLERS AND NEWSMEN. |'},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 1,
'page_char_start': 301,
'page_char_end': 307,
'page_leaf_num': 2,
'page_num': '',
'page_num_conf': 0.0,
'page_text': ''},
{'id': 'sim_notes-and-queries_1849-11-03_1_1',
'page_idx': 2,
'page_char_start': 307,
'page_char_end': 3212,
'page_leaf_num': 3,
'page_num': '2',
'page_num_conf': 100.0,
'page_text': 'NOTES ann QUERIES:\nA MEDIUM OF\n——_——s\nINTER-COMMUNICATION\nFOR\nLITERARY MEN, ARTISTS,\nANTIQUARIES, GENEALOGISTS, ETC.\n“When found, make a note of.” — Carrain Currie.\nNOTES AND QUERIES.\nTue nature and design of the present work have been so fully stated in the Prospectus, and are indeed so far explained by its very Title, that it is unnecessary to occupy any great portion of its first number with details on the subject. We are under no temptation to fill its columns with an account of what we hope future numbers will be. Indeed, we would rather give a specimen than a de- scription; and only regret that, from the wide range of subjects which it is intended to embrace, and the correspondence and contri- butions of various kinds which we are led to expect, even this can only be done gradually. A few words of introduction and explanation may, however, be allowed; and, indeed, ought to be prefixed, that we may be understood by those readers who have not seen our Pro- spectus.\n“ WHEN FOUND, MAKE A NOTE OF,” is a most admirable rule; and if the excellent Captain had never uttered another word, he might have passed for a profound philosopher. It is a rule which should shine in gilt letters on the gingerbread of youth, and the specta- ele-case of age. Every man who reads with any view beyond mere pastime, knows the value of it. Every one, more or less, acts upon it. Every one regrets and suffers who\n \nSATURDAY, NOVEMBER 3. 1849.\nPrice Threepence. Stamped Edition, 4 d.\n \nneglects it. There is some trouble in it, to be sure; but in what good thing is there not? and what trouble does it save! Nay, what mischief! Half the lies that are current in the world owe their origin to a misplaced confidence in memory, rather than to inten- tional falsehood. We have never known more than one man who could deliberately and con- scientiously say that his memory had never deceived him; and he (when he saw that he had excited the surprise of his hearers, espe- cially those who knew how many years he had spent in the management of important com- mercial affairs) used to. add, — because he had never trusted it; but had uniformly written down what he was anxious to remember.\nBut, on the other hand, it cannot be denied that reading and writing men, of moderate industry, who act on this rule for any con- siderable length of time, will aceumulate a good deal of matter in various forms, shapes, and sizes—some more, some less legible and intelligible —some unposted in old pocket books — some on whole or half sheets, or mere seraps of paper, and backs of letters— some, lost sight of and forgotten, stuffing out old portfolios, or getting smoky edges in bundles tied up with faded tape. There are, we are quite sure, countless boxes and drawers, and pigeon-holes of such things, which want look- ing over, and would well repay the trouble.\n \n \nFOURTH EDITION.'}]
Modifying the pages_metadata
Table in the Database#
Using the chunk_page_text()
function, we can add page content to our pages metadata in-memory. But what if we want to add it to the database. The pages_metadata
already exists, but does not include a text
column. However, we can modify that table to include just such a column:
db["pages_metadata"].add_column("page_text", str)
<Table pages_metadata (id, page_idx, page_char_start, page_char_end, page_leaf_num, page_num, page_num_conf, page_text)>
We can also enable a full text search facility over the table. Our interest is primarily in searching over the page_text
, but if we include a couple of other columns, that can help us key into records in other tables.
# Enable full text search
# This creates an extra virtual table to support the full text search
db["pages_metadata_fts"].drop(ignore=True)
db["pages_metadata"].enable_fts(["id", "page_idx", "page_text"], create_triggers=True, tokenize="porter")
<Table pages_metadata (id, page_idx, page_char_start, page_char_end, page_leaf_num, page_num, page_num_conf, page_text)>
We can now update the records in the pages_metadata
table so they include the page_text
:
q = f'SELECT DISTINCT(id) FROM pages_metadata;'
id_vals = read_sql(q, db.conn).to_dict(orient="records")
for sample_id_val in id_vals:
updated_pages = chunk_page_text(db, sample_id_val["id"])
db["pages_metadata"].upsert_all(updated_pages, pk=("id", "page_idx"))
We should now be able to search at the page level:
search_term = "customs"
q = f"""
SELECT * FROM pages_metadata_fts
WHERE pages_metadata_fts MATCH {db.quote(search_term)};
"""
read_sql(q, db.conn)
id | page_idx | page_text | |
---|---|---|---|
0 | sim_notes-and-queries_1849-11-10_1_2 | 5 | 22 NOTES\n \nAND QUERIES.\nCatalogue — in whic... |
1 | sim_notes-and-queries_1849-11-10_1_2 | 8 | Nov. 10. 1849.)\nNOTES AND QUERIES.\n25\n \nne... |
2 | sim_notes-and-queries_1849-11-10_1_2 | 9 | bring with him some coffee, which he believed ... |
3 | sim_notes-and-queries_1849-11-10_1_2 | 12 | Nov. 10. 1849.]\nActing her passions on our st... |
4 | sim_notes-and-queries_1849-11-10_1_2 | 14 | ~—\n \n|\n \nNov. 10. 1849.]\nNOTES AND QUERIE... |
5 | sim_notes-and-queries_1849-11-17_1_3 | 8 | = 17. 1849.] }\nreceive his representations an... |
6 | sim_notes-and-queries_1849-11-24_1_4 | 2 | ~vwe eS | FY\nweNTe 6 FS-r—lCUcUOrlClC hLOOlhC... |
7 | sim_notes-and-queries_1849-11-24_1_4 | 6 | NOTES AND QUERIES.\n \n \n \nNov. 24. 1849.]\n... |
8 | sim_notes-and-queries_1849-11-24_1_4 | 15 | NOTES AND QUERIES.\nJust published, Part II., ... |
9 | sim_notes-and-queries_1849-12-01_1_5 | 5 | NOTES AND QUERIES.\n \nmore than three Passeng... |
We can then bring in additional columns from the original pages_metadata
table:
search_term = "customs"
q = f"""
SELECT page_num, page_leaf_num, pages_metadata_fts.* FROM pages_metadata_fts, pages_metadata
WHERE pages_metadata_fts MATCH {db.quote(search_term)}
AND pages_metadata.id = pages_metadata_fts.id
AND pages_metadata.page_idx = pages_metadata_fts.page_idx;
"""
read_sql(q, db.conn)
page_num | page_leaf_num | id | page_idx | page_text | |
---|---|---|---|---|---|
0 | 23 | 6 | sim_notes-and-queries_1849-11-10_1_2 | 5 | 22 NOTES\n \nAND QUERIES.\nCatalogue — in whic... |
1 | 26 | 9 | sim_notes-and-queries_1849-11-10_1_2 | 8 | Nov. 10. 1849.)\nNOTES AND QUERIES.\n25\n \nne... |
2 | 27 | 10 | sim_notes-and-queries_1849-11-10_1_2 | 9 | bring with him some coffee, which he believed ... |
3 | 30 | 13 | sim_notes-and-queries_1849-11-10_1_2 | 12 | Nov. 10. 1849.]\nActing her passions on our st... |
4 | 32 | 15 | sim_notes-and-queries_1849-11-10_1_2 | 14 | ~—\n \n|\n \nNov. 10. 1849.]\nNOTES AND QUERIE... |
5 | 42 | 9 | sim_notes-and-queries_1849-11-17_1_3 | 8 | = 17. 1849.] }\nreceive his representations an... |
6 | 52 | 3 | sim_notes-and-queries_1849-11-24_1_4 | 2 | ~vwe eS | FY\nweNTe 6 FS-r—lCUcUOrlClC hLOOlhC... |
7 | 56 | 7 | sim_notes-and-queries_1849-11-24_1_4 | 6 | NOTES AND QUERIES.\n \n \n \nNov. 24. 1849.]\n... |
8 | 65 | 16 | sim_notes-and-queries_1849-11-24_1_4 | 15 | NOTES AND QUERIES.\nJust published, Part II., ... |
9 | 71 | 6 | sim_notes-and-queries_1849-12-01_1_5 | 5 | NOTES AND QUERIES.\n \nmore than three Passeng... |
Automatically Populating the pages
Table from the issues
Table#
Rather than manually adding the page data to the pages
table, we can automatically create the pages
table from the content contained in the issues
table and the page metadata in the metadata
table.
TO DO - CREATE TABLE AS ;
maybe also as an extra demonstrate how to generate this automatically from a trigger
discuss various advantages and disadvantages of each approach; one is a step wise pipeline (create as) other is reactive and automatic ( trigger)