Building a 19th c. Notes & Queries Full Text Search Engine
Building a 19th c. Notes & Queries Full Text Search Engine#
Having got various pieces in place, we’re now in a position to attempt to create a comprehensive full text search engine over 19th century issue of Notes & Queries. As we use the database, there may well be “optimisations” we can make, for example in trying to tidy up the content a little. But for now, let’s just put the pieces we’ve already assembled together and see how it looks.
Start off by loading some essentially packages, as well as package files we created for ourselves previously:
from pathlib import Path
from sqlite_utils import Database
from ia_utils.create_db_table_metadata import create_db_table_metadata
from ia_utils.open_metadata_records import open_metadata_records
from ia_utils.add_patched_metadata_records_to_db import add_patched_metadata_records_to_db
from ia_utils.create_db_table_issues import create_db_table_issues
from ia_utils.create_db_table_pages_metadata import create_db_table_pages_metadata
It takes several hours to download the datafiles, so if we already have a full database file, we may want to put in some guards so we don’t overwrite it an lose all that previously downloaded data:
RECREATE_FULL_DB = False
full_db_name = "full_nq.db"
db_exists = Path("full_nq.db").is_file()
dirname = "ia-downloads"
p = Path(dirname)
# Extra cautious...
if RECREATE_FULL_DB and not db_exists:
db_full = Database(full_db_name, recreate=True)
create_db_table_metadata(db_full)
data_records = open_metadata_records()
add_patched_metadata_records_to_db(db_full, data_records)
create_db_table_issues(db_full)
create_db_table_pages_metadata(db_full)
db_full["pages_metadata"].add_column("page_text", str)
db_full["pages_metadata_fts"].drop(ignore=True)
db_full["pages_metadata"].enable_fts(["id", "page_idx", "page_text"],
create_triggers=True, tokenize="porter")
else:
db_full = Database(full_db_name)
from pandas import read_sql
# Get the records for a particular year
q = """
SELECT id, title, date, is_index
FROM metadata
WHERE is_index = 0
AND strftime('%Y', datetime) = '{year}';
"""
results_19th_cent = read_sql(q.format(year=1849), db_full.conn)
results_19th_cent
id | title | date | is_index | |
---|---|---|---|---|
0 | sim_notes-and-queries_1849-11-03_1_1 | Notes and Queries 1849-11-03: Vol 1 Iss 1 | 1849-11-03 | 0 |
1 | sim_notes-and-queries_1849-11-10_1_2 | Notes and Queries 1849-11-10: Vol 1 Iss 2 | 1849-11-10 | 0 |
2 | sim_notes-and-queries_1849-11-17_1_3 | Notes and Queries 1849-11-17: Vol 1 Iss 3 | 1849-11-17 | 0 |
3 | sim_notes-and-queries_1849-11-24_1_4 | Notes and Queries 1849-11-24: Vol 1 Iss 4 | 1849-11-24 | 0 |
4 | sim_notes-and-queries_1849-12-01_1_5 | Notes and Queries 1849-12-01: Vol 1 Iss 5 | 1849-12-01 | 0 |
5 | sim_notes-and-queries_1849-12-08_1_6 | Notes and Queries 1849-12-08: Vol 1 Iss 6 | 1849-12-08 | 0 |
6 | sim_notes-and-queries_1849-12-15_1_7 | Notes and Queries 1849-12-15: Vol 1 Iss 7 | 1849-12-15 | 0 |
7 | sim_notes-and-queries_1849-12-22_1_8 | Notes and Queries 1849-12-22: Vol 1 Iss 8 | 1849-12-22 | 0 |
8 | sim_notes-and-queries_1849-12-29_1_9 | Notes and Queries 1849-12-29: Vol 1 Iss 9 | 1849-12-29 | 0 |
We now need to:
iterate through the records;
download the issue;
carve it into various parts;
add the parts to the database.
We have all the pieces we need, so let’s do it:
# Dowload the tqdm progrss bar tools
from tqdm.notebook import tqdm
#And enable the pandas extensions
tqdm.pandas()
from ia_utils.download_and_extract_text import download_and_extract_text
from ia_utils.download_ia_records_by_format import download_ia_records_by_format
from ia_utils.add_page_metadata_to_db import add_page_metadata_to_db
from ia_utils.chunk_page_text import chunk_page_text
# Extra cautious
if RECREATE_FULL_DB and not db_exists:
# Iterate through all the years we want to search over
for year in tqdm(range(1849, 1900)):
# Get issues by year
results_by_year = read_sql(q.format(year=str(year)), db_full.conn)
# Download issue content by year
results_by_year['content'] = results_by_year["id"].apply(download_and_extract_text,
verbose=False)
# Add issues content to database
results_by_year[["id", "content"]].to_sql("issues",
db_full.conn,
index=False, if_exists="append")
# For each issue, we need to grab the metadata and store it in the database
download_ia_records_by_format(results_by_year.to_dict(orient="records"), p)
add_page_metadata_to_db(db_full, results_by_year.to_dict(orient="records"),
verbose=False)
for record_id_val in results_by_year['id'].to_list():
updated_pages = chunk_page_text(db_full, record_id_val)
db_full["pages_metadata"].upsert_all(updated_pages, pk=("id", "page_idx"))
q = """
SELECT COUNT(*)
FROM pages_metadata;
"""
read_sql(q, db_full.conn)
COUNT(*) | |
---|---|
0 | 58406 |
q = """
SELECT COUNT(*) FROM issues;
"""
read_sql(q, db_full.conn)
COUNT(*) | |
---|---|
0 | 2565 |
search_term = "sin eater"
#q = f"""
#SELECT * FROM pages_metadata_fts
#WHERE pages_metadata_fts MATCH {db.quote(search_term)};
#"""
q = """
SELECT id, snippet(pages_metadata_fts, -1, "__", "__", "...", 10) as clip
FROM pages_metadata_fts WHERE pages_metadata_fts MATCH {query} ;
"""
read_sql(q.format(query=db_full.quote(search_term)), db_full.conn)
id | clip | |
---|---|---|
0 | sim_notes-and-queries_1851-09-20_4_99 | ...Minor Queries :— Mazer Wood __eaters__ —“ A... |
1 | sim_notes-and-queries_1851-09-20_4_99 | ...Mazer Wood and __Sin__-__eaters__ (Vol. iii... |
2 | sim_notes-and-queries_1851-12-27_4_113 | ...on mazer-wood and __sin__-__eaters__, 211. ... |
3 | sim_notes-and-queries_1851-12-27_4_113 | ...__Sin__-__eaters__, notices respecting, 211... |
4 | sim_notes-and-queries_1852-10-23_6_156 | ...Coffins - -\nMinor Queries ANSwereD : — __S... |
5 | sim_notes-and-queries_1852-10-23_6_156 | ...__Sin__-__eater__.— Can any of your readers... |
6 | sim_notes-and-queries_1852-12-04_6_162 | ...furnish its quota The __Sin__-__eater__, by... |
7 | sim_notes-and-queries_1852-12-04_6_162 | ...regoi — irst line, THE __SIN__-__EATER__. I... |
8 | sim_notes-and-queries_1852-12-25_6_165 | ...B. “CE. ) on the __sin__-__eater__, 541.\nB... |
9 | sim_notes-and-queries_1852-12-25_6_165 | ...Leeper ( Alex.) on the __sin__-__eater__, 5... |
10 | sim_notes-and-queries_1852-12-25_6_165 | ...off, 288, — origin of __sin__-__eater__, 30... |
11 | sim_notes-and-queries_1856-07-12_2_28 | ...MARRIOT THE GREAT __EATER__. (2™ §. ii. 6.)... |
12 | sim_notes-and-queries_1856-12-27_2_52 | ...the great __eater__, 33. “ Rebukes for __Si... |
13 | sim_notes-and-queries_1864-10-29_6_148 | ...John into __Sin__- jon, St. Clair into Sinc... |
14 | sim_notes-and-queries_1865-05-20_7_177 | ...this crying __sin__ did drawe downe\n__eate... |
15 | sim_notes-and-queries_1867-11-02_12_305 | ...4s ; __sin__ rle volumes, 288 . 48.3; sing ... |
16 | sim_notes-and-queries_1870-11-12_6_150 | ...The origin of the __Sin__-__eater__ is need... |
17 | sim_notes-and-queries_1876-12-23_6_156 | ...We are well rid of the __sin__-__eater__, w... |
18 | sim_notes-and-queries_1876-12-30_6_157 | ...364, 423\nSicilian, 507\n__Sin__-__eater__,... |
19 | sim_notes-and-queries_1876-12-30_6_157 | ...and the Bible, 509 __Sin__-__eater__, 505 W... |
20 | sim_notes-and-queries_1877-06-30_7_183 | ...local name, on the __Sin__-__eater__, 14 n ... |
21 | sim_notes-and-queries_1877-06-30_7_183 | ...66\nShrove Tuesday, 120\n__Sin__-__eater__,... |
22 | sim_notes-and-queries_1879-07-26_12_291 | ...Writings of the opium-__eater__. In one of ... |
23 | sim_notes-and-queries_1882-01-28_5_109 | ...and Hastings.\nEp. MarsHatt.\n__Sin__ Curis... |
24 | sim_notes-and-queries_1882-06-24_5_130 | ...on Franklin, 288\n \nMoon, __sin__ to point... |
25 | sim_notes-and-queries_1883-01-13_7_159 | ...The superstition of the __Sin__-__Eater__ i... |
26 | sim_notes-and-queries_1883-06-30_7_183 | ...for, 410, 448, 474 __Sin__-__eater__, 25, 3... |
27 | sim_notes-and-queries_1883-09-29_8_196 | ...John Aubrey has three passages concerning _... |
28 | sim_notes-and-queries_1883-12-29_8_209 | ...Swithin, 46 Scrofula, touching for, 113, 29... |
29 | sim_notes-and-queries_1885-02-21_11_269 | ...from Tennyson, The Lotos __Eaters__, stanza... |
30 | sim_notes-and-queries_1889-12-28_8_209 | ...Good night! There goes another year! (__Eat... |
31 | sim_notes-and-queries_1892-01-30_1_5 | ...prevailed, certain persons called “__sin__ ... |
32 | sim_notes-and-queries_1892-10-29_2_44 | ...it were not lead, that is, __sinful__, yet ... |
33 | sim_notes-and-queries_1893-11-04_4_97 | ...LOTOS-__EATER__ in CAPRI. By Dr. ALAN WALTE... |
34 | sim_notes-and-queries_1895-10-12_8_198 | ...Charterhouse — ‘‘ Oyster of Veal” — __Sin__... |
35 | sim_notes-and-queries_1895-10-12_8_198 | ...Here the __sin__-__eater__ was supposed to ... |
36 | sim_notes-and-queries_1895-10-26_8_200 | ...Moore, 331—Lan Stam\nct — __Sin__-__eaters_... |
37 | sim_notes-and-queries_1895-11-09_8_202 | ...Heene, Worthing :—‘ Though the __sin__-__ea... |
38 | sim_notes-and-queries_1895-12-28_8_209 | ...and Burns, 205, 515 __Sin__-__eaters__, 288... |
39 | sim_notes-and-queries_1895-12-28_8_209 | ...Westminster, anchorite at, 408 __Sin__-__ea... |
40 | sim_notes-and-queries_1896-02-08_9_215 | ...REPLIES :—__Sin__-__Eater__, 109—Vatican Em... |
41 | sim_notes-and-queries_1896-02-08_9_215 | ...Beylics,\n__SIN__-__EATER__. (8™ §S. viii. ... |
42 | sim_notes-and-queries_1896-02-08_9_215 | ...the myth of the __sin__-__eater__. The only... |
43 | sim_notes-and-queries_1896-02-29_9_218 | ...REPLIES :—__Sin__-__eater__, 169—Bream’s Bu... |
44 | sim_notes-and-queries_1896-02-29_9_218 | ...Beplies,\n__SIN__ - __EATER__, (8 §, viii, ... |
45 | sim_notes-and-queries_1896-03-21_9_221 | ...Revels for Scotland—Milton—__Sin__-__eater_... |
46 | sim_notes-and-queries_1896-03-21_9_221 | ...w.c.B\nSrn-__EaTER__ (8 §, viii, 288, 332 ;... |
47 | sim_notes-and-queries_1896-04-11_9_224 | ...s ‘ Richard III.,’ 295—__Sin__-__eater__—Th... |
48 | sim_notes-and-queries_1896-06-27_9_235 | ...Mass, its etymology, 334 __Sin__-__eaters__... |
49 | sim_notes-and-queries_1896-06-27_9_235 | ...38\nRose-galls, 93\n__Sin__-__eaters__, 110... |
50 | sim_notes-and-queries_1896-06-27_9_235 | ...Gastayne, 232 Senses, the seven, 493 __Sin_... |
51 | sim_notes-and-queries_1896-06-27_9_235 | ...438 Rose-gall, 93 __Sin__-__eaters__, 109, ... |
52 | sim_notes-and-queries_1896-06-27_9_235 | ...271 Penn (William), 243 __Sin__-__eaters__,... |
53 | sim_notes-and-queries_1896-06-27_9_235 | ...Trunion, 34 __Sin__-__eaters__, 110 Sterlin... |
54 | sim_notes-and-queries_1896-06-27_9_235 | ...P.) on __sin__-__eaters__, 109, 236 Owen (M... |
55 | sim_notes-and-queries_1896-06-27_9_235 | ...M. ) on maypoles, 10 __Sin__-__eaters__, 11... |
56 | sim_notes-and-queries_1896-06-27_9_235 | ...1, ‘‘ Bare bodkin,” 362, | __Sin__-__eaters... |
57 | sim_notes-and-queries_1896-06-27_9_235 | ...W.) on __sin__-__eaters__, 169 Thomas (R.) ... |
58 | sim_notes-and-queries_1897-08-28_12_296 | ...I am sorry that __Sin__ Hersert Maxwett wil... |