Exploring Notes & Queries#

This chapter, and several following ones, will describe how to create various search context for 19th century issues of Notes & Queries. These include:

a monolithic PDF of index issues up to 1900;
a searchable database of index issues up to 1900;
a full text searchable database of non-index issues up to 1900.

Original scans of the original publication, as well as automatically extracted search text, are available, for free, from the Internet Archive.

Working With Documents From the Internet Archive#

The Internet Archive – archive.org – is an incredible resource. Amongst other things, it is home to a large number of out-of-copyright digitised books scanned by the Google Book project as well as other book scanning initiatives.

In this unbook, I will explore various ways in which can build tools around the Internet Archive and documents retrieved from it.

Searching the Internet Archive#

Many people will be familiar with the web interface to the Internet Archive (and I suspect many more are not aware of the existence of the Internet Archive at all). This provides tools for discovering documents available in the archive, previewing the scanned versions of them, and even searching within them.

At times, the search inside a book can be a bit hit and miss, in part depending on the quality of the scanned images and the ability of the OCR tools - where “OCR” stands for “optical character recognition” - to convert the pictures of text into actual text. Which is to say, searchable text.

One of the advantages of creating our own database is that as well as having the corpus available locally, we can use various fuzzy search tools to find partial matches to text to supplement our full text search activities.

To work with the archive, we’ll use the Python programming language. This lets us write instructions for our machine helpers to follow. One of the machine helpers comes in the form of the internetarchive Python package, a collection of routines that can access the Internet Archive at the programming, rather than human user interface, level.

The human level interface simply provides graphical tools that we can understand, such as menu items and toolbar buttons. Selecting or clicking these simply invokes machine level commands in a useable-for-us way. Writing program code lets us call those commands directly, in a textual way, rather than visually, by clicking menu items and buttons. Copying and pasting simple text instructions that can be used to perform a particular function is often quite straightforward. Modifying such commands may also be relatively straightforward. (For example, given a block of code that downloads a file from a web location using code of the form download_file("https://example.com/this_file.pdf"), you could probably work out how to download a file from http://another.example.com/myfile.pdf.) Creating graphical user interfaces is hard. Graphical user interfaces also constrains users to using just the functions and features that the designers and developers of the user interface chose to support in the user interface, in just the way that the user interface allows. Being able to instruct a machine using code, even copied and pasted code, gives the end-use far more power over the machine.

Within any particular programming language, packages are often used to bundle together various tools and functions that can be used to support particular activities or tasks, or work with particular resources or resource types.

One of the most useful tools within the Internet Archive package is the search_items() function, which lets us search the Internet Archive.

# If we haven't already installed the package into our computing environment,
# we need to download it and install it.
#%pip install internetarchive

# Load in a function to search the archive
from internetarchive import search_items

# We are going to build up a list of search results
items = []

Item Metadata#

At the data level, the Internet Archive has metadata, or “data about data” that provides key information or summary information about each data record. For example, works can be organised as part of different collections via collection elements such as collection:"pub_notes-and-queries".

For periodicals, there may also be a publication identifier associated with the periodical (for example, sim_pubid:1250) or metadata identifying which volume or issue a particular edition of a periodical may be.

In the following bit of code, we search over the Notes & Queries collection, retrieving data about each item in the collection.

This is quite a large collection, so to run a query that retrieves all the items in it may take a considerable amount of time. Instead, we can limit the search to issues published in a particular year, and further limit the query to only retrieve a certain number of records.

# We can use a programming loop to search for items, iterate through the items
# and retrieve a record for each one
# The enumerate() command will loop trhough all the items, returnin a running count of items
# returned, as well as each separate item
# The count starts at 0...
for count, item in enumerate(search_items('collection:"pub_notes-and-queries" AND year:1867').iter_as_items()):
    # Display the count, the item identifier and title
    print(count, item.identifier, item.metadata['title'])

    # If we see item with count value of at least 3, which is to say, the fourth item,
    # (we start counting at zero, remember...)
    if count >= 3:
        # Then break out of this loop
        break

sim_notes-and-queries_1867-01-05_11_262 Notes and Queries  1867-01-05: Vol 11 Iss 262
sim_notes-and-queries_1867-01-12_11_263 Notes and Queries  1867-01-12: Vol 11 Iss 263
sim_notes-and-queries_1867-01-19_11_264 Notes and Queries  1867-01-19: Vol 11 Iss 264
sim_notes-and-queries_1867-01-26_11_265 Notes and Queries  1867-01-26: Vol 11 Iss 265

As well as the “offical” collection, some copies of Notes and Queries from other providers are also available in the Internet Archive. For example, there are some submissions from Project Gutenberg.

The following retrieves an item obtained from the gutenberg collection, which is to say, Project Gutenberg, and previews its metadata:

from internetarchive import get_item

# Retrieve an item from its unique identifier
item = get_item('notesandqueriesi13536gut')

# And display its metadata
item.metadata

{'identifier': 'notesandqueriesi13536gut',
 'title': 'Notes and Queries, Index of Volume 1, November, 1849-May, 1850: A Medium of Inter-Communication for Literary Men, Artists, Antiquaries, Genealogists, Etc.',
 'possible-copyright-status': 'NOT_IN_COPYRIGHT',
 'copyright-region': 'US',
 'mediatype': 'texts',
 'collection': 'gutenberg',
 'creator': 'Various',
 'contributor': 'Project Gutenberg',
 'description': 'Book from Project Gutenberg: Notes and Queries, Index of Volume 1, November, 1849-May, 1850: A Medium of Inter-Communication for Literary Men, Artists, Antiquaries, Genealogists, Etc.',
 'language': 'eng',
 'call_number': 'gutenberg etext# 13536',
 'addeddate': '2006-12-07',
 'publicdate': '2006-12-07',
 'backup_location': 'ia903600_27'}

The items in the pub_notes-and-queries collection have much more metadata available, including volume and issue data, and the identifiers for the previous and next issue.

In some cases, the identifier values may be human readable, if you look closely enough. For example, Notes and Queries was published weekly, typically with two volumes per year, and an index for each. In the pub_notes-and-queries collections, the identifier for Volume 11, issue 262, published on January 5th, 1867, is sim_notes-and-queries_1867-01-05_11_262; and the identifier for the index of volume 12, published in throughout the second half of 1867, is sim_notes-and-queries_1867_12_index.

Available Files#

As well as the data record, certain other files may be associated with that item such as PDF scans, or files containing the raw scanned text of the document.

We have already seen how we can retrieve an item given it’s identifier, but let’s see it in action again:

item = get_item("sim_notes-and-queries_1867_12_index")

item.metadata['title'], item.identifier

('Notes and Queries  1867: Vol 12 Index',
 'sim_notes-and-queries_1867_12_index')

We can make a call from this data item to return a list of the files associated with that item, and display their file formats:

for file_item in item.get_files():
    print(file_item.format)

Item Tile
JPEG 2000
JPEG 2000
Text PDF
Archive BitTorrent
chOCR
DjVuTXT
Djvu XML
Metadata
JSON
hOCR
OCR Page Index
OCR Search Text
Item Image
Single Page Processed JP2 ZIP
Metadata
Metadata
Page Numbers JSON
JSON
Scandata

For this item, then, we can get a PDF document, a file containing the search text, a record with information about page numbers, an XML version of the original scanned version, some image scans, and various other things containing who knows what!

A Complete List of Notes & Queries Issues#

To help us work with the pub_notes-and-queries collection, let’s construct a local copy of the most important metadata associated with each item in the collection, specifically the item identifier, date and title, as well as the volume and issue. (Notes and Queries also has a higher level of organisation, a Series, which means that volume and issue numbers can actually recycle, so by itself, a particular (volume, issue) pair does not identify a unique item, but a (series, volume, issue) or (year, volume, issue) triple does.)

For convenience, we might also collect the previous and next item identifiers, as well as a flag that tells us whether access is restricted or not. (For 19th century editions, there are no restrictions; but for more recent 20th century editions, access may be limited to library shelf access).

As we construct various tools for working with the Internet Archive and various files downloaded from it, it will be useful to also save those tools in a way that we can can make use of them.

The Python programming language supports a simple mechanism for bundling files into “packages” simply by including files in directory that is marked as a package directory. The simplest way to mark a directory as a Python package is simple to create an empty file called __init__.py inside it.

So let’s create a package called ia_utils by creating a directory of that name containing an empty __init__.py file:

from pathlib import Path

# Create the directory if it doesn't already exist
ia_utils = Path("ia_utils")
ia_utils.mkdir(exist_ok=True)

# Create the blank file
Path( ia_utils / "__init__.py" ).touch()

Note

The pathlib package contains powerful tools for working with directories, files, and file paths.

The following cell contains a set of instructions bundled together to define a function under a unique function name. Functions provide us with a shorthand way of writing a set of instructions once, then calling on them repeatedly via their function name.

In particular, the function takes in an item metadata record, tidies it up a little and returns just the fields we are interested in.

In the following cell, we use some magic to write the contents of the cell to a package file; in the next cell after that, we import the function from the file. This provides us with a convenient way of saving code to a file that we can also reuse elsewhere.

%%writefile ia_utils/out_ia_metadata.py
import csv

def out_ia_metadata(item):
    """Retrieve a subset of item metadata and return it as a list."""
    # This is a nested function that looks up piece of metadata if it exists
    # If it doesn't exist, we set it to ''
    def _get(_item, field):
        return _item[field] if field in _item else ''

    #item = get_item(i.identifier)
    identifier = item.metadata['identifier']
    date =  _get(item.metadata, 'date')
    title = _get(item.metadata, 'title')
    volume =_get(item.metadata, 'volume')
    issue = _get(item.metadata, 'issue')
    prev_ = _get(item.metadata, 'previous_item')
    next_ = _get(item.metadata, 'next_item')
    restricted = _get(item.metadata,'access-restricted-item')
    
    return [identifier, date, title, volume, issue, prev_, next_, restricted]

Now we can import the function from the package. And so can other notebooks.

from ia_utils.out_ia_metadata import out_ia_metadata

Tracking Updates to the Function

If we update the function and rewrite the file, the from...import.. line will not normally reload the (updated) function if the function has already been imported.

There are two ways round this:

load the file in and run it, rather than importing the package, using a magic command of the form %run -i ia_utils/out_ia_metadata.py
configure the notebook at the start by running %load_ext autoreload ; %autoreload 2 (see the documentation).

Here’s what the data retrieved from an item record by the out_ia_metadata function looks like:

# Get an item record form its identifier
item = get_item("sim_notes-and-queries_1867_12_index")

# Display the key metadata
out_ia_metadata(item)

['sim_notes-and-queries_1867_12_index',
 '1867',
 'Notes and Queries  1867: Vol 12 Index',
 '12',
 'Index',
 'sim_notes-and-queries_1867-06-29_11_287',
 'sim_notes-and-queries_1867-07-06_12_288',
 '']

We can now build up a list of lists containing the key metadata for all editions of Notes of Queries in the pub_notes-and-queries collection.

Our recipe will proceed in the following three steps:

search for all the items in the collection;
build up a list of records where item contains the key metadata, extracted from the full record using the out_ia_metadata() function;
open a file (nandq_internet_archive.txt), give it a column header line, and write the key metadata records to it, one record per line.

The file will be written in “CSV” format (comma separarated variable), a simple text format for describing tabular data. CSV files can be read by spreadsheet applications, as well as other tools, and use comma separators to identify “columns” of information in each row.

# The name of the file we'll write our csv data to
csv_fn = "nandq_internet_archive.txt"

The file takes quite a long time to assemble (we need to download several thousand metadata records), so we only want to do it once.

So let’s check to see if the file exists (if it does, we won’t try to recreate it:

from pathlib import Path

csv_file_exists = Path(csv_fn).is_file()

Conveniently, identifiers for all the issues of Notes and Queries held by the Internet Archive can be retrieved via the pub_notes-and-queries collection.

The return object is an iterator with individual results that take the form {'identifier': 'sim_notes-and-queries_1849-11-03_1_1'} and from which we can obtain unique identifiers:

# Find records for all items in the collection
items = search_items('collection:"pub_notes-and-queries"')

The following incantation constructs one list from the members of another. In particular, we iterate through each item in the pub_notes-and-queries, extract the identifier, retrieve the corresponding metadata record (get_item()), create our own corresponding metadata record (out_ia_metadata()) and add it to a new list.

In all, there are several thousand records to download, and each takes a noticeable time, so rather than just sitting watching a progress bar for an hour, go and grab a meal rather than a coffee…

# The tqdm package provides a convenient progress bar
# for tracking progress through looped actions
from tqdm.notebook import tqdm

# If a local file containing the data doesn't already exist,
# then grab the data...
if not csv_file_exists:
    # Our list of custom metadata records
    csv_items = []

    for i in tqdm(items):
        id_val = i["identifier"]
        metadata_record = get_item(id_val)
        custom_metadata_record = out_ia_metadata( metadata_record )
        csv_items.append( custom_metadata_record )
        
# We should perhaps incrementally write the CSV file as we go along
# or incrementally save the data to a simple local database
# If something goes wrong during the downloads, then at least
# we won;t have lost everything..

We can now open the CSV file and write the data to it:

# If a local file containing the data doesn't already exist,
# then grab the data...
if not csv_file_exists:
    with open(csv_fn, 'w') as outfile:
        print(f"Writing data to file {csv_fn}")

        # Create a "CSV writer" object that can write to the file 
        csv_write = csv.writer(outfile)
        # Write a header row at the top of the file
        csv_write.writerow(['id','date','title','vol','iss','prev_id', 'next_id','restricted'])
        # Then write out list of essential metadata items out, one record per row
        csv_write.writerows(csv_items)

    # Update the file exists flag
    csv_file_exists = Path(csv_fn).is_file()

We can use a simple Linux command line tool (head) to show the top five lines of the file:

!head -n 5 nandq_internet_archive.txt

id,date,title,vol,iss,prev_id,next_id,restricted

sim_notes-and-queries_1849-11-03_1_1,1849-11-03,Notes and Queries  1849-11-03: Vol 1 Iss 1,1,1,sim_notes-and-queries_1849-1850_1_index,sim_notes-and-queries_1849-11-10_1_2,

sim_notes-and-queries_1849-11-10_1_2,1849-11-10,Notes and Queries  1849-11-10: Vol 1 Iss 2,1,2,sim_notes-and-queries_1849-11-03_1_1,sim_notes-and-queries_1849-11-17_1_3,

sim_notes-and-queries_1849-11-17_1_3,1849-11-17,Notes and Queries  1849-11-17: Vol 1 Iss 3,1,3,sim_notes-and-queries_1849-11-10_1_2,sim_notes-and-queries_1849-11-24_1_4,

sim_notes-and-queries_1849-11-24_1_4,1849-11-24,Notes and Queries  1849-11-24: Vol 1 Iss 4,1,4,sim_notes-and-queries_1849-11-17_1_3,sim_notes-and-queries_1849-12-01_1_5,

So, with some idea of what’s available to us, data wise, and file wise, what can we start to do with it?

Generating a Monolithic PDF Index for Notes & Queries Up To 1900#

If we want to search for items in Notes and Queries “manually”, one of the most effective ways is to look up items in the volume indexes. With two volumes a year, this means checking almost 100 separate documents if we want to look up 19th century references. (That’s not quite true: from the 1890s, indexes were produced that started to to aggregate indices over several years.)

So how might we go about producing a single index PDF for 19th c. editions of Notes & Queries? As a conjoined set of original index PDFs, this wouldn’t provide us with unified index terms - a search on an index item would return separate entries for each volume index in which the term appeared – but it would mean we only needed to search one PDF document.

We’ll use the Python csv package to simplify saving and load the data:

import csv

To begin with, we can load in our list of Notes and Queries record data downloaded from the Internet Archive.

%%writefile ia_utils/open_metadata_records.py
import csv

# Specify the file name we want to read data in from
def open_metadata_records(fn='nandq_internet_archive.txt'):
    """Open and read metadata records file."""

    with open(fn, 'r') as f:
        # We are going to load the data into a data structure known as a dictionary, or dict
        # Each item in the dictionary contains several elements as `key:value` pairs
        # The key matches the column name in the CSV data file,
        # along with the corresponding value in a given item row

        # Read the data in
        csv_data = csv.DictReader(f)

        # And convert it to a list of data records
        data_records = list(csv_data)
        
    return data_records

# Import that function from the package we just wrote it to
from ia_utils.open_metadata_records import open_metadata_records

Let’s grab the metadata records from our saved file:

data_records = open_metadata_records()

# Preview the first record (index count starts at 0)
# The object returned is a dictionary / dict
data_records[0]

{'id': 'sim_notes-and-queries_1849-11-03_1_1',
 'date': '1849-11-03',
 'title': 'Notes and Queries  1849-11-03: Vol 1 Iss 1',
 'vol': '1',
 'iss': '1',
 'prev_id': 'sim_notes-and-queries_1849-1850_1_index',
 'next_id': 'sim_notes-and-queries_1849-11-10_1_2',
 'restricted': ''}

Populating a Database With Record Metadata#

Let’s start by creating a table in the database that can store our metadata data records, as loaded in from the data file.

from sqlite_utils import Database

db_name = "nq_demo.db"

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

%%writefile ia_utils/create_db_table_metadata.py
import datetime

def create_db_table_metadata(db, drop=True):
    # If we want to remove the table completely, we can drop  it
    if drop:
        db["metadata"].drop(ignore=True)
        db["metadata"].create({
            "id": str,
            "date": str,
            "datetime": datetime.datetime, # Use an actual time representation
            "series": str,
            "vol": str,
            "iss": str,
            "title": str, 
            "next_id": str, 
            "prev_id": str,
            "is_index": bool, # Is the record an index record
            "restricted": str, # should really be boolean
        }, pk=("id"))

Now we can load the function back in from out package and call it:

from ia_utils.create_db_table_metadata import create_db_table_metadata

create_db_table_metadata(db)

We need to do a little tidying of the records, but then we can add them directly to the database:

%%writefile ia_utils/add_patched_metadata_records_to_db.py
from tqdm.notebook import tqdm
import dateparser

def add_patched_metadata_records_to_db(db, data_records):
    """Add metadata records to database."""
    # Patch records to include a parsed datetime element
    for record in tqdm(data_records):
        # Parse the raw date into a date object
        # Need to handle a YYYY - YYYY exception
        # If we detect this form, use the last year for the record
        if len(record['date'].split()[0]) > 1:
            record['datetime'] = dateparser.parse(record['date'].split()[-1])
        else:
            record['datetime'] = dateparser.parse(record['date'])

        record['is_index'] = 'index' in record['title'].lower() # We assign the result of a logical test

    # Add records to the database
    db["metadata"].insert_all(data_records)

Let’s call that function and add our metadata data records:

from ia_utils.add_patched_metadata_records_to_db import add_patched_metadata_records_to_db

add_patched_metadata_records_to_db(db, data_records)

/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
  date_obj = stz.localize(date_obj)

We can then query the data, for example return the first rows:

from pandas import read_sql

q = "SELECT * FROM metadata LIMIT 5"

read_sql(q, db.conn)

	id	date	datetime	series	vol	iss	title	next_id	prev_id
0	sim_notes-and-queries_1849-11-03_1_1	1849-11-03	1849-11-03T00:00:00	None	1	1	Notes and Queries 1849-11-03: Vol 1 Iss 1	sim_notes-and-queries_1849-11-10_1_2	sim_notes-and-queries_1849-1850_1_index
1	sim_notes-and-queries_1849-11-10_1_2	1849-11-10	1849-11-10T00:00:00	None	1	2	Notes and Queries 1849-11-10: Vol 1 Iss 2	sim_notes-and-queries_1849-11-17_1_3	sim_notes-and-queries_1849-11-03_1_1
2	sim_notes-and-queries_1849-11-17_1_3	1849-11-17	1849-11-17T00:00:00	None	1	3	Notes and Queries 1849-11-17: Vol 1 Iss 3	sim_notes-and-queries_1849-11-24_1_4	sim_notes-and-queries_1849-11-10_1_2
3	sim_notes-and-queries_1849-11-24_1_4	1849-11-24	1849-11-24T00:00:00	None	1	4	Notes and Queries 1849-11-24: Vol 1 Iss 4	sim_notes-and-queries_1849-12-01_1_5	sim_notes-and-queries_1849-11-17_1_3
4	sim_notes-and-queries_1849-12-01_1_5	1849-12-01	1849-12-01T00:00:00	None	1	5	Notes and Queries 1849-12-01: Vol 1 Iss 5	sim_notes-and-queries_1849-12-08_1_6	sim_notes-and-queries_1849-11-24_1_4

Or we could return the identifiers for index issues between 1875 and 1877:

q = """
SELECT id, title
FROM metadata
WHERE is_index = 1
    -- Extract the year
    AND strftime('%Y', datetime) >= '1875'
    AND strftime('%Y', datetime) <= '1877'
"""

read_sql(q, db.conn)

	id	title
0	sim_notes-and-queries_1875_3_index	Notes and Queries 1875: Vol 3 Index
1	sim_notes-and-queries_1875_4_index	Notes and Queries 1875: Vol 4 Index
2	sim_notes-and-queries_1876_5_index	Notes and Queries 1876: Vol 5 Index
3	sim_notes-and-queries_1876_6_index	Notes and Queries 1876: Vol 6 Index
4	sim_notes-and-queries_1877_7_index	Notes and Queries 1877: Vol 7 Index
5	sim_notes-and-queries_1877_8_index	Notes and Queries 1877: Vol 8 Index

By inspection of the list of index entries, we note that at some point cumulative indexes over a set of years, as well as volume level indexes, were made available. Cumulative indexes include:

Notes and Queries 1892 - 1897: Vol 1-12 Index
Notes and Queries 1898 - 1903: Vol 1-12 Index
Notes and Queries 1904 - 1909: Vol 1-12 Index
Notes and Queries 1910 - 1915: Vol 1-12 Index

In this first pass, we shall just ignore the cumulative indexes.

At this point, it is not clear where we might reliably obtain the series information from.

To make the data easier to work with, we can parse the date as a date thing (technical term!;-) using tools in the Python dateparser package:

import dateparser

The parsed data provides ways of comparing dates, extracting month and year, and so on.

indexes = []

# Get index records up to 1900
max_year = 1900

for record in data_records:
    # Only look at index records
    # exclude cumulative indexes
    if 'index' in record['id'] and "cumulative" not in record['id']:
        # Need to handle a YYYY - YYYY exception
        # If we detect it, ignore it
        if len(record['date'].split()) > 1:
               continue
        
        # Parse the year into a date object
        # Then filter by year
        if dateparser.parse(record['date'].split()[0]).year >= max_year:
            break
        indexes.append(record) 

# Preview the first three index records
indexes[:3]

PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html [date_parser.py:35]

[{'id': 'sim_notes-and-queries_1850_2_index',
  'date': '1850',
  'title': 'Notes and Queries  1850: Vol 2 Index',
  'vol': '2',
  'iss': 'Index',
  'prev_id': 'sim_notes-and-queries_1850-05-25_1_30',
  'next_id': 'sim_notes-and-queries_1850-06-01_2_31',
  'restricted': '',
  'datetime': datetime.datetime(1850, 3, 20, 0, 0),
  'is_index': True},
 {'id': 'sim_notes-and-queries_1851_3_index',
  'date': '1851',
  'title': 'Notes and Queries  1851: Vol 3 Index',
  'vol': '3',
  'iss': 'Index',
  'prev_id': 'sim_notes-and-queries_1850-12-28_2_61',
  'next_id': 'sim_notes-and-queries_1851-01-04_3_62',
  'restricted': '',
  'datetime': datetime.datetime(1851, 3, 20, 0, 0),
  'is_index': True},
 {'id': 'sim_notes-and-queries_1851_4_index',
  'date': '1851',
  'title': 'Notes and Queries  1851: Vol 4 Index',
  'vol': '4',
  'iss': 'Index',
  'prev_id': 'sim_notes-and-queries_1851-06-28_3_87',
  'next_id': 'sim_notes-and-queries_1851-07-05_4_88',
  'restricted': '',
  'datetime': datetime.datetime(1851, 3, 20, 0, 0),
  'is_index': True}]

To generate the complete PDF index, we need to do several things:

iterate through the list of index records;
for each one, download the associated PDF to a directory;
merge all the downloaded files into a single PDF;
optionally, delete the original PDF files.

Working With PDF Files Downloaded from the Internet Archive#

We can download files from the Internet Archive using the internetarchive.download() function. This takes a list of items via a formats parameter for the files we want to download. For example, we might want to download the “Text PDF” (a PDF file with full text search), or a simple text file containing just the OCR captured text (OCR Search Text), or both.

We can also specify the directory into which the files are downloaded.

Let’s import the packages that help simplify this task, and create a path to our desired download directory:

# Import the necessary packages
from internetarchive import download

To keep our files organised, we’ll create a directory into which we can download the files:

# Create download dir file path
dirname = 'ia-downloads'

p = Path(dirname)

One of the ways we can work with the data is to process it using Python programming code.

For example, we can iterate through the index records and download the required files:

# Use tqdm for provide a progress bar
for record in tqdm(indexes):
    _id = record['id']
    
    # Download PDF - this may take time to retrieve / download
    # This downloads to a directory with the same name as the record id
    # The file name is akin to ${id}.pdf
    download(_id, destdir=p, silent = True,
             formats=["Text PDF", "OCR Search Text"])

To create single monolithic PDF, we can use another fragment of code to iterate through the downloaded PDF files, adding each one to a single merged PDF file object. We can also create and insert a reference page between each of the original documents to provide provenance if the is no date on the index pages.

Let’s start by seeing how to create a simple PDF page. The reportlab Python package provides various tools for creating simple PDF documents:

#%pip install --upgrade reportlab
from reportlab.pdfgen.canvas import Canvas

For example, we can create a simple single page document that we can add index metadata to and then insert in between the pages of each index issue:

# Create a page canvas
test_pdf = "test-page.pdf"
canvas = Canvas(test_pdf)

# Write something on the page at a particular location
# In this case, let's use the title from the first index record
txt = indexes[0]['title']
# Co-ordinate origin is bottom left of the page
# Scale is points, where 72 points = 1 inch
canvas.drawString(72, 10*72, txt)

# Save the page
canvas.save()

Now we can preview the test page:

from IPython.display import IFrame

IFrame(test_pdf, width=600, height=500)

A simple function lets us generate a simple page rendering a short text string:

def make_pdf_page(txt, fn="test_pdf.pdf"):
    """"""
    canvas = Canvas(fn)

    # Write something on the page at a partcular location
    # Co-ordinate origin is bottom left of the page
    # Scale is points, where 72 points = 1 inch
    canvas.drawString(72, 10*72, txt)

    # Save the page
    canvas.save()
    
    return fn

Let’s now create our monolithic index with metadata page inserts.

The PyPDF2 package contains various tools for splitting and combining PDF documents:

from PyPDF2 import PdfFileReader, PdfFileMerger

We can use it merge our separate index cover page and index issue documents, for example:

# Create a merged PDF file creating object
output = PdfFileMerger()

# Generate a monolithic PDF index file by concatenating the pages
# from each individual PDF index file
# Use tqdm for provide a progress bar
for record in tqdm(indexes):
    # Generate some metadata:
    txt = record['title']
    metadata_pdf = make_pdf_page(txt)
    # Add this to the output document
    output.append(metadata_pdf)
    # Delete the metadata file
    Path(metadata_pdf).unlink()

    # Get the record ID
    _id = record['id']

    # Locate the file and merge it into the monolithic PDF
    output.append((p / _id / f'{_id}.pdf').as_posix())
    
# Write merged PDF file
with open("notes_and_queries_big_index.pdf", "wb") as output_stream:
    output.write(output_stream)

output = None

The resulting PDF document is a large document (about 100MB) that collects all the separate indexes in one place, although not as a single, reconciled index: if the same index terms exist in multiple index documents, there will be multiple occurrences of that term in the longer document.

However, if we do need a PDF reference to the index, it is useful to have to hand.

Story Notes - Technical Recipes

Exploring Notes & Queries

Contents