Duncan Williamson Audio Recordings#

The Tobar an Dualchais website is an archival website for Scottish oral tradition.

The site hosts several Duncan Williamson stories, but they are not as discoverable as they could be. So this notebook will describe a recipe for scraping the metadata, at least, and making it available in a more easily navigable form in the form of a CSV data file or a long text file that includes the results for all the records in a single, searchable document.

Available Recordings#

Recordings can be listed by artist. The results are paged, by default, in groups of 10. The URL for the second page of search results for a search on Duncan Williamson” is given as:

url = "https://www.tobarandualchais.co.uk/search?l=en&page=2&page_size=10&term=%22Williamson%2C+Duncan%2C+1928-2007+%284292%29%22&type=archival_object"

The page size and page number are readily visible in the URL. The results report suggests 29 pages of results are available, so just under 300 results in all.

Rather than scrape all the next page links, we can generate them from the page size and the number of results pages.

Upping the page size seems to cause the server on the other end to struggle a bit. Setting a page size of 500 returns 250 items, so given we’re going to have to make at least two calls to get all the results, let’s make things a bit easier for the server and limit ourselves to batch sizes of 50 results, which means we’ll need to make 6 results page calls in all.

To support this, we can parameterise the URL:

_url = "https://www.tobarandualchais.co.uk/search?l=en&page={page}&page_size={page_size}&term=%22Williamson%2C+Duncan%2C+1928-2007+%284292%29%22&type=archival_object"

We’ll start by looking at a small page, with just five results. We can construct an appropriate URL as follows:

url = _url.format(page=1, page_size=5)

The bs4 / BeautifulSoup package is a Python package that supports the parsing and processing of HTML and XML documents.

import requests
from bs4 import BeautifulSoup

From the raw HTML text, we can create a navigable “soup” that allows us to reference different elements within the HTML structure.

response = requests.get(url)
soup = BeautifulSoup(response.text)

Using a browser’s developer tools, we can explore the HTML structure of the page in relation to the rendered view.

For example, the number of results pages is given in a p element with class search-page:

Path to search results page count

We can retrieve the text contained in the element by referencing the element:

num_results_pages = soup.find("p", {"class": "search-page"}).text
'Page 1 of 58'

We can easily extract the number of results by splitting that string on white space characters and picking the last item and casting it to an integer.

num_results_pages = int(num_results_pages.split()[-1])

Each results item in the results page includes some metadata and a link to a record results page.

Looking at the page structure, we see that the results links have the class search-item__link. We can use this as a crib to extract the links:

example_track_links = soup.find_all("a", {"class": "search-item__link"})
[<a class="search-item__link" href="/track/60155?l=en">View Track</a>,
 <a class="search-item__link" href="/track/60156?l=en">View Track</a>,
 <a class="search-item__link" href="/track/60158?l=en">View Track</a>,
 <a class="search-item__link" href="/track/60162?l=en">View Track</a>,
 <a class="search-item__link" href="/track/60167?l=en">View Track</a>]

The links are relative to the domain, https://www.tobarandualchais.co.uk.

domain = "https://www.tobarandualchais.co.uk"

The metadata that appears on the search results page is duplicated in an actual record page, so there is no need to scrape it from the results page. Instead, we’ll get what we need from the results record pages.

Let’s get an example record page down. First we construct a page URL:

example_page_url = f"{domain}{example_track_links[0]['href']}"

Then we grab the page and make soup from it:

example_record_soup = BeautifulSoup(requests.get(example_page_url).text)

The title, which appears to be the first line of summary with a maximum character limit, is in a span element with a contributor__title class:

# Title
example_record_soup.find("span", {"class": "contributor__title" }).text
'Balmoral Highlanders/Father John MacMillan of Barra/Jean Mauchline'

The rest of the page is not so conveniently structured, with the class elements appearing in each part of the result record. However, we can identify the appropriate block from an h3 element with text Summary contained within it and the just grab the next sibling element:

# Summary
str(example_record_soup.find("h3", string="Summary").find_next("p"))
'<p class="contributor-bio-item__content">Diddling of three marches. They are \'Balmoral Highlanders\', \'Father John MacMillan of Barra\' and \'Jean Mauchline\'.</p>'

The date is another useful metadata field, which we can identify from a prior spanned "Date" label:

example_date_str = example_record_soup.find("span", string='Date').find_next("span").text

To work with dates as date objects, we can use the dateparser and datetime packages:

from dateparser import parse
import datetime

We can try to parse this into a datetime object:

# Output date format
dt = "%Y-%m-%d"

# If only a year is specified, by default the parsed datetime
# will be set relative to the current datetime
# Or we can force a relative dummy date
try_date = parse(example_date_str.strip(),
                             settings={'RELATIVE_BASE': datetime.datetime(2000, 1, 1)})
example_record_date = try_date.strftime(dt) if try_date else ''

/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py:76: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
  now = self.get_local_tz().localize(now)
/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
  date_obj = stz.localize(date_obj)

If available, the genre is also likely to be of interest to us, so that we can search for songs, or stories, for example:

example_genre = example_record_soup.find("h3", string='Genre').find_next("p").text

The audio file(s) seem to be loaded via turbo-frame elements. These in turn appear to load a page containing the media player in a source element. So we can grab all turbo-frame elements from a page, iterate through them, extracting the frame path from each one, and then load the corresponding frame page. Each of these frame pages then contains an audio source element from which we can grab the audio file URL.

example_sources = []

# Grab and iterate through each turbo-frame element
for turbo_frame in example_record_soup.find_all('turbo-frame'):
    # The frame URL is given by the src attribute
    turbo_frame_url = f'{domain}{turbo_frame["src"]}'
    # Get the frame page text, make soup from it
    # and find the (first and only) source element
    # Append this element to our sources list for the record page
    example_sources.append( BeautifulSoup(requests.get(turbo_frame_url).text).find("source") )

[<source src="https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10602/SOSS_007913_060155.mp4" type="audio/mp4"/>]

If we want, can can embed that audio in our own player. Let’s start by downloading a local copy of the audio file.

The pathlib package provides a range of tools for working with files:

from pathlib import Path

Start by ensuring we have a directory available to download the audio files into:

download_dir_name = "audio"

# Generate a path
download_dir = Path(download_dir_name)

# Ensure the directory (and its parents for a long path) exist
download_dir.mkdir(parents=True, exist_ok=True)

The urllib package provides a method for downloading files from a URL into a specified directory:

import urllib

Now we can download the audio file into that directory. The filename is the last part of the URL:

# The URL is given by the src attribute of a source element
audio_url = example_sources[0]["src"]
# The file name is the last part of the URL
audio_filename = audio_url.split("/")[-1]

# Create a path to the audio file in the download directory
local_audio = download_dir / audio_filename

# Download the audio file from th specified URL to the required location
urllib.request.urlretrieve (audio_url, local_audio)
 <http.client.HTTPMessage at 0x10f0097f0>)

Now we can play it from the local copy:

from IPython.display import Audio