Duncan Williamson Audio Recordings
Contents
Duncan Williamson Audio Recordings#
The Tobar an Dualchais website is an archival website for Scottish oral tradition.
The site hosts several Duncan Williamson stories, but they are not as discoverable as they could be. So this notebook will describe a recipe for scraping the metadata, at least, and making it available in a more easily navigable form in the form of a CSV data file or a long text file that includes the results for all the records in a single, searchable document.
Available Recordings#
Recordings can be listed by artist. The results are paged, by default, in groups of 10. The URL for the second page of search results for a search on Duncan Williamson” is given as:
url = "https://www.tobarandualchais.co.uk/search?l=en&page=2&page_size=10&term=%22Williamson%2C+Duncan%2C+1928-2007+%284292%29%22&type=archival_object"
The page size and page number are readily visible in the URL. The results report suggests 29 pages of results are available, so just under 300 results in all.
Rather than scrape all the next page links, we can generate them from the page size and the number of results pages.
Upping the page size seems to cause the server on the other end to struggle a bit. Setting a page size of 500 returns 250 items, so given we’re going to have to make at least two calls to get all the results, let’s make things a bit easier for the server and limit ourselves to batch sizes of 50 results, which means we’ll need to make 6 results page calls in all.
To support this, we can parameterise the URL:
_url = "https://www.tobarandualchais.co.uk/search?l=en&page={page}&page_size={page_size}&term=%22Williamson%2C+Duncan%2C+1928-2007+%284292%29%22&type=archival_object"
We’ll start by looking at a small page, with just five results. We can construct an appropriate URL as follows:
url = _url.format(page=1, page_size=5)
url
'https://www.tobarandualchais.co.uk/search?l=en&page=1&page_size=5&term=%22Williamson%2C+Duncan%2C+1928-2007+%284292%29%22&type=archival_object'
The bs4
/ BeautifulSoup
package is a Python package that supports the parsing and processing of HTML and XML documents.
import requests
from bs4 import BeautifulSoup
From the raw HTML text, we can create a navigable “soup” that allows us to reference different elements within the HTML structure.
response = requests.get(url)
soup = BeautifulSoup(response.text)
Using a browser’s developer tools, we can explore the HTML structure of the page in relation to the rendered view.
For example, the number of results pages is given in a p
element with class search-page
:
We can retrieve the text contained in the element by referencing the element:
num_results_pages = soup.find("p", {"class": "search-page"}).text
num_results_pages
'Page 1 of 58'
We can easily extract the number of results by splitting that string on white space characters and picking the last item and casting it to an integer.
num_results_pages = int(num_results_pages.split()[-1])
num_results_pages
58
Each results item in the results page includes some metadata and a link to a record results page.
Looking at the page structure, we see that the results links have the class search-item__link
. We can use this as a crib to extract the links:
example_track_links = soup.find_all("a", {"class": "search-item__link"})
example_track_links
[<a class="search-item__link" href="/track/60155?l=en">View Track</a>,
<a class="search-item__link" href="/track/60156?l=en">View Track</a>,
<a class="search-item__link" href="/track/60158?l=en">View Track</a>,
<a class="search-item__link" href="/track/60162?l=en">View Track</a>,
<a class="search-item__link" href="/track/60167?l=en">View Track</a>]
example_track_links[0]['href']
'/track/60155?l=en'
The links are relative to the domain, https://www.tobarandualchais.co.uk
.
domain = "https://www.tobarandualchais.co.uk"
The metadata that appears on the search results page is duplicated in an actual record page, so there is no need to scrape it from the results page. Instead, we’ll get what we need from the results record pages.
Let’s get an example record page down. First we construct a page URL:
example_page_url = f"{domain}{example_track_links[0]['href']}"
example_page_url
'https://www.tobarandualchais.co.uk/track/60155?l=en'
Then we grab the page and make soup from it:
example_record_soup = BeautifulSoup(requests.get(example_page_url).text)
The title, which appears to be the first line of summary with a maximum character limit, is in a span
element with a contributor__title
class:
# Title
example_record_soup.find("span", {"class": "contributor__title" }).text
'Balmoral Highlanders/Father John MacMillan of Barra/Jean Mauchline'
The rest of the page is not so conveniently structured, with the class elements appearing in each part of the result record. However, we can identify the appropriate block from an h3
element with text Summary contained within it and the just grab the next sibling element:
# Summary
str(example_record_soup.find("h3", string="Summary").find_next("p"))
'<p class="contributor-bio-item__content">Diddling of three marches. They are \'Balmoral Highlanders\', \'Father John MacMillan of Barra\' and \'Jean Mauchline\'.</p>'
The date is another useful metadata field, which we can identify from a prior spanned "Date"
label:
example_date_str = example_record_soup.find("span", string='Date').find_next("span").text
example_date_str
'1977'
To work with dates as date objects, we can use the dateparser
and datetime
packages:
from dateparser import parse
import datetime
We can try to parse this into a datetime object:
# Output date format
dt = "%Y-%m-%d"
# If only a year is specified, by default the parsed datetime
# will be set relative to the current datetime
# Or we can force a relative dummy date
try_date = parse(example_date_str.strip(),
settings={'RELATIVE_BASE': datetime.datetime(2000, 1, 1)})
example_record_date = try_date.strftime(dt) if try_date else ''
example_record_date
/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py:76: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
now = self.get_local_tz().localize(now)
/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
date_obj = stz.localize(date_obj)
'1977-01-01'
If available, the genre is also likely to be of interest to us, so that we can search for songs, or stories, for example:
example_genre = example_record_soup.find("h3", string='Genre').find_next("p").text
example_genre
'Music'
The audio file(s) seem to be loaded via turbo-frame
elements. These in turn appear to load a page containing the media player in a source
element. So we can grab all turbo-frame
elements from a page, iterate through them, extracting the frame path from each one, and then load the corresponding frame page. Each of these frame pages then contains an audio source
element from which we can grab the audio file URL.
example_sources = []
# Grab and iterate through each turbo-frame element
for turbo_frame in example_record_soup.find_all('turbo-frame'):
# The frame URL is given by the src attribute
turbo_frame_url = f'{domain}{turbo_frame["src"]}'
# Get the frame page text, make soup from it
# and find the (first and only) source element
# Append this element to our sources list for the record page
example_sources.append( BeautifulSoup(requests.get(turbo_frame_url).text).find("source") )
example_sources
[<source src="https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10602/SOSS_007913_060155.mp4" type="audio/mp4"/>]
If we want, can can embed that audio in our own player. Let’s start by downloading a local copy of the audio file.
The pathlib
package provides a range of tools for working with files:
from pathlib import Path
Start by ensuring we have a directory available to download the audio files into:
download_dir_name = "audio"
# Generate a path
download_dir = Path(download_dir_name)
# Ensure the directory (and its parents for a long path) exist
download_dir.mkdir(parents=True, exist_ok=True)
The urllib
package provides a method for downloading files from a URL into a specified directory:
import urllib
Now we can download the audio file into that directory. The filename is the last part of the URL:
# The URL is given by the src attribute of a source element
audio_url = example_sources[0]["src"]
# The file name is the last part of the URL
audio_filename = audio_url.split("/")[-1]
# Create a path to the audio file in the download directory
local_audio = download_dir / audio_filename
# Download the audio file from th specified URL to the required location
urllib.request.urlretrieve (audio_url, local_audio)
(PosixPath('audio/SOSS_007913_060155.mp4'),
<http.client.HTTPMessage at 0x10f0097f0>)
Now we can play it from the local copy:
from IPython.display import Audio
Audio(local_audio)
Putting all the pieces together#
We can now put all the pieces together to make a scraper for the metadata (and optionally all the audio files) for all the Duncan Williamson tracks on the Tobar an Dualchais website (or at least, all those records identified by a search on Duncan Williamson).
The recipe will run something like:
load page with all “next page” links displayed;
grab results links for first page;
grab results links for all next pages;
for each results link, grab result record page, extract required data.
If we use the requests-cache
package, we can keep a local copy of downloaded pages so if we need to rerun things (for example, to extract more data fields) we will have locally cached copies of the files to work from.
We can set the cache to last for a specified number amount of time:
import requests_cache
from datetime import timedelta
# Cache into a sqlite database; set to expire after 500 days
requests_cache.install_cache("tobar_cache", backend="sqlite",
expire_after=timedelta(days=500))
To make the scrape easier, lets set up some functions based the sketches we’ve already made:
import time
#Relative links are relative to this domain
DOMAIN = "https://www.tobarandualchais.co.uk"
def get_search_results_page(search_term='"Williamson, Duncan, 1928-2007 (4292)"',
page_num=1, page_size=10):
"""Get paged search results page with specified number of results per page.
Return as soup.
"""
# Generate the request using provided parameters
response = requests.get("https://www.tobarandualchais.co.uk/search",
params= {"l": "en", "term": search_term,
"type":"archival_object",
"page": page_num, "page_size": page_size})
#print(response.url)
results_page_soup = BeautifulSoup(response.text)
return results_page_soup
def get_number_of_results_pages(results_page_soup):
"""Return the number of results pages available."""
num_results_pages_ = results_page_soup.find("p", {"class": "search-page"}).text
num_results_pages = int(num_results_pages_.split()[-1])
return num_results_pages
def get_result_links(results_page_soup):
"""Get results links from a results soup page."""
result_links_ = results_page_soup.find_all("a", {"class": "search-item__link"})
result_links = [result_link["href"] for result_link in result_links_]
return result_links
def get_audio_files_url(result_record_soup, domain=DOMAIN):
"""Get audio file link from turbo-frame loaded page."""
audio_sources = []
# Grab and iterate through each turbo-frame element
for turbo_frame in result_record_soup.find_all('turbo-frame'):
# The frame URL is given by the src attribute
turbo_frame_url = f'{domain}{turbo_frame["src"]}'
# Get the frame page text, make soup from it
# and find the (first and only) source element
# Append this element to our sources list for the record page
audio_url_ = BeautifulSoup(requests.get(turbo_frame_url).text).find("source")
if audio_url_:
audio_sources.append(audio_url_["src"])
return audio_sources
# Provide a delay between calls to the website
def get_result_record_data(result_path, domain=DOMAIN, audio=False, be_nice=0.1):
"""Get result record data for a result page."""
# Slight delay before we make a call
# This is just so we don't hammer the server
# If we are hitting the cached pages, this can be set to 0
time.sleep(be_nice)
result_record_data = {}
result_record_url = f"{domain}{result_path}"
response = requests.get(result_record_url)
# Is the response ok?
if response.status_code!=200:
print(f"Something wrong with {result_record_url}")
return {}
result_record_soup = BeautifulSoup(response.text)
# Title
result_record_data["title"] = result_record_soup.find("span", {"class": "contributor__title" }).text
# Summary
_summary = result_record_soup.find("h3", string="Summary")
result_record_data["summary"] = str(_summary.find_next("p")) if _summary else ''
# Date
result_record_data["raw_date"] = result_record_soup.find("span", string='Date').find_next("span").text
# Genre
result_record_data["genre"] = result_record_soup.find("h3", string='Genre').find_next("p").text
# URL
result_record_data["url"] = result_record_url
# Output date format
dt = "%Y-%m-%d"
# If only a year is specified, by default the parsed datetime
# will be set relative to the current datetime
# Or we can force a relative dummy date
try_date = parse(result_record_data["raw_date"].strip(),
settings={'RELATIVE_BASE': datetime.datetime(2000, 1, 1)})
result_record_data["date"] = try_date.strftime(dt) if try_date else ''
if audio:
result_record_data["audio_url"] = get_audio_files_url(result_record_soup,
domain=domain)
return result_record_data
Let’s try for just a single small results page:
results_page_soup = get_search_results_page(page_size=5)
results_links = get_result_links(results_page_soup)
num_results_pages = get_number_of_results_pages(results_page_soup)
results_links, num_results_pages
(['/track/60155?l=en',
'/track/60156?l=en',
'/track/60158?l=en',
'/track/60162?l=en',
'/track/60167?l=en'],
58)
And let’s see if we can iterate through those to get metadata results, first without the audio link:
results_metadata = []
for results_link in results_links:
results_metadata.append( get_result_record_data(results_link) )
results_metadata
/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py:76: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
now = self.get_local_tz().localize(now)
/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
date_obj = stz.localize(date_obj)
[{'title': 'Balmoral Highlanders/Father John MacMillan of Barra/Jean Mauchline',
'summary': '<p class="contributor-bio-item__content">Diddling of three marches. They are \'Balmoral Highlanders\', \'Father John MacMillan of Barra\' and \'Jean Mauchline\'.</p>',
'raw_date': '1977',
'genre': 'Music',
'url': 'https://www.tobarandualchais.co.uk/track/60155?l=en',
'date': '1977-01-01'},
{'title': 'Diddling',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson points out that the pronunciation of the \'words\' change in faster diddling. His mother was a great diddler, \'Bundle and Go\' being one of her favourites. He diddles two tunes.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60156?l=en',
'date': '1977-01-01'},
{'title': 'Leaving Lismore',
'summary': '<p class="contributor-bio-item__content">A piper would choose a tune to play according to the location, occasion and listeners. He might play a slow tune, such as \'Leaving Lismore\'. It is diddled here.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60158?l=en',
'date': '1977-01-01'},
{'title': 'Diddling',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson diddles a tune he calls \'Loch Duie\' [Bràigh Loch Iall], to which he has also set words at the request of old Maggie Cameron. She liked the tune because Sandy played it on the pipes. He then diddles fragments of 2/4 marches.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60162?l=en',
'date': '1977-01-01'},
{'title': 'The Bloody Fields of Flanders',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson says this is his version of the 3/4 march \'The Bloody Fields of Flanders\', and that the tune was a great favourite with Hamish Henderson.</p>',
'raw_date': '1977',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/60167?l=en',
'date': '1977-01-01'}]
And then with an audio link:
results_metadata = []
for results_link in results_links:
results_metadata.append( get_result_record_data(results_link, audio=True) )
results_metadata
[{'title': 'Balmoral Highlanders/Father John MacMillan of Barra/Jean Mauchline',
'summary': '<p class="contributor-bio-item__content">Diddling of three marches. They are \'Balmoral Highlanders\', \'Father John MacMillan of Barra\' and \'Jean Mauchline\'.</p>',
'raw_date': '1977',
'genre': 'Music',
'url': 'https://www.tobarandualchais.co.uk/track/60155?l=en',
'date': '1977-01-01',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10602/SOSS_007913_060155.mp4']},
{'title': 'Diddling',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson points out that the pronunciation of the \'words\' change in faster diddling. His mother was a great diddler, \'Bundle and Go\' being one of her favourites. He diddles two tunes.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60156?l=en',
'date': '1977-01-01',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10603/SOSS_007913_060156.mp4']},
{'title': 'Leaving Lismore',
'summary': '<p class="contributor-bio-item__content">A piper would choose a tune to play according to the location, occasion and listeners. He might play a slow tune, such as \'Leaving Lismore\'. It is diddled here.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60158?l=en',
'date': '1977-01-01',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10604/SOSS_007913_060158.mp4']},
{'title': 'Diddling',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson diddles a tune he calls \'Loch Duie\' [Bràigh Loch Iall], to which he has also set words at the request of old Maggie Cameron. She liked the tune because Sandy played it on the pipes. He then diddles fragments of 2/4 marches.</p>',
'raw_date': '1977',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/60162?l=en',
'date': '1977-01-01',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10605/SOSS_007913_060162.mp4']},
{'title': 'The Bloody Fields of Flanders',
'summary': '<p class="contributor-bio-item__content">Duncan Williamson says this is his version of the 3/4 march \'The Bloody Fields of Flanders\', and that the tune was a great favourite with Hamish Henderson.</p>',
'raw_date': '1977',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/60167?l=en',
'date': '1977-01-01',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10607/SOSS_007913_060167.mp4']}]
As that seems to work, now we can try for a complete scrape.
To keep track of progress, we can use a progress indicator:
# tqdm is a simple progress bar indicator
from tqdm.notebook import tqdm
To minimise the calls, we get a the results links first, then dedupe:
results_batch_size = 50
results_page_num = num_results_pages = 1
# Get all results links
results_links = []
while results_page_num <= num_results_pages:
results_page_soup = get_search_results_page(page_num=results_page_num,
page_size=results_batch_size)
# Extend the list of results links we have so far
results_links.extend( get_result_links(results_page_soup) )
# If this is the first page of results, how many pages are there
if results_page_num==1:
num_results_pages = get_number_of_results_pages(results_page_soup)
print(f"{num_results_pages} results pages to download.")
# Increment which page we are on
results_page_num += 1
# Simple progress track
print(".", end="")
print(f"{len(results_links)} results links identified from search results.")
# Dedupe the results list by making a set from them
results_links = list(set(results_links))
print(f"{len(results_links)} unique results links.")
290 results links identified from search results.
290 unique results links.
Having got all the results links, we can now grab all the individual results pages:
# Now we can iterate though the unique results
results_metadata = []
for results_link in tqdm(results_links):
metadata_record = get_result_record_data(results_link, audio=True)
if metadata_record:
results_metadata.append( metadata_record )
Something wrong with https://www.tobarandualchais.co.uk/track/91107?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/66370?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/39404?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/30879?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/29116?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/64461?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/33313?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/28969?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/33323?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/30876?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/91097?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/91102?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/29143?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/65839?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/65843?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/29897?l=en
Let’s preview the last few results:
results_metadata[-5:]
[{'title': 'The Wee Toon Clerk',
'summary': '<p class="contributor-bio-item__content">Comic night-visiting song.</p>',
'raw_date': '18 January 1987',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/91091?l=en',
'date': '1987-01-18',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/5508/SOSS_009517_091091.mp4']},
{'title': 'Black Plague of Mingaley',
'summary': '<p class="contributor-bio-item__content">The contributor mentions a song about a plague on Mingulay which killed the entire local population, and recites one verse.<br/><br/>Followed by a brief discussion of the contributor\'s song repertoire. If a song doesn\'t appeal to people he\'s travelling with, he won\'t sing it. If they don\'t know the song or it doesn\'t mean much to them he gives them something else.</p>',
'raw_date': '21 February 1976',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/110378?l=en',
'date': '1976-02-21',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/54205/SOSS_008651_110378.mp4']},
{'title': "Kilmarnock's Town",
'summary': '<p class="contributor-bio-item__content">A fragment of the song often called \'Kilmarnock\'s Town\', in which Willie discovers the body of his sweetheart floating down the tide; he rescues her corpse and laments.<br/><br/>Duncan Williamson describes how his mother sang this song (the events of which take place in Ayrshire) in an Argyllshire style; she liked romantic songs such as this, whereas his father preferred old ballads and war songs.</p>',
'raw_date': '31 October 1976',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/32113?l=en',
'date': '1976-10-31',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10872/SOSS_007847_032113.mp4']},
{'title': "Duncan Williamson's favourite songs.",
'summary': '<p class="contributor-bio-item__content">Duncan Williamson\'s favourite songs.<br/><br/>Duncan Williamson\'s favourite songs are about courage. He lists: \'Sir Patrick Spens\', \'Jock o Braidisley\', \'Hind Horn\', \'The Factory Girl\', \'Marlin Green\', \'Thomas the Rhymer\'. \'Boreland Shore\' is a favourite because his father sang it and it has a good tune. He likes the stories contained in these songs. Travellers all like to hear songs about courage.</p>',
'raw_date': '31 October 1976',
'genre': 'Information',
'url': 'https://www.tobarandualchais.co.uk/track/32116?l=en',
'date': '1976-10-31',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/10873/SOSS_007847_032116.mp4']},
{'title': 'Tramps and Hawkers',
'summary': '<p class="contributor-bio-item__content">In this Traveller\'s song, the man talks of the various parts of Scotland to which he has travelled for work, and his carefree life on the road. He soon resigns himself to the fact that he can no longer earn enough in Scotland so he resolves to go to Paddy\'s land [Ireland].</p>',
'raw_date': '23 December 1978',
'genre': 'Song',
'url': 'https://www.tobarandualchais.co.uk/track/66365?l=en',
'date': '1978-12-23',
'audio_url': ['https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/8799/SOSS_008353_066365.mp4']}]
Let’s now serialise the results into a text file.
The following function will generate a single record report:
import markdownify
def display_record(record):
"""text display of record."""
txt = f"""
{record['title']}
{record["url"]}
{record['genre']}
{record['raw_date']}
{markdownify.markdownify(record['summary']).strip()}
{" :: ".join(record['audio_url'])}
"""
return txt
from IPython.display import Markdown
Markdown( display_record(results_metadata[0]) )
The Princess on the Glass Hill https://www.tobarandualchais.co.uk/track/28935?l=en Story 06 May 1976
The story of how lazy Jack got horses and armour and won the princess on the glass hill.
Jack was the youngest of three sons of a widow. He was very lazy and dirty and untidy. He never helped his mother as his brothers did. However one night he was persuaded to go and keep watch in their little cornfield to stop the deer eating the crop. He sat down leaning his back on a tree and fell asleep. A noise awoke him, and he saw a black horse with black armour tied to the saddle. He took the horse and hid it in the wood, so that his brothers would not get it. The next night he agreed to keep watch again, and this time he was wakened by a white horse, with silver armour. He hid it with the first horse. On the third night when Jack was keeping watch a brown horse with gold armour came, and Jack hid it too. The following day Jack’s brothers brought news from the market that the king would come and make an announcement. Jack went with his brothers and heard the king announce from the royal coach that his daughter would marry the man who could climb on horseback the glass hill on which she was sitting, and who could catch three golden apples which she would throw. If no-one could do this within the next three days the princess would not marry anyone. On the first day countless horsemen of all kinds tried to climb the hill, but the horses’ hooves slipped on the glass. Then Jack came on the black horse and dressed in the black armour and managed to get some distance up the hill. The princess threw a golden apple and Jack caught it and rode away. On the second day still no-one could climb the glass hill. Jack came on the white horse and dressed in the silver armour and got half-way up the glass hill. The princess threw him a golden apple and he again rode off, too fast for the king’s soldiers to catch him. The same thing happened on the third day, and this time Jack came on the brown horse and wore the gold armour. He got to the top of the glass hill and kissed the princess, who gave him the third apple. Again he rode off before anyone could catch him. The king ordered every village and every house to be searched for the three knights, but they were not found. Jack set out for the palace in his usual dirty and untidy condition, but with the three golden apples in his pockets. Soldiers and guards tried to stop him reaching the king, thinking he was a tramp, but he said he had news for the king and princess and was eventually admitted. He presented the princess with the golden apples and revealed that he was the three knights. The king was delighted, and ordered Jack to be cleaned up and dressed in good clothes. The princess fell in love with him and they were married. Meanwhile Jack’s mother and brothers wondered what had happened to him, but his brothers would not go looking for him. Then a royal carriage attended by soldiers came to their hut. Jack’s mother thought Jack had got them into some kind of trouble, and begged for his life to be spared. Jack then revealed himself, and took his mother and his horses and armour back to the castle. Jack’s brothers were left in the woodcutter’s hut.
Before and after telling the story contributor tells his grandchildren to be quiet and listen so that they will be able to pass on the story. At the end he promises to tell them other stories.
https://digitalpreservation.is.ed.ac.uk/bitstream/handle/20.500.12734/7870/SOSS_008607_028935.mp4
full_text = """# Duncan Williamson Audio on Tobar an Dualchais
Via: https://www.tobarandualchais.co.uk/
---
"""
full_text += "\n---\n".join([display_record(record_result) for record_result in results_metadata])
Markdown(full_text[:1000])
Duncan Williamson Audio on Tobar an Dualchais
Via: https://www.tobarandualchais.co.uk/
The Princess on the Glass Hill https://www.tobarandualchais.co.uk/track/28935?l=en Story 06 May 1976
The story of how lazy Jack got horses and armour and won the princess on the glass hill.
Jack was the youngest of three sons of a widow. He was very lazy and dirty and untidy. He never helped his mother as his brothers did. However one night he was persuaded to go and keep watch in their little cornfield to stop the deer eating the crop. He sat down leaning his back on a tree and fell asleep. A noise awoke him, and he saw a black horse with black armour tied to the saddle. He took the horse and hid it in the wood, so that his brothers would not get it. The next night he agreed to keep watch again, and this time he was wakened by a white horse, with silver armour. He hid it with the first horse. On the third night when Jack was keeping watch a brown horse with gold armour came, and Jack
with open("williamson_audio.md", "w") as f:
f.write(full_text)
The pandas package is a very powerful package for working with tabular datasets. The DataFrame
object can be used to represent such data:
from pandas import DataFrame
We can also generate a simple CSV file, most conveniently via a pandas dataframe:
df = DataFrame(results_metadata)
df.head()
title | summary | raw_date | genre | url | date | audio_url | |
---|---|---|---|---|---|---|---|
0 | The Princess on the Glass Hill | <p class="contributor-bio-item__content">The s... | 06 May 1976 | Story | https://www.tobarandualchais.co.uk/track/28935... | 1976-05-06 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
1 | Jack mistook a thorn tree for an old woman, an... | <p class="contributor-bio-item__content">Jack ... | 11 July 1976 | Story | https://www.tobarandualchais.co.uk/track/30609... | 1976-07-11 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
2 | The story of an old ballad about a farmer refu... | <p class="contributor-bio-item__content">The s... | 13 November 1976 | Story | https://www.tobarandualchais.co.uk/track/33139... | 1976-11-13 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
3 | Down in Yonder Bushes | <p class="contributor-bio-item__content">Jilte... | 17 July 1976 | Song | https://www.tobarandualchais.co.uk/track/30890... | 1976-07-17 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
4 | Lord Ullin's Daughter | <p class="contributor-bio-item__content">A son... | September 1977 | Song | https://www.tobarandualchais.co.uk/track/78630... | 1977-09-01 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
Cast the summary to markdown:
df['summary'] = df['summary'].apply(lambda x: markdownify.markdownify(x).strip())
df.head()
title | summary | raw_date | genre | url | date | audio_url | |
---|---|---|---|---|---|---|---|
0 | The Princess on the Glass Hill | The story of how lazy Jack got horses and armo... | 06 May 1976 | Story | https://www.tobarandualchais.co.uk/track/28935... | 1976-05-06 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
1 | Jack mistook a thorn tree for an old woman, an... | Jack mistook a thorn tree for an old woman, an... | 11 July 1976 | Story | https://www.tobarandualchais.co.uk/track/30609... | 1976-07-11 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
2 | The story of an old ballad about a farmer refu... | The story of an old ballad about a farmer refu... | 13 November 1976 | Story | https://www.tobarandualchais.co.uk/track/33139... | 1976-11-13 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
3 | Down in Yonder Bushes | Jilted lover's song. | 17 July 1976 | Song | https://www.tobarandualchais.co.uk/track/30890... | 1976-07-17 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
4 | Lord Ullin's Daughter | A song derived from Thomas Campbell's poem 'Lo... | September 1977 | Song | https://www.tobarandualchais.co.uk/track/78630... | 1977-09-01 | [https://digitalpreservation.is.ed.ac.uk/bitst... |
And stringify the list of audio_urls:
df['audio_url'] = df['audio_url'].apply(lambda x: ' :: '.join(x))
df.head()
title | summary | raw_date | genre | url | date | audio_url | |
---|---|---|---|---|---|---|---|
0 | The Princess on the Glass Hill | The story of how lazy Jack got horses and armo... | 06 May 1976 | Story | https://www.tobarandualchais.co.uk/track/28935... | 1976-05-06 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
1 | Jack mistook a thorn tree for an old woman, an... | Jack mistook a thorn tree for an old woman, an... | 11 July 1976 | Story | https://www.tobarandualchais.co.uk/track/30609... | 1976-07-11 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
2 | The story of an old ballad about a farmer refu... | The story of an old ballad about a farmer refu... | 13 November 1976 | Story | https://www.tobarandualchais.co.uk/track/33139... | 1976-11-13 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
3 | Down in Yonder Bushes | Jilted lover's song. | 17 July 1976 | Song | https://www.tobarandualchais.co.uk/track/30890... | 1976-07-17 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
4 | Lord Ullin's Daughter | A song derived from Thomas Campbell's poem 'Lo... | September 1977 | Song | https://www.tobarandualchais.co.uk/track/78630... | 1977-09-01 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
We can trivially save the dataframe as a CSV file, adding some structure along the way by first sorting the dataframe by genre and date.
df.sort_values(["genre", "date"]).to_csv("duncan_williamson_audio.csv", index=False)
The saved files should contain a complete summary of records available.
For convenience, I have placed copies of the files I generated here.
Grabbing A Catalogue for Betsy Whyte#
We can also use the above approach to grab catalogue for other artists from the *Tobar an Dualchais * website.
For example, let’s get a catalogue for Betsy Whyte:
search_term = ' "Whyte, Betsy, 1919-1988 (3523)" '
results_batch_size = 50
results_page_num = num_results_pages = 1
# Get all results links
results_links = []
while results_page_num <= num_results_pages:
results_page_soup = get_search_results_page(search_term,
page_num=results_page_num,
page_size=results_batch_size)
# Extend the list of results links we have so far
results_links.extend( get_result_links(results_page_soup) )
# If this is the first page of results, how many pages are there
if results_page_num==1:
num_results_pages = get_number_of_results_pages(results_page_soup)
print(f"{num_results_pages} results pages to download.")
# Increment which page we are on
results_page_num += 1
print(".", end="")
print(f"{len(results_links)} results links identified from search results.")
# Dedupe the results list by making a set from them
results_links = list(set(results_links))
print(f"{len(results_links)} unique results links.")
7 results pages to download.
.......330 results links identified from search results.
330 unique results links.
# Now we can iterate though the unique results
results_metadata = []
for results_link in tqdm(results_links):
metadata_record = get_result_record_data(results_link, audio=True)
if metadata_record:
results_metadata.append( metadata_record )
/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py:76: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
now = self.get_local_tz().localize(now)
/usr/local/lib/python3.9/site-packages/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
date_obj = stz.localize(date_obj)
Something wrong with https://www.tobarandualchais.co.uk/track/76984?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69721?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76909?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76986?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/36989?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/77000?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69748?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/77213?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76426?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/39246?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69716?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69710?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76951?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76430?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76994?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/33049?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/39244?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69785?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76401?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76970?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/36971?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/69722?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/36982?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/36991?l=en
Something wrong with https://www.tobarandualchais.co.uk/track/76405?l=en
full_text = """# Betsy Whyte Audio on Tobar an Dualchais
Via: https://www.tobarandualchais.co.uk/
---
"""
full_text += "\n---\n".join([display_record(record_result) for record_result in results_metadata])
with open("whyte_audio.md", "w") as f:
f.write(full_text)
df = DataFrame(results_metadata)
df['summary'] = df['summary'].apply(lambda x: markdownify.markdownify(x).strip())
df['audio_url'] = df['audio_url'].apply(lambda x: ' :: '.join(x))
df = df.sort_values(["genre", "date"])
# Save to CSV
df.to_csv("betsy_whyte_audio.csv", index=False)
df.head()
title | summary | raw_date | genre | url | date | audio_url | |
---|---|---|---|---|---|---|---|
142 | Travellers used cant words when people in auth... | Travellers used cant words when people in auth... | 14 December 1973 | Information | https://www.tobarandualchais.co.uk/track/76578... | 1973-12-14 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
248 | Traveller families, their nicknames and rules ... | Traveller families, their nicknames and rules ... | 14 December 1973 | Information | https://www.tobarandualchais.co.uk/track/76567... | 1973-12-14 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
0 | Discussion about putting the 'buchloch' on peo... | Discussion about putting the 'buchloch' on peo... | 02 May 1974 | Information | https://www.tobarandualchais.co.uk/track/36963... | 1974-05-02 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
90 | Itinerant singers. | Itinerant singers. \n \nIn Betsy Whyte's gra... | 02 May 1974 | Information | https://www.tobarandualchais.co.uk/track/36920... | 1974-05-02 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
138 | Betsy Whyte recites a toast of her mother's. | Betsy Whyte recites a toast of her mother's. ... | 02 May 1974 | Information | https://www.tobarandualchais.co.uk/track/36908... | 1974-05-02 | https://digitalpreservation.is.ed.ac.uk/bitstr... |
Gist of Betsy Whyte files here.