Ashliman “Folklore and Mythology Electronic Texts” Scraper#

A scraper to grab all the stories edited / translated by D. L. Ashliman that are referenced from https://sites.pitt.edu/~dash/folktexts.html .

Note that these are Copyright D. L. Ashliman, University of Pittsburgh, 1996-2022.

My intention is to create personal collection of these stories to support personal search and discovery of stories.

The stories appear to have been added to the website over several years. Whilst there is a lot of structure that we can can exploit to extract the story texts, there may be some stories that arenlt extracted correctly, or even at all.

story_list_url = "https://sites.pitt.edu/~dash/folktexts.html"

The stories are collected on thematic / story type pages that are referenced from https://sites.pitt.edu/~dash/folktexts.html.

Let’s start by grabbing links to all those pages; we’ll use requests-cache to cache the pages so we donlt repeatedly hit the website as we test the script…

# Suppress any warnings
import warnings
warnings.filterwarnings('ignore')
# We wouldn't normally do this but it makes the book output cleaner in a couple of places...

#%pip install requests-HTML
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache', backend='sqlite', expire_after=timedelta(days=100))

from requests_html import HTMLSession
 
url = "https://sites.pitt.edu/~dash/folktexts.html"
session = HTMLSession()
index_response = session.get(url)

# Clean the set of links to just the links we are interested in
links = [l for l in index_response.html.absolute_links if \
         l.startswith(url.replace(url.split("/")[-1], "")) and "#" not in l]
# Note that the order of the links is not guaranteed, so we'll sort them to fix the order
links.sort()

links[:3]
['https://sites.pitt.edu/~dash/abduct.html',
 'https://sites.pitt.edu/~dash/aesopold.html',
 'https://sites.pitt.edu/~dash/aesopskids.html']

The story pages have a consistent form, so if we inspect one of the pages, we should be able to generate some sort of template for scraping all the pages.

First of all, grab a story page:

page_response = session.get(links[0])

Let’s see if we can extract some information from the header at the top of the page:

# Get the head information
# This works for many pages, but not all..
header = page_response.html.find('center')
if not header:
    header = page_response.html.xpath('//h1[@align="CENTER"]')

if header:
    header = header[0].full_text

header
'\nAbducted by Aliens  \nEdited by\n\n\nD. L. Ashliman  \n© 1999-2021\n'

The general category is given in the first line:

category = header.strip().split('\n')[0].strip()
category
'Abducted by Aliens'

We may also have subheader content:

# eg https://sites.pitt.edu/~dash/type1586.html
subheading = page_response.html.xpath('//p[@align="CENTER"]')

if subheading:
    subheading = subheading[0].text

subheading
[]

Some page URLs are ATU codes; we can co-ot these as low hanging fruit if we canlt otherwise access the ATU reference:

import re

url_regex = r"^type(?P<atu>[\da-zA-Z]+)\.html"

_url = 'https://sites.pitt.edu/~dash/type1586.html'.split('/')[-1]#links[12].split('/')[-1]
matches= re.search(url_regex, _url, re.IGNORECASE | re.MULTILINE)

if matches:
    atu = matches.group("atu")

atu
'1586'

In some headers, there may be an indication of the Aarne-Thompson-Uther story type. If the tale type is described in a conventional way, we can easily extract it:

import re

regex= r"^.*Aarne-Thompson-Uther type (?P<atu>[\da-zA-Z]+).*"

matches= re.search(regex, page_response.html.find('center')[0].full_text, re.IGNORECASE | re.MULTILINE)

if matches:
    atu = matches.group("atu")

atu
'1586'

Inspection of the page response HTML suggests that stories are separated by the repeated <hr/><hr/> element. We can also find a break before the first tale as \n\n<hr/>\n\n<p>\n.

# Get the full HTML text
html = page_response.html.find('body')
# In some cases, the page may not be in an HTML tag but may just be body content
if html:
    html = html[0].html
else:
    html = page_response.html.html
    # This pages are old so patch them
    html = html.replace("<hr>", "<hr/>")

# Split out the section containing the stories
stories_html = html.split("\n\n<hr/>\n<p>\n")[-1]

# Chunk the stories
story_chunks = stories_html.split("<hr/><hr/>")
# Some old pages may just carry one story with no splits
# If there is only one item, use that...
story_chunks = story_chunks if len(story_chunks)==1 else story_chunks[:-1]

# Preview the text in the first item
story_chunks[:2][0][:500]
'<body text="#000000" link="#0000ff" vlink="#800080" bgcolor="#eeffee">\n<hr/>\n<center>\n<h1>Abducted by Aliens </h1> \nEdited by<br/>\n<a href="http://www.pitt.edu/~dash/ashliman.html"><img src="dash.gif" align="top" border="2" width="42" height="38"/><br/>\n\nD. L. Ashliman</a><br/>  \n© 1999-2021\n</center>\n<p>\nThe aliens in the legends that follow are not those from outer space, but\nrather underground people from our own earth: fairies, trolls, elves, and\nthe like.\n<p>\n<hr/>\n<h2><a name="contents">Co'

Each chunk may have additional information after the story, separated by a single <hr/>. So lets split each stroy into a 2-tuple for now of the form (story, additional_info).

story_chunk_tuples = []

for story_chunk in story_chunks:
    story_chunk_items = story_chunk.split('<hr/>')
    if len(story_chunk_items)==2:
        story_chunk_tuples.append( (story_chunk_items[0], story_chunk_items[1]) )
    
story_chunk_tuples[:2]
[('\n<h2><a name="brosnan">Taken by the Good People</a></h2>\n<h3>Ireland</h3>\nI was serving my time to the cattle trade, with a man the name of Lynch --\nGod be good to him! I suppose I was no more than twelve years of age at\nthe time. \'Twas a very out of the way place and mountainy. \n<p>\nWell, not far from my master\'s house there was a family of the Brogans.\n\'Twas the will of God that Mrs. Brogan took sick, and there was a baby\nborn, but the poor woman died. Well, the sister, a younger girl than the\nwoman that died, came to nurse the child. After some time she began to\nlook very delicate and uneasy. The naghbours were beginning to talk amongs\nthemselves about her, and it came to Brogan\'s ears, and, begor, it made\nhim vexed. So he asked the sister what was up with her.\n<p>\n"Well, John," says she, "I did not like to tell you, but Ellie" -- that\nwas the name of the dead woman -- "comes every night, and takes the baby\nand nurses it, and goes away without a word."\n<p>\n"By my word," says John, "she is not dead at all, but taken, and I will\nwatch her to-night."\n<p>\nGood enough, he remained up, and about 12 o\'clock in she came, and he put\nhis arms around her, but as he said, felt no substance.\n<p>\n"You can\'t keep me now," says she, "for I\'m married agin; but if you come\nto the Bottle Hill field to-morrow night, there will be about 40 of us\ngoin\' t\'words Blarney, and we will all be on horses, with our husbands.\nAll the horses will be white, and I and my man will be last. Bring a hazel\nstick woud [with] you and strike the horse on the right side, and I will\nfall off. Just as I fall, ketch me with all your might. You will know my\nman, for he is the only one of them that has a red head."\n<p>\nWell, he went, and he must have a great heart, for on they come, gallopin\'\nlike mad. Just as the man with the red head\'s horse came he stood one-side\nand struck. She fell and he gripped her like iron. Well, such a hullabaloo\nas there was, was never heard, and all the other men makin\' game of the\nred-headed man.\n<p>\nWell, he brought her home, and they lived for years after, and had a good\nfamily, and were the happiest people around the place. I often see some of\nher children; of course they are all married now, and gone here and there,\nbut that\'s as true as my name is Tim Brosnan.  <p>\n',
  '\n<ul>\n<li>Source (Internet Archive): "Folk-Tales from County Limerick collected by Miss D. Knox,"\n<a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/n4/mode/1up"><i>Folk-Lore: A Quarterly Review of Myth, Tradition, Institution, &amp;\nCustom</i></a> (London: Folk-Lore Society, 1917), v. 28, <a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/218/mode/2up">pp. 218-219</a>.<p>\n<li>Knox\'s source: Told by Tim Brosnan, Dungeagan, County Kerry.<p>\n<li>I have retained Knox\'s spelling.<p>\n<li>Return to the <a href="#contents">table of contents</a>.\n</li></p></li></p></li></p></li></ul>\n'),
 ('\n<h2><a name="kelly">Twenty Years with the Good People</a></h2>\n<h3>Ireland</h3>\nI had a gran\'uncle, he was a shoemaker; he was only about 3 or 4 months\nmarried. I\'m up to fourscore now. Well, God rest all their souls, for they\nare all gone, I hope to a better world!\n<p>\nWell, sir, he says to his wife, and a purty girl she was, as I hear um\nsay, -- the fortune wasn\'t very big but \'twould buy him a good bit of\nleather, and I might tell you, \'twas all brogues that was worn at the\ntime, and faith, you should be big before you would get them same. \n<p>\nHowisever, he started one day for Limerick would [with] and ass and car,\nto bring home leather and other little things he wanted. He did not return\nthat night or the next, nor the next. Begor, the wife and some frinds went\nto Limerick next day, but no trace of the husband could be found. I forgot\nto tell you that the third morning after he was gone the wife rose very\nearly, and there at the dure [door] was the ass and car. The whole country\nwas searched, up high and low down, but no trace. Weeks, monts and years\ncame and went, but he never turned up.\n<p>\nNow the wife kept on a little business, sellin\' nick-nacks to support\nherself, and a son, that grew to be a fine strapping man, as I hear um\nsay, the picture of his father.\n<p>\nNow, sir, the boy was in or about twenty, when one day, himself and his\nmother were atin\' their dinner, whin in comes a man and says, "God save\nye!"\n<p>\n"And you too," says the mother. "Will you ate a spud, sir?" says she.\n<p>\nHe rached for the spud, and in doin\' so the sleeve of his coat shortned as\nhe reached out his hand. He had a mole on his wrist and she see it, and\nher husband had one in the same spot.\n<p>\n"Good God!" says she, "are you John M\'Namara?" -- for that was his name.\n<p>\n"I am," says he, "and your husband, and that\'s my son, but I can\'t tell\nyou for some time where I was since I left you. But some time I might have\nthe power, but not now."\n<p>\nWell, lo and behold you, in a week\'s time he started to work, and the\nboots he made were a surprise to the whole country round, and I believe he\nlived for nine or ten years ater that, but he never tould her or any one\nwhere he was, but of course everbody knew that \'twas wood [with] the good\npeople. <p>\n',
  '\n<ul>\n<li>Source: "Folk-Tales from County Limerick collected by Miss D. Knox,"\n<a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/n4/mode/1up"><i>Folk-Lore: A Quarterly Review of Myth, Tradition, Institution, &amp;\nCustom</i></a> (London: Folk-Lore Society, 1917), v. 28, <a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/214/mode/2up">pp. 215-216</a>.<p>\n<li>Knox\'s source: Told by John Kelly, Cooraclare?, County Clare.<p>\n<li>I have retained Knox\'s spelling.<p>\n<li>Return to the <a href="#contents">table of contents</a>. <p>\n</p></li></p></li></p></li></p></li></ul>\n')]

We can now start to parse the separate chunks of story. For example, the title appears in an <h2> element:

story_chunk_tuples[0][0][:500]
'\n<h2><a name="brosnan">Taken by the Good People</a></h2>\n<h3>Ireland</h3>\nI was serving my time to the cattle trade, with a man the name of Lynch --\nGod be good to him! I suppose I was no more than twelve years of age at\nthe time. \'Twas a very out of the way place and mountainy. \n<p>\nWell, not far from my master\'s house there was a family of the Brogans.\n\'Twas the will of God that Mrs. Brogan took sick, and there was a baby\nborn, but the poor woman died. Well, the sister, a younger girl than the'
from requests_html import HTML

example_story_html = HTML(html=story_chunk_tuples[0][0])
example_story_html.find('h2')[0].text
'Taken by the Good People'

There is sometimes a subtitle in an <h3> element:

example_subheading = example_story_html.find('h3')
if example_subheading:
    print(example_subheading[0].text)
Ireland

We have access to the full story text:

example_story_html.full_text
'Taken by the Good People\nIreland\nI was serving my time to the cattle trade, with a man the name of Lynch --\nGod be good to him! I suppose I was no more than twelve years of age at\nthe time. \'Twas a very out of the way place and mountainy. \n\nWell, not far from my master\'s house there was a family of the Brogans.\n\'Twas the will of God that Mrs. Brogan took sick, and there was a baby\nborn, but the poor woman died. Well, the sister, a younger girl than the\nwoman that died, came to nurse the child. After some time she began to\nlook very delicate and uneasy. The naghbours were beginning to talk amongs\nthemselves about her, and it came to Brogan\'s ears, and, begor, it made\nhim vexed. So he asked the sister what was up with her.\n\n"Well, John," says she, "I did not like to tell you, but Ellie" -- that\nwas the name of the dead woman -- "comes every night, and takes the baby\nand nurses it, and goes away without a word."\n\n"By my word," says John, "she is not dead at all, but taken, and I will\nwatch her to-night."\n\nGood enough, he remained up, and about 12 o\'clock in she came, and he put\nhis arms around her, but as he said, felt no substance.\n\n"You can\'t keep me now," says she, "for I\'m married agin; but if you come\nto the Bottle Hill field to-morrow night, there will be about 40 of us\ngoin\' t\'words Blarney, and we will all be on horses, with our husbands.\nAll the horses will be white, and I and my man will be last. Bring a hazel\nstick woud [with] you and strike the horse on the right side, and I will\nfall off. Just as I fall, ketch me with all your might. You will know my\nman, for he is the only one of them that has a red head."\n\nWell, he went, and he must have a great heart, for on they come, gallopin\'\nlike mad. Just as the man with the red head\'s horse came he stood one-side\nand struck. She fell and he gripped her like iron. Well, such a hullabaloo\nas there was, was never heard, and all the other men makin\' game of the\nred-headed man.\n\nWell, he brought her home, and they lived for years after, and had a good\nfamily, and were the happiest people around the place. I often see some of\nher children; of course they are all married now, and gone here and there,\nbut that\'s as true as my name is Tim Brosnan.  \n'

If the original HTML included “useful” HTML tags, these would be stripped out in the full_text,, so instead it might be useful to parse the story and additional information HTML as markdown (a simple structured text format).

story_chunk_tuples[0][0]
'\n<h2><a name="brosnan">Taken by the Good People</a></h2>\n<h3>Ireland</h3>\nI was serving my time to the cattle trade, with a man the name of Lynch --\nGod be good to him! I suppose I was no more than twelve years of age at\nthe time. \'Twas a very out of the way place and mountainy. \n<p>\nWell, not far from my master\'s house there was a family of the Brogans.\n\'Twas the will of God that Mrs. Brogan took sick, and there was a baby\nborn, but the poor woman died. Well, the sister, a younger girl than the\nwoman that died, came to nurse the child. After some time she began to\nlook very delicate and uneasy. The naghbours were beginning to talk amongs\nthemselves about her, and it came to Brogan\'s ears, and, begor, it made\nhim vexed. So he asked the sister what was up with her.\n<p>\n"Well, John," says she, "I did not like to tell you, but Ellie" -- that\nwas the name of the dead woman -- "comes every night, and takes the baby\nand nurses it, and goes away without a word."\n<p>\n"By my word," says John, "she is not dead at all, but taken, and I will\nwatch her to-night."\n<p>\nGood enough, he remained up, and about 12 o\'clock in she came, and he put\nhis arms around her, but as he said, felt no substance.\n<p>\n"You can\'t keep me now," says she, "for I\'m married agin; but if you come\nto the Bottle Hill field to-morrow night, there will be about 40 of us\ngoin\' t\'words Blarney, and we will all be on horses, with our husbands.\nAll the horses will be white, and I and my man will be last. Bring a hazel\nstick woud [with] you and strike the horse on the right side, and I will\nfall off. Just as I fall, ketch me with all your might. You will know my\nman, for he is the only one of them that has a red head."\n<p>\nWell, he went, and he must have a great heart, for on they come, gallopin\'\nlike mad. Just as the man with the red head\'s horse came he stood one-side\nand struck. She fell and he gripped her like iron. Well, such a hullabaloo\nas there was, was never heard, and all the other men makin\' game of the\nred-headed man.\n<p>\nWell, he brought her home, and they lived for years after, and had a good\nfamily, and were the happiest people around the place. I often see some of\nher children; of course they are all married now, and gone here and there,\nbut that\'s as true as my name is Tim Brosnan.  <p>\n'

The following function will attempt to repair any broken HTML before we pass it to markdownify. There are probably better ways of doing this!

#%pip install --upgrade parse # Tools to support "fromat" style parsing
#%pip install --upgrade markdownify # Tools to convert HTML to markdown

from parse import parse

import html5lib
import lxml

import markdownify

def repair_HTML(html):
    """Repair HTML."""
    # html5bib repairs the broken HTML
    tree = html5lib.parse(html, treebuilder='lxml', namespaceHTMLElements=False)
    html_ = parse("<body>{html}</body>", lxml.html.tostring(tree.find("body")).decode('utf-8'))
    if html_:
        return html_['html'], tree
    return '', None


(_html, _tree) = repair_HTML(story_chunk_tuples[0][0])

md = markdownify.markdownify(_html, bullets="-").strip()
md
'Taken by the Good People\n------------------------\n\n\n### Ireland\n\n\nI was serving my time to the cattle trade, with a man the name of Lynch --\nGod be good to him! I suppose I was no more than twelve years of age at\nthe time. \'Twas a very out of the way place and mountainy. \n\nWell, not far from my master\'s house there was a family of the Brogans.\n\'Twas the will of God that Mrs. Brogan took sick, and there was a baby\nborn, but the poor woman died. Well, the sister, a younger girl than the\nwoman that died, came to nurse the child. After some time she began to\nlook very delicate and uneasy. The naghbours were beginning to talk amongs\nthemselves about her, and it came to Brogan\'s ears, and, begor, it made\nhim vexed. So he asked the sister what was up with her.\n\n\n\n"Well, John," says she, "I did not like to tell you, but Ellie" -- that\nwas the name of the dead woman -- "comes every night, and takes the baby\nand nurses it, and goes away without a word."\n\n\n\n"By my word," says John, "she is not dead at all, but taken, and I will\nwatch her to-night."\n\n\n\nGood enough, he remained up, and about 12 o\'clock in she came, and he put\nhis arms around her, but as he said, felt no substance.\n\n\n\n"You can\'t keep me now," says she, "for I\'m married agin; but if you come\nto the Bottle Hill field to-morrow night, there will be about 40 of us\ngoin\' t\'words Blarney, and we will all be on horses, with our husbands.\nAll the horses will be white, and I and my man will be last. Bring a hazel\nstick woud [with] you and strike the horse on the right side, and I will\nfall off. Just as I fall, ketch me with all your might. You will know my\nman, for he is the only one of them that has a red head."\n\n\n\nWell, he went, and he must have a great heart, for on they come, gallopin\'\nlike mad. Just as the man with the red head\'s horse came he stood one-side\nand struck. She fell and he gripped her like iron. Well, such a hullabaloo\nas there was, was never heard, and all the other men makin\' game of the\nred-headed man.\n\n\n\nWell, he brought her home, and they lived for years after, and had a good\nfamily, and were the happiest people around the place. I often see some of\nher children; of course they are all married now, and gone here and there,\nbut that\'s as true as my name is Tim Brosnan.'

We can use the IPython.display.Markdown function to preview the formatted story:

from IPython.display import Markdown

Markdown(md)

Taken by the Good People

Ireland

I was serving my time to the cattle trade, with a man the name of Lynch – God be good to him! I suppose I was no more than twelve years of age at the time. ‘Twas a very out of the way place and mountainy.

Well, not far from my master’s house there was a family of the Brogans. ‘Twas the will of God that Mrs. Brogan took sick, and there was a baby born, but the poor woman died. Well, the sister, a younger girl than the woman that died, came to nurse the child. After some time she began to look very delicate and uneasy. The naghbours were beginning to talk amongs themselves about her, and it came to Brogan’s ears, and, begor, it made him vexed. So he asked the sister what was up with her.

“Well, John,” says she, “I did not like to tell you, but Ellie” – that was the name of the dead woman – “comes every night, and takes the baby and nurses it, and goes away without a word.”

“By my word,” says John, “she is not dead at all, but taken, and I will watch her to-night.”

Good enough, he remained up, and about 12 o’clock in she came, and he put his arms around her, but as he said, felt no substance.

“You can’t keep me now,” says she, “for I’m married agin; but if you come to the Bottle Hill field to-morrow night, there will be about 40 of us goin’ t’words Blarney, and we will all be on horses, with our husbands. All the horses will be white, and I and my man will be last. Bring a hazel stick woud [with] you and strike the horse on the right side, and I will fall off. Just as I fall, ketch me with all your might. You will know my man, for he is the only one of them that has a red head.”

Well, he went, and he must have a great heart, for on they come, gallopin’ like mad. Just as the man with the red head’s horse came he stood one-side and struck. She fell and he gripped her like iron. Well, such a hullabaloo as there was, was never heard, and all the other men makin’ game of the red-headed man.

Well, he brought her home, and they lived for years after, and had a good family, and were the happiest people around the place. I often see some of her children; of course they are all married now, and gone here and there, but that’s as true as my name is Tim Brosnan.

If we look at the additional information, we see that it contains a list of items, which may contain links. The final list item is boilerplate and we can remove it.

story_chunk_tuples[3][1]
'\n<ul>\n<li>Source (books.google.com): William Butler Yeats, <i><a target="_blank" href="https://books.google.com/books?id=AhZLAAAAMAAJ&amp;pg=PP13#v=onepage&amp;q&amp;f=false">The Celtic Twilight</a></i> (London: A. H. Bullen, 1902), <a target="_blank" href="https://books.google.com/books?id=AhZLAAAAMAAJ&amp;pg=PA117#v=onepage&amp;q&amp;f=false">pp. 117-29</a>. <p>\n<li>Source (Internet Archive): William Butler Yeats, <i><a target="_blank" href="https://archive.org/details/celtictwilight00yeatrich">The Celtic Twilight</a></i> (London: A. H. Bullen, 1902), <a target="_blank" href="xxx">pp. 117-29</a>. <p>\n<li>Return to the <a href="#contents">table of contents</a>. <p>\n</p></li></p></li></p></li></ul>\n'
additional_info = [li.html for li in HTML(html=story_chunk_tuples[0][1]).find("li")[:-1]]
additional_info
['<li>Source (Internet Archive): "Folk-Tales from County Limerick collected by Miss D. Knox,"\n<a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/n4/mode/1up"><i>Folk-Lore: A Quarterly Review of Myth, Tradition, Institution, &amp;\nCustom</i></a> (London: Folk-Lore Society, 1917), v. 28, <a target="_blank" href="https://archive.org/stream/folklore28folkuoft#page/218/mode/2up">pp. 218-219</a>.<p>\n<li>Knox\'s source: Told by Tim Brosnan, Dungeagan, County Kerry.<p>\n<li>I have retained Knox\'s spelling.<p>\n<li>Return to the <a href="#contents">table of contents</a>.\n</li></p></li></p></li></p></li>',
 '<li>Knox\'s source: Told by Tim Brosnan, Dungeagan, County Kerry.<p>\n<li>I have retained Knox\'s spelling.<p>\n<li>Return to the <a href="#contents">table of contents</a>.\n</li></p></li></p></li>',
 '<li>I have retained Knox\'s spelling.<p>\n<li>Return to the <a href="#contents">table of contents</a>.\n</li></p></li>']

In some records, the Aarne-Thompson-Uther type is given, which we can easily parse out if it is presented in a regular form:

#Aarne-Thompson-Uther type 1586.

# Requires exact match
parse("Aarne-Thompson-Uther type {atu}.", "Aarne-Thompson-Uther type 1586.")
# Accepts arbitrary but required prefix
parse("{}Aarne-Thompson-Uther type {atu}.", "Similar to Aarne-Thompson-Uther type 1586.")
<Result ('Similar to ',) {'atu': '1586'}>

Create a Simple Database#

We can create a simple database structure to hold the stories.

from sqlite_utils import Database

db_name = "ashliman_demo.db"

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

Let’s start off with a table for the stories. This should include:

  • story title;

  • story text;

  • story metadata.

There may also be optional subheading information.

From the page, we will also be able to get the general tale type. We may also be able to get the ATU tale type, either from the page, or from the story record. We can include these in our table definition.

db["ashliman_stories"].delete_where()
db["ashliman_stories"].create({
    "title": str,
    "subheading": str,
    "text": str,
    "metadata": str,
    "generic_type": str,
    "generic_info": str,
    "atu": str, # This may be null
    "url": str,
})

# Create a full text search table to improve search support
db["ashliman_stories"].enable_fts(["text"], create_triggers=True, tokenize="porter")
<Table ashliman_stories (title, subheading, text, metadata, generic_type, generic_info, atu, url)>

Let’s see if we can add create an example set of records from a single page and add it to the database:

def fetch_and_parse_page(url, exceptions=None):
    """Fetch and parse Ashliman story pages."""
    
    page_response = session.get(url)
    
    atu_regex = r"^.*Aarne-Thompson-Uther type (?P<atu>[\da-zA-Z]+).*"
    
    story_records = []
    
    atu = ''
    generic_type = ''
    
    # Low hanging fruit: ATU in URL
    url_regex = r"^type(?P<atu>[\da-zA-Z]+)\.html"
    _url = url.split('/')[-1]
    matches= re.search(url_regex, _url, re.IGNORECASE | re.MULTILINE)
    if matches:
        atu = matches.group("atu")
    # Low hanging fruit - atu in text
    generic_info = page_response.html.xpath('//p[@align="CENTER"]')
    generic_info = generic_info[0].text.strip() if generic_info else ''
    # Check to see if we can find an ATU; if we do, this overrides the page atu
    matches = re.search(atu_regex, generic_info, re.IGNORECASE | re.MULTILINE)
    if matches:
        atu = matches.group("atu")
    
    # The generic type is the page title
    # Get the head information
    header = page_response.html.find('center')
    if not header:
        header = page_response.html.xpath('//h1[@align="CENTER"]')

    if header:
        header = header[0].full_text
        generic_type = header.strip().split('\n')[0].strip()
        
        # See if an ATU is defined in the header
        matches = re.search(atu_regex, header, re.IGNORECASE | re.MULTILINE)

        if matches:
            atu = matches.group("atu")

    # Now find stories
    html = page_response.html.find('body')
    if html:
        html = html[0].html
    else:
        html = page_response.html.html
        # This pages are old so patch them
        html = html.replace("<hr>", "<hr/>")
        
    # Split out the section containing the stories
    story_section = "\n\n<hr/>\n<p>\n"
    if exceptions and url in exceptions:
        html = html.replace(exceptions[url], f"{story_section}{exceptions[url]}")
    stories_html = html.split(story_section)
    ix = 0 if len(stories_html)==1 else -1
    stories_html = stories_html[ix]

    # Chunk the stories
    story_chunks = stories_html.split("<hr/><hr/>")
    # Some old pages may not have the non-story boilerplate at the end
    story_chunks = story_chunks if len(story_chunks)==1 else story_chunks[:-1]

    # Now iterate through all the stories
    for story_chunk in story_chunks:
        record = {'generic_type': generic_type,
                  'generic_info': generic_info,
                  'atu':atu,
                  'metadata':'',
                  'url': url}
        
        story_chunk_items = story_chunk.split('<hr/>')
    
    
    
        # Trap any broken html...
        #try:
        #    story_html = HTML(html=story_chunk_items[0])
        #except:
        #    pass
        #record['text'] = story_html.full_text.strip()

        try:
            (_html, story_tree) = repair_HTML(story_chunk_items[0])
            record['text'] = markdownify.markdownify(_html, bullets="-").strip()
            #record['html'] = _html
            story_tree = HTML(html=_html)
        except:
            record['text'] = ''
            
        
        # If there is no story text, move on
        if not record['text']:
            continue
        
        # Get title
        title = story_tree.find('h2')
        title = title[0].text.strip() if title else generic_type
        # If we have not dropped the contents section, and we can identify it, move on...
        if title=='Contents':
            continue
        
        record['title'] = title
        
        # Get optional subheading
        subheading = story_tree.find('h3')
        subheading = subheading[0].text.strip() if subheading else ''
        
        record['subheading'] = subheading
        
        if len(story_chunk_items)==2:
            (_html, _) = repair_HTML(story_chunk_items[1])
            metadata = markdownify.markdownify(_html, bullets="-").strip()
            # Cleaning, added as and when I spot it's necessary
            metadata = metadata.replace("- Return to the [table of contents](#contents).", "")
            record['metadata'] = metadata.strip()
            # Check to see if we can find an ATU; if we do, this overrides the page atu
            matches = re.search(atu_regex, record['metadata'], re.IGNORECASE | re.MULTILINE)
            if matches:
                record['atu'] = matches.group("atu")
        
        # Depending on the layout, we may get what is essentially a null record
        if record['text']!='folktexts' and record['text']!=record['title'] \
                and not record['text'].startswith("Return to D. L. Ashliman"):
            story_records.append(record)
        
    return story_records

Let’s try that function out:

# This is a hack fix - some pages don't have the conventional split string
exceptions = {"https://sites.pitt.edu/~dash/hildebrand.html": "I have heard tell",
              "https://sites.pitt.edu/~dash/bluebelt.html": "Once upon a time",
              "https://sites.pitt.edu/~dash/lowell.html": "I had a little daughter",
              "https://sites.pitt.edu/~dash/changeling.html": '<h2><a name="legends">The legends </a></h2>'}
    
example_records = fetch_and_parse_page(links[1], exceptions)
example_records[:2]
[{'generic_type': "Old Folks in Aesop's Fables",
  'generic_info': '',
  'atu': '',
  'metadata': '',
  'url': 'https://sites.pitt.edu/~dash/aesopold.html',
  'text': "- Sources:\n\t- Fables 1-18: *Aesop's Fables*. Translated by V. S. Vernon Jones. \n\tLondon: Heinemann, 1912.\n\t- Fables 19-20: *The Fables of Aesop*. Edited by Joseph Jacobs. \n\tLondon and New York: Macmillan, 1894.\n\t- Fables 21-28: *The Fables of Aesop*. Based on the texts of \n\tL'Estrange and Croxall. New York and Boston: \n\t\n\tBooks, Inc., n.d.\n- Return to D. L. Ashliman's [**folktexts**](folktexts.html), a library of folktales, folklore, \nfairy tales, and mythology.",
  'title': "Old Folks in Aesop's Fables",
  'subheading': ''},
 {'generic_type': "Old Folks in Aesop's Fables",
  'generic_info': '',
  'atu': '',
  'metadata': '',
  'url': 'https://sites.pitt.edu/~dash/aesopold.html',
  'text': 'The Mischievous Dog\n-------------------\n\n \nThere was once a dog who used to snap at people and bite them without any \nprovocation, and who was a great nuisance to \n\neveryone who came to his master\'s house. So his master fastened a bell \nround his neck to warn people of his presence. \n\nThe dog was very proud of the bell, and strutted about tinkling it with \nimmense satisfaction. But an old dog came up \n\nto him and said, "The fewer airs you give yourself the better, my friend. \nYou don\'t think, do you, that your bell was \n\ngiven you as a reward of merit? On the contrary, it is a badge of \ndisgrace." \nMoral: Notoriety is often mistaken for fame.',
  'title': 'The Mischievous Dog',
  'subheading': ''}]

For the metadata, we could have a metadata table, with one or more rows per book depending on the number of metadata list items? Or is the use of lists in metadata inconsistently applied?

Markdown(example_records[1]['metadata'])
<IPython.core.display.Markdown object>

We can add these to the database and have a go at querying them:

db["ashliman_stories"].delete_where()
db["ashliman_stories"].insert_all(example_records)
<Table ashliman_stories (title, subheading, text, metadata, generic_type, generic_info, atu, url)>
from pandas import read_sql

q = "SELECT * FROM ashliman_stories LIMIT 12"

read_sql(q, db.conn)
title subheading text metadata generic_type generic_info atu url
0 Old Folks in Aesop's Fables - Sources:\n\t- Fables 1-18: *Aesop's Fables*.... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
1 The Mischievous Dog The Mischievous Dog\n-------------------\n\n \... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
2 The Mice in Council The Mice in Council\n-------------------\n\n \... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
3 The Old Woman and the Doctor The Old Woman and the Doctor\n----------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
4 The Crab and His Mother The Crab and His Mother\n---------------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
5 The Old Lion The Old Lion\n------------\n\n \nA lion, enfee... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
6 The Peasant and the Apple Tree The Peasant and the Apple Tree\n--------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
7 The Mice and the Weasels The Mice and the Weasels\n--------------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
8 The Ass and the Old Peasant The Ass and the Old Peasant\n-----------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
9 The Old Woman and the Wine Jar The Old Woman and the Wine Jar\n--------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
10 The Oxen and the Butchers The Oxen and the Butchers\n-------------------... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html
11 The Old Hound The Old Hound\n-------------\n\n \nA hound who... Old Folks in Aesop's Fables https://sites.pitt.edu/~dash/aesopold.html

Let’s now try to populate the whole database:

# tqdm gives us a progress bar
from tqdm.notebook import tqdm

# Empty the database
db["ashliman_stories"].delete_where()

# Iterate through all the pages, adding the stories to the db as we do so
broken_parser_pages = []

# Also recall that requests-cache should be cacheing pages
# so we should only hit the original site once per page
# no matter how many times we run this in testing and development
for link in tqdm(links):
    try:
        records = fetch_and_parse_page(link)
        db["ashliman_stories"].insert_all(records)
    except:
        broken_parser_pages.append(link)

Are there any pages that didn’t get parsed?

broken_parser_pages
[]

How many stories are there?

q = "SELECT COUNT(*) AS num_stories FROM ashliman_stories"

read_sql(q, db.conn)
num_stories
0 1122
q = "SELECT * FROM ashliman_stories LIMIT 10"

read_sql(q, db.conn)
title subheading text metadata generic_type generic_info atu url
0 Taken by the Good People Ireland Taken by the Good People\n--------------------... - Source (Internet Archive): "Folk-Tales from ... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
1 Twenty Years with the Good People Ireland Twenty Years with the Good People\n-----------... - Source: "Folk-Tales from County Limerick col... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
2 Jamie Freel and the Young Lady: A Donegal Tale Ireland Jamie Freel and the Young Lady: A Donegal Tale... - Source (books.google.com): William Butler Ye... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
3 Kidnappers Ireland Kidnappers\n----------\n\n\n### Ireland\n\n\nA... - Source (books.google.com): William Butler Ye... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
4 Ethna the Bride Ireland Ethna the Bride\n---------------\n\n\n### Irel... - Source (books.google.com): Lady [Jane France... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
5 Ned the Jockey Wales Ned the Jockey\n--------------\n\n\n### Wales\... - Source (books.google.com): Edward Hamer, "Pa... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
6 The Old Man and the Fairies Wales The Old Man and the Fairies\n-----------------... - Source (books.google.com): P. H. Emerson, *[... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
7 A Visit to Fairyland Wales A Visit to Fairyland\n--------------------\n\n... - Source (books.google.com): D. E. Jenkins, *[... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
8 Four Years in Faery Isle of Man Four Years in Faery\n-------------------\n\n\n... - Source (books.google.com): John Rhys, *[Celt... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html
9 The Lost Wife of Ballaleece Isle of Man The Lost Wife of Ballaleece\n-----------------... - Source (Internet Archive): Sophia Morrison, ... Abducted by Aliens https://sites.pitt.edu/~dash/abduct.html