Tale Types Scraper#

The University of Missouri Libraries publish a useful look-up table of AT, ATU and Thompson Motif codes which might be handy to have in a simple database.

The original pages can be found from this explanatory guide: https://libraryguides.missouri.edu/c.php?g=1039894&p=7541717

url = "https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090"
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache', backend='sqlite', expire_after=timedelta(days=100))

from requests_html import HTMLSession
 
session = HTMLSession()
index_response = session.get(url)

The resource pages are linked from a menu bar. The links we want are all but the first and the last one:

link_elements = index_response.html.find('.nav-tabs')[0].find("li")[1:-1]
link_elements
[<Element 'li' class=('active',)>,
 <Element 'li' class=()>,
 <Element 'li' class=()>,
 <Element 'li' class=()>,
 <Element 'li' class=()>,
 <Element 'li' class=()>,
 <Element 'li' class=()>,
 <Element 'li' class=()>]
index_links = [el.xpath("//a/@href")[0] for el in link_elements]
index_links
['https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7610331',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7619154',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7620890',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7621605',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624305',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624836',
 'https://libraryguides.missouri.edu/c.php?g=1039894&p=7635705']

The pages are split into separate tables which are embedded in particular block elements.

response = session.get(index_links[2])

# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = response.html.find('.s-lib-box')
boxes
[<Element 'div' id='s-lg-box-24196290' class=('s-lib-box', 's-lib-box-std')>,
 <Element 'div' id='s-lg-box-24196308' class=('s-lib-box', 's-lib-box-std')>,
 <Element 'div' id='s-lg-box-24197326' class=('s-lib-box', 's-lib-box-std')>,
 <Element 'div' id='s-lg-box-24197327' class=('s-lib-box', 's-lib-box-std')>,
 <Element 'div' id='s-lg-box-24197417' class=('s-lib-box', 's-lib-box-std')>]

We can extract the box title as a category label:

category = boxes[0].find('h2')[0].text
category
'Tales of Magic - Magic Object 560 - 571'

And we can grab the table within the box (I think there is just one table per box, so take the first, but we coul also go defensive on this to check for multiple tables…):

example_table = boxes[1].find('table')[0]
example_table
<Element 'table' class=('table', 'table-bordered') style='border: 0px solid rgb(180, 180, 180);'>
example_table.html
'<table class="table table-bordered" style="border: 0px solid rgb(180, 180, 180);">\n<tbody>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;571 \\&quot;All Stick Together\\&quot;&quot;}" height="21" style="height:21px;width:256px;">ATU Classification Type\xa0</td>\n<td style="width:276px;">\xa0AT Classification Type\xa0\xa0</td>\n<td style="width:631px;">Thompson Motif</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td style="width:631px;">\xa0</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 571B Lover Exposed</td>\n<td>AT 571B Himphamp</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1010. &#x2020;H1010. Impossible tasks&quot;}" style="width:631px;">H1010. †H1010. Impossible tasks</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;571B Lover Exposed&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D812.3. &#x2020;D812.3. Magic object received from devil.&quot;}">\n<div>\n<div>D812.3. †D812.3. Magic object received from devil.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;571B Lover Exposed&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1413.8. &#x2020;D1413.8. Chamber-pot to which one sticks.&quot;}">\n<div>\n<div>D1413.8. †D1413.8. Chamber-pot to which one sticks.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;571B Lover Exposed&quot;}" height="21" style="height:21px;">ATU 576 Magic Knife</td>\n<td>AT 576 Magic Knife</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D838. &#x2020;D838. Magic object acquired by stealing&quot;}" style="width:631px;">D838. †D838. Magic object acquired by stealing</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;576 Magic Knife&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1083. &#x2020;D1083. Magic knife.&quot;}" style="width:631px;">D1083. †D1083. Magic knife.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;576 Magic Knife&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1400.1.4.3. &#x2020;D1400.1.4.3. Magic knife conquers enemy.&quot;}">\n<div>\n<div>D1400.1.4.3. †D1400.1.4.3. Magic knife conquers enemy.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;576 Magic Knife&quot;}" height="21" style="height:21px;">ATU 577 The King\'s Tasks</td>\n<td>AT 577 The King\'s Tasks</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H335. &#x2020;H335. Tasks assigned suitors.&quot;}" style="width:631px;">H335. †H335. Tasks assigned suitors.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1601.14. &#x2020;D1601.14. Self-chopping axe.&quot;}" style="width:631px;">D1601.14. †D1601.14. Self-chopping axe.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1601.15. &#x2020;D1601.15. Automatic shovel.&quot;}" style="width:631px;">D1601.15. †D1601.15. Automatic shovel.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1601.18.2. &#x2020;D1601.18.2. Self-playing violin.&quot;}" style="width:631px;">D1601.18.2. †D1601.18.2. Self-playing violin.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1581. &#x2020;D1581. Tasks performed by use of magic object.&quot;}">\n<div>\n<div>D1581. †D1581. Tasks performed by use of magic object.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H971.1. &#x2020;H971.1. Tasks performed with help of old woman&quot;}">\n<div>\n<div>H971.1. †H971.1. Tasks performed with help of old woman</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;577 The King\'s Tasks&quot;}" height="21" style="height:21px;">ATU 580 Beloved of Women</td>\n<td>AT 580 Beloved of Women</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1900. &#x2020;D1900. Love induced by magic.&quot;}" style="width:631px;">D1900. †D1900. Love induced by magic.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;580 Beloved of Women&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D856. &#x2020;D856. Magic object acquired by gaining love of owner.&quot;}">\n<div>\n<div>D856. †D856. Magic object acquired by gaining love of owner.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;580 Beloved of Women&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1472.1.7. &#x2020;D1472.1.7. Magic table supplies food and drink.&quot;}">\n<div>\n<div>D1472.1.7. †D1472.1.7. Magic table supplies food and drink.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;580 Beloved of Women&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1473. &#x2020;D1473. Magic object furnishes clothes.&quot;}" style="width:631px;">D1473. †D1473. Magic object furnishes clothes.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;580 Beloved of Women&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;T45. &#x2020;T45. Lover buys admission to woman\'s room.&quot;}">\n<div>\n<div>T45. †T45. Lover buys admission to woman\'s room.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;580 Beloved of Women&quot;}" height="21" style="height:21px;">ATU 566 Three Magic Objects and Wonderful Fruits</td>\n<td>AT 580* Money without End (now ATU 566)</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1451. &#x2020;D1451. Inexhaustible purse furnishes money.&quot;}">\n<div>\n<div>D1451. †D1451. Inexhaustible purse furnishes money.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 303 The Twins or Blood-Brothers</td>\n<td>AT 581 Twin Brother and Thier Magic Objects (now ATU 303)</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1081. &#x2020;D1081. Magic sword.&quot;}" style="width:631px;">D1081. †D1081. Magic sword.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;303 The Twins or Blood-Brothers&quot;}" height="21" style="height:21px;">ATU 585 Spindle, Shuttle, and Needle</td>\n<td>AT 585 Spindle, Shuttle, and Needle</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1311.2. &#x2020;H1311.2. Quest for bride richest and poorest.&quot;}">\n<div>\n<div>H1311.2. †H1311.2. Quest for bride richest and poorest.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;585 Spindle, Shuttle, and Needle&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1425.1. &#x2020;D1425.1. Magic spindle brings back prince for heroine.&quot;}">\n<div>\n<div>D1425.1. †D1425.1. Magic spindle brings back prince for heroine.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;585 Spindle, Shuttle, and Needle&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1484.1. &#x2020;D1484.1. Magic shuttle makes road.&quot;}" style="width:631px;">D1484.1. †D1484.1. Magic shuttle makes road.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;585 Spindle, Shuttle, and Needle&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1337.1.7. &#x2020;D1337.1.7. Magic needle transforms a room from plainness to beauty.&quot;}">\n<div>\n<div>D1337.1.7. †D1337.1.7. Magic needle transforms a room from plainness to beauty.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;585 Spindle, Shuttle, and Needle&quot;}" height="21" style="height:21px;">ATU 590 Faithless Mother</td>\n<td>AT 590 Magic Belt</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1840. &#x2020;D1840. Magic invulnerability.&quot;}" style="width:631px;">D1840. †D1840. Magic invulnerability.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;589 Faithless Mother&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1335.4. &#x2020;D1335.4. Magic belt gives strength&quot;}" style="width:631px;">D1335.4. †D1335.4. Magic belt gives strength</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;589 Faithless Mother&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;S12.1. &#x2020;S12.1. Treacherous mother marries ogre and plots against son.&quot;}">\n<div>\n<div>S12.1. †S12.1. Treacherous mother marries ogre and plots against son.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H931. &#x2020;H931. Tasks assigned in order to get rid of hero.&quot;}">\n<div>\n<div>H931. †H931. Tasks assigned in order to get rid of hero.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1212. &#x2020;H1212. Quest assigned because of feigned illness.&quot;}">\n<div>\n<div>H1212. †H1212. Quest assigned because of feigned illness.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1333.3.1.1. &#x2020;H1333.3.1.1. Quest for magic apple&quot;}" style="width:631px;">H1333.3.1.1. †H1333.3.1.1. Quest for magic apple</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1361. &#x2020;H1361. Quest for lion\'s milk.&quot;}" style="width:631px;">H1361. †H1361. Quest for lion\'s milk.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;F615. &#x2020;F615. Strong man evades death.&quot;}" style="width:631px;">F615. †F615. Strong man evades death.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;B431.2. &#x2020;B431.2. Helpful lion.&quot;}" style="width:631px;">B431.2. †B431.2. Helpful lion.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;K975. &#x2020;K975. Secret of strength treacherously discovered.&quot;}">\n<div>\n<div>K975. †K975. Secret of strength treacherously discovered.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D861. &#x2020;D861. Magic object stolen&quot;}" style="width:631px;">D861. †D861. Magic object stolen</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;B512. &#x2020;B512. Medicine shown by animal.&quot;}" style="width:631px;">B512. †B512. Medicine shown by animal.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D880. &#x2020;D880. Recovery of magic object.&quot;}" style="width:631px;">D880. †D880. Recovery of magic object.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 592 The Dance Among Thorns</td>\n<td>AT 592 Dancing in Thorns</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;S322.4. &#x2020;S322.4. Evil stepmother casts boy forth.&quot;}" style="width:631px;">S322.4. †S322.4. Evil stepmother casts boy forth.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;Q42.1. &#x2020;Q42.1. Spendthrift knight.&quot;}" style="width:631px;">Q42.1. †Q42.1. Spendthrift knight.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1761.0.2. &#x2020;D1761.0.2. Limited number of wishes granted.&quot;}">\n<div>\n<div>D1761.0.2. †D1761.0.2. Limited number of wishes granted.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1653.1.7. &#x2020;D1653.1.7. Infallible gun.&quot;}" style="width:631px;">D1653.1.7. †D1653.1.7. Infallible gun.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1415.2.5. &#x2020;D1415.2.5. Magic fiddle causes dancing.&quot;}">\n<div>\n<div>D1415.2.5. †D1415.2.5. Magic fiddle causes dancing.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N55. &#x2020;N55. Shooting contest on wager.&quot;}" style="width:631px;">N55. †N55. Shooting contest on wager.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;K551.3.1. &#x2020;K551.3.1. Respite from death while one plays the fiddle&quot;}">\n<div>\n<div>K551.3.1. †K551.3.1. Respite from death while one plays the fiddle</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;592 The Dance Among Thorns&quot;}" height="21" style="height:21px;">ATU 593 Fiddevav</td>\n<td>AT 593 Blowing Up the Fire</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N825.3. &#x2020;N825.3. Old woman helper.&quot;}" style="width:631px;">N825.3. †N825.3. Old woman helper.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;593 Fiddevav&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1413. &#x2020;D1413. Magic object holds person fast.&quot;}" style="width:631px;">D1413. †D1413. Magic object holds person fast.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;593 Fiddevav&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1413.17. &#x2020;D1413.17. Magic adhesive stone.&quot;}" style="width:631px;">D1413.17. †D1413.17. Magic adhesive stone.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;593 Fiddevav&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D2172.1. &#x2020;D2172.1. Magic repetition.&quot;}" style="width:631px;">D2172.1. †D2172.1. Magic repetition.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;593 Fiddevav&quot;}" height="21" style="height:21px;">ATU 610 The Healing Fruit</td>\n<td>AT 610 Fruit to Cure the Princess</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;T68.1. &#x2020;T68.1. Princess offered as prize to rescuer.&quot;}">\n<div>\n<div>T68.1. †T68.1. Princess offered as prize to rescuer.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H346. &#x2020;H346. Princess given to man who can heal her.&quot;}">\n<div>\n<div>H346. †H346. Princess given to man who can heal her.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;Q2. &#x2020;Q2. Kind and unkind.&quot;}" style="width:631px;">Q2. †Q2. Kind and unkind.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;L13. &#x2020;L13. Compassionate youngest son.&quot;}" style="width:631px;">L13. †L13. Compassionate youngest son.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;L10. &#x2020;L10. Victorious youngest son.&quot;}" style="width:631px;">L10. †L10. Victorious youngest son.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N825.3. &#x2020;N825.3. Old woman helper.&quot;}" style="width:631px;">N825.3. †N825.3. Old woman helper.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1500.1.5. &#x2020;D1500.1.5. Magic healing fruit.&quot;}" style="width:631px;">D1500.1.5. †D1500.1.5. Magic healing fruit.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1010. &#x2020;H1010. Impossible tasks&quot;}" style="width:631px;">H1010. †H1010. Impossible tasks</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1533.1.1. &#x2020;D1533.1.1. Magic land and water ship.&quot;}">\n<div>\n<div>D1533.1.1. †D1533.1.1. Magic land and water ship.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1112. &#x2020;H1112. Task: herding rabbits.&quot;}" style="width:631px;">H1112. †H1112. Task: herding rabbits.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H1331.1.2.1. &#x2020;H1331.1.2.1. Quest for remarkable bird feathers.&quot;}">\n<div>\n<div>H1331.1.2.1. †H1331.1.2.1. Quest for remarkable bird feathers.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;L161. &#x2020;L161. Lowly hero marries princess&quot;}" style="width:631px;">L161. †L161. Lowly hero marries princess</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;610 The Healing Fruit&quot;}" height="21" style="height:21px;">ATU 612 The Three Snake-Leaves</td>\n<td>AT 612 The Three Snake-Leaves</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;M254. &#x2020;M254. Promise to be buried with wife if she dies first.&quot;}">\n<div>\n<div>M254. †M254. Promise to be buried with wife if she dies first.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;B512. &#x2020;B512. Medicine shown by animal.&quot;}" style="width:631px;">B512. †B512. Medicine shown by animal.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;B491.1. &#x2020;B491.1. Helpful serpent.&quot;}" style="width:631px;">B491.1. †B491.1. Helpful serpent.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1500.1.4. &#x2020;D1500.1.4. Magic healing plant.&quot;}" style="width:631px;">D1500.1.4. †D1500.1.4. Magic healing plant.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;E165. &#x2020;E165. Resuscitation of wife by husband giving up half his remaining life.&quot;}">\n<div>\n<div>E165. †E165. Resuscitation of wife by husband giving up half his remaining life.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;K2213.5. &#x2020;K2213.5. The faithless resuscitated wife&quot;}">\n<div>\n<div>K2213.5. †K2213.5. The faithless resuscitated wife</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;K2213.2. &#x2020;K2213.2. Faithless wife and paramour throw hero overboard.&quot;}">\n<div>\n<div>K2213.2. †K2213.2. Faithless wife and paramour throw hero overboard.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;S142. &#x2020;S142. Person thrown into the water and abandoned.&quot;}">\n<div>\n<div>S142. †S142. Person thrown into the water and abandoned.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;P361. &#x2020;P361. Faithful servant.&quot;}" style="width:631px;">P361. †P361. Faithful servant.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;E105. &#x2020;E105. Resuscitation by herbs (leaves).&quot;}" style="width:631px;">E105. †E105. Resuscitation by herbs (leaves).</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;612 The Three Snake-Leaves&quot;}" height="21" style="height:21px;">ATU 613 The Two Travelers</td>\n<td>AT 613 The Two Travelers: Truth and Falsehood</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N61. &#x2020;N61. Wager that falsehood is better than truth.&quot;}">\n<div>\n<div>N61. †N61. Wager that falsehood is better than truth.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;K451.1. &#x2020;K451.1. Unjust umpire decides a religious dispute&quot;}">\n<div>\n<div>K451.1. †K451.1. Unjust umpire decides a religious dispute</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N92. &#x2020;N92. Wager to win or lose according to whether jackal howls or ass brays before game is finished.&quot;}">\n<div>\n<div>N92. †N92. Wager to win or lose according to whether jackal howls or ass brays before game is finished.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N2.3.3. &#x2020;N2.3.3. Eyes wagered.&quot;}" style="width:631px;">N2.3.3. †N2.3.3. Eyes wagered.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;M225. &#x2020;M225. Eyes exchanged for food.&quot;}" style="width:631px;">M225. †M225. Eyes exchanged for food.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;F1045. &#x2020;F1045. Night spent in tree.&quot;}" style="width:631px;">F1045. †F1045. Night spent in tree.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N451.1. &#x2020;N451.1. Secrets of animals (demons) accidentally overheard from tree (bridge) hiding place.&quot;}">\n<div>\n<div>N451.1. †N451.1. Secrets of animals (demons) accidentally overheard from tree (bridge) hiding place.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D1505.5. &#x2020;D1505.5. Magic water restores sight.&quot;}" style="width:631px;">D1505.5. †D1505.5. Magic water restores sight.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;D2064.1. &#x2020;D2064.1. Magic sickness because girl has thrown away her consecrated wafer.&quot;}">\n<div>\n<div>D2064.1. †D2064.1. Magic sickness because girl has thrown away her consecrated wafer.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;F933.2. &#x2020;F933.2. Dry spring restored by removal of certain stone.&quot;}">\n<div>\n<div>F933.2. †F933.2. Dry spring restored by removal of certain stone.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H346. &#x2020;H346. Princess given to man who can heal her.&quot;}">\n<div>\n<div>H346. †H346. Princess given to man who can heal her.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;N471. &#x2020;N471. Foolish attempt of second man to overhear secrets (from animals, demons etc.).&quot;}">\n<div>\n<div>N471. †N471. Foolish attempt of second man to overhear secrets (from animals, demons etc.).</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;613 The Two Travelers&quot;}" height="21" style="height:21px;">ATU 857 Louse-Skin</td>\n<td>AT 621 The Louse-Skin (now ATU 857)</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;B873.1. &#x2020;B873.1. Giant louse.&quot;}" style="width:631px;">B873.1. †B873.1. Giant louse.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;857 Louse-Skin&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;F983.2. &#x2020;F983.2. Louse fattened.&quot;}" style="width:631px;">F983.2. †F983.2. Louse fattened.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;857 Louse-Skin&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H551. &#x2020;H551. Princess offered to man who can out-riddle her.&quot;}" style="width:631px;">H551. †H551. Princess offered to man who can out-riddle her.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;857 Louse-Skin&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H522.1. &#x2020;H522.1. Test: guessing origin of certain skin.&quot;}">\n<div>\n<div>H522.1. †H522.1. Test: guessing origin of certain skin.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;857 Louse-Skin&quot;}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;H573.3. &#x2020;H573.3. Riddle solved by listening to propounder talk in his sleep.&quot;}">\n<div>\n<div>H573.3. †H573.3. Riddle solved by listening to propounder talk in his sleep.</div>\n</div>\n</td>\n</tr>\n</tbody>\n</table>'

We can parse that table into a pandas dataframe:

import pandas as pd

#Parse HTML table as dataframe
pd.read_html(example_table.html)[0]
0 1 2
0 ATU Classification Type AT Classification Type Thompson Motif
1 NaN NaN NaN
2 ATU 571B Lover Exposed AT 571B Himphamp H1010. †H1010. Impossible tasks
3 NaN NaN D812.3. †D812.3. Magic object received from de...
4 NaN NaN D1413.8. †D1413.8. Chamber-pot to which one st...
... ... ... ...
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse.
84 NaN NaN F983.2. †F983.2. Louse fattened.
85 NaN NaN H551. †H551. Princess offered to man who can o...
86 NaN NaN H522.1. †H522.1. Test: guessing origin of cert...
87 NaN NaN H573.3. †H573.3. Riddle solved by listening to...

88 rows × 3 columns

The first table in the page has a structure distinct from all the rest. In particular, it includes some blank lines and the column headers.

We could parse out the column headers, or we could set them explicitly. For conveniences, let’s set them explicilty, then remove them if they appear in the table.

For the first table in the page, we can then tidy up the dataframe by dropping at least the first three lines (the index count strats at 0), as well as dropping empty rows, resetting the index to account for that, and replacing null values with an empty string:

table_df = pd.read_html(example_table.html, skiprows=3)[0]
# Do some cleaning
table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')

colnames = ["ATU Classification Type", "AT Classification Type", "Thompson Motif"]
#Explicitly set column names
table_df.columns = colnames

# Remove column names if they were in the table and reset the indeex
table_df = table_df[table_df["ATU Classification Type"]!="ATU Classification Type"].reset_index(drop=True)

table_df
ATU Classification Type AT Classification Type Thompson Motif
0 D812.3. †D812.3. Magic object received from de...
1 D1413.8. †D1413.8. Chamber-pot to which one st...
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing
3 D1083. †D1083. Magic knife.
4 D1400.1.4.3. †D1400.1.4.3. Magic knife conquer...
... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse.
81 F983.2. †F983.2. Louse fattened.
82 H551. †H551. Princess offered to man who can o...
83 H522.1. †H522.1. Test: guessing origin of cert...
84 H573.3. †H573.3. Riddle solved by listening to...

85 rows × 3 columns

We notice that sometimes there are blanks in the ATU and AT columns, the assumption being that the values are inherited from above. We can explicitly fill down these values:

table_df = table_df.replace("", None).fillna(method="ffill")
table_df
ATU Classification Type AT Classification Type Thompson Motif
0 D812.3. †D812.3. Magic object received from de...
1 D1413.8. †D1413.8. Chamber-pot to which one st...
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing
3 ATU 576 Magic Knife AT 576 Magic Knife D1083. †D1083. Magic knife.
4 ATU 576 Magic Knife AT 576 Magic Knife D1400.1.4.3. †D1400.1.4.3. Magic knife conquer...
... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse.
81 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) F983.2. †F983.2. Louse fattened.
82 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H551. †H551. Princess offered to man who can o...
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H522.1. †H522.1. Test: guessing origin of cert...
84 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H573.3. †H573.3. Riddle solved by listening to...

85 rows × 3 columns

We can also add in the category:

table_df["category"] = category
table_df
ATU Classification Type AT Classification Type Thompson Motif category
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571
3 ATU 576 Magic Knife AT 576 Magic Knife D1083. †D1083. Magic knife. Tales of Magic - Magic Object 560 - 571
4 ATU 576 Magic Knife AT 576 Magic Knife D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... Tales of Magic - Magic Object 560 - 571
... ... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse. Tales of Magic - Magic Object 560 - 571
81 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) F983.2. †F983.2. Louse fattened. Tales of Magic - Magic Object 560 - 571
82 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H551. †H551. Princess offered to man who can o... Tales of Magic - Magic Object 560 - 571
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H522.1. †H522.1. Test: guessing origin of cert... Tales of Magic - Magic Object 560 - 571
84 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H573.3. †H573.3. Riddle solved by listening to... Tales of Magic - Magic Object 560 - 571

85 rows × 4 columns

For the other tables, we don’t need to drop any rows and we can recycle the column names, retaining the empty row dropper and null value cleaner just in case.

Some of the other tables may have column names, others may not. So let’s always add column names explicitly, then drop any row that appears to duplicate them:

# Work with the second table in a page... 
example_df2 = pd.read_html(boxes[1].find('table')[0].html)[0].dropna(axis=0, how='all').reset_index().fillna('')
example_df2.columns = table_df.columns
# Remove column names if they were in the table
example_df2 = example_df2[example_df2["ATU Classification Type"]!="ATU Classification Type"]

example_df2
ATU Classification Type AT Classification Type Thompson Motif category
0 0 ATU Classification Type AT Classification Type Thompson Motif
1 2 ATU 571B Lover Exposed AT 571B Himphamp H1010. †H1010. Impossible tasks
2 3 D812.3. †D812.3. Magic object received from de...
3 4 D1413.8. †D1413.8. Chamber-pot to which one st...
4 5 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing
... ... ... ... ...
82 83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse.
83 84 F983.2. †F983.2. Louse fattened.
84 85 H551. †H551. Princess offered to man who can o...
85 86 H522.1. †H522.1. Test: guessing origin of cert...
86 87 H573.3. †H573.3. Riddle solved by listening to...

87 rows × 4 columns

Some of the tables include “metadata” rows rather than splitting things into separate pages.

response = session.get(index_links[5])

# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = response.html.find('.s-lib-box')
borked_df = pd.read_html(boxes[0].find('table')[0].html, skiprows=3)[0]
borked_df = borked_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
borked_df.columns = colnames
borked_df.iloc[10:25]
ATU Classification Type AT Classification Type Thompson Motif
10 ATU 1008 Lighting the Road AT 1008 Lighting the Way by Setting the Barn o... K1412. Lighting the road (or painting the hous...
11 ATU 1009 Guarding the Store-room Door AT 1009 Guarding the Door by Carrying it Away K1413. Guarding the door.
12 ATU 1010 Repairing the House AT 1010 Fixing Things Up K1415. Repairing the house
13 ATU 1011 Tearing up the Orchard (Vineyard) AT 1011 Cutting Down the Orchard K1416. Tearing up the orchard (vineyard).
14 ATU 1012 Cleaning the Child AT 1012 Cleaning the Master's Child K1461.1. Cleaning the child.
15 ATU 1013 Bathing (Warming) Grandmother AT 1013 Bathing the Grandmother in Boiling Water K1462. Washing the grandmother -- in boiling w...
16 ATU 1029 The woman as Cuckoo in the Tree AT 1029 The Bargain Ends When the Cuckoo Sings K1691. The woman as cuckoo on the tree shot down.
17 _____________ _____________ Partnership between Man and Ogre 1030 - 1059
18 ATU 1030 The Crop Division AT 1030 Man and Ogre Share the Harvest K171.1. Deceptive crop division: above the gro...
19 ATU 1036 Hogs with Curly Tails AT 1036 Hogs with Curly Tails K171.4. Deceptive division of pigs: curly and ...
20 ATU 1045 Pulling the Lake Together AT 1045 Pulling Down the Forest with a Rope K1744. Hero threatens to pull the lake togethe...
21 ATU 1049 The heavy axe AT 1049 The heavy Axe or the Giant Bucket K1741.1. Felling the whole forest.
22 K1741.3. Bringing the whole well.
23 ATU 1050 Felling trees AT 1050 Contest in Felling Trees K44. Deceptive contest in chopping.
24 K178. Deceptive bargain: felling the tree.

We can split the table on these “headers” and use the new heading as the category:

# Find index of split rows
borked_df[borked_df["ATU Classification Type"].str.startswith("__")]
ATU Classification Type AT Classification Type Thompson Motif
17 _____________ _____________ Partnership between Man and Ogre 1030 - 1059
29 _____________ _____________ Contest between Man and Ogre 1060 - 1114
53 _____________ _____________ Man Kills (Injures) Ogre 1115 - 1144
72 _____________ _____________ Ogre Frightened by Man 1145 - 1154
80 _____________ _____________ Man Outwits the Devil 1155 - 1169
92 _____________ _____________ Souls Saved from the Devil 1170 - 1199

If we get the index values of these we can iteratively split the dataframe:

split_indexes = borked_df[borked_df["ATU Classification Type"].str.startswith("__")].index.to_list()
split_indexes

split_dfs = []

prev_ix = 0
for ix in split_indexes:
    tmp_df = borked_df.iloc[prev_ix:ix, :].reset_index(drop=True)
    split_dfs.append(tmp_df)
    prev_ix = ix

# Remaining part:
tmp_df = borked_df.iloc[prev_ix:, :].reset_index(drop=True)
split_dfs.append(tmp_df)

The first row of each of the split dataframes contains the new category:

split_dfs[2].iloc[0]
ATU Classification Type                               _____________
AT Classification Type                                _____________
Thompson Motif             Contest between Man and Ogre 1060 - 1114
Name: 0, dtype: object

We can process the dataframes by extracting the category, updating the dataframe with it, and dropping the metadata row:

for split_df in split_dfs[1:]:
    category = split_df.iloc[0, 2]
    split_df["category"] = category
    # Because we are modifying the dataframes as referenced objects, we change them inplace
    split_df.drop(0, inplace=True)
    split_df.reset_index(drop=True, inplace=True)
    
split_dfs[1].head()
ATU Classification Type AT Classification Type Thompson Motif category
0 ATU 1030 The Crop Division AT 1030 Man and Ogre Share the Harvest K171.1. Deceptive crop division: above the gro... Partnership between Man and Ogre 1030 - 1059
1 ATU 1036 Hogs with Curly Tails AT 1036 Hogs with Curly Tails K171.4. Deceptive division of pigs: curly and ... Partnership between Man and Ogre 1030 - 1059
2 ATU 1045 Pulling the Lake Together AT 1045 Pulling Down the Forest with a Rope K1744. Hero threatens to pull the lake togethe... Partnership between Man and Ogre 1030 - 1059
3 ATU 1049 The heavy axe AT 1049 The heavy Axe or the Giant Bucket K1741.1. Felling the whole forest. Partnership between Man and Ogre 1030 - 1059
4 K1741.3. Bringing the whole well. Partnership between Man and Ogre 1030 - 1059

We can split out each row into more detail:

def ATU_splitter(cell):
    """Split out ATU Classification type elements."""
    if cell:
        parts = cell.split()
        return pd.Series({#"ATU_code":" ".join(parts[:2]),
                          "ATU_code": parts[1],
                          "ATU_label": " ".join(parts[2:])})
    return pd.Series({"ATU_code":'', "ATU_label":''})

table_df[["ATU_code","ATU_label"]]= table_df["ATU Classification Type"].apply(ATU_splitter)
table_df
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571 576 Magic Knife
3 ATU 576 Magic Knife AT 576 Magic Knife D1083. †D1083. Magic knife. Tales of Magic - Magic Object 560 - 571 576 Magic Knife
4 ATU 576 Magic Knife AT 576 Magic Knife D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... Tales of Magic - Magic Object 560 - 571 576 Magic Knife
... ... ... ... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin
81 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) F983.2. †F983.2. Louse fattened. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin
82 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H551. †H551. Princess offered to man who can o... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H522.1. †H522.1. Test: guessing origin of cert... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin
84 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H573.3. †H573.3. Riddle solved by listening to... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin

85 rows × 6 columns

def AT_splitter(cell):
    """Split out AT Classification type elements."""
    if cell:
        parts = cell.split()
        return pd.Series({#"AT_code":" ".join(parts[:2]),
                          "AT_code": parts[1],
                          "AT_label": " ".join(parts[2:])})
    return pd.Series({"AT_code":'', "AT_label":''})

table_df[["AT_code","AT_label"]]= table_df["AT Classification Type"].apply(AT_splitter)
table_df
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife
3 ATU 576 Magic Knife AT 576 Magic Knife D1083. †D1083. Magic knife. Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife
4 ATU 576 Magic Knife AT 576 Magic Knife D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife
... ... ... ... ... ... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857)
81 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) F983.2. †F983.2. Louse fattened. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857)
82 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H551. †H551. Princess offered to man who can o... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857)
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H522.1. †H522.1. Test: guessing origin of cert... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857)
84 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H573.3. †H573.3. Riddle solved by listening to... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857)

85 rows × 8 columns

def Thompson_Motif_splitter(cell):
    """Split out Thompson Motif elements."""
    if cell:
        parts = cell.split()
        return pd.Series({"Motif_code":" ".join(parts[:1]).strip("."), "Motif_label": " ".join(parts[1:])})
    return pd.Series({"Motif_code":'', "Motif_label":''})

table_df[["Motif_code","Motif_label"]]= table_df["Thompson Motif"].apply(Thompson_Motif_splitter)
table_df
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label Motif_code Motif_label
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571 D812.3 †D812.3. Magic object received from devil.
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571 D1413.8 †D1413.8. Chamber-pot to which one sticks.
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife D838 †D838. Magic object acquired by stealing
3 ATU 576 Magic Knife AT 576 Magic Knife D1083. †D1083. Magic knife. Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife D1083 †D1083. Magic knife.
4 ATU 576 Magic Knife AT 576 Magic Knife D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife D1400.1.4.3 †D1400.1.4.3. Magic knife conquers enemy.
... ... ... ... ... ... ... ... ... ... ...
80 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) B873.1. †B873.1. Giant louse. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857) B873.1 †B873.1. Giant louse.
81 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) F983.2. †F983.2. Louse fattened. Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857) F983.2 †F983.2. Louse fattened.
82 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H551. †H551. Princess offered to man who can o... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857) H551 †H551. Princess offered to man who can out-rid...
83 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H522.1. †H522.1. Test: guessing origin of cert... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857) H522.1 †H522.1. Test: guessing origin of certain skin.
84 ATU 857 Louse-Skin AT 621 The Louse-Skin (now ATU 857) H573.3. †H573.3. Riddle solved by listening to... Tales of Magic - Magic Object 560 - 571 857 Louse-Skin 621 The Louse-Skin (now ATU 857) H573.3 †H573.3. Riddle solved by listening to propoun...

85 rows × 10 columns

We can add those codes to a simple database. Let’s create the database first:

from sqlite_utils import Database

db_name = "tale_types_demo.db"

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)
db["tale_types"].delete_where()

table_df.to_sql("tale_types", index=False, con=db.conn)
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(

Try a test query:

from pandas import read_sql

q = "SELECT * FROM tale_types LIMIT 3"

read_sql(q, db.conn)
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label Motif_code Motif_label
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571 D812.3 †D812.3. Magic object received from devil.
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571 D1413.8 †D1413.8. Chamber-pot to which one sticks.
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife D838 †D838. Magic object acquired by stealing

We can create a simple function to scrape all the tables on the page, process them and add them to the database.

We can also add the page title as a column entry.

def process_dataframe1(table_df, category):
    """Process the """
    #Explicitly set column names
    table_df.columns = ["ATU Classification Type", "AT Classification Type", "Thompson Motif"]

    # Remove column names if they were in the table
    table_df = table_df[table_df["ATU Classification Type"]!="ATU Classification Type"].reset_index(drop=True)

    table_df['category'] = category
   
    return table_df

def patch_motif_label(cell):
    """Fix borked motif label."""
    if cell and cell.startswith("†"):
        parts = cell.split()
        return  pd.Series({"Motif_label": " ".join(parts[1:])})
    return pd.Series({"Motif_label":''})
    
def process_dataframe2(table_df):
    """Process the """   
    # Fill down on blanks
    table_df = table_df.replace("", None).fillna(method="ffill")
    table_df[["ATU_code","ATU_label"]]= table_df["ATU Classification Type"].apply(ATU_splitter)
    table_df[["AT_code","AT_label"]]= table_df["AT Classification Type"].apply(AT_splitter)
    table_df[["Motif_code","Motif_label"]]= table_df["Thompson Motif"].apply(Thompson_Motif_splitter)
    
    # There may be some cruft in Motif_label
    table_df[["Motif_label"]]= table_df["Motif_label"].apply(patch_motif_label)

    return table_df

    
def scrape_page_tables(url):
    """Scrape all the tables on the page."""
    response = session.get(url)

    # dot notation: . is class, # is id
    #s-lib-box s-lib-box-std
    boxes = response.html.find('.s-lib-box')
    
    tables = []
    
    for i, box in enumerate(boxes):
        category = box.find('h2')[0].text
        
        # Some pages have cruft boxes with no tables
        _tables = box.find('table')
        if not _tables:
            continue
        table = _tables[0]
    
        if i==0:
            # There may be cruft empty tables at the top of the page
            try:
                table_df = pd.read_html(table.html, skiprows=3)[0]
                table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
            except:
                continue
        else:
            table_df = pd.read_html(table.html)[0]
            table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')

        table_df = process_dataframe1(table_df, category)

        # Some dataframes are actually multiple dataframes
        split_indexes = table_df[table_df["ATU Classification Type"].str.startswith("__")].index.to_list()
        if split_indexes:
            split_dfs = []
            prev_ix = 0
            for ix in split_indexes:
                tmp_df = table_df.iloc[prev_ix:ix, :].reset_index(drop=True)
                split_dfs.append(tmp_df)
                prev_ix = ix

            # Remaining part:
            tmp_df = table_df.iloc[prev_ix:, :].reset_index(drop=True)
            split_dfs.append(tmp_df)

            for split_df in split_dfs[1:]:
                partial_category = split_df.iloc[0, 2]
                split_df["category"] = partial_category
                split_df = split_df.drop(0).reset_index(drop=True)
                split_df = process_dataframe2(split_df)
                tables.append(split_df)
                
        else:
            table_df = process_dataframe2(table_df)
            tables.append(table_df)
    return tables
index_links[5]
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624305'

Let’s try it out:

scrape_page_tables(index_links[5])[0]
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label Motif_code Motif_label
0 ATU 1030 The Crop Division AT 1030 Man and Ogre Share the Harvest K171.1. Deceptive crop division: above the gro... Partnership between Man and Ogre 1030 - 1059 1030 The Crop Division 1030 Man and Ogre Share the Harvest K171.1
1 ATU 1036 Hogs with Curly Tails AT 1036 Hogs with Curly Tails K171.4. Deceptive division of pigs: curly and ... Partnership between Man and Ogre 1030 - 1059 1036 Hogs with Curly Tails 1036 Hogs with Curly Tails K171.4
2 ATU 1045 Pulling the Lake Together AT 1045 Pulling Down the Forest with a Rope K1744. Hero threatens to pull the lake togethe... Partnership between Man and Ogre 1030 - 1059 1045 Pulling the Lake Together 1045 Pulling Down the Forest with a Rope K1744
3 ATU 1049 The heavy axe AT 1049 The heavy Axe or the Giant Bucket K1741.1. Felling the whole forest. Partnership between Man and Ogre 1030 - 1059 1049 The heavy axe 1049 The heavy Axe or the Giant Bucket K1741.1
4 ATU 1049 The heavy axe AT 1049 The heavy Axe or the Giant Bucket K1741.3. Bringing the whole well. Partnership between Man and Ogre 1030 - 1059 1049 The heavy axe 1049 The heavy Axe or the Giant Bucket K1741.3
5 ATU 1050 Felling trees AT 1050 Contest in Felling Trees K44. Deceptive contest in chopping. Partnership between Man and Ogre 1030 - 1059 1050 Felling trees 1050 Contest in Felling Trees K44
6 ATU 1050 Felling trees AT 1050 Contest in Felling Trees K178. Deceptive bargain: felling the tree. Partnership between Man and Ogre 1030 - 1059 1050 Felling trees 1050 Contest in Felling Trees K178
7 ATU 1050 Felling trees AT 1050 Contest in Felling Trees K1421. Clearing land: axe broken. Partnership between Man and Ogre 1030 - 1059 1050 Felling trees 1050 Contest in Felling Trees K1421
8 ATU 1051 Bending a Tree AT 1051 Springing with a Bent Tree K1112. Bending the tree. Partnership between Man and Ogre 1030 - 1059 1051 Bending a Tree 1051 Springing with a Bent Tree K1112
9 ATU 1052 Carrying a Tree AT 1052 A Contest in Carrying a Tree K71. Deceptive contest in carrying a tree: rid... Partnership between Man and Ogre 1030 - 1059 1052 Carrying a Tree 1052 A Contest in Carrying a Tree K71
10 ATU 1053 Shooting Wild Boars AT 1053 Shooting Wild Boars K1741.2. A thousand at one shot. Partnership between Man and Ogre 1030 - 1059 1053 Shooting Wild Boars 1053 Shooting Wild Boars K1741.2

We can now scrape the whole collection:

from tqdm.notebook import tqdm


# Note that this assumes a globally defined requests session and database
def scrape_pages(url="https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090",
                 db_table="tale_types"):
    """Scrape all tale type lookup pages."""
    index_response = session.get(url)
    
    # Get links for lookup pages
    link_elements = response.html.find('.nav-tabs')[0].find("li")[1:-1]
    index_links = [el.xpath("//a/@href")[0] for el in link_elements]
    
    # Parse each page
    for index_link in tqdm(index_links):
        tables = scrape_page_tables(index_link)
        
        # Add to database
        # Note that we need to ensure we append the data to an existing table
        for table in tables:
            table.to_sql(db_table, index=False, if_exists="append", con=db.conn)
# Clear out the table
#db["tale_types"].drop()
db["tale_links"].delete_where()

# And scrape everything
scrape_pages()
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(
q = "SELECT * FROM tale_types LIMIT 3"

read_sql(q, db.conn)
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label Motif_code Motif_label
0 D812.3. †D812.3. Magic object received from de... Tales of Magic - Magic Object 560 - 571 D812.3 †D812.3. Magic object received from devil.
1 D1413.8. †D1413.8. Chamber-pot to which one st... Tales of Magic - Magic Object 560 - 571 D1413.8 †D1413.8. Chamber-pot to which one sticks.
2 ATU 576 Magic Knife AT 576 Magic Knife D838. †D838. Magic object acquired by stealing Tales of Magic - Magic Object 560 - 571 576 Magic Knife 576 Magic Knife D838 †D838. Magic object acquired by stealing
q = "SELECT COUNT(*) AS numrows FROM tale_types"

read_sql(q, db.conn)
numrows
0 2311
q = "SELECT * FROM tale_types WHERE ATU_code=610"

read_sql(q, db.conn)
ATU Classification Type AT Classification Type Thompson Motif category ATU_code ATU_label AT_code AT_label Motif_code Motif_label
0 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess T68.1. †T68.1. Princess offered as prize to re... Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess T68.1 †T68.1. Princess offered as prize to rescuer.
1 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H346. †H346. Princess given to man who can hea... Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess H346 †H346. Princess given to man who can heal her.
2 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess Q2. †Q2. Kind and unkind. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess Q2 †Q2. Kind and unkind.
3 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L13. †L13. Compassionate youngest son. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess L13 †L13. Compassionate youngest son.
4 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L10. †L10. Victorious youngest son. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess L10 †L10. Victorious youngest son.
5 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess N825.3. †N825.3. Old woman helper. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess N825.3 †N825.3. Old woman helper.
6 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess D1500.1.5. †D1500.1.5. Magic healing fruit. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess D1500.1.5 †D1500.1.5. Magic healing fruit.
7 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1010. †H1010. Impossible tasks Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess H1010 †H1010. Impossible tasks
8 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess D1533.1.1. †D1533.1.1. Magic land and water ship. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess D1533.1.1 †D1533.1.1. Magic land and water ship.
9 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1112. †H1112. Task: herding rabbits. Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess H1112 †H1112. Task: herding rabbits.
10 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1331.1.2.1. †H1331.1.2.1. Quest for remarkabl... Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess H1331.1.2.1 †H1331.1.2.1. Quest for remarkable bird feathers.
11 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L161. †L161. Lowly hero marries princess Tales of Magic - Magic Object 560 - 571 610 The Healing Fruit 610 Fruit to Cure the Princess L161 †L161. Lowly hero marries princess
12 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess T68.1. †T68.1. Princess offered as prize to re... Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess T68.1 Princess offered as prize to rescuer.
13 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H346. †H346. Princess given to man who can hea... Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess H346 Princess given to man who can heal her.
14 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess Q2. †Q2. Kind and unkind. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess Q2 Kind and unkind.
15 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L13. †L13. Compassionate youngest son. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess L13 Compassionate youngest son.
16 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L10. †L10. Victorious youngest son. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess L10 Victorious youngest son.
17 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess N825.3. †N825.3. Old woman helper. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess N825.3 Old woman helper.
18 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess D1500.1.5. †D1500.1.5. Magic healing fruit. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess D1500.1.5 Magic healing fruit.
19 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1010. †H1010. Impossible tasks Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess H1010 Impossible tasks
20 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess D1533.1.1. †D1533.1.1. Magic land and water ship. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess D1533.1.1 Magic land and water ship.
21 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1112. †H1112. Task: herding rabbits. Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess H1112 Task: herding rabbits.
22 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess H1331.1.2.1. †H1331.1.2.1. Quest for remarkabl... Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess H1331.1.2.1 Quest for remarkable bird feathers.
23 ATU 610 The Healing Fruit AT 610 Fruit to Cure the Princess L161. †L161. Lowly hero marries princess Tales of Magic - Magic Object 571B - 649 610 The Healing Fruit 610 Fruit to Cure the Princess L161 Lowly hero marries princess

Story Scraper#

The Missouri Libraries website also maintain a collection of stories referenced by AT and ATU categorised tale types: https://libraryguides.missouri.edu/c.php?g=1083510

tales_index_url = "https://libraryguides.missouri.edu/c.php?g=1083510"
session = HTMLSession()
tales_index_response = session.get(tales_index_url)

We can get links for the separate index pages as we did before:

tales_link_elements = tales_index_response.html.find('.nav-tabs')[0].find("li")[:-2]
tales_index_links = [el.xpath("//a/@href")[0] for el in tales_link_elements]
tales_index_links
['https://libraryguides.missouri.edu/c.php?g=1083510&p=7898512',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7901911',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7913834',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916779',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916818',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916872',
 'https://libraryguides.missouri.edu/c.php?g=1083510&p=7917477']

There is just a single box per page although it may contain several tables:

tales_response = session.get(tales_index_links[1])

# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = tales_response.html.find('.s-lib-box')
boxes
[<Element 'div' id='s-lg-box-25061504' class=('s-lib-box', 's-lib-box-std')>]

As before, we can extract the box title as a category heading:

category = boxes[0].find('h2')[0].text
category
'ATU 300 - 559 Tales of Magic'
example_tales_tables = boxes[0].find('table')
example_tales_tables[:3]
[<Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>,
 <Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>,
 <Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>]

Each table can be mapped to a dataframe, removing empty rows and filling down on the blanks.

The columns could be detected, but they are consistent and will need rolling over for multiple tables in a box, so we might as well set them explicitly and then tidy up.

example_df = pd.read_html(example_tales_tables[0].html)[0]
example_df = example_df.dropna(how='all').fillna(method='ffill')
example_df.columns = ["ATU Title", "AT", "Title", "Origin"]

# Remove column names if they were in the table and reset index
example_df = example_df[example_df["ATU Title"]!="ATU Title"].reset_index(drop=True)

# Tidy AT
example_df["AT"] = example_df["AT"].str.replace("*", "", regex=False)
# Tidy ATU Title
example_df["ATU Name"] = example_df["ATU Title"].apply(lambda x: " ".join(x.split()[2:]))
example_df.head()
ATU Title AT Title Origin ATU Name
0 ATU 300 The Dragon-Slayer 300 Three Brothers and the three Tasks British The Dragon-Slayer
1 ATU 300 The Dragon-Slayer 300 Three Dogs German The Dragon-Slayer
2 ATU 300 The Dragon-Slayer 300 Black Crow and the White Cheese Egyptian The Dragon-Slayer
3 ATU 300 The Dragon-Slayer 300 Five Counsels Mexican The Dragon-Slayer
4 ATU 300 The Dragon-Slayer 300 Monster's Hairs Armenian The Dragon-Slayer

Also note that we need to parse out the link for the tale in the HTML table. The pandas .read_html() function does not do this out of the can, although we can patch pandas to support it. Alternatively, we can grab the links from the table.

from bs4 import BeautifulSoup

def get_table_links(html):
    """Parse an HMTL table and extract links from it."""
    table_soup = BeautifulSoup(html, 'lxml')

    links = []
    for row in table_soup.find_all('tr'):
        for td in row.find_all('td'):
            if td.find('a'):
                links.append((td.a['href'], td.text))

    link_df = pd.DataFrame(links, columns=["link", "Title"])
    
    return link_df

link_df = get_table_links(example_tales_tables[0].html)
link_df.head()
link Title
0 https://archive.org/details/dictionaryofbrit01... Three Brothers and the three Tasks
1 https://archive.org/details/greenfairybook00la... Three Dogs
2 https://archive.org/details/folktalesofegypt00... Black Crow and the White Cheese
3 https://archive.org/details/folktalesofmexic00... Five Counsels
4 https://archive.org/details/100armeniantales00... Monster's Hairs

Note that there may be duplicate story names within a table, so we will join by the index value.

example_df = example_df.join(link_df["link"])
example_df.head()
ATU Title AT Title Origin ATU Name link
0 ATU 300 The Dragon-Slayer 300 Three Brothers and the three Tasks British The Dragon-Slayer https://archive.org/details/dictionaryofbrit01...
1 ATU 300 The Dragon-Slayer 300 Three Dogs German The Dragon-Slayer https://archive.org/details/greenfairybook00la...
2 ATU 300 The Dragon-Slayer 300 Black Crow and the White Cheese Egyptian The Dragon-Slayer https://archive.org/details/folktalesofegypt00...
3 ATU 300 The Dragon-Slayer 300 Five Counsels Mexican The Dragon-Slayer https://archive.org/details/folktalesofmexic00...
4 ATU 300 The Dragon-Slayer 300 Monster's Hairs Armenian The Dragon-Slayer https://archive.org/details/100armeniantales00...

To parse the page completely, we’d need to iterate over all the tables.

Let’s build a scraper for all the pages and all the tables in a page:

def scrape_tale_links(url="https://libraryguides.missouri.edu/c.php?g=1083510",
                      db_table="tale_links"):
    """Scrape all tale links."""
    
    tales_index_response = session.get(url)
    
    tales_link_elements = tales_index_response.html.find('.nav-tabs')[0].find("li")[:-2]
    tales_index_links = [el.xpath("//a/@href")[0] for el in tales_link_elements]

    for tales_page_link in tqdm(tales_index_links):
        tales_response = session.get(tales_page_link)
        # dot notation: . is class, # is id
        #s-lib-box s-lib-box-std
        boxes = tales_response.html.find('.s-lib-box')

        category = boxes[0].find('h2')[0].text

        tables =  boxes[0].find('table')

        for table in tables:
            table_df = pd.read_html(table.html)[0]
            table_df = table_df.dropna(how='all').reset_index(drop=True).fillna(method='ffill')
            table_df.columns = ["ATU Title", "AT", "Title", "Origin"]
            table_df["AT"] = table_df["AT"].astype(str)
            table_df["ATU Title"] = table_df["AT"].astype(str)

            table_df["category"] = category

            # Remove column names if they were in the table
            table_df = table_df[table_df["ATU Title"]!="ATU Title"]

            # Tidy AT
            table_df["AT"] = table_df["AT"].str.replace("*", "", regex=False)
            # Tidy ATU Title
            table_df["ATU Name"] = table_df["ATU Title"].apply(lambda x: " ".join(x.split()[2:]))
            
            # Add links
            link_df = get_table_links(example_tales_tables[0].html)
            table_df = table_df.join(link_df["link"])
            
            table_df.to_sql(db_table, if_exists="append", index=False, con=db.conn)
# Clear the db table (if it exists)
#db["tale_links"].drop()
db["tale_links"].delete_where()

scrape_tale_links()
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  sql.to_sql(
q = "SELECT * FROM tale_links LIMIT 10"

read_sql(q, db.conn)
ATU Title AT Title Origin category ATU Name link
0 AT AT Title Origin ATU 1-299 Animal Tales https://archive.org/details/dictionaryofbrit01...
1 1 1 The Wolf and the Fox French ATU 1-299 Animal Tales https://archive.org/details/greenfairybook00la...
2 1 1 Fox in Inishkea Irish ATU 1-299 Animal Tales https://archive.org/details/folktalesofegypt00...
3 1 1 Fish Thief Japanesse ATU 1-299 Animal Tales https://archive.org/details/folktalesofmexic00...
4 1 1 Tub of Butter Canadian ATU 1-299 Animal Tales https://archive.org/details/100armeniantales00...
5 1 1 Fox and Lapp Lapp ATU 1-299 Animal Tales https://archive.org/details/irishfolktales00gl...
6 1 1 Reynard and Bruin European ATU 1-299 Animal Tales https://archive.org/details/greenhillsofmagi00...
7 1* 1 Fox and the Hare in Winter German ATU 1-299 Animal Tales https://archive.org/details/irishfolktales00gl...
8 1* 1 Fox and the Magpie British ATU 1-299 Animal Tales https://archive.org/details/moreenglishfairy00...
9 1* 1 Hare, Badger, Monkey, and Otter Japanesse ATU 1-299 Animal Tales https://archive.org/details/greyfairybook00lan...

Lots of the links are to resources held by archive.org, although many of them are limited preview books.

It might be interesting trying to build something to check whether a resource is openly available there, or limited access, and if it is openly available, perhaps attempt to scrape the story text from the Internet Archive.

Identifying Openly Available archive.org Resources#

Let’s start by just grabbing a list of unique resources on archive.org:

q = 'SELECT * FROM tale_links WHERE link LIKE "https://archive.org/details/%";'

xx = read_sql(q, db.conn)
xx['resource'] = xx['link'].str.replace("https://archive.org/details/", "", regex=False)
xx['resource'] = xx['resource'].apply(lambda x: x.split("/")[0])
xx['resource'].unique()
array(['dictionaryofbrit01brig', 'greenfairybook00lang_0',
       'folktalesofegypt00elsh', 'folktalesofmexic0000pare',
       '100armeniantales0000unse', 'irishfolktales00glas',
       'greenhillsofmagi0000musi', 'moreenglishfairy00jaco',
       'greyfairybook00lang', 'russianfairytale00afan',
       'folktalesofgreec0000mega', 'englishfairytale1902jaco',
       'italianfolktales00calv', 'populartalesfrom00daseiala',
       'yellowfairybook00lang02', 'folktalesofgerma00kurt',
       'bluefairybook00langiala', 'brownfairybook00langrich',
       'folktalesoffranc00mass', 'folktalesofjapan00seki',
       'moregreekfolktal0000dawk', 'violetfairybook00lang'], dtype=object)

I’ve already got a database of the Lang coloured Fairy Books, so it should be possible to join in to that.