Tale Types Scraper
Contents
Tale Types Scraper#
The University of Missouri Libraries publish a useful look-up table of AT, ATU and Thompson Motif codes which might be handy to have in a simple database.
The original pages can be found from this explanatory guide: https://libraryguides.missouri.edu/c.php?g=1039894&p=7541717
url = "https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090"
import requests_cache
from datetime import timedelta
requests_cache.install_cache('web_cache', backend='sqlite', expire_after=timedelta(days=100))
from requests_html import HTMLSession
session = HTMLSession()
index_response = session.get(url)
The resource pages are linked from a menu bar. The links we want are all but the first and the last one:
link_elements = index_response.html.find('.nav-tabs')[0].find("li")[1:-1]
link_elements
[<Element 'li' class=('active',)>,
<Element 'li' class=()>,
<Element 'li' class=()>,
<Element 'li' class=()>,
<Element 'li' class=()>,
<Element 'li' class=()>,
<Element 'li' class=()>,
<Element 'li' class=()>]
index_links = [el.xpath("//a/@href")[0] for el in link_elements]
index_links
['https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7610331',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7619154',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7620890',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7621605',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624305',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624836',
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7635705']
The pages are split into separate tables which are embedded in particular block elements.
response = session.get(index_links[2])
# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = response.html.find('.s-lib-box')
boxes
[<Element 'div' id='s-lg-box-24196290' class=('s-lib-box', 's-lib-box-std')>,
<Element 'div' id='s-lg-box-24196308' class=('s-lib-box', 's-lib-box-std')>,
<Element 'div' id='s-lg-box-24197326' class=('s-lib-box', 's-lib-box-std')>,
<Element 'div' id='s-lg-box-24197327' class=('s-lib-box', 's-lib-box-std')>,
<Element 'div' id='s-lg-box-24197417' class=('s-lib-box', 's-lib-box-std')>]
We can extract the box title as a category label:
category = boxes[0].find('h2')[0].text
category
'Tales of Magic - Magic Object 560 - 571'
And we can grab the table within the box (I think there is just one table per box, so take the first, but we coul also go defensive on this to check for multiple tables…):
example_table = boxes[1].find('table')[0]
example_table
<Element 'table' class=('table', 'table-bordered') style='border: 0px solid rgb(180, 180, 180);'>
example_table.html
'<table class="table table-bordered" style="border: 0px solid rgb(180, 180, 180);">\n<tbody>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"571 \\"All Stick Together\\""}" height="21" style="height:21px;width:256px;">ATU Classification Type\xa0</td>\n<td style="width:276px;">\xa0AT Classification Type\xa0\xa0</td>\n<td style="width:631px;">Thompson Motif</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td style="width:631px;">\xa0</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 571B Lover Exposed</td>\n<td>AT 571B Himphamp</td>\n<td data-sheets-value="{"1":2,"2":"H1010. †H1010. Impossible tasks"}" style="width:631px;">H1010. †H1010. Impossible tasks</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"571B Lover Exposed"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D812.3. †D812.3. Magic object received from devil."}">\n<div>\n<div>D812.3. †D812.3. Magic object received from devil.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"571B Lover Exposed"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1413.8. †D1413.8. Chamber-pot to which one sticks."}">\n<div>\n<div>D1413.8. †D1413.8. Chamber-pot to which one sticks.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"571B Lover Exposed"}" height="21" style="height:21px;">ATU 576 Magic Knife</td>\n<td>AT 576 Magic Knife</td>\n<td data-sheets-value="{"1":2,"2":"D838. †D838. Magic object acquired by stealing"}" style="width:631px;">D838. †D838. Magic object acquired by stealing</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"576 Magic Knife"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1083. †D1083. Magic knife."}" style="width:631px;">D1083. †D1083. Magic knife.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"576 Magic Knife"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1400.1.4.3. †D1400.1.4.3. Magic knife conquers enemy."}">\n<div>\n<div>D1400.1.4.3. †D1400.1.4.3. Magic knife conquers enemy.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"576 Magic Knife"}" height="21" style="height:21px;">ATU 577 The King\'s Tasks</td>\n<td>AT 577 The King\'s Tasks</td>\n<td data-sheets-value="{"1":2,"2":"H335. †H335. Tasks assigned suitors."}" style="width:631px;">H335. †H335. Tasks assigned suitors.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1601.14. †D1601.14. Self-chopping axe."}" style="width:631px;">D1601.14. †D1601.14. Self-chopping axe.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1601.15. †D1601.15. Automatic shovel."}" style="width:631px;">D1601.15. †D1601.15. Automatic shovel.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1601.18.2. †D1601.18.2. Self-playing violin."}" style="width:631px;">D1601.18.2. †D1601.18.2. Self-playing violin.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1581. †D1581. Tasks performed by use of magic object."}">\n<div>\n<div>D1581. †D1581. Tasks performed by use of magic object.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H971.1. †H971.1. Tasks performed with help of old woman"}">\n<div>\n<div>H971.1. †H971.1. Tasks performed with help of old woman</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"577 The King\'s Tasks"}" height="21" style="height:21px;">ATU 580 Beloved of Women</td>\n<td>AT 580 Beloved of Women</td>\n<td data-sheets-value="{"1":2,"2":"D1900. †D1900. Love induced by magic."}" style="width:631px;">D1900. †D1900. Love induced by magic.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"580 Beloved of Women"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D856. †D856. Magic object acquired by gaining love of owner."}">\n<div>\n<div>D856. †D856. Magic object acquired by gaining love of owner.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"580 Beloved of Women"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1472.1.7. †D1472.1.7. Magic table supplies food and drink."}">\n<div>\n<div>D1472.1.7. †D1472.1.7. Magic table supplies food and drink.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"580 Beloved of Women"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1473. †D1473. Magic object furnishes clothes."}" style="width:631px;">D1473. †D1473. Magic object furnishes clothes.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"580 Beloved of Women"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"T45. †T45. Lover buys admission to woman\'s room."}">\n<div>\n<div>T45. †T45. Lover buys admission to woman\'s room.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"580 Beloved of Women"}" height="21" style="height:21px;">ATU 566 Three Magic Objects and Wonderful Fruits</td>\n<td>AT 580* Money without End (now ATU 566)</td>\n<td data-sheets-value="{"1":2,"2":"D1451. †D1451. Inexhaustible purse furnishes money."}">\n<div>\n<div>D1451. †D1451. Inexhaustible purse furnishes money.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 303 The Twins or Blood-Brothers</td>\n<td>AT 581 Twin Brother and Thier Magic Objects (now ATU 303)</td>\n<td data-sheets-value="{"1":2,"2":"D1081. †D1081. Magic sword."}" style="width:631px;">D1081. †D1081. Magic sword.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"303 The Twins or Blood-Brothers"}" height="21" style="height:21px;">ATU 585 Spindle, Shuttle, and Needle</td>\n<td>AT 585 Spindle, Shuttle, and Needle</td>\n<td data-sheets-value="{"1":2,"2":"H1311.2. †H1311.2. Quest for bride richest and poorest."}">\n<div>\n<div>H1311.2. †H1311.2. Quest for bride richest and poorest.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"585 Spindle, Shuttle, and Needle"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1425.1. †D1425.1. Magic spindle brings back prince for heroine."}">\n<div>\n<div>D1425.1. †D1425.1. Magic spindle brings back prince for heroine.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"585 Spindle, Shuttle, and Needle"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1484.1. †D1484.1. Magic shuttle makes road."}" style="width:631px;">D1484.1. †D1484.1. Magic shuttle makes road.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"585 Spindle, Shuttle, and Needle"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1337.1.7. †D1337.1.7. Magic needle transforms a room from plainness to beauty."}">\n<div>\n<div>D1337.1.7. †D1337.1.7. Magic needle transforms a room from plainness to beauty.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"585 Spindle, Shuttle, and Needle"}" height="21" style="height:21px;">ATU 590 Faithless Mother</td>\n<td>AT 590 Magic Belt</td>\n<td data-sheets-value="{"1":2,"2":"D1840. †D1840. Magic invulnerability."}" style="width:631px;">D1840. †D1840. Magic invulnerability.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"589 Faithless Mother"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1335.4. †D1335.4. Magic belt gives strength"}" style="width:631px;">D1335.4. †D1335.4. Magic belt gives strength</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"589 Faithless Mother"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"S12.1. †S12.1. Treacherous mother marries ogre and plots against son."}">\n<div>\n<div>S12.1. †S12.1. Treacherous mother marries ogre and plots against son.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H931. †H931. Tasks assigned in order to get rid of hero."}">\n<div>\n<div>H931. †H931. Tasks assigned in order to get rid of hero.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1212. †H1212. Quest assigned because of feigned illness."}">\n<div>\n<div>H1212. †H1212. Quest assigned because of feigned illness.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1333.3.1.1. †H1333.3.1.1. Quest for magic apple"}" style="width:631px;">H1333.3.1.1. †H1333.3.1.1. Quest for magic apple</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1361. †H1361. Quest for lion\'s milk."}" style="width:631px;">H1361. †H1361. Quest for lion\'s milk.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"F615. †F615. Strong man evades death."}" style="width:631px;">F615. †F615. Strong man evades death.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"B431.2. †B431.2. Helpful lion."}" style="width:631px;">B431.2. †B431.2. Helpful lion.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"K975. †K975. Secret of strength treacherously discovered."}">\n<div>\n<div>K975. †K975. Secret of strength treacherously discovered.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D861. †D861. Magic object stolen"}" style="width:631px;">D861. †D861. Magic object stolen</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"B512. †B512. Medicine shown by animal."}" style="width:631px;">B512. †B512. Medicine shown by animal.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D880. †D880. Recovery of magic object."}" style="width:631px;">D880. †D880. Recovery of magic object.</td>\n</tr>\n<tr height="21">\n<td height="21" style="height:21px;">ATU 592 The Dance Among Thorns</td>\n<td>AT 592 Dancing in Thorns</td>\n<td data-sheets-value="{"1":2,"2":"S322.4. †S322.4. Evil stepmother casts boy forth."}" style="width:631px;">S322.4. †S322.4. Evil stepmother casts boy forth.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"Q42.1. †Q42.1. Spendthrift knight."}" style="width:631px;">Q42.1. †Q42.1. Spendthrift knight.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1761.0.2. †D1761.0.2. Limited number of wishes granted."}">\n<div>\n<div>D1761.0.2. †D1761.0.2. Limited number of wishes granted.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1653.1.7. †D1653.1.7. Infallible gun."}" style="width:631px;">D1653.1.7. †D1653.1.7. Infallible gun.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1415.2.5. †D1415.2.5. Magic fiddle causes dancing."}">\n<div>\n<div>D1415.2.5. †D1415.2.5. Magic fiddle causes dancing.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N55. †N55. Shooting contest on wager."}" style="width:631px;">N55. †N55. Shooting contest on wager.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"K551.3.1. †K551.3.1. Respite from death while one plays the fiddle"}">\n<div>\n<div>K551.3.1. †K551.3.1. Respite from death while one plays the fiddle</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"592 The Dance Among Thorns"}" height="21" style="height:21px;">ATU 593 Fiddevav</td>\n<td>AT 593 Blowing Up the Fire</td>\n<td data-sheets-value="{"1":2,"2":"N825.3. †N825.3. Old woman helper."}" style="width:631px;">N825.3. †N825.3. Old woman helper.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"593 Fiddevav"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1413. †D1413. Magic object holds person fast."}" style="width:631px;">D1413. †D1413. Magic object holds person fast.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"593 Fiddevav"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1413.17. †D1413.17. Magic adhesive stone."}" style="width:631px;">D1413.17. †D1413.17. Magic adhesive stone.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"593 Fiddevav"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D2172.1. †D2172.1. Magic repetition."}" style="width:631px;">D2172.1. †D2172.1. Magic repetition.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"593 Fiddevav"}" height="21" style="height:21px;">ATU 610 The Healing Fruit</td>\n<td>AT 610 Fruit to Cure the Princess</td>\n<td data-sheets-value="{"1":2,"2":"T68.1. †T68.1. Princess offered as prize to rescuer."}">\n<div>\n<div>T68.1. †T68.1. Princess offered as prize to rescuer.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H346. †H346. Princess given to man who can heal her."}">\n<div>\n<div>H346. †H346. Princess given to man who can heal her.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"Q2. †Q2. Kind and unkind."}" style="width:631px;">Q2. †Q2. Kind and unkind.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"L13. †L13. Compassionate youngest son."}" style="width:631px;">L13. †L13. Compassionate youngest son.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"L10. †L10. Victorious youngest son."}" style="width:631px;">L10. †L10. Victorious youngest son.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N825.3. †N825.3. Old woman helper."}" style="width:631px;">N825.3. †N825.3. Old woman helper.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1500.1.5. †D1500.1.5. Magic healing fruit."}" style="width:631px;">D1500.1.5. †D1500.1.5. Magic healing fruit.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1010. †H1010. Impossible tasks"}" style="width:631px;">H1010. †H1010. Impossible tasks</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1533.1.1. †D1533.1.1. Magic land and water ship."}">\n<div>\n<div>D1533.1.1. †D1533.1.1. Magic land and water ship.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1112. †H1112. Task: herding rabbits."}" style="width:631px;">H1112. †H1112. Task: herding rabbits.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H1331.1.2.1. †H1331.1.2.1. Quest for remarkable bird feathers."}">\n<div>\n<div>H1331.1.2.1. †H1331.1.2.1. Quest for remarkable bird feathers.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"L161. †L161. Lowly hero marries princess"}" style="width:631px;">L161. †L161. Lowly hero marries princess</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"610 The Healing Fruit"}" height="21" style="height:21px;">ATU 612 The Three Snake-Leaves</td>\n<td>AT 612 The Three Snake-Leaves</td>\n<td data-sheets-value="{"1":2,"2":"M254. †M254. Promise to be buried with wife if she dies first."}">\n<div>\n<div>M254. †M254. Promise to be buried with wife if she dies first.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"B512. †B512. Medicine shown by animal."}" style="width:631px;">B512. †B512. Medicine shown by animal.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"B491.1. †B491.1. Helpful serpent."}" style="width:631px;">B491.1. †B491.1. Helpful serpent.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1500.1.4. †D1500.1.4. Magic healing plant."}" style="width:631px;">D1500.1.4. †D1500.1.4. Magic healing plant.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"E165. †E165. Resuscitation of wife by husband giving up half his remaining life."}">\n<div>\n<div>E165. †E165. Resuscitation of wife by husband giving up half his remaining life.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"K2213.5. †K2213.5. The faithless resuscitated wife"}">\n<div>\n<div>K2213.5. †K2213.5. The faithless resuscitated wife</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"K2213.2. †K2213.2. Faithless wife and paramour throw hero overboard."}">\n<div>\n<div>K2213.2. †K2213.2. Faithless wife and paramour throw hero overboard.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"S142. †S142. Person thrown into the water and abandoned."}">\n<div>\n<div>S142. †S142. Person thrown into the water and abandoned.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"P361. †P361. Faithful servant."}" style="width:631px;">P361. †P361. Faithful servant.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"E105. †E105. Resuscitation by herbs (leaves)."}" style="width:631px;">E105. †E105. Resuscitation by herbs (leaves).</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"612 The Three Snake-Leaves"}" height="21" style="height:21px;">ATU 613 The Two Travelers</td>\n<td>AT 613 The Two Travelers: Truth and Falsehood</td>\n<td data-sheets-value="{"1":2,"2":"N61. †N61. Wager that falsehood is better than truth."}">\n<div>\n<div>N61. †N61. Wager that falsehood is better than truth.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"K451.1. †K451.1. Unjust umpire decides a religious dispute"}">\n<div>\n<div>K451.1. †K451.1. Unjust umpire decides a religious dispute</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N92. †N92. Wager to win or lose according to whether jackal howls or ass brays before game is finished."}">\n<div>\n<div>N92. †N92. Wager to win or lose according to whether jackal howls or ass brays before game is finished.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N2.3.3. †N2.3.3. Eyes wagered."}" style="width:631px;">N2.3.3. †N2.3.3. Eyes wagered.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"M225. †M225. Eyes exchanged for food."}" style="width:631px;">M225. †M225. Eyes exchanged for food.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"F1045. †F1045. Night spent in tree."}" style="width:631px;">F1045. †F1045. Night spent in tree.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N451.1. †N451.1. Secrets of animals (demons) accidentally overheard from tree (bridge) hiding place."}">\n<div>\n<div>N451.1. †N451.1. Secrets of animals (demons) accidentally overheard from tree (bridge) hiding place.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D1505.5. †D1505.5. Magic water restores sight."}" style="width:631px;">D1505.5. †D1505.5. Magic water restores sight.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"D2064.1. †D2064.1. Magic sickness because girl has thrown away her consecrated wafer."}">\n<div>\n<div>D2064.1. †D2064.1. Magic sickness because girl has thrown away her consecrated wafer.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"F933.2. †F933.2. Dry spring restored by removal of certain stone."}">\n<div>\n<div>F933.2. †F933.2. Dry spring restored by removal of certain stone.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H346. †H346. Princess given to man who can heal her."}">\n<div>\n<div>H346. †H346. Princess given to man who can heal her.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"N471. †N471. Foolish attempt of second man to overhear secrets (from animals, demons etc.)."}">\n<div>\n<div>N471. †N471. Foolish attempt of second man to overhear secrets (from animals, demons etc.).</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"613 The Two Travelers"}" height="21" style="height:21px;">ATU 857 Louse-Skin</td>\n<td>AT 621 The Louse-Skin (now ATU 857)</td>\n<td data-sheets-value="{"1":2,"2":"B873.1. †B873.1. Giant louse."}" style="width:631px;">B873.1. †B873.1. Giant louse.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"857 Louse-Skin"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"F983.2. †F983.2. Louse fattened."}" style="width:631px;">F983.2. †F983.2. Louse fattened.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"857 Louse-Skin"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H551. †H551. Princess offered to man who can out-riddle her."}" style="width:631px;">H551. †H551. Princess offered to man who can out-riddle her.</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"857 Louse-Skin"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H522.1. †H522.1. Test: guessing origin of certain skin."}">\n<div>\n<div>H522.1. †H522.1. Test: guessing origin of certain skin.</div>\n</div>\n</td>\n</tr>\n<tr height="21">\n<td data-sheets-value="{"1":2,"2":"857 Louse-Skin"}" height="21" style="height:21px;">\xa0</td>\n<td>\xa0</td>\n<td data-sheets-value="{"1":2,"2":"H573.3. †H573.3. Riddle solved by listening to propounder talk in his sleep."}">\n<div>\n<div>H573.3. †H573.3. Riddle solved by listening to propounder talk in his sleep.</div>\n</div>\n</td>\n</tr>\n</tbody>\n</table>'
We can parse that table into a pandas dataframe:
import pandas as pd
#Parse HTML table as dataframe
pd.read_html(example_table.html)[0]
0 | 1 | 2 | |
---|---|---|---|
0 | ATU Classification Type | AT Classification Type | Thompson Motif |
1 | NaN | NaN | NaN |
2 | ATU 571B Lover Exposed | AT 571B Himphamp | H1010. †H1010. Impossible tasks |
3 | NaN | NaN | D812.3. †D812.3. Magic object received from de... |
4 | NaN | NaN | D1413.8. †D1413.8. Chamber-pot to which one st... |
... | ... | ... | ... |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. |
84 | NaN | NaN | F983.2. †F983.2. Louse fattened. |
85 | NaN | NaN | H551. †H551. Princess offered to man who can o... |
86 | NaN | NaN | H522.1. †H522.1. Test: guessing origin of cert... |
87 | NaN | NaN | H573.3. †H573.3. Riddle solved by listening to... |
88 rows × 3 columns
The first table in the page has a structure distinct from all the rest. In particular, it includes some blank lines and the column headers.
We could parse out the column headers, or we could set them explicitly. For conveniences, let’s set them explicilty, then remove them if they appear in the table.
For the first table in the page, we can then tidy up the dataframe by dropping at least the first three lines (the index count strats at 0), as well as dropping empty rows, resetting the index to account for that, and replacing null values with an empty string:
table_df = pd.read_html(example_table.html, skiprows=3)[0]
# Do some cleaning
table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
colnames = ["ATU Classification Type", "AT Classification Type", "Thompson Motif"]
#Explicitly set column names
table_df.columns = colnames
# Remove column names if they were in the table and reset the indeex
table_df = table_df[table_df["ATU Classification Type"]!="ATU Classification Type"].reset_index(drop=True)
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | |
---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | ||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | ||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing |
3 | D1083. †D1083. Magic knife. | ||
4 | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... | ||
... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. |
81 | F983.2. †F983.2. Louse fattened. | ||
82 | H551. †H551. Princess offered to man who can o... | ||
83 | H522.1. †H522.1. Test: guessing origin of cert... | ||
84 | H573.3. †H573.3. Riddle solved by listening to... |
85 rows × 3 columns
We notice that sometimes there are blanks in the ATU and AT columns, the assumption being that the values are inherited from above. We can explicitly fill down these values:
table_df = table_df.replace("", None).fillna(method="ffill")
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | |
---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | ||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | ||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing |
3 | ATU 576 Magic Knife | AT 576 Magic Knife | D1083. †D1083. Magic knife. |
4 | ATU 576 Magic Knife | AT 576 Magic Knife | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... |
... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. |
81 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | F983.2. †F983.2. Louse fattened. |
82 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H551. †H551. Princess offered to man who can o... |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H522.1. †H522.1. Test: guessing origin of cert... |
84 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H573.3. †H573.3. Riddle solved by listening to... |
85 rows × 3 columns
We can also add in the category:
table_df["category"] = category
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | category | |
---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | ||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | ||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 |
3 | ATU 576 Magic Knife | AT 576 Magic Knife | D1083. †D1083. Magic knife. | Tales of Magic - Magic Object 560 - 571 |
4 | ATU 576 Magic Knife | AT 576 Magic Knife | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... | Tales of Magic - Magic Object 560 - 571 |
... | ... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. | Tales of Magic - Magic Object 560 - 571 |
81 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | F983.2. †F983.2. Louse fattened. | Tales of Magic - Magic Object 560 - 571 |
82 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H551. †H551. Princess offered to man who can o... | Tales of Magic - Magic Object 560 - 571 |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H522.1. †H522.1. Test: guessing origin of cert... | Tales of Magic - Magic Object 560 - 571 |
84 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H573.3. †H573.3. Riddle solved by listening to... | Tales of Magic - Magic Object 560 - 571 |
85 rows × 4 columns
For the other tables, we don’t need to drop any rows and we can recycle the column names, retaining the empty row dropper and null value cleaner just in case.
Some of the other tables may have column names, others may not. So let’s always add column names explicitly, then drop any row that appears to duplicate them:
# Work with the second table in a page...
example_df2 = pd.read_html(boxes[1].find('table')[0].html)[0].dropna(axis=0, how='all').reset_index().fillna('')
example_df2.columns = table_df.columns
# Remove column names if they were in the table
example_df2 = example_df2[example_df2["ATU Classification Type"]!="ATU Classification Type"]
example_df2
ATU Classification Type | AT Classification Type | Thompson Motif | category | |
---|---|---|---|---|
0 | 0 | ATU Classification Type | AT Classification Type | Thompson Motif |
1 | 2 | ATU 571B Lover Exposed | AT 571B Himphamp | H1010. †H1010. Impossible tasks |
2 | 3 | D812.3. †D812.3. Magic object received from de... | ||
3 | 4 | D1413.8. †D1413.8. Chamber-pot to which one st... | ||
4 | 5 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing |
... | ... | ... | ... | ... |
82 | 83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. |
83 | 84 | F983.2. †F983.2. Louse fattened. | ||
84 | 85 | H551. †H551. Princess offered to man who can o... | ||
85 | 86 | H522.1. †H522.1. Test: guessing origin of cert... | ||
86 | 87 | H573.3. †H573.3. Riddle solved by listening to... |
87 rows × 4 columns
Some of the tables include “metadata” rows rather than splitting things into separate pages.
response = session.get(index_links[5])
# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = response.html.find('.s-lib-box')
borked_df = pd.read_html(boxes[0].find('table')[0].html, skiprows=3)[0]
borked_df = borked_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
borked_df.columns = colnames
borked_df.iloc[10:25]
ATU Classification Type | AT Classification Type | Thompson Motif | |
---|---|---|---|
10 | ATU 1008 Lighting the Road | AT 1008 Lighting the Way by Setting the Barn o... | K1412. Lighting the road (or painting the hous... |
11 | ATU 1009 Guarding the Store-room Door | AT 1009 Guarding the Door by Carrying it Away | K1413. Guarding the door. |
12 | ATU 1010 Repairing the House | AT 1010 Fixing Things Up | K1415. Repairing the house |
13 | ATU 1011 Tearing up the Orchard (Vineyard) | AT 1011 Cutting Down the Orchard | K1416. Tearing up the orchard (vineyard). |
14 | ATU 1012 Cleaning the Child | AT 1012 Cleaning the Master's Child | K1461.1. Cleaning the child. |
15 | ATU 1013 Bathing (Warming) Grandmother | AT 1013 Bathing the Grandmother in Boiling Water | K1462. Washing the grandmother -- in boiling w... |
16 | ATU 1029 The woman as Cuckoo in the Tree | AT 1029 The Bargain Ends When the Cuckoo Sings | K1691. The woman as cuckoo on the tree shot down. |
17 | _____________ | _____________ | Partnership between Man and Ogre 1030 - 1059 |
18 | ATU 1030 The Crop Division | AT 1030 Man and Ogre Share the Harvest | K171.1. Deceptive crop division: above the gro... |
19 | ATU 1036 Hogs with Curly Tails | AT 1036 Hogs with Curly Tails | K171.4. Deceptive division of pigs: curly and ... |
20 | ATU 1045 Pulling the Lake Together | AT 1045 Pulling Down the Forest with a Rope | K1744. Hero threatens to pull the lake togethe... |
21 | ATU 1049 The heavy axe | AT 1049 The heavy Axe or the Giant Bucket | K1741.1. Felling the whole forest. |
22 | K1741.3. Bringing the whole well. | ||
23 | ATU 1050 Felling trees | AT 1050 Contest in Felling Trees | K44. Deceptive contest in chopping. |
24 | K178. Deceptive bargain: felling the tree. |
We can split the table on these “headers” and use the new heading as the category:
# Find index of split rows
borked_df[borked_df["ATU Classification Type"].str.startswith("__")]
ATU Classification Type | AT Classification Type | Thompson Motif | |
---|---|---|---|
17 | _____________ | _____________ | Partnership between Man and Ogre 1030 - 1059 |
29 | _____________ | _____________ | Contest between Man and Ogre 1060 - 1114 |
53 | _____________ | _____________ | Man Kills (Injures) Ogre 1115 - 1144 |
72 | _____________ | _____________ | Ogre Frightened by Man 1145 - 1154 |
80 | _____________ | _____________ | Man Outwits the Devil 1155 - 1169 |
92 | _____________ | _____________ | Souls Saved from the Devil 1170 - 1199 |
If we get the index values of these we can iteratively split the dataframe:
split_indexes = borked_df[borked_df["ATU Classification Type"].str.startswith("__")].index.to_list()
split_indexes
split_dfs = []
prev_ix = 0
for ix in split_indexes:
tmp_df = borked_df.iloc[prev_ix:ix, :].reset_index(drop=True)
split_dfs.append(tmp_df)
prev_ix = ix
# Remaining part:
tmp_df = borked_df.iloc[prev_ix:, :].reset_index(drop=True)
split_dfs.append(tmp_df)
The first row of each of the split dataframes contains the new category:
split_dfs[2].iloc[0]
ATU Classification Type _____________
AT Classification Type _____________
Thompson Motif Contest between Man and Ogre 1060 - 1114
Name: 0, dtype: object
We can process the dataframes by extracting the category, updating the dataframe with it, and dropping the metadata row:
for split_df in split_dfs[1:]:
category = split_df.iloc[0, 2]
split_df["category"] = category
# Because we are modifying the dataframes as referenced objects, we change them inplace
split_df.drop(0, inplace=True)
split_df.reset_index(drop=True, inplace=True)
split_dfs[1].head()
ATU Classification Type | AT Classification Type | Thompson Motif | category | |
---|---|---|---|---|
0 | ATU 1030 The Crop Division | AT 1030 Man and Ogre Share the Harvest | K171.1. Deceptive crop division: above the gro... | Partnership between Man and Ogre 1030 - 1059 |
1 | ATU 1036 Hogs with Curly Tails | AT 1036 Hogs with Curly Tails | K171.4. Deceptive division of pigs: curly and ... | Partnership between Man and Ogre 1030 - 1059 |
2 | ATU 1045 Pulling the Lake Together | AT 1045 Pulling Down the Forest with a Rope | K1744. Hero threatens to pull the lake togethe... | Partnership between Man and Ogre 1030 - 1059 |
3 | ATU 1049 The heavy axe | AT 1049 The heavy Axe or the Giant Bucket | K1741.1. Felling the whole forest. | Partnership between Man and Ogre 1030 - 1059 |
4 | K1741.3. Bringing the whole well. | Partnership between Man and Ogre 1030 - 1059 |
We can split out each row into more detail:
def ATU_splitter(cell):
"""Split out ATU Classification type elements."""
if cell:
parts = cell.split()
return pd.Series({#"ATU_code":" ".join(parts[:2]),
"ATU_code": parts[1],
"ATU_label": " ".join(parts[2:])})
return pd.Series({"ATU_code":'', "ATU_label":''})
table_df[["ATU_code","ATU_label"]]= table_df["ATU Classification Type"].apply(ATU_splitter)
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | |
---|---|---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | ||||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | ||||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife |
3 | ATU 576 Magic Knife | AT 576 Magic Knife | D1083. †D1083. Magic knife. | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife |
4 | ATU 576 Magic Knife | AT 576 Magic Knife | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife |
... | ... | ... | ... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin |
81 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | F983.2. †F983.2. Louse fattened. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin |
82 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H551. †H551. Princess offered to man who can o... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H522.1. †H522.1. Test: guessing origin of cert... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin |
84 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H573.3. †H573.3. Riddle solved by listening to... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin |
85 rows × 6 columns
def AT_splitter(cell):
"""Split out AT Classification type elements."""
if cell:
parts = cell.split()
return pd.Series({#"AT_code":" ".join(parts[:2]),
"AT_code": parts[1],
"AT_label": " ".join(parts[2:])})
return pd.Series({"AT_code":'', "AT_label":''})
table_df[["AT_code","AT_label"]]= table_df["AT Classification Type"].apply(AT_splitter)
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | |
---|---|---|---|---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | ||||||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | ||||||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife |
3 | ATU 576 Magic Knife | AT 576 Magic Knife | D1083. †D1083. Magic knife. | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife |
4 | ATU 576 Magic Knife | AT 576 Magic Knife | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife |
... | ... | ... | ... | ... | ... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) |
81 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | F983.2. †F983.2. Louse fattened. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) |
82 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H551. †H551. Princess offered to man who can o... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H522.1. †H522.1. Test: guessing origin of cert... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) |
84 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H573.3. †H573.3. Riddle solved by listening to... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) |
85 rows × 8 columns
def Thompson_Motif_splitter(cell):
"""Split out Thompson Motif elements."""
if cell:
parts = cell.split()
return pd.Series({"Motif_code":" ".join(parts[:1]).strip("."), "Motif_label": " ".join(parts[1:])})
return pd.Series({"Motif_code":'', "Motif_label":''})
table_df[["Motif_code","Motif_label"]]= table_df["Thompson Motif"].apply(Thompson_Motif_splitter)
table_df
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | Motif_code | Motif_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | D812.3 | †D812.3. Magic object received from devil. | ||||||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | D1413.8 | †D1413.8. Chamber-pot to which one sticks. | ||||||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife | D838 | †D838. Magic object acquired by stealing |
3 | ATU 576 Magic Knife | AT 576 Magic Knife | D1083. †D1083. Magic knife. | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife | D1083 | †D1083. Magic knife. |
4 | ATU 576 Magic Knife | AT 576 Magic Knife | D1400.1.4.3. †D1400.1.4.3. Magic knife conquer... | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife | D1400.1.4.3 | †D1400.1.4.3. Magic knife conquers enemy. |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | B873.1. †B873.1. Giant louse. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) | B873.1 | †B873.1. Giant louse. |
81 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | F983.2. †F983.2. Louse fattened. | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) | F983.2 | †F983.2. Louse fattened. |
82 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H551. †H551. Princess offered to man who can o... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) | H551 | †H551. Princess offered to man who can out-rid... |
83 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H522.1. †H522.1. Test: guessing origin of cert... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) | H522.1 | †H522.1. Test: guessing origin of certain skin. |
84 | ATU 857 Louse-Skin | AT 621 The Louse-Skin (now ATU 857) | H573.3. †H573.3. Riddle solved by listening to... | Tales of Magic - Magic Object 560 - 571 | 857 | Louse-Skin | 621 | The Louse-Skin (now ATU 857) | H573.3 | †H573.3. Riddle solved by listening to propoun... |
85 rows × 10 columns
We can add those codes to a simple database. Let’s create the database first:
from sqlite_utils import Database
db_name = "tale_types_demo.db"
# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)
db["tale_types"].delete_where()
table_df.to_sql("tale_types", index=False, con=db.conn)
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
sql.to_sql(
Try a test query:
from pandas import read_sql
q = "SELECT * FROM tale_types LIMIT 3"
read_sql(q, db.conn)
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | Motif_code | Motif_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | D812.3 | †D812.3. Magic object received from devil. | ||||||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | D1413.8 | †D1413.8. Chamber-pot to which one sticks. | ||||||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife | D838 | †D838. Magic object acquired by stealing |
We can create a simple function to scrape all the tables on the page, process them and add them to the database.
We can also add the page title as a column entry.
def process_dataframe1(table_df, category):
"""Process the """
#Explicitly set column names
table_df.columns = ["ATU Classification Type", "AT Classification Type", "Thompson Motif"]
# Remove column names if they were in the table
table_df = table_df[table_df["ATU Classification Type"]!="ATU Classification Type"].reset_index(drop=True)
table_df['category'] = category
return table_df
def patch_motif_label(cell):
"""Fix borked motif label."""
if cell and cell.startswith("†"):
parts = cell.split()
return pd.Series({"Motif_label": " ".join(parts[1:])})
return pd.Series({"Motif_label":''})
def process_dataframe2(table_df):
"""Process the """
# Fill down on blanks
table_df = table_df.replace("", None).fillna(method="ffill")
table_df[["ATU_code","ATU_label"]]= table_df["ATU Classification Type"].apply(ATU_splitter)
table_df[["AT_code","AT_label"]]= table_df["AT Classification Type"].apply(AT_splitter)
table_df[["Motif_code","Motif_label"]]= table_df["Thompson Motif"].apply(Thompson_Motif_splitter)
# There may be some cruft in Motif_label
table_df[["Motif_label"]]= table_df["Motif_label"].apply(patch_motif_label)
return table_df
def scrape_page_tables(url):
"""Scrape all the tables on the page."""
response = session.get(url)
# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = response.html.find('.s-lib-box')
tables = []
for i, box in enumerate(boxes):
category = box.find('h2')[0].text
# Some pages have cruft boxes with no tables
_tables = box.find('table')
if not _tables:
continue
table = _tables[0]
if i==0:
# There may be cruft empty tables at the top of the page
try:
table_df = pd.read_html(table.html, skiprows=3)[0]
table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
except:
continue
else:
table_df = pd.read_html(table.html)[0]
table_df = table_df.dropna(axis=0, how='all').reset_index(drop=True).fillna('')
table_df = process_dataframe1(table_df, category)
# Some dataframes are actually multiple dataframes
split_indexes = table_df[table_df["ATU Classification Type"].str.startswith("__")].index.to_list()
if split_indexes:
split_dfs = []
prev_ix = 0
for ix in split_indexes:
tmp_df = table_df.iloc[prev_ix:ix, :].reset_index(drop=True)
split_dfs.append(tmp_df)
prev_ix = ix
# Remaining part:
tmp_df = table_df.iloc[prev_ix:, :].reset_index(drop=True)
split_dfs.append(tmp_df)
for split_df in split_dfs[1:]:
partial_category = split_df.iloc[0, 2]
split_df["category"] = partial_category
split_df = split_df.drop(0).reset_index(drop=True)
split_df = process_dataframe2(split_df)
tables.append(split_df)
else:
table_df = process_dataframe2(table_df)
tables.append(table_df)
return tables
index_links[5]
'https://libraryguides.missouri.edu/c.php?g=1039894&p=7624305'
Let’s try it out:
scrape_page_tables(index_links[5])[0]
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | Motif_code | Motif_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ATU 1030 The Crop Division | AT 1030 Man and Ogre Share the Harvest | K171.1. Deceptive crop division: above the gro... | Partnership between Man and Ogre 1030 - 1059 | 1030 | The Crop Division | 1030 | Man and Ogre Share the Harvest | K171.1 | |
1 | ATU 1036 Hogs with Curly Tails | AT 1036 Hogs with Curly Tails | K171.4. Deceptive division of pigs: curly and ... | Partnership between Man and Ogre 1030 - 1059 | 1036 | Hogs with Curly Tails | 1036 | Hogs with Curly Tails | K171.4 | |
2 | ATU 1045 Pulling the Lake Together | AT 1045 Pulling Down the Forest with a Rope | K1744. Hero threatens to pull the lake togethe... | Partnership between Man and Ogre 1030 - 1059 | 1045 | Pulling the Lake Together | 1045 | Pulling Down the Forest with a Rope | K1744 | |
3 | ATU 1049 The heavy axe | AT 1049 The heavy Axe or the Giant Bucket | K1741.1. Felling the whole forest. | Partnership between Man and Ogre 1030 - 1059 | 1049 | The heavy axe | 1049 | The heavy Axe or the Giant Bucket | K1741.1 | |
4 | ATU 1049 The heavy axe | AT 1049 The heavy Axe or the Giant Bucket | K1741.3. Bringing the whole well. | Partnership between Man and Ogre 1030 - 1059 | 1049 | The heavy axe | 1049 | The heavy Axe or the Giant Bucket | K1741.3 | |
5 | ATU 1050 Felling trees | AT 1050 Contest in Felling Trees | K44. Deceptive contest in chopping. | Partnership between Man and Ogre 1030 - 1059 | 1050 | Felling trees | 1050 | Contest in Felling Trees | K44 | |
6 | ATU 1050 Felling trees | AT 1050 Contest in Felling Trees | K178. Deceptive bargain: felling the tree. | Partnership between Man and Ogre 1030 - 1059 | 1050 | Felling trees | 1050 | Contest in Felling Trees | K178 | |
7 | ATU 1050 Felling trees | AT 1050 Contest in Felling Trees | K1421. Clearing land: axe broken. | Partnership between Man and Ogre 1030 - 1059 | 1050 | Felling trees | 1050 | Contest in Felling Trees | K1421 | |
8 | ATU 1051 Bending a Tree | AT 1051 Springing with a Bent Tree | K1112. Bending the tree. | Partnership between Man and Ogre 1030 - 1059 | 1051 | Bending a Tree | 1051 | Springing with a Bent Tree | K1112 | |
9 | ATU 1052 Carrying a Tree | AT 1052 A Contest in Carrying a Tree | K71. Deceptive contest in carrying a tree: rid... | Partnership between Man and Ogre 1030 - 1059 | 1052 | Carrying a Tree | 1052 | A Contest in Carrying a Tree | K71 | |
10 | ATU 1053 Shooting Wild Boars | AT 1053 Shooting Wild Boars | K1741.2. A thousand at one shot. | Partnership between Man and Ogre 1030 - 1059 | 1053 | Shooting Wild Boars | 1053 | Shooting Wild Boars | K1741.2 |
We can now scrape the whole collection:
from tqdm.notebook import tqdm
# Note that this assumes a globally defined requests session and database
def scrape_pages(url="https://libraryguides.missouri.edu/c.php?g=1039894&p=7609090",
db_table="tale_types"):
"""Scrape all tale type lookup pages."""
index_response = session.get(url)
# Get links for lookup pages
link_elements = response.html.find('.nav-tabs')[0].find("li")[1:-1]
index_links = [el.xpath("//a/@href")[0] for el in link_elements]
# Parse each page
for index_link in tqdm(index_links):
tables = scrape_page_tables(index_link)
# Add to database
# Note that we need to ensure we append the data to an existing table
for table in tables:
table.to_sql(db_table, index=False, if_exists="append", con=db.conn)
# Clear out the table
#db["tale_types"].drop()
db["tale_links"].delete_where()
# And scrape everything
scrape_pages()
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
sql.to_sql(
q = "SELECT * FROM tale_types LIMIT 3"
read_sql(q, db.conn)
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | Motif_code | Motif_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | D812.3. †D812.3. Magic object received from de... | Tales of Magic - Magic Object 560 - 571 | D812.3 | †D812.3. Magic object received from devil. | ||||||
1 | D1413.8. †D1413.8. Chamber-pot to which one st... | Tales of Magic - Magic Object 560 - 571 | D1413.8 | †D1413.8. Chamber-pot to which one sticks. | ||||||
2 | ATU 576 Magic Knife | AT 576 Magic Knife | D838. †D838. Magic object acquired by stealing | Tales of Magic - Magic Object 560 - 571 | 576 | Magic Knife | 576 | Magic Knife | D838 | †D838. Magic object acquired by stealing |
q = "SELECT COUNT(*) AS numrows FROM tale_types"
read_sql(q, db.conn)
numrows | |
---|---|
0 | 2311 |
q = "SELECT * FROM tale_types WHERE ATU_code=610"
read_sql(q, db.conn)
ATU Classification Type | AT Classification Type | Thompson Motif | category | ATU_code | ATU_label | AT_code | AT_label | Motif_code | Motif_label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | T68.1. †T68.1. Princess offered as prize to re... | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | T68.1 | †T68.1. Princess offered as prize to rescuer. |
1 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H346. †H346. Princess given to man who can hea... | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H346 | †H346. Princess given to man who can heal her. |
2 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | Q2. †Q2. Kind and unkind. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | Q2 | †Q2. Kind and unkind. |
3 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L13. †L13. Compassionate youngest son. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L13 | †L13. Compassionate youngest son. |
4 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L10. †L10. Victorious youngest son. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L10 | †L10. Victorious youngest son. |
5 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | N825.3. †N825.3. Old woman helper. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | N825.3 | †N825.3. Old woman helper. |
6 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | D1500.1.5. †D1500.1.5. Magic healing fruit. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | D1500.1.5 | †D1500.1.5. Magic healing fruit. |
7 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1010. †H1010. Impossible tasks | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1010 | †H1010. Impossible tasks |
8 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | D1533.1.1. †D1533.1.1. Magic land and water ship. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | D1533.1.1 | †D1533.1.1. Magic land and water ship. |
9 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1112. †H1112. Task: herding rabbits. | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1112 | †H1112. Task: herding rabbits. |
10 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1331.1.2.1. †H1331.1.2.1. Quest for remarkabl... | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1331.1.2.1 | †H1331.1.2.1. Quest for remarkable bird feathers. |
11 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L161. †L161. Lowly hero marries princess | Tales of Magic - Magic Object 560 - 571 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L161 | †L161. Lowly hero marries princess |
12 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | T68.1. †T68.1. Princess offered as prize to re... | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | T68.1 | Princess offered as prize to rescuer. |
13 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H346. †H346. Princess given to man who can hea... | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H346 | Princess given to man who can heal her. |
14 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | Q2. †Q2. Kind and unkind. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | Q2 | Kind and unkind. |
15 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L13. †L13. Compassionate youngest son. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L13 | Compassionate youngest son. |
16 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L10. †L10. Victorious youngest son. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L10 | Victorious youngest son. |
17 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | N825.3. †N825.3. Old woman helper. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | N825.3 | Old woman helper. |
18 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | D1500.1.5. †D1500.1.5. Magic healing fruit. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | D1500.1.5 | Magic healing fruit. |
19 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1010. †H1010. Impossible tasks | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1010 | Impossible tasks |
20 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | D1533.1.1. †D1533.1.1. Magic land and water ship. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | D1533.1.1 | Magic land and water ship. |
21 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1112. †H1112. Task: herding rabbits. | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1112 | Task: herding rabbits. |
22 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | H1331.1.2.1. †H1331.1.2.1. Quest for remarkabl... | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | H1331.1.2.1 | Quest for remarkable bird feathers. |
23 | ATU 610 The Healing Fruit | AT 610 Fruit to Cure the Princess | L161. †L161. Lowly hero marries princess | Tales of Magic - Magic Object 571B - 649 | 610 | The Healing Fruit | 610 | Fruit to Cure the Princess | L161 | Lowly hero marries princess |
Story Scraper#
The Missouri Libraries website also maintain a collection of stories referenced by AT and ATU categorised tale types: https://libraryguides.missouri.edu/c.php?g=1083510
tales_index_url = "https://libraryguides.missouri.edu/c.php?g=1083510"
session = HTMLSession()
tales_index_response = session.get(tales_index_url)
We can get links for the separate index pages as we did before:
tales_link_elements = tales_index_response.html.find('.nav-tabs')[0].find("li")[:-2]
tales_index_links = [el.xpath("//a/@href")[0] for el in tales_link_elements]
tales_index_links
['https://libraryguides.missouri.edu/c.php?g=1083510&p=7898512',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7901911',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7913834',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916779',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916818',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7916872',
'https://libraryguides.missouri.edu/c.php?g=1083510&p=7917477']
There is just a single box per page although it may contain several tables:
tales_response = session.get(tales_index_links[1])
# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = tales_response.html.find('.s-lib-box')
boxes
[<Element 'div' id='s-lg-box-25061504' class=('s-lib-box', 's-lib-box-std')>]
As before, we can extract the box title as a category heading:
category = boxes[0].find('h2')[0].text
category
'ATU 300 - 559 Tales of Magic'
example_tales_tables = boxes[0].find('table')
example_tales_tables[:3]
[<Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>,
<Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>,
<Element 'table' border='1' cellpadding='0' cellspacing='0' dir='ltr' style='table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none' xmlns='http://www.w3.org/1999/xhtml'>]
Each table can be mapped to a dataframe, removing empty rows and filling down on the blanks.
The columns could be detected, but they are consistent and will need rolling over for multiple tables in a box, so we might as well set them explicitly and then tidy up.
example_df = pd.read_html(example_tales_tables[0].html)[0]
example_df = example_df.dropna(how='all').fillna(method='ffill')
example_df.columns = ["ATU Title", "AT", "Title", "Origin"]
# Remove column names if they were in the table and reset index
example_df = example_df[example_df["ATU Title"]!="ATU Title"].reset_index(drop=True)
# Tidy AT
example_df["AT"] = example_df["AT"].str.replace("*", "", regex=False)
# Tidy ATU Title
example_df["ATU Name"] = example_df["ATU Title"].apply(lambda x: " ".join(x.split()[2:]))
example_df.head()
ATU Title | AT | Title | Origin | ATU Name | |
---|---|---|---|---|---|
0 | ATU 300 The Dragon-Slayer | 300 | Three Brothers and the three Tasks | British | The Dragon-Slayer |
1 | ATU 300 The Dragon-Slayer | 300 | Three Dogs | German | The Dragon-Slayer |
2 | ATU 300 The Dragon-Slayer | 300 | Black Crow and the White Cheese | Egyptian | The Dragon-Slayer |
3 | ATU 300 The Dragon-Slayer | 300 | Five Counsels | Mexican | The Dragon-Slayer |
4 | ATU 300 The Dragon-Slayer | 300 | Monster's Hairs | Armenian | The Dragon-Slayer |
Also note that we need to parse out the link for the tale in the HTML table. The pandas .read_html()
function does not do this out of the can, although we can patch pandas to support it. Alternatively, we can grab the links from the table.
from bs4 import BeautifulSoup
def get_table_links(html):
"""Parse an HMTL table and extract links from it."""
table_soup = BeautifulSoup(html, 'lxml')
links = []
for row in table_soup.find_all('tr'):
for td in row.find_all('td'):
if td.find('a'):
links.append((td.a['href'], td.text))
link_df = pd.DataFrame(links, columns=["link", "Title"])
return link_df
link_df = get_table_links(example_tales_tables[0].html)
link_df.head()
link | Title | |
---|---|---|
0 | https://archive.org/details/dictionaryofbrit01... | Three Brothers and the three Tasks |
1 | https://archive.org/details/greenfairybook00la... | Three Dogs |
2 | https://archive.org/details/folktalesofegypt00... | Black Crow and the White Cheese |
3 | https://archive.org/details/folktalesofmexic00... | Five Counsels |
4 | https://archive.org/details/100armeniantales00... | Monster's Hairs |
Note that there may be duplicate story names within a table, so we will join by the index value.
example_df = example_df.join(link_df["link"])
example_df.head()
ATU Title | AT | Title | Origin | ATU Name | link | |
---|---|---|---|---|---|---|
0 | ATU 300 The Dragon-Slayer | 300 | Three Brothers and the three Tasks | British | The Dragon-Slayer | https://archive.org/details/dictionaryofbrit01... |
1 | ATU 300 The Dragon-Slayer | 300 | Three Dogs | German | The Dragon-Slayer | https://archive.org/details/greenfairybook00la... |
2 | ATU 300 The Dragon-Slayer | 300 | Black Crow and the White Cheese | Egyptian | The Dragon-Slayer | https://archive.org/details/folktalesofegypt00... |
3 | ATU 300 The Dragon-Slayer | 300 | Five Counsels | Mexican | The Dragon-Slayer | https://archive.org/details/folktalesofmexic00... |
4 | ATU 300 The Dragon-Slayer | 300 | Monster's Hairs | Armenian | The Dragon-Slayer | https://archive.org/details/100armeniantales00... |
To parse the page completely, we’d need to iterate over all the tables.
Let’s build a scraper for all the pages and all the tables in a page:
def scrape_tale_links(url="https://libraryguides.missouri.edu/c.php?g=1083510",
db_table="tale_links"):
"""Scrape all tale links."""
tales_index_response = session.get(url)
tales_link_elements = tales_index_response.html.find('.nav-tabs')[0].find("li")[:-2]
tales_index_links = [el.xpath("//a/@href")[0] for el in tales_link_elements]
for tales_page_link in tqdm(tales_index_links):
tales_response = session.get(tales_page_link)
# dot notation: . is class, # is id
#s-lib-box s-lib-box-std
boxes = tales_response.html.find('.s-lib-box')
category = boxes[0].find('h2')[0].text
tables = boxes[0].find('table')
for table in tables:
table_df = pd.read_html(table.html)[0]
table_df = table_df.dropna(how='all').reset_index(drop=True).fillna(method='ffill')
table_df.columns = ["ATU Title", "AT", "Title", "Origin"]
table_df["AT"] = table_df["AT"].astype(str)
table_df["ATU Title"] = table_df["AT"].astype(str)
table_df["category"] = category
# Remove column names if they were in the table
table_df = table_df[table_df["ATU Title"]!="ATU Title"]
# Tidy AT
table_df["AT"] = table_df["AT"].str.replace("*", "", regex=False)
# Tidy ATU Title
table_df["ATU Name"] = table_df["ATU Title"].apply(lambda x: " ".join(x.split()[2:]))
# Add links
link_df = get_table_links(example_tales_tables[0].html)
table_df = table_df.join(link_df["link"])
table_df.to_sql(db_table, if_exists="append", index=False, con=db.conn)
# Clear the db table (if it exists)
#db["tale_links"].drop()
db["tale_links"].delete_where()
scrape_tale_links()
/usr/local/lib/python3.9/site-packages/pandas/core/generic.py:2872: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
sql.to_sql(
q = "SELECT * FROM tale_links LIMIT 10"
read_sql(q, db.conn)
ATU Title | AT | Title | Origin | category | ATU Name | link | |
---|---|---|---|---|---|---|---|
0 | AT | AT | Title | Origin | ATU 1-299 Animal Tales | https://archive.org/details/dictionaryofbrit01... | |
1 | 1 | 1 | The Wolf and the Fox | French | ATU 1-299 Animal Tales | https://archive.org/details/greenfairybook00la... | |
2 | 1 | 1 | Fox in Inishkea | Irish | ATU 1-299 Animal Tales | https://archive.org/details/folktalesofegypt00... | |
3 | 1 | 1 | Fish Thief | Japanesse | ATU 1-299 Animal Tales | https://archive.org/details/folktalesofmexic00... | |
4 | 1 | 1 | Tub of Butter | Canadian | ATU 1-299 Animal Tales | https://archive.org/details/100armeniantales00... | |
5 | 1 | 1 | Fox and Lapp | Lapp | ATU 1-299 Animal Tales | https://archive.org/details/irishfolktales00gl... | |
6 | 1 | 1 | Reynard and Bruin | European | ATU 1-299 Animal Tales | https://archive.org/details/greenhillsofmagi00... | |
7 | 1* | 1 | Fox and the Hare in Winter | German | ATU 1-299 Animal Tales | https://archive.org/details/irishfolktales00gl... | |
8 | 1* | 1 | Fox and the Magpie | British | ATU 1-299 Animal Tales | https://archive.org/details/moreenglishfairy00... | |
9 | 1* | 1 | Hare, Badger, Monkey, and Otter | Japanesse | ATU 1-299 Animal Tales | https://archive.org/details/greyfairybook00lan... |
Lots of the links are to resources held by archive.org
, although many of them are limited preview books.
It might be interesting trying to build something to check whether a resource is openly available there, or limited access, and if it is openly available, perhaps attempt to scrape the story text from the Internet Archive.
Identifying Openly Available archive.org
Resources#
Let’s start by just grabbing a list of unique resources on archive.org
:
q = 'SELECT * FROM tale_links WHERE link LIKE "https://archive.org/details/%";'
xx = read_sql(q, db.conn)
xx['resource'] = xx['link'].str.replace("https://archive.org/details/", "", regex=False)
xx['resource'] = xx['resource'].apply(lambda x: x.split("/")[0])
xx['resource'].unique()
array(['dictionaryofbrit01brig', 'greenfairybook00lang_0',
'folktalesofegypt00elsh', 'folktalesofmexic0000pare',
'100armeniantales0000unse', 'irishfolktales00glas',
'greenhillsofmagi0000musi', 'moreenglishfairy00jaco',
'greyfairybook00lang', 'russianfairytale00afan',
'folktalesofgreec0000mega', 'englishfairytale1902jaco',
'italianfolktales00calv', 'populartalesfrom00daseiala',
'yellowfairybook00lang02', 'folktalesofgerma00kurt',
'bluefairybook00langiala', 'brownfairybook00langrich',
'folktalesoffranc00mass', 'folktalesofjapan00seki',
'moregreekfolktal0000dawk', 'violetfairybook00lang'], dtype=object)
I’ve already got a database of the Lang coloured Fairy Books, so it should be possible to join in to that.