Finding Book Index-Like Things In Lang’s Fairy Stories…
Contents
Finding Book Index-Like Things In Lang’s Fairy Stories…#
One of the problems associated with search as a form of discovery is knowing what to search for. If you have a traditional print book, it contains various navigational clues that let you quickly get a sense of what the book contains, and in what order.
A look at the table of contents gives you great big signposts in the form of chapter headings, and the distance from one signpost to the net is given not in leagues or miles, but pages.
At the back of the book, the index provides a jewellery box of trinkets, each with its own story, or stories, to tell, at a signposted location inside the book. Skimming the index gives you a summary of the key players and key topics contained inside the book: the indexer is truly of the cunning kind. A large index entry suggests the important role the perosn, or place, or topic so indexed, plays within the book.
So for our fairy tales, if we aren’t sure what to search for in order to discover, can we create an index like thing that will tell us what the key nouns are – the names, places and events – and perhaps other key references, such as to large sums of money mentioned somewhere within the tales?
As well as the precious objects worthy of mention, there are other things we can tease out of the text that might make it easier to search for what we want. For example, if we want to find a story that explicitly mentions a large financial reward, it might be useful if we could search for tales mentioning at least 500 gold pieces (or equivalent). But how could we possibly do that, you might ask…
In technicalese, by extracting entities we can search around, that’s how… And I don’t necessarily mean “entities” of the supernatural kind, the bogles and boggarts and other things that come out at night that we might generically class as “ghouls”.
So what sorts of entity might we be able to find in the stories…?!
Connecting to the Database#
Let’s start by connecting to the database of Lang’s fairy stories that we created earlier, as well as loading in some magic that let’s us talk to it:
from sqlite_utils import Database
db_name = "lang_fairy_tale.db"
db = Database(db_name)
# Load in the sql magic
%load_ext sql
%sql sqlite:///$db_name
We can test it works with a simple query:
%%sql
SELECT title FROM books LIMIT 3;
* sqlite:///lang_fairy_tale.db
Done.
title |
---|
A Voyage To Lilliput |
Aladdin And The Wonderful Lamp |
Beauty And The Beast |
Detecting Entities the spacy
Way#
To try to identify “things” in the stories, we can invoke the powers of the great God spacy
, which walks amongst us in the guise of a natural language processing toolkit:
#%pip install --upgrade spacy
# Get the spell book from a library shelf
import spacy
# And summon some magic from it
nlp = spacy.load("en_core_web_sm")
We also need some text to work with, so load in a dataframe from the database containing our book records:
response = %sql SELECT * FROM books;
df = response.DataFrame()
df.head()
* sqlite:///lang_fairy_tale.db
Done.
book | title | text | last_para | first_line | provenance | chapter_order | |
---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 |
1 | The Blue Fairy Book | Prince Hyacinth And The Dear Little Princess | Once upon a time there lived a king who was de... | [1] Le Prince Desir et la Princesse Mignonne. ... | Once upon a time there lived a king who was de... | Le Prince Desir et la Princesse Mignonne. Par ... | 1 |
2 | The Blue Fairy Book | East Of The Sun And West Of The Moon | Once upon a time there was a poor husbandman w... | [1] Asbjornsen and Moe. | Once upon a time there was a poor husbandman w... | Asbjornsen and Moe. | 2 |
3 | The Blue Fairy Book | The Yellow Dwarf | Once upon a time there lived a queen who had b... | [1] Madame d'Aulnoy. | Once upon a time there lived a queen who had b... | Madame d'Aulnoy. | 3 |
4 | The Blue Fairy Book | Little Red Riding Hood | Once upon a time there lived in a certain vill... | And, saying these words, this wicked wolf fell... | Once upon a time there lived in a certain vill... | 4 |
Now let’s have a go at extracting some entities (this may take some time!).
Note
The invocation is a little cryptic, but what it boils down to is that we define a get_entities
invocation that uses an entity divining nlp
spell from the spacy
spell book, and then we use some pandas magic to call a genie with that spell on the book text
from every row in the dataframe.
# Extract a set of entities, rather than a list...
get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents}
# Apply the entity extraction routine to the text column element in each row
# The full run takes some time....
df['entities'] = df["text"].apply(get_entities)
df.head(10)
book | title | text | last_para | first_line | provenance | chapter_order | entities | |
---|---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | {DATE :: mid-day, CARDINAL :: three, GPE :: Ma... |
1 | The Blue Fairy Book | Prince Hyacinth And The Dear Little Princess | Once upon a time there lived a king who was de... | [1] Le Prince Desir et la Princesse Mignonne. ... | Once upon a time there lived a king who was de... | Le Prince Desir et la Princesse Mignonne. Par ... | 1 | {NORP :: Greek, NORP :: Roman, LAW :: the Dear... |
2 | The Blue Fairy Book | East Of The Sun And West Of The Moon | Once upon a time there was a poor husbandman w... | [1] Asbjornsen and Moe. | Once upon a time there was a poor husbandman w... | Asbjornsen and Moe. | 2 | {FAC :: the White Bear, DATE :: day, DATE :: a... |
3 | The Blue Fairy Book | The Yellow Dwarf | Once upon a time there lived a queen who had b... | [1] Madame d'Aulnoy. | Once upon a time there lived a queen who had b... | Madame d'Aulnoy. | 3 | {CARDINAL :: seven, CARDINAL :: six, CARDINAL ... |
4 | The Blue Fairy Book | Little Red Riding Hood | Once upon a time there lived in a certain vill... | And, saying these words, this wicked wolf fell... | Once upon a time there lived in a certain vill... | 4 | {ORG :: Grandmamma, PERSON :: Gaffer Wolf, PER... | |
5 | The Blue Fairy Book | The Sleeping Beauty In The Wood | There were formerly a king and a queen, who we... | No one dared to tell him, when the Ogress, all... | There were formerly a king and a queen, who we... | 5 | {CARDINAL :: seven, CARDINAL :: three, WORK_OF... | |
6 | The Blue Fairy Book | Cinderella, Or The Little Glass Slipper | Once there was a gentleman who married, for hi... | [1] Charles Perrault. | Once there was a gentleman who married, for hi... | Charles Perrault. | 6 | {CARDINAL :: six, CARDINAL :: three, CARDINAL ... |
7 | The Blue Fairy Book | Aladdin And The Wonderful Lamp | There once lived a poor tailor, who had a son ... | [1] Arabian Nights. | There once lived a poor tailor, who had a son ... | Arabian Nights. | 7 | {CARDINAL :: forty, CARDINAL :: six, PERSON ::... |
8 | The Blue Fairy Book | The Tale Of A Youth Who Set Out To Learn What ... | A father had two sons, of whom the eldest was ... | [1] Grimm. | A father had two sons, of whom the eldest was ... | Grimm. | 8 | {CARDINAL :: seven, CARDINAL :: six, CARDINAL ... |
9 | The Blue Fairy Book | Rumpelstiltzkin | There was once upon a time a poor miller who h... | [1] Grimm. | There was once upon a time a poor miller who h... | Grimm. | 9 | {TIME :: early dawn, PERSON :: Rumpelstiltzkin... |
Danger
We should probably just do this once and add an appropriate table of entities to the database…
We can use some more pandas magic to explode these entity lists out into a long format dataframe:
from pandas import Series
# Explode the entities one per row...
df_long = df.explode('entities')
df_long.rename(columns={"entities":"entity"}, inplace=True)
# And then separate out entity type and value
df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series)
df_long.head()
book | title | text | last_para | first_line | provenance | chapter_order | entity | entity_typ | entity_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | DATE :: mid-day | DATE | mid-day |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | CARDINAL :: three | CARDINAL | three |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | GPE :: Maisonneuve | GPE | Maisonneuve |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | GPE :: Albania | GPE | Albania |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | CARDINAL :: one | CARDINAL | one |
The data frame object can often be quite helpful. If we ask it nicely, it will tell us what different entity types it contains, and how many rows are associated with each one:
df_long["entity_typ"].value_counts()
DATE 3108
CARDINAL 2536
TIME 1928
ORG 1564
PERSON 1499
ORDINAL 879
GPE 564
WORK_OF_ART 358
NORP 343
QUANTITY 209
LOC 165
PRODUCT 144
FAC 126
MONEY 41
LAW 34
EVENT 28
LANGUAGE 12
Name: entity_typ, dtype: int64
How Much?!#
What sort of money has been identified in the stories?
money_filter = df_long["entity_typ"]=="MONEY"
df_long[money_filter]["entity_value"].value_counts().head(10)
a penny 6
a hundred dollars 5
a few pence 4
three hundred dollars 4
two hundred dollars 3
the hundred dollars 2
ten dollars 2
the thousand dollars 1
only a few hundred yards 1
the hundred marks 1
Name: entity_value, dtype: int64
Dollars? Really??? What about gold coins?!
The magic works using a classifier, a special type of gossip that has read everything there is to read about everything, and recognises from text we provide what sorts of things various things are likely to be in a text, based on how they appear in the text and what other words appear around them.
So I wonder: do I need to train a new classifier that is a bit more specialised in terms of its familiarity with the sorts of thing you find in fairy tales?! Or was the original text really like that (are there really lots of mentions of “dollar” amounts?!) Or has the text been got at someway, perhaps in editing translation on its way to the website I downloaded my source texts from?
Note
Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive?
Having identified various fragments of text as currency related, can we actually convert those phrases into something evening more abstract, such as a numerical amounts and recognisable currencies or units?
The quantulum3
spell book is useful to us here… so let’s create an invocation of our own around it:
#%pip install quantulum3
from quantulum3.parser import parse as qp
def quantity_conversions(quantities):
"""Attempt to identify quanities and units."""
# We'll keep track of which monetary amounts we can convert
conversions = []
# and those we can't
not_converted = []
# Let's try with just a few monetary amounts
for q in quantities:
try:
conversions.append((q, qp(q)))
except:
not_converted.append(q)
# How did we do?
return conversions, not_converted
And now let’s see if we can identify some monetary amounts and currencies, or units:
money = df_long[money_filter]["entity_value"].value_counts().index.tolist()
quantity_conversions(money[:10])
([('a penny',
[Quantity(1, "Unit(name="penny", entity=Entity("currency"), uri=Penny)")]),
('a hundred dollars',
[Quantity(100, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('three hundred dollars',
[Quantity(300, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('two hundred dollars',
[Quantity(200, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('the hundred dollars', []),
('ten dollars',
[Quantity(10, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('the thousand dollars', []),
('the hundred marks', [])],
['a few pence', 'only a few hundred yards'])
If we can start to glue these various pieces together, it’s not hard to imagine labeling each story with monetary amounts, which means we can then start to search for stories that mention sums of a certain magnitude (for example, “more than a thousand dollars”).
Various currency conversion package also exist, which means we could start to imagine searching across currencies, if we can identify appropriate currency conversion rates. For example, “give me all stories mentioning an amount of at least 1000 gold coins, or equivalent”.
How Far Have We Still To Go?#
As with as monetary amounts, did our classifier genie spot any other quantities in the texts?
quantity_filter = df_long["entity_typ"]=="QUANTITY"
df_long[quantity_filter]["entity_value"].value_counts().head(10)
one foot 15
a mile 12
a few miles 7
twenty miles 6
a hundred miles 5
three miles 5
seven miles 3
a few yards 3
a few feet 3
several miles 3
Name: entity_value, dtype: int64
We can also attempt to parse these values using the quantulum3
incantation:
# Get a list of quantities extracted from out texts
quantities = df_long[quantity_filter]["entity_value"].value_counts().index.tolist()
quantity_conversions(quantities[:10])
([('one foot',
[Quantity(1, "Unit(name="foot", entity=Entity("length"), uri=Foot_(unit))")]),
('a mile',
[Quantity(1, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('twenty miles',
[Quantity(20, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('a hundred miles',
[Quantity(100, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('three miles',
[Quantity(3, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('seven miles',
[Quantity(7, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('several miles', [])],
['a few miles', 'a few yards', 'a few feet'])
How Many Did You Say?#
We can also identify cardinal values, which is to say, numbers:
cardinal_filter = df_long["entity_typ"]=="CARDINAL"
cardinals = df_long[cardinal_filter]["entity_value"].value_counts().index.tolist()
cardinals[:10]
['one',
'two',
'three',
'half',
'One',
'four',
'six',
'twelve',
'seven',
'thousand']
Again, can we cast these to numerics by the power of quantulum3
?
quantity_conversions(cardinals[:10])
([('one', []),
('two',
[Quantity(2, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('three',
[Quantity(3, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('half',
[Quantity(0, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('One', []),
('four',
[Quantity(4, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('six',
[Quantity(6, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('twelve',
[Quantity(12, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('seven',
[Quantity(7, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('thousand', [])],
[])
Who’s There?#
As well as quantities, what people or sorts of people have been identified?
df_long[df_long["entity_typ"]=="PERSON"]["entity_value"].value_counts().head(10)
Fairy 31
Majesty 29
Prince 29
Queen 27
bush 12
wolf 11
Madam 11
Campbell 11
Lang 11
Jack 10
Name: entity_value, dtype: int64
### Where’s There?
How about geo-political entities (GPEs)?
df_long[df_long["entity_typ"]=="GPE"]["entity_value"].value_counts().head(10)
thou 11
Paris 10
France 9
Japan 8
Greece 7
Denmark 6
Ireland 6
Contes Armeniens 5
Gedichte 5
Finland 5
Name: entity_value, dtype: int64
When’s There?#
When did things happen?
df_long[df_long["entity_typ"]=="DATE"]["entity_value"].value_counts().head(10)
one day 188
One day 154
the day 86
three days 79
next day 62
the next day 61
a few days 50
all day 44
many years 41
that day 40
Name: entity_value, dtype: int64
And how about time considerations?
df_long[df_long["entity_typ"]=="TIME"]["entity_value"].value_counts().head(10)
night 117
evening 115
a few minutes 77
morning 69
next morning 63
one morning 58
Next morning 53
the night 47
midnight 44
one night 40
Name: entity_value, dtype: int64
Organisationally Speaking…#
Were organisations or organisational types recognised?
df_long[df_long["entity_typ"]=="ORG"]["entity_value"].value_counts().head(10)
King 102
Princess 71
Court 48
Prince 42
Grimm 28
King's 23
Quick 18
eagle 12
Majesty 11
I. 10
Name: entity_value, dtype: int64
What’s a NORP
? (Ah… Nationalities Or Religious or Political groups.)
df_long[df_long["entity_typ"]=="NORP"]["entity_value"].value_counts().head(10)
German 24
Danish 18
French 14
Russian 11
Indian 10
Christian 9
Italian 8
Portuguese 7
Spanish 7
Chinese 6
Name: entity_value, dtype: int64