Finding Book Index-Like Things In Lang’s Fairy Stories…
Contents
Finding Book Index-Like Things In Lang’s Fairy Stories…#
One of the problems associated with search as a form of discovery is knowing what to search for. If you have a traditional print book, it contains various navigational clues that let you quickly get a sense of what the book contains, and in what order.
A look at the table of contents gives you great big signposts in the form of chapter headings, and the distance from one signpost to the net is given not in leagues or miles, but pages.
At the back of the book, the index provides a jewellery box of trinkets, each with its own story, or stories, to tell, at a signposted location inside the book. Skimming the index gives you a summary of the key players and key topics contained inside the book: the indexer is truly of the cunning kind. A large index entry suggests the important role the perosn, or place, or topic so indexed, plays within the book.
So for our fairy tales, if we aren’t sure what to search for in order to discover, can we create an index like thing that will tell us what the key nouns are – the names, places and events – and perhaps other key references, such as to large sums of money mentioned somewhere within the tales?
As well as the precious objects worthy of mention, there are other things we can tease out of the text that might make it easier to search for what we want. For example, if we want to find a story that explicitly mentions a large financial reward, it might be useful if we could search for tales mentioning at least 500 gold pieces (or equivalent). But how could we possibly do that, you might ask…
In technicalese, by extracting entities we can search around, that’s how… And I don’t necessarily mean “entities” of the supernatural kind, the bogles and boggarts and other things that come out at night that we might generically class as “ghouls”.
So what sorts of entity might we be able to find in the stories…?!
Connecting to the Database#
Let’s start by connecting to the database of Lang’s fairy stories that we created earlier, as well as loading in some magic that let’s us talk to it:
We can test it works with a simple query:
Detecting Entities the spacy
Way#
To try to identify “things” in the stories, we can invoke the powers of the great God spacy
, which walks amongst us in the guise of a natural language processing toolkit:
We also need some text to work with, so load in a dataframe from the database containing our book records:
book | title | text | last_para | first_line | provenance | chapter_order | |
---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 |
1 | The Blue Fairy Book | Prince Hyacinth And The Dear Little Princess | Once upon a time there lived a king who was de... | [1] Le Prince Desir et la Princesse Mignonne. ... | Once upon a time there lived a king who was de... | Le Prince Desir et la Princesse Mignonne. Par ... | 1 |
2 | The Blue Fairy Book | East Of The Sun And West Of The Moon | Once upon a time there was a poor husbandman w... | [1] Asbjornsen and Moe. | Once upon a time there was a poor husbandman w... | Asbjornsen and Moe. | 2 |
3 | The Blue Fairy Book | The Yellow Dwarf | Once upon a time there lived a queen who had b... | [1] Madame d'Aulnoy. | Once upon a time there lived a queen who had b... | Madame d'Aulnoy. | 3 |
4 | The Blue Fairy Book | Little Red Riding Hood | Once upon a time there lived in a certain vill... | And, saying these words, this wicked wolf fell... | Once upon a time there lived in a certain vill... | 4 |
Now let’s have a go at extracting some entities (this may take some time!).
Note
The invocation is a little cryptic, but what it boils down to is that we define a get_entities
invocation that uses an entity divining nlp
spell from the spacy
spell book, and then we use some pandas magic to call a genie with that spell on the book text
from every row in the dataframe.
# Extract a set of entities, rather than a list...
get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents}
# Apply the entity extraction routine to the text column element in each row
# The full run takes some time....
df['entities'] = df["text"].apply(get_entities)
df.head(10)
book | title | text | last_para | first_line | provenance | chapter_order | entities | |
---|---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | {DATE :: mid-day, CARDINAL :: three, GPE :: Ma... |
1 | The Blue Fairy Book | Prince Hyacinth And The Dear Little Princess | Once upon a time there lived a king who was de... | [1] Le Prince Desir et la Princesse Mignonne. ... | Once upon a time there lived a king who was de... | Le Prince Desir et la Princesse Mignonne. Par ... | 1 | {NORP :: Greek, NORP :: Roman, LAW :: the Dear... |
2 | The Blue Fairy Book | East Of The Sun And West Of The Moon | Once upon a time there was a poor husbandman w... | [1] Asbjornsen and Moe. | Once upon a time there was a poor husbandman w... | Asbjornsen and Moe. | 2 | {FAC :: the White Bear, DATE :: day, DATE :: a... |
3 | The Blue Fairy Book | The Yellow Dwarf | Once upon a time there lived a queen who had b... | [1] Madame d'Aulnoy. | Once upon a time there lived a queen who had b... | Madame d'Aulnoy. | 3 | {CARDINAL :: seven, CARDINAL :: six, CARDINAL ... |
4 | The Blue Fairy Book | Little Red Riding Hood | Once upon a time there lived in a certain vill... | And, saying these words, this wicked wolf fell... | Once upon a time there lived in a certain vill... | 4 | {ORG :: Grandmamma, PERSON :: Gaffer Wolf, PER... | |
5 | The Blue Fairy Book | The Sleeping Beauty In The Wood | There were formerly a king and a queen, who we... | No one dared to tell him, when the Ogress, all... | There were formerly a king and a queen, who we... | 5 | {CARDINAL :: seven, CARDINAL :: three, WORK_OF... | |
6 | The Blue Fairy Book | Cinderella, Or The Little Glass Slipper | Once there was a gentleman who married, for hi... | [1] Charles Perrault. | Once there was a gentleman who married, for hi... | Charles Perrault. | 6 | {CARDINAL :: six, CARDINAL :: three, CARDINAL ... |
7 | The Blue Fairy Book | Aladdin And The Wonderful Lamp | There once lived a poor tailor, who had a son ... | [1] Arabian Nights. | There once lived a poor tailor, who had a son ... | Arabian Nights. | 7 | {CARDINAL :: forty, CARDINAL :: six, PERSON ::... |
8 | The Blue Fairy Book | The Tale Of A Youth Who Set Out To Learn What ... | A father had two sons, of whom the eldest was ... | [1] Grimm. | A father had two sons, of whom the eldest was ... | Grimm. | 8 | {CARDINAL :: seven, CARDINAL :: six, CARDINAL ... |
9 | The Blue Fairy Book | Rumpelstiltzkin | There was once upon a time a poor miller who h... | [1] Grimm. | There was once upon a time a poor miller who h... | Grimm. | 9 | {TIME :: early dawn, PERSON :: Rumpelstiltzkin... |
Danger
We should probably just do this once and add an appropriate table of entities to the database…
We can use some more pandas magic to explode these entity lists out into a long format dataframe:
from pandas import Series
# Explode the entities one per row...
df_long = df.explode('entities')
df_long.rename(columns={"entities":"entity"}, inplace=True)
# And then separate out entity type and value
df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series)
df_long.head()
book | title | text | last_para | first_line | provenance | chapter_order | entity | entity_typ | entity_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | DATE :: mid-day | DATE | mid-day |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | CARDINAL :: three | CARDINAL | three |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | GPE :: Maisonneuve | GPE | Maisonneuve |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | GPE :: Albania | GPE | Albania |
0 | The Blue Fairy Book | The Bronze Ring | Once upon a time in a certain country there li... | [1] Traditions Populaires de l'Asie Mineure. C... | Once upon a time in a certain country there li... | Traditions Populaires de l'Asie Mineure. Carno... | 0 | CARDINAL :: one | CARDINAL | one |
The data frame object can often be quite helpful. If we ask it nicely, it will tell us what different entity types it contains, and how many rows are associated with each one:
How Much?!#
What sort of money has been identified in the stories?
Dollars? Really??? What about gold coins?!
The magic works using a classifier, a special type of gossip that has read everything there is to read about everything, and recognises from text we provide what sorts of things various things are likely to be in a text, based on how they appear in the text and what other words appear around them.
So I wonder: do I need to train a new classifier that is a bit more specialised in terms of its familiarity with the sorts of thing you find in fairy tales?! Or was the original text really like that (are there really lots of mentions of “dollar” amounts?!) Or has the text been got at someway, perhaps in editing translation on its way to the website I downloaded my source texts from?
Note
Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive?
Having identified various fragments of text as currency related, can we actually convert those phrases into something evening more abstract, such as a numerical amounts and recognisable currencies or units?
The quantulum3
spell book is useful to us here… so let’s create an invocation of our own around it:
#%pip install quantulum3
from quantulum3.parser import parse as qp
def quantity_conversions(quantities):
"""Attempt to identify quanities and units."""
# We'll keep track of which monetary amounts we can convert
conversions = []
# and those we can't
not_converted = []
# Let's try with just a few monetary amounts
for q in quantities:
try:
conversions.append((q, qp(q)))
except:
not_converted.append(q)
# How did we do?
return conversions, not_converted
And now let’s see if we can identify some monetary amounts and currencies, or units:
([('a penny',
[Quantity(1, "Unit(name="penny", entity=Entity("currency"), uri=Penny)")]),
('a hundred dollars',
[Quantity(100, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('three hundred dollars',
[Quantity(300, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('two hundred dollars',
[Quantity(200, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('the hundred dollars', []),
('ten dollars',
[Quantity(10, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
('the thousand dollars', []),
('the hundred marks', [])],
['a few pence', 'only a few hundred yards'])
If we can start to glue these various pieces together, it’s not hard to imagine labeling each story with monetary amounts, which means we can then start to search for stories that mention sums of a certain magnitude (for example, “more than a thousand dollars”).
Various currency conversion package also exist, which means we could start to imagine searching across currencies, if we can identify appropriate currency conversion rates. For example, “give me all stories mentioning an amount of at least 1000 gold coins, or equivalent”.
How Far Have We Still To Go?#
As with as monetary amounts, did our classifier genie spot any other quantities in the texts?
We can also attempt to parse these values using the quantulum3
incantation:
([('one foot',
[Quantity(1, "Unit(name="foot", entity=Entity("length"), uri=Foot_(unit))")]),
('a mile',
[Quantity(1, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('twenty miles',
[Quantity(20, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('a hundred miles',
[Quantity(100, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('three miles',
[Quantity(3, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('seven miles',
[Quantity(7, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
('several miles', [])],
['a few miles', 'a few yards', 'a few feet'])
How Many Did You Say?#
We can also identify cardinal values, which is to say, numbers:
Again, can we cast these to numerics by the power of quantulum3
?
([('one', []),
('two',
[Quantity(2, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('three',
[Quantity(3, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('half',
[Quantity(0, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('One', []),
('four',
[Quantity(4, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('six',
[Quantity(6, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('twelve',
[Quantity(12, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('seven',
[Quantity(7, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
('thousand', [])],
[])
Who’s There?#
As well as quantities, what people or sorts of people have been identified?
### Where’s There?
How about geo-political entities (GPEs)?
Organisationally Speaking…#
Were organisations or organisational types recognised?
What’s a NORP
? (Ah… Nationalities Or Religious or Political groups.)