Finding Book Index-Like Things In Lang’s Fairy Stories…#

One of the problems associated with search as a form of discovery is knowing what to search for. If you have a traditional print book, it contains various navigational clues that let you quickly get a sense of what the book contains, and in what order.

A look at the table of contents gives you great big signposts in the form of chapter headings, and the distance from one signpost to the net is given not in leagues or miles, but pages.

At the back of the book, the index provides a jewellery box of trinkets, each with its own story, or stories, to tell, at a signposted location inside the book. Skimming the index gives you a summary of the key players and key topics contained inside the book: the indexer is truly of the cunning kind. A large index entry suggests the important role the perosn, or place, or topic so indexed, plays within the book.

So for our fairy tales, if we aren’t sure what to search for in order to discover, can we create an index like thing that will tell us what the key nouns are – the names, places and events – and perhaps other key references, such as to large sums of money mentioned somewhere within the tales?

As well as the precious objects worthy of mention, there are other things we can tease out of the text that might make it easier to search for what we want. For example, if we want to find a story that explicitly mentions a large financial reward, it might be useful if we could search for tales mentioning at least 500 gold pieces (or equivalent). But how could we possibly do that, you might ask…

In technicalese, by extracting entities we can search around, that’s how… And I don’t necessarily mean “entities” of the supernatural kind, the bogles and boggarts and other things that come out at night that we might generically class as “ghouls”.

So what sorts of entity might we be able to find in the stories…?!

Connecting to the Database#

Let’s start by connecting to the database of Lang’s fairy stories that we created earlier, as well as loading in some magic that let’s us talk to it:

from sqlite_utils import Database

db_name = "lang_fairy_tale.db"
db = Database(db_name)

# Load in the sql magic
%load_ext sql
%sql sqlite:///$db_name

We can test it works with a simple query:

%%sql
SELECT title FROM books LIMIT 3;
 * sqlite:///lang_fairy_tale.db
Done.
title
A Voyage To Lilliput
Aladdin And The Wonderful Lamp
Beauty And The Beast

Detecting Entities the spacy Way#

To try to identify “things” in the stories, we can invoke the powers of the great God spacy, which walks amongst us in the guise of a natural language processing toolkit:

#%pip install --upgrade spacy
# Get the spell book from a library shelf
import spacy

# And summon some magic from it
nlp = spacy.load("en_core_web_sm")

We also need some text to work with, so load in a dataframe from the database containing our book records:

response = %sql SELECT * FROM books;

df = response.DataFrame()
df.head()
 * sqlite:///lang_fairy_tale.db
Done.
book title text last_para first_line provenance chapter_order
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0
1 The Blue Fairy Book Prince Hyacinth And The Dear Little Princess Once upon a time there lived a king who was de... [1] Le Prince Desir et la Princesse Mignonne. ... Once upon a time there lived a king who was de... Le Prince Desir et la Princesse Mignonne. Par ... 1
2 The Blue Fairy Book East Of The Sun And West Of The Moon Once upon a time there was a poor husbandman w... [1] Asbjornsen and Moe. Once upon a time there was a poor husbandman w... Asbjornsen and Moe. 2
3 The Blue Fairy Book The Yellow Dwarf Once upon a time there lived a queen who had b... [1] Madame d'Aulnoy. Once upon a time there lived a queen who had b... Madame d'Aulnoy. 3
4 The Blue Fairy Book Little Red Riding Hood Once upon a time there lived in a certain vill... And, saying these words, this wicked wolf fell... Once upon a time there lived in a certain vill... 4

Now let’s have a go at extracting some entities (this may take some time!).

Note

The invocation is a little cryptic, but what it boils down to is that we define a get_entities invocation that uses an entity divining nlp spell from the spacy spell book, and then we use some pandas magic to call a genie with that spell on the book text from every row in the dataframe.

# Extract a set of entities, rather than a list...
get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents}

# Apply the entity extraction routine to the text column element in each row
# The full run takes some time....
df['entities'] = df["text"].apply(get_entities)

df.head(10)
book title text last_para first_line provenance chapter_order entities
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 {DATE :: mid-day, CARDINAL :: three, GPE :: Ma...
1 The Blue Fairy Book Prince Hyacinth And The Dear Little Princess Once upon a time there lived a king who was de... [1] Le Prince Desir et la Princesse Mignonne. ... Once upon a time there lived a king who was de... Le Prince Desir et la Princesse Mignonne. Par ... 1 {NORP :: Greek, NORP :: Roman, LAW :: the Dear...
2 The Blue Fairy Book East Of The Sun And West Of The Moon Once upon a time there was a poor husbandman w... [1] Asbjornsen and Moe. Once upon a time there was a poor husbandman w... Asbjornsen and Moe. 2 {FAC :: the White Bear, DATE :: day, DATE :: a...
3 The Blue Fairy Book The Yellow Dwarf Once upon a time there lived a queen who had b... [1] Madame d'Aulnoy. Once upon a time there lived a queen who had b... Madame d'Aulnoy. 3 {CARDINAL :: seven, CARDINAL :: six, CARDINAL ...
4 The Blue Fairy Book Little Red Riding Hood Once upon a time there lived in a certain vill... And, saying these words, this wicked wolf fell... Once upon a time there lived in a certain vill... 4 {ORG :: Grandmamma, PERSON :: Gaffer Wolf, PER...
5 The Blue Fairy Book The Sleeping Beauty In The Wood There were formerly a king and a queen, who we... No one dared to tell him, when the Ogress, all... There were formerly a king and a queen, who we... 5 {CARDINAL :: seven, CARDINAL :: three, WORK_OF...
6 The Blue Fairy Book Cinderella, Or The Little Glass Slipper Once there was a gentleman who married, for hi... [1] Charles Perrault. Once there was a gentleman who married, for hi... Charles Perrault. 6 {CARDINAL :: six, CARDINAL :: three, CARDINAL ...
7 The Blue Fairy Book Aladdin And The Wonderful Lamp There once lived a poor tailor, who had a son ... [1] Arabian Nights. There once lived a poor tailor, who had a son ... Arabian Nights. 7 {CARDINAL :: forty, CARDINAL :: six, PERSON ::...
8 The Blue Fairy Book The Tale Of A Youth Who Set Out To Learn What ... A father had two sons, of whom the eldest was ... [1] Grimm. A father had two sons, of whom the eldest was ... Grimm. 8 {CARDINAL :: seven, CARDINAL :: six, CARDINAL ...
9 The Blue Fairy Book Rumpelstiltzkin There was once upon a time a poor miller who h... [1] Grimm. There was once upon a time a poor miller who h... Grimm. 9 {TIME :: early dawn, PERSON :: Rumpelstiltzkin...

Danger

We should probably just do this once and add an appropriate table of entities to the database…

We can use some more pandas magic to explode these entity lists out into a long format dataframe:

from pandas import Series

# Explode the entities one per row...
df_long = df.explode('entities')
df_long.rename(columns={"entities":"entity"}, inplace=True)

# And then separate out entity type and value
df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series)
df_long.head()
book title text last_para first_line provenance chapter_order entity entity_typ entity_value
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 DATE :: mid-day DATE mid-day
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 CARDINAL :: three CARDINAL three
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 GPE :: Maisonneuve GPE Maisonneuve
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 GPE :: Albania GPE Albania
0 The Blue Fairy Book The Bronze Ring Once upon a time in a certain country there li... [1] Traditions Populaires de l'Asie Mineure. C... Once upon a time in a certain country there li... Traditions Populaires de l'Asie Mineure. Carno... 0 CARDINAL :: one CARDINAL one

The data frame object can often be quite helpful. If we ask it nicely, it will tell us what different entity types it contains, and how many rows are associated with each one:

df_long["entity_typ"].value_counts()
DATE           3108
CARDINAL       2536
TIME           1928
ORG            1564
PERSON         1499
ORDINAL         879
GPE             564
WORK_OF_ART     358
NORP            343
QUANTITY        209
LOC             165
PRODUCT         144
FAC             126
MONEY            41
LAW              34
EVENT            28
LANGUAGE         12
Name: entity_typ, dtype: int64

How Much?!#

What sort of money has been identified in the stories?

money_filter = df_long["entity_typ"]=="MONEY"

df_long[money_filter]["entity_value"].value_counts().head(10)
a penny                     6
a hundred dollars           5
a few pence                 4
three hundred dollars       4
two hundred dollars         3
the hundred dollars         2
ten dollars                 2
the thousand dollars        1
only a few hundred yards    1
the hundred marks           1
Name: entity_value, dtype: int64

Dollars? Really??? What about gold coins?!

The magic works using a classifier, a special type of gossip that has read everything there is to read about everything, and recognises from text we provide what sorts of things various things are likely to be in a text, based on how they appear in the text and what other words appear around them.

So I wonder: do I need to train a new classifier that is a bit more specialised in terms of its familiarity with the sorts of thing you find in fairy tales?! Or was the original text really like that (are there really lots of mentions of “dollar” amounts?!) Or has the text been got at someway, perhaps in editing translation on its way to the website I downloaded my source texts from?

Note

Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive?

Having identified various fragments of text as currency related, can we actually convert those phrases into something evening more abstract, such as a numerical amounts and recognisable currencies or units?

The quantulum3 spell book is useful to us here… so let’s create an invocation of our own around it:

#%pip install quantulum3
from quantulum3.parser import parse as qp

def quantity_conversions(quantities):
    """Attempt to identify quanities and units."""
    # We'll keep track of which monetary amounts we can convert
    conversions = []
    # and those we can't
    not_converted = []

    # Let's try with just a few monetary amounts
    for q in quantities:
        try:
            conversions.append((q, qp(q)))
        except:
            not_converted.append(q)

    # How did we do?
    return conversions, not_converted

And now let’s see if we can identify some monetary amounts and currencies, or units:

money = df_long[money_filter]["entity_value"].value_counts().index.tolist()

quantity_conversions(money[:10])
([('a penny',
   [Quantity(1, "Unit(name="penny", entity=Entity("currency"), uri=Penny)")]),
  ('a hundred dollars',
   [Quantity(100, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
  ('three hundred dollars',
   [Quantity(300, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
  ('two hundred dollars',
   [Quantity(200, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
  ('the hundred dollars', []),
  ('ten dollars',
   [Quantity(10, "Unit(name="dollar", entity=Entity("currency"), uri=Dollar)")]),
  ('the thousand dollars', []),
  ('the hundred marks', [])],
 ['a few pence', 'only a few hundred yards'])

If we can start to glue these various pieces together, it’s not hard to imagine labeling each story with monetary amounts, which means we can then start to search for stories that mention sums of a certain magnitude (for example, “more than a thousand dollars”).

Various currency conversion package also exist, which means we could start to imagine searching across currencies, if we can identify appropriate currency conversion rates. For example, “give me all stories mentioning an amount of at least 1000 gold coins, or equivalent.

How Far Have We Still To Go?#

As with as monetary amounts, did our classifier genie spot any other quantities in the texts?

quantity_filter = df_long["entity_typ"]=="QUANTITY"

df_long[quantity_filter]["entity_value"].value_counts().head(10)
one foot           15
a mile             12
a few miles         7
twenty miles        6
a hundred miles     5
three miles         5
seven miles         3
a few yards         3
a few feet          3
several miles       3
Name: entity_value, dtype: int64

We can also attempt to parse these values using the quantulum3 incantation:

# Get a list of quantities extracted from out texts
quantities = df_long[quantity_filter]["entity_value"].value_counts().index.tolist()

quantity_conversions(quantities[:10])
([('one foot',
   [Quantity(1, "Unit(name="foot", entity=Entity("length"), uri=Foot_(unit))")]),
  ('a mile',
   [Quantity(1, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
  ('twenty miles',
   [Quantity(20, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
  ('a hundred miles',
   [Quantity(100, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
  ('three miles',
   [Quantity(3, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
  ('seven miles',
   [Quantity(7, "Unit(name="mile", entity=Entity("length"), uri=Mile)")]),
  ('several miles', [])],
 ['a few miles', 'a few yards', 'a few feet'])

How Many Did You Say?#

We can also identify cardinal values, which is to say, numbers:

cardinal_filter = df_long["entity_typ"]=="CARDINAL"

cardinals = df_long[cardinal_filter]["entity_value"].value_counts().index.tolist()
cardinals[:10]
['one',
 'two',
 'three',
 'half',
 'One',
 'four',
 'six',
 'twelve',
 'seven',
 'thousand']

Again, can we cast these to numerics by the power of quantulum3?

quantity_conversions(cardinals[:10])
([('one', []),
  ('two',
   [Quantity(2, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('three',
   [Quantity(3, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('half',
   [Quantity(0, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('One', []),
  ('four',
   [Quantity(4, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('six',
   [Quantity(6, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('twelve',
   [Quantity(12, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('seven',
   [Quantity(7, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]),
  ('thousand', [])],
 [])

Who’s There?#

As well as quantities, what people or sorts of people have been identified?

df_long[df_long["entity_typ"]=="PERSON"]["entity_value"].value_counts().head(10)
Fairy       31
Majesty     29
Prince      29
Queen       27
bush        12
wolf        11
Madam       11
Campbell    11
Lang        11
Jack        10
Name: entity_value, dtype: int64

### Where’s There?

How about geo-political entities (GPEs)?

df_long[df_long["entity_typ"]=="GPE"]["entity_value"].value_counts().head(10)
thou                11
Paris               10
France               9
Japan                8
Greece               7
Denmark              6
Ireland              6
Contes Armeniens     5
Gedichte             5
Finland              5
Name: entity_value, dtype: int64

When’s There?#

When did things happen?

df_long[df_long["entity_typ"]=="DATE"]["entity_value"].value_counts().head(10)
one day         188
One day         154
the day          86
three days       79
next day         62
the next day     61
a few days       50
all day          44
many years       41
that day         40
Name: entity_value, dtype: int64

And how about time considerations?

df_long[df_long["entity_typ"]=="TIME"]["entity_value"].value_counts().head(10)
night            117
evening          115
a few minutes     77
morning           69
next morning      63
one morning       58
Next morning      53
the night         47
midnight          44
one night         40
Name: entity_value, dtype: int64

Organisationally Speaking…#

Were organisations or organisational types recognised?

df_long[df_long["entity_typ"]=="ORG"]["entity_value"].value_counts().head(10)
King        102
Princess     71
Court        48
Prince       42
Grimm        28
King's       23
Quick        18
eagle        12
Majesty      11
I.           10
Name: entity_value, dtype: int64

What’s a NORP? (Ah… Nationalities Or Religious or Political groups.)

df_long[df_long["entity_typ"]=="NORP"]["entity_value"].value_counts().head(10)
German        24
Danish        18
French        14
Russian       11
Indian        10
Christian      9
Italian        8
Portuguese     7
Spanish        7
Chinese        6
Name: entity_value, dtype: int64