A demonstration of how to use the ipysankeywidget
package to generate a Sankey diagram from a pandas
dataframe.
This notebook also demonstrates how widget libraries can also be thought of as code generators capable of generating reusable code that can be used directly elsewhere, or can be treated as an automatically generated "first draft" of the code for interactive chart that can be further enhanced and edited by hand to produce a more polished production quality output.
Originally motivated by Oli Hawkins' Internal migration flows in the UK [about].
#!pip3 install ipysankeywidget
#!jupyter nbextension enable --py --sys-prefix ipysankeywidget
import pandas as pd
#Data from ONS: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/migrationwithintheuk/datasets/matricesofinternalmigrationmovesbetweenlocalauthoritiesandregionsincludingthecountriesofwalesscotlandandnorthernireland
#Read in the CSV file
#If we specify the null character and thousands separator, the flows whould be read in as numerics not strings
df=pd.read_csv("../data/laandregionsquarematrices2015/regionsquarematrix2015.csv",
skiprows = 8,thousands=',',na_values='-')
df.head()
from ipysankeywidget import SankeyWidget
#The widget requires an edgelist with source, target and value columns
dfm=pd.melt(df,id_vars=['DESTINATION','Region'], var_name='source', value_name='value')
dfm.columns=['DESTINATION','target','source','value']
dfm['target']=dfm['target']+'_'
dfm.head()
#The SankeyWidget function expects a list of dicts, each dict specifying an edge
#Also check how to drop rows where the weight is NA
links=dfm.dropna()[['source','target','value']].to_dict(orient='records')
links[:3]
#Generate and display default styled Sankey diagram
SankeyWidget(value={'links': links},
width=800, height=800,margins=dict(top=0, bottom=0))
We can also add a colour mapping to the chart - provide the mapping based on the first letter of the area code:
colormap={'E':'#ffcc00','N':'green','S':'blue','W':'red'}
dfm['color']=dfm['source'].apply(lambda x: colormap[x[0]])
links = dfm.dropna()[['source','target','value','color']].to_dict(orient='records')
SankeyWidget(value={'links': links},
width=800, height=800,margins=dict(top=0, bottom=0))
The original diagram just showed flows between nations, and did not include intra-nation flows.
So let's drop flows between regions of the same country - that is, flows where the leading country code is the same for both the source and the target:
#Create a data frame with dropped flow between countries
#That is, ignore rows where the country code indication is the same between source and target
#Again, drop the rows with unspecificed flows
dfmb = dfm[dfm['source'].str[0]!=dfm['target'].str[0]].dropna()
links= dfmb[['source','target','value','color']].to_dict(orient='records')
SankeyWidget(value={'links': links}, width=800, height=800,margins=dict(top=0, bottom=0))
The original diagram aggregated counts for regions within a particular country. So let's do the same...
Start by defining a country mapping - we can also use this to label the country nodes rather more meaningfully.
Note that to prevent circularity, we distinguish between the source and target nodes by naming them slightly differently: the target node label identifiers have whitespace to distinguish them from the source node label identifiers.
countrymap={'E':'England','N':'Northern Ireland','S':'Scotland','W':'Wales'}
dfmb['countrysource']=dfmb['source'].apply(lambda x: countrymap[x[0]])
dfmb['countrytarget']=dfmb['target'].apply(lambda x: countrymap[x[0]]+' ')
Aggregate (sum) the counts on a country-country flow basis, as well as colouring by source country:
#Group the (regional) country-country rows and sum the flows, resetting the table to flat columns
dfmg = dfmb.groupby(['countrysource','countrytarget']).aggregate(sum).reset_index()
#Rename the columns as required by the Sankey plotter
dfmg.columns=['source','target','value']
#And apply colour
dfmg['color']=dfmg['source'].apply(lambda x: colormap[x[0]])
dfmg
Now we can render this table to give a country-country migrant flow Sankey diagram:
links=dfmg.to_dict(orient='records')
s=SankeyWidget(value={'links': links},
width=800, height=800,margins=dict(top=0, bottom=0,left=150,right=120))
s
One of the under-appreciated benefits that arises from using widget libraries to generate rich interactive outputs for use in live documents is that the generated code can also be reused elsewhere.
For example, it's not hard to see the benefits that might arise from being able to generate a flat image rendering of a generated chart such that that image can be reused elsewhere:
#!mkdir -p images
#Save a png version
s.save_png('images/mySankey.png')
Render the saved png as an image in a markdown cell:
But in many cases we can also render output code.
Some widget libraries generate HTML output files (or HTML fragments) from code templates, enriched with suitably formatted data when the output widget is generated.
In this case, the widget that is produced is an SVG file - which we can export as such, and then reuse directly elsewhere, or use as a first draft of our own customised version of the output chart:
#save svg
s.save_svg('images/mySankey.svg')
from IPython.display import SVG, display
display(SVG('images/mySankey.svg'))