Migration Flows Sankey

As well as supporting the generation of parameterised reports, reproducible workflows also support the automated generation of (templated) code that implements interactive charts.

For example, inspired by Oli Hawkins (Visualising migration between the countries of the UK [demo]), we can generate interactive Sankey plots using the googleVis or rCharts packages.

Note that the packages generate standalone HTML/Javascript code to implement the charts, code that can be used elsewhere or embedded in other HTML pages.

#The RCharts package throws a wobbly if we don't load knitr in explicitly
library(knitr)
library(readr)

#Data from ONS: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/migrationwithintheuk/datasets/matricesofinternalmigrationmovesbetweenlocalauthoritiesandregionsincludingthecountriesofwalesscotlandandnorthernireland
regionsquarematrix2015 = read_csv("../data/laandregionsquarematrices2015/regionsquarematrix2015.csv", skip = 8)

#The data has thousand separator commas - so remove them and convert to numeric
#There is probably a more idiomatic way of doing this using tidyr...
regionsquarematrix2015 = cbind(regionsquarematrix2015[1:2],
                               sapply(regionsquarematrix2015[3:ncol(regionsquarematrix2015)],
                                      function(x)  as.numeric(gsub(",", "", x)) ) )

head(regionsquarematrix2015)
##                DESTINATION    Region E12000001 E12000002 E12000003
## 1               North East E12000001        NA      6870     10820
## 2               North West E12000002      6670        NA     22930
## 3 Yorkshire and The Humber E12000003     10830     22050        NA
## 4            East Midlands E12000004      3030     10300     19520
## 5            West Midlands E12000005      2260     13440      8220
## 6                     East E12000006      2850      7120      7600
##   E12000004 E12000005 E12000006 E12000007 E12000008 E12000009 W92000004
## 1      3580      2360      3560      4400      4580      2250      1010
## 2     11130     15000      8020     14870     12240      7570     10190
## 3     19280      8470      9530     11230     10680      5710      2910
## 4        NA     19180     20820     16010     19050      6980      3140
## 5     17110        NA      9390     17760     16540     13250      8260
## 6     15500      7630        NA     72460     28570      9500      3050
##   S92000003 N92000002
## 1      3350       630
## 2      6000      2150
## 3      3690       620
## 4      2310       540
## 5      2230       540
## 6      3030       700

The Sankey diagram generators seem to expect the data to be provided as edge lists (from, to, value).

library(tidyr)

#Melt the data (wide to long) so we have from/to/value flows
rr=regionsquarematrix2015 %>% gather(source, value, 3:ncol(.))

#Merge in names for the source areas
rr=merge(rr, unique(data.frame(SOURCE=rr$DESTINATION,
                               source=rr$Region)),
         by='source')

#The Sankey diagram generators dislike cycles - so set unique labels for from/to
rr$source=paste0(rr$source,'_')
rr$SOURCE=paste0(rr$SOURCE,' ')

#Drop rows that have no flow associated with them
rr=rr[!is.na(rr$value),]

colnames(rr) = c("source","targetName","target","value","sourceName")
rr = rr[,c("sourceName","targetName","source","target","value")]
head(rr)
##    sourceName               targetName     source    target value
## 2 North East                North West E12000001_ E12000002  6670
## 3 North East  Yorkshire and The Humber E12000001_ E12000003 10830
## 4 North East             East Midlands E12000001_ E12000004  3030
## 5 North East             West Midlands E12000001_ E12000005  2260
## 6 North East                      East E12000001_ E12000006  2850
## 7 North East                    London E12000001_ E12000007  6110

googleVis

googleVis is an R package that provides an R wrapper around/interface to Google Chart tools.

We can generate a Sankey diagram using googleVis from a data frame representing an edge list in the following way:

#For use in Rmd/knitr, set the block parameter: results='asis'
library(googleVis)
options(gvis.plot.tag='chart')

#Generate the Sankey diagram HTML
s=gvisSankey(rr[,c('source','target','value')])
#And render it
plot(s)

Notwithstanding the availability of from, to and weight parameters for specifying column names, the function appears to want the dataframe passed in in a particular way, specifically from, to, weight.

According to the Google Sankey diagram documentation, node labels, as well as the color of nodes and edges, can be controlled; see Sankey diagrams with googleVis or this StackOverflow answer for an example of how to pass the parameters in.

To color the nodes, we need to provide node colors in the order in which the nodes are added to the chart. We can find the node order by interleaving the source and target columns:

nodeOrder=unique(c(rbind(rr$source, rr$target)))

Add some node colors:

colormapl=c(E='#ffcc00',N='green',S='blue',W='red')

#Now we need to get the color for the node order.
nodeColor=unname(colormapl[substring(nodeOrder, 1, 1)])

#http://stackoverflow.com/a/32111596/454773
colors_node_array = paste0("[", paste0("'", nodeColor,"'", collapse = ','), "]")

opts = paste0("{ node: { colors: ", colors_node_array ," } }" )

s=gvisSankey(rr[,c('source','target','value')], options=list( sankey=opts))
plot(s)

Add some edge colors, again in the order of edges supplied.

#Use the originating node colour for the edge
opts = paste0("{ link: { colorMode: 'source' },node: { colors: ", colors_node_array ," } }" )
s=gvisSankey(rr[,c('source','target','value')], options=list(sankey=opts))
plot(s)

The labels are the node values. If we map the identifiers to (distinct) labels, they make the chart more informative.

s=gvisSankey(rr[,c('sourceName','targetName','value')], options=list(sankey=opts))
plot(s)

Generate a view of the chart that omits flows within the same country.

#Limit the rows
rr2=rr[substring(rr$target, 1, 1)!=substring(rr$source, 1, 1),]

#ABstract out the code that allows us to generate a new color array
setNodeColors=function(df,source='source',target='target'){
  #Interleave the nodes from the edgelist in the order they are introduced
  nodeOrder=unique(c(rbind(df[[source]], df[[target]])))
  #Generae a color mapping from the country indicator at the start of the country/region code
  nodeColor=unname(colormapl[substring(nodeOrder, 1, 1)])
  #Get the data in the form that the Sankey widget wants it...
  colors_node_array = paste0("[", paste0("'", nodeColor,"'", collapse = ','), "]")
  colors_node_array
}

colors_node_array=setNodeColors(rr2)

opts = paste0("{ link: { colorMode: 'source' }, node: { colors: ", colors_node_array ," } }" )
s=gvisSankey(rr2[,c('sourceName','targetName','value')], options=list(sankey=opts))
plot(s)

Finally, how about we group the (English) regional flows to a single English flow.

library(dplyr)
countrymap=c(E='England',N='Northern Ireland',S='Scotland',W='Wales')
  
rr2$countrysource=countrymap[substring(rr2$source, 1, 1)]
rr2$countrytarget=paste0(countrymap[substring(rr2$target, 1, 1)],' ')
rrg = rr2 %>%
          group_by(countrysource,countrytarget) %>%
          summarise(value = sum(value))

#Generate new color array
colors_node_array=setNodeColors(rrg,'countrysource','countrytarget')

opts = paste0("{ link: { colorMode: 'source' }, node: { colors: ", colors_node_array ," } }" )
s=gvisSankey(rrg,options=list(sankey=opts))
plot(s)

We can also change the edge color to a gradient between the source and target color values, but this just looks a horrible mess to me!

opts = paste0("{ link: { colorMode: 'gradient' }, node: { colors: ", colors_node_array ," } }" )
s=gvisSankey(rrg,options=list(sankey=opts))
plot(s)

rCharts

Generate a Sankey diagram using rCharts:

#Based on http://bl.ocks.org/timelyportfolio/6085852
#There is also a particle flow enhancement demoed at https://bl.ocks.org/micahstubbs/6a366e759f029599678e293521d7e26c
library(rCharts)

sankeyPlot2 <- rCharts$new()
sankeyPlot2$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey/')
sankeyPlot2$set(
  data = rr[,c('source','target','value')],
  nodeWidth = 15,
  nodePadding = 10,
  layout = 32,
  width = 750,
  height = 500
)

sankeyPlot2$show('iframesrc', cdn = TRUE)
#Note that at the time of writing, the rCharts_d3_sankey bakes in the http protocol for loading three
#resources that breaks if the output HTML page is served as https.

Some control over colouring can be introduced by extending the template, as demonstrated in this Stack Overflow answer (following this original explanation from @timelyportfolio, the author of the rCharts Sankey package.)

A wide range of interactive chart types can be generated in this way. The htmlwidgets for R project represents the latest iteration in the production of interactive Javascript widgets for use in RMarkdown documents and Shiny applications.

sankeyD3

This seems to be the most recent attempt at an R/Sankey diagram library, again using D3.js.

The library appears to require nodes being identified as consecutive integers, starting at 0.

#devtools::install_github("fbreitwieser/sankeyD3")
library(sankeyD3)
library(plyr)

#Get a mapping from codes to numeric node IDs
#Need to interleave the nodes appropriately
rrd=data.frame(rid= c(rbind(regionsquarematrix2015$Region,
                            paste0(regionsquarematrix2015$Region,'_'))) )

rrd['num']=0:(nrow(rrd)-1)

rrd['name']=c(rbind(c(regionsquarematrix2015$DESTINATION,
                      paste0(regionsquarematrix2015$DESTINATION,'_'))))

#Map the edges
rr$source=as.integer(mapvalues(unlist(rr$source),from=unlist(rrd['rid']),to=unlist(rrd['num'])))
rr$target=as.integer(mapvalues(unlist(rr$target),from=unlist(rrd['rid']),to=unlist(rrd['num'])))

sankeyNetwork(Links = rr, Nodes = rrd, Source = "source", title="Migration Flows",
            Target = "target", Value = "value", NodeID = "name",
              fontSize = 12, nodeWidth = 30,showNodeValues = FALSE)