This article was published as a part of the Data Science Blogathon.
Very often, we are in a situation where we would have to visualize how data flows between entities. For example, let’s take the case of how residents have migrated from one country to another within the UK. Here, it would be an interesting analysis to see how many residents have migrated from England to say Northern Ireland, Scotland, and Wales.
From this Sankey diagram visualization, it is apparent that more residents have migrated from England to Wales than to Scotland or to Northern Ireland.
Sankey diagrams typically depict the flow of data from one entity (or node) to another.
The entity from/to where data flows is referred to as a node – the node where the flow originates is the source node (e.g. England on the left-hand side) and where the flow ends is the target node (e.g. Wales on the right-hand side). The source and target nodes are often represented as rectangles with a label.
The flow itself is represented by a straight or a curved path is called the link. The width of the flow/link is proportional to the amount/quantity of flow. In the above example, the flow (i.e. migration of residents) from England to Wales is wider (more) than that from England to Scotland or Northern Ireland indicating more number of residents migrating to Wales than to the other countries.
The Sankey diagrams can be used to represent the flow of energy, money, costs, anything that has a notion of flow.
Minard’s classic diagram of Napoleon’s invasion of Russia is perhaps the most famous example of the Sankey diagram. This visualization using the Sankey diagram displays very effectively how the French army progressed (or dwindled?) on its way to Russia and back.
Now, let’s see how we can use python’s plotly to plot a Sankey diagram.
For plotting a Sankey diagram, let’s use the Olympics 2021 dataset. This dataset has details about the medals tally – country, total medals, and the split across the gold, silver, and bronze medals. Let’s plot a Sankey diagram to understand how many of the medals a country won are Gold, Silver, and Bronze.
import pandas as pd
df_medals = pd.read_excel("Medals.xlsx")
print(df_medals.info())
df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True)
print(df_medals)
We will use the plotly’s go interface Sankey that takes 2 parameters – nodes and links.
Note that all the nodes – source and target should have unique identifiers.
In this case,
We will need to instantiate 2 python dict objects to represent the
and pass this to the plotly‘s go interface Sankey.
Each index of the lists – label, source, target, value, and color – corresponds to one node or link respectively.
NODES = dict( # 0 1 2 3 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = ["seagreen", "dodgerblue", "orange", "gold", "silver", "brown" ],) LINKS = dict( source = [ 0, 0, 0, 1, 1, 1, 2, 2, 2], # The origin or the source nodes of the link target = [ 3, 4, 5, 3, 4, 5, 3, 4, 5], # The destination or the target nodes of the link value = [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links # Color of the links # Target Node: 3-Gold 4 -Silver 5-Bronze color = [ "lightgreen", "lightgreen", "lightgreen", # Source Node: 0 - United States of America "lightskyblue", "lightskyblue", "lightskyblue", # Source Node: 1 - People's Republic of China "bisque", "bisque", "bisque"],) # Source Node: 2 - Japan data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.show()
Sankey diagram – a basic plot
Here we have a very basic plot. But do you notice how the diagram is too wide and Silver appears before the Gold? Let’s adjust the position of the nodes and the width.
Let’s add the x and y positions for the nodes to explicitly specify the positions of the nodes. The values should be between 0 and 1.
NODES = dict( # 0 1 2 3 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = [ "seagreen", "dodgerblue", "orange", "gold", "silver", "brown" ], x = [ 0, 0, 0, 0.5, 0.5, 0.5], y = [ 0, 0.5, 1, 0.1, 0.5, 1],) data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.update_layout(title="Olympics - 2021: Country & Medals", font_size=16) fig.show()
With this, we get a compact diagram:
Sankey diagram – node position adjusted
See below how the various parameters passed in the code map to the nodes and links in the diagram
Sankey diagram – how code maps to diagram
The plot is interactive. You could hover on the nodes and the links for more information.
Sankey diagram – with default hover labels
Currently, the information displayed in the hover labels is the default text. When you hover on the
Don’t you think the labels are too verbose? All these can be improved.
Let’s improve the format of the hover labels using the hovertemplate parameter
NODES = dict( # 0 1 2 3 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = [ "seagreen", "dodgerblue", "orange", "gold", "silver", "brown" ], x = [ 0, 0, 0, 0.5, 0.5, 0.5], y = [ 0, 0.5, 1, 0.1, 0.5, 1], hovertemplate=" ",)
LINK_LABELS = [] for country in ["USA","China","Japan"]: for medal in ["Gold","Silver","Bronze"]: LINK_LABELS.append(f"{country}-{medal}")
LINKS = dict( source = [ 0, 0, 0, 1, 1, 1, 2, 2, 2], # The origin or the source nodes of the link target = [ 3, 4, 5, 3, 4, 5, 3, 4, 5], # The destination or the target nodes of the link value = [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links # Color of the links # Target Node: 3-Gold 4 -Silver 5-Bronze color = [ "lightgreen", "lightgreen", "lightgreen", # Source Node: 0 - United States of America "lightskyblue", "lightskyblue", "lightskyblue", # Source Node: 1 - People's Republic of China "bisque", "bisque", "bisque"], # Source Node: 2 - Japan label = LINK_LABELS, hovertemplate="%{label}",)
data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.update_layout(title="Olympics - 2021: Country & Medals", font_size=16) fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey')) fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell")) fig.show()
Sankey diagram – with improved hover labels
Nodes are referred to as source and target with respect to a link. A node that is a target for one link can be a source for another.
We saw how Sankey diagrams can be used to represent flows effectively and how plotly python library can be to generate Sankey diagrams for a sample dataset.
About the author
A technical architect who also loves to break complex concepts into easily digestible capsules! Currently, finding my way around the fascinating world of data visualizations and data storytelling!