This article was published as a part of the Data Science Blogathon.
While using visualizations, one compact visualization showing the relation between multiple variables has an upper hand over multiple visualizations – one for each variable. When you are trying to visualize high dimensional numerical data instead of multiple bar/line charts (one for each numerical variable), a single Parallel Coordinates plot could be more useful.
A Parallel coordinates plot is used to analyze multivariate numerical data. It allows a comparison of the samples or observations across multiple numerical variables.
See the below Parallel Coordinates plot of the iris dataset, where each sample is plotted starting from the left-most axis to the right-most axis. We can see how a sample behaves for each of the various features.
Image source: https://upload.wikimedia.org/wikipedia/en/4/4a/ParCorFisherIris.png
From the above example, we can see that there are 4 variables/features – Sepal Width, Sepal Length, Petal Width, Petal Length. Each of the variables is represented by an axis. Each axis has different min-max values
Let’s use the Olympics 2021 dataset to illustrate the use of a parallel coordinates plot. This dataset has details about the
Read and prepare the data
df_teams = pd.read_excel("data/Teams.xlsx") df_atheletes = pd.read_excel("data/Athletes.xlsx") df_medals = pd.read_excel("data/Medals.xlsx") print(df_teams.info()) print(df_atheletes.info()) print(df_medals.info())
There is no missing data, so no specific missing data handling is needed.
Let’s find the number of disciplines each country has participated in and the number of athletes from each country who participated and merge this data into a single data frame.
df_medals.rename(columns={'Team/NOC':'NOC', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True) df_disciplines_per_country = df_teams.groupby(by='NOC').agg({'Discipline':'nunique'}) df_atheletes_per_country = df_atheletes.groupby(by='NOC').agg({'Name':'nunique'}).rename(columns={'Name':'Athletes'}) df = pd.merge(left=df_disciplines_per_country, right=df_medals, how='inner',on='NOC') df = pd.merge(left=df, right=df_atheletes_per_country, how='inner',on='NOC') df.rename(columns={'NOC':'Country'}, inplace=True) df = df[['Country', 'Rank', 'Total Medals', 'Gold Medals', 'Silver Medals', 'Bronze Medals', 'Athletes', 'Discipline' ]] df.sort_values(by='Rank', inplace=True) df.reset_index(inplace=True) df.drop(columns=['index'], inplace=True) df.head(10)
Final dataset after merging all the different datasets
plt.figure(figsize=(20, 5)) ax = plt.subplot(1,2,1) ax = df[['Country','Athletes']][:40].plot.bar(x='Country', xlabel = '', ax=ax) ax = plt.subplot(1,2,2) df[['Country','Discipline']][:40].plot.bar(x='Country', xlabel = '', ax=ax)
plt.figure(figsize=(20, 5)) ax = plt.subplot(1,2,1) df[['Country','Rank']][:40].plot.bar(x='Country', xlabel = '', ax=ax) ax = plt.subplot(1,2,2) df[['Country','Gold Medals', 'Silver Medals','Bronze Medals',]][:40].plot.bar(stacked=True, x='Country', xlabel = '', ax=ax)
df_20 = df.head(20).copy() df_20 = df_20[['Country', 'Athletes', 'Discipline', 'Rank', 'Total Medals', 'Gold Medals', 'Silver Medals', 'Bronze Medals']] plt.figure(figsize=(16,8)) pd.plotting.parallel_coordinates(df_20, 'Country', color=('#556270', '#4ECDC4', '#C7F464'))
Parallel Coordinates Plot using Pandas
With the pandas interface, we have 2 issues
In a parallel coordinates plot with px.parallel_coordinates, each row (or sample) of the DataFrame is represented by a polyline mark which traverses a set of parallel axes, one for each of the dimensions.
import plotly.express as px df_ = df.copy() # color : Values from this column are used to assign color to the poly lines. # dimensions: Values from these columns form the axes in the plot. fig = px.parallel_coordinates(df_, color="Rank", dimensions=['Rank', 'Athletes', 'Discipline','Total Medals'], color_continuous_scale=px.colors.diverging.Tealrose, color_continuous_midpoint=2) fig.show()
Parallel Coordinates Plot using Plotly express
import plotly.graph_objects as go df_ = df.copy() dimensions = list([ dict(range=(df_['Rank']There is still one issue. USA won the most medals but is displayed at the bottom. Due to this there are unnecessary criss-crossed lines. This is no very intuitive. We would like to see countries in descending order.min(), df_['Rank'].max()),tickvals = df_['Rank'], ticktext = df_['Country'],label='Country', values=df_['Rank']),
dict(range=(df_['Athletes'].min(),df_['Athletes'].max()),label='Athletes', values=df_['Athletes']), dict(range=(df_['Discipline'].min(),df_['Discipline'].max()),label='Discipline', values=df_['Discipline']), dict(range=(df_['Total Medals'].min(), df_['Total Medals'].max()),label='Total Medals', values=df_['Total Medals']), dict(range=(df_['Gold Medals'].min(), df_['Gold Medals'].max()),label='Gold Medals', values=df_['Gold Medals']), dict(range=(df_['Silver Medals'].min(), df_['Silver Medals'].max()),label='Silver Medals', values=df_['Silver Medals']), dict(range=(df_['Bronze Medals'].min(), df_['Bronze Medals'].max()),label='Bronze Medals', values=df_['Bronze Medals']), ]) fig = go.Figure(data= go.Parcoords(line = dict(color = df_['Rank'], colorscale = 'agsunset'), dimensions = dimensions)) fig.show()
Parallel Coordinates Plot using Plotly graph objects
This is definitely a better plot than what pandas gave us. But the figure size is bad – labels are cut off. Let us adjust the size using update_layout
# Adjust the size to fit all the labels fig.update_layout(width=1200, height=800,margin=dict(l=150, r=60, t=60, b=40)) fig.show()
There is still one issue. The USA won the most medals but is displayed at the bottom. Due to this, there are unnecessary crisscrossed lines. This is not very intuitive. We would like to see countries in descending order
# Let's reverse the min and max values for the Rank, so that the country with top rank comes on the top. dimensions = list([ dict(range=(df_['Rank'].max(), df_['Rank'].min()), tickvals = df_['Rank'], ticktext = df_['Country'],label='Country', values=df_['Rank']), dict(range=(df_['Athletes'].min(),df_['Athletes'].max()),label='Athletes', values=df_['Athletes']), dict(range=(df_['Discipline'].min(),df_['Discipline'].max()),label='Discipline', values=df_['Discipline']), dict(range=(df_['Total Medals'].min(), df_['Total Medals'].max()),label='Total Medals', values=df_['Total Medals']), dict(range=(df_['Gold Medals'].min(), df_['Gold Medals'].max()), label='Gold Medals', values=df_['Gold Medals']), dict(range=(df_['Silver Medals'].min(), df_['Silver Medals'].max()),label='Silver Medals', values=df_['Silver Medals']), dict(range=(df_['Bronze Medals'].min(), df_['Bronze Medals'].max()),label='Bronze Medals', values=df_['Bronze Medals']), ]) fig = go.Figure(data= go.Parcoords(line = dict(color = df_['Rank'], colorscale = 'agsunset'), dimensions = dimensions)) fig.update_layout(width=1200, height=800,margin=dict(l=150, r=60, t=60, b=40)) fig.show()
Parallel Coordinates Plot using plotly graph objects
Now the plot looks much better. Don’t you agree? If you follow the line corresponding to the United States of America ranked first on the top of the table, you can see that 614 athletes have participated in 18 disciplines and won a total of 113 medals out of which 39 are Gold. While China who fielded 400 athletes in 15 disciplines is at the 2nd position with 88 medals and 37 Golds.
From this one chart, we can draw the following insights that are basically the same as earlier. Only in this case, there is one single summary view.
What are the insights?
We saw how Parallel Coordinates plots – compact visualizations – can be used for high dimensional multivariate numerical data to bring out meaningful insights. To generate Parallel Coordinates plots we used the Plotly Python library that provides a lot of convenient functions.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion