Data Visualization Techniques You MUST Know as a Data Scientist

Harsh Last Updated : 12 Nov, 2024

11 min read

This article was published as a part of the Data Science Blogathon

Introduction

Have you ever got confused by reading those large data tables? Don’t worry, that happens with everyone (including me)!

Image source: Google Images

With around 2.5 quintillion data being generated each day including Text Messages, Emails, Videos, Images, etc. it becomes difficult to analyze all these data and find insights from those data. FYI, The cardinal number of quintillion is 10¹⁸

Thomas Redman has stated that:

“Where there is data smoke, there is business fire.”

After reading this article, you will know:

What and Why of the Data Visualizations
Different types of plots including Relational Plots, Regression Plots, Distribution Plots, Multi-plot Grids, Categorical Plots, and Matrix Plots.
How to Choose a Plot?

So, let’s begin finding insights!

What is Data Visualization?
Why Data Visualizations?
Plot Types
- Relational Plots
- Regression Plots
- Distribution Plots
- Multi-plot Grids
- Categorical Plots
- Matrix Plots
How to Choose a Plot?
Summary
Endnotes

What is Data Visualization?

Data visualization is the technique of translating information or data into a visual context, such as a map or graph, to make data easier for the human brain to understand.

Why Data Visualizations?

Image source: Google Images

Indeed, you’ve understood the meme very well. Still, let me throw some Statistics and Psychology on Why Data Visualizations?

Psychology

Our brain is uniquely built to recognize patterns and connections. Thus, presenting data visually accelerates our perception and helps to reduce cognitive load.

Statistics

Companies with the most advanced analytics capabilities are:

2x more likely to be in the top quartile of their financial performance.
2x more likely to use data very frequently when making decisions.
3x more likely to execute decisions as intended.
5x more likely to make decisions much faster than market peers. (Source: Bain & Company)

In simple words, it helps us understand data better and faster 🙂

Types of Plots

Enough of these theories now, right? Let’s get our hands dirty on Data Visualization techniques. We’ll be using the seaborn library for Data Visualizations and a couple of datasets. You may read seaborn docs here and can download all the datasets from here to follow along.

Please note that we’ll be using Figure-level Functions mostly instead of Axes-level Functions. Axes-level functions plot data onto a single matplotlib.pyplot.Axes object while the Figure-level function uses a seaborn object, usually a FacetGrid. There is one Figure-level function in each category of the plot which can be used to plot all the axes-level functions. Refer to the following diagram.

Data Visualization Techniques types of plot

Image source

For example, relplot() is the figure-level function for the Relational Plots. It can be used to plot both the scatterplot() and the lineplot() but the default behavior is scatterplot(). We can set the parameter kind=’line’ to plot a line chart. The Same applies to the other plot types.

I highly recommend you to read more about this

Grab the datasets? Let’s begin by importing the necessary modules.

Importing Modules

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_theme(style='darkgrid', color_codes=True)

Let’s dive deep into the different types of plots.

Relational Plots

This type of plot is usually used to show the statistical analysis. It helps us to understand how variables in a dataset relate to each other and how those relationships depend on other variables.

We will be using the relplot() function for Relational Plots. The default behavior of this function is to draw a bar plot. We can use kind=’line’ to draw a line plot.

Note: One can also use the lineplot() and the scatterplot() functions to draw a line plot and a scatter plot respectively but these are the Axes-level functions.

Line Plot

A line plot is nothing but the different data points connected with a line. Line plots are generally used to track the changes over a period of time. Sometimes, it is also used to compare the changes over a time period for different groups.

covid_country_wise_data = pd.read_csv('/content/drive/MyDrive/AV-Article/Datasets/covid_country_wise.csv', index_col=0)
covid_country_wise_data.head()

sns.relplot(x='Date', y='Confirmed', data=covid_day_wise_data.iloc[::10, :], kind='line', height=4.5, aspect=2)
plt.ylabel('Rate', fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.title('Daily Confirmed COVID-19 Cases Globally', fontsize=16)
plt.xticks(rotation=90)

Note: .iloc[::10, :] in the data parameter denotes that we’re plotting the line chart at an interval of 10. If we omit this, the date values on the x-axis will be overlapped. You may try it on your own.

Output:

Data Visualization Techniques relational plots

Inference

COVID cases start rising exponentially from April.

Line Chart Use Cases

Stock prices over a period of time.
Total views on this article over a period of time.

Scatter Plot

A scatter plot uses dots to represent values for two different numeric variables. Generally, it is used to observe relationships between two variables.

insurance_data = pd.read_csv('/content/drive/MyDrive/AV-Article/Datasets/insurance.csv')
insurance_data.head()

sns.relplot(data=insurance_data, x='bmi', y='charges', hue='smoker', height=5, aspect=1.5)
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('BMI vs Charges Based on Smoking Habit', fontsize=16)

hue parameter is used to separate the data based on the different categories present in a provided column (smoker in this case). This can be used while performing classification by passing a target variable as hue.

height and aspect are the parameters to set a figure size in Figure-level functions, width = height * aspect. Note that we can’t use matplotlib figsize to set a figure size in Figure-level functions.

Output:

Data Visualization Techniques scatter plot

Inference

Persons with higher BMI tend to pay more charges compared to lower BMI.
Smokers tend to pay more charges compared to non-smokers.

Understandable, right?

Scatter Plot Use Cases

Weight and Height Distributions.
Beach visitors based on the temperature.

Regression Plots

This type of plot is mainly used to fit the linear regression line onto the scatter plots which can solve your linear regression problem just in a moment. We’ll be using regplot() and lmplot() in this section. Both functions draw a scatter plot of two variables x and y, then fit the regression model and plot the resulting regression line.

regplot() is an axes-level function that does not provide hue as a parameter while lmplot() is a figure-level function that provides hue as a parameter.

Reg Plot

plt.figure(figsize=(10, 6))
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('BMI vs Charges Based on Smoking Habit with Regression Line', fontsize=16)

Output:

Data Visualization Techniques regression plot

Inference

Shows the linear relationship between BMI and Charges.

LM Plot

sns.lmplot(data=insurance_data, x='bmi', y='charges', hue='smoker', height=5, aspect=1.7)
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.title('BMI vs Charges Based on Smoking Habit with Multiple Regression Line', fontsize=16)

Output:

Inference

Shows the linear relationship between BMI and Charges based on smoking habits.

Let’s add one more parameter!

sns.lmplot(data=insurance_data, x='bmi', y='charges', hue='smoker', col='sex')

Output:

Inference

Shows the linear relationship between BMI and Charges based on smoking habits separately for a male and a female.

Looks Interesting? Try the following code on your own and interpret the results! Note that we’re adding one more parameter.

sns.lmplot(data=insurance_data, x='bmi', y='charges', hue='smoker', col='sex', row='region')

Let me know the inference in the comments section!

Distribution Plots

This type of plot is mainly used to see how the variables are distributed, which is considered as an early stage in Data Analysis. This type of plots helps us to perform the Univariate Analysis which can answer the following important questions:

What ranges do the observations cover?
Are they skewed or not?
Are there any outliers?
Do the answers to the above questions vary across the different subsets of the data?

As shown in the first diagram of the Types of Plots section, distplot() is a Figure-level function with histplot() as default behavior, which is the most common approach to visualize a distribution. We can use the kind parameter to visualize different distributions.

Histograms

Histograms are used to perform a Univariate or Bivariate Analysis on the datasets.

sns.displot(data=iris_data, x='Sepal Length (cm)', aspect=1.6, bins=30)
plt.title('Sepal Length Distribution', fontsize=16)

Output:

Inference

Most flowers are having sepal length in the range of 5.5 to 6.8

Let us add our favorite parameter i.e. hue 🙂

sns.displot(data=iris_data, x='Sepal Length (cm)', hue='Species', aspect=1.6, element='step', bins=30)
plt.title('Species wise Sepal Length Distribution', fontsize=16)

Output:

Inference

Most setosa flowers are lying in the range of 4.9 to 5.1
Only setosa flowers have a sepal length less than 5.
Only virginica flowers have a sepal length greater than 7.

KDE Plot

A Kernel Density Estimation (KDE) is analogous to the Histograms but KDE represents the data using a continuous probability density curve in one or more dimensions.

sns.displot(data=iris_data, x='Sepal Length (cm)', hue='Species', kind='kde', aspect=1.6)
plt.title('Species wise Sepal Length KDE', fontsize=16)

Output:

Inference

Setosa flowers are having a highest density and it is less skewed.
Virginica flowers are having a lowest density and it is highly skewed.

Multi-plot Grids

This type of plot helps us to show the relationship between two variables by plotting two or more separate graphs. This is one of the favorite choices of Data Scientists as this provides Univariate and Bivariate Analysis in a single figure.

Joint Plot

It combines the functionality of a scatter plot and a histogram or a KDE plot.

We can use the jointplot() function to draw a joint plot. It draws a scatter plot in the middle which helps us in Bivariate Analysis and KDE Plot at the top and the right side which helps us in Univariate Analysis.

sns.jointplot(data=iris_data, x='Sepal Length (cm)', y='Petal Length (cm)', hue='Species')

Output:

Inference

The sepal length of Setosa flowers lies in the range of 4-6.
The petal length of Setosa flowers lies in the range of 1-2 and it is easily distinguishable from the other species of flowers.
Petal length distribution for Setosa flowers is less skewed while it is highly skewed for Versicolor and Virginica flowers.

Pair Plot

It helps us to plot pairwise relationships in a dataset.

The simplest invocation uses scatterplot() for each pairing of the variables and histplot() for the marginal plots along the diagonal if we do not provide hue parameter and kdeplot() if we provide hue parameter.

We can use the pairplot() function to draw a pair plot.

sns.pairplot(data=iris_data, hue='Species', height=2)

Output:

Inference

Same as above.
Setosa flowers distribution is less skewed for Sepal Length, Petal Length and Petal Width while it is highly skewed for Sepal Width.

Categorical Plots

In the relational plots, we focused on cases where the main relationship was between two numerical variables. If one of the variables is categorical, it may be helpful to use a more specialized approach to visualization i.e. Categorical Plots.

There are numbers of axes-level functions such as barplot(), countplot(), etc. to draw a categorical plot but we’ll be using catplot() which is a figure-level function that gives unified higher-level access to different axes level functions.

Let’s get started!

Bar Plot

A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle.

sns.catplot(data=flight_delays_data, x=flight_delays_data.index, y='HA', kind='bar', aspect=1.75)
plt.ylabel('Delays')
plt.title('Month wise Delays of the HA Airline', fontsize=16

Inference

Highest Delay in August.
Lowest Delay in November.

Count Plot

This type of plot helps us to show the count of each category of the categorical variables. It can be thought of as a histogram for categorical variables.

sns.catplot(data=insurance_data, x='smoker', hue='sex', kind='count', aspect=1.5)
plt.xlabel('Smoker')
plt.title('Gender wise Total Smoker and Non Smoker', fontsize=16)

count plot

Inference

More male smokers compared to female smokers.

Swarm Plot

This can be thought of as a scatter plot for the categorical variables. It shows all the data points in a figure which helps us to identify the outliers.

sns.catplot(data=insurance_data, x='smoker', y='charges', kind='swarm', aspect=1.5)
plt.title('Swarm Plot - Charges Based on Smoking Habits', fontsize=16)

Inference

Most non-smokers lie in the charges range of 100-1400.
More outliers in the case of non-smokers compared to smokers.
Smokers’ charges vary more than non-smokers’.

Box and Whisker Plot

This is one of the most used plots in the field of Data Science. It shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. It helps us to detect outliers more easily compared to the swarm plots.

Please refer to the following diagram.

Image Source: Google Images

sns.catplot(data=insurance_data, x='smoker', y='charges', kind='box', aspect=1.4)
plt.title('Box Plot - Charges Based on Smoking Habits', fontsize=16)

box plot

Inference

Same as in above.

Violin Plot

A violin plot is almost similar to the box and whisker plot except that it uses KDE instead of actual data points.

Note: Violin plot is influenced by sample size since it uses KDE underlying. Using it for a small sample size is not advisable.

Try this on your own!

You should try adding our favorite parameters including hue, col in a swarm plot, box and whisker plot, and a violin plot as we previously did in the Regression plots.

Matrix Plots

This type of plot is used to find quick conclusions from the dataset by plotting data in the form of a matrix.

Heat Map

A heatmap is a graphical representation of data in which data values are represented as a color-encoded matrix. This can be used to see the relationship between two variables. It can also be used to plot a correlation matrix.

We can use the heatmap() function in seaborn to plot a heatmap.

plt.figure(figsize=(14, 8))
sns.heatmap(data=flight_delays_data, annot=True, center=0, linewidths=.5)
plt.xlabel('Airlines Code')
plt.title('Heatmap of Airline and Month wise Delays', fontsize=16)

annot=True: Write the data value in each cell.

center=0: Plots data with diverging colormap.

linewidth=.5: Each cell will be divided by 0.5

You may try different combinations of these parameters and other parameters mentioned in the documentation on your own.

Output:

Note: White cells for the US airline denotes missing data and negative delay denotes that the flights come earlier than the mentioned time.

Inference

NK airline is having the highest delay compared to other airlines with the overall highest delay in June.
AS airline is having a negative delay.
The first six months noted higher delays compared to the last six months.
and many more! Let us discuss this in the comment section.

How to Choose a Plot?

Please refer to the following diagram to choose a plot according to use cases from various available plots.

choose

Please do not hesitate to credit me, if you use the above image elsewhere in the future.

Summary

In this article, you have understood some amazing Data Visualization techniques.

Specifically, you learned:

What and Why of the Data Visualizations
Different types of plots including Relational Plots, Regression Plots, Distribution Plots, Categorical Plots, Matrix Plots and Multi-plot Grids.
How to Choose a Plot?

Endnotes

This completes today’s discussion. Thank you for reading this article!

I hope you enjoyed the article and it’s worth spending your 10 minutes.

Please feel free to put up your questions in the comments down below. I would love to answer those.

Did I miss something important or want to share your thoughts? Comment down below and I’ll get back to you.

About the Author

I am Harsh Dhamecha, an aspiring Data Scientist. Currently, I am pursuing a Bachelor of Technology in Computer Science with a specialization in Artificial Intelligence. I love sharing my knowledge through the articles.

You may also look at another article I have written on the Generative Adversarial Networks(GANs).

Still reading! Special thanks to you 🙌

The media shown in this article on Data Visualization Techniques are not owned by Analytics Vidhya and are used at the Author’s discretion.

Harsh

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction

Tools

Libraries

Plots

Use cases

Data Visualization Techniques You MUST Know as a Data Scientist

Introduction

Table of Contents

What is Data Visualization?

Why Data Visualizations?

Types of Plots

Relational Plots

Regression Plots

Distribution Plots

Multi-plot Grids

Categorical Plots

Swarm Plot

Box and Whisker Plot

Matrix Plots

How to Choose a Plot?

Summary

Endnotes

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID