Exploratory Data Analysis in Python

Ayushi Last Updated : 24 Oct, 2024

10 min read

Overview

Understanding how EDA is done in Python
Various steps involved in the Exploratory Data Analysis
Performing EDA on a given dataset

Introduction

Exploratory data analysis popularly known as EDA is a process of performing some initial investigations on the dataset to discover the structure and the content of the given dataset. It is often known as Data Profiling. It is an unavoidable step in the entire journey of data analysis right from the business understanding part to the deployment of the models created.

EDA is where we get the basic understanding of the data in hand which then helps us in the further process of Data Cleaning & Data Preparation.

We will be covering a wide range of topics under EDA starting from the basic data exploration (structure based) to the normalization and the standardization of the data. In this article, we will be using the Python programming language to perform the EDA steps.

Let’s see what all we are going to cover!

Introducing the Dataset
Importing the Python Libraries
Loading the Dataset in Python
Structured Based Data Exploration
Handling Duplicates
Handling Outliers
Handling Missing Values
Univariate Analysis
Bivariate Analysis

Introducing the Dataset

For this article, we will be using the Black Friday dataset which can be downloaded from here.

Importing the Python Libraries

Let’s import all the python libraries we will be needing for our analysis namely NumPy, Pandas, Matplotlib and Seaborn.

Loading the Dataset in Python

Now let’s load our dataset into Python. We will be reading the data from a CSV (comma-separated values) file into a Pandas DataFrame naming it as df here.
Python Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('train.csv')
print('Displaying first five rows:')
print(df.head())
print('Displaying last five rows')
print(df.tail())
print(f"Number of rows: {df.shape[0]}\nNumber of columns: {df.shape[1]}")

Let’s begin with the basic exploration of the data we have!

Structured Based Data Exploration

It is the very first step in EDA which can also be referred to as Understanding the MetaData! That’s correct, ‘Data about the Data’.

It is here that we get the description of the data we have in our data frame.

Let’s try now.

Display the FIRST 5 Observations

Display the LAST 5 Observations

Display the Number of Variables & Number of Observations
/>df.shape() gives us a tuple having 2 values.

Display the Variable Names and their Data Types

df.dtypes

This gives us the type of variables in our dataset.

Count the Number of Non-Missing Values for each Variable

df.count()

This gives the number of non-missing values for each variable and is extremely useful while handling missing values in a data frame.

Descriptive Statistics

Now to know about the characteristics of the data set we will use the df.describe() method which by default gives the summary of all the numerical variables present in our data frame.

df.describe()

Using the df.describe() method we get the following characteristics of the numerical variables namely to count (number of non-missing values), mean, standard deviation, and the 5 point summary which includes minimum, first quartile, second quartile, third quartile, and maximum.

What about the categorical variables?

df.describe(include = 'all')

Categorical values | Exploratory Data Analysis in Python

By providing the include argument and assigning it the value ‘all’ we get the summary of the categorical variables too. For the categorical variables, we get the characteristics: count (number of non-missing values) , unique (number of unique values), top (the most frequent value), and the frequency of the most frequent value.

Display the Complete Meta-Data of the dataset

df.info()

Meta Data | Exploratory Data Analysis in Python

By just this one command of df.info() we get the complete information of the data in hand.

With this, we are done with the Structure-Based Exploratory Data Analysis and now it’s time to get into the Content Based Exploratory Data Analysis.

Handling Duplicates

This involves 2 steps: Detecting duplicates and Removing duplicates.

To check for the duplicates in our data

df.duplicated()

Handling Duplicates | Exploratory Data Analysis in Python

Hereby duplicates mean the exact same observations repeating themselves. As we can see that there are no duplicate observations in our data and hence each observation is unique.

However, to remove the duplicates(if any) we can use the code :

df.drop_duplicates()

Further, we can see that there are duplicate values in some of the variables like User_ID. How can we remove those?

df.drop_duplicates(subset='User_ID')

This by default keeps just the first occurrence of the duplicated value in the User_ID variable and drops the rest of them. Hold On! Here we do not want to remove the duplicate values from the User_ID variable permanently so just to see the output and not make any permanent change in our data frame we can write the command as:

df.drop_duplicates(subset='User_ID' , inplace=False)

As we can see, the values in the User_ID variable are all unique now.

So this is how detection and removal of duplicated observations/values are done in a data frame.

Handling Outliers

What are Outliers? Outliers are the extreme values on the low and the high side of the data. Handling Outliers involves 2 steps: Detecting outliers and Treatment of outliers.

Detecting Outliers

For this we consider any variable from our data frame and determine the upper cut off and the lower cutoff with the help of any of the 3 methods namely :

Percentile Method
IQR Method
Standard Deviation Method

Let’s consider the Purchase variable. Now we will be determining if there are any outliers in our data set using the IQR(Interquartile range) Method. What is this method about? You will get to know about it as we go along the process so let’s start. Finding the minimum(p0), maximum(p100), first quartile(q1), second quartile(q2), the third quartile(q3), and the iqr(interquartile range) of the values in the Purchase variable.

p0=df.Purchase.min()
p100=df.Purchase.max()
q1=df.Purchase.quantile(0.25)
q2=df.Purchase.quantile(0.5)
q3=df.Purchase.quantile(0.75)
iqr=q3-q1

Now since we have all the values we need to find the lower cutoff(lc) and the upper cutoff(uc) of the values.

lc = q1 - 1.5*iqr
uc = q3 + 1.5*iqr

lc

uc

We have the uppercut off and the lower cutoff, what now? We will be using the convention :

If lc < p0 → There are NO Outliers on the lower side

If uc > p100 → There are NO Outliers on the higher side

print( "p0 = " , p0 ,", p100 = " , p100 ,", lc = " , lc ,", uc = " , uc)

Clearly lc < p0 so there are no outliers on the lower side. But uc < p100 so there are outliers on the higher side. We can get a pictorial representation of the outlier by drawing the box plot.

df.Purchase.plot(kind='box')

Now since we have detected the outliers it is time to treat those.

Outlier Treatment

Do not worry about the data loss as here we are not going to remove any value from the variable but rather clip them. In this process, we replace the values falling outside the range with the lower or the upper cutoff accordingly. By this, the outliers are removed from the data and we get all the data within the range.

Clipping all values greater than the upper cutoff to the upper cutoff :

df.Purchase.clip(upper=uc)

To finally treat the outliers and make the changes permanent :

df.Purchase.clip(upper=uc,inplace=True)
df.Purchase.plot(kind='box')

Handling Missing Values

What are Missing Values? Missing Values are the unknown values in the data. This involves 2 steps: Detecting the missing values and Treatment of the Missing Values

Detecting the Missing Values

df.isna()

Handling Missing Values | Exploratory Data Analysis in Python

df.isna() returns True for the missing values and False for the non-missing values.

Here we are going to find out the percentage of missing values in each variable.

df.isna().sum()/df.shape[0]

Missing values | Exploratory Data Analysis in Python

And we get from the output that we do have missing values in our data frame in 2 variables: Product_Category_2 and Product_Category_3, so detection is done.

Missing Value Treatment

To treat the missing values we can opt for a method from the following :

Drop the variable
Drop the observation(s)
Missing Value Imputation

For variable Product_Category_2, 31.56% of the values are missing. We should not drop such a large number of observations nor should we drop the variable itself hence we will go for imputation. Data Imputation is done on the Series. Here we replace the missing values with some value which could be static, mean, median, mode, or an output of a predictive model.

Since it is a categorical variable, let’s impute the values by mode.

df.Product_Category_2.mode()[0]
df.Product_Category_2.fillna(df.Product_Category_2.mode()[0],inplace=True)

Done!

df.isna().sum()

Missing value treatment | Exploratory Data Analysis in Python

For variable Product_Category_3, 69.67% of the values are missing which is a lot hence we will go for dropping this variable.

df.dropna(axis=1,inplace=True)

How to Check?

df.dtypes

Analysis using Charts

Univariate Analysis

In this type of analysis, we use a single variable and plot charts on it. Here the charts are created to see the distribution and the composition of the data depending on the type of variable namely categorical or numerical.

For Continuous Variables: To see the distribution of data we create Box plots and Histograms.

Distribution of Purchase

Histogram

df.Purchase.hist()
plt.show()

Distribution of Purchase | Exploratory Data Analysis in Python

We created this histogram using the hist() method of the Series but there is another method too known as plot() by which we can create many more charts.

df.Purchase.plot(kind='hist' , grid = True)
plt.show()

We have another way to create this chart by directly using matplotlib!

plt.hist(df.Purchase)
plt.grid(True)
plt.show()

Box Plot

df.Purchase.plot(kind='box')
plt.show()

plt.boxplot(df.Purchase)
plt.show()

Box plot | Exploratory Data Analysis in Python

For Categorical Variables :

To see the distribution of data we create frequency plots like Bar charts, Horizontal Bar charts, etc.
To see the composition of data we create Pie charts.

Composition of Gender

df.groupby('Gender').City_Category.count().plot(kind='pie')
plt.show()

Distribution of Marital_Status

sns.countplot(df.Marital_Status)
plt.show()

Composition of City_Category

df.groupby('City_Category').City_Category.count().plot(kind='pie')
plt.show()

Distribution of Age

sns.countplot(df.Age)
plt.show()

Composition of Stay_In_Current_City_Years

df.groupby('Stay_In_Current_City_Years').City_Category.count().plot(kind='pie')
plt.show()

Pie Chart | Exploratory Data Analysis in Python

Distribution of Occupation

sns.countplot(df.Occupation)
plt.show()

Distribution of Product_Category_1

df.groupby('Product_Category_1').City_Category.count().plot(kind='barh')
plt.show()

Distribution of product | Exploratory Data Analysis in Python

Bivariate Analysis

In this type of analysis, we take two variables at a time and create charts on them. Since we have 2 types of variables Categorical and Numerical so there can be 3 cases in bivariate analysis :

Numerical & Numerical: To see the relationship between the 2 variables we create Scatter Plots and a Correlation Matrix with a Heatmap on the top.

Scatter Plot

Since there is only 1 numerical variable in our dataset so we cannot create the Scatter plot here. But how can we do so? Let’s take a hypothetical example such that we consider all the numeric variables(having dtype as int or float) here as numerical variables.

Considering 2 categorical variables Product_Category_1 and Product_Category_2

df.plot(x='Product_Category_1',y='Product_Category_2',kind = 'scatter')
plt.show()

plt.scatter(x=df.Product_Category_1 , y=df.Product_Category_2)
plt.show()

Correlation Matrix

Finding a correlation between all the numeric variables.

df.select_dtypes(['float64' , 'int64']).corr()

Heatmap

Creating a heatmap using Seaborn on the top of the correlation matrix obtained above to visualize the correlation between the different numerical columns of the data. This is done when we have a large number of variables.

sns.heatmap(df.select_dtypes(['float64' , 'int64']).corr(),annot=True)
plt.show()

Heatmap | Exploratory Data Analysis in Python

Numerical & Categorical

To see the composition of data we create bar and line charts.
To see the comparison between the 2 variables we create bar and line charts.

Comparison between Purchase and Occupation: Bar Chart

df.groupby('Occupation').Purchase.sum().plot(kind='bar')
plt.show()

summary=df.groupby('Occupation').Purchase.sum()
plt.bar(x=summary.index , height=summary.values)
plt.show()

sns.barplot(x=summary.index , y=summary.values)
plt.show()

Occupation | Exploratory Data Analysis in Python

Comparison between Purchase and Age: Line Chart

df.groupby('Age').Purchase.sum().plot(kind='line')
plt.show()

Composition of Purchase by Gender: Pie Chart

df.groupby('Gender').Purchase.sum().plot(kind='pie')
plt.show()

Comparison between Purchase and City_Category: Area Chart

df.groupby('City_Category').Purchase.sum().plot(kind='area')
plt.show()

Purchase and City | Exploratory Data Analysis in Python

Comparison between Purchase and Stay_In_Current_City_Years: Horizontal Bar Chart

df.groupby('Stay_In_Current_City_Years').Purchase.sum().plot(kind='barh')
plt.show()

Comparison between Purchase and Marital_Status

sns.boxplot(x='Marital_Status',y='Purchase',data=df)
plt.show()

Categorical & Categorical: To see the relationship between the 2 variables we create a crosstab and a heatmap on top.

Relationship between Age and Gender: Creating a crosstab showing the date for Age and Gender

pd.crosstab(df.Age,df.Gender)

Heatmap: Creating a Heat Map on the top of the crosstab.

sns.heatmap(pd.crosstab(df.Age,df.Gender))
plt.show()

Relationship between City_Category and Stay_In_Current_City_Years

sns.heatmap(pd.crosstab(df.City_Category,df.Stay_In_Current_City_Years))
plt.show()

EndNotes

Finally, we have come to the end of this article. In this article, we took a sample data set and performed exploratory data analysis on it using the Python programming language using the Pandas DataFrame. However, this was just a basic idea on how EDA is done you can definitely explore it to as much extent as you want and try performing the steps on bigger datasets as well.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Exploratory Data Analysis in Python

Overview

Introduction

Table of Contents

Introducing the Dataset

Importing the Python Libraries

Loading the Dataset in Python

Structured Based Data Exploration

Display the LAST 5 Observations

Display the Variable Names and their Data Types

Descriptive Statistics

Display the Complete Meta-Data of the dataset

Handling Duplicates

Handling Outliers

Outlier Treatment

Handling Missing Values

Missing Value Treatment

How to Check?

Univariate Analysis

Bivariate Analysis

Correlation Matrix

Heatmap

EndNotes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID