Exploratory Data Analysis (EDA) in Python

Subhi Last Updated : 06 Apr, 2022

7 min read

Introduction

Exploratory Data Analysis is a method of evaluating or comprehending data in order to derive insights or key characteristics. EDA can be divided into two categories: graphical analysis and non-graphical analysis.

EDA is a critical component of any data science or machine learning process. You must explore the data, understand the relationships between variables, and the underlying structure of the data in order to build a reliable and valuable output based on it.

The EDA stages will be carried out in this tutorial using the Python programming language.

The Dataset

For this article, we will be doing Customer Churn Prediction. When clients stop doing business with a company, this is known as customer churn or customer attrition.

Because the cost of getting a new customer is usually higher than keeping an existing one, understanding customer churn is critical to a company’s success. As a result, churn analysis is the first step in gaining a better understanding of your clients.

To gain a deeper grasp of our data, we will go deep into Exploratory Data Analysis (EDA). The dataset is available here.

Importing the Python Libraries

First of all, we need to import all the libraries that are required for the analysis, namely Pandas for handling data, Numpy for numerical calculations, Matplotlib and Seaborn for visualization.

Loading the Dataset in Python

Now, load the dataset into the pandas dataframe.

Structured Based Data Exploration

This is the first part of EDA where the data frame is evaluated for structure, columns and data types. The goal of this step is to get a general understanding of the dataset.

Display the first 5 Observations

We get the output as:

Display the Last 5 Observations

The output is:

Display the Number of Variables and Observations

This can be done with df.shape which gives the output as a tuple having 2 values. The first value counts the number of data points and the second value represents the number of features in the dataset.

In this dataframe, there are 7043 rows and 21 columns.

Display the Variable Names and their Data Types

Count the number of Non-Missing Values for each variable

df.count() counts the number of non-empty values. It gives the idea of missing values in our dataset.

Descriptive Statistics

Now to know more about the characteristics of the dataset we will use the df.describe() which by default gives statistical information of all numerical features in our data frame.

Descriptive Statistics

df.describe() gives some basic statistical details like count, percentile, mean, standard deviation, and the 5 point summary which includes minimum, first quartile, second quartile, third quartile and maximum of numerical features.

What about the categorical features?

By providing an include argument and assigning it the value ‘all’, we can get the summary of all the categorical features too.

Display the Complete Summary of the Dataset

df.info() gives the summary of the dataframe including data types, shape and memory storage.

Handling Missing Values

Missing values are the unknown values in the dataset. The concept of missing values is important to understand in order to successfully manage data. The first step is to detect the missing value in the dataset and then treat them using the appropriate method.

Detecting the Missing Values

Using error = ‘coerce’ will replace all non-numeric values with NaN.
isnull().sum() returns the number of missing values in the dataset.

We have 11 missing values in the ‘Total Charges’ column. Now, we will see different methods to deal with them.

Missing Value Treatment

To treat missing values we can use the following ways:

Drop the variable
Drop the observation(s)
Mean imputation or median imputation or mode imputation

For variable ‘Total Charges’ only 11 values are missing. Since these data records are comparatively very low as compared to the total data set, we can drop them.

Done. Let’s check!

Missing value treatment

Analysis using Charts

Data Visualizations

Now, it’s time to visualize the data. We can see how the data appears and what sort of relation the properties of data hold with the help of data visualization. It’s the quickest approach to check if the features reflect the output.

Target Variable

Let’s visualize the target variable i.e. Churn. It has two categories- Yes or No.

Display a frequency distribution of churn

Data Visualisation | EDA in python

The plot shows a class imbalance of the data between churners and non-churners. To address this, resampling would be a suitable approach.

Categorical Variables

There are 17 Categorical features in the dataset. Let’s see their churning rate with respect to the target variable.

Note: I have only shown 5 graphs here which are more important according to me.

Relationship between Monthly Charges and Total Charges

Total charges are the sum total of monthly charges. So, let’s visualize their relationship.

Here we can see that Total Charges and monthly charges are highly correlated.

Customer Contract Distribution

Here we are trying to visualize the churning rate with respect to Contract.

About 75% of customers with Month-to-Month Contracts opted to move out as compared to 13% of customers with one-year contract and 3% with two-year contracts.

Payment Method Distribution

This is the visualization of the payment method. It has four categories.

Payment Method Distribution | EDA in python

The electronic check has the highest users.

Dependents distribution

This graph shows the churning rate with respect to Dependents.

Dependents Distribution | EDA in python

Customers without dependents are more likely to churn

Churn distribution w.r.t Partners

This graph shows the churning rate with respect to Partners.

Customers that do not have partners are likely to churn more.

Conclusion

In this article, we tried to analyze customer behaviour. First, we explored the dataset on a basic level. We looked for missing values and treated them by dropping those values. Then we used the Pandas DataFrame to do Exploratory Data Analysis on sample data by plotting different graphs like Count plot, Pie Chart, Line Plot and Histplot. From this, we got some useful insights like: “Customers with month-to-month contracts churn the most”, “Total charges and monthly charges were highly correlated”, etc. This way, we perform EDA on the datasets to explore the data and extract all possible insights from it, which can help in model building and also better decision making.

However, this was only a basic overview of how EDA works; you can go deeper into it and attempt the stages on larger datasets.

You can reach out to me on LinkedIn.

Subhi

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

brahmaid

It is needed for personalizing the website.

Expiry: Session

Type: HTTP

csrftoken

This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website

Expiry: Session

Type: HTTPS

Identityid

Preserves the login/logout state of users across the whole site.

Expiry: Session

Type: HTTPS

sessionid

Preserves users' states across page requests.

Expiry: Session

Type: HTTPS

g_state

Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.

Expiry: 365 days

Type: HTTP

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

We do not use cookies of this type.

_gcl_au

Used by Google Adsense, to store and track conversions.

Expiry: 3 Months

Type: HTTP