Introduction to Exploratory Data Analysis (EDA)

Nikhil Last Updated : 27 Nov, 2024

8 min read

Exploratory Data Analysis is a process of examining or understanding the data and extracting insights dataset to identify patterns or main characteristics of the data. EDA is generally classified into two methods, i.e. graphical analysis and non-graphical analysis.

EDA is very essential because it is a good practice to first understand the problem statement and the various relationships between the data features before getting your hands dirty.

In this article, EDA notes are essential for summarizing findings. EDA helps uncover patterns, trends, and insights effectively. “eda notes” and “eda” are keywords that guide the analysis process, ensuring a thorough and organized approach to data exploration.

This article was published as a part of the Data Science Blogathon.

Exploratory Data Analysis (EDA)
Types of Exploratory Data Analysis
Understanding EDA
- How to handle the missing values by using a few techniques
Conclusion

Exploratory Data Analysis (EDA)

Technically, The primary motive of EDA is to

Examine the data distribution
Handling missing values of the dataset(a most common issue with every dataset)
Handling the outliers
Removing duplicate data
Encoding the categorical variables
Normalizing and Scaling

Note – Don’t worry if you are not familiar with some of the above terms, we will get to know each one in detail.

Types of Exploratory Data Analysis

Univariate Analysis

Univariate analysis focuses on analyzing a single variable at a time. It aims to describe the data and find patterns rather than establish causation or relationships. Techniques used include:

Descriptive statistics (mean, median, mode, standard deviation, etc.)
Frequency distributions (histograms, bar graphs, etc.)

Bivariate Analysis

Bivariate analysis explores relationships between two variables. It helps find correlations, relationships, and dependencies between pairs of variables. Techniques include:

Scatter plots
Correlation analysis

Multivariate Analysis

Multivariate analysis extends bivariate analysis to include more than two variables. It focuses on understanding complex interactions and dependencies between multiple variables. Techniques include:

Heat maps
Scatter plot matrices
Principal Component Analysis (PCA)

Understanding EDA

To understand the steps involved in Exploratory Data Analysis, we will use Python as the programming language and Jupyter Notebooks because it’s open-source, and not only it’s an excellent IDE but also very good for visualization and presentation.

_{Step 1}

First, we will import all the pythonStep 2 libraries that are required for this, which include NumPy for numerical calculations and scientific computing, Pandas for handling data, and Matplotlib and Seaborn for visualization.

Step 2

Then we will load the data into the Pandas data frame. For this analysis, we will use a dataset of “World Happiness Report”, which has the following columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption, etc. to describe the extent to which these factors contribute to evaluating the happiness.

You can find this dataset over here.

Step 3

We can observe the dataset by checking a few of the rows using the head() method, which returns the first five records from the dataset.

Step 4

Using shape, we can observe the dimensions of the data.

Step 5

info() method shows some of the characteristics of the data such as Column Name, No. of non-null values of our columns, Dtype of the data, and Memory Usage.
Python Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

happinessData = pd.read_csv('happiness.csv')
print(happinessData.head())
print("-------------------------")
print("-------------------------")
print(f"Shape of the data: {happinessData.shape}")
print("-------------------------")
print("-------------------------")
print(happinessData.info())

From this, we can observe, that the data which we have doesn’t have any missing values. We are very lucky in this case, but in real-life scenarios, the data usually has missing values which we need to handle for our model to work accurately. (Note – Later on, I’ll show you how to handle the data if it has missing values in it)

Step 6

We will use describe() method, which shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25, 0.50, 0.75 quartiles.

Step 7

Handling missing values in the dataset. Luckily, this dataset doesn’t have any missing values, but the real world is not so naive as our case.

So I have removed a few values intentionally just to depict how to handle this particular case.

We can check if our data contains a null value or not by the following command

As we can see that “Happiness Score” and “Freedom” features have 1 missing values each.

How to handle the missing values by using a few techniques

Drop the missing values – If the dataset is huge and missing values are very few then we can directly drop the values because it will not have much impact.
Replace with mean values – We can replace the missing values with mean values, but this is not advisable in case if the data has outliers.
Replace with median values – We can replace the missing values with median values, and it is recommended in case if the data has outliers.
Replace with mode values – We can do this in the case of a Categorical feature.
Regression – It can be used to predict the null value using other details from the dataset.

For our case, we will handle missing values by replacing them with the median value.

And, now we can again check if the missing values have been handled or not.

happiness data exploratory data anlaysis

And, now we can see that our dataset doesn’t have any null values now.

Step 8

We can check for duplicate values in our dataset as the presence of duplicate values will hamper the accuracy of our ML model.

We can remove duplicate values using drop_duplicates()

As we can see that the duplicate values are now handled.

Step 9

Handling the outliers in the data, i.e. the extreme values in the data. We can find the outliers in our data using a Boxplot.

As we can observe from the boxplot that the normal range of data lies within the block, and small circles at the extreme ends of the graph denote the outliers.

So to handle it we can either drop the outlier values or replace the outlier values using IQR(Interquartile Range Method).

In Exploratory Data Analysis, we calculate the IQR to identify patterns by finding the difference between the 25th and 75th percentiles of the data. The percentiles can be calculated by sorting the selecting values at specific indices. The IQR is used to identify outliers by defining limits on the sample values that are a factor k of the IQR. The common value for the factor k is the value 1.5.

Now we can again plot the boxplot and check if the outliers have been handled or not.

Finally, we can observe that our data is now free from outliers.

Step 10

Normalizing and Scaling – Data Normalization or feature scaling is a process to standardize the range of features of the data as the range may vary a lot. So we can preprocess the data using ML algorithms. So for this, we will use StandardScaler for the numerical values, which uses the formula as x-mean/std deviation.

As we can see that the “Happiness Score” column has been normalized.

Step 11

We can find the pairwise correlation between the different columns of the data using the corr() method. (Note – All non-numeric data type column will be ignored.)

“happinessData.corr()” to find the pairwise correlation of all columns in the data frame. The function automatically excludes any ‘nan’ values.

The resulting coefficient is a value between -1 and 1 inclusive, where:

1: Total positive linear correlation
0: No linear correlation, the two variables most likely do not affect each other
-1: Total negative linear correlation

Pearson Correlation is the default method of the function “corr”.

Now, we will create a heatmap using Seaborn to visualize the correlation between the different columns of our data:

As we can observe from the above heatmap of correlations, there is a high correlation between –

Happiness Score – Economy (GDP per Capita) = 0.78
Happiness Score – Family = 0.74
Happiness Score – Health (Life Expectancy) = 0.72
Economy (GDP per Capita) – Health (Life Expectancy) = 0.82

Step 12

Now, using Seaborn, we will visualize the relation between Economy (GDP per Capita) and Happiness Score by using a regression plot. And as we can see, as the Economy increases, the Happiness Score increases as well as denoting a positive relation.

Now, we will visualize the relation between Family and Happiness Score by using a regression plot.

Now, we will visualize the relation between Health (Life Expectancy) and Happiness Score by using a regression plot. And as we can see that, as Happiness is dependent on health, i.e. Good Health is equal to More Happy a person is.

Now, we will visualize the relation between Freedom and Happiness Score by using a regression plot. And as we can see that, as the correlation is less between these two parameters so the graph is more scattered and the dependency is less between the two.

I hope we all now have a basic understanding of how to perform Exploratory Data Analysis(EDA).

Hence, the above are the steps that I personally follow for Exploratory Data Analysis, but there are various other plots and commands, which we can use to explore more into the data.

Conclusion

Exploratory Data Analysis (EDA) includes examining datasets to discover patterns through Univariate, Bivariate, and Multivariate analysis techniques. These methods concentrate on individual, paired, and multiple variables, in that order. Exploratory Data Analysis (EDA) also involves dealing with missing values, which analysts typically address using methods such as mean or median imputation and predictive modeling to ensure data integrity.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Q1. what is the purpose of using the .corr() method in eda?

A. The .corr() method in Exploratory Data Analysis (EDA) serves the purpose of calculating the pairwise correlation between all columns in a dataset. This method helps identify relationships and dependencies between variables, which is essential for understanding the data’s structure and underlying patterns. By examining the correlation coefficients, analysts can determine how strongly pairs of variables are related, aiding in feature selection and the development of predictive models.

Q2. what are the key aspects of exploratory data analysis?

A. The key aspects of Exploratory Data Analysis (EDA) include:Data Distribution Examination: Analyzing how data is distributed to identify patterns and trends.
Handling Missing Values: Addressing gaps in the dataset using various techniques to maintain data integrity.
Outlier Detection: Identifying outliers that may skew results or indicate significant anomalies.
Data Cleaning: Removing duplicates and correcting inconsistencies to prepare the dataset for analysis.
Encoding Categorical Variables: Converting categorical data into numerical formats for better analysis.
Normalization and Scaling: Adjusting data scales to ensure equal contribution to analyses, especially in distance-based algorithms.
Graphical and Non-Graphical Analysis: Utilizing both visual methods (like histograms) and statistical summaries to explore data comprehensively.
Correlation Analysis: Understanding relationships between variables to identify potential predictors for modeling.

Nikhil

Data Scientist with 6 years of experience in analysing large datasets and delivering valuable insights via advanced data-driven methods. Proficient in Time Series Forecasting, Natural Language Processing and with a demonstrated history of working in the Telecom, Healthcare and Retail Supply Chain industries.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introduction to Exploratory Data Analysis (EDA)

Table of contents

Exploratory Data Analysis (EDA)

Types of Exploratory Data Analysis

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Understanding EDA

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

How to handle the missing values by using a few techniques

Step 8

Step 9

Step 10

Step 11

Step 12

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

_{Step 1}