Exploratory Data Analysis – The Go-To Technique to Explore Your Data!

Sameer Last Updated : 24 Oct, 2024

8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Exploratory Data Analysis(EDA) is one of the most underrated and under-utilized approaches in any Data Science project. EDA is the first step that data scientists perform where they study the data and extract valuable information and non-obvious insights from the data which ultimately helps during model building.

Before you model the data and test it, you need to build a relationship with the data. You can build this relationship by exploring the data, by plotting the data against the target variable, and observe how your data is behaving. This process of analysis before modeling is called Exploratory Data Analysis.

In this article, we are going to perform a hands-on EDA on a complex dataset from Kaggle(Advanced House Prediction). The link to the dataset is given below:

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The lifecycle of a Data Science Project

1) Exploratory Data Analysis

2) Feature Engineering

3) Feature Selection

4) Hyperparameter tuning

5) Model Building and deployment

Let us perform on this complex dataset which has around 81 independent features and 1 target variable(sale price). It is a Regression problem statement.

EDA will contain some basic steps like analyzing missing values, numerical and categorical features’ distribution, outliers, multicollinearity, etc. We will see each one of the steps one by one.

Missing Values

Most of the time the data we obtain contains missing values and we need to find whether there exists any relationship between missing data and the sale price(target variable). Depending on that we replace the missing value with something like the median of that column.

This is the python code to capture the missing values for a large dataset in a list where we replace the missing value with 1 and replace the non-missing value with 0 and plot against the median sale price to see whether there exists a relationship b/w null values and target variable or not.

LotFrontage 0.1774 % missing values
Alley 0.9377 % missing values
MasVnrType 0.0055 % missing values
MasVnrArea 0.0055 % missing values
BsmtQual 0.0253 % missing values
BsmtCond 0.0253 % missing values
BsmtExposure 0.026 % missing values
BsmtFinType1 0.0253 % missing values
BsmtFinType2 0.026 % missing values
FireplaceQu 0.4726 % missing values
GarageType 0.0555 % missing values
GarageYrBlt 0.0555 % missing values
GarageFinish 0.0555 % missing values
GarageQual 0.0555 % missing values
GarageCond 0.0555 % missing values
PoolQC 0.9952 % missing values
Fence 0.8075 % missing values
MiscFeature 0.963 % missing values

Since there are many missing values, we need to find the relationship between null values and the target variable(sale price)

This is one of the plots which shows that null values of Lot frontage feature have an impact on the target variable as it is increasing with the sale price. So yes, there exists a relationship b/w the two and we need to replace the null values with something substantial like the median of that particular feature.

Numerical Features

Since this is a large dataset we need to visualize the different types of variables like date-time(year), discrete and continuous numerical feature, categorical feature, and their behavior with the target variable.

There are 39 numerical features in this dataset. The data type for string or a mix of string and numeric is given as an object which we can check by using the types attribute.

Date Time variable(year feature or temporal variable)

This is the python code to find the year features and see how those four features behave with respect to the target variable.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('train.csv')
data.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('yr sold')
plt.ylabel('salesprice')
plt.title('house price vs yr sold')
plt.show()

We can see here that as the yr sold increases, the cost decreases. Now, this has to be an anomaly since it is not possible so we need to do more analysis and come to better conclusions. This just shows the importance of EDA and how it can affect our conclusions.

Instead of comparing the sale price with the yr sold feature, let us compare the sale price and the difference of all year features.

Now we can compare the sale price(median) with the year built and the year of remodification and come to various conclusions like the value on the X-axis increases, the price decreases.

Discrete numerical features

Discrete variables are the variables whose values exist in a particular range or are countable in a finite amount of time.

I have kept the threshold value for unique variables in a feature as 25 and those should not be in the year feature. Now let us see if there exists a relationship b/w discrete features and the target variable.

We can see that one of the features like OverallQuality has a direct relation with the target variable.

Continuous numerical features

These are the type of features whose value can be basically anything till infinity. By using histograms, we analyze their distribution throughout the data set.

We can see that the distribution we obtained is skewed. During regression problem statements, it is necessary to convert the skewed distribution to a normal distribution as it increases the accuracy of the model.

Logarithmic transformation is one of the techniques to convert a skewed distribution to a normal distribution where we take the log of all values of that particular feature and convert it into a whole new log feature itself.

Outliers

The outlier is any data point that lies outside of the distribution of the data set.

The presence of outliers in the dataset can hamper the accuracy of the model. Algorithms like linear regression are very sensitive to outliers so it needs to be handled carefully.

The Standard Deviation method is a common method to identify and replace the outliers where any data point which lies outside the 3rd standard deviation is considered to be an outlier. Although that threshold standard deviation can change depending on the size of the data set.

Here in EDA, let us analyze the outliers in the data set using boxplot.

The black dots denote the outliers present which are away from the distribution. The lower line of the rectangular box is 25%ile and the upper line is 75%ile.

So those black dots are the values that need to be removed or replaced which we will see in feature engineering.

Categorical features

The data type for a categorical feature is an object and we can check that with types attribute of pandas.

We generally convert the categorical values of a feature into dummy variables so that our algorithm understands. This is called as One hot encoding. If the cardinality of a particular category is very high, then we do not use one-hot encoding as it might lead to a curse of dimensionality.

The feature is MSZoning and number of categories are 5
The feature is Street and number of categories are 2
The feature is Alley and number of categories are 3
The feature is LotShape and number of categories are 4
The feature is LandContour and number of categories are 4
The feature is Utilities and number of categories are 2
The feature is LotConfig and number of categories are 5
The feature is LandSlope and number of categories are 3
The feature is Neighborhood and number of categories are 25
The feature is Condition1 and number of categories are 9
The feature is Condition2 and number of categories are 8
The feature is BldgType and number of categories are 5
The feature is HouseStyle and number of categories are 8
The feature is RoofStyle and number of categories are 6
The feature is RoofMatl and number of categories are 8
The feature is Exterior1st and number of categories are 15
The feature is Exterior2nd and number of categories are 16
The feature is MasVnrType and number of categories are 5
The feature is ExterCond and number of categories are 5
The feature is Foundation and number of categories are 6
The feature is BsmtQual and number of categories are 5
The feature is BsmtCond and number of categories are 5
The feature is BsmtExposure and number of categories are 5
The feature is BsmtFinType1 and number of categories are 7
The feature is BsmtFinType2 and number of categories are 7
The feature is Heating and number of categories are 6
The feature is HeatingQC and number of categories are 5
The feature is CentralAir and number of categories are 2
The feature is Electrical and number of categories are 6
The feature is KitchenQual and number of categories are 4
The feature is Functional and number of categories are 7
The feature is FireplaceQu and number of categories are 6
The feature is GarageType and number of categories are 7
The feature is GarageFinish and number of categories are 4
The feature is GarageQual and number of categories are 6
The feature is GarageCond and number of categories are 6
The feature is PavedDrive and number of categories are 3
The feature is PoolQC and number of categories are 4
The feature is Fence and number of categories are 5
The feature is SaleType and number of categories are 9
The feature is SaleCondition and number of categories are 6

The threshold value of categories that I have chosen for this case to perform one-hot encoding is 10.

Now let us check whether there exists any relationship between the categorical features and the median of the target variable(sale price).

Multicollinearity

In any dataset, whenever the independent features are internally correlated with each other, it hampers the accuracy of the model because the individual contribution of the features cannot be obtained. This is called Multicollinearity.

This is a huge problem when it comes to algorithms like linear and logistic regression.

How to fix it?

We use the correlation matrix with heatmap to visualize the relationship of all the independent features with each other by their correlation coefficient values.

Generally, 0.7 is taken as the threshold which means if any 2 features have a correlation above 0.7, one of the two features can be dropped.

Conclusion

These were some important steps to perform in Exploratory Data Analysis and it also shows the importance of EDA when it comes to real-life projects. I hope everyone uses this technique while solving their project.

Happy Learning! 🙂

Sameer

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Exploratory Data Analysis – The Go-To Technique to Explore Your Data!

Introduction

The lifecycle of a Data Science Project

Missing Values

Numerical Features

Date Time variable(year feature or temporal variable)

Discrete numerical features

Continuous numerical features

Outliers

Categorical features

Multicollinearity

How to fix it?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)