Mastering Exploratory Data Analysis(EDA) For Data Science Enthusiasts

Pranshu Sharma Last Updated : 25 Oct, 2024

5 min read

This article was published as a part of the Data Science Blogathon.

Overview

Step by Step approach to Perform EDA
Resources Like Blogs, MOOCS for getting familiar with EDA
Getting familiar with various Data Visualization techniques, charts, plots
Demonstration of some steps with Python Code Snippet

What is that one thing that differentiates one data science professional, from the other?

Not Machine Learning, Not Deep Learning, Not SQL, It’s Exploratory Data Analysis (EDA). How good one is with the identification of hidden patterns/trends of the data and how valuable the extracted insights are, is what differentiates Data Professionals.

1. What Is Exploratory Data Analysis

Exploratory Data Analysis is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
EDA assists Data science professionals in various ways:-

1 Getting a better understanding of data
2 Identifying various data patterns
3 Getting a better understanding of the problem statement

[ Note: the dataset in this blog is being opted as iris dataset]

2. Checking Introductory Details About Data

The first and foremost step of any data analysis, after loading the data file, should be about checking few introductory details like, no. Of columns, no. of rows, types of features( categorical or Numerical), data types of column entries.

Python Code Snippet

Python Code:

import seaborn as sns
df = sns.load_dataset('iris')
print(df.info())

data.head() For displaying first five rows

data.tail() For Displaying last Five Rows

3. Statistical Insight

This step should be performed for getting details about various statistical data like Mean, Standard Deviation, Median, Max Value, Min Value

Python Code Snippet

data.describe()

4. Data cleaning

This is the most important step in EDA involving removing duplicate rows/columns, filling the void entries with values like mean/median of the data, dropping various values, removing null entries

Checking Null entries

Python Code Snippet

data.IsNull().sum gives the number of missing values for each variable

Removing Null Entries

Python Code Snippet

data.dropna(axis=0,inplace=True) If null entries are there

Filling values in place of Null Entries(If Numerical feature)

Values can either be mean, median or any integer

Python Code Snippet

data[“sepal_length”].fillna(value=data[“sepal_length”].mean(), inplace = True) if there’s a null entry

Checking Duplicates

Python Code Snippet

data.duplicated().sum() returning total number of duplicates entries

Removing Duplicates

Python Code Snippet

data.drop_duplicates(inplace=True)

5. Data Visualization

Data visualization is the method of converting raw data into a visual form, such as a map or graph, to make data easier for us to understand and extract useful insights.

The main goal of data visualization is to put large datasets into a visual representation. It is one of the important steps and simple steps when it comes to data science

You Can refer to the blog below for getting more details about Data Visualization

Choosing The Right Visualization Techniques for extracting Data Insights

Various Types of Visualization analysis is:

a. Uni Variate analysis:

This shows every observation/distribution in data on a single data variable. It can be shown with the help of various plots like Scatter Plot, Line plot, Histogram(summary)plot, box plots, violin plot, etc.

b. Bi-Variate analysis:

Bivariate analysis displays are done to reveal the relationship between two data variables. It can also be shown with the help of Scatter plots, histograms, Heat Maps, Box Plots, Violin Plots, etc.

c. Multi-Variate analysis:

Multivariate analysis, as the name suggests, displays are done to reveal the relationship between more than two data variables.

Scatterplots, Histograms, box plots, violin plots can be used for Multivariate Analysis

Various Plots

Below are some of the plots that can be deployed for Univariate, Bivariate, Multivariate analysis

a. Scatter Plot

Python Code Snippet

plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(data[‘sepal_length’],data[‘sepal_width’],hue =data[‘species’],s=50)

For multivariate analysis

Python Code Snippet

sns.pairplot(data,hue=”species”,height=4)

b. Box Plot

Boxplot to see how the categorical feature “Species” is distributed with all other four input variables

Python Code Snippet

fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1])
plt.show()

c. Violin Plot

More informative, than box plot, and shows full distribution of data

Python Code Snippet

fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()

d. Histograms

It can be used for visualizing the Probability density function(PDF)

Python Code Snippet

sns.FacetGrid(iris_data, hue=”species”, height=5)
.map(sns.distplot, “petal_width”)
.add_legend();

With this I finish this blog.
Hello Everyone, Namaste
My name is Pranshu Sharma and I am a Data Science Enthusiast
Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.
Dhanyvaad!!
Feedback:
Email: [email protected]

You can refer to the blog being, mentioned below for getting familiar with Exploratory Data Analysis

Exploratory Data Analysis: Iris Dataset

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathon exploratory data analysis

Pranshu Sharma

Aspiring Data Scientist | M.TECH, CSE at NIT DURGAPUR

Beginner Data Exploration Data Visualization Python Python Structured Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Mastering Exploratory Data Analysis(EDA) For Data Science Enthusiasts

Overview

What is that one thing that differentiates one data science professional, from the other?

1. What Is Exploratory Data Analysis

2. Checking Introductory Details About Data

3. Statistical Insight

Python Code Snippet

4. Data cleaning

Checking Null entries

Python Code Snippet

Removing Null Entries

Python Code Snippet

Filling values in place of Null Entries(If Numerical feature)

Python Code Snippet

Checking Duplicates

Python Code Snippet

Removing Duplicates

Python Code Snippet

5. Data Visualization

Various Types of Visualization analysis is:

a. Uni Variate analysis:

b. Bi-Variate analysis:

c. Multi-Variate analysis:

Various Plots

a. Scatter Plot

Python Code Snippet

For multivariate analysis

Python Code Snippet

b. Box Plot

Python Code Snippet

c. Violin Plot

Python Code Snippet

d. Histograms

Python Code Snippet

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering