How to Improve Your Business With Exploratory Data Analysis!

Muthu Last Updated : 05 Oct, 2020

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Exploratory Data Analysis is an approach to discover the insights in the data. It is one of the best practices in data science today. People generally confused with the key difference between Data analysis and Exploratory Data Analysis. You won’t see any big difference but these two have different purposes.

Exploratory Data Analysis is a complement to inferential statistics where it is mixed with rigid rules and formulas whereas, Data Analysis is the combination of statistics and probability to figure out the trends and patterns of the dataset.

EDA is the first step in the Data Analysis phase where you can manipulate the datasets to achieve the results. It is implemented before the statistical techniques applied to the datasets. Statistical techniques usually applied on the datasets with the histogram or box plots but EDA is not coming with a set of techniques or procedures. It is like a form of art than applying science.

EDA process makes the analyst get a feel about the dataset and used their ideas to judge the important elements in the dataset.

For example, in the multidimensional scaling, it is the visual representation of the distance of similarities between the set of objects. The user can find the exact distance between the objects by looking at the multidimensional representation.

Why EDA is necessary for business?

EDA is a crucial step before getting deep into machine learning or modeling your data to solve your business problems. It allows you to analyze the proper model to interpret the correct results.

The machine learning has more powerful advanced algorithms and so, people almost skip the Exploratory Data Analysis phase. People usually take advantage of algorithms and skip the EDA phase, where it is like feeding the data into the black box and look for better results.

Exploratory Data analysis provides a lot of crucial information where people usually miss and this information helps in the long run. EDA has been in the process long ago and it was developed in 1970 by John Turkey, the scientist who coined the word “Bit”. It is often described as philosophy as it had no hard or fast rules to approach it.

Purpose of EDA

The purpose of EDA is-

Finding the missing and erroneous data
Gain deep insights from the dataset
Identify the important features in your dataset
Perform hypothesis testing for the specific model
Estimate the parameters and associated intervals.

Tools and techniques used in Exploratory Data Analysis

S+ and R are the most important statistical programming languages helpful to perform EDA better. It comes with a bundle of tool to perform functions like-

Classification and Dimension reduction techniques
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Predictive Analysis

How EDA does helps your business?

Exploratory Data Analysis provides the most extreme value to the business by helping the Data scientist to interpret the correct results which match with the required business contexts. It also helps the stakeholders to check whether they have posted the right queries. They also come to know some interesting trends even they are not aware of its existence.

There are many data connectors available that help the companies to incorporate the EDA into Business Intelligence software. We can build and run the statistical models in R that use BI data to update the information automatically as flow with the model.

Let’s see some case study of Exploratory Data Analysis on E-Commerce

In the E-commerce world, we often want to know which customers, where they are coming from to place the most orders and spend their money. These insights help to drive the sale of the company.

Let’s explore the dataset which contains transactional data with customers from various countries who make purchases from UK based online retail company that sells occasional gifts.

Company — UK-based and registered non-store online retail
Products for selling — All occasion gifts
Customers — Most of the merchants are wholesalers
Transactions Period — 1st Dec 2010 to 9th Dec 2011 (One year)

Data Cleaning

In the real world, we all know that data is messy and it is necessary to clean the data before exploring the dataset.

Below is the snapshot of the original dataset.

Below are the details about the features-

InvoiceNo: Transaction number
StockCode: Product code
Description: Name of the product
Quantity: Total number of products purchased for each transaction
InvoiceDate: Timestamp for each transaction
UnitPrice: Product price per unit
CustomerID: Unique identifier for each customer
Country: Country name

Let’s check any missing values present in the columns-

We can see that, there are some missing values present in the CustomerID and Description. The missing values present in the rows should be removed. The python provides you with the drop() to handle the missing values.

Let’s see some statistical information about the dataset. We can use describe() function. It will show the mean, std, and IQR values of continuous or discrete variables.

From this analysis, we can check that Quantity has negative values and UnitPrice has zero values. These values will not be possible. So we can remove the negative and zero values of the two variables.

In this dataset, we need to add extra features that help to gain deep insights about the sales.

We need to add the amount_spent variable. To calculate the total amount spent on each purchase, we can simply multiply the Quantity with UnitPrice.

In the E-commerce world, we often desire to know, which customers, where they come from- place the most order and spend the most money on their purchase. It helps to analyze and improve company sales.

From the table, we can observe that the UK has the most number of orders and the Netherlands spends the highest amount of money on their purchases.

Likewise, with the above chart, we can observe that company receives the highest number of orders in November 2011. Likewise, we can do depth analysis with the help of purchase order datasets. We can use this EDA to validate the business assumptions and interpret the machine learning model to predict the sale next year.

We can discover the transactional patterns for each country. This shows the number of orders in each country (with the UK)

This shows the number of orders in each country (without the UK)

As a result, the company receives a higher number of orders from the UK, since it is a UK based company.

So, the TOP 5 countries (including the UK) that place the highest number of orders are as below:

United Kingdom
Germany
France
Ireland
Spain

Let’s explore the top 5 countries with the highest money spent. This shows the number of money spent by each country (with the UK)

This shows the number of money spent by each country (without the UK)

So, the TOP 5 countries (including the UK) that spend the most money on purchase are as below:

United Kingdom
Netherlands
Ireland (EIRE)
Germany
France

Observation from EDA

The below-given points are observed from EDA,

The customer with the highest number of orders comes from the United Kingdom (UK) and the second place by German
The customer with the highest money spent on purchases comes from the United Kingdom (UK) and the second position goes to the Netherlands
The TOP 5 countries (including the UK) that place the highest number of orders are as follow → the United Kingdom, Germany, France, Ireland (EIRE), Spain
The TOP 5 countries (including the UK) that spend the most money on purchases are as follow → the United Kingdom, Netherlands, Ireland (EIRE), Germany, France

· The company receives the highest number of orders in November 2011.

We can identify interesting patterns by simply performing EDA. It is all about understanding your data before making any assumptions and also helps you to avoid using or building inaccurate models on the dataset.

Conclusion

Finally, Predictive analytics and machine learning are the important tools to analyze the dataset properly to interpret the results which are aligned with the business objectives.

Muthu

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to Improve Your Business With Exploratory Data Analysis!

Introduction

Why EDA is necessary for business?

Purpose of EDA

Tools and techniques used in Exploratory Data Analysis

How EDA does helps your business?

Let’s see some case study of Exploratory Data Analysis on E-Commerce

Data Cleaning

Observation from EDA

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#