This article was published as a part of the Data Science Blogathon.
Well! We all love cakes. If you take a deeper look at the baking process, you will notice how the proper amalgamation of the several ingredients and one clever leavening agent -Baking Powder can decide the rise and the fall of your cake.
`Baking the cake` might sound off-track in the technical article but I believe it to be quite relatable and a delicious analogy to understand the importance of EDA in the Data Science Pipeline.
When Baking the cake is to the Data Science Pipeline then Clever Leavening Agent(Baking Powder) is to Exploratory Data Analysis.
Before your mouth starts watering for a Cake as mine already is, Let’s Understand.
Exploratory Data Analysis is an approach for Data Analysis that employs a variety of techniques to-
Let’s take the famous `BLACK FRIDAY SALES` case study to understand, Why do we need EDA.
The core problem is to understand customer behavior by predicting the purchase amount. But isn’t it too abstract and leaves you baffling on what to do with the data, especially when you have so many different products with various categories.
Before reading further, give a little thought to this question- Would you put all the ingredients available in the kitchen as it is in the oven to bake the cake?
Obviously, The answer is no! Before you take the entire dataset as it is in consideration to bake it in the Machine Learning Model, you would want to
** EDA in an essence can break or make any machine learning model.**
There are 5 steps in EDA :->
To showcase Univariate analysis on one of the Continuous variables of the Black Friday Sale Dataset- `Purchase`, I have created a function which takes Data as input and Plot a KDE graph explaining the characteristics of the feature.
To showcase Univariate analysis on the Categorical variables of the Black Friday Sale Dataset- `City_Category` and `Marital_Status`, I have created a function that takes Data and Features as input which returns a count plot explaining the frequency of the categories in the feature.
It is important to note that just relying on Univariate and Bivariate Analysis can be quite misleading, So to verify the inferences drawn from these two can be validated with Hypothesis Testing. We can do a t-test, chi-square test, Anova which allows us to quantify whether two samples are significantly similar or different from each other. Here I have created a function to analyze continuous and categorical relationships that return t-statistic value.
In Univariate Analysis we observe that there is a significant difference between the number of customers who are married and unmarried. From t-test, we get t-statistic value 0.89 which is greater than significance level i.e 0.05 which shows that there is no significant difference between average purchase of singles and married .
In this article, I have briefly discussed the importance of EDA in the Data Science pipeline and steps that are involved in proper analysis.I have also showcased how wrong or incomplete analysis can be quite misleading and can considerably affect the performance machine learning models.
“If you don’t roast your data, you are just another person with an opinion.”;)
This piece of knowledge is amazing.
Thank you for this content, I was googling this information from many days.
Really informative and crisp explanation