Optimizing Exploratory Data Analysis using Functions in Python!

Rohit Last Updated : 11 Jan, 2021

4 min read

This article was published as a part of the Data Science Blogathon.

“The more, the merrier”.

It is a perfect saying for the amount of analysis done on any dataset.

As more and more opt for a career in Data Science, the more is the need to have a Fastrack way to guide each and everyone through the path. I learned python as the base to start and then gradually added skills that helped me grow in the data science domain.

Functions EDA

In this post, I will be adding all the important steps and python functions you can use for Exploratory Data Analysis (EDA) on any dataset.

Okay, today’s plan is to run our fingers through data and figure out as much as we can but all in an optimized way. I am writing this article to share user-defined functions to help and shorten the EDA coding time.

The most important steps to follow in a project are:

Importing the data
Data validation
1. Column datatype
2. Imputing null/missing values
Data exploration (EDA)
1. Univariate
2. Bivariate
3. Multivariate
Feature Engineering
Transformation/Scaling
Model building (applying machine-learning algorithms) and tuning
Score calculation

From the above, we will be covering the functions for EDA. Again, if you feel any issues while using those or you need any help on any other part, please let me know in the comments. There are several options to implement, but I have chosen the most generalized way.

Index

Introduction
Univariate analysis
Bi-variate analysis
Multi-variate analysis
Helpful functions
Summary

Introduction

The most important and time-consuming part of any analytics problem is understanding the data. It is better to spend time studying the data rather than coding the same thing again and again.

The functions we are going to build today are pretty general and you can adapt them as per your requirement.

The pseudo-code for a user-defined function in python is:

Function Definition:

          def func_name(parameters ):         # function name and parameters 
                 "function_steps"
                  function_commands
                  return [return_value]

Function call:

            func_name(parameters)

Function for Univariate analysis:

Moving onto EDA, we can define any function once, and call it by passing the feature name from the dataset as parameters. I have attached a GitHub link that demonstrates the implementation of all functions described below – Github – https://github.com/r-pant/data-hacks/blob/master/big%20mart%20sales/file1.ipynb

Categorical:

Below function plots count plot for the feature being passed to the function.

            def plot_cat(var, l=8,b=5):
                      plt.figure(figsize = (l, b))
                      sns.countplot(df1[var], order = df1[var].value_counts().index)

Continuous:

For a simple distplot for continuous feature

               def plot_cont(var, l=8,b=5):
                    plt.figure(figsize=(l, b))
                    sns.distplot(df1[var])
                    plt.xlabel(var)

2. To view a detailed kde plot with all details:

               # plot kde plot with median and Std values
               def plot_cont_kde(var, l=8,b=5):
                    mini = df1[var].min()
                    maxi = df1[var].max()
                    ran = df1[var].max()-df1[var].min()
                    mean = df1[var].mean()
                    skew = df1[var].skew()
                    kurt = df1[var].kurtosis()
                    median = df1[var].median()
                    st_dev = df1[var].std()
                    points = mean-st_dev, mean+st_dev
                     fig, axes=plt.subplots(1,2)
                     sns.boxplot(data=df1,x=var, ax=axes[0])
                     sns.distplot(a=df1[var], ax=axes[1], color='#ff4125')
                     sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
                     sns.scatterplot([mini, maxi], [0,0], color = 'orange', label = "min/max")
                     sns.scatterplot([mean], [0], color = 'red', label = "mean")
                     sns.scatterplot([median], [0], color = 'blue', label = "median")
                     fig.set_size_inches(l,b)
                     plt.title('std_dev = {}; kurtosis = {};nskew = {}; range = {}nmean = {}; 
                                median =  {}'.format((round(points[0],2),round(points[1],2)),
                                round(kurt,2),round(skew,2),(round(mini,2),round(maxi,2),
                                round(ran,2)),round(mean,2), round(median,2)))

Functions for Bi-variate analysis:

The bi-variate analysis is very helpful in finding out correlation patterns and to test our hypothesis. This will help us infer and build different features to feed into our model.

Categorical-Categorical:

         def BVA_categorical_plot(data, tar, cat):
              '''take data and two categorical variables,
               calculates the chi2 significance between the two variables
               and prints the result with countplot & CrossTab
              '''
              #isolating the variables
              data = data[[cat,tar]][:]
              #forming a crosstab
              table = pd.crosstab(data[tar],data[cat],)
              f_obs = np.array([table.iloc[0][:].values,
              table.iloc[1][:].values])
              #performing chi2 test
              from scipy.stats import chi2_contingency
              chi, p, dof, expected = chi2_contingency(f_obs)
             #checking whether results are significant
             if p<0.05:
                  sig = True
             else:
                  sig = False
             #plotting grouped plot
             sns.countplot(x=cat, hue=tar, data=data)
             plt.title("p-value = {}n difference significant? = {}n".format(round(p,8),sig))
             #plotting percent stacked bar plot
             #sns.catplot(ax, kind='stacked')
             ax1 = data.groupby(cat)[tar].value_counts(normalize=True).unstack()
             ax1.plot(kind='bar', stacked='True',title=str(ax1))
             int_level = data[cat].value_counts()

Categorical-Continuous:

Here, I have used two functions, one to calculate z-value and the others to plot the relation between our features.

    def TwoSampleZ(X1, X2, sigma1, sigma2, N1, N2):
         '''
          function takes mean, standard dev., and no. of observations and returns: p-value calculated  for 2-sampled Z-Test
         '''
         from numpy import sqrt, abs, round
         from scipy.stats import norm
         ovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)
          z = (X1 - X2)/ovr_sigma
          pval = 2*(1 - norm.cdf(abs(z)))
          return pval
                 --------------------------------------------------------------------------------------------------------------------------
      def Bivariate_cont_cat(data, cont, cat, category):
           #creating 2 samples
           x1 = data[cont][data[cat]==category][:] # all categorical features
           x2 = data[cont][~(data[cat]==category)][:] # all continuous features
           #calculating descriptives
           n1, n2 = x1.shape[0], x2.shape[0]
           m1, m2 = x1.mean(), x2.mean() # calculates mean
           std1, std2 = x1.std(), x2.mean() # calculates standard deviation
            #calculating p-values
            z_p_val = TwoSampleZ(m1, m2, std1, std2, n1, n2)
            #table
            table = pd.pivot_table(data=data, values=cont, columns=cat, aggfunc = np.mean)
            #plotting
            plt.figure(figsize = (15,6), dpi=140)
            #barplot
            plt.subplot(1,2,1)
            sns.barplot([str(category),'not {}'.format(category)], [m1, m2])
            plt.ylabel('mean {}'.format(cont))
            plt.xlabel(cat)
            plt.title(' n z-test p-value = {}n {}'.format(z_p_val,table))
            # boxplot
            plt.subplot(1,2,2)
            sns.boxplot(x=cat, y=cont, data=data)
            plt.title('categorical boxplot')

Continuous-Continuous:

           #Defining a function to calculate correlation among columns:
     def corr_2_cols(Col1, Col2):
          res = pd.crosstab(df1[Col1],df1[Col2])
          # res = df1.groupby([Col1, Col2]).size().unstack()
          res['perc'] = (res[res.columns[1]]/(res[res.columns[0]] + res[res.columns[1]]))
          return res

Functions for Multi-variate analysis:

       def Grouped_Box_Plot(data, cont, cat1, cat2):
            #boxplot
            sns.boxplot(x=cat1, y=cont, hue=cat2, data=data, orient='v')
            plt.title('Boxplot')

Summary

All the above functions help us cut the time and reduce redundancy in our code.

There are times when you will be in need to change the type of plot or add more details in the same. You can alter any function as per your requirement. Do note “Always follow a structure to complete your EDA”. I have shared the steps above you should follow while working with the dataset.

-Rohit

Rohit

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Optimizing Exploratory Data Analysis using Functions in Python!

Index

Introduction

Function for Univariate analysis:

Categorical:

Continuous:

Functions for Bi-variate analysis:

Continuous-Continuous:

Functions for Multi-variate analysis:

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Optimizing Exploratory Data Analysis using Functions in Python!

Index

Introduction

Function for Univariate analysis:

Categorical:

Continuous:

Functions for Bi-variate analysis:

Continuous-Continuous:

Functions for Multi-variate analysis:

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques