Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents.
EDA is a significant step to take before diving into statistical modeling or machine learning, to ensure the data is really what it is claimed to be and that there are no obvious errors. It should be part of data science projects in every organization.
For example, in Python, you can perform EDA techniques by importing necessary libraries, loading your dataset, and using functions to display basic information, summary statistics, check for missing values, and visualize distributions and relationships between variables. Here’s a basic example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('your_dataset.csv')
# Display basic information
# Display summary statistics
# Check for missing values
# Visualize distributions
EDA is crucial because raw data is usually skewed, may have outliers, or too many missing values. A model built on such data results in sub-optimal performance. In the hurry to get to the machine learning stage, some data professionals either entirely skip the EDA process or do a very mediocre job. This is a mistake with many implications, including:
In this article, we’ll be using Pandas, Seaborn, and Matplotlib libraries of Python to demonstrate various EDA techniques applied to Haberman’s Breast Cancer Survival Dataset. This will provide a practical understanding of EDA and highlight its importance in the data analysis workflow.
Before diving into the dataset, let’s first understand the different types of Exploratory Data Analysis (EDA) techniques. Here are five key types of EDA techniques:
By using these EDA techniques, we can gain a comprehensive understanding of the data, identify key patterns and relationships, and ensure the data’s integrity before proceeding with more complex analyses.
The dataset used is an open source dataset and comprises cases from the exploratory data analysis conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital, focusing on the survival of patients post-surgery for breast cancer. The dataset can be Downloaded from Here.
Attribute Information
Attributes 1, 2, and 3 form our features (independent variables), while attribute 4 is our class label (dependent variable).
Let’s begin our analysis . . .
Import all necessary packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
Load the dataset in pandas dataframe:
df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year', 'positive_axillary_nodes', 'survival_status']
To understand the dataset, let’s just see the first few rows.
Shape of the DataFrame
To understand the size of the dataset, we check its shape.
(305, 4)
Class Distribution
Next, let’s see how many data points there are for each class label in our dataset. There are 305 rows and 4 columns. But how many data points for each class label are present in our dataset?
df[‘survival_status’].value_counts ()
Checking for Missing Values
Let’s check for any missing values in the dataset.
print("Missing values in each column:\n", df.isnull().sum())
There are no missing values in the dataset.
Data Information
Let’s get a summary of the dataset to understand the data types and further verify the absence of missing values.
By understanding the basic structure, distribution, and completeness of the data, we can proceed with more detailed exploratory data analysis (EDA) and uncover deeper insights.
Before proceeding with statistical analysis and visualization, we need to modify the original class labels. The current labels are 1
(survived 5 years or more) and 2
(died within 5 years), which are not very descriptive. We’ll map these to more intuitive categorical variables: ‘yes’ for survival and ‘no’ for non-survival.
# Map survival status values to categorical variables 'yes' and 'no'
df['survival_status'] = df['survival_status'].map({1: 'yes', 2: 'no'})
# Display the updated DataFrame to verify changes
We will now perform a general statistical analysis to understand the overall distribution and central tendencies of the data.
# Display summary statistics of the DataFrame
df .describe ()
If you see, there is a significant difference between the mean and the median values. This is because there are some outliers in our data and the mean is influenced by the presence of outliers.
To gain deeper insights, we’ll perform a statistical analysis for each class (survived vs. not survived) separately.
Survived (Yes) Analysis:
survival_yes = df[df['survival_status'] == 'yes']
Not Survived (No) Analysis:
survival_no = df[df['survival_status'] == 'no']
From the above class-wise analysis, it can be observed that —
Note that, all these observations are solely based on the data at hand.
“A picture is worth ten thousand words”
– Frank R. Bernard
Uni-variate analysis involves studying one variable at a time. This type of analysis helps in understanding the distribution and characteristics of each variable individually. Below are different ways to perform uni-variate analysis along with their outputs and interpretations.
Distribution plots, also known as probability density function (PDF) plots, show how values in a dataset are spread out. They help us see the shape of the data distribution and identify patterns.
Patient’s Age
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Age", kde=True).add_legend()
plt.title('Distribution of Age')
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Year", kde=True).add_legend()
plt.title('Distribution of Operation Year')
plt.xlabel('Operation Year')
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Nodes", kde=True).add_legend()
plt.title('Distribution of Positive Axillary Nodes')
plt.xlabel('Number of Positive Axillary Nodes')
But we must back our observations with some quantitative measure. That’s where the Cumulative Distribution function(CDF) plots come into the picture.
CDF plots show the probability that a variable will take a value less than or equal to a specific value. They provide a cumulative measure of the distribution.
counts, bin_edges = np.histogram(data[data['Survival_Status'] == 1]['Nodes'], density=True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf, label='CDF Survival status = Yes')
counts, bin_edges = np.histogram(data[data['Survival_Status'] == 2]['Nodes'], density=True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf, label='CDF Survival status = No')
plt.xlabel("Positive Axillary Nodes")
plt.title('Cumulative Distribution Function for Positive Axillary Nodes')
Box plots, also known as box-and-whisker plots, summarize data using five key metrics: minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. They also highlight outliers.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.boxplot(x='Survival_Status', y='Age', data=data)
plt.title('Box Plot of Age')
plt.subplot(1, 3, 2)
sns.boxplot(x='Survival_Status', y='Year', data=data)
plt.title('Box Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.boxplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Box Plot of Positive Axillary Nodes')
Violin plots combine the features of box plots and density plots. They provide a visual summary of the data and show the distribution’s shape, density, and variability.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.violinplot(x='Survival_Status', y='Age', data=data)
plt.title('Violin Plot of Age')
plt.subplot(1, 3, 2)
sns.violinplot(x='Survival_Status', y='Year', data=data)
plt.title('Violin Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.violinplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Violin Plot of Positive Axillary Nodes')
These observations align with our previous analyses and provide a deeper understanding of the data.
Bar charts display the frequency or count of categories within a single variable, making them useful for comparing different groups.
Survival Status Count
sns.countplot(x='Survival_Status', data=df)
plt.title('Count of Survival Status')
plt.xlabel('Survival Status')
Histograms show the distribution of numerical data by grouping data points into bins. They help understand the frequency distribution of a variable.
Age Distribution
df['Age'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Histogram of Age')
Bi-variate data analysis involves studying the relationship between two variables at a time. This helps in understanding how one variable affects another and can reveal underlying patterns or correlations. Here are some common methods for bi-variate analysis.
A pair plot visualizes the pairwise relationships between variables in a dataset. It displays both the distributions of individual variables and their relationships.
sns.pairplot(data, hue='Survival_Status')
While the pair plot provides an overview of the relationships between all pairs of variables, sometimes it is useful to focus on the relationship between just two specific variables in more detail. This is where the joint plot comes in.
A joint plot provides a detailed view of the relationship between two variables along with their individual distributions.
sns.jointplot(x='Age', y='Nodes', data=data, kind='scatter')
While joint plots and pair plots help visualize the relationships between pairs of variables, a heatmap can provide a broader view of the correlations among all the variables in the dataset simultaneously.
A heatmap visualizes the correlation between different variables. It uses color coding to represent the strength of the correlations, which can help identify relationships between variables.
sns.heatmap(data.corr(), cmap='YlGnBu', annot=True)
These bi-variate analysis techniques provide valuable insights into the relationships between different features in the dataset, helping to understand how they interact and influence each other. Understanding these relationships is crucial for building more accurate models and making informed decisions in data analysis and machine learning tasks.
Multivariate analysis involves examining more than two variables simultaneously to understand their relationships and combined effects. This type of analysis is essential for uncovering complex interactions in data. Let’s explore several multivariate analysis techniques.
A contour plot is a graphical technique that represents a 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format. This allows us to visualize complex relationships between three variables in an easily interpretable 2-D chart.
For example, let’s examine the relationship between patient’s age and operation year, and how these relate to the number of patients.
sns.jointplot(x='Age', y='Year', data=data, kind='kde', fill=True)
By utilizing contour plots, we can effectively consolidate information from three dimensions into a two-dimensional format, making it easier to identify patterns and relationships in the data. This approach enhances our ability to perform comprehensive multivariate analysis and extract valuable insights from complex datasets.
A 3D scatter plot is an extension of the traditional scatter plot into three dimensions, which allows us to visualize the relationship among three variables.
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Age'], df['Year'], df['Nodes'])
In this article, we learned some common steps involved in exploratory data analysis. We also saw several types of charts & plots and what information is conveyed by each of these. This is just not it, I encourage you to play with the data and come up with different kinds of visualizations and observe what insights you can extract from it.
A. Exploratory Data Analysis (EDA) in data science involves examining datasets to summarize their main characteristics, often through visual methods. EDA helps data scientists understand data structure, detect patterns, identify anomalies, and generate hypotheses, which are crucial for informed decision-making and preparing data for further analysis or modeling.
A. The four steps of exploratory data analysis (EDA) typically involve:
1. Data Cleaning
2. Data Exploration
3. Feature Engineering
4. Data Visualization
1. Summarize the Data: Calculate basic statistics for numerical variables and determine frequency distribution for categorical variables.
2. Visualize the Data: Create histograms, scatter plots, box plots, bar charts, and pie charts.
3. Identify Outliers: Detect and investigate outliers using statistical methods or visualization techniques.
4. Transform the Data: Apply transformations to improve the performance of machine learning algorithms and handle missing values.
5. Identify Relationships: Calculate correlation coefficients and create correlation matrices.
6. Generate Hypotheses: Formulate hypotheses about the underlying patterns and relationships in the data.
7. Iterate and Refine: EDA is an iterative process, so revisit previous steps and refine your analysis as needed.
A. The four main types of Exploratory Data Analysis (EDA) are:
1. Univariate Analysis
2. Bivariate Analysis
3. Multivariate Analysis
4. Visualization Techniques
A. Several tools and libraries are commonly used for Exploratory Data Analysis (EDA), including: