This article was published as a part of the Data Science Blogathon.
Exploratory Data Analysis helps in identifying any outlier data points, understanding the relationships between the various attributes and structure of the data, recognizing the important variables. It helps in framing questions and visualizing the results, paving the way to make an informed choice of the machine learning algorithm for the problem at hand.
While working on performing Exploratory Data Analysis, it is important that we keep our objective in mind. Plotting fancy graphs is not the aim but deriving useful insights is.
Keeping that in mind, in this article we would look into an example of Exploratory Data Analysis performed on Haberman’s survival dataset which is available on Kaggle.
The objective of this analysis is To find patterns within the dataset to gain further understanding of the data and leverage it to choose a machine learning algorithm for predicting the survival rates of patients who undergo the surgery.
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Data attributes:-
Age of patient at the time of operation (numerical)
Patient’s year of operation (year — 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer, 2 = the patient died within 5 year
We start by loading the data into a data frame
df = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv")
df.shape
df.info()
The data has 305 rows and 4 columns with no NULL values. The columns do not have a heading/title, hence we provide a meaningful title to the columns in our dataset.
df.columns = ["age",'year','nodes','status']
df.describe()
Observations
A quick look at the count of records for the attributes “age” and “year”(when the operation was performed) gives us the following insights.
print(df["status"].value_counts())
print(df["year"].value_counts())
It is the simplest form of analyzing data, it uses only one variable hence the name, Univariate.
We would use Probability Density Function, Cumulative Distribution Function, Box Plots, and Violin Plots for our analysis
Probability Density Function
The probability density function(PDF) provides the probability of a random variable falling in the range of values.
We have plotted below the PDF of the age
Observations
PDF of Number of Nodes
Observations
We see that the data is overlapping but we can note that the survival rate is better in patients who have 0–2 nodes and the survival rate decreases as there is an increase in the number of nodes.
PDF of the year of operation
Observations
The data is overlapping but we can see that between 1963 and 1966 we have more survival data and between 1958–1961 we have more data on patients who died within 5 years of the operation.
It describes the probability that a random variable will be found at a value less than or equal to the point at which the CDF is calculated.
CDF of number of nodes
Observations
Observations
Box Plots
Help us in visualizing the distribution of data based on the quartiles and provide some indication of the data’s symmetry and skewness. Unlike many other methods of data display, boxplots show outliers.
Boxplot on Age
Observations
As we have noted before the data is overlapping to a great extent and hence we would not be able to draw an accurate conclusion on the basis of just the age of the patient.
Boxplot on Nodes
Observations
We can see that for the nodes attribute we have some outlier points.
The aim is to find patterns/relationships within the dataset using two attributes. It is useful in testing simple associations.
One plot which can be used for the analysis is the pair plot.
Pair plots are an easy way to visualize relationships within your data. A matrix of each variable associated with another variable is produced for our analysis.
Example of Pair Plots
Observations
Plot 2:- attributes:- age and year
The points are overlapping, due to which all points are not clearly visible on the plot, which makes it difficult to conclude.
Plot 3:- attributes:- age and nodes
The points are overlapping, due to which all points are not clearly visible on the plot, which makes it difficult to conclude. We can however see that patients with more number of nodes and high age are generally of status 2(who could not survive)
Plot 6:- attributes:- year and nodes
The points are overlapping, due to which all points are not clearly visible on the plot, which makes it difficult to conclude.
Contour plots can be used for multivariate analysis. They are used to represent a three-dimensional surface on a two-dimensional plane. One variable is represented on the horizontal axis and a second variable is represented on the vertical axis. The third variable is represented by a colour gradient.
A contour plot on attributes, age on the Y axis, year on the X-axis, and the third variable is status = 1(successful survival post 5 years of operation)
Observations
The patients who survived are mostly in the approximate age group of 45–55 within the years 1962–1964
A contour plot on attributes, age on the Y axis, year on the X-axis, and the third variable is status = 2(could not survive)
Observations
The patients who could not survive were in the approximate age group of 45–50 between the years 1962 and 1965
Hope you liked reading my article on Exploratory Data Analysis. Read the latest articles on our blog!
I am currently working as an analyst. By writing these articles I try to deepen my understanding of applied machine learning. Click here and get in touch with me!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.