This article was published as a part of the Data Science Blogathon
As the title suggests, we will be exploring data and analyze Haberman Data set of cancer Survivals in this blog. Exploring data is the first step to draw conclusions on how to approach the objective. Keeping objective in mind one analyses data. Let’s understand what and how data is explored and analyzed.
The prerequisite for exploring data is data acquisition.
Steps to be followed before data is acquired :
As we’ll explore Haberman Data set of cancer Survivals , Let’s download the Data set using the below link
https://www.kaggle.com/gilsousa/habermans-survival-data-set
With data acquired, we’ll understand the data set using the following metrics
Count of whole instances along the number of attributes with shape method in data frame created using pandas. Names of the columns in the data set can be checked using the columns method in the data frame created using pandas.
Dataset columns and shape
import pandas as pd haberman = pd.read_csv("haberman.csv") print(haberman.columns) print(haberman.shape)
The above code gives us information that the dataset contains and the shape (i.e., no. of instances and no. of columns)
Class attribute/Dependent variable in the data set determines how balanced the data set is. Instances distributed over each class decide the balance of the data set. In this data set, Survival years is the class attribute. For all instances, the value of the Survival years attribute is either 1 or 2.
Determining the survival span, the attribute Survival years is given value as bulleted below
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") print(haberman["Survival years"].value_counts())
Independent variables are used in data analysis and machine learning model training to determine/predict class attribute/s. Data analysis could be progressed on plotting. With univariate/ bivariate and multivariate analysis available, let’s perform and draw conclusions from each analysis
Only one variable analysis is concentrated in this analysis. Let’s see the range of one independent variable using histogram and pdf curves(continuous line over histograms)
Distplot function combines kde plot and rug plot functions from the seaborn module for the curve and hist function from matplotlib module for the histogram. Operation Year attribute is plotted against the class attribute, Survival years.
Same graphs can be obtained and the behavior of the independent variable at every instance range as shown in X-axis across class attribute in Y-axis can be understood.
In the below plot, there’s overlapping of Operation Year attribute, this can be said as a grim situation, as this makes classification of Operation Year attribute over class attribute Survival years hard i.e., if Operation Year attribute is distributed evenly against the class attribute, then a range of the distribution would have made classification easy and usable.
Overlapping can be spotted with the different color area, that is not displayed in the legend.
Operation Year independent attribute distribution against Class attribute Survival years
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") sns.FacetGrid(haberman,hue="Survival years",size=12). map(sns.distplot,"Operation Year"). add_legend() plt.show()
The other two independent variables are distributed in the same style as the Operation Year attribute when tried to classify against class attribute Survival years.
Patient Age Distribution
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") sns.set_style("whitegrid"); sns.FacetGrid(haberman,hue="Survival years",size = 3) .map(sns.distplot,"Patient Age"). add_legend() plt.show()
Number of nodes Distribution
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") sns.set_style("whitegrid"); sns.FacetGrid(haberman,hue="Survival years",size = 3) .map(sns.distplot,"Num of nodes"). add_legend() plt.show()
CDF of each independent variable against class attribute shows us the distribution across the range of its value, below is the graph that shows CDF.along with its PDF.
Patient Age CDF and PDF
The orange curve in the graph is CDF(Cumulative Density Function) for Patient Age independent attribute against class attribute Survival years with the class as 1(i.e., patients survived less than 5 years after the operation).
CDF gives information of patient age, such as every patient age is greater than 70, for all patients who could not survive more than 5 years after the operation and half of the patient's age lie between 50 and 60. Blue line in the plot below is PDF shows how patient age is distributed across the age range, with an increase of bins size in the histogram method of NumPy module, PDF is plotted more specifically for all points. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") haberman_Survival_years = haberman.loc[haberman["Survival years"] == 1] counts,bin_edges = np.histogram(haberman_Survival_years["Patient Age"], bins = 10, density = True) pdf = counts/(sum(counts)) cdf = np.cumsum(pdf) plt.plot(bin_edges[1:],pdf) plt.plot(bin_edges[1:],cdf) plt.show()
Num of nodes CDF and PDF
The orange curve in the graph is CDF(Cumulative Density Function) for the Number of Nodes independent attribute against class attribute Survival years with the class as 2(i.e., patients survived greater than 5 years after the operation).
CDF gives information of patient age, such as 60 % of patients had less than 10 nodes, for all patients who could survive more than 5 years after the operation and every patient’s number of nodes are between 5 and 55(approximately). Blueline in the plot below is PDF shows how a number of nodes in patients are distributed across the number of nodes range. Blueline(PDF) gives information that no patient survived more than 5 years after an operation who diagnosed with more than 30 nodes.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") haberman_Survival_years = haberman.loc[haberman["Survival years"] == 2] counts,bin_edges = np.histogram(haberman_Survival_years["Num of nodes"], bins = 10, density = True) pdf = counts/(sum(counts)) cdf = np.cumsum(pdf) plt.plot(bin_edges[1:],pdf) plt.plot(bin_edges[1:],cdf) plt.show()
Let’s pair plot, using two variables on each axis and color on the data points to classify the class attribute. Pair plot is a data visualization technique, coloring class attribute with plotting two independent variables in all combinations possible.
The last graph in the pair plot below shows that the number of lymph nodes contributed to deciding class attribute( Survival years ).
The first graph in the pair plot below shows that the older patients are leaning to class1( Survival years after operation 5).
Pair Plots
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np haberman = pd.read_csv("haberman.csv") sns.pairplot(haberman, hue="Survival years", size = 3). add_legend() plt.show()
Using more than two independent variables in plotting graphs can be hard to visualize, for 3-D analysis link provided below can be used
https://plotly.com/python/3d-scatter-plots/
When data is imbalanced and data(independent variables) are not greatly contributing to deciding class attribute better data analysis and visualization techniques are required. Haberman Cancer Survival Dataset is imbalanced and independent attributes relevance to the class attribute is not great when analyzed using the above techniques. Univariate analysis(CDF) provides information about the distribution of attributes across the range for each class in the class attribute, irrespective of how much the dataset is balanced/imbalanced.