Not Machine Learning, Not Deep Learning, Not SQL, It’s Exploratory Data Analysis (EDA). How good one is with the identification of hidden patterns/trends of the data and how valuable the extracted insights are, is what differentiates Data Professionals.
Exploratory Data Analysis is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
EDA assists Data science professionals in various ways:-
1 Getting a better understanding of data
2 Identifying various data patterns
3 Getting a better understanding of the problem statement
[ Note: the dataset in this blog is being opted as iris dataset]
The first and foremost step of any data analysis, after loading the data file, should be about checking few introductory details like, no. Of columns, no. of rows, types of features( categorical or Numerical), data types of column entries.
Python Code Snippet
Python Code:
import seaborn as sns
df = sns.load_dataset('iris')
print(df.info())
data.head() For displaying first five rows
data.tail() For Displaying last Five Rows
This step should be performed for getting details about various statistical data like Mean, Standard Deviation, Median, Max Value, Min Value
data.describe()
This is the most important step in EDA involving removing duplicate rows/columns, filling the void entries with values like mean/median of the data, dropping various values, removing null entries
data.IsNull().sum gives the number of missing values for each variable
data.dropna(axis=0,inplace=True) If null entries are there
Values can either be mean, median or any integer
data[“sepal_length”].fillna(value=data[“sepal_length”].mean(), inplace = True) if there’s a null entry
data.duplicated().sum() returning total number of duplicates entries
data.drop_duplicates(inplace=True)
Data visualization is the method of converting raw data into a visual form, such as a map or graph, to make data easier for us to understand and extract useful insights.
The main goal of data visualization is to put large datasets into a visual representation. It is one of the important steps and simple steps when it comes to data science
You Can refer to the blog below for getting more details about Data Visualization
This shows every observation/distribution in data on a single data variable. It can be shown with the help of various plots like Scatter Plot, Line plot, Histogram(summary)plot, box plots, violin plot, etc.
Bivariate analysis displays are done to reveal the relationship between two data variables. It can also be shown with the help of Scatter plots, histograms, Heat Maps, Box Plots, Violin Plots, etc.
Multivariate analysis, as the name suggests, displays are done to reveal the relationship between more than two data variables.
Scatterplots, Histograms, box plots, violin plots can be used for Multivariate Analysis
Below are some of the plots that can be deployed for Univariate, Bivariate, Multivariate analysis
plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(data[‘sepal_length’],data[‘sepal_width’],hue =data[‘species’],s=50)
sns.pairplot(data,hue=”species”,height=4)
Boxplot to see how the categorical feature “Species” is distributed with all other four input variables
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1])
plt.show()
More informative, than box plot, and shows full distribution of data
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()
It can be used for visualizing the Probability density function(PDF)
sns.FacetGrid(iris_data, hue=”species”, height=5)
.map(sns.distplot, “petal_width”)
.add_legend();
With this I finish this blog.
Hello Everyone, Namaste
My name is Pranshu Sharma and I am a Data Science Enthusiast
Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.
Dhanyvaad!!
Feedback:
Email: [email protected]
You can refer to the blog being, mentioned below for getting familiar with Exploratory Data Analysis
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Great Article! Thank you for sharing this is a very informative post, and looking forward to the latest one