This article was published as a part of the Data Science Blogathon.
The first step in a data science project is to summarize, describe, and visualize the data. Try to know about different aspects of data and its attributes. Best models are created by those who understand their data.
Explore the features of data and its attributes using Descriptive Statistics. Insights and numerical Summary you get from Descriptive Statistics help you to better understand or be in a position to handle the data more efficiently for machine learning tasks.
Descriptive Statistics is the default process in Data analysis. Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
So, in this article, I will explain the attributes of the dataset using Descriptive Statistics. It is divided into two parts: Measure of Central Data points and Measure of Dispersion. Before we start with our analysis, we need to complete the data collection and cleaning process.
We will collect data from here. I will only use test data for analysis. You can combine both test and train data for analysis. Here is a code for the data cleaning process of train data.
Python Code:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
from collections import Counter
# Loat the train and test data
train_df = pd.read_csv('train_BMS.csv')
train_df['df_type'] = 'train'
test_df = pd.read_csv('test_BMS.csv')
test_df['df_type'] = 'test'
# concatenating test and train data
combined_data = pd.concat([train_df, test_df],ignore_index=True)
# check null values
print(train_df.apply(lambda x: sum(x.isnull())))
# remove null values
avg_weight = combined_data.pivot_table(values='Item_Weight', index='Item_Identifier')
missing_bool = combined_data['Item_Weight'].isnull()
combined_data.loc[missing_bool,'Item_Weight'] = combined_data.loc[missing_bool,'Item_Identifier'].apply(lambda x: avg_weight.loc[x])
avg_visibility = combined_data.pivot_table(values='Item_Visibility', index='Item_Identifier')
missing_bool = combined_data['Item_Visibility'] == 0
combined_data.loc[missing_bool,'Item_Visibility'] = combined_data.loc[missing_bool,'Item_Identifier'].apply(lambda x: avg_visibility.loc[x])
combined_data['Item_Fat_Content'] = combined_data['Item_Fat_Content'].replace({'LF':'Low Fat',
'reg':'Regular Fat',
'low fat':'Low Fat'})
combined_data['Outlet_Years'] = 2013 - combined_data['Outlet_Establishment_Year']
train = combined_data[combined_data['df_type'] == 'train']
train.drop(['Outlet_Size','Outlet_Establishment_Year','df_type'],axis=1,inplace=True)
# train data information
print(train.info())
Take Away from the code
Let’s start with Descriptive Statistics analysis of data.
Finding the center of numerical and categorical data using mean, median and mode is known as Measure of Central Datapoint. The calculation of central values of column data by mean, median, and mode are different from each other.
Okay then, let’s calculate the mean, median, count, and mode of the dataset attributes using python.
The Sum of values present in the column divided by total rows of that column is known as mean. It is also known as average.
Use train.mean() to calculate the mean value of numerical columns of the train dataset.
Here is a code for categorical columns of the train dataset.
print(train[['Item_Outlet_Sales','Outlet_Type']].groupby(['Outlet_Type']).agg({'Item_Outlet_Sales':'mean'}))
Analysis of the output
The center value of an attribute is known as a median. How do we calculate the median value? First, sort the column data in an ascending or descending order. Then find the total rows and then divide it by 2.
That output value is the median for that column.
The median value divides the data points into two parts. That means 50% of data points are present above the median and 50% below.
Generally, median and mean values are different for the same data.
The median is not affected by the outliers. Due to the outliers, the difference between the mean and median values increases.
Use train.median() to calculate the mean value of numerical columns of the train dataset.
Here is a code for categorical columns of the train dataset.
print(train[['Item_Outlet_Sales','Outlet_Type']].groupby(['Outlet_Type']).agg({'Item_Outlet_Sales':'median'}))
Analysis of Output
print(train[['Item_Outlet_Sales', 'Outlet_Type', 'Outlet_Identifier', 'Item_Identifier']].groupby(['Outlet_Type']).agg(lambda x:x.value_counts().index[0]))
Analysis of Output
A measure of dispersion explains how diverse attribute values are in the dataset. It is also known as the measure of spread. From this statistic, you come to know that how and why data spread from one point to another.
These are the statistics that come under the measure of dispersion.
The difference between the max value to min value in a column is known as the range.
Here is a code to calculate the range.
for i in num_col:
print(f"Column: {i} Max_Value: {max(train[i])} Min_Value: {min(train[i])} Range: {round(max(train[i]) - min(train[i]),2)}")
You can also calculate the range of categorical columns. Here is a code to find out min and max values in each outlet category.
Analysis of Output
Here is a code to calculate the quartiles.
The difference between the 3rd and the 1st quartile is also known as Interquartile (IQR). Also, maximum data points fall under IQR.
The standard deviation value tells us how much all data points deviate from the mean value. The standard deviation is affected by the outliers because it uses the mean for its calculation.
Here is a code to calculate the standard deviation.
for i in num_col:
print(i , round(train[i].std(),2))
Pandas also have a shortcut to calculate all the above statistics values.
Train.describe()
for i in num_col:
print(i , round(train[i].var(),2))
Analysis of output.
Ideally, the distribution of data should be in the shape of Gaussian (bell curve). But practically, data shapes are skewed or have asymmetry. This is known as skewness in data.
You can calculate the skewness of train data by train.skew(). Skewness value can be negative (left) skew or positive (right) skew. Its value should be close to zero.
These are the go-to statistics when we perform exploratory data analysis on the dataset. You need to pay attention to the values generated by these statistics and ask why this number. These statistics help us determine the attributes for data transformation and removal of variables from further processing.
Pandas library has really good functions that help you to get Descriptive Statistics values in one line of code.