As we all know there are certain processes to analyze data. First, we define the problem then we mine the data and prepare the data for analysis. Before future engineering and model building, there is an important step.
Exploratory Data Analysis refers to the critical process of conducting initial research on data to discover patterns, detect anomalies, and check assumptions with the help of summary statistics and graphical representations. Exploratory Data Analysis is an important step before starting to analyze or modeling of the data. It provides the context needed to develop an appropriate model and interpret the results correctly.
Let look at a sample R implementation.
In this part, we discover the variable types and their summary statistics in the data. First, we upload the USArrests dataset in R. Then we print the dataset using “headTail” function which prints the first 4, and last 4 rows dataset in default.
Then we look types of the variables and summary statistics of the variables.
“glimpse” and “str” functions give us types of variables.
“profiling_num” function in funModeling library gives us detailed statistics like mean, standard deviation, skewness, kurtosis, interquartile range, etc.
Thus, while the urban population is left skewed, Rape is right skewed.
Thus, while the urban population distributes sharply, assault distribute flattened.
A combination of unusual values on at least two variables is a multivariate outlier. The effect of statistical studies can be affected by all kinds of outliers. They may distort the statistical analyses and violate their assumptions.
Let us show both multivariate and individual outliers.
“plot_outlier” function is very useful function in “dlookr” library. It shows Boxplots and histograms of all numeric variables with outlier and without the outlier. The reason for shows Boxplots is that they are very useful tools to visualize the outliers.
As seen in the plots only rape variables has outliers. Moreover, when we look at the histogram without outliers, its shape being more symmetric.
Let look at multivariate outliers. (it is very useful in multivariate analysis, just an example let us show it)
As seen there are 7 outliers in the data.
To proceed with statistical methods, it is important to evaluate normality. This assumption allows us to construct confidence intervals and conduct hypothesis tests. For normality checking, there is no best method that is correct in all conditions. It is very convenient to use graphical approaches to decide on multivariate normality, in addition to numerical results. It can be helpful to combine them to provide more accurate choices.
In this part, we can look at different graphs of variables and the relationship between variables visually. Let us write some research questions.
For this question, we can use map or bar plot.
R code for the plot in the below is:
For this question, we can draw an interactive plot like seen in the above to see the state names.
R codes for the interactive plot is:
Or, we can drow it using ggplot.
As seen there is a positive relationship between murder and assault.
Let look at correlation between variables. To look at this we can draw hat maps.
Positive correlations are shown in blue and negative correlations in red color. Color intensity is proportional to the correlation coefficients. When we look at the correlation matrix, it is seen that between some variables there is a strong positive relationship such as assault and rape, assault and murder.
To conclude, in this article, we examine the explanatory data analysis and which visualization types we can use for the explanatory data analysis. As stated above it is a very crucial step and it must be done before future engineering and model building to better understand data. You can access the codes from the link below.
https://github.com/iremtanriverdi/R_codes
The media shown in this article on Exploratory Data Analysis are not owned by Analytics Vidhya and is used at the Author’s discretion.