In any data science project, the Statistical Data Exploration phase, or Exploratory Data Analysis (EDA), plays a crucial role in model building. It begins once we’ve translated our business problem into a data science problem and have identified and listed all associated hypotheses. This phase aims to uncover key characteristics and hidden patterns within the dataset. This article focuses on conducting Data Exploration using statistical measures such as P-values, R-squared, hypothesis testing, and Analysis of Variance (ANOVA) to compare different groups, emphasizing practical application over theoretical concepts.
Analytical tools like Tableau for visualizations and Python packages like scipy for statistical tests such as one-way ANOVA and comparison of f-ratio are employed. While many statistical tests assume a bell curve distribution, in this case, the Dependent Variable (study variable) exhibits a Gaussian curve shape, prompting a statistical exploration to draw inferences.
In regression analysis and statistical data exploration, R-squared and P-value are critical measures often overlooked. However, modern analytical tools like Tableau or Power BI simplify the computation of these measures and facilitate the creation of informative plots with trend lines. Leveraging these tools allows for efficient inference generation without extensive coding.
This article was published as a part of the Data Science Blogathon.
This article is divided into 3 sections as in the Overview. But before we go to the individual sections, here are a few statistical data exploration terms we should be familiar with:
We often denote this as R2 or r2, more commonly known as R Squared, indicating the extent of influence a specific independent variable exerts on the dependent variable. Typically ranging between 0 and 1, values below 0.3 suggest weak influence, while those between 0.3 and 0.5 indicate moderate influence. Values exceeding 0.7 signify a strong effect on the dependent variable. Further discussion on this topic will be provided later in the blog.
The P-value is a probabilistic measure indicating the likelihood that an observed value occurred by random chance. It assesses the significance of differences observed in the dependent variable when the corresponding independent variable changes. A lower P-value signifies a greater significance of the observed difference. Typically used in statistical hypothesis testing, a P-value < 0.05 suggests rejection of the null hypothesis, while P > 0.05 indicates no significant differences when the variable changes. In the figure below, the shaded portion illustrates the P-value.
The idea here is to reject or nullify the Null Hypothesis and come up with the Alternate Hypothesis, that better explains the phenomenon.
This is contrary to the Null Hypothesis which is to say It is the opposite of the Null Hypothesis. For example, if a Null Hypothesis states that “I am going to win $10” then the Alternate Hypothesis would be “I am going to win more than $10”. We are checking if there is enough evidence (with the Alternate Hypothesis) to reject the Null Hypothesis. The hypothesis test can be one-tailed or two-tailed as in the figure below which depicts the standard normal model ( mean =0, the standard deviation of 1). Here the Pc is the critical value or test statistics:
The Confidence Interval (CI) is the range of values (-R,+R), we are sure that our population parameter (true value) lies in. This is mainly used in Hypothesis testing. The significance level defines how much evidence we require to reject H0 in favor of Ha. It serves as the cutoff. The default cutoff commonly used is 0.05. CI table with critical values and alpha values at (1%,5%,10%) significance level for a standard normal distribution is listed below:
Usually, when regression is referred to in the context of machine learning, we mean the line of linear regression and y-intercept, the point where this line cuts the y-axis. This line can be mathematically represented as a straight line passing through the data point coordinates of (independent variable, dependent variable). In an equation form,
y = m * x + C, where C is the y-intercept and m is the gradient or slope
in real-time situations, this may not be always a straight line and there will be a nonlinearity in the independent variables or predictors in relation to the dependent variable or the variable we want to predict the outcome. so we, need to look at other regressions like polynomial exponential, or even logarithmic based on the dataset we are mining. in this article, I have data ( target variable ) which sort of looks like a Gaussian curve and hence I will be trying to fit a polynomial regression on it.
In statistics, polynomial regression is a form of regression analysis that considers the nonlinearity of independent variables, and the target variable is modeled as the nth-degree polynomial of the predictor variables. That is
y = b0 + b1 x1 + b2 x22 + b3 x33 + ….. bn xnn
where y is the target or dependent variable,
,b1,b2 ….bn are the regression coefficients and y-intercept of b0 for each degree of the polynomial, and x1,x2 …xn are the predictors or the independent variables.
For the demonstration, I will take 3 independent variables (Temperature, Current, Voltage) and the dependent variable (Power) from my private project dataset. The data pertains to the energy system, wherein we have continuous instantaneous power generated at each timestep on any given day for the time the system is active. Let’s take a look at the power trend plot ( generated using Tableau) on any given day.
The above plot is quite similar to a bell curve, with lots of spikes that can be seen as this is the instantaneous power generated in 35 – 45 sec durations.
df.dtypes
Datetime object
Power float64
Temperature float64
Current float64
Voltage float64
dtype: object
Sample data frame records
As we can see Power value changes every 30-40 sec. The dataset contains data for two years 2019 and 2020. Let us look at the scatter plots of the dependent and each of the independent variables for a particular month.
As the Output seems to have a trend of a Normal curve, I will be testing it with a polynomial regression ( for the nonlinearity of degree 6). We can also try to fit 3rd order polynomial, basically a sort of hyperparameter. I have used the Tableau analytical tool here as we can do a bit of statistical analytics and draw trend lines etc with ease without having to write our code.
Next, let us see how to interpret these values in the next section
This can be drawn from Tableau desktop -Analytics -Model-Trend lines- Polynomial
Before we do some interpretation of the data, we need to gather all that somewhere. I have got those values month-wise for a device and stored them in the form of tabular data. (see below). let us understand the data first. There are 12 rows and 9 columns. The rows contain the month’s data and columns have data of 3 independent variables in relation to the target. The first three columns have the median value ( you can also use mean values ) of that particular month, the next three columns have the P-value and the last three have the R-squared values. The green lines are the polynomial trend lines.
From the above table, we can make some first-hand inferences like:
Analysis of Variance and F-statistics: We perform ANOVA tests to compare two groups (in this case, 2 different devices) and compute the F-statistics to determine variability.
In this section, I conduct several statistical hypotheses tests using similar data from another device. I demonstrate how to perform a one-way ANOVA test on a particular independent variable of two different devices. If these devices are placed adjacent to one another at the same location, then we fail to reject the Null Hypothesis as both devices would perform similarly. However, if these devices are placed elsewhere at different geographical locations, then we observe variance. Below, we present the data of device 2 at another distant location. Using Python’s scipy, we conduct a simple test to compare the Temperature variability of these 2 devices and evaluate the f-ratio for each month. For demonstration purposes, we focus on data from April to August to calculate the f-ratio.
We can also do more complex tests like
Enter the temperature scores of the 2 devices
device1 = [52.34,57.36,53.47,57.84,56.21]
device2 = [61.97,65.42,64.27,62.98,63.22]
from scipy.stats import f_oneway
f_oneway(device1,device2)
#perform one-way ANOVA
F_onewayResult(statistic=43.35900660252281, pvalue=0.00017210195536532808)
since the pvalue is < 0.05, we reject the Null Hypothesis. So the population mean of the 2 devices are not same.
F = variation between sample means / variation within the samples (43 in this case)
This article emphasizes statistical data exploration’s vital role in model building within data science projects. Utilizing regression models, sample size, adjusted R-squared, correlation coefficients, and other metrics, we drew valuable insights. Through polynomial regression, we analyzed the variance of the dependent variable against independent variables, uncovering nuanced relationships. Real-time data interpretation, focusing on P-values and R-squared scores, offered actionable insights. Moreover, ANOVA facilitated comparing different system parameters, shedding light on device performance. This article underscores the importance of meticulous exploration, hypothesis testing, and continuous inquiry in data analysis, essential for robust model development across diverse datasets.
A. R-squared, or the coefficient of determination, measures the proportion of the dependent variable’s variance predictable from the independent variable(s). A higher R squared (closer to 1) indicates better explanatory power, but no universal threshold defines a “good” value.
A. A good R-squared varies based on factors like dataset, predictors, and sample size. Generally, higher values suggest better model fit. Adjusted R-squared, considering predictors and sample size, provides a more accurate measure.
A. A high R-squared in regression analysis signifies strong model fit, indicating how well the model explains variability in the response variable. However, context, outliers, and other diagnostics are crucial for interpretation.
A. An R-squared of 0.3 implies 30% of the dependent variable’s variability explained by the predictors. Context, data nature, and model specifics influence interpretation of adequacy.
A. An R-squared of 0.4 indicates 40% of the dependent variable’s variability explained by the model’s independent variables. Context, data nature, and model criteria impact assessment of model fit.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
I am a Ghanaian doing my PhD at VIT UNIVERSITY, INDIA - TAMIL NADU. Information is very useful. Please, any suggested textbook where I can do further reading on R-Sqaured values or score. Thank you.