Interpreting P-Value and R Squared Score on Real-Time Data

PATTABHIRAMAN Last Updated : 27 Nov, 2024

10 min read

In any data science project, the Statistical Data Exploration phase, or Exploratory Data Analysis (EDA), plays a crucial role in model building. It begins once we’ve translated our business problem into a data science problem and have identified and listed all associated hypotheses. This phase aims to uncover key characteristics and hidden patterns within the dataset. This article focuses on conducting Data Exploration using statistical measures such as P-values, R-squared, hypothesis testing, and Analysis of Variance (ANOVA) to compare different groups, emphasizing practical application over theoretical concepts.

Analytical tools like Tableau for visualizations and Python packages like scipy for statistical tests such as one-way ANOVA and comparison of f-ratio are employed. While many statistical tests assume a bell curve distribution, in this case, the Dependent Variable (study variable) exhibits a Gaussian curve shape, prompting a statistical exploration to draw inferences.

In regression analysis and statistical data exploration, R-squared and P-value are critical measures often overlooked. However, modern analytical tools like Tableau or Power BI simplify the computation of these measures and facilitate the creation of informative plots with trend lines. Leveraging these tools allows for efficient inference generation without extensive coding.

In this article, you will explore the differences between r-squared and p-value, examining how p-value vs r value and p-value vs r-squared influence statistical analysis.

Key Takeaways

Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
How to draw inference from P-Value and R Squared scores with the real-time data
Comparing two different systems parameters using statistical tests like Anova

This article was published as a part of the Data Science Blogathon.

Important Terms used in Interpreting P-Value and R Squared Score
Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
How to Draw Inference from P-Value and R Squared Scores with Real-time Data?
- Data Interpretations – I Gather Some Visible Facts
- Data interpretations – II Deeper Insights
Utilizing ANOVA to Assess Different System Parameters
Frequently Asked Questions

Important Terms used in Interpreting P-Value and R Squared Score

This article is divided into 3 sections as in the Overview. But before we go to the individual sections, here are a few statistical data exploration terms we should be familiar with:

Coefficient of Determination

We often denote this as R2 or r2, more commonly known as R Squared, indicating the extent of influence a specific independent variable exerts on the dependent variable. Typically ranging between 0 and 1, values below 0.3 suggest weak influence, while those between 0.3 and 0.5 indicate moderate influence. Values exceeding 0.7 signify a strong effect on the dependent variable. Further discussion on this topic will be provided later in the blog.

Check out this article about Analysis of Variance (ANOVA)

P-Value

The P-value is a probabilistic measure indicating the likelihood that an observed value occurred by random chance. It assesses the significance of differences observed in the dependent variable when the corresponding independent variable changes. A lower P-value signifies a greater significance of the observed difference. Typically used in statistical hypothesis testing, a P-value < 0.05 suggests rejection of the null hypothesis, while P > 0.05 indicates no significant differences when the variable changes. In the figure below, the shaded portion illustrates the P-value.

Null Hypothesis H0

The idea here is to reject or nullify the Null Hypothesis and come up with the Alternate Hypothesis, that better explains the phenomenon.

Alternate Hypothesis Ha

This is contrary to the Null Hypothesis which is to say It is the opposite of the Null Hypothesis. For example, if a Null Hypothesis states that “I am going to win $10” then the Alternate Hypothesis would be “I am going to win more than $10”. We are checking if there is enough evidence (with the Alternate Hypothesis) to reject the Null Hypothesis. The hypothesis test can be one-tailed or two-tailed as in the figure below which depicts the standard normal model ( mean =0, the standard deviation of 1). Here the Pc is the critical value or test statistics:

Confidence Interval and Level of Significance (Alpha)

The Confidence Interval (CI) is the range of values (-R,+R), we are sure that our population parameter (true value) lies in. This is mainly used in Hypothesis testing. The significance level defines how much evidence we require to reject H0 in favor of Ha. It serves as the cutoff. The default cutoff commonly used is 0.05. CI table with critical values and alpha values at (1%,5%,10%) significance level for a standard normal distribution is listed below:

Regression lines and Equation

Usually, when regression is referred to in the context of machine learning, we mean the line of linear regression and y-intercept, the point where this line cuts the y-axis. This line can be mathematically represented as a straight line passing through the data point coordinates of (independent variable, dependent variable). In an equation form,

y = m * x + C, where C is the y-intercept and m is the gradient or slope

in real-time situations, this may not be always a straight line and there will be a nonlinearity in the independent variables or predictors in relation to the dependent variable or the variable we want to predict the outcome. so we, need to look at other regressions like polynomial exponential, or even logarithmic based on the dataset we are mining. in this article, I have data ( target variable ) which sort of looks like a Gaussian curve and hence I will be trying to fit a polynomial regression on it.

Polynomial Regression

In statistics, polynomial regression is a form of regression analysis that considers the nonlinearity of independent variables, and the target variable is modeled as the nth-degree polynomial of the predictor variables. That is

y = b0 + b1 x1 + b2 x22 + b3 x33 + ….. bn xnn

where y is the target or dependent variable,

,b1,b2 ….bn are the regression coefficients and y-intercept of b0 for each degree of the polynomial, and x1,x2 …xn are the predictors or the independent variables.

Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable

For the demonstration, I will take 3 independent variables (Temperature, Current, Voltage) and the dependent variable (Power) from my private project dataset. The data pertains to the energy system, wherein we have continuous instantaneous power generated at each timestep on any given day for the time the system is active. Let’s take a look at the power trend plot ( generated using Tableau) on any given day.

The above plot is quite similar to a bell curve, with lots of spikes that can be seen as this is the instantaneous power generated in 35 – 45 sec durations.

df.dtypes

Datetime        object
Power          float64
Temperature    float64
Current        float64
Voltage        float64
dtype: object

Sample data frame records

As we can see Power value changes every 30-40 sec. The dataset contains data for two years 2019 and 2020. Let us look at the scatter plots of the dependent and each of the independent variables for a particular month.

Read More about the Understanding Random Variable in Statistics

Observations

Temperature value ranges from 42 to 65 for most of the cases when the device actively (Power >0) generated Power
Voltage value ranges from 18 to 45 for most of the cases when the device actively (Power >0) generated Power
The current seems to be having a strong linear sort of relationship and Power is at its maximum when the current value is close to 10

As the Output seems to have a trend of a Normal curve, I will be testing it with a polynomial regression ( for the nonlinearity of degree 6). We can also try to fit 3rd order polynomial, basically a sort of hyperparameter. I have used the Tableau analytical tool here as we can do a bit of statistical analytics and draw trend lines etc with ease without having to write our code.

Next, let us see how to interpret these values in the next section

This can be drawn from Tableau desktop -Analytics -Model-Trend lines- Polynomial

Polynomial order 6 Regression fit on multi variables

How to Draw Inference from P-Value and R Squared Scores with Real-time Data?

Before we do some interpretation of the data, we need to gather all that somewhere. I have got those values month-wise for a device and stored them in the form of tabular data. (see below). let us understand the data first. There are 12 rows and 9 columns. The rows contain the month’s data and columns have data of 3 independent variables in relation to the target. The first three columns have the median value ( you can also use mean values ) of that particular month, the next three columns have the P-value and the last three have the R-squared values. The green lines are the polynomial trend lines.

Data Interpretations – I Gather Some Visible Facts

P-Value and R Squared Scores with Real-time Data

From the above table, we can make some first-hand inferences like:

All independent variables indicate a rejection of the Null Hypothesis, suggesting strong evidence that these predictors influence the target. Most values exhibit P < 0.0001, signifying robust statistical support for the Alternate Hypothesis, indicating a change in predictors correlates with a change in the target variable.
The R-squared score reveals the predictors’ impact on the dependent variable. Current emerges as the most influential variable, followed by Temperature and Voltage.
Throughout the study period, the device consistently generates 140-160 watts of power. Hence, it can be reliably inferred that the device is capable of producing at least 140 watts of power daily.

Also, Read this about the Guide to Data Exploration

Data interpretations – II Deeper Insights

A robust linear relationship between Current and Power is evident, as changes in Current proportionately affect Power. This observation is supported by consistently high R-squared scores, often nearing 0.99, indicating strong predictive power. The scatterplot above visually confirms this linear relationship with Current.
The maximum Power output of 378 watts corresponds to specific predictor values: Current at 10.42, Temperature at 60.62, and Voltage at 36.30. However, further investigation is warranted to uncover additional patterns or combinations that optimize output.
During March and April, unusual trends emerge, characterized by lower median Temperatures (≤53) and higher median Voltages (≥38), accompanied by lower median Currents compared to other months. These observations align with corresponding R-squared scores. Notably, when Temperature decreases, Voltage appears to exert greater influence on the Target variable, and vice versa. The degree of influence can be discerned from the median Current values.
Leveraging these insights, real-time alerts can be established to monitor predictor values, focusing on key thresholds and recent median Current trends. This approach validates two hypotheses: varying degrees of dominance based on Current median and Temperature’s fluctuating influence on the target, particularly when Temperature median shifts.
The unusually high P-value for Temperature in March and April signals potential abnormal device behavior. Such occurrences, where the Null Hypothesis cannot be rejected, should prompt further investigation into anomalous device performance in future instances.

Clear your understanding about the Exploratory Data Analysis (EDA) using Python

Utilizing ANOVA to Assess Different System Parameters

Analysis of Variance and F-statistics: We perform ANOVA tests to compare two groups (in this case, 2 different devices) and compute the F-statistics to determine variability.

In this section, I conduct several statistical hypotheses tests using similar data from another device. I demonstrate how to perform a one-way ANOVA test on a particular independent variable of two different devices. If these devices are placed adjacent to one another at the same location, then we fail to reject the Null Hypothesis as both devices would perform similarly. However, if these devices are placed elsewhere at different geographical locations, then we observe variance. Below, we present the data of device 2 at another distant location. Using Python’s scipy, we conduct a simple test to compare the Temperature variability of these 2 devices and evaluate the f-ratio for each month. For demonstration purposes, we focus on data from April to August to calculate the f-ratio.

We can also do more complex tests like

checking the day-wise variability instead of the monthly value
Use the f-ratio as a new feature that can later be used in predictive modeling to predict the value, for instance in device 2, we can see that the values for Sep and Oct are not available or use it for complex analytics like What if Analysis ( New device is placed in a particular location and we need to forecast the values for this device based on the results we have got from the current devices)

Simple Python example to test the Hypothesis using one-way ANOVA :

Enter the temperature scores of the 2 devices
device1 = [52.34,57.36,53.47,57.84,56.21]
device2 = [61.97,65.42,64.27,62.98,63.22]
from scipy.stats import f_oneway
f_oneway(device1,device2)
#perform one-way ANOVA
F_onewayResult(statistic=43.35900660252281, pvalue=0.00017210195536532808)
since the pvalue is < 0.05, we reject the Null Hypothesis. So the population mean of the 2 devices are not same.
F = variation between sample means / variation within the samples (43 in this case)

Conclusion

This article emphasizes statistical data exploration’s vital role in model building within data science projects. Utilizing regression models, sample size, adjusted R-squared, correlation coefficients, and other metrics, we drew valuable insights. Through polynomial regression, we analyzed the variance of the dependent variable against independent variables, uncovering nuanced relationships. Real-time data interpretation, focusing on P-values and R-squared scores, offered actionable insights. Moreover, ANOVA facilitated comparing different system parameters, shedding light on device performance. This article underscores the importance of meticulous exploration, hypothesis testing, and continuous inquiry in data analysis, essential for robust model development across diverse datasets.

Frequently Asked Questions

Q1. What Does an R Squared Value Mean?

A. R-squared, or the coefficient of determination, measures the proportion of the dependent variable’s variance predictable from the independent variable(s). A higher R squared (closer to 1) indicates better explanatory power, but no universal threshold defines a “good” value.

Q2. What is a Good R-squared Value?

A. A good R-squared varies based on factors like dataset, predictors, and sample size. Generally, higher values suggest better model fit. Adjusted R-squared, considering predictors and sample size, provides a more accurate measure.

Q3. What is a good r-square value in regression analysis?

A. A high R-squared in regression analysis signifies strong model fit, indicating how well the model explains variability in the response variable. However, context, outliers, and other diagnostics are crucial for interpretation.

Q4. What does it indicate if the R-squared value of a regression model is 0.3?

A. An R-squared of 0.3 implies 30% of the dependent variable’s variability explained by the predictors. Context, data nature, and model specifics influence interpretation of adequacy.

Q5. What does an R2 value of 0.4 mean?

A. An R-squared of 0.4 indicates 40% of the dependent variable’s variability explained by the model’s independent variables. Context, data nature, and model criteria impact assessment of model fit.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

PATTABHIRAMAN

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Interpreting P-Value and R Squared Score on Real-Time Data

Key Takeaways

Table of contents

Important Terms used in Interpreting P-Value and R Squared Score

Coefficient of Determination

P-Value

Null Hypothesis H0

Alternate Hypothesis Ha

Confidence Interval and Level of Significance (Alpha)

Regression lines and Equation

Polynomial Regression

Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable

Observations

How to Draw Inference from P-Value and R Squared Scores with Real-time Data?

Data Interpretations – I Gather Some Visible Facts

Data interpretations – II Deeper Insights

Utilizing ANOVA to Assess Different System Parameters

Simple Python example to test the Hypothesis using one-way ANOVA :

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV