The Game of Increasing R-squared in a Regression Model

Chirag Goyal Last Updated : 15 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

After building a Machine Learning model, the next and very crucial step is to evaluate the model performance on the unseen or test data and see how good our model is against a benchmark model.

The evaluation metric to be used would depend upon the type of problem you are trying to solve —whether it is a supervised, unsupervised problem, or a mix of these (like semi-supervised), and if it is a classification or a regression task.

In this article, we will discuss two important evaluation metrics used for regression problem statements and we will try to find the key difference between them and learn why these metrics are preferred over Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for a regression problem statement.

Some Important questions which we are trying to understand in this article are as follows:

👉 The Game of increasing R-squared (R²)

👉 Why we go for adjusted-R²?

👉 When to use which from R² and adjusted-R²?

Let’s first understand what exactly is R Squared?

R-squared, which sometimes is also known as the coefficient of determination, defines the degree to which the variance in the dependent variable (target or response) can be explained by the independent variable (features or predictors).

Let us understand this with an example — say the R² value for a regression model having Income as an Independent variable (predictor) and, Expenditure as a dependent variable (response) comes out to be 0.76.

– In general terms, this means that 76% of the variation in the dependent variable is explained by the independent variables.

But for our defined regression problem statement, it can be understood as,

👉 76% variability in expenditure is associated or related with the regression equation and 24% variations are due to other factors.

👉76% variability in expenditure is explained by its linear relationship with income while 24% variations are uncounted for.

👉 76% variation in expenditure due to variation in income while we can’t say anything about the 24% variations. God knows better about it.

R Squared | Linear regression

Image Source: link

Important points about R Squared

👉 Ideally, we would want the independent variables to explain the complete variations in the target variable. In that scenario, the R² value would be equal to 1. Thus we can say that the higher the R² value, the better is our model.

👉 In simple terms, the higher the R², the more variation is explained by your input variables, and hence better is your model. Also, the R² would range from [0,1]. Here is the formula for calculating R²–

The R² is calculated by dividing the sum of squares of residuals from the regression model (given by SS_RES) by the total sum of squares of errors from the average model (given by SS_TOT) and then subtracting it from 1.

Looking at R-Squared. In data science we create regression… | by Erika D | Medium

Fig. Formula for Calculating R²

Image Source: link

Drawbacks of using R Squared :

👉 Every time if we add X_i(independent/predictor/explanatory) to a regression model, R² increases even if the independent variable is insignificant for our regression model.

👉 R² assumes that every independent variable in the model helps to explain variations in the dependent variable. In fact, some independent variables don’t help to explain the dependent variable. In simple words, some variables don’t contribute to predicting the dependent variable.

👉 So, if we add new features to the data (which may or may not be useful), the R² value for the model would either increase or remain the same but it would never decrease.

So, to overcome all these problems, we have adjusted-R² which is a slightly modified version of R².

Let’s understand what is Adjusted R²?

👉 Similar to R², Adjusted-R² measures the proportion of variations explained by only those independent variables that really help in explaining the dependent variable.

👉 Unlike R², the Adjusted-R² punishes for adding such independent variables that don’t help in predicting the dependent variable (target).

Let us mathematically understand how this feature is accommodated in Adjusted-R². Here is the formula for adjusted R²

From Data Pre-processing to Optimizing a Regression Model Performance - R Squared

Fig. Formula for Calculating adjusted-R²

Image Source: link

Let’s take an example to understand the values changes of these metrics in a Regression model

For Example,

Independent Variable	R²	Adjusted-R²
X₁	67.8	67.1
X₂	88.3	85.6
X₃	92.5	82.7

In this example for a regression problem statement, we observed that the independent variable X₃ is insignificant or it doesn’t contribute to explain the variation in the dependent variable. Hence, adjusted-R² is decreased because the involvement of in-significant variable harms the predicting power of other variables that are already included in the model and declared significant.

R² vs Adjusted-R²

👉 Adjusted-R² is an improved version of R².

👉 Adjusted-R² includes the independent variable in the model on merit.

👉 Adjusted-R² < R²

👉 R² includes extraneous variations whereas adjusted-R² includes pure variations.

👉 The difference between R² and adjusted-R² is only the degrees of freedom.

The Game of Increasing R²

Sometimes researchers tried their best to increase R² in every possible way.

👉 One way to include more and more explanatory (independent) variables in the model because:

R² is an increasing function of the number of independent variables i.e, with the inclusion of one more independent variable R² is likely to increase or at least will not decrease.

When to use which?

Comparing models using R²

Comparing two models just based on R² is dangerous as,

👉 Models having a different number of independent variables may have an equal value of R².

👉 Total sample size and respective degrees of freedom are ignored.

Hence, there is a likelihood that one would choose the wrong model.

Problem solved by adjusted-R²

To compare two different models, or choose the best model, the adjusted-R²is used because:

👉 It is adjusted for the respective degree of freedom.

👉 It takes into account the total sample size and number of independent variables.

👉 It is not an increasing function of the number of independent variables.

👉 It only increases if newly independent variables have an impact on the dependent variable.

CONCLUSION:

So, concluding the discussion we say that,

👉 R²can be used to access the goodness of fit of a single model whereas,

👉Adjusted-R² is used to compare two models and to see the real impact of newly added independent variables.

👉 Adjusted-R² should be used while selecting important predictors for the regression model.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article on Top Machine Learning Libraries in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion.

Chirag Goyal

I am a B.Tech. student (Computer Science major) currently in the pre-final year of my undergrad. My interest lies in the field of Data Science and Machine Learning. I have been pursuing this interest and am eager to work more in these directions. I feel proud to share that I am one of the best students in my class who has a desire to learn many new things in my field.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Type: HTTP

li_theme_set

ANONCHK

Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation

Expiry: 1 Day

Type: HTTP

We do not use cookies of this type.

Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.

Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.

The Game of Increasing R-squared in a Regression Model

Introduction

Let’s first understand what exactly is R Squared?

Important points about R Squared

Drawbacks of using R Squared :

Let’s understand what is Adjusted R2?

R2 vs Adjusted-R2

The Game of Increasing R2

When to use which?

Problem solved by adjusted-R2

CONCLUSION:

End Notes

About the author

Chirag Goyal

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

Let’s understand what is Adjusted R²?

R² vs Adjusted-R²

The Game of Increasing R²

Problem solved by adjusted-R²