What is Multicollinearity? Understand Causes, Effects and Detection Using VIF

Aniruddha Bhandari Last Updated : 03 Jan, 2025

10 min read

Multicollinearity might be a handful to pronounce, but it’s a topic you should be aware of in the field of data science and machine learning, especially if you’re sitting for data scientist interviews! In this article, we will understand what multicollinearity is and how it is caused. We will also try to understand why it is a problem and how we can detect and fix it. Also you will get to know proper understanding about Multicollinearity meaning, vif for multicollinearity and also about the multicollinearity in regression so on these topics you will get the insights on the article.

Before diving further, it is imperative to have a basic understanding of regression and some statistical terms. For this, I highly recommend going through the below resources:

Fundamentals of Regression Analysis (Free Course!)
Beginner’s Guide to Linear Regression

In this article, you will learn what multicollinearity is, how it affects regression analysis, and the role of VIF in identifying multicollinearity in regression models.

Learning Objective

Understand what multicollinearity is and why it is a problem in a regression model.
Learn the causes of multicollinearity.
Understand how to detect multicollinearity using the variance inflation factor (VIF).
Learn about the methods used to fix multicollinearity, including dropping correlated features.

What is Multicollinearity?
Issues with Multicollinearity in Regression Models
Understanding the Impact of Multicollinearity
What is the assumption of multicollinearity in linear regression?
What Causes Multicollinearity?
Detecting Multicollinearity Using a Variance Inflation Factor (VIF)
Fixing Multicollinearity
How to Interpret MultiCollinearity in Spss?
Conclusion

What is Multicollinearity?

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a regression model are highly correlated, indicating a strong linear relationship among the predictor variables. This issue complicates regression analysis by making it difficult to accurately determine the individual effects of each independent variable on the dependent variable.

The presence of multicollinearity can lead to unstable and unreliable coefficient estimates, making it challenging to interpret the results and draw meaningful conclusions from the model. Detecting and addressing multicollinearity is crucial to ensure the validity and robustness of regression models. For example, in a regression model, variables such as height and weight or household income and water consumption often show high correlation.

Key takeways

When multiple independent variables in a model are correlated, this is known statistically as multicollinearity.
If the correlation between two variables is +/- 1.0, then the variables are said to be perfectly collinear.
Less trustworthy statistical conclusions will arise from multicollinearity among independent variables.

Example

An everyday example of multicollinearity can be illustrated with Colin, who experiences happiness from watching television while eating chips. His happiness is hard to attribute to either activity individually because the more television he watches, the more chips he eats, making the two activities highly correlated.

This makes it difficult to determine whether his happiness is more influenced by eating chips or watching television, exemplifying the multicollinearity problem. In the context of machine learning, multicollinearity, marked by a correlation coefficient close to +1.0 or -1.0 between variables, can lead to less dependable statistical conclusions. Therefore, managing multicollinearity is essential in predictive modeling to obtain reliable and interpretable results.

Issues with Multicollinearity in Regression Models

Multicollinearity can be a problem in a regression model when using algorithms such as OLS (ordinary least squares) in statsmodels. This is because the estimated regression coefficients become unstable and difficult to interpret in the presence of multicollinearity. Statsmodels is a Python library that provides a range of tools for statistical analysis, including regression analysis.

When multicollinearity is present, the estimated regression coefficients may become large and unpredictable, leading to unreliable inferences about the effects of the predictor variables on the response variable. Therefore, it is important to check for multicollinearity and consider using other regression techniques that can handle this problem, such as ridge regression or principal component regression.

Understanding the Impact of Multicollinearity

For example, let’s assume that in the following linear equation:

Y = W0+W1*X1+W2*X2

Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2 constant. But since X1 and X2 are highly correlated, changes in X1 would also cause changes in X2, and we would not be able to see their individual effect on Y.

The regression coefficient, also known as the beta coefficient, measures the strength and direction of the relationship between a predictor variable (X) and the response variable (Y). In the presence of multicollinearity, the regression coefficients become unstable and difficult to interpret because the variance of the coefficients becomes large. This results in wide confidence intervals and increased variability in the predicted values of Y for a given value of X. As a result, it becomes challenging to determine the individual contribution of each predictor variable to the response variable and make reliable inferences about their effects on Y.

“ This makes the effects of X1 on Y difficult to distinguish from the effects of X2 on Y. ”

Multicollinearity may not affect the accuracy of the machine-learning model as much. But we might lose reliability in determining the effects of individual features in your model – and that can be a problem when it comes to interpretability.

What is the assumption of multicollinearity in linear regression?

In a linear regression, multicollinear variables are those in which two or more independent variables significantly correlate with one another. As a result, the model may encounter problems like:

Unpredictable Coefficients: Even with minor model tweaks, if the values (coefficients) indicating the influences of each variable vary greatly, they may lose some of their dependability.

Complicated: Due to the substantial information overlap between the variables, it is difficult to pinpoint each one’s exact contribution to the final result.

Increased variability: Because of the higher level of uncertainty in the predicted coefficients, it is more challenging to assess the significance of the variables.

Multicollinearity can make a model less predictive, but it can also make interpretation more difficult and cause trust in the findings to be misplaced.

What Causes Multicollinearity?

Multicollinearity could occur due to the following problems:

Multicollinearity could exist because of the problems in the dataset at the time of creation. These problems could be because of poorly designed experiments, highly observational data, or the inability to manipulate the data.
For example, determining the electricity consumption of a household from the household income and the number of electrical appliances. Here, we know that the number of electrical appliances in a household will increase with household income. However, this cannot be removed from the dataset.
Multicollinearity could also occur when new variables are created which are dependent on other variables.
For example, creating a variable for BMI from the height and weight variables would include redundant information in the model, and the new variable will be a highly correlated variable.
Including identical variables in the dataset.
For example, including variables for temperature in Fahrenheit and temperature in Celsius.
Inaccurate use of dummy variables can also cause a multicollinearity problem. This is called the Dummy variable trap.
For example, in a dataset containing the status of marriage variable with two unique values: ‘married’, and ’single’. Creating dummy variables for both of them would include redundant information. We can make do with only one variable containing 0/1 for ‘married’/’single’ status.
Insufficient data, in some cases, can also cause multicollinearity problems.

Read more about the Multicollinearity: Problem, Detection and Solution. Dive into our comprehensive article now

Detecting Multicollinearity Using a Variance Inflation Factor (VIF)

Let’s try detecting multicollinearity in a dataset to give you a flavor of what can go wrong.

I have created a dataset determining the salary of a person in a company based on the following features:

Gender (0 – female, 1- male)
Age
Years of service (Years spent working in the company)
Education level (0 – no formal education, 1 – under-graduation, 2 – post-graduation)

df=pd.read_csv(r'C:/Users/Dell/Desktop/salary.csv')
df.head()

In Python, there are several ways to detect multicollinearity in a dataset, such as using the Variance Inflation Factor (VIF) or calculating the correlation matrix of the independent variables. To address multicollinearity, techniques such as regularization or feature selection can be applied to select a subset of independent variables that are not highly correlated with each other. In this article, we will focus on the most common one – VIF (Variance Inflation Factors).

” VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. “

or

VIF score of an independent variable represents how well the variable is explained by other independent variables.

R^2 value is determined to find out how well an independent variable is described by the other independent variables. A high value of R^2 means that the variable is highly correlated with the other variables. This is captured by the VIF, which is denoted below:

So, the closer the R^2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

VIF starts at 1 and has no upper limit
VIF = 1, no correlation between the independent variable and the other variables
VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

X = df.iloc[:,:-1]
calc_vif(X)
view raw

We can see here that the ‘Age’ and ‘Years of service’ have a high VIF value, meaning they can be predicted by other independent variables in the dataset.

Although correlation matrix and scatter plots can also be used to find multicollinearity, their findings only show the bivariate relationship between the independent variables. VIF is preferred as it can show the correlation of a variable with a group of other variables.

Checkout this article about the Identifying and Dealing with Multicollinearity and Heteroscedasticity

Fixing Multicollinearity

Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features:


X = df.drop(['Age','Salary'],axis=1)
calc_vif(X)

The image on the left contains the original VIF value for variables, and the one on the right is after dropping the ‘Age’ variable. We were able to drop the variable ‘Age’ from the dataset because its information was being captured by the ‘Years of service’ variable. This has reduced the redundancy in our dataset. Dropping variables should be an iterative process starting with the variable having the largest VIF value because other variables highly capture its trend. If you do this, you will notice that VIF values for other variables would have reduced, too, although to a varying extent.

In our example, after dropping the ‘Age’ variable, VIF values for all variables have decreased to varying degrees.

Next, combine the correlated variables into one and drop the others. This will reduce the multicollinearity.

df2 = df.copy()
df2['Age_at_joining'] = df.apply(lambda x: x['Age'] - x['Years of service'],axis=1)
X = df2.drop(['Age','Years of service','Salary'],axis=1)
calc_vif(X)

Multicollinearity: VIF values after combining features

The image on the left contains the original VIF value for variables, and the one on the right is after combining the ‘Age’ and ‘Years of service’ variables. Combining ‘Age’ and ‘Years of experience’ into a single variable, ‘Age_at_joining’ allows us to capture the information in both variables.

However, multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the following reasons:

When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

Also, Read the article about Multicollinearity in Data Science

How to Interpret MultiCollinearity in Spss?

To interpret MultiCollinearity in Spass here are Some Points:

VIF (Variance Inflation Factor) • Where: Coefficients table • Problem if: VIF > 5-10 • Higher VIF = More multicollinearity
Tolerance • Where: Coefficients table • Problem if: < 0.1 • Lower tolerance = More multicollinearity
Condition Index • Where: Collinearity Diagnostics table • Caution if: 15-30 • Problem if: > 30
Variance Proportions • Where: Collinearity Diagnostics table • Problem if: Multiple variables > 0.5 on same row
Correlation Matrix • Problem if: Correlations > 0.8 between variables

Conclusion

We learned how the problem of multicollinearity could occur in regression models when two or more independent variables in a data frame have a high correlation with one another. Its presence can cause the regression coefficients to become unstable and difficult to interpret, which can lead to wide confidence intervals and increased variability in the predicted values of the dependent variable. Understanding what causes it and how to detect and fix it can help us to overcome these problems.

In this article, we explored how the Variance Inflation Factor (VIF) can be used to detect the existence of multicollinearity in our dataset and how to fix the problem by identifying and dropping the correlated variables. Remember, when assessing the statistical significance of predictor variables in a regression model, it is important to consider their individual coefficients and their standard errors, p-values, and confidence intervals. Predictor variables with high multicollinearity may have inflated standard errors and p-values, which can lead to incorrect conclusions about their statistical significance.

Hope you like the article on multicollinearity in regression! So, what is multicollinearity? It occurs when independent variables are closely related, making it hard to see their individual effects. This can inflate standard errors and lead to unreliable results. The Variance Inflation Factor (VIF) is a useful tool to check for multicollinearity, helping you decide if you need to adjust your variables for clearer analysis

If you want to understand other regression models or want to understand model interpretation, I highly recommend going through the following wonderfully written articles:

Regression Modeling
Machine Learning Model Interpretability

As a next step, you should also check out the Fundamentals of Regression (free) course.

Key Takeaways

Multicollinearity occurs when two or more independent variables have a high correlation with one another in a regression model, which makes it difficult to determine the individual effect of each independent variable on the dependent variable.
Multicollinearity can occur due to poorly designed experiments, highly observational data, creating new variables that are dependent on other variables, including identical variables in the dataset, inaccurate use of dummy variables, or insufficient data.
One method to detect multicollinearity is to calculate the variance inflation factor (VIF) for each independent variable, and a VIF value greater than 1.5 indicates multicollinearity.
To fix multicollinearity, one can remove one of the highly correlated variables, combine them into a single variable, or use a dimensionality reduction technique such as principal component analysis to reduce the number of variables while retaining most of the information.
So in this article you will be get the analysis of multicollinearity meaning and how multicollinearity in regression will make the detection with VIF for multicollinearity.

Frequently Asked Questions

Q1. How can we identify the linearity of correlation?

A. Use scatter plots for visual relationships, correlation coefficients for numerical strength and direction, and linear regression models for prediction, with high R-squared values indicating strong linear relationships.

Q2. What is the relationship between VIF and R-squared?

A. VIF detects multicollinearity among predictors, with high values indicating high collinearity. High R-squared values indicate a strong linear relationship in regression models but don’t directly indicate multicollinearity.

Q3. Why do we need to use VIF?

A. VIF identifies variables contributing to multicollinearity. Removing high VIF variables reduces multicollinearity, improving regression model accuracy and stability. Values above 5 or 10 are typically targeted.

Q4. What is multicollinearity?

A. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to determine their individual effects on the dependent variable.

Q5.What are the two types of multicollinearity?

Perfect Multicollinearity: Exact linear relationship between variables.
High Multicollinearity: Strong correlation but not perfect.

Aniruddha Bhandari

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Naveen kumar Mamidala

How does non linear algo handle multi colinearity

Show 1 reply

For tree-based algorithms, multicollinearity wouldn't matter much as they split on the feature that gives higher information gain. However, for other algorithms like polynomial regression and SVM, regularization can be used.

Christophe Bunn

Hi Aniruddha, when you wrote "Coefficient W1 is the increase in Y for a unit increase in W1 while keeping X2 constant." didn't you mean "Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2 constant."? Cheers, Chris.

Hey Chris, thanks for pointing out the mistake.

Parvesh

Hi Aniruddha I found this article very useful, could you share dataset so that readers may implement code at their end to get maximum out of this article

Hi Parvesh Glad you liked the article. I created a dummy dataset for this article. You can access it at this link. Thanks

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What is Multicollinearity? Understand Causes, Effects and Detection Using VIF

Table of contents

What is Multicollinearity?

Key takeways

Example

Issues with Multicollinearity in Regression Models

Understanding the Impact of Multicollinearity

What is the assumption of multicollinearity in linear regression?

What Causes Multicollinearity?

Detecting Multicollinearity Using a Variance Inflation Factor (VIF)

Fixing Multicollinearity

How to Interpret MultiCollinearity in Spss?

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID