A Guide to Understanding Interaction Terms

Mncedisi Last Updated : 29 Nov, 2024

8 min read

Introduction

Interaction terms are incorporated in regression modelling to capture the effect of two or more independent variables in the dependent variable. At times, it is not just the simple relationship between the control variables and the target variable that is under investigation, interaction terms can be quite helpful at these moments. These are also useful whenever the relationship between one independent variable and the dependent variable is conditional on the level of another independent variable.

This, of course, implies that the effect of one predictor on the response variable depends on the level of another predictor. In this blog, we examine the idea of interaction terms through a simulated scenario: predicting time and again the amount of time users would spend on an e-commerce channel using their past behavior.

Learning Objectives

Understand how interaction terms enhance the predictive power of regression models.
Learn to create and incorporate interaction terms in a regression analysis.
Analyze the impact of interaction terms on model accuracy through a practical example.
Visualize and interpret the effects of interaction terms on predicted outcomes.
Gain insights into when and why to apply interaction terms in real-world scenarios.

This article was published as a part of the Data Science Blogathon.

Introduction
Understanding the Basics of Interaction Terms
How Interaction Terms Influence Regression Coefficients?
Simulated Scenario: User Behavior on an E-Commerce Platform
Model Without an Interaction Term
Model With an Interaction Term
Comparing Model Performance
Conclusion
Frequently Asked Questions

Understanding the Basics of Interaction Terms

In real life, we do not find that a variable works in isolation of the others and hence the real-life models are much more complex than those that we study in classes. For example, the effect of the end user navigation actions such as adding items to a cart on the time spent on an e-commerce platform differs when the user adds the item to a cart and buys them. Thus, adding interaction terms as variables to a regression model allows to acknowledge these intersections and, therefore, enhance the model’s fitness for purpose in terms of explaining the patterns underlying the observed data and/or predicting future values of the dependent variable.

Mathematical Representation

Let’s consider a linear regression model with two independent variables, X1 and X2:

Y = β0 + β1X1 + β2X2 + ϵ,

where Y is the dependent variable, β0 is the intercept, β1 and β2 are the coefficients for the independent variables X1 and X2, respectively, and ϵ is the error term.

Adding an Interaction Term

To include an interaction term between X1 and X2, we introduce a new variable X1⋅X2 :

Y = β0 + β1X1 + β2X2 + β3(X1⋅X2) + ϵ,

where β3 represents the interaction effect between X1 and X2. The term X1⋅X2 is the product of the two independent variables.

How Interaction Terms Influence Regression Coefficients?

β0: The intercept, representing the expected value of Y when all independent variables are zero.
β1: The effect of X1 on Y when X2 is zero.
β2: The effect of X2 on Y when X1 is zero.
β3: The change in the effect of X1 on Y for a one-unit change in X2, or equivalently, the change in the effect of X2 on Y for a one-unit change in X1.

Example: User Activity and Time Spent

First, let’s create a simulated dataset to represent user behavior on an online store. The data consists of:

added_in_cart: Indicates if a user has added products to their cart (1 for adding and 0 for not adding).
purchased: Whether or not the user completed a purchase (1 for completion or 0 for non-completion).
time_spent: The amount of time a user spent on an e-commerce platform. Our goal is to predict the duration of a user’s visit on an online store by analysing if they add products to their cart and complete a transaction.

# import libraries
import pandas as pd
import numpy as np

# Generate synthetic data
def generate_synthetic_data(n_samples=2000):

    np.random.seed(42)
    added_in_cart = np.random.randint(0, 2, n_samples)
    purchased = np.random.randint(0, 2, n_samples)
    time_spent = 3 + 2*purchased + 2.5*added_in_cart + 4*purchased*added_in_cart + np.random.normal(0, 1, n_samples)
    return pd.DataFrame({'purchased': purchased, 'added_in_cart': added_in_cart, 'time_spent': time_spent})

df = generate_synthetic_data()
df.head()

Output:

A Guide to Understanding Interaction Terms

Simulated Scenario: User Behavior on an E-Commerce Platform

As our next step we will first build an ordinary least square regression model with consideration to these actions of the market but without coverage to their interaction effects. Our hypotheses are as follows: (Hypothesis 1) There is an effect of the time spent on the website where each action is taken separately. Now we will then construct a second model that includes the interaction term that exists between adding products into cart and making a purchase.

This will help us counterpoise the impact of those actions, separately or combined on the time spent on the website. This suggests that we want to find out if users who both add products to the cart and make a purchase spend more time on the site than the time spent when each behavior is considered individually.

Model Without an Interaction Term

Following the model’s construction, the following outcomes were noted:

With a mean squared error (MSE) of 2.11, the model without the interaction term accounts for roughly 80% (test R-squared) and 82% (train R-squared) of the variance in the time_spent. This indicates that time_spent predictions are, on average, 2.11 squared units off from the actual time_spent. Although this model can be improved upon, it is reasonably accurate.
Furthermore, the plot below indicates graphically that although the model performs fairly well. There is still much room for improvement, especially in terms of capturing higher values of time_spent.

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Model without interaction term
X = df[['purchased', 'added_in_cart']]
y = df['time_spent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train_const).fit()
y_pred = model.predict(X_test_const)

# Calculate metrics for model without interaction term
train_r2 = model.rsquared
test_r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("Model without Interaction Term:")
print('Training R-squared Score (%):', round(train_r2 * 100, 4))
print('Test R-squared Score (%):', round(test_r2 * 100, 4))
print("MSE:", round(mse, 4))
print(model.summary())


# Function to plot actual vs predicted
def plot_actual_vs_predicted(y_test, y_pred, title):

    plt.figure(figsize=(8, 4))
    plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.title(title)
    plt.show()

# Plot without interaction term
plot_actual_vs_predicted(y_test, y_pred, 'Actual vs Predicted Time Spent (Without Interaction Term)')

Output:

Model With an Interaction Term

A better fit for the model with the interaction term is indicated by the scatter plot with the interaction term, which displays predicted values substantially closer to the actual values.
The model explains much more of the variance in the time_spent with the interaction term, as shown by the higher test R-squared value (from 80.36% to 90.46%).
The model’s predictions with the interaction term are more accurate, as evidenced by the lower MSE (from 2.11 to 1.02).
The closer alignment of the points to the diagonal line, particularly for higher values of time_spent, indicates an improved fit. The interaction term aids in expressing how user actions collectively affect the amount of time spent.

# Add interaction term
df['purchased_added_in_cart'] = df['purchased'] * df['added_in_cart']
X = df[['purchased', 'added_in_cart', 'purchased_added_in_cart']]
y = df['time_spent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

model_with_interaction = sm.OLS(y_train, X_train_const).fit()
y_pred_with_interaction = model_with_interaction.predict(X_test_const)

# Calculate metrics for model with interaction term
train_r2_with_interaction = model_with_interaction.rsquared
test_r2_with_interaction = r2_score(y_test, y_pred_with_interaction)
mse_with_interaction = mean_squared_error(y_test, y_pred_with_interaction)

print("\nModel with Interaction Term:")
print('Training R-squared Score (%):', round(train_r2_with_interaction * 100, 4))
print('Test R-squared Score (%):', round(test_r2_with_interaction * 100, 4))
print("MSE:", round(mse_with_interaction, 4))
print(model_with_interaction.summary())


# Plot with interaction term
plot_actual_vs_predicted(y_test, y_pred_with_interaction, 'Actual vs Predicted Time Spent (With Interaction Term)')

# Print comparison
print("\nComparison of Models:")
print("R-squared without Interaction Term:", round(r2_score(y_test, y_pred)*100,4))
print("R-squared with Interaction Term:", round(r2_score(y_test, y_pred_with_interaction)*100,4))
print("MSE without Interaction Term:", round(mean_squared_error(y_test, y_pred),4))
print("MSE with Interaction Term:", round(mean_squared_error(y_test, y_pred_with_interaction),4))

Output:

Comparing Model Performance

The model predictions without the interaction term are represented by the blue points. When the actual time spent values are higher, these points are more dispersed from the diagonal line.
The model predictions with the interaction term are represented by the red points. The model with the interaction term produces more accurate predictions. Especially for higher actual time spent values, as these points are closer to the diagonal line.

# Compare model with and without interaction term

def plot_actual_vs_predicted_combined(y_test, y_pred1, y_pred2, title1, title2):

    plt.figure(figsize=(10, 6))
    plt.scatter(y_test, y_pred1, edgecolors='blue', label=title1, alpha=0.6)
    plt.scatter(y_test, y_pred2, edgecolors='red', label=title2, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.title('Actual vs Predicted User Time Spent')
    plt.legend()
    plt.show()

plot_actual_vs_predicted_combined(y_test, y_pred, y_pred_with_interaction, 'Model Without Interaction Term', 'Model With Interaction Term')

Output:

Conclusion

The improvement in the model’s performance with the interaction term demonstrates that sometimes adding interaction terms to your model may enhance its importance. This example highlights how interaction terms can capture additional information that is not apparent from the main effects alone. In practice, considering interaction terms in regression models can potentially lead to more accurate and insightful predictions.

In this blog, we first generated a synthetic dataset to simulate user behavior on an e-commerce platform. We then constructed two regression models: one without interaction terms and one with interaction terms. By comparing their performance, we demonstrated the significant impact of interaction terms on the accuracy of the model.

Check out the full code and resources on GitHub.

Key Takeaways

Regression models with interaction terms can help to better understand the relationships between two or more variables and the target variable by capturing their combined effects.
Including interaction terms can significantly improve model performance, as evidenced by higher R-squared values and lower MSE in this guide.
Interaction terms are not just theoretical concepts, they can be applied to real-world scenarios.

Frequently Asked Questions

Q1. What are interaction terms in regression analysis?

A. They are variables created by multiplying two or more independent variables. They are used to capture the combined effect of these variables on the dependent variable. This can provide a more nuanced understanding of the relationships in the data.

Q2. When should I consider using interaction terms in my model?

A. You should consider using IT when you suspect that the effect of one independent variable on the dependent variable depends on the level of another independent variable. For example, if you believe that the impact of adding items to the cart on the time spent on an e-commerce platform depends on whether the user makes a purchase. You should include an interaction term between these variables.

Q3. How do I interpret the coefficients of interaction terms?

A. The coefficient of an interaction term represents the change in the effect of one independent variable on the dependent variable for a one-unit change in another independent variable. For example, in our example above we have an interaction term between purchased and added_in_cart, the coefficient tells us how the effect of adding items to the cart on time spent changes when a purchase is made.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Mncedisi

Data Scientist with 4+ years of experience in Data Science and Analytics roles within the Retail/eCommerce, Delivery Optimisation and Media & Entertainment industries. I’ve worked extensively with developing and deploying machine learning solutions, data visualisation or reporting, building actionable insights for the business to drive data-driven strategies.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Guide to Understanding Interaction Terms

Introduction

Learning Objectives

Table of contents

Understanding the Basics of Interaction Terms

Mathematical Representation

Adding an Interaction Term

How Interaction Terms Influence Regression Coefficients?

Example: User Activity and Time Spent

Simulated Scenario: User Behavior on an E-Commerce Platform

Model Without an Interaction Term

Model With an Interaction Term

Comparing Model Performance

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM