Interaction terms are incorporated in regression modelling to capture the effect of two or more independent variables in the dependent variable. At times, it is not just the simple relationship between the control variables and the target variable that is under investigation, interaction terms can be quite helpful at these moments. These are also useful whenever the relationship between one independent variable and the dependent variable is conditional on the level of another independent variable.
This, of course, implies that the effect of one predictor on the response variable depends on the level of another predictor. In this blog, we examine the idea of interaction terms through a simulated scenario: predicting time and again the amount of time users would spend on an e-commerce channel using their past behavior.
This article was published as a part of the Data Science Blogathon.
In real life, we do not find that a variable works in isolation of the others and hence the real-life models are much more complex than those that we study in classes. For example, the effect of the end user navigation actions such as adding items to a cart on the time spent on an e-commerce platform differs when the user adds the item to a cart and buys them. Thus, adding interaction terms as variables to a regression model allows to acknowledge these intersections and, therefore, enhance the model’s fitness for purpose in terms of explaining the patterns underlying the observed data and/or predicting future values of the dependent variable.
Let’s consider a linear regression model with two independent variables, X1 and X2:
Y = β0 + β1X1 + β2X2 + ϵ,
where Y is the dependent variable, β0 is the intercept, β1 and β2 are the coefficients for the independent variables X1 and X2, respectively, and ϵ is the error term.
To include an interaction term between X1 and X2, we introduce a new variable X1⋅X2 :
Y = β0 + β1X1 + β2X2 + β3(X1⋅X2) + ϵ,
where β3 represents the interaction effect between X1 and X2. The term X1⋅X2 is the product of the two independent variables.
First, let’s create a simulated dataset to represent user behavior on an online store. The data consists of:
# import libraries
import pandas as pd
import numpy as np
# Generate synthetic data
def generate_synthetic_data(n_samples=2000):
np.random.seed(42)
added_in_cart = np.random.randint(0, 2, n_samples)
purchased = np.random.randint(0, 2, n_samples)
time_spent = 3 + 2*purchased + 2.5*added_in_cart + 4*purchased*added_in_cart + np.random.normal(0, 1, n_samples)
return pd.DataFrame({'purchased': purchased, 'added_in_cart': added_in_cart, 'time_spent': time_spent})
df = generate_synthetic_data()
df.head()
Output:
As our next step we will first build an ordinary least square regression model with consideration to these actions of the market but without coverage to their interaction effects. Our hypotheses are as follows: (Hypothesis 1) There is an effect of the time spent on the website where each action is taken separately. Now we will then construct a second model that includes the interaction term that exists between adding products into cart and making a purchase.
This will help us counterpoise the impact of those actions, separately or combined on the time spent on the website. This suggests that we want to find out if users who both add products to the cart and make a purchase spend more time on the site than the time spent when each behavior is considered individually.
Following the model’s construction, the following outcomes were noted:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Model without interaction term
X = df[['purchased', 'added_in_cart']]
y = df['time_spent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
y_pred = model.predict(X_test_const)
# Calculate metrics for model without interaction term
train_r2 = model.rsquared
test_r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("Model without Interaction Term:")
print('Training R-squared Score (%):', round(train_r2 * 100, 4))
print('Test R-squared Score (%):', round(test_r2 * 100, 4))
print("MSE:", round(mse, 4))
print(model.summary())
# Function to plot actual vs predicted
def plot_actual_vs_predicted(y_test, y_pred, title):
plt.figure(figsize=(8, 4))
plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title(title)
plt.show()
# Plot without interaction term
plot_actual_vs_predicted(y_test, y_pred, 'Actual vs Predicted Time Spent (Without Interaction Term)')
Output:
# Add interaction term
df['purchased_added_in_cart'] = df['purchased'] * df['added_in_cart']
X = df[['purchased', 'added_in_cart', 'purchased_added_in_cart']]
y = df['time_spent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Add a constant for the intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model_with_interaction = sm.OLS(y_train, X_train_const).fit()
y_pred_with_interaction = model_with_interaction.predict(X_test_const)
# Calculate metrics for model with interaction term
train_r2_with_interaction = model_with_interaction.rsquared
test_r2_with_interaction = r2_score(y_test, y_pred_with_interaction)
mse_with_interaction = mean_squared_error(y_test, y_pred_with_interaction)
print("\nModel with Interaction Term:")
print('Training R-squared Score (%):', round(train_r2_with_interaction * 100, 4))
print('Test R-squared Score (%):', round(test_r2_with_interaction * 100, 4))
print("MSE:", round(mse_with_interaction, 4))
print(model_with_interaction.summary())
# Plot with interaction term
plot_actual_vs_predicted(y_test, y_pred_with_interaction, 'Actual vs Predicted Time Spent (With Interaction Term)')
# Print comparison
print("\nComparison of Models:")
print("R-squared without Interaction Term:", round(r2_score(y_test, y_pred)*100,4))
print("R-squared with Interaction Term:", round(r2_score(y_test, y_pred_with_interaction)*100,4))
print("MSE without Interaction Term:", round(mean_squared_error(y_test, y_pred),4))
print("MSE with Interaction Term:", round(mean_squared_error(y_test, y_pred_with_interaction),4))
Output:
# Compare model with and without interaction term
def plot_actual_vs_predicted_combined(y_test, y_pred1, y_pred2, title1, title2):
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred1, edgecolors='blue', label=title1, alpha=0.6)
plt.scatter(y_test, y_pred2, edgecolors='red', label=title2, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted User Time Spent')
plt.legend()
plt.show()
plot_actual_vs_predicted_combined(y_test, y_pred, y_pred_with_interaction, 'Model Without Interaction Term', 'Model With Interaction Term')
Output:
The improvement in the model’s performance with the interaction term demonstrates that sometimes adding interaction terms to your model may enhance its importance. This example highlights how interaction terms can capture additional information that is not apparent from the main effects alone. In practice, considering interaction terms in regression models can potentially lead to more accurate and insightful predictions.
In this blog, we first generated a synthetic dataset to simulate user behavior on an e-commerce platform. We then constructed two regression models: one without interaction terms and one with interaction terms. By comparing their performance, we demonstrated the significant impact of interaction terms on the accuracy of the model.
Check out the full code and resources on GitHub.
A. They are variables created by multiplying two or more independent variables. They are used to capture the combined effect of these variables on the dependent variable. This can provide a more nuanced understanding of the relationships in the data.
A. You should consider using IT when you suspect that the effect of one independent variable on the dependent variable depends on the level of another independent variable. For example, if you believe that the impact of adding items to the cart on the time spent on an e-commerce platform depends on whether the user makes a purchase. You should include an interaction term between these variables.
A. The coefficient of an interaction term represents the change in the effect of one independent variable on the dependent variable for a one-unit change in another independent variable. For example, in our example above we have an interaction term between purchased and added_in_cart, the coefficient tells us how the effect of adding items to the cart on time spent changes when a purchase is made.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.