Granger Causality in Time Series – Explained using Chicken and Egg problem

Pallavi Last Updated : 12 Nov, 2024

9 min read

This article was published as a part of the Data Science Blogathon

Introduction

The purpose of this article is to understand what is granger causality and its application in Time series forecasting for better prediction.

Image Source: https://www.wallpaperflare.com

Overview of Time series:

Time series is a set of observations on the values that the variable takes at different times. The time interval at which the data is collected may be hourly, daily, weekly, monthly, quarterly, annually, etc.

Time series data can be categorized as follows:

Univariate time series – Consists of a single variable that changes over time. Eg Account balance over a span of ten years.
Multivariate time series – Consists of more than one variable which varies over time and there may be dependency among the variables. Eg Different expenditures of a family over a span of ten years.

Time series analysis provides insight into the pattern or characteristics of time series data. Time series data can be decomposed into three components:

Trend – This shows the tendency of the data over a long period of time, it may be upward, downward, or stable.
Seasonality – It is the variation that occurs in a periodic manner and repeats each year.
Noise or Random – Fluctuations in the data that are erratic.

Image Source: https://www.researchgate.net

Forecasting the data for a particular time in the future is known as Time Series Forecasting. It is one of the powerful machine learning models widely used in the fields like finance, weather forecasting, the health sector, environmental studies, business, retail, etc to arrive at strategic decisions.

In real-time, most of the data consists of multiple variables, in which independent variables might depend on other independent variables, these relations will have an impact on the predictions or forecasting. Most of the time people generally get misled and build multilinear regression models in such cases. A High R square value will further mislead and make a poor prediction.

Spurious Regression

Linear regression might indicate a strong relationship between two or more variables, but these variables may be totally unrelated in reality. Predictions fail when it comes to domain knowledge, this scenario is known as spurious regression.

There is a strong relationship between chicken consumption and crude oil exports in the below graph even though they are unrelated.

Source: http://www.tylervigen.com/spurious-correlations

Strong trend / nonstationary and higher R square are observed in spurious regression. Spurious regression has to be eliminated while building the model since they are unrelated and have no causal relationship.

Multilinear regression makes use of a correlation matrix to check the dependency between all the independent variables. If the correlation coefficient value is high between two variables any one variable is retained and another one is discarded to remove the dependency. In the dataset when time is a confounding factor multilinear regression fails, the correlation coefficient which is used to eliminate the variable is not time-bound instead it just gives the correlation between the two variables.

trend Granger Causality — Image Source: https://en.wikipedia.org/wiki/Granger_causality

Consider the above time-series graph, variable X has a direct influence on variable Y but there is a lag of 5 between X and Y in which case we cant use the correlation matrix. For eg. an increase in coronavirus positive cases in the city and an increase in the number of people getting hospitalized. For better forecasting, here we would like to know if there is a causal relationship.

Image Source: https://clockwise.software/blog/solve-the-chicken-and-egg-problem/

Granger Causality comes to Rescue

Prof. Clive W.J. Granger, recipient of the 2003 Nobel Prize in Economics developed the concept of causality to improve the performance of forecasting.

It is basically an econometric hypothetical test for verifying the usage of one variable in forecasting another in multivariate time series data with a particular lag.

Granger causality is a statistical concept used to determine whether one time series can predict the future values of another time series. It measures the extent to which the past values of one variable provide valuable information for forecasting the other variable’s future behavior. Granger causality helps analyze potential relationships between variables, aiding in predicting trends. However, it’s important to note that Granger causality doesn’t prove direct causation and should be interpreted cautiously when evaluating causal relationships between variables.

A prerequisite for performing the Granger Causality test is that the data need to be stationary i.e it should have a constant mean, constant variance, and no seasonal component. Transform the non-stationary data to stationary data by differencing it, either first-order or second-order differencing. Do not proceed with the Granger causality test if the data is not stationary after second-order differencing.

Let us consider three variables X_t , Y_t , and W_tpreset in time series data.

Case 1: Forecast X_t+1 based on past values X_t.

Case 2: Forecast X_t+1 based on past values X_t and Y_t.

Case3 : Forecast X_t+1 based on past values X_t , Y_t , and W_t, where variable Y_thas direct dependency on variable W_t.

Here Case 1 is univariate time series also known as the autoregressive model in which there is a single variable and forecasting is done based on the same variable lagged by say order p.

The equation for the Auto-regressive model of order p (RESTRICTED MODEL, RM)

X_t = α + 𝛾₁ X_𝑡−1 + 𝛾₂X_𝑡−2 + ⋯ + 𝛾_𝑝X_𝑡−𝑝

where p parameters (degrees of freedom) to be estimated.

In Case 2 the past values of Y contain information for forecasting X_t+1. Y_t is said to “Granger cause” X_t+1 provided Y_t occurs before X_t+1and it contains data for forecasting X_t+1.

Equation using a predictor Y_t (UNRESTRICTED MODEL, UM)

X_t = α + 𝛾₁ X_𝑡−1 + 𝛾₂X_𝑡−2 + ⋯ + 𝛾_𝑝X_𝑡−𝑝+ α₁Y_t-1+ ⋯ + α_𝑝Y_t-p

2p parameters (degrees of freedom) to be estimated.

If Y_t causes X_t, then Y must precede X which implies:

Lagged values of Y should be significantly related to X.

Lagged values of X should not be significantly related to Y.

Case 3 can not be used to find Granger causality since variable Y_t is influenced by variable W_t.

Hypothesis test

Null Hypothesis (H₀) : Y_t does not “Granger cause” X_t+1 i.e., 𝛼₁ = 𝛼₂ = ⋯ = 𝛼_𝑝 = 0

Alternate Hypothesis(H_A): Y_t does “Granger cause” X_t+1, i.e., at least one of the lags of Y is significant.

Calculate the f-statistic

F_{p,n-2𝑝−1} = (𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒) / (𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)

F_{p,n-2𝑝−1} = ( (𝑆𝑆𝐸_𝑅𝑀−𝑆𝑆𝐸_𝑈𝑀) /𝑝) /(𝑆𝑆𝐸_𝑈𝑀 /𝑛−2𝑝−1)

where n is the number of observations and
SSE is Sum of Squared Errors.

If the p-values are less than a significance level (0.05) for at least one of the lags then reject the null hypothesis.

Perform test for both the direction X_t->Y_t and Y_t->X_t.

Try different lags (p). The optimal lag can be determined using AIC.

Limitation

Granger causality does not provide any insight on the relationship between the variable hence it is not true causality unlike ’cause and effect’ analysis.
Granger causality fails to forecast when there is an interdependency between two or more variables (as stated in Case 3).
Granger causality test can’t be performed on non-stationary data.

Resolving Chicken and Egg problem

Let us apply Granger causality to check whether the egg came first or chicken came first.

Importing libraries

import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import pandas as pd

Loading Dataset

Data is from the U.S. Department of Agriculture. It consists of two-time series variables from 1930 to 1983, one of U.S. egg production and the other the estimated U.S. chicken population.

df = pd.read_csv('chickegg.csv')

Exploring the Dataset

df.head()

df.dtypes

df.shape

(53, 3)

df.describe()

Check if the data is stationary, if not make it stationary to proceed.

# Draw Plot
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()
plot_df(df, x=df.Year, y=df.chicken, title='Polulation of the chicken across US')   
plot_df(df, x=df.Year, y=df.egg, title='Egg Produciton')

By visual inspection, both the chicken and egg data are not stationary. Let us confirm this by running Augmented Dickey-Fuller Test (ADF Test).

Augmented Dickey-Fuller Test (ADF test)

ADF test is a popular statistical test for checking whether the Time Series is stationary or not which works based on the unit root test. The number of unit roots present in the series indicates the number of differencing operations that are required to make it stationary

Consider the hypothesis test where:

Null Hypothesis (H₀): Series has a unit root and is non-stationary.

Alternative Hypothesis (H_A): Series has no unit root and is stationary.

from statsmodels.tsa.stattools import adfuller
result = adfuller(df['chicken'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')

if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

result = adfuller(df['egg'])

print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

p-values of both the egg and chicken variables are greater than the significant value (0.05), the Null hypothesis is valid and the series is not stationary.

Data Transformation

Granger causality test is carried out only on stationary data hence we need to transform the data by differencing it to make it stationary. Let us perform the first-order differencing on chicken and egg data.

df_transformed = df.diff().dropna()
df = df.iloc[1:]
print(df.shape)
df_transformed.shape

df_transformed.head()

plot_df(df_transformed, x=df.Year, y=df_transformed.chicken, title='Polulation of the chicken across US')   
plot_df(df_transformed, x=df.Year, y=df_transformed.egg, title='Egg Produciton')

Repeat the ADF test again on differenced data to check for stationarity.

result = adfuller(df_transformed['chicken'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

result = adfuller(df_transformed['egg'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

Transformed chicken and egg data are stationary, hence there is no need to go for second-order differencing.

Test the Granger Causality

There are several ways to find the optimal lag but for simplicity let’s consider 4th lag as of now.

Do eggs granger cause chickens?

Null Hypothesis (H₀) : eggs do not granger cause chicken.

Alternative Hypothesis (H_A) : eggs granger cause chicken.

from statsmodels.tsa.stattools import grangercausalitytests

grangercausalitytests(df_transformed[['chicken', 'egg']], maxlag=4)

p-value is very low, Null hypothesis is rejected hence eggs are granger causing chicken.

That implies eggs came first.

Now repeat the Granger causality test in the opposite direction.

Do chickens granger cause eggs at lag 4?

Null Hypothesis (H₀) : chicken does not granger cause eggs.

Alternative Hypothesis (H_A) : chicken granger causes eggs.

grangercausalitytests(df_transformed[['egg', 'chicken']], maxlag=4)

The p-value is considerably high thus chickens do not granger cause eggs.

The above analysis concludes that the egg came first and not the chicken.

Once the analysis is done the next step is to begin forecasting using time series forecasting models.

Frequently Asked Questions

Q1. How do you explain Granger causality?

A. Granger causality is a statistical concept used to determine if one time series can predict another’s future values. If a time series A Granger-causes another time series B, it means that past values of A provide useful information in predicting future values of B, beyond what B’s own past values provide. This concept is widely applied in econometrics, neuroscience, and more. However, Granger causality doesn’t imply true causation; it only suggests a predictive relationship. Both variables’ behavior and external factors should be considered to draw meaningful conclusions about causal relationships.

Q2. What is Granger causality of two variables?

A. Granger causality between two variables suggests that the past values of one variable provide predictive information about the future values of another variable, beyond the information contained in its own past values. It’s a statistical approach to assess temporal relationships but doesn’t imply direct causation; rather, it indicates whether one variable helps predict the other.

EndNote

Whenever you come across time-bound data having multiple variables be suspicious about high R² and possible spurious regression. Make use of the time series forecasting for better performance. Check for bidirectional Granger causality between each variable and eliminate the variable based on test results before proceeding with forecasting techniques. I hope you enjoyed the article and gained insight into the Granger causality concept. Please drop your suggestions or queries in the comment section.

Thank you for reading!

About Author

Hello, Pallavi Padav from Mangalore holding PG degree in Data Science from INSOFE. Passionate about participating in Data Science hackathons, blogathons and workshops.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Pallavi

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Granger Causality in Time Series – Explained using Chicken and Egg problem

Introduction

Overview of Time series:

Spurious Regression

Granger Causality comes to Rescue

Hypothesis test

Calculate the f-statistic

Limitation

Resolving Chicken and Egg problem

Importing libraries

Loading Dataset

Exploring the Dataset

Augmented Dickey-Fuller Test (ADF test)

Data Transformation

Test the Granger Causality

Frequently Asked Questions

EndNote

About Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk