Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is an ensemble learning method that uses bagging (bootstrap sample), constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. It can be used for both Classification and Regression problems in ML. However, it can also be used in time series analysis and forecasting, both univariate and multivariate dataset by creating lag variables and seasonality variables manually. When evaluating the performance of a Random Forest model, metrics such as standard deviation, var, moving average, and linear regression can provide insights into its effectiveness.
No algorithm works best for all the datasets. So depending on the data you can try various algorithms and choose the best for your data. I have tried various time series models ARIMA model, SARIMA, ets, lstm (deep learning), Random forest, XGBoost, and fbprophet for time series forecasting and each of these algorithms worked best for one category or the other. Random forest model, XGBoost, and fbprophet outperformed for multivariate and intermittent data.
After completing this tutorial, you will know:
Intermittent demand data is one of the data types with a very random pattern, for example, demand data. The data will have a value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time.
In this tutorial, you will learn how to develop a Random forest model for time series forecasting.
Let’s get started.
Problem: Forecast demand for a jeans brand for the coming 6 months.
Data: We have monthly sales quantity available for 2 years (from May 2019 to May 2021) in the CSV file.
import pandas as pd
from sklearn.feature_selection import RFE
# import random forest regression from scikit-learn to perform regression model
from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import timedelta
import calender
jeans_data=pd.read_csv('jeans_data.csv')
jeans_data.head()
from statsmodels.tsa.stattools import adfuller
from numpy import log
result = adfuller(df.value.dropna())
print('p-value: %f' % result[1])
p-value: 0.024419
Since the p-value is below 0.05, the data can be assumed to be stationary hence we can proceed with the data without any transformation.
dataframe = DataFrame()
for i in range(12, 0, -1):
dataframe['t-' + str(i)] = jeans_data.SaleQty.shift(i)
final_data = pd.concat(jeans_data, dataframe], axis=1)
final_data.dropna(inplace=True)
You can give any value in place of 12, depending on your time interval and the number of lags you want to create. It is ideal to give 12 for monthly data and 54 for weekly data and limit the number of independent variables later. These lag variables are used for autoregressive prediction.
Create a variable that has different values for different months which will add a seasonal component to the model, which may help improve the forecast.
final_data['date'] = pd.to_datetime(final_data['date'], format='%Y-%m-%d')
final_data['month'] = final_data['date'].dt.month
Or we can add dummy variables for each month:
dummy = pd.get_dummies(final_data['month'])
final_data = pd.concat([final_data, dummy], axis=1)
We will take the most recent 6 months’ subset of data as the test data and the rest of the data as the training dataset.
finaldf = final_data.drop(['date'], axis=1)
finaldf = finaldf.reset_index(drop=True)
test_length=6
end_point = len(finaldf)
x = end_point - test_length
finaldf_train = finaldf.loc[:x - 1, :]
finaldf_test = finaldf.loc[x:, :]
finaldf_test_x = finaldf_test.loc[:, finaldf_test.columns != 'SaleQty']
finaldf_test_y = finaldf_test['SaleQty']
finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
finaldf_train_y = finaldf_train['SaleQty']
print("Starting model train..")
rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
fit = rfe.fit(finaldf_train_x, finaldf_train_y)
y_pred = fit.predict(finaldf_test_x)
I have used RFE (recursive feature elimination) to limit the number of independent variables/features to 4, you can change the value and choose the value that gives the least error. I have taken n_estimators (number of trees in the forest) 100 which is the default value. We can also try different hyperparameters.
y_true = np.array(finaldf_test_ y['SaleQty'])
sumvalue=np.sum(y_true)
mape=np.sum(np.abs((y_true - y_pred)))/sumvalue*100
accuracy=100-mape
print('Accuracy:', round(accuracy,2),'%.')
# we can also use other metrics like mae or rmse etc.
Accuracy: 89.42 %.
We will predict sale quantity for the future 6 months. The lags will be null for future date points so we have to predict for one month at a time and use the predicted sale for creating lag for next month’s prediction and so on. Please note we are using the predicted sale only to create the lag variable, we are not training the model again.
# predictive modelling
def create_lag(df3):
dataframe = DataFrame()
for i in range(12, 0, -1):
dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
df4 = pd.concat([df3, dataframe], axis=1)
df4.dropna(inplace=True)
return df4
yhat=[]
future_dataframe= jeans_data.copy()
n=6
x = future_dataframe.at[end_point - 1, 'date']
days_in_month=calendar.monthrange(x.year, x.month)[1]
for i in range(n):
future_dataframe.at[future_dataframe.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
future_dataframe.at[future_dataframe.index[end_point + i], SaleQty] = 0
future_dataframe ['date'] = pd.to_datetime(future_dataframe ['date'], format='%Y-%m-%d')
future_dataframe ['month'] = future_dataframe ['date'].dt.month
future_dataframe = future_dataframe.drop(['date'], axis=1)
future_dataframe _end = len(jeans_data)
for i in range(n, 0, -1):
y = future_dataframe _end - i
inputfile = finaldf.loc[y:end_point, :]
inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
pred_set = inputfile_x.head(1)
pred = fit.predict(pred_set)
future_dataframe.at[future_dataframe.index[future_dataframe _end - i], 'SaleQty'] = pred[0]
finaldf = create_lag(future_dataframe)
finaldf = finaldf.reset_index(drop=True)
yhat.append(pred)
predicted_value= np.array(yhat)
You can add any other independent variables available like promotions, special_days, weekends, start_of_month, etc.
Find below the complete code:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import datetime
import calendar
from datetime import timedelta
import datetime as dt
def add_month(df, forecast_length, forecast_period):
end_point = len(df)
df1 = pd.DataFrame(index=range(forecast_length), columns=range(2))
df1.columns = ['SaleQty', 'date']
df = df.append(df1)
df = df.reset_index(drop=True)
x = df.at[end_point - 1, 'date']
x = pd.to_datetime(x, format='%Y-%m-%d')
days_in_month=calendar.monthrange(x.year, x.month)[1]
if forecast_period == 'Week':
for i in range(forecast_length):
df.at[df.index[end_point + i], 'date'] = x + timedelta(days=7 + 7 * i)
df.at[df.index[end_point + i], 'SaleQty'] = 0
elif forecast_period == 'Month':
for i in range(forecast_length):
df.at[df.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
df.at[df.index[end_point + i], 'SaleQty'] = 0
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['month'] = df['date'].dt.month
df = df.drop(['date'], axis=1)
return df
def create_lag(df3):
dataframe = DataFrame()
for i in range(12, 0, -1):
dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
df4 = pd.concat([df3, dataframe], axis=1)
df4.dropna(inplace=True)
return df4
def randomForest(df1, forecast_length, forecast_period):
df3 = df1[['SaleQty', 'date']]
df3 = add_month(df3, forecast_length, forecast_period)
finaldf = create_lag(df3)
finaldf = finaldf.reset_index(drop=True)
n = forecast_length
end_point = len(finaldf)
x = end_point - n
finaldf_train = finaldf.loc[:x - 1, :]
finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
finaldf_train_y = finaldf_train['SaleQty']
print("Starting model train..")
rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
fit = rfe.fit(finaldf_train_x, finaldf_train_y)
print("Model train completed..")
print("Creating forecasted set..")
yhat = []
end_point = len(finaldf)
n = forecast_length
df3_end = len(df3)
for i in range(n, 0, -1):
y = end_point - i
inputfile = finaldf.loc[y:end_point, :]
inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
pred_set = inputfile_x.head(1)
pred = fit.predict(pred_set)
df3.at[df3.index[df3_end - i], 'SaleQty'] = pred[0]
finaldf = create_lag(df3)
finaldf = finaldf.reset_index(drop=True)
yhat.append(pred)
yhat = np.array(yhat)
print("Forecast complete..")
return yhat
predicted_value=randomForest(jeans_data, 6, 'Month')
The random forest ensemble learning method performs bootstrapping of observations by randomly sampling the training set. So the order of the data points change hence it might not perform well in many time series data, but it does perform well for intermittent data as it catches the probability of demand/sale of a zero selling product well.
Please let me know your queries and suggestions if any.
This study conducted an analysis of intermittent data utilizing a specific model. Initially, essential packages were imported, and data stationarity was examined. Lag and seasonal variables were subsequently generated. The model underwent training and evaluation, demonstrating satisfactory results. Additionally, predictions for future data points were generated. This study underscores the significance of meticulous data preparation and the selection of appropriate modeling techniques for intermittent data. Future research avenues may explore alternative models or incorporate additional variables to enhance predictive accuracy. It’s important to note that the actual conclusion may differ based on the study’s specific results and discoveries.
A. In data science, the random forest algorithm can be adapted for time series prediction by using lagged observations as predictors. This approach, which involves creating a supervised learning task from univariate time series data, leverages the algorithm’s capacity for handling complex, non-linear relationships.
A. Time series forecasting methods predict future data points by analyzing historical trends. These include ARIMA for stationary series, SARIMA for seasonal data, Exponential Smoothing for trends and seasonality, Prophet for daily patterns, and machine learning models like Random Forests and Neural Networks.
A. Advantages: Handles non-linear data well, robust to overfitting, and performs feature selection. Disadvantages: Complex and resource-intensive, difficult to interpret, and not ideal for very high-dimensional sparse data.
Thank you for sharing this. Where can I downlload the data you used? I'd like to run it with diffenet algorithm.
Thank you for sharing this. Could you let me know where is the data? I'd like to run it with different algorithm.
Nice post: thank you! Question : at the beginning of your article, you state that "Random forest, XGBoost, and fbprophet outperformed for multivariate and intermittent data". However, I do not remind that fbprophet works for multivariate data (only univariate data). So how did you manage to use it with explanatory variables?