Predict Future Sales using XGBRegressor

De-Yu Last Updated : 12 Jun, 2021

15 min read

This article was published as a part of the Data Science Blogathon

Objective

From this article, you will learn how to predict the monthly sales of all different items at different shops, given the historical daily sales data (with code!).

XGBRegressor — Photo by Carlos Muza on Unsplash

It is an integral part of the business to make an intelligent business decision to save cost or gain more profit. Good forecasting can help to ensure the retailers maintain adequate inventory levels, mitigate the chance of stock obsolete, or supply the right product at the right time and location. For example, if the forecast indicates a 25% increase in sales of products or services, the store can purchase those products ahead to meet the demand. Conversely, a forecast of shortfalls in sales can allow people to mitigate the effect by taking actions ahead.

Challenges of the problem

Build a good model from observed time series data is challenging because time series data are usually unstable and continuous. For instance, future sales can be affected by new products launch, promotions available, seasonality, and other changes that make it very difficult to predict using past behaviors. Furthermore, there is only a single historical sample available in each data instance, hence, we might need to add lag terms to perform the forecasting task.

Dataset Description

The dataset can be downloaded from here. The definition of data files and fields are as follow:

sales_train.csv – training dataset. Daily historical data from January 2013 to October 2015.

date – date in format dd/mm/yyyy
date_block_num – a consecutive month number, used for convenience. January 2013 is 0, February 2012 is 1, …, October 2015 is 33.
shop_id – unique identifier of a shop
item_id – unique identifier of a product
item_price – the current price of an item
item_cnt_day – number of products sold per day

test.csv – test dataset. The task is to predict total sales for every item and shop in the next month i.e. November 2015 which date_block_num is 34.

Id – represents a (shop, item) tuple within the test set
shop_id – unique identifier of a shop & item_id – unique identifier of a product

item_categories.csv – information about the items’ categories.

item_category_id – unique identifier of the item category
item_category_name – the name of the item category

shops.csv – information about the shops.

shop_id – unique identifier of a shop
shop_name – the name of shop

items.csv – information about the items or products.

item_category_id – unique identifier of the item category
item_id – unique identifier of a product
item_name – the name of the item

Data Cleaning

1. Checking missing values: There are no missing values in all the data files mentioned above.

2. Remove outliers: removing the noisy data values/obvious outliers by defining some threshold.

Defined threshold:

item_cnt_day – remove the data instances that sold more than 1500 in one day.
item_price – remove the item with a price greater than 150,000.

3. Remove data instances with negative price or item count: Remove any data instances where the item price or item_cnt_day values are negative. These data points might be due to the refund. But for simplicity, I decided to remove those data points.

4. Combine duplicated feature values into a single feature value: several shops seems to be the same to each other. This could be possibly due to moving store locations or shops re-opening. For instance, shop name “!Якутск Орджоникидзе, 56 фран” and “Якутск Орджоникидзе, 56” are being treated as the same shop after pre-processing.

5. Feature construction: construct a new feature to capture important information.

shop_name’s feature values represent both the shop location (city) and shop category (i.e. shopping center, outlet, etc). Therefore, I have created 2 new features named “shop_city” and “shop_category”. Further aggregate shop category feature values by only keeping shop category if there are 5 or more shops of that category and the rest are grouped as “other”.
item_category_name’s feature values can further break down into category and subcategory. For instance, Accessories – PS4 is split into “Accessories” as a category and “PS4” as a subcategory.
Since the objective is to predict monthly sales, it is needed to create the “item_cnt_month” feature by summing up all the “item_count_day” for specific shop and item within a month.
Create “revenue” feature by multiplying “item_cnt_month” and “item_price”
Split the date feature into “month” and “days”
“previousMth_avg_item_cnt” by averaging all the “item_cnt_month” of the previous month. In other words, it means an average number of items sold in the previous month.
“shopSubcategory_avg_item_cnt” by averaging all “item_cnt_month” for each shop and each item subcategory combinations within a month.
“shopCity_avg_item_cnt” by averaging all “item_cnt_month” for each city within a month.
“shopCityItem_avg_item_cnt” by averaging all “item_cnt_month” for each city and each item combinations within a month.
“item_avg_item_cnt” by averaging all the “item_cnt_month” for each item within a month.
“shop_avg_item_cnt” by averaging all the “item_cnt_month” for each shop within a month.
“shopItem_avg_item_cnt” by averaging all the “item_cnt_month” for each item and each shop combinations within a month.
Create side features called “globalAvg_item_price” and and “shop_avg_item_price” to calculate for “delta_price”. “globalAvg _item_price” is the average item price of each item for a whole time frame of 33 months, “shop_avg_item_price” is the average item price of each item within a month, “delta_price” is the difference between “avg_item_price” and “shop_avg_item_price”. This “delta_price” is aimed to capture how the current month’s average item price is related to the global average item price.
Create side features called “globalAvg_shop_revenue” and and “shop_avg_revenue” to calculate for “delta_revenue”. “globalAvg_shop_revenue” is the average revenue of each shop for a whole time frame of 33 months, “shop_avg_revenue” is the average revenue of each store within a month, “delta_revenue” is the difference between “globalAvg_shop_revenue” and “shop_avg_revenue”. This “delta_revenue” is aimed to capture how current month average shop revenue is related to global average shop revenue.

Since time series data is non-stationary, the differencing features “delta_price” and “delta_revenue” is a transformation that could somehow help to stabilize or reduce the effect of trend and seasonality.

6. Add lag features: a time series is a sequence of observations taken sequentially in time. In order to predict time series data, the model needs to use historical data then using them to predict future observations. The steps that shifted the data backward in time sequence are called lag times or lags. Therefore, a time series problem is usually transformed into a supervised machine learning problem adding lags as input features.

Add 3 steps back in time; 3-lags for “item_cnt_moth” feature
Add 3 steps back in time; 3-lags for “item_avg_item_cnt” feature
Add 3 steps back in time; 3-lags for “shop_avg_item_cnt” feature
Add 3 steps back in time; 3-lags for “shopItem_avg_item_cnt” feature

Combine all the features generated in step 5. and create a dataframe with every combination of (month, shop, and item) of increasing month order. After that delete the first 3 months’ data instances from the dataframe since they don’t have lag values.

7. Label Encoding: All the categorical variables in my dataset such as category, sub_category, and shop_city have no order. However, I am using label encoding instead of one-hot encoding because I am going to use a tree-based model (XGBRegressor) which will pick a random set of features to split the tree based on how well that splitting will result in purify child nodes.

Experiments

ML Model

The model I am going to use is the eXtreme Gradient Boosting Regressor aka XGBRegressor. It is particularly popular because it has been the winning algorithm in several recent Kaggle competitions. It is an ensemble learning method that will create a final model by combining several weak models.

As described in the name gradient boosting machine use gradient descent and boosting method. Boosting method adopts the iterative procedure to adaptively change the distribution of training data by focusing more on previously misclassified records to build the base learners. It can be considered as adding the models sequentially until no further improvement is made. A gradient descent algorithm is used to minimize the loss. In gradient boosting where the predictions of multiple models are combined the gradient is used to optimize the boosted model prediction in each boosting round.

XGBoost is a special implementation of a gradient boosting machine that uses more accurate approximations to find the best model. It improves upon gradient boosting machine framework through systems optimization and algorithmic enhancements. Some of its improvements are computing second-order gradients which need fewer steps to coverage to the optimum and regularization terms which improve model generalization.

Train-val-test split

Since I am dealing with a time series dataset, splitting the data randomly will lead to the wrong prediction. Therefore, I have selected the first 4 to 32 months of data as training dataset, month 33 as validation data, and month 34 as testing data. Next, I am going to train the model using train data and tune hyperparameters using the validation set.

Training and Tuning hyperparameters

Before performing hyperparameter tuning, I have set the “n_estimators” = 1000 and “early_stopping_rounds” = 20 which are used to determine the optimal number of boosting rounds. Furthermore, I have also fixed the eta = 0.1.
early_stopping_rounds: it will stop training if the error stops reducing for a specific threshold.
n_estimators: corresponds to the maximum number of boosting rounds or number of gradient boosted trees to build. Usually set it to a large value and together with early_stopping_rounds threshold to find the optimal number of rounds before reaching it.
eta: parameter the control the learning rate. If the learning rate is high, it is difficult to converge. On the other hand, if the learning rate is too low it will take more boosting round to converge.

I have chosen 5 of the hyperparameters that are usually having a high impact on performance; “max_depth”, “min_child_weight”, “subsample”, “cosample_bytree”, and “alpha”. The evaluation metric used to train the model is the root mean squared error (RMSE).

max_depth: maximum tree depth for base learners. The deeper the tree, the more complex the model. However, if the tree is too deep, the split becomes less relevant and will cause the model to overfit. A set of 4 candidate values {5, 9, 10, 14}
min_child_weight: corresponds to a minimum number of instances needed to be in each node. A smaller min_child_weight will lead to a child node with fewer samples and thus allowing a more complex tree. A set of 4 candidate values {1, 5, 6, 10}
subsample: subsampling ratio of the training instances. It will occur once in every boosting iteration. Subsample ratio = 0.5 means that the algorithm would randomly sample half of the training data prior to growing trees. A set of 4 candidate values {1, 0.8, 0.6, 0.3}
cosample_bytree: subsample ratio of columns or features when constructing each tree. Similarly, it occurs once in every boosting iteration. Colsample_bytree = 1 means that we will use all features. A set of 4 candidate values {1, 0.8, 0.6, 0.3}
alpha: L1 regularization parameters to avoid overfitting or control model complexity. A set of 4 candidate values {0.05, 0.1, 0.2, 0.3, 0.5}

The total combinations using the above hyperparameters will need to run 4x4x4x4x5 = 1280 different models. With limited time and computational resources, I have decided to run these hyperparameters in pairs, i.e. first performed hyperparameter tuning using a combination of max_depth and min_child_weight, second performed using combinations of subsample and cosample_bytree, third performed on alpha, and finally performed on eta.

Results

Table 1: The performance on the training set and validation set by varying values of max_depth and min_child_weight in XGBRegressor (in RMSE)

	RMSE on training set	RMSE on validation set
max_depth = 5, min_child_weight = 1	0.85729	0.90985
max_depth = 5, min_child_weight = 5	0.85209	0.90801
max_depth = 5, min_child_weight = 6	0.84928	0.90789
max_depth = 5, min_child_weight = 10	0.84632	0.90671
max_depth = 9, min_child_weight = 1	0.80990	0.89646
max_depth = 9, min_child_weight = 5	0.78785	0.89408
max_depth = 9, min_child_weight = 6	0.80670	0.89383
max_depth = 9, min_child_weight = 10	0.80193	0.89292
max_depth = 10, min_child_weight = 1	0.78412	0.89490
max_depth = 10, min_child_weight = 5	0.76146	0.89195
max_depth = 10, min_child_weight = 6	0.78534	0.89334
max_depth= 10, min_child_weight= 10	0.75548	0.89255
max_depth = 14, min_child_weight = 1	0.67286	0.89773
max_depth = 14, min_child_weight = 5	0.72487	0.89512
max_depth = 14, min_child_weight = 6	0.72679	0.89628
max_depth= 14, min_child_weight= 10	0.71907	0.89314

We get the best score with max_depth = 10 and min_child_weight = 5.

Table 2: The performance on the training set and validation set by varying values of subsample and cosample_bytree in XGBRegressor (in RMSE)

	RMSE – training set	RMSE – validation set
subsample = 1, cosample_bytree = 1	0.77881	0.89499
subsample = 1, cosample_bytree = 0.8	0.80461	0.89839
subsample = 1, cosample_bytree = 0.6	0.77588	0.89328
subsample = 1, cosample_bytree = 0.3	0.79709	0.89300
subsample = 0.8, cosample_bytree = 1	0.77680	0.89501
subsample = 0.8, cosample_bytree = 0.8	0.76146	0.89195
subsample = 0.8, cosample_bytree = 0.6	0.77924	0.89056
subsample = 0.8, cosample_bytree = 0.3	0.79472	0.88637
subsample = 0.6, cosample_bytree = 1	0.79843	0.89901
subsample = 0.6, cosample_bytree = 0.8	0.78591	0.90145
subsample = 0.6, cosample_bytree = 0.6	0.77927	0.88560
subsample = 0.6, cosample_bytree = 0.3	0.79306	0.88662
subsample = 0.3, cosample_bytree = 1	0.79265	0.89602
subsample = 0.3, cosample_bytree = 0.8	0.81131	0.89944
subsample = 0.3, cosample_bytree = 0.6	0.79335	0.89067
subsample = 0.3, cosample_bytree = 0.3	0.78281	0.88789

We get the best score with subsample = 0.6 and cosample_bytree = 0.6.

Table 3: The performance on
training set and validation set by varying values of alpha in XGBRegressor (in RMSE)

	RMSE – training set	RMSE – validation set
alpha = 0.5	0.77921	0.89105
alpha = 0.3	0.77943	0.88729
alpha = 0.2	0.78550	0.89182
alpha = 0.1	0.77927	0.88560
alpha = 0.05	0.78821	0.89116

We get the best score with alpha = 0.1

Here is what the final models’ parameters looks like:

model = XGBRegressor (max_depth = 10,
                                            n_estimators = 1000,
                                            min_child_weight = 5, 
                                            subsample = 0.6, 
                                           cosample_bytree = 0.6,
                                           alpha = 0.1,
                                           eta = 0.1, 
                                           seed = 42)

Codes

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

# load data
items=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv")
shops=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv")
cats=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv")
train=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv")
test=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv")

# Data Cleaning
plt.figure(figsize=(10,4))
plt.xlim(-100, 3000)
flierprops = dict(marker='o', markerfacecolor='purple', markersize=6,
                  linestyle='none', markeredgecolor='black')
sns.boxplot(x=train.item_cnt_day, flierprops=flierprops)

plt.figure(figsize=(10,4))
plt.xlim(train.item_price.min(), train.item_price.max()*1.1)
sns.boxplot(x=train.item_price, flierprops=flierprops)

# Remove outlier
train = train[(train.item_price < 300000 )& (train.item_cnt_day < 1000)]
# remove negative item price
train = train[train.item_price > 0].reset_index(drop = True)

# Cleaning shops data
# Якутск Орджоникидзе, 56
train.loc[train.shop_id == 0, 'shop_id'] = 57
test.loc[test.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
train.loc[train.shop_id == 1, 'shop_id'] = 58
test.loc[test.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
train.loc[train.shop_id == 10, 'shop_id'] = 11
test.loc[test.shop_id == 10, 'shop_id'] = 11

# Clean up some shop names and add 'city' and 'category' to shops df.
shops.loc[ shops.shop_name == 'Сергиев Посад ТЦ "7Я"',"shop_name" ] = 'СергиевПосад ТЦ "7Я"'
shops["city"] = shops.shop_name.str.split(" ").map( lambda x: x[0] )
shops["category"] = shops.shop_name.str.split(" ").map( lambda x: x[1] )
shops.loc[shops.city == "!Якутск", "city"] = "Якутск"

# Only keep shop category if there are 5 or more shops of that category, the rest are grouped as "other".
category = []
for cat in shops.category.unique():
    if len(shops[shops.category == cat]) >= 5:
        category.append(cat)    
shops.category = shops.category.apply( lambda x: x if (x in category) else "other" )

# label encoding
from sklearn.preprocessing import LabelEncoder
shops["shop_category"] = LabelEncoder().fit_transform( shops.category )
shops["shop_city"] = LabelEncoder().fit_transform( shops.city )
shops = shops[["shop_id", "shop_category", "shop_city"]]

# Cleaning Item Category Data
cats["type_code"] = cats.item_category_name.apply( lambda x: x.split(" ")[0] ).astype(str)
cats.loc[ (cats.type_code == "Игровые")| (cats.type_code == "Аксессуары"), "category" ] = "Игры"
category = []
for cat in cats.type_code.unique():
    if len(cats[cats.type_code == cat]) >= 5: 
        category.append( cat )
cats.type_code = cats.type_code.apply(lambda x: x if (x in category) else "etc")
# Label Encoding
cats.type_code = LabelEncoder().fit_transform(cats.type_code)
cats["split"] = cats.item_category_name.apply(lambda x: x.split("-"))
cats["subtype"] = cats.split.apply(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
cats["subtype_code"] = LabelEncoder().fit_transform( cats["subtype"] )
cats = cats[["item_category_id", "subtype_code", "type_code"]]

# Cleaning Item Data
import re
def name_correction(x):
    x = x.lower() # all letters lower case
    x = x.partition('[')[0] # partition by square brackets
    x = x.partition('(')[0] # partition by curly brackets
    x = re.sub('[^A-Za-z0-9А-Яа-я]+', ' ', x) # remove special characters
    x = x.replace('  ', ' ') # replace double spaces with single spaces
    x = x.strip() # remove leading and trailing white space
    return x
  
 # Cleaning Item name
# split item names by first bracket
items["name1"], items["name2"] = items.item_name.str.split("[", 1).str
items["name1"], items["name3"] = items.item_name.str.split("(", 1).str

# replace special characters and turn to lower case
items["name2"] = items.name2.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()
items["name3"] = items.name3.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower()

# fill nulls with '0'
items = items.fillna('0')

items["item_name"] = items["item_name"].apply(lambda x: name_correction(x))

# return all characters except the last if name 2 is not "0" - the closing bracket
items.name2 = items.name2.apply( lambda x: x[:-1] if x !="0" else "0")

# clean item type
items["type"] = items.name2.apply(lambda x: x[0:8] if x.split(" ")[0] == "xbox" else x.split(" ")[0] )
items.loc[(items.type == "x360") | (items.type == "xbox360") | (items.type == "xbox 360") ,"type"] = "xbox 360"
items.loc[ items.type == "", "type"] = "mac"
items.type = items.type.apply( lambda x: x.replace(" ", "") )
items.loc[ (items.type == 'pc' )| (items.type == 'pс') | (items.type == "pc"), "type" ] = "pc"
items.loc[ items.type == 'рs3' , "type"] = "ps3"

group_sum = items.groupby(["type"]).agg({"item_id": "count"})
group_sum = group_sum.reset_index()
drop_cols = []
for cat in group_sum.type.unique():
    if group_sum.loc[(group_sum.type == cat), "item_id"].values[0] <40:
        drop_cols.append(cat)
items.name2 = items.name2.apply( lambda x: "other" if (x in drop_cols) else x )
items = items.drop(["type"], axis = 1)

items.name2 = LabelEncoder().fit_transform(items.name2)
items.name3 = LabelEncoder().fit_transform(items.name3)

items.drop(["item_name", "name1"],axis = 1, inplace= True)
items.head()

# Preprocessing create a matrix df with every combination of month, shop and item in order of increasing month. Item_cnt_day is summed into an item_cnt_month.
from itertools import product
matrix = []
cols  = ["date_block_num", "shop_id", "item_id"]
for i in range(34):
    sales = train[train.date_block_num == i]
    matrix.append( np.array(list( product( [i], sales.shop_id.unique(), sales.item_id.unique() ) ), dtype = np.int16) )

matrix = pd.DataFrame( np.vstack(matrix), columns = cols )
matrix["date_block_num"] = matrix["date_block_num"].astype(np.int8)
matrix["shop_id"] = matrix["shop_id"].astype(np.int8)
matrix["item_id"] = matrix["item_id"].astype(np.int16)
matrix.sort_values( cols, inplace = True )

# add revenue to train df
train["revenue"] = train["item_cnt_day"] * train["item_price"]
group = train.groupby( ["date_block_num", "shop_id", "item_id"] ).agg( {"item_cnt_day": ["sum"]} )
group.columns = ["item_cnt_month"]
group.reset_index( inplace = True)
matrix = pd.merge( matrix, group, on = cols, how = "left" )
matrix["item_cnt_month"] = matrix["item_cnt_month"].fillna(0).astype(np.float16)

# Create a test set for month 34.
test["date_block_num"] = 34
test["date_block_num"] = test["date_block_num"].astype(np.int8)
test["shop_id"] = test.shop_id.astype(np.int8)
test["item_id"] = test.item_id.astype(np.int16)

# Concatenate train and test sets.
matrix = pd.concat([matrix, test.drop(["ID"],axis = 1)], ignore_index=True, sort=False, keys=cols)
matrix.fillna( 0, inplace = True )

# Add shop, items and categories data onto matrix df.
matrix = pd.merge( matrix, shops, on = ["shop_id"], how = "left" )
matrix = pd.merge(matrix, items, on = ["item_id"], how = "left")
matrix = pd.merge( matrix, cats, on = ["item_category_id"], how = "left" )
matrix["shop_city"] = matrix["shop_city"].astype(np.int8)
matrix["shop_category"] = matrix["shop_category"].astype(np.int8)
matrix["item_category_id"] = matrix["item_category_id"].astype(np.int8)
matrix["subtype_code"] = matrix["subtype_code"].astype(np.int8)
matrix["name2"] = matrix["name2"].astype(np.int8)
matrix["name3"] = matrix["name3"].astype(np.int16)
matrix["type_code"] = matrix["type_code"].astype(np.int8)

# Feature Engineering, add lag features to matrix df.
# Define a lag feature function
def lag_feature( df,lags, cols ):
    for col in cols:
        print(col)
        tmp = df[["date_block_num", "shop_id","item_id",col ]]
        for i in lags:
            shifted = tmp.copy()
            shifted.columns = ["date_block_num", "shop_id", "item_id", col + "_lag_"+str(i)]
            shifted.date_block_num = shifted.date_block_num + i
            df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df
  
# Add item_cnt_month lag features. 
matrix = lag_feature( matrix, [1,2,3], ["item_cnt_month"] )

# Add the previous month's average item_cnt.
group = matrix.groupby( ["date_block_num"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_avg_item_cnt"]
group.reset_index(inplace = True)
matrix = pd.merge(matrix, group, on = ["date_block_num"], how = "left")
matrix.date_avg_item_cnt = matrix["date_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1], ["date_avg_item_cnt"] )
matrix.drop( ["date_avg_item_cnt"], axis = 1, inplace = True )

# Add lag values of item_cnt_month for month / item_id.
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix.date_item_avg_item_cnt = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3], ['date_item_avg_item_cnt'])
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)

# Add lag values for item_cnt_month for every month / shop combination.
group = matrix.groupby( ["date_block_num","shop_id"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_shop_avg_item_cnt"]
group.reset_index(inplace = True)
matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id"], how = "left")
matrix.date_avg_item_cnt = matrix["date_shop_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1,2,3], ["date_shop_avg_item_cnt"] )
matrix.drop( ["date_shop_avg_item_cnt"], axis = 1, inplace = True )

# Add lag values for item_cnt_month for month/shop/item.
group = matrix.groupby( ["date_block_num","shop_id","item_id"] ).agg({"item_cnt_month" : ["mean"]})
group.columns = ["date_shop_item_avg_item_cnt"]
group.reset_index(inplace = True)
matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id","item_id"], how = "left")
matrix.date_avg_item_cnt = matrix["date_shop_item_avg_item_cnt"].astype(np.float16)
matrix = lag_feature( matrix, [1,2,3], ["date_shop_item_avg_item_cnt"] )
matrix.drop( ["date_shop_item_avg_item_cnt"], axis = 1, inplace = True )

# Add lag values for item_cnt_month for month/shop/item subtype.
group = matrix.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_subtype_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'subtype_code'], how='left')
matrix.date_shop_subtype_avg_item_cnt = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_shop_subtype_avg_item_cnt'])
matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True)

# Add lag values for item_cnt_month for month/city.
group = matrix.groupby(['date_block_num', 'shop_city']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_city_avg_item_cnt']
group.reset_index(inplace=True)
matrix = pd.merge(matrix, group, on=['date_block_num', "shop_city"], how='left')
matrix.date_city_avg_item_cnt = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_city_avg_item_cnt'])
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)

# Add lag values for item_cnt_month for month/city/item.
group = matrix.groupby(['date_block_num', 'item_id', 'shop_city']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_city_avg_item_cnt' ]
group.reset_index(inplace=True)
matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'shop_city'], how='left')
matrix.date_item_city_avg_item_cnt = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], ['date_item_city_avg_item_cnt'])
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)

# Add average item price on to matix df., add lag values of item price per month
# Add delta price values - how current month average pirce relates to global average.
group = train.groupby( ["item_id"] ).agg({"item_price": ["mean"]})
group.columns = ["item_avg_item_price"]
group.reset_index(inplace = True)
matrix = matrix.merge( group, on = ["item_id"], how = "left" )
matrix["item_avg_item_price"] = matrix.item_avg_item_price.astype(np.float16)
group = train.groupby( ["date_block_num","item_id"] ).agg( {"item_price": ["mean"]} )
group.columns = ["date_item_avg_item_price"]
group.reset_index(inplace = True)
matrix = matrix.merge(group, on = ["date_block_num","item_id"], how = "left")
matrix["date_item_avg_item_price"] = matrix.date_item_avg_item_price.astype(np.float16)
lags = [1, 2, 3]
matrix = lag_feature( matrix, lags, ["date_item_avg_item_price"] )
for i in lags:
    matrix["delta_price_lag_" + str(i) ] = (matrix["date_item_avg_item_price_lag_" + str(i)]- matrix["item_avg_item_price"] )/ matrix["item_avg_item_price"]

def select_trends(row) :
    for i in lags:
        if row["delta_price_lag_" + str(i)]:
            return row["delta_price_lag_" + str(i)]
    return 0

matrix["delta_price_lag"] = matrix.apply(select_trends, axis = 1)
matrix["delta_price_lag"] = matrix.delta_price_lag.astype( np.float16 )
matrix["delta_price_lag"].fillna( 0 ,inplace = True)

features_to_drop = ["item_avg_item_price", "date_item_avg_item_price"]
for i in lags:
    features_to_drop.append("date_item_avg_item_price_lag_" + str(i) )
    features_to_drop.append("delta_price_lag_" + str(i) )
matrix.drop(features_to_drop, axis = 1, inplace = True)

# Add total shop revenue per month to matix df. Add lag values of revenue per month.
# Add delta revenue values - how current month revenue relates to global average.
group = train.groupby( ["date_block_num","shop_id"] ).agg({"revenue": ["sum"] })
group.columns = ["date_shop_revenue"]
group.reset_index(inplace = True)

matrix = matrix.merge( group , on = ["date_block_num", "shop_id"], how = "left" )
matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32)

group = group.groupby(["shop_id"]).agg({ "date_block_num":["mean"] })
group.columns = ["shop_avg_revenue"]
group.reset_index(inplace = True )

matrix = matrix.merge( group, on = ["shop_id"], how = "left" )
matrix["shop_avg_revenue"] = matrix.shop_avg_revenue.astype(np.float32)
matrix["delta_revenue"] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue']
matrix["delta_revenue"] = matrix["delta_revenue"]. astype(np.float32)

matrix = lag_feature(matrix, [1], ["delta_revenue"])
matrix["delta_revenue_lag_1"] = matrix["delta_revenue_lag_1"].astype(np.float32)
matrix.drop( ["date_shop_revenue", "shop_avg_revenue", "delta_revenue"] ,axis = 1, inplace = True)

# Add month and number of days in each month to matrix df.
matrix["month"] = matrix["date_block_num"] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix["days"] = matrix["month"].map(days).astype(np.int8)

# Add the month of each shop and item first sale.
matrix["item_shop_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id","shop_id"])["date_block_num"].transform('min')
matrix["item_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id"])["date_block_num"].transform('min')

# Delete first three months from matrix. They don't have lag values.
matrix = matrix[matrix["date_block_num"] > 3]

# Model
import gc
import pickle
from xgboost import XGBRegressor
from matplotlib.pylab import rcParams

data = matrix.copy()
del matrix
gc.collect()
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)
Y_train = Y_train.clip(0, 20)
Y_valid = Y_valid.clip(0, 20)
del data
gc.collect();

# Training
model = XGBRegressor(
    max_depth=10,
    n_estimators=1000,
    min_child_weight=0.5, 
    colsample_bytree=0.8, 
    subsample=0.8, 
    eta=0.1,
    seed=42)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 20)

rcParams['figure.figsize'] = 12, 4

# Testing
Y_pred = model.predict(X_valid).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# Plot feature importance
from xgboost import plot_importance
def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax)
plot_features(model, (10,14))

Conclusion

From this tutorial, I hope you gain some practical experience in dealing with time series dataset, performing data pre-processing, data cleaning, feature aggregation and feature construction, training and tuning model’s hyperparameters.

About the Author

Connect with me on LinkedIn Here.

Thanks for giving your time!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

De-Yu

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Predict Future Sales using XGBRegressor

Objective

Challenges of the problem

Dataset Description

Data Cleaning

Experiments

Results

Codes

Conclusion

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#