This article was published as a part of the Data Science Blogathon
From this article, you will learn how to predict the monthly sales of all different items at different shops, given the historical daily sales data (with code!).
It is an integral part of the business to make an intelligent business decision to save cost or gain more profit. Good forecasting can help to ensure the retailers maintain adequate inventory levels, mitigate the chance of stock obsolete, or supply the right product at the right time and location. For example, if the forecast indicates a 25% increase in sales of products or services, the store can purchase those products ahead to meet the demand. Conversely, a forecast of shortfalls in sales can allow people to mitigate the effect by taking actions ahead.
Build a good model from observed time series data is challenging because time series data are usually unstable and continuous. For instance, future sales can be affected by new products launch, promotions available, seasonality, and other changes that make it very difficult to predict using past behaviors. Furthermore, there is only a single historical sample available in each data instance, hence, we might need to add lag terms to perform the forecasting task.
The dataset can be downloaded from here. The definition of data files and fields are as follow:
sales_train.csv – training dataset. Daily historical data from January 2013 to October 2015.
test.csv – test dataset. The task is to predict total sales for every item and shop in the next month i.e. November 2015 which date_block_num is 34.
item_categories.csv – information about the items’ categories.
shops.csv – information about the shops.
items.csv – information about the items or products.
1. Checking missing values: There are no missing values in all the data files mentioned above.
2. Remove outliers: removing the noisy data values/obvious outliers by defining some threshold.
Defined threshold:
3. Remove data instances with negative price or item count: Remove any data instances where the item price or item_cnt_day values are negative. These data points might be due to the refund. But for simplicity, I decided to remove those data points.
4. Combine duplicated feature values into a single feature value: several shops seems to be the same to each other. This could be possibly due to moving store locations or shops re-opening. For instance, shop name “!Якутск Орджоникидзе, 56 фран” and “Якутск Орджоникидзе, 56” are being treated as the same shop after pre-processing.
5. Feature construction: construct a new feature to capture important information.
Since time series data is non-stationary, the differencing features “delta_price” and “delta_revenue” is a transformation that could somehow help to stabilize or reduce the effect of trend and seasonality.
6. Add lag features: a time series is a sequence of observations taken sequentially in time. In order to predict time series data, the model needs to use historical data then using them to predict future observations. The steps that shifted the data backward in time sequence are called lag times or lags. Therefore, a time series problem is usually transformed into a supervised machine learning problem adding lags as input features.
Combine all the features generated in step 5. and create a dataframe with every combination of (month, shop, and item) of increasing month order. After that delete the first 3 months’ data instances from the dataframe since they don’t have lag values.
7. Label Encoding: All the categorical variables in my dataset such as category, sub_category, and shop_city have no order. However, I am using label encoding instead of one-hot encoding because I am going to use a tree-based model (XGBRegressor) which will pick a random set of features to split the tree based on how well that splitting will result in purify child nodes.
ML Model
The model I am going to use is the eXtreme Gradient Boosting Regressor aka XGBRegressor. It is particularly popular because it has been the winning algorithm in several recent Kaggle competitions. It is an ensemble learning method that will create a final model by combining several weak models.
As described in the name gradient boosting machine use gradient descent and boosting method. Boosting method adopts the iterative procedure to adaptively change the distribution of training data by focusing more on previously misclassified records to build the base learners. It can be considered as adding the models sequentially until no further improvement is made. A gradient descent algorithm is used to minimize the loss. In gradient boosting where the predictions of multiple models are combined the gradient is used to optimize the boosted model prediction in each boosting round.
XGBoost is a special implementation of a gradient boosting machine that uses more accurate approximations to find the best model. It improves upon gradient boosting machine framework through systems optimization and algorithmic enhancements. Some of its improvements are computing second-order gradients which need fewer steps to coverage to the optimum and regularization terms which improve model generalization.
Train-val-test split
Since I am dealing with a time series dataset, splitting the data randomly will lead to the wrong prediction. Therefore, I have selected the first 4 to 32 months of data as training dataset, month 33 as validation data, and month 34 as testing data. Next, I am going to train the model using train data and tune hyperparameters using the validation set.
Training and Tuning hyperparameters
I have chosen 5 of the hyperparameters that are usually having a high impact on performance; “max_depth”, “min_child_weight”, “subsample”, “cosample_bytree”, and “alpha”. The evaluation metric used to train the model is the root mean squared error (RMSE).
The total combinations using the above hyperparameters will need to run 4x4x4x4x5 = 1280 different models. With limited time and computational resources, I have decided to run these hyperparameters in pairs, i.e. first performed hyperparameter tuning using a combination of max_depth and min_child_weight, second performed using combinations of subsample and cosample_bytree, third performed on alpha, and finally performed on eta.
Table 1: The performance on the training set and validation set by varying values of max_depth and min_child_weight in XGBRegressor (in RMSE)
RMSE on training set | RMSE on validation set | |
max_depth = 5, min_child_weight = 1 | 0.85729 | 0.90985 |
max_depth = 5, min_child_weight = 5 | 0.85209 | 0.90801 |
max_depth = 5, min_child_weight = 6 | 0.84928 | 0.90789 |
max_depth = 5, min_child_weight = 10 | 0.84632 | 0.90671 |
max_depth = 9, min_child_weight = 1 | 0.80990 | 0.89646 |
max_depth = 9, min_child_weight = 5 | 0.78785 | 0.89408 |
max_depth = 9, min_child_weight = 6 | 0.80670 | 0.89383 |
max_depth = 9, min_child_weight = 10 | 0.80193 | 0.89292 |
max_depth = 10, min_child_weight = 1 | 0.78412 | 0.89490 |
max_depth = 10, min_child_weight = 5 | 0.76146 | 0.89195 |
max_depth = 10, min_child_weight = 6 | 0.78534 | 0.89334 |
max_depth= 10, min_child_weight= 10 | 0.75548 | 0.89255 |
max_depth = 14, min_child_weight = 1 | 0.67286 | 0.89773 |
max_depth = 14, min_child_weight = 5 | 0.72487 | 0.89512 |
max_depth = 14, min_child_weight = 6 | 0.72679 | 0.89628 |
max_depth= 14, min_child_weight= 10 | 0.71907 | 0.89314 |
We get the best score with max_depth = 10 and min_child_weight = 5.
Table 2: The performance on the training set and validation set by varying values of subsample and cosample_bytree in XGBRegressor (in RMSE)
RMSE – training set | RMSE – validation set | |
subsample = 1, cosample_bytree = 1 | 0.77881 | 0.89499 |
subsample = 1, cosample_bytree = 0.8 | 0.80461 | 0.89839 |
subsample = 1, cosample_bytree = 0.6 | 0.77588 | 0.89328 |
subsample = 1, cosample_bytree = 0.3 | 0.79709 | 0.89300 |
subsample = 0.8, cosample_bytree = 1 | 0.77680 | 0.89501 |
subsample = 0.8, cosample_bytree = 0.8 | 0.76146 | 0.89195 |
subsample = 0.8, cosample_bytree = 0.6 | 0.77924 | 0.89056 |
subsample = 0.8, cosample_bytree = 0.3 | 0.79472 | 0.88637 |
subsample = 0.6, cosample_bytree = 1 | 0.79843 | 0.89901 |
subsample = 0.6, cosample_bytree = 0.8 | 0.78591 | 0.90145 |
subsample = 0.6, cosample_bytree = 0.6 | 0.77927 | 0.88560 |
subsample = 0.6, cosample_bytree = 0.3 | 0.79306 | 0.88662 |
subsample = 0.3, cosample_bytree = 1 | 0.79265 | 0.89602 |
subsample = 0.3, cosample_bytree = 0.8 | 0.81131 | 0.89944 |
subsample = 0.3, cosample_bytree = 0.6 | 0.79335 | 0.89067 |
subsample = 0.3, cosample_bytree = 0.3 | 0.78281 | 0.88789 |
We get the best score with subsample = 0.6 and cosample_bytree = 0.6.
Table 3: The performance on
training set and validation set by varying values of alpha in XGBRegressor (in RMSE)
RMSE – training set | RMSE – validation set | |
alpha = 0.5 | 0.77921 | 0.89105 |
alpha = 0.3 | 0.77943 | 0.88729 |
alpha = 0.2 | 0.78550 | 0.89182 |
alpha = 0.1 | 0.77927 | 0.88560 |
alpha = 0.05 | 0.78821 | 0.89116 |
We get the best score with alpha = 0.1
Here is what the final models’ parameters looks like:
model = XGBRegressor (max_depth = 10, n_estimators = 1000, min_child_weight = 5, subsample = 0.6, cosample_bytree = 0.6, alpha = 0.1, eta = 0.1, seed = 42)
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set(style="darkgrid") # load data items=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv") shops=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv") cats=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv") train=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv") test=pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv") # Data Cleaning plt.figure(figsize=(10,4)) plt.xlim(-100, 3000) flierprops = dict(marker='o', markerfacecolor='purple', markersize=6, linestyle='none', markeredgecolor='black') sns.boxplot(x=train.item_cnt_day, flierprops=flierprops) plt.figure(figsize=(10,4)) plt.xlim(train.item_price.min(), train.item_price.max()*1.1) sns.boxplot(x=train.item_price, flierprops=flierprops) # Remove outlier train = train[(train.item_price < 300000 )& (train.item_cnt_day < 1000)] # remove negative item price train = train[train.item_price > 0].reset_index(drop = True) # Cleaning shops data # Якутск Орджоникидзе, 56 train.loc[train.shop_id == 0, 'shop_id'] = 57 test.loc[test.shop_id == 0, 'shop_id'] = 57 # Якутск ТЦ "Центральный" train.loc[train.shop_id == 1, 'shop_id'] = 58 test.loc[test.shop_id == 1, 'shop_id'] = 58 # Жуковский ул. Чкалова 39м² train.loc[train.shop_id == 10, 'shop_id'] = 11 test.loc[test.shop_id == 10, 'shop_id'] = 11 # Clean up some shop names and add 'city' and 'category' to shops df. shops.loc[ shops.shop_name == 'Сергиев Посад ТЦ "7Я"',"shop_name" ] = 'СергиевПосад ТЦ "7Я"' shops["city"] = shops.shop_name.str.split(" ").map( lambda x: x[0] ) shops["category"] = shops.shop_name.str.split(" ").map( lambda x: x[1] ) shops.loc[shops.city == "!Якутск", "city"] = "Якутск" # Only keep shop category if there are 5 or more shops of that category, the rest are grouped as "other". category = [] for cat in shops.category.unique(): if len(shops[shops.category == cat]) >= 5: category.append(cat) shops.category = shops.category.apply( lambda x: x if (x in category) else "other" ) # label encoding from sklearn.preprocessing import LabelEncoder shops["shop_category"] = LabelEncoder().fit_transform( shops.category ) shops["shop_city"] = LabelEncoder().fit_transform( shops.city ) shops = shops[["shop_id", "shop_category", "shop_city"]] # Cleaning Item Category Data cats["type_code"] = cats.item_category_name.apply( lambda x: x.split(" ")[0] ).astype(str) cats.loc[ (cats.type_code == "Игровые")| (cats.type_code == "Аксессуары"), "category" ] = "Игры" category = [] for cat in cats.type_code.unique(): if len(cats[cats.type_code == cat]) >= 5: category.append( cat ) cats.type_code = cats.type_code.apply(lambda x: x if (x in category) else "etc") # Label Encoding cats.type_code = LabelEncoder().fit_transform(cats.type_code) cats["split"] = cats.item_category_name.apply(lambda x: x.split("-")) cats["subtype"] = cats.split.apply(lambda x: x[1].strip() if len(x) > 1 else x[0].strip()) cats["subtype_code"] = LabelEncoder().fit_transform( cats["subtype"] ) cats = cats[["item_category_id", "subtype_code", "type_code"]] # Cleaning Item Data import re def name_correction(x): x = x.lower() # all letters lower case x = x.partition('[')[0] # partition by square brackets x = x.partition('(')[0] # partition by curly brackets x = re.sub('[^A-Za-z0-9А-Яа-я]+', ' ', x) # remove special characters x = x.replace(' ', ' ') # replace double spaces with single spaces x = x.strip() # remove leading and trailing white space return x # Cleaning Item name # split item names by first bracket items["name1"], items["name2"] = items.item_name.str.split("[", 1).str items["name1"], items["name3"] = items.item_name.str.split("(", 1).str # replace special characters and turn to lower case items["name2"] = items.name2.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower() items["name3"] = items.name3.str.replace('[^A-Za-z0-9А-Яа-я]+', " ").str.lower() # fill nulls with '0' items = items.fillna('0') items["item_name"] = items["item_name"].apply(lambda x: name_correction(x)) # return all characters except the last if name 2 is not "0" - the closing bracket items.name2 = items.name2.apply( lambda x: x[:-1] if x !="0" else "0") # clean item type items["type"] = items.name2.apply(lambda x: x[0:8] if x.split(" ")[0] == "xbox" else x.split(" ")[0] ) items.loc[(items.type == "x360") | (items.type == "xbox360") | (items.type == "xbox 360") ,"type"] = "xbox 360" items.loc[ items.type == "", "type"] = "mac" items.type = items.type.apply( lambda x: x.replace(" ", "") ) items.loc[ (items.type == 'pc' )| (items.type == 'pс') | (items.type == "pc"), "type" ] = "pc" items.loc[ items.type == 'рs3' , "type"] = "ps3" group_sum = items.groupby(["type"]).agg({"item_id": "count"}) group_sum = group_sum.reset_index() drop_cols = [] for cat in group_sum.type.unique(): if group_sum.loc[(group_sum.type == cat), "item_id"].values[0] <40: drop_cols.append(cat) items.name2 = items.name2.apply( lambda x: "other" if (x in drop_cols) else x ) items = items.drop(["type"], axis = 1) items.name2 = LabelEncoder().fit_transform(items.name2) items.name3 = LabelEncoder().fit_transform(items.name3) items.drop(["item_name", "name1"],axis = 1, inplace= True) items.head() # Preprocessing create a matrix df with every combination of month, shop and item in order of increasing month. Item_cnt_day is summed into an item_cnt_month. from itertools import product matrix = [] cols = ["date_block_num", "shop_id", "item_id"] for i in range(34): sales = train[train.date_block_num == i] matrix.append( np.array(list( product( [i], sales.shop_id.unique(), sales.item_id.unique() ) ), dtype = np.int16) ) matrix = pd.DataFrame( np.vstack(matrix), columns = cols ) matrix["date_block_num"] = matrix["date_block_num"].astype(np.int8) matrix["shop_id"] = matrix["shop_id"].astype(np.int8) matrix["item_id"] = matrix["item_id"].astype(np.int16) matrix.sort_values( cols, inplace = True ) # add revenue to train df train["revenue"] = train["item_cnt_day"] * train["item_price"] group = train.groupby( ["date_block_num", "shop_id", "item_id"] ).agg( {"item_cnt_day": ["sum"]} ) group.columns = ["item_cnt_month"] group.reset_index( inplace = True) matrix = pd.merge( matrix, group, on = cols, how = "left" ) matrix["item_cnt_month"] = matrix["item_cnt_month"].fillna(0).astype(np.float16) # Create a test set for month 34. test["date_block_num"] = 34 test["date_block_num"] = test["date_block_num"].astype(np.int8) test["shop_id"] = test.shop_id.astype(np.int8) test["item_id"] = test.item_id.astype(np.int16) # Concatenate train and test sets. matrix = pd.concat([matrix, test.drop(["ID"],axis = 1)], ignore_index=True, sort=False, keys=cols) matrix.fillna( 0, inplace = True ) # Add shop, items and categories data onto matrix df. matrix = pd.merge( matrix, shops, on = ["shop_id"], how = "left" ) matrix = pd.merge(matrix, items, on = ["item_id"], how = "left") matrix = pd.merge( matrix, cats, on = ["item_category_id"], how = "left" ) matrix["shop_city"] = matrix["shop_city"].astype(np.int8) matrix["shop_category"] = matrix["shop_category"].astype(np.int8) matrix["item_category_id"] = matrix["item_category_id"].astype(np.int8) matrix["subtype_code"] = matrix["subtype_code"].astype(np.int8) matrix["name2"] = matrix["name2"].astype(np.int8) matrix["name3"] = matrix["name3"].astype(np.int16) matrix["type_code"] = matrix["type_code"].astype(np.int8) # Feature Engineering, add lag features to matrix df. # Define a lag feature function def lag_feature( df,lags, cols ): for col in cols: print(col) tmp = df[["date_block_num", "shop_id","item_id",col ]] for i in lags: shifted = tmp.copy() shifted.columns = ["date_block_num", "shop_id", "item_id", col + "_lag_"+str(i)] shifted.date_block_num = shifted.date_block_num + i df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left') return df # Add item_cnt_month lag features. matrix = lag_feature( matrix, [1,2,3], ["item_cnt_month"] ) # Add the previous month's average item_cnt. group = matrix.groupby( ["date_block_num"] ).agg({"item_cnt_month" : ["mean"]}) group.columns = ["date_avg_item_cnt"] group.reset_index(inplace = True) matrix = pd.merge(matrix, group, on = ["date_block_num"], how = "left") matrix.date_avg_item_cnt = matrix["date_avg_item_cnt"].astype(np.float16) matrix = lag_feature( matrix, [1], ["date_avg_item_cnt"] ) matrix.drop( ["date_avg_item_cnt"], axis = 1, inplace = True ) # Add lag values of item_cnt_month for month / item_id. group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_item_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left') matrix.date_item_avg_item_cnt = matrix['date_item_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3], ['date_item_avg_item_cnt']) matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True) # Add lag values for item_cnt_month for every month / shop combination. group = matrix.groupby( ["date_block_num","shop_id"] ).agg({"item_cnt_month" : ["mean"]}) group.columns = ["date_shop_avg_item_cnt"] group.reset_index(inplace = True) matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id"], how = "left") matrix.date_avg_item_cnt = matrix["date_shop_avg_item_cnt"].astype(np.float16) matrix = lag_feature( matrix, [1,2,3], ["date_shop_avg_item_cnt"] ) matrix.drop( ["date_shop_avg_item_cnt"], axis = 1, inplace = True ) # Add lag values for item_cnt_month for month/shop/item. group = matrix.groupby( ["date_block_num","shop_id","item_id"] ).agg({"item_cnt_month" : ["mean"]}) group.columns = ["date_shop_item_avg_item_cnt"] group.reset_index(inplace = True) matrix = pd.merge(matrix, group, on = ["date_block_num","shop_id","item_id"], how = "left") matrix.date_avg_item_cnt = matrix["date_shop_item_avg_item_cnt"].astype(np.float16) matrix = lag_feature( matrix, [1,2,3], ["date_shop_item_avg_item_cnt"] ) matrix.drop( ["date_shop_item_avg_item_cnt"], axis = 1, inplace = True ) # Add lag values for item_cnt_month for month/shop/item subtype. group = matrix.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'item_cnt_month': ['mean']}) group.columns = ['date_shop_subtype_avg_item_cnt'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'subtype_code'], how='left') matrix.date_shop_subtype_avg_item_cnt = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1], ['date_shop_subtype_avg_item_cnt']) matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True) # Add lag values for item_cnt_month for month/city. group = matrix.groupby(['date_block_num', 'shop_city']).agg({'item_cnt_month': ['mean']}) group.columns = ['date_city_avg_item_cnt'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', "shop_city"], how='left') matrix.date_city_avg_item_cnt = matrix['date_city_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1], ['date_city_avg_item_cnt']) matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True) # Add lag values for item_cnt_month for month/city/item. group = matrix.groupby(['date_block_num', 'item_id', 'shop_city']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_item_city_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'shop_city'], how='left') matrix.date_item_city_avg_item_cnt = matrix['date_item_city_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1], ['date_item_city_avg_item_cnt']) matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True) # Add average item price on to matix df., add lag values of item price per month # Add delta price values - how current month average pirce relates to global average. group = train.groupby( ["item_id"] ).agg({"item_price": ["mean"]}) group.columns = ["item_avg_item_price"] group.reset_index(inplace = True) matrix = matrix.merge( group, on = ["item_id"], how = "left" ) matrix["item_avg_item_price"] = matrix.item_avg_item_price.astype(np.float16) group = train.groupby( ["date_block_num","item_id"] ).agg( {"item_price": ["mean"]} ) group.columns = ["date_item_avg_item_price"] group.reset_index(inplace = True) matrix = matrix.merge(group, on = ["date_block_num","item_id"], how = "left") matrix["date_item_avg_item_price"] = matrix.date_item_avg_item_price.astype(np.float16) lags = [1, 2, 3] matrix = lag_feature( matrix, lags, ["date_item_avg_item_price"] ) for i in lags: matrix["delta_price_lag_" + str(i) ] = (matrix["date_item_avg_item_price_lag_" + str(i)]- matrix["item_avg_item_price"] )/ matrix["item_avg_item_price"] def select_trends(row) : for i in lags: if row["delta_price_lag_" + str(i)]: return row["delta_price_lag_" + str(i)] return 0 matrix["delta_price_lag"] = matrix.apply(select_trends, axis = 1) matrix["delta_price_lag"] = matrix.delta_price_lag.astype( np.float16 ) matrix["delta_price_lag"].fillna( 0 ,inplace = True) features_to_drop = ["item_avg_item_price", "date_item_avg_item_price"] for i in lags: features_to_drop.append("date_item_avg_item_price_lag_" + str(i) ) features_to_drop.append("delta_price_lag_" + str(i) ) matrix.drop(features_to_drop, axis = 1, inplace = True) # Add total shop revenue per month to matix df. Add lag values of revenue per month. # Add delta revenue values - how current month revenue relates to global average. group = train.groupby( ["date_block_num","shop_id"] ).agg({"revenue": ["sum"] }) group.columns = ["date_shop_revenue"] group.reset_index(inplace = True) matrix = matrix.merge( group , on = ["date_block_num", "shop_id"], how = "left" ) matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32) group = group.groupby(["shop_id"]).agg({ "date_block_num":["mean"] }) group.columns = ["shop_avg_revenue"] group.reset_index(inplace = True ) matrix = matrix.merge( group, on = ["shop_id"], how = "left" ) matrix["shop_avg_revenue"] = matrix.shop_avg_revenue.astype(np.float32) matrix["delta_revenue"] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue'] matrix["delta_revenue"] = matrix["delta_revenue"]. astype(np.float32) matrix = lag_feature(matrix, [1], ["delta_revenue"]) matrix["delta_revenue_lag_1"] = matrix["delta_revenue_lag_1"].astype(np.float32) matrix.drop( ["date_shop_revenue", "shop_avg_revenue", "delta_revenue"] ,axis = 1, inplace = True) # Add month and number of days in each month to matrix df. matrix["month"] = matrix["date_block_num"] % 12 days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31]) matrix["days"] = matrix["month"].map(days).astype(np.int8) # Add the month of each shop and item first sale. matrix["item_shop_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id","shop_id"])["date_block_num"].transform('min') matrix["item_first_sale"] = matrix["date_block_num"] - matrix.groupby(["item_id"])["date_block_num"].transform('min') # Delete first three months from matrix. They don't have lag values. matrix = matrix[matrix["date_block_num"] > 3] # Model import gc import pickle from xgboost import XGBRegressor from matplotlib.pylab import rcParams data = matrix.copy() del matrix gc.collect() X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1) Y_train = data[data.date_block_num < 33]['item_cnt_month'] X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1) Y_valid = data[data.date_block_num == 33]['item_cnt_month'] X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1) Y_train = Y_train.clip(0, 20) Y_valid = Y_valid.clip(0, 20) del data gc.collect(); # Training model = XGBRegressor( max_depth=10, n_estimators=1000, min_child_weight=0.5, colsample_bytree=0.8, subsample=0.8, eta=0.1, seed=42) model.fit( X_train, Y_train, eval_metric="rmse", eval_set=[(X_train, Y_train), (X_valid, Y_valid)], verbose=True, early_stopping_rounds = 20) rcParams['figure.figsize'] = 12, 4 # Testing Y_pred = model.predict(X_valid).clip(0, 20) Y_test = model.predict(X_test).clip(0, 20) submission = pd.DataFrame({ "ID": test.index, "item_cnt_month": Y_test }) submission.to_csv('xgb_submission.csv', index=False) # Plot feature importance from xgboost import plot_importance def plot_features(booster, figsize): fig, ax = plt.subplots(1,1,figsize=figsize) return plot_importance(booster=booster, ax=ax) plot_features(model, (10,14))
From this tutorial, I hope you gain some practical experience in dealing with time series dataset, performing data pre-processing, data cleaning, feature aggregation and feature construction, training and tuning model’s hyperparameters.
Connect with me on LinkedIn Here.
Thanks for giving your time!
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.