As a data scientist participating in multiple machine learning competition, I am always on the lookout for “not-yet-popular” algorithms. The way I define them is that these algorithms by themselves may not end up becoming a competition winner. But they bring different way for doing predictions on table.
Why I am interested in these algorithms? The key is to read “by themselves” in the statement above. These algorithms can be used in ensemble models to get extra edge over mostly popular gradient boosting algorithms (XGBoost, LightGBM etc.).
This article talks about one such algorithm called Regularized Greedy Forests (RGF). It performs comparable (if not better) to boosting algorithms for large number of datasets. They produce less correlated predictions and do well in ensemble with other tree boosting models.
In order to get the most out of this article, you should know the basics of gradient boosting and basic decision tree terminology.
In Boosting algorithms, each classifier/regressor is trained on data, taking into account the previous classifiers’/regressors’ success. After each training step, the weights are redistributed. Mis-classified data increases its weights to emphasize the most difficult cases. In this way, subsequent learners will focus on them during their training.
However, the boosting methods simply treat the decision tree base learner as a black box and it does not take advantage of the tree structure itself. In a sense, boosting does a partial corrective step to the model at each iteration.
In contrast, RGF performs 2 steps:
Search for the optimum structure change:
Let me explain this with an example.
Figure 3 shows that at the same stage as Figure 2, we may either consider splitting one of the leaf nodes marked with symbol X or grow a new tree T4 (split T4 ’s root).
Weight Optimization
Weights for each of the nodes are also optimized in order to minimize the loss function further:
Regularization
Explicit regularization to the loss function is essential for this algorithm as it overfits really quickly. It is possible to have different L2 regularization parameters for the process of growing a forest and the process of weight correction.
There are three methods of regularization:
Tree Size
RGF does not require the tree size parameter (e.g., number of trees, max depth) needed in gradient boosted decision trees. With RGF, the size of each tree is automatically determined as a result of minimizing the regularized loss. What we do declare, is the maximum number of leaves in the forest and regularization parameters (L1 and L2).
Model Size
Since RGF performs fully corrective steps on the model/forest, it can train a simpler model as compared to boosting algorithms which require a small learning rate/shrinkage and large number of estimators to produce good results.
Original RGF implementation for binary classification and regression was done in C++ by authors of the original research paper – Rie Johnson and Tong Zhang. The most popular wrapper of the same implementation for Python developed by fukatani even supports multiclass classification (using ‘one vs. all’ technique). Much of the implementation is based on RGF wrapper by MLWave.
Let’s talk about the most important parameters that would affect the accuracy of the model, or speed of training, or both:
Let us try RGF on the Big Mart Sales Prediction problem. The dataset for the same can be downloaded from this datahack link. I have imported some pre-processing steps used in this article.
import pandas as pd import numpy as np #Read files: train = pd.read_csv("Train_UWu5bXk.csv") test = pd.read_csv("Test_u94Q5KV.csv") train['source']='train' test['source']='test' data = pd.concat([train, test],ignore_index=True) #Filter categorical variables categorical_columns = [x for x in data.dtypes.index if data.dtypes[x]=='object'] #Exclude ID cols and source: categorical_columns = [x for x in categorical_columns if x not in ['Item_Identifier','Outlet_Identifier','source']] #Get the first two characters of ID: data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2]) #Rename them to more intuitive categories: data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'}) #Years data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year'] #Change categories of low fat: data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low Fat', 'reg':'Regular', 'low fat':'Low Fat'}) #Mark non-consumables as separate category in low_fat: data.loc[data['Item_Type_Combined']=="Non-Consumable",'Item_Fat_Content'] = "Non-Edible" #Fill missing values by a very large negative val data.fillna(-9999,inplace = True) #Import library: from sklearn.preprocessing import LabelEncoder le = LabelEncoder() #New variable for outlet data['Outlet'] = le.fit_transform(data['Outlet_Identifier']) var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet'] le = LabelEncoder() for i in var_mod: data[i] = le.fit_transform(data[i].astype(str)) train_new = train.drop(['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'],axis=1) test_new = test.drop(['Item_Identifier','Outlet_Identifier'],axis=1) y_all = train['Item_Outlet_Sales']
Once we have the pre-processed the stored data, we can then import RGF using the following command:
from rgf.sklearn import RGFRegressor from sklearn.model_selection import GridSearchCV
The two most important parameters to set for this is the maximum number of leaves allowed and L2 regularization. We can use a grid search to find out the parameters with the best cross validation MSE.
parameters = {'max_leaf':[1000,1200,1300,1400,1500,1600,1700,1800,1900,2000], 'l2':[0.1,0.2,0.3], 'min_samples_leaf':[5,10]} clf = GridSearchCV(estimator=rgf, param_grid=parameters, scoring='neg_mean_squared_error', n_jobs = -1, cv = 3)
Looks like we are trying to fit too complex a model with too many leaves. The high regularization term is high and max_leaf is low. Let us do a different grid search with lower max_leaf:
parameters = {'max_leaf':[100,200,300,400,500,800,900,1000], 'algorithm':("RGF_Sib","RGF"), 'l2':[0.1,0.2,0.3], 'min_samples_leaf':[5,10]}
Looks like these parameters are fitting the best. This scores a RMSE of 1146 on the public leaderboard.
RGF is yet another tree ensemble technique that sits next to gradient boosting algorithms and can be used for effectively modelling non-linear relationships. The documentation for the library used can be found at this link. There are several other parameters to try and tune with the RGF. Let me know in the comments how it works out for you.
Excellent work Mr.Ankit; It is very helpful for us to discuss with our students...
Thank you for the kind words.
This is super useful! I will definitely follow your example and try it out By saying that this algorithm "may train simpler models" what do you mean? Is it that it takes less training time or the training file weights less?
The problem with boosting is that in order to perform well, it requires a large number of estimators with very small learning rates, while RGF constantly gets regularized at the forest level with each iteration and hence trains a model with lesser number of trees, hence the simpler model.
Thanks for the very useful and interesting article! From Feb 10th FastRGF isn't in alpha stage anymore and we removed FastRGF.rst document from the GitHub repo (all info is in Readme.rst document now).
Thanks for the information.