Gradient Boosting is a powerful machine learning algorithm that reduces bias error in models. Unlike AdaBoost, where we specify the base estimator, Gradient Boosting uses a fixed Decision Stump. We can tune the n_estimators
parameter, which defaults to 100 if not specified. This algorithm works for both regression (using Mean Squared Error) and classification (using Log Loss). This article will cover key topics related to Gradient Boosting.
This article was published as a part of the Data Science Blogathon.
Let us now understand the working of the Gradient Boosting Algorithm with the help of one example. In the following example, Age is the Target variable whereas LikesExercising, GotoGym, DrivesCar are independent variables. As in this example, the target variable is continuous, GradientBoostingRegressor is used here.
For estimator-1, the root level (level 0) will consist of all the records. The predicted age at this level is equal to the mean of the entire Age column i.e. 41.33(addition of all values in Age column divided by a number of records i.e. 9). Let us find out what is the MSE for this level. MSE is calculated as the mean of the square of errors. Here error is equal to actual age-predicted age. The predicted age for the particular node is always equal to the mean of age records of that node. So, the MSE of the root node of the 1st estimator is calculated as given below.
MSE=(∑(Agei –mu)2 )/9=577.11
The cost function hers is MSE and the objective of the algorithm here is to minimize the MSE.
Now, one of the independent variables will be used by the Gradient boosting to create the Decision Stump. Let us suppose that the LikesExercising
is used here for prediction. So, the records with False LikesExercising will go in one child node, and records with True LikesExercising will go in another child node as shown below.
Let us find the means and MSEs of both the nodes of level 1. For the left node, the mean is equal to 20.25 and MSE is equal to 83.19. Whereas, for the right node, the mean is equal to 58.2 and MSE is equal to 332.16. Total MSE for level 1 will be equal to the addition of all nodes of level 1 i.e. 83.19+332.16=415.35. We can see here, the cost function i.e. MSE of level 1 is better than level 0.
Let us now find out the estimator-2. Unlike AdaBoost, in the Gradient boosting algorithm, residues (agei – mu)of the first estimator are taken as root nodes as shown below. Let us suppose for this estimator another dependent variable is used for prediction. So, the records with False GotoGym
will go in one child node, and records with True GotoGym will go in another child node as shown below.
The prediction of age here is slightly tricky. First, the age will be predicted from estimator 1 as per the value of LikeExercising, and then the mean from the estimator is found out with the help of the value of GotoGym and then that means is added to age-predicted from the first estimator and that is the final prediction of Gradient boosting with two estimators.
Let us consider if we want to predict the age for the following records:
Here, LikesExercising is equal to False. So, the predicted age from the first estimator will be 20.25 (i.e. mean of the left node of the first estimator). Now we need to check what is the value of GotoGym for the second predictor and its value is True. So, the mean of True GotoGym in the 2nd estimator is -3.56. This will be added to the prediction of the first estimator i.e. 20.25. So final prediction of this model will be 20.25+(-3.56) = 16.69.
Let us predict of ages of all records we have in the example:
Let us now find out the Final MSE for above all 9 records.
MSE = ((-2.69)2 +(-1.69)2 + (-0.69)2 +(-28.64)2 +(19.31)2 +(-15.33)2 + (14.36)2 +(6.67)2 +(7.67)2 )/9
= 194.2478
So, we can see that the final MSE is much better than the MSE of the root node of the 1st Estimator. This is only for 2 estimators. There can be n number of estimators in gradient boosting algorithm.
Checkout this article about Hyperparameter Tuning and its Techniques
Let us now write a python code for the same.
# Importing required modules
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
# Let us create the Data-Frame for above
X=pd.DataFrame({'LikesExercising':[False,False,False,True,False,True,True,True,True],
'GotoGym':[True,True,True,True,True,False,True,False,False],
'DrivesCar':[True,False,False,True,True,False,True,False,True]})
Y=pd.Series(name='Age',data=[14,15,16,26,36,50,69,72,74])
# Let us encode true and false to number value 0 and 1
LE=LabelEncoder()
X['LikesExercising']=LE.fit_transform(X['LikesExercising'])
X['GotoGym']=LE.fit_transform(X['GotoGym'])
X['DrivesCar']=LE.fit_transform(X['DrivesCar'])
#We will now see the effect of different numbers of estimators on MSE.
# 1) Let us now use GradientBoostingRegressor with 2 estimators to train the model and to predict the age for the same inputs.
GB=GradientBoostingRegressor(n_estimators=2)
GB.fit(X,Y)
Y_predict=GB.predict(X) #ages predicted by model with 2 estimators
Y_predict
# Output
#Y_predict=[38.23 , 36.425, 36.425, 42.505, 38.23 , 45.07 , 42.505, 45.07 ,47.54]
#Following code is used to find out MSE of prediction with Gradient boosting algorithm having estimator 2.
MSE_2=(sum((Y-Y_predict)**2))/len(Y)
print('MSE for two estimators :',MSE_2)
#Output: MSE for two estimators : 432.482055555546
# 2) Let us now use GradientBoostingRegressor with 3 estimators to train the model and to predict the age for the same inputs.
GB=GradientBoostingRegressor(n_estimators=3)
GB.fit(X,Y)
Y_predict=GB.predict(X) #ages predicted by model with 3 estimators
Y_predict
# Output
#Y_predict=[36.907, 34.3325, 34.3325, 43.0045, 36.907 , 46.663 , 43.0045, 46.663 , 50.186]
#Following code is used to find out MSE of prediction with Gradient boosting algorithm having estimator 3.
MSE_3=(sum((Y-Y_predict)**2))/len(Y)
print('MSE for three estimators :',MSE_3)
#Output: MSE for three estimators : 380.05602055555556
# 3) Let us now use GradientBoostingRegressor with 50 estimators to train the model and to predict the age for the same inputs.
GB=GradientBoostingRegressor(n_estimators=50)
GB.fit(X,Y)
Y_predict=GB.predict(X) #ages predicted by model with 50 estimators
Y_predict
# Output
#Y_predict=[25.08417833, 15.63313919, 15.63313919, 47.46821839, 25.08417833, 60.89864242, 47.46821839, 60.89864242, 73.83164334]
#Following code is used to find out MSE of prediction with Gradient boosting algorithm having estimator 50.
MSE_50=(sum((Y-Y_predict)**2))/len(Y)
print('MSE for fifty estimators :',MSE_50)
#Output: MSE for fifty estimators: 156.5667260994211
As we can see here, MSE reduces as we increase the estimator value. The situation comes where MSE becomes saturated which means even if we increase the estimator value there will be no significant decrease in MSE.
Let us now see, how GridSearchCV can be used to find the best estimator for the above example.
from sklearn.model_selection import GridSearchCV
model=GradientBoostingRegressor()
params={'n_estimators':range(1,200)}
grid=GridSearchCV(estimator=model,cv=2,param_grid=params,scoring='neg_mean_squared_error')
grid.fit(X,Y)
print("The best estimator returned by GridSearch CV is:",grid.best_estimator_)
#Output
#The best estimator returned by GridSearch CV is: GradientBoostingRegressor(n_estimators=19)
GB=grid.best_estimator_
GB.fit(X,Y)
Y_predict=GB.predict(X)
Y_predict
#output:
#Y_predict=[27.20639114, 18.98970027, 18.98970027, 46.66697477, 27.20639114,58.34332496, 46.66697477, 58.34332496, 69.58721772]
MSE_best=(sum((Y-Y_predict)**2))/len(Y)
print('MSE for best estimators :',MSE_best)
#Following code is used to find out MSE of prediction for Gradient boosting algorithm with best estimator value given by GridSearchCV
#Output: MSE for best estimators : 164.2298548605391
You may think that MSE for n_estimator=50 is better than MSE for n_estimator=19. Still GridSearchCV returns 19 not 50. Actually, we can observe here is until 19 with each increment in estimator value the reduction in MSE was significant, but after 19 there is no significant decrease in MSE with increment in estimators. So, n_estimator=19 was returned by GridSearchSV.
This article explains how Gradient Boosting works using a simple example. The Gradient Boosting Algorithm is primarily used to reduce Bias error and includes models like GradientBoostingRegressor (using MSE as the cost function) and GradientBoostingClassifier (using Log-Loss for classification). A key aspect of this algorithm is determining the optimal value for n_estimators, which can be achieved using GridSearchCV. After reading, it is recommended to practice applying this algorithm and then learn how to tune its hyper-parameters effectively.
A. Gradient boosting, an ensemble machine learning technique, constructs a robust predictive model by sequentially combining multiple weak models. It minimizes errors using a gradient descent-like approach during training.
A. In regression, gradient boosting fits a series of weak learners, typically decision trees, to residuals, gradually reducing prediction errors through model updates.
A. Light Gradient Boosting, or LightGBM, employs a tree-based learning algorithm, utilizing a histogram-based method for faster training and reduced memory usage.
A. Gradient boosting reduces bias by iteratively adding weak models, focusing on remaining errors and emphasizing previously mispredicted instances. This sequential process refines predictions, ultimately minimizing bias in the final model
Very informative article, please also write a detailed article for classification using log-loss.