A Comprehensive Guide to Ensemble Learning (with Python codes)

Aishwarya Singh Last Updated : 03 Jan, 2025

21 min read

Ensemble techniques in machine learning function much like seeking advice from multiple sources before making a significant decision, such as purchasing a car. Just as you wouldn’t rely solely on one opinion, ensemble models combine predictions from multiple base models to enhance overall performance. One popular method, majority voting, aggregates predictions to select the class label by majority. This tutorial explores ensemble learning concepts, including bootstrap sampling to train models on different subsets, the role of predictors in building diverse models, and practical implementation in Python using scikit-learn.

It addresses binary classification scenarios and delves into techniques to tackle issues like data mining, which identifies valuable patterns from data, and managing high variance through ensemble methods. Additionally, the tutorial covers optimizing ensemble performance with hyperparameter tuning, providing a comprehensive foundation for leveraging ensemble learning effectively.

Are you a beginner looking for a place to start your journey in data science and machine learning? Presenting two comprehensive courses, full of knowledge and data science learning, curated just for you!

Learning Outcomes

Understand the principles and algorithms behind various machine learning models and neural networks.
Implement machine learning models and neural networks to solve real-world problems.
Analyze and interpret the predictive performance of machine learning models and neural networks.
Explain the concept of stacked generalization and its use in improving model accuracy.
Perform exploratory data analysis and data preprocessing to prepare datasets for modeling.
Apply feature selection methods to reduce dimensionality and improve model efficiency.
Compare and contrast different learning methods, such as supervised, unsupervised, and reinforcement learning.
Implement popular ensemble techniques, including bagging, boosting, and stacking, to improve model performance.

This article was published as a part of the Data Science Blogathon.

What is Ensemble Learning with example?
Simple Ensemble Techniques
Advanced Ensemble techniques
Algorithms based on Bagging and Boosting
Conclusion

What is Ensemble Learning with example?

Ensemble learning is a machine learning technique that enhances accuracy and resilience in forecasting by merging predictions from multiple models. It aims to mitigate errors or biases that may exist in individual models by leveraging the collective intelligence of the ensemble.

The underlying concept behind ensemble learning is to combine the outputs of diverse models to create a more precise prediction. By considering multiple perspectives and utilizing the strengths of different models, ensemble learning improves the overall performance of the learning system. This approach not only enhances accuracy but also provides resilience against uncertainties in the data. By effectively merging predictions from multiple models, ensemble learning has proven to be a powerful tool in various domains, offering more robust and reliable forecasts.

Let’s understand the concept of ensemble learning with an example. Suppose you are a movie director and you have created a short movie on a very important and interesting topic. Now, you want to take preliminary feedback (ratings) on the movie before making it public.

Possible Ways

You may ask one of your friends to rate the movie for you.
Now it’s entirely possible that the person you have chosen loves you very much and doesn’t want to break your heart by providing a 1-star rating to the horrible work you have created.
Another way could be by asking 5 colleagues of yours to rate the movie.
This should provide a better idea of the movie. This method may provide honest ratings for your movie. But a problem still exists. These 5 people may not be “Subject Matter Experts” on the topic of your movie. Sure, they might understand the cinematography, the shots, or the audio, but at the same time may not be the best judges of dark humor.
How about asking 50 people to rate the movie?
Some of which can be your friends, some of them can be your colleagues and some may even be total strangers.

The responses, in this case, would be more generalized and diversified since now you have people with different sets of skills. And as it turns out – this is a better approach to get honest ratings than the previous cases we saw.

With these examples, you can infer that a diverse group of people are likely to make better decisions as compared to individuals. The same holds for a diverse set of models compared to single models. Ensemble Learning, a technique in Machine Learning, achieves this diversification.

Now that you have a gist of ensemble learning, let’s explore various techniques in ensemble learning along with their implementations.

Simple Ensemble Techniques

In this section, we will look at a few simple but powerful techniques, namely:

Max Voting
Averaging
Weighted Averaging

Max Voting

The max voting method generally serves classification problems. In this technique, multiple models make predictions for each data point. Each model’s predictions count as a ‘vote.’ The majority of the models’ predictions determine the final prediction.

For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can consider this as taking the mode of all the predictions.

The result of max voting would be something like this:

Colleague 1	Colleague 2	Colleague 3	Colleague 4	Colleague 5	Final rating
5	4	5	4	4	4

Sample Code:

Here x_train consists of independent variables in training data, y_train is the target variable for training data. The validation set is x_test (independent variables) and y_test (target variable) .

# IMPORTS
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statistics as st
import warnings
warnings.filterwarnings('ignore')

# SPLITTING THE DATASET
df = pd.read_csv('heart.csv')
x = df.drop('target', axis = 1)
y = df['target']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# MODELS CREATION
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

# PREDICTION
pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)

# FINAL_PREDICTION
final_pred = np.array([])
for i in range(0,len(x_test)):
    final_pred = np.append(final_pred, st.mode([pred1[i], pred2[i], pred3[i]]))
print(final_pred)

Alternatively, you can use “VotingClassifier” module in sklearn as follows:

from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = tree.DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)

Averaging

Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Use averaging to make predictions in regression problems or to calculate probabilities for classification problems.

For example, in the below case, the averaging method would take the average of all the values.

i.e. (5+4+5+4+4)/5 = 4.4

Colleague 1	Colleague 2	Colleague 3	Colleague 4	Colleague 5	Final rating
5	4	5	4	4	4.4

Sample Code:

model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3

Weighted Average

This is an extension of the averaging method. Different weights assign to all models, defining the importance of each model for prediction. For instance, if two of your colleagues are critics while others lack prior experience in this field, the answers from these two friends receive more importance compared to the others.

The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.

Colleague 1	Colleague 2	Colleague 3	Colleague 4	Colleague 5	Final rating
weight	0.23	0.23	0.18	0.18	0.18
rating	5	4	5	4	4	4.41

Sample Code:

model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)

Advanced Ensemble techniques

Now that we have covered the basic ensemble techniques, let’s move on to understanding the advanced techniques.

Stacking

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model makes predictions on the test set. Below is a step-by-step explanation of a simple stacked ensemble:

The train set is split into 10 parts.

A base model, such as a decision tree, fits on nine parts while making predictions for the tenth part. This process occurs for each part of the training set.

The base model, in this case a decision tree, fits on the entire training dataset.
Using this model, predictions are made on the test set.

Steps 2 to 4 repeat for another base model, such as KNN, resulting in a new set of predictions for both the training set and the test set.

The predictions from the train set are used as features to build a new model.

This model is used to make final predictions on the test prediction set.

Sample code:

We first define a function to make predictions on n-folds of train and test dataset. This function returns the predictions for train and test for each model.

def Stacking(model,train,y,test,n_fold):
   folds=StratifiedKFold(n_splits=n_fold,random_state=1)
   test_pred=np.empty((test.shape[0],1),float)
   train_pred=np.empty((0,1),float)
   for train_indices,val_indices in folds.split(train,y.values):
      x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
      y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]

      model.fit(X=x_train,y=y_train)
      train_pred=np.append(train_pred,model.predict(x_val))
      test_pred=np.append(test_pred,model.predict(test))
    return test_pred.reshape(-1,1),train_pred

Now we’ll create two base models – decision tree and knn.

model1 = tree.DecisionTreeClassifier(random_state=1)

test_pred1 ,train_pred1=Stacking(model=model1,n_fold=10, train=x_train,test=x_test,y=y_train)

train_pred1=pd.DataFrame(train_pred1)
test_pred1=pd.DataFrame(test_pred1)

model2 = KNeighborsClassifier()

test_pred2 ,train_pred2=Stacking(model=model2,n_fold=10,train=x_train,test=x_test,y=y_train)

train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)

Create a third model, logistic regression, on the predictions of the decision tree and knn models.

df = pd.concat([train_pred1, train_pred2], axis=1)
df_test = pd.concat([test_pred1, test_pred2], axis=1)

model = LogisticRegression(random_state=1)
model.fit(df,y_train)
model.score(df_test, y_test)

In order to simplify the above explanation, the stacking model we have created has only two levels. Build the decision tree and KNN models at level zero, while construct a logistic regression model at level one. Feel free to create multiple levels in a stacking model.

Blending

Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions. In other words, unlike stacking, you make predictions only on the holdout set. Use the holdout set and the predictions to build a model that runs on the test set. Here is a detailed explanation of the blending process:

The train set is split into training and validation sets.

Model(s) are fitted on the training set.
The predictions are made on the validation set and the test set.

The validation set and its predictions are used as features to build a new model.
This model is used to make final predictions on the test and meta-features.

Sample Code:

We’ll build two models, decision tree and knn, on the train set in order to make predictions on the validation set.

model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)
val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)

model2 = KNeighborsClassifier()
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
test_pred2=model2.predict(x_test)
val_pred2=pd.DataFrame(val_pred2)
test_pred2=pd.DataFrame(test_pred2)

Combine the meta-features and the validation set to build a logistic regression model that makes predictions on the test set.

df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)
df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)

model = LogisticRegression()
model.fit(df_val,y_val)
model.score(df_test,y_test)

Bagging

The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. So how can we solve this problem? One of the techniques is bootstrapping.

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.

The Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to provide a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set.

Create multiple subsets from the original dataset by selecting observations with replacement.
A base model (weak model) is created on each of these subsets.
The models run in parallel and are independent of each other.

Boosting

Let’s consider another question before we go further: If the first model, and then the next (or probably all models), incorrectly predict a data point, will combining predictions improve results? Boosting addresses such situations.

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.

A subset is created from the original dataset.
Initially, all data points are given equal weights.
A base model is created on this subset.
This model is used to make predictions on the whole dataset.

Errors are calculated using the actual values and predicted values.
The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)
Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)

Similarly, multiple models are created, each correcting the errors of the previous model.
The final model (strong learner) is the weighted mean of all the models (weak learners).

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.

Algorithms based on Bagging and Boosting

Bagging and Boosting are two of the most commonly used techniques in machine learning. In this section, we will look at them in detail. Following are the algorithms we will be focusing on:

Bagging algorithms:

Bagging meta-estimator
Random forest

Boosting algorithms:

AdaBoost
GBM
XGBM
Light GBM
CatBoost

For all the algorithms discussed in this section, we will follow this procedure:

Introduction to the algorithm
Sample code
Parameters

For this article, I have used the Loan Prediction Problem. You can download the dataset from here. Please note that a few code lines (reading the data, splitting into train-test sets, etc.) will be the same for each algorithm. In order to avoid repetition, I have written the code for the same below, and further discussed only the code for the algorithm.

#importing important packages
import pandas as pd
import numpy as np

#reading the dataset
df=pd.read_csv("/home/user/Desktop/train.csv")

#filling missing values
df['Gender'].fillna('Male', inplace=True)

Similarly, fill values for all the columns. EDA, missing values and outlier treatment has been skipped for the purposes of this article. To understand these topics, you can go through this article: Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas.

#split dataset into train and test

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3, random_state=0)

x_train=train.drop('Loan_Status',axis=1)
y_train=train['Loan_Status']

x_test=test.drop('Loan_Status',axis=1)
y_test=test['Loan_Status']

#create dummies
x_train=pd.get_dummies(x_train)
x_test=pd.get_dummies(x_test)

Let’s jump into the bagging and boosting algorithms!

Bagging meta-estimator

The bagging meta-estimator serves as an ensembling algorithm that you can use for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. Following are the steps for the bagging meta-estimator algorithm:

Random subsets are created from the original dataset (Bootstrapping).
The subset of the dataset includes all features.
A user-specified base estimator is fitted on each of these smaller sets.
Predictions from each model are combined to get the final result.

Python Code

from sklearn.ensemble import BaggingClassifier
from sklearn import tree
model = BaggingClassifier(tree.DecisionTreeClassifier(random_state=1))
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.75135135135135134

Sample code for regression problem

from sklearn.ensemble import BaggingRegressor
model = BaggingRegressor(tree.DecisionTreeRegressor(random_state=1))
model.fit(x_train, y_train)
model.score(x_test,y_test)

Parameters used in the algorithms

base_estimator:
- It defines the base estimator to fit on random subsets of the dataset.
- When nothing is specified, the base estimator is a decision tree.
n_estimators:
- It is the number of base estimators to be created.
- Carefully tune the number of estimators because a large number will take a very long time to run, while a very small number might not provide the best results.
max_samples:
- This parameter controls the size of the subsets.
- It is the maximum number of samples to train each base estimator.
max_features:
- Controls the number of features to draw from the whole dataset.
- It defines the maximum number of features required to train each base estimator.
n_jobs:
- The number of jobs to run in parallel.
- Set this value equal to the cores in your system.
- If -1, the number of jobs is set to the number of cores.
random_state:
- It specifies the method of random split. When random state value is same for two models, the random selection is same for both models.
- This parameter is useful when you want to compare different models.

Random Forest

Random Forest is another ensemble machine learning algorithm that follows the bagging technique. It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees. Unlike the bagging meta-estimator, the random forest randomly selects a set of features to decide the best split at each node of the decision tree.

Looking at it step-by-step, this is what a random forest model does:

Random subsets are created from the original dataset (bootstrapping).
At each node in the decision tree, consider only a random set of features to decide the best split.
A decision tree model is fitted on each of the subsets.
Calculate the final prediction by averaging the predictions from all decision trees.

Note: Build the decision trees in the random forest on a subset of data and features. Specifically, the sklearn model of random forest uses all features for the decision tree, while a subset of features randomly selects for splitting at each node.

To sum up, Random forest randomly selects data points and features, and builds multiple trees (Forest).

Python Code:

'''
The following code is for the Random Forest
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# view the top 3 rows of the dataset
print(train_data.head(3))

# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''

Create the object of the Random Forest model
You can also add other parameters and test your code here
Some parameters are : n_estimators and max_depth
Documentation of sklearn RandomForestClassifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

'''
model = RandomForestClassifier()

# fit the model with the training data
model.fit(train_x,train_y)

# number of trees used
print('Number of Trees used : ', model.n_estimators)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

Parameters

n_estimators:
- It defines the number of decision trees to be created in a random forest.
- Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.
criterion:
- It defines the function that is to be used for splitting.
- The function measures the quality of a split for each feature and chooses the best split.
max_features :
- It defines the maximum number of features allowed for the split in each decision tree.
- Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.
max_depth:
- Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.
min_samples_split:
- Used to define the minimum number of samples required in a leaf node before a split is attempted.
- If the number of samples is less than the required number, the node is not split.
min_samples_leaf:
- This defines the minimum number of samples required to be at a leaf node.
- Smaller leaf size makes the model more prone to capturing noise in train data.
max_leaf_nodes:
- This parameter specifies the maximum number of leaf nodes for each tree.
- The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.
n_jobs:
- This indicates the number of jobs to run in parallel.
- Set value to -1 if you want it to run on all cores in the system.
random_state:
- This parameter is used to define the random selection.
- It is used for comparison between various models.

AdaBoost

Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, models use decision trees for modeling. The process creates multiple sequential models, with each model correcting errors from the previous one. AdaBoost assigns weights to incorrectly predicted observations, and the subsequent model aims to predict these values accurately.

Below are the steps for performing the AdaBoost algorithm:

Initially, all observations in the dataset are given equal weights.
A model is built on a subset of data.
Using this model, predictions are made on the whole dataset.
Errors are calculated by comparing the predictions and actual values.
While creating the next model, assign higher weights to the data points that the model predicted incorrectly.
Determine weights using the error value. For instance, a higher error assigns more weight to the observation.
Repeat this process until the error function no longer changes or you reach the maximum limit of the number of estimators.

Python Code

from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(random_state=1)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.81081081081081086

Sample code for regression problem

from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)

Parameters

base_estimators:
- Specify the type of base estimator, meaning the machine learning algorithm to use as the base learner.
n_estimators:
- It defines the number of base estimators.
- The default value is 10, but you should keep a higher value to get better performance.
learning_rate:
- This parameter controls the contribution of the estimators in the final combination.
- There is a trade-off between learning_rate and n_estimators.
max_depth:
- Defines the maximum depth of the individual estimator.
- Tune this parameter for best performance.
n_jobs
- Specifies the number of processors it is allowed to use.
- Set value to -1 for maximum processors allowed.
random_state :
- An integer value to specify the random data split.
- A definite value of random_state will always produce same results if given with same parameters and training data.

Gradient Boosting (GBM)

Gradient Boosting or GBM is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree.

We will use a simple example to understand the GBM algorithm. We have to predict the age of a group of people using the below data:

Assume the mean age as the predicted value for all observations in the dataset.
Calculate the errors using this mean prediction and the actual values of age.

A tree model is created using the errors calculated above as target variable. Our objective is to find the best split to minimize the error.
The predictions by this model are combined with the predictions 1.

This value calculated above is the new prediction.
New errors are calculated using this predicted value and actual value.

Repeat steps 2 to 6 until you reach the maximum number of iterations or the error function no longer changes.

Python Code

from sklearn.ensemble import GradientBoostingClassifier
model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.81621621621621621

Sample code for regression problem

from sklearn.ensemble import GradientBoostingRegressor
model= GradientBoostingRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)

Parameters

min_samples_split
- Define the minimum number of samples (or observations) required in a node for consideration in splitting.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
min_samples_leaf
- Defines the minimum samples required in a terminal or leaf node.
- Generally, lower values for imbalanced class problems because the regions where the minority class becomes the majority will be very small.

min_weight_fraction_leaf
- Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
max_depth
- The maximum depth of a tree.
- Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
- Should be tuned using CV.
max_leaf_nodes
- The maximum number of terminal nodes or leaves in a tree.
- It can define max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
max_features
- The number of features to consider while searching for the best split. These will be randomly selected.
- As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total number of features.
- Higher values can lead to over-fitting but it generally depends on a case to case scenario.

XGBoost

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. It has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique.

Let us see how XGBoost is comparatively better than other techniques:

Regularization:
- Standard GBM implementation has no regularisation like XGBoost.
- Thus XGBoost also helps to reduce overfitting.
Parallel Processing:
- XGBoost implements parallel processing and is faster than GBM .
- XGBoost also supports implementation on Hadoop.
High Flexibility:
- XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model.
Handling Missing Values:
- XGBoost has an in-built routine to handle missing values.
Tree Pruning:
- XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.
Built-in Cross-Validation:
- XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

Python Code

Since XGBoost takes care of the missing values itself, you do not have to impute the missing values. You can skip the step for missing value imputation from the code mentioned above. Follow the remaining steps as always and then apply xgboost as below.

import xgboost as xgb
model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.82702702702702702

Sample code for regression problem

import xgboost as xgb
model=xgb.XGBRegressor()
model.fit(x_train, y_train)
model.score(x_test,y_test)

Parameters

nthread
- Use this for parallel processing and enter the number of cores in the system.
- If you wish to run on all cores, do not input this value. The algorithm will detect it automatically.
eta
- Analogous to learning rate in GBM.
- Makes the model more robust by shrinking the weights on each step.
min_child_weight
- Defines the minimum sum of weights of all observations required in a child.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
max_depth
- It is used to define the maximum depth.
- Higher depth will allow the model to learn relations very specific to a particular sample.
max_leaf_nodes
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
gamma
- Split a node only when the resulting split provides a positive reduction in the loss function.
- Gamma specifies the minimum loss reduction required to make a split.
- Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
subsample
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
- Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.
colsample_bytree
- It is similar to max_features in GBM.
- Denotes the fraction of columns to be randomly sampled for each tree.

Light GBM

Before discussing how Light GBM works, let’s first understand why we need this algorithm when we have so many others (like the ones we have seen above). Light GBM beats all the other algorithms when the dataset is extremely large. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset.

LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern. The images below will help you understand the difference in a better way.

Leaf-wise growth may cause overfitting on smaller datasets, but you can avoid that by using the ‘max_depth’ parameter for learning. You can read more about Light GBM and its comparison with XGB in this article.

Python Code

import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train)
#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100) 
y_pred=model.predict(x_test)
for i in range(0,185):
   if y_pred[i]>=0.5: 
   y_pred[i]=1
else: 
   y_pred[i]=0
0.81621621621621621

Sample code for regression problem

import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train)
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100)
from sklearn.metrics import mean_squared_error
rmse=mean_squared_error(y_pred,y_test)**0.5

Parameters

num_iterations:
- It specifies the number of boosting iterations to perform.
num_leaves :
- This parameter sets the number of leaves to form in a tree.
- In case of Light GBM, since splitting takes place leaf-wise rather than depth-wise, num_leaves must be smaller than 2^(max_depth), otherwise, it may lead to overfitting.
min_data_in_leaf :
- A very small value may cause overfitting.
- It is also one of the most important parameters in dealing with overfitting.
max_depth:
- It specifies the maximum depth or level up to which a tree can grow.
- A very high value for this parameter can cause overfitting.
bagging_fraction:
- It specifies the fraction of data to use for each iteration.
- This parameter generally speeds up the training.
max_bin :
- It defines the maximum number of bins that feature values will be bucketed into.
- A smaller value of max_bin can save a lot of time as it buckets the feature values in discrete bins which is computationally inexpensive.

CatBoost

Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels (i.e. they are highly cardinal), performing one-hot-encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset.

CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms. Here is an article that explains CatBoost in detail.

Python Code

CatBoost algorithm effectively deals with categorical variables. Thus, you should not perform one-hot encoding for categorical variables. Just load the files, impute missing values, and you’re good to go.

from catboost import CatBoostClassifier
model=CatBoostClassifier()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
model.fit(x_train,y_train,cat_features=([ 0,  1, 2, 3, 4, 10]),eval_set=(x_test, y_test))
model.score(x_test,y_test)
0.80540540540540539

Sample code for regression problem

from catboost import CatBoostRegressor
model=CatBoostRegressor()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
model.fit(x_train,y_train,cat_features=([ 0,  1, 2, 3, 4, 10]),eval_set=(x_test, y_test))
model.score(x_test,y_test)

Parameters

loss_function:
- Defines the metric to be used for training.

iterations:
- The maximum number of trees that can be built.
- The final number of trees may be less than or equal to this number.
learning_rate:
- Defines the learning rate.
- Used for reducing the gradient step.
border_count:
- It specifies the number of splits for numerical features.
- It is similar to the max_bin parameter.
depth:
- Defines the depth of the trees.
random_seed:
- This parameter is similar to the ‘random_state’ parameter we have seen previously.
- It is an integer value to define the random seed for training.

This brings us to the end of the ensemble algorithms section. We have covered quite a lot in this article!

Conclusion

Ensemble modeling can exponentially boost the performance of your model and can sometimes be the deciding factor between first place and second! In this article, we covered various ensemble learning techniques and saw how these techniques are applied in machine learning algorithms. Further, we implemented the algorithms on our loan prediction dataset.

This article will have given you a solid understanding of ensemble learning. If you have any suggestions or questions, do share in the comment section below. Also, I encourage you to implement these algorithms at your end and share your results with us!

And if you want to hone your skills as a data science professional then I will recommend you take up this comprehensive course that provides you all the tools and techniques you need to apply machine learning to solve business problems.

Frequently Asked Questions

Q1. What is bagging and boosting in machine learning?

A. Bagging and boosting are ensemble learning techniques in machine learning. Bagging trains multiple models on different subsets of training data with replacement and combines their predictions to reduce variance and improve generalization. Boosting combines multiple weak learners to create a strong learner by focusing on misclassified data points and assigning higher weights in the next iteration. Examples of bagging algorithms include Random Forest while boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.

Q2. What is difference between bagging and boosting?

A. Bagging reduces variance by training multiple models independently on different subsets of training data and combining their predictions, while boosting reduces bias by iteratively training weak learners and focusing on misclassified data points to create a strong learner. Random Forest is a popular bagging algorithm, while AdaBoost, Gradient Boosting, and XGBoost are popular boosting algorithms.

Q3. Which are the three types of ensemble learning?

1. Bagging: Creates multiple models from different training data samples, reducing variance.
2. Boosting: Creates multiple models with weighted training data, reducing bias.
3. Stacking: Combines predictions of multiple models using a meta-model, enhancing generalization.

The media in this article does not belong to Analytics Vidhya and the author uses it at their discretion.

Aishwarya Singh

An avid reader and blogger who loves exploring the endless world of data science and artificial intelligence. Fascinated by the limitless applications of ML and AI; eager to learn and discover the depths of data science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Joaquin

Really nice article! And just when I needed the most. Could you please upload the dataset you used? Im having an error regarding the shapes when implementing the Stacking Ensemble. Thank you!

Show 1 reply

Hi Joaquin, Glad you found this useful. You can download the dataset from this link.

Aditya

Nice Article !!!

Thanks Aditya

Sanjoy Datta

Thank you. This is great content. Been following it from the beginning. 2 issues: Getting NameError: tree is not defined. Secondly, from section 4 onwards, there is dataset to work on. But no dataset referred to for sections before 4. So cannot run the code on data. NameError Traceback (most recent call last) in () 3 from sklearn.ensemble import BaggingClassifier 4 #model = tree.DecisionTreeClassifier() ----> 5 model = BaggingClassifier(tree.DecisionTreeClassifier(random_state=1)) 6 model.fit(x_train, y_train) 7 model.score(x_test,y_test) NameError: name 'tree' is not defined For beginners like me, will need a little more detail to follow the full notebook.

Hi Sanjoy, The codes for voting and averaging can be used with any dataset, and hence no particular dataset is attached to that section. You can try implementing the codes on loan prediction dataset and if you face any issues do let me know. Regarding the error 'tree not found' , please use the following code line : from sklearn import tree. Thank you for pointing it out. I will update the same in the post.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Comprehensive Guide to Ensemble Learning (with Python codes)

Learning Outcomes

Table of contents

What is Ensemble Learning with example?

Possible Ways

Simple Ensemble Techniques

Max Voting

Averaging

Weighted Average

Advanced Ensemble techniques

Stacking

Blending

Bagging

Boosting

Algorithms based on Bagging and Boosting

Bagging meta-estimator

Python Code

Sample code for regression problem

Parameters used in the algorithms

Random Forest

Parameters

AdaBoost

Python Code

Sample code for regression problem

Parameters

Gradient Boosting (GBM)

Python Code

Sample code for regression problem

Parameters

XGBoost

Python Code

Sample code for regression problem

Parameters

Light GBM

Python Code

Sample code for regression problem

Parameters

CatBoost

Python Code