What is XGBoost Algorithm?

Aayush Tyagi Last Updated : 28 Feb, 2025

11 min read

XGBoost is a machine learning algorithm that belongs to the ensemble learning category, specifically the gradient boosting framework. It utilizes decision trees as base learners and employs regularization techniques to enhance model generalization. XGBoost is famous for its computational efficiency, offering efficient processing, insightful feature importance analysis, and seamless handling of missing values. It’s the go-to algorithm for a wide range of tasks, including regression, classification, and ranking. In this article, we will give you an overview of XGBoost model, along with a use-case!

In this article, you will learn about the XGBoost algorithm. We will explain how the XGBoost classifier works and how to build an XGBoost model. You will also see how XGBoost works and why it is useful in machine learning.

We recommend going through the below article as well to fully understand the various terms and concepts mentioned in this article:

What is XGBoost in Machine Learning?
Why Ensemble Learning?
Demonstrating the Potential of Gradient Boosting
Using Gradient d=Descent for Optimizing the Loss Function
Unique Features of XGBoost Model
Python Code for XGBoost
XGBoost Model Benefits and Attributes
XGBoost vs Gradient Boosting
Difference between XGBoost and Random Forest
Conclusion
Frequently Asked Questions

What is XGBoost in Machine Learning?

XGBoost, or eXtreme Gradient Boosting, is a XGBoost algorithm in machine learning algorithm under ensemble learning. It is trendy for supervised learning tasks, such as regression and classification. XGBoost builds a predictive model by combining the predictions of multiple individual models, often decision trees, in an iterative manner.

The algorithm works by sequentially adding weak learners to the ensemble, with each new learner focusing on correcting the errors made by the existing ones. It uses a gradient descent optimization technique to minimize a predefined loss function during training.

Key features of XGBoost Algorithm include its ability to handle complex relationships in data, regularization techniques to prevent overfitting and incorporation of parallel processing for efficient computation.

Why Ensemble Learning?

XGBoost is an ensemble learning method. Sometimes, it may not be sufficient to rely upon the results of just one machine learning model. Ensemble learning offers a systematic solution to combine the predictive power of multiple learners. The resultant is a single model which gives the aggregated output from several models.

The models that form the ensemble, also known as base learners, could be either from the same learning algorithm or different learning algorithms. Bagging and boosting serve as two widely used ensemble learners. Though you can apply these techniques with several statistical models, decision trees dominate their usage.

Let’s briefly discuss bagging before taking a more detailed look at the concept of gradient boosting.

Bagging

While decision trees are one of the most easily interpretable models, they exhibit highly variable behavior. Consider a single training dataset that we randomly split into two parts. Now, let’s use each part to train a decision tree in order to obtain two models.

When we fit both these models, they would yield different results. Decision trees exhibit high variance due to this behavior. Bagging or boosting aggregation helps to reduce the variance in any learner. Several decision trees generated in parallel form the base learners of the bagging technique. Data sampled with replacement is fed to these learners for training. The final prediction is the averaged output from all the learners.

Boosting

In boosting, the trees build sequentially so that each subsequent tree aims to reduce the errors of the previous tree. Each tree learns from its predecessors and updates the residual errors. Hence, the tree that grows next in the sequence will learn from an updated version of the residuals.

The base learners in boosting are weak learners in which the bias is high, and the predictive power is just a tad better than random guessing. Each of these weak learners contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners. The final strong learner brings down both the bias and the variance.

In contrast to bagging techniques like Random Forest, boosting uses trees with fewer splits. Such small trees, which are not very deep, are highly interpretable. You can optimally select parameters like the number of trees or iterations, the learning rate of gradient boosting, and the depth of the tree through validation techniques like k-fold cross-validation. Having a large number of trees might lead to overfitting. So, it is necessary to carefully choose the stopping criteria for boosting.

Gradient Boosting Ensemble Technique

The gradient boosting ensemble technique consists of three simple steps:

An initial model F0 is defined to predict the target variable y. This model will be associated with a residual (y – F0)
A new model h1 is fit to the residuals from the previous step
Now, F0 and h1 are combined to give F1, the boosted version of F0. The mean squared error from F1 will be lower than that from F0:

To improve the performance of F1, we could model after the residuals of F1 and create a new model F2:

This can be done for ‘m’ iterations, until residuals have been minimized as much as possible:

Here, the additive learners do not disturb the functions created in the previous steps. Instead, they impart information of their own to bring down the errors.

Demonstrating the Potential of Gradient Boosting

In this section, we will explore the power of gradient boosting, a machine learning technique, by building an ensemble model to predict salary based on years of experience. By utilizing regression trees and optimizing loss functions, we aim to showcase the significant reduction in error that gradient boosting can achieve.

Introduction to the Predictive Model

Consider the following data where the years of experience is predictor variable and salary (in thousand dollars) is the target. Using regression trees as base learners, we can create an ensemble model to predict the salary. For the sake of simplicity, we can choose square loss as our loss function and our objective would be to minimize the square error.

Initializing the Model and Understanding Residuals

As the first step, the model should be initialized with a function F0(x). F0(x) should be a function which minimizes the loss function or MSE (mean squared error), in this case:

Taking the first differential of the above equation with respect to γ shows that the function minimizes at the mean i=1nyin. So, you can initiate the boosting model with:

F0(x) gives the predictions from the first stage of our model. Now, the residual error for each instance is (yi – F0(x)).

Checkout this article about Mean Squared Error: Definition and Formula

Building Additive Learners

We can use the residuals from F0(x) to create h1(x). h1(x) will be a regression tree which will try and reduce the residuals from the previous step. The output of h1(x) won’t be a prediction of y; instead, it will help in predicting the successive function F1(x) which will bring down the residuals.

The additive model h1(x) computes the mean of the residuals (y – F0) at each leaf of the tree. The boosted function F1(x) is obtained by summing F0(x) and h1(x). This way h1(x) learns from the residuals of F0(x) and suppresses it in F1(x).

This can be repeated for 2 more iterations to compute h2(x) and h3(x). Each of these additive learners, hm(x), will make use of the residuals from the preceding function, Fm-1(x).

The MSEs for F0(x), F1(x) and F2(x) are 875, 692 and 540. It’s amazing how these simple weak learners can bring about a huge reduction in error!

Observing the Reduction in Error

Note that each learner, hm(x), trains on the residuals. All the additive learners in boosting model themselves after the residual errors at each step. Intuitively, you can observe that the boosting learners use patterns in residual errors. At the stage where boosting reaches maximum accuracy, the residuals appear randomly distributed without any pattern.

Plots of fn and hn — **Plots of Fn and hn**

Using Gradient d=Descent for Optimizing the Loss Function

In the case discussed above, MSE was the loss function. The mean minimized the error here. When MAE (mean absolute error) is the loss function, the median would be used as F₀(x) to initialize the model. A unit change in y would cause a unit change in MAE as well. Using scikit-learn, you can implement various models, including tree boosting algorithms and linear regression models to analyze the differences in loss functions and their impact on the model’s performance.

For MSE, the change observed would be roughly exponential. Instead of fitting hm(x) on the residuals, fitting it on the gradient of loss function, or the step along which loss occurs, would make this process generic and applicable across all loss functions.

Gradient descent helps us minimize any differentiable function. Earlier, the regression tree for hm(x) predicted the mean residual at each terminal node of the tree. In gradient boosting, the average gradient component would be computed.

For each node, there is a factor γ with which hm(x) is multiplied. This accounts for the difference in impact of each branch of the split. Gradient boosting helps in predicting the optimal gradient for the additive model, unlike classical gradient descent techniques which reduce error in the output at each iteration.

Checkout this article 4 Boosting Algorithms You Should Know

The following steps involve gradient boosting:

F0(x) – with which we initialize the boosting algorithm – is to be defined:

The gradient of the loss function is computed iteratively:

Each hm(x) is fit on the gradient obtained at each step
The multiplicative factor γm for each terminal node is derived and the boosted model Fm(x) is defined:

Unique Features of XGBoost Model

XGBoost model is a popular implementation of gradient boosting. Let’s discuss some features or metrics of XGBoost that make it so interesting:

Regularization: XGBoost has an option to penalize complex models through both L1 and L2 regularization. Regularization helps in preventing overfitting
Handling sparse data: Missing values or data processing steps like one-hot encoding make data sparse. XGBoost Classifier incorporates a sparsity-aware split finding algorithm to handle different types of sparsity patterns in the data
Weighted quantile sketch: Most existing tree based algorithms can find the split points when the data points are of equal weights (using quantile sketch algorithm). However, they are not equipped to handle weighted data. XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data
Block structure for parallel learning: For faster computing, XGBoost Classifier can make use of multiple cores on the CPU. This is possible because of a block structure in its system design. Data is sort and store data in in-memory units called blocks. Unlike other algorithms, this approach enables subsequent iterations to reuse the data layout instead of computing it again.
This feature also serves useful for steps like split finding and column sub-sampling
Cache awareness: In XGBoost machine learning, Scala requires non-continuous memory access to obtain the gradient statistics by row index. Hence, Tianqi Chen designed XGBoost to optimize hardware usage. This optimization occurs by allocating internal buffers in each thread, where the workflow can store the gradient statistics. And these parallel tree make better XGboost algorithms with the help of julia and java lanuages.
Out-of-core computing: This feature optimizes the available disk space and maximizes its usage when handling huge datasets that do not fit into memory

Python Code for XGBoost

Here’s a live coding window to see how XGBoost works and play around with the code without leaving this article!

'''
The following code is for XGBoost
Created by - ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the XGBoost model
You can also add other parameters and test your code here
Some parameters are : max_depth and n_estimators
Documentation of xgboost:

https://xgboost.readthedocs.io/en/latest/
'''
model = XGBClassifier()

# fit the model with the training data
model.fit(train_x,train_y)


# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

XGBoost Model Benefits and Attributes

High accuracy: The XGBoost Classifier delivers high accuracy and consistently outperforms other machine learning algorithms in many predictive modeling tasks.
Scalability: It is highly scalable and can handle large datasets with millions of rows and columns.
Efficiency: The design ensures computational efficiency, allowing it to quickly train models on large datasets.
Flexibility: It supports a variety of data types and objectives, including regression, classification, and ranking problems.
Regularization: It incorporates regularization techniques to avoid overfitting and improve generalization performance.
Interpretability: It provides feature importance scores that can help users understand which features are most important for making predictions.
Open-source: XGBoost Model serves as an open-source library widely used and supported by the data science community.

XGBoost vs Gradient Boosting

Feature	XGBoost	Gradient Boosting
Description	Advanced implementation of gradient boosting	Ensemble technique using weak learners
Optimization	Regularized objective function	Error gradient minimization
Efficiency	Highly optimized, efficient	Computationally intensive
Missing Values	Built-in support	Requires preprocessing
Regularization	Built-in L1 and L2	Requires external steps
Feature Importance	Built-in measures	Limited, needs external calculation
Interpretability	Complex, less interpretable	More interpretable models

Difference between XGBoost and Random Forest

Feature	XGBoost	Random Forest
Description	Improves mistakes from previous trees	Builds trees independently
Algorithm Type	Boosting	Bagging
Handling of Weak Learners	Corrects errors sequentially	Combines predictions of independently built trees
Regularization	Uses L1 and L2 regularization to prevent overfitting	Usually doesn’t employ regularization techniques
Performance	Often performs better on structured data but needs more tuning	Simpler and less prone to overfitting

Conclusion

So that was all about the mathematics that power the popular XGBoost algorithm. If your basics are solid, this article must have been a breeze for you. It’s such a powerful algorithm and while there are other techniques that have spawned from it (like CATBoost), XGBoost Model remains a game changer in the machine learning community. We highly recommend you to take up this course to sharpen your skills in machine learning and learn all the state-of-the-art techniques used in the field with our Applied Machine Learning – Beginner to Professional course. Also, these Algorithm helps you for training data. and help you for learning rate that will help you for lightgbm the algorithms.

Hope you like the article you will find below! XGBoost, or eXtreme Gradient Boosting, is a powerful machine learning algorithm that efficiently builds predictive models. How does XGBoost work? It sequentially adds weak learners, primarily decision trees, to improve accuracy. What is XGBoost? This algorithm excels in handling large datasets and offers features like regularization and parallel processing for enhanced performance.

Frequently Asked Questions

Q1. Is XGBoost better than random forest?

A. XGBoost and random forest performance depends on the data and the problem you are solving. XGBoost tends to perform better on structured data, while random forest can be more effective on unstructured data.

Q2. What is XGBoost Python used for?

A. XGBoost Python is a Python package that enables building and training models using the XGBoost algorithm in Python. It includes many functions for tuning and optimizing model performance.

Q3. Is XGBoost a classifier or regression?

A. XGBoost is a versatile algorithm, applicable to both classification and regression tasks. You can effectively manage various data types and tailor the system to meet specific requirements.

Q4. What is the difference between XGBoost and random forest?

XGBoost is powerful but complex. Random Forest is simpler but less powerful.

Q5.When should we use XGBoost?

XGBoost is a powerful ML algorithm for large, structured datasets. It excels in prediction, feature importance, and handling imbalanced data.

Aayush Tyagi

Data Analyst with over 2 years of experience in leveraging data insights to drive informed decisions. Passionate about solving complex problems and exploring new trends in analytics. When not diving deep into data, I enjoy playing chess, singing, and writing shayari.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

PRITESH S

Nice explanation !

srinivas

Hi. Nice article. Thanks for sharing. Couple of clarification 1. what's the formula for calculating the h1(X) 2. How did the split happen x23.

Show 1 reply

Ramya Bhaskar

Hi Srinivas, The split was decided based on a simple approach. A tree with a split at x = 23 returned the least SSE during prediction. Hope this answers your question. Thanks & Regards, Ramya Bhaskar

Anonymous

Do you have your app for iOS?

Aishwarya Singh

yes

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What is XGBoost Algorithm?

Table of contents

What is XGBoost in Machine Learning?

Why Ensemble Learning?

Bagging

Boosting

Gradient Boosting Ensemble Technique

Demonstrating the Potential of Gradient Boosting

Introduction to the Predictive Model

Initializing the Model and Understanding Residuals

Building Additive Learners

Observing the Reduction in Error

Using Gradient d=Descent for Optimizing the Loss Function

Unique Features of XGBoost Model

Python Code for XGBoost

XGBoost Model Benefits and Attributes

XGBoost vs Gradient Boosting

Difference between XGBoost and Random Forest

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID