How to Use XGBoost Algorithm in R?

Tavish Srivastava Last Updated : 11 Nov, 2024

8 min read

Do you know that employing the XGBoost algorithm is a winning strategy in many data science competitions? So, what makes it more powerful than a traditional Random Forest or Neural Network? In broad terms, it’s this algorithm’s efficiency, accuracy, and feasibility. In the last few years, predictive modeling has become much faster and more accurate. I remember spending long hours on feature engineering to improve the model by a few decimals. A lot of that difficult work can now be done using better algorithms. One of those algorithms is XGBoost algorithm.

Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called the Otto Classification Challenge. The latest implementation on “xgboost” in R was launched in August 2015. This post will refer to this version (0.4-2).

In this article, I’ve explained a simple approach to using xgboost in R. So, consider this algorithm the next time you build a model. I’m sure you’ll be in shock and then happy!

Overview:

Learn the fundamentals of XGBoost and how it differs from other algorithms like Random Forest and Neural Networks.
Discover why XGBoost is a favorite choice in data science competitions for its speed, accuracy, and high predictive power.
Explore a hands-on tutorial to implement XGBoost in R, including data preparation, model tuning, and evaluation.
Understand advanced features and best practices for optimizing XGBoost models to enhance performance in predictive modeling tasks.

What is XGBoost?
Preparation of Data for using XGBoost
Building Model using Xgboost in R
Parameters Used in XGBoost
Advanced Functionality of Xgboost
Testing whether the results make sense
Conclusion
Frequently Asked Questions

What is XGBoost?

Extreme gradient boosting (xgboost) is similar to the gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. What makes it fast is that it can do parallel computation on a single machine.

This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification, and ranking.

Since it is very high in predictive power but relatively slow to implement, “xgboost” becomes an ideal fit for many competitions. It also has additional features for cross-validation and finding important variables. Many parameters need to be controlled to optimize the model. We will discuss these factors in the next section.

Preparation of Data for using XGBoost

XGBoost only works with numeric vectors. Yes! You need to work on data types here.

Therefore, you need to convert all other forms of data into numeric vectors. One Hot Encoding is a simple method to convert categorical variables into numeric vectors. This term emanates from digital circuit language, which means an array of binary signals, and the only legal values are 0s and 1s.

In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. A sparse Matrix is a matrix where most of the values are zeros. Conversely, a dense matrix is a matrix where most values are non-zero.

Suppose you have a dataset named ‘campaign’ and want to convert all categorical variables except the response variable into such flags. Here is how you do it :

sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)

Now, let’s break down this code as follows:

“sparse.model.matrix” is the command, and all other inputs inside parentheses are parameters.
The parameter “response” says this statement should ignore the “response” variable.
“-1” removes an extra column, which this command creates as the first column.
And finally, you specify the dataset name.

To convert the target variables as well, you can use the following code:

output_vector = df[,response] == "Responder"

Here is what the code does:

set output_vector to 0
set output_vector to 1 for rows where the response is "Responder" is TRUE ;
return output_vector.

Building Model using Xgboost in R

Here are simple steps you can use to crack any data problem using Xgboost in R:

Step 1: Load all the libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)

Step 2: Load the dataset

(Here, I use bank data where we need to find whether a customer is eligible for a loan or not).

set.seed(100)
setwd("C:\\Users\\ts93856\\Desktop\\datasource")
# load data
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")
# Loading labels of train data

labels = df_train['labels']
df_train = df_train[-grep('labels', colnames(df_train))]

# combine train and test data
df_all = rbind(df_train,df_test)

Step 3: Data Cleaning & Feature Engineering

# clean Variables :  here I clean people with age less than 14 or more than 100

df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])

# one-hot-encoding categorical features
ohe_feats = c('gender', 'education', 'employer')

dummies <- dummyVars(~ gender +  education + employer, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))

I am using a list of variables in “feature_selected” to be used by the model. Later in this article on XGBoost in R, I will share a quick and smart way to choose variables.

df_all_combined <- df_all_combined[,c('id',features_selected)] 
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$labels,"'True'=1; 'False'=0)
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]

Step 4: Tune and Run the model

xgb <- xgboost(data = data.matrix(X[,-1]), 
 label = y, 
 eta = 0.1,
 max_depth = 15, 
 nround=25, 
 subsample = 0.5,
 colsample_bytree = 0.5,
 seed = 1,
 eval_metric = "merror",
 objective = "multi:softprob",
 num_class = 12,
 nthread = 3
)

Step 5: Score the Test Population

And that’s it! You now have an object “xgb” which is an xgboost model. Here is how you score a test population :

# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

Parameters Used in XGBoost

I understand that, by now, you would be highly curious about the various parameters used in the XGBoost model. Three types of parameters can be used for XGBoost classification in R: General Parameters, Booster Parameters, and Task Parameters.

General parameters refer to which booster we are using to do boosting. The commonly used are tree or linear models.
Booster parameters depend on which booster you have chosen
Learning Task parameters that decide on the learning scenario, for example, regression tasks, may use different parameters with ranking tasks.

Let’s understand these parameters in detail. I require you to pay attention here. This is the most critical aspect of implementing the xgboost algorithm in R:

General Parameters

silent: The default value is 0. You need to specify 0 for printing running messages and 1 for silent mode.
booster: The default value is gbtree. You need to specify the booster to use: gbtree (tree-based) or gblinear (linear function).
num_pbuffer: This is set automatically by xgboost; no need to be set by the user. Read the documentation of xgboost for more details.
num_feature: This is set automatically by xgboost, no need to be set by user.

Booster Parameters

Let us now look at some tree-specific parameters for XGBoost in R:

eta

The default value is set to 0.3. To prevent overfitting, you must specify the step size shrinkage used in the update. After each boosting step, we can get the weights of new features directly. And eta shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. A low eta value means the model is more robust to overfitting.

gamma

The default value is set to 0. You need to specify the minimum loss reduction required to partition the leaf node of the tree further. The larger the gamma, the more conservative the algorithm will be. The range is 0 to ∞.

max_depth

The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.

min_child_weight

The default value is set to 1. You must specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process will give up further partitioning. In linear regression mode, this simply corresponds to a minimum number of instances needed in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.

max_delta_step

The default value is set to 0. This is the maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help make the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression when the class is extremely imbalanced. Setting it to a value of 1-10 might help control the update.The range is 0 to ∞.

subsample

The default value is 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees, which will prevent overfitting. The range is 0 to 1.

colsample_bytree

The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.

Linear Booster Specific Parameters

lambda and alpha: These are regularization terms on weights. Lambda default value assumed is 1, and alpha is 0.
lambda_bias: L2 regularization term on bias with a default value of 0.

Learning Task Parameters

base_score: The default value is set to 0.5. You need to specify the initial prediction score of all instances, global bias.
objective: The default value is set to reg:linear. You need to specify the type of learner you want, which includes linear regression, logistic regression, Poisson regression, etc.
eval_metric: You need to specify the evaluation metrics for validation data; a default metric will be assigned according to the objective( rmse for regression, error for classification, mean average precision for ranking
seed: As always, here you specify the seed to reproduce the same set of outputs.

Advanced Functionality of Xgboost

Compared to other machine learning techniques, implementing xgboost is really simple. If you have done all we have until now, you already have a model.

Let’s take it one step further and try to find the importance of the variable in the model and subset our variable list.

# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with.stats = T)
model[1:10] #This statement prints top 10 nodes of the model

# Get the feature real names
names <- dimnames(data.matrix(X[,-1]))[[2]]

# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])

#In case last step does not work for you because of a version issue, you can try following :
barplot(importance_matrix[,1])

How to use XGBoost algorithm in R in easy steps_-80

As you can observe, many variables are just not worth using in our model. You can conveniently remove these variables and run the model again. This time, you can expect better accuracy.

Testing whether the results make sense

Let’s assume age was the variable that came out to be the most important from the above analysis. Here is a simple chi-square test which you can do to see whether the variable is important or not.

test <- chisq.test(train$Age, output_vector)
print(test)

We can do the same process for all important variables. This will reveal whether the model has accurately identified all possible important variables.

Conclusion

In conclusion, the XGBoost algorithm offers powerful capabilities for building numeric predictive models in both R and Python. By monitoring metrics like test error and leveraging features such as watchlist, we can iteratively refine our models for improved performance. With its versatility and efficiency, XGBoost classification in R remains a cornerstone in modern data science, enabling precise and actionable insights from complex datasets.

Frequently Asked Questions

Q1. Can you use XGBoost in R?

Ans: Yes, XGBoost can be used in R. It’s implemented through the ‘xgboost’ package, which provides powerful tools for building gradient boosting models efficiently.

Q2. How does XGBoost work?

Ans: XGBoost works by iteratively training a sequence of decision trees, each correcting the previous trees’ errors. It employs a gradient-boosting framework that optimizes a differentiable loss function to minimize prediction errors.

Q3. Is XGBoost better than random forest for regression?

Ans: XGBoost and random forest are both powerful algorithms for regression tasks, but their performance may vary depending on the dataset and specific requirements. While random forest builds multiple independent trees, XGBoost sequentially improves upon them, often leading to higher accuracy, especially in complex datasets.

If you like what you just read and want to continue your analytics learning, subscribe to our emails, follow us on Twitter, or like our Facebook page.

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Aditya

How to find best parameter values for the model?

Show 1 reply

Tavish

Aditya, Its an iterative process. You generally start with the default value and then move towards either extremes depending on the CV gain. Tavish

kumar

Below code is giving an error : labels = df_train['labels']

Mikhail

I think in the dataset "label" is "Loan_Status" and this code is right labels = df_train['Loan_Status'] df_train = df_train[-grep('Loan_Status', colnames(df_train))]

HighSpirits

Very helpful article Srivastava. I heard about XGBOOST but did not implement it. Will definitely try this in the next competition, using this article.

How to Use XGBoost Algorithm in R?

Table of contents

What is XGBoost?

Preparation of Data for using XGBoost

Building Model using Xgboost in R

Step 1: Load all the libraries

Step 2: Load the dataset

Step 3: Data Cleaning & Feature Engineering

Step 4: Tune and Run the model

Step 5: Score the Test Population

Parameters Used in XGBoost

General Parameters

Booster Parameters

eta

gamma

max_depth

min_child_weight

max_delta_step

subsample

colsample_bytree

Linear Booster Specific Parameters

Learning Task Parameters

Advanced Functionality of Xgboost

Testing whether the results make sense

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID