Do you know that employing the XGBoost algorithm is a winning strategy in many data science competitions? So, what makes it more powerful than a traditional Random Forest or Neural Network? In broad terms, it’s this algorithm’s efficiency, accuracy, and feasibility. In the last few years, predictive modeling has become much faster and more accurate. I remember spending long hours on feature engineering to improve the model by a few decimals. A lot of that difficult work can now be done using better algorithms. One of those algorithms is XGBoost algorithm.
Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called the Otto Classification Challenge. The latest implementation on “xgboost” in R was launched in August 2015. This post will refer to this version (0.4-2).
In this article, I’ve explained a simple approach to using xgboost in R. So, consider this algorithm the next time you build a model. I’m sure you’ll be in shock and then happy!
Overview:
Extreme gradient boosting (xgboost) is similar to the gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. What makes it fast is that it can do parallel computation on a single machine.
This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification, and ranking.
Since it is very high in predictive power but relatively slow to implement, “xgboost” becomes an ideal fit for many competitions. It also has additional features for cross-validation and finding important variables. Many parameters need to be controlled to optimize the model. We will discuss these factors in the next section.
XGBoost only works with numeric vectors. Yes! You need to work on data types here.
Therefore, you need to convert all other forms of data into numeric vectors. One Hot Encoding is a simple method to convert categorical variables into numeric vectors. This term emanates from digital circuit language, which means an array of binary signals, and the only legal values are 0s and 1s.
In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. A sparse Matrix is a matrix where most of the values are zeros. Conversely, a dense matrix is a matrix where most values are non-zero.
Suppose you have a dataset named ‘campaign’ and want to convert all categorical variables except the response variable into such flags. Here is how you do it :
sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)
Now, let’s break down this code as follows:
To convert the target variables as well, you can use the following code:
output_vector = df[,response] == "Responder"
Here is what the code does:
is "Responder"
is TRUE ;Here are simple steps you can use to crack any data problem using Xgboost in R:
library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)
(Here, I use bank data where we need to find whether a customer is eligible for a loan or not).
set.seed(100)
setwd("C:\\Users\\ts93856\\Desktop\\datasource")
# load data
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")
# Loading labels of train data
labels = df_train['labels']
df_train = df_train[-grep('labels', colnames(df_train))]
# combine train and test data
df_all = rbind(df_train,df_test)
# clean Variables : here I clean people with age less than 14 or more than 100
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])
# one-hot-encoding categorical features
ohe_feats = c('gender', 'education', 'employer')
dummies <- dummyVars(~ gender + education + employer, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))
I am using a list of variables in “feature_selected” to be used by the model. Later in this article on XGBoost in R, I will share a quick and smart way to choose variables.
df_all_combined <- df_all_combined[,c('id',features_selected)]
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$labels,"'True'=1; 'False'=0)
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]
xgb <- xgboost(data = data.matrix(X[,-1]),
label = y,
eta = 0.1,
max_depth = 15,
nround=25,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = "merror",
objective = "multi:softprob",
num_class = 12,
nthread = 3
)
And that’s it! You now have an object “xgb” which is an xgboost model. Here is how you score a test population :
# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))
I understand that, by now, you would be highly curious about the various parameters used in the XGBoost model. Three types of parameters can be used for XGBoost classification in R: General Parameters, Booster Parameters, and Task Parameters.
Let’s understand these parameters in detail. I require you to pay attention here. This is the most critical aspect of implementing the xgboost algorithm in R:
Let us now look at some tree-specific parameters for XGBoost in R:
The default value is set to 0.3. To prevent overfitting, you must specify the step size shrinkage used in the update. After each boosting step, we can get the weights of new features directly. And eta shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. A low eta value means the model is more robust to overfitting.
The default value is set to 0. You need to specify the minimum loss reduction required to partition the leaf node of the tree further. The larger the gamma, the more conservative the algorithm will be. The range is 0 to ∞.
The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.
The default value is set to 1. You must specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process will give up further partitioning. In linear regression mode, this simply corresponds to a minimum number of instances needed in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.
The default value is set to 0. This is the maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help make the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression when the class is extremely imbalanced. Setting it to a value of 1-10 might help control the update.The range is 0 to ∞.
The default value is 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees, which will prevent overfitting. The range is 0 to 1.
The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.
Compared to other machine learning techniques, implementing xgboost is really simple. If you have done all we have until now, you already have a model.
Let’s take it one step further and try to find the importance of the variable in the model and subset our variable list.
# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with.stats = T)
model[1:10] #This statement prints top 10 nodes of the model
# Get the feature real names
names <- dimnames(data.matrix(X[,-1]))[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
#In case last step does not work for you because of a version issue, you can try following :
barplot(importance_matrix[,1])
As you can observe, many variables are just not worth using in our model. You can conveniently remove these variables and run the model again. This time, you can expect better accuracy.
Let’s assume age was the variable that came out to be the most important from the above analysis. Here is a simple chi-square test which you can do to see whether the variable is important or not.
test <- chisq.test(train$Age, output_vector)
print(test)
We can do the same process for all important variables. This will reveal whether the model has accurately identified all possible important variables.
In conclusion, the XGBoost algorithm offers powerful capabilities for building numeric predictive models in both R and Python. By monitoring metrics like test error and leveraging features such as watchlist, we can iteratively refine our models for improved performance. With its versatility and efficiency, XGBoost classification in R remains a cornerstone in modern data science, enabling precise and actionable insights from complex datasets.
Ans: Yes, XGBoost can be used in R. It’s implemented through the ‘xgboost’ package, which provides powerful tools for building gradient boosting models efficiently.
Ans: XGBoost works by iteratively training a sequence of decision trees, each correcting the previous trees’ errors. It employs a gradient-boosting framework that optimizes a differentiable loss function to minimize prediction errors.
Ans: XGBoost and random forest are both powerful algorithms for regression tasks, but their performance may vary depending on the dataset and specific requirements. While random forest builds multiple independent trees, XGBoost sequentially improves upon them, often leading to higher accuracy, especially in complex datasets.
If you like what you just read and want to continue your analytics learning, subscribe to our emails, follow us on Twitter, or like our Facebook page.
How to find best parameter values for the model?
Aditya, Its an iterative process. You generally start with the default value and then move towards either extremes depending on the CV gain. Tavish
Below code is giving an error : labels = df_train['labels']
I think in the dataset "label" is "Loan_Status" and this code is right labels = df_train['Loan_Status'] df_train = df_train[-grep('Loan_Status', colnames(df_train))]
Very helpful article Srivastava. I heard about XGBOOST but did not implement it. Will definitely try this in the next competition, using this article.