How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?

Guest Blog Last Updated : 26 Aug, 2021

8 min read

Introduction

Variable selection is an important aspect of model building which every analyst must learn. After all, it helps in building predictive models free from correlated variables, biases and unwanted noise.

A lot of novice analysts assume that keeping all (or more) variables will result in the best model as you are not losing any information. Sadly, that is not true!

How many times has it happened that removing a variable from the model has increased your model accuracy ?

At least, it has happened to me. Such variables are often found to be correlated and hinder achieving higher model accuracy. Today, we’ll learn one of the ways of how to get rid of such variables in R. I must say, R has an incredible CRAN repository. Out of all packages, one such available package for variable selection is Boruta Package.

In this article, we’ll focus on understanding the theory and practical aspects of using Boruta Package. I’ve followed a step wise approach to help you understand better.

I’ve also drawn a comparison of boruta with other traditional feature selection algorithms. Using this, you can arrive at a more meaningful set of features which can pave the way for a robust prediction model. The terms “features”, “variables” and “attributes” have been used interchangeably, so don’t get confused!

tutorial on using boruta package in R

What is Boruta algorithm and why such a strange name ?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. This package derive its name from a demon in Slavic mythology who dwelled in pine forests.

We know that feature selection is a crucial step in predictive modeling. This technique achieves supreme importance when a data set comprised of several variables is given for model building.

Boruta can be your algorithm of choice to deal with such data sets. Particularly when one is interested in understanding the mechanisms related to the variable of interest, rather than just building a black box predictive model with good prediction accuracy.

How does it work?

Below is the step wise working of boruta algorithm:

Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.
Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.

What makes it different from traditional feature selection algorithms?

Boruta follows an all-relevant feature selection method where it captures all features which are in some circumstances relevant to the outcome variable. In contrast, most of the traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.

While fitting a random forest model on a data set, you can recursively get rid of features in each iteration which didn’t perform well in the process. This will eventually lead to a minimal optimal subset of features as the method minimizes the error of random forest model. This happens by selecting an over-pruned version of the input data set, which in turn, throws away some relevant features.

On the other hand, boruta find all features which are either strongly or weakly relevant to the decision variable. This makes it well suited for biomedical applications where one might be interested to determine which human genes (features) are connected in some way to a particular medical condition (target variable).

Boruta in Action in R (Practical)

Till here, we have understood the theoretical aspects of Boruta Package. But, that isn’t enough. The real challenge starts now. Let’s learn to implement this package in R.

First things first. Let’s install and call this package for use.

> install.packages("Boruta")
> library(Boruta)

Now, we’ll load the data set. For this tutorial I’ve taken the data set from Practice Problem Loan Prediction

> setwd("../Data/Loan_Prediction")
> traindata <- read.csv("train.csv", header = T, stringsAsFactors = F)

Let’s have a look at the data.

> str(traindata)
> names(traindata) <- gsub("_", "", names(traindata))

gsub() function is used to replace an expression with other one. In this case, I’ve replaced the underscore(_) with blank(“”).

Let’s check if this data set has missing values.

> summary(traindata)

We find that many variables have missing values. It’s important to treat missing values prior to implementing boruta package. Moreover, this data set also has blank values. Let’s clean this data set.

Now we’ll replace blank cells with NA. This will help me treat all NA’s at once.

> traindata[traindata == ""] <- NA

Here, I’m following the simplest method of missing value treatment i.e. list wise deletion. More sophisticated methods & packages of missing value imputation can be found here.

> traindata <- traindata[complete.cases(traindata),]

Let’s convert the categorical variables into factor data type.

> convert <- c(2:6, 11:13)
> traindata[,convert] <- data.frame(apply(traindata[convert], 2, as.factor))

Now is the time to implement and check the performance of boruta package. The syntax of boruta is almost similar to regression (lm) method.

> set.seed(123)
> boruta.train <- Boruta(LoanStatus~.-LoanID, data = traindata, doTrace = 2)
> print(boruta.train)

Boruta performed 99 iterations in 18.80749 secs.
5 attributes confirmed important: ApplicantIncome, CoapplicantIncome,
CreditHistory, LoanAmount, LoanAmountTerm.
4 attributes confirmed unimportant: Dependents, Education, Gender, SelfEmployed.
2 tentative attributes left: Married, PropertyArea.

Boruta gives a crystal clear call on the significance of variables in a data set. In this case, out of 11 attributes, 4 of them are rejected and 5 are confirmed. 2 attributes are designated as tentative. Tentative attributes have importance so close to their best shadow attributes that Boruta is not able to make a decision with the desired confidence in default number of random forest runs.

Now, we’ll plot the boruta variable importance chart.

By default, plot function in Boruta adds the attribute values to the x-axis horizontally where all the attribute values are not dispayed due to lack of space.

Here I’m adding the attributes to the x-axis vertically.

> plot(boruta.train, xlab = "", xaxt = "n")
> lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i) boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
> names(lz) <- colnames(boruta.train$ImpHistory)
> Labels <- sort(sapply(lz,median))
> axis(side = 1,las=2,labels = names(Labels), at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

variable importance boruta package in R

Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute. Red, yellow and green boxplots represent Z scores of rejected, tentative and confirmed attributes respectively.

Now is the time to take decision on tentative attributes. The tentative attributes will be classified as confirmed or rejected by comparing the median Z score of the attributes with the median Z score of the best shadow attribute. Let’s do it.

> final.boruta <- TentativeRoughFix(boruta.train)
> print(final.boruta)

Boruta performed 99 iterations in 18.399 secs.
Tentatives roughfixed over the last 99 iterations.
6 attributes confirmed important: ApplicantIncome, CoapplicantIncome,
CreditHistory, LoanAmount, LoanAmountTerm and 1 more.
5 attributes confirmed unimportant: Dependents, Education, Gender, PropertyArea,
SelfEmployed.

Boruta result plot after the classification of tentative attributes

boruta tentative importance plot

It’s time for results now. Let’s obtain the list of confirmed attributes

> getSelectedAttributes(final.boruta, withTentative = F)
[1] "Married" "ApplicantIncome" "CoapplicantIncome" "LoanAmount"
[5] "LoanAmountTerm" "CreditHistory"

We’ll create a data frame of the final result derived from Boruta.

> boruta.df <- attStats(final.boruta)
> class(boruta.df)
[1] "data.frame"
> print(boruta.df)

meanImp medianImp minImp maxImp normHits decision

Gender 1.04104738 0.9181620 -1.9472672 3.767040 0.01010101 Rejected

Married 2.76873080 2.7843600 -1.5971215 6.685000 0.56565657 Confirmed

Dependents 1.15900910 1.0383850 -0.7643617 3.399701 0.01010101 Rejected

Education 0.64114702 0.4747312 -1.0773928 3.745441 0.03030303 Rejected

SelfEmployed -0.02442418 -0.1511711 -0.9536783 1.495992 0.00000000 Rejected

ApplicantIncome 6.05487791 6.0311639 2.9801751 9.197305 0.94949495 Confirmed

CoapplicantIncome 5.76704389 5.7920332 1.9322989 10.184245 0.97979798 Confirmed

LoanAmount 5.19167613 5.3606935 1.7489061 8.855464 0.88888889 Confirmed

LoanAmountTerm 5.50553498 5.3938036 2.0361781 9.025020 0.90909091 Confirmed

CreditHistory 59.57931404 60.2352549 51.7297906 69.721650 1.00000000 Confirmed

PropertyArea 2.77155525 2.4715892 -1.2486696 8.719109 0.54545455 Rejected

Let’s understand the parameters used in Boruta as follows:

maxRuns: maximal number of random forest runs. You can consider increasing this parameter if tentative attributes are left. Default is 100.

doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus additionally reporting each iteration. Default is 0.

holdHistory: The full history of importance runs is stored if set to TRUE (Default). Gives a plot of Classifier run vs. Importance when the plotImpHistory function is called.

For more complex parameters, please refer to the package documentation of Boruta.

Boruta vs Traditional Feature Selection Algorithm

Till here, we have learnt about the concept and steps to implement boruta package in R.

What if we used a traditional feature selection algorithm such as recursive feature elimination on the same data set. Do we end up with the same set of important features? Let us find out.

Now, we’ll learn the steps used to implement recursive feature elimination (RFE). In R, RFE algorithm can be implemented using caret package.

Let’s start by defining a control function to be used with RFE algorithm. We’ll load the required libraries:

> library(caret)
> library(randomForest)
> set.seed(123)
> control <- rfeControl(functions=rfFuncs, method="cv", number=10)

Here we have specified a random forest selection function through rfFuncs option (which is also the underlying algorithm in Boruta)

Let’s implement the RFE algorithm now.

> rfe.train <- rfe(traindata[,2:12], traindata[,13], sizes=1:12, rfeControl=control)

I’m sure this is self explanatory. traindata[,2:12] refers to selecting all independent variables except the ID variable. traindata[,13] selects only the dependent variable. It might take some time to run.

We can also check the outcome of this algorithm.

> rfe.train

Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:

Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.8083 0.4702 0.03810 0.1157 *
2 0.8041 0.4612 0.03575 0.1099
3 0.8021 0.4569 0.04201 0.1240
4 0.7896 0.4378 0.03991 0.1249
5 0.7978 0.4577 0.04557 0.1348
6 0.7957 0.4471 0.04422 0.1315
7 0.8061 0.4754 0.04230 0.1297
8 0.8083 0.4767 0.04055 0.1203
9 0.7897 0.4362 0.05044 0.1464
10 0.7918 0.4453 0.05549 0.1564
11 0.8041 0.4751 0.04419 0.1336

The top 1 variables (out of 1):
CreditHistory

This algorithm gives highest weightage to Credit History. Now, we’ll plot the result of RFE algorithm and obtain a variable importance chart.

> plot(rfe.train, type=c("g", "o"), cex = 1.0, col = 1:11)

importance2

Let’s extract the chosen features. I am confident it would result in Credit History.

> predictors(rfe.train)
[1] "CreditHistory"

Hence, we see that recursive feature elimination algorithm has selected “CreditHistory” as the only important feature among the 11 features in the dataset.

As compared to this traditional feature selection algorithm, boruta returned a much better result of variable importance which was easy to interpret as well ! I find it awesome to work on R where one has access to so many amazing packages. I’m sure there would be many other packages for feature selection. I’d love to read about them.

End notes

Boruta is an easy to use package as there aren’t many parameters to tune / remember. You shouldn’t use a data set with missing values to check important variables using Boruta. It’ll blatantly throw errors. You can use this algorithm on any classification / regression problem in hand to come up with a subset of meaningful features.

In this article, I’ve used a quick method to impute missing value because the scope of this article was to understand boruta (theory & practical). I’d suggest you to use advanced methods of missing value imputation. After all, information available in data is all we look for ! Keep going.

Did you like reading this article ? What other methods of variable selection do you use? Do share your suggestions / opinions in the comments section below.

About the Author

Debarati_Dutta Debarati Dutta is MA Econometrics graduate from University of Madras. She has more than 3 years of experience in data analytics and predictive modeling across multiple domains. She has worked in companies such as Amazon, Antuit, Netlink. Currently, she’s based out of Montreal, Canada.

Debarati is the first winner of Blogathon. She won amazon voucher worth INR 5000.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Sreenivas Kalahasti

Gr8 Share Debrati. I was looking for the same kind of info. Keep sharing. Thanks.

Show 1 reply

Debarati

Thanks Sreenivas. Glad that it helped.

Dr.D.K.Samuel

Beautiful, meaningful info, thanks a lot

Thank you so much Dr. Samuel.

Vlad

Hi, This is not the best package for the determination of the importance of predictors. See this article.https://www.mql5.com/en/articles/2029

Hi Vlad, Thanks for the interesting article. Well, I would say "best" is more likely a relative term which depends a lot on the problem we have in hand as well as our needs. As mentioned in another comment, if prediction accuracy is your only concern, it might / might not be the best method for feature selection. But, if you are also interested in understanding the relationships in your data, it would do a much better job. Hence, application of machine learning techniques involve a lot of trial and error to arrive at the "best" method.

How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?

Introduction

What is Boruta algorithm and why such a strange name ?

How does it work?

What makes it different from traditional feature selection algorithms?

Boruta in Action in R (Practical)

Boruta vs Traditional Feature Selection Algorithm

End notes

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid