Random forest is one of the most commonly used algorithm in Kaggle competitions. Along with a good predictive power, Random forest model are pretty simple to build. We have previously explained the algorithm of a random forest ( Introduction to Random Forest ). This article is the second part of the series on comparison of a random forest with a CART model. In the first article, we took an example of an inbuilt R-dataset to predict the classification of an specie. In this article we will build a random forest model on the same dataset to compare the performance with previously built CART model. I did this experiment a week back and found the results very insightful. I recommend the reader to read the first part of this article (Last article) before reading this one.
[stextbox id=”section”] Background on Dataset “Iris” [/stextbox]
Data set “iris” gives the measurements in centimeters of the variables : sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of Iris. The dataset has 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species. We intend to predict the Specie based on the 4 flower characteristic variables.
We will first load the dataset into R and then look at some of the key statistics. You can use the following codes to do so.
[stextbox id=”grey”]
data(iris)
# look at the dataset
summary(iris)
# visually look at the dataset
qplot(Petal.Length,Petal.Width,colour=Species,data=iris)
[/stextbox]
[stextbox id=”section”] Results using CART Model [/stextbox]
The first step we follow in any modeling exercise is to split the data into training and validation. You can use the following code for the split. (We will use the same split for random forest as well)
[stextbox id=”grey”]
train.flag <- createDataPartition(y=iris$Species,p=0.5,list=FALSE)
training <- iris[train.flag,]
Validation <- iris[-train.flag,]
[/stextbox]
CART model gave following result in the training and validation :
Misclassification rate in training data = 3/75
Misclassification rate in validation data = 4/75
As you can see, CART model gave decent result in terms of accuracy and stability. We will now model the random forest algorithm on the same training dataset and validate it using same validation dataset.
[stextbox id=”section”] Building a Random forest model [/stextbox]
We have used “caret” , “randomForest” and “randomForestSRC” package to build this model. You can use the following code to generate a random forest model on the training dataset.
[stextbox id=”grey”]
> library(randomForest)
> library(randomForestSRC)
> library(caret)
> modfit <- train(Species~ .,method="rf",data=training)
> pred <- predict(modfit,training)
> table(pred,training$Species)
pred setosa versicolor virginica
setosa 25 0 0
versicolor 0 25 0
virginica 0 0 25
[/stextbox]
Misclassification rate in training data = 0/75 [This is simply awesome!]
[stextbox id=”section”] Validating the model [/stextbox]
Having built such an accurate model, we will like to make sure that we are not over fitting the model on the training data. This is done by validating the same model on an independent data set. We use the following code to do the same :
[stextbox id=”grey”]
> train.cart<-predict(modfit,newdata=training)
> table(train.cart,training$Species)
> train.cart setosa versicolor virginica
pred setosa versicolor virginica
setosa 25 0 0
versicolor 0 22 0
virginica 0 3 25
# Misclassification rate = 3/75[/stextbox]
Only 3 misclassified observations out of 75, signifies good predictive power. However, we see a significant drop in predictive power of this model when we compare it to training misclassification.
[stextbox id=”section”] Comparison between the two models [/stextbox]
Till this point, everything was as per books. Here comes the tricky part. Once you have all performance metrics, you need to select the best model as per your business requirement. We will make this judgement based on 3 criterion in this case apart from business requirements:
.1. Stability : The model should have similar performance metrics across both training and validation. This is very essential because business can live with a lower accuracy but not with a lower stability. We will give the highest weight to stability. For this case let’s take it as 5.
2. Performance on Training data : This is one of the important metric but nothing conclusive can be said just based on this metric. This is because an over fit model is unacceptable but will get a very high score at this parameter. Hence, we will give a low weight to this parameter (say 2).
3. Performance on Validation data : This metric catch holds of overfit model and hence is an important metric. We will score it higher than performance and lower than stability. For this case let’s take it as 3.
Note that the weights and scores entirely depends on the business case. Following is a score table as per my judgement in this case.
As you can see from the table that however Random forest gives me a better performance, I still will go ahead and use CART model because of the stability factor. Other factor in favor of CART model is the easy business justification. Random forest is very difficult to explain to people working on field. CART models are simple cuts which can be justified by simple business justification/reasons. But the choice of model selection is entirely dependent on business requirement.
[stextbox id=”section”] End Notes [/stextbox]
Every model has its own strength. Random forest, as seen from this case study, has a very high accuracy on the training population, because it uses many different characteristics to make a prediction. But, because of the same reason, it sometimes over fits the model on the data. CART model on the other side is simplistic criterion cut model. This might be over simplification in some case but works pretty well in most business scenarios. However, the choice of model might be business requirement dependent, it is always good to compare performance of different model before taking this call.
Did you find the article useful? Did this article solve any of your existing dilemmas? Have you compared the two models in any of your projects? If you did, share with us your thoughts on the topic.
Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.
Understanding Random Forest Algorithm With Exam...
Top 10 Machine Learning Algorithms You Must Know
Comparing a CART model to Random Forest (Part 1)
Powerful Guide to learn Random Forest (with cod...
How Does Random Forest Work?
An Introduction to Random Forest Algorithm for ...
How To Evaluate The Business Value Of a Machine...
An Introduction to Random Forest using the fast...
Tuning the parameters of your Random Forest model
Decision Tree vs Random Forest | Which Is Right...
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
It is needed for personalizing the website.
Expiry: Session
Type: HTTP
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Expiry: Session
Type: HTTPS
Preserves the login/logout state of users across the whole site.
Expiry: Session
Type: HTTPS
Preserves users' states across page requests.
Expiry: Session
Type: HTTPS
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
Expiry: 365 days
Type: HTTP
Used by Microsoft Clarity, to store and track visits across websites.
Expiry: 1 Year
Type: HTTP
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
Expiry: 1 Year
Type: HTTP
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
Expiry: 1 Day
Type: HTTP
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
Expiry: 2 Years
Type: HTTP
Use to measure the use of the website for internal analytics
Expiry: 1 Years
Type: HTTP
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
Expiry: 1 Year
Type: HTTP
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
Expiry: 2 Months
Type: HTTP
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
Expiry: 399 Days
Type: HTTP
Used by Google Analytics, to store and count pageviews.
Expiry: 399 Days
Type: HTTP
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
Expiry: 1 Day
Type: HTTP
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
Expiry: Session
Type: PIXEL
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
Expiry: 6 Months
Type: HTTP
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
Expiry: 2 Years
Type: HTTP
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
Expiry: 1 Year
Type: HTTP
this is used to send push notification using webengage.
Expiry: 1 Year
Type: HTTP
used by webenage to track auth of webenagage.
Expiry: Session
Type: HTTP
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
Expiry: 1 Day
Type: HTTP
Use to maintain an anonymous user session by the server.
Expiry: 1 Year
Type: HTTP
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
Expiry: 1 Year
Type: HTTP
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
Expiry: 6 Months
Type: HTTP
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
Expiry: 6 Months
Type: HTTP
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
Expiry: 6 Months
Type: HTTP
allow for the Linkedin follow feature.
Expiry: 1 Year
Type: HTTP
often used to identify you, including your name, interests, and previous activity.
Expiry: 2 Months
Type: HTTP
Tracks the time that the previous page took to load
Expiry: Session
Type: HTTP
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
Expiry: Session
Type: HTTP
Tracks percent of page viewed
Expiry: Session
Type: HTTP
Indicates the start of a session for Adobe Experience Cloud
Expiry: Session
Type: HTTP
Provides page name value (URL) for use by Adobe Analytics
Expiry: Session
Type: HTTP
Used to retain and fetch time since last visit in Adobe Analytics
Expiry: 6 Months
Type: HTTP
Remembers a user's display preference/theme setting
Expiry: 6 Months
Type: HTTP
Remembers which users have updated their display / theme preferences
Expiry: 6 Months
Type: HTTP
Used by Google Adsense, to store and track conversions.
Expiry: 3 Months
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
Expiry: 2 Years
Type: HTTP
These cookies are used for the purpose of targeted advertising.
Expiry: 6 Hours
Type: HTTP
These cookies are used for the purpose of targeted advertising.
Expiry: 1 Month
Type: HTTP
These cookies are used to gather website statistics, and track conversion rates.
Expiry: 1 Month
Type: HTTP
Aggregate analysis of website visitors
Expiry: 6 Months
Type: HTTP
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
Expiry: 4 Months
Type: HTTP
Contains a unique browser and user ID, used for targeted advertising.
Expiry: 2 Months
Type: HTTP
Used by LinkedIn to track the use of embedded services.
Expiry: 1 Year
Type: HTTP
Used by LinkedIn for tracking the use of embedded services.
Expiry: 1 Day
Type: HTTP
Used by LinkedIn to track the use of embedded services.
Expiry: 6 Months
Type: HTTP
Use these cookies to assign a unique ID when users visit a website.
Expiry: 6 Months
Type: HTTP
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
Expiry: 6 Months
Type: HTTP
Used to make a probabilistic match of a user's identity outside the Designated Countries
Expiry: 90 Days
Type: HTTP
Used to collect information for analytics purposes.
Expiry: 1 year
Type: HTTP
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
Expiry: 1 Day
Type: HTTP
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Edit
Resend OTP
Resend OTP in 45s
I would say this is not fair comparison. Random Forests are not meant for such small datasets essentially. Also try to use CART where p>n. Here is a Quote from Kaggle wiki on RFs. "Unlike single decision trees which are likely to suffer from high Variance or high [Bias] (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes." With increase in your data in either dimensions, RFs are much more stable in comparison to single trees because of inherent virtue of averaging.
Lalit, Thank you for bringing this up. As mentioned in the article, the choice is always made based on business requirement. The case in hand has low number of observation and has a clean split between different species. This is why CART works so well. For this reason my preference will be CART model, as random forest does not add a significant value in terms of predictive power and stability. But I would say, choosing an algorithm is more of an art than straight forward science. But I appreciate you bringing up a valid point to compare the two models. Tavish
Good explanation Tavish.Thank you for it.Would you mind explaining why you did your validation on the training data.set? Thats the data set you used to make your prediction model.you also have a validation(test) data set with you,isnt it?I might be wrong as I am a beginner in this.