While trying to make a better predictive model, we come across a famous ensemble technique in machine learning algorithms, known as Random Forest in Machine Learning. The Random Forest algorithm comes along with the concept of Out-of-Bag Score(OOB_Score).
Random Forest, is a powerful ensemble technique for machine learning and data science, but most people tend to skip the concept of OOB_Score while learning about the algorithm and hence fail to understand the complete importance of Random forest as an ensemble method.
This blog will walk you through the OOB_Score concept with the help of examples.
One of the best interpretable models used for supervised learning is Decision Trees, where the algorithm makes decisions and predict the values using an if-else condition, as shown in the example.
Though, Decision trees are easy to understand and in interpretations. One major issue with the decision tree is:
Hence, to have the best of both worlds, that is less variance and more interpretability. The algorithm of Random Forest was introduced.
Random Forests or Random Decision Forests are an ensemble learning method for classification and regression problems that operate by constructing a multitude of independent decision trees(using bootstrapping) at training time and outputting majority prediction from all the trees as the final output.
Constructing many decision trees in a Random Forest algorithm helps the model to generalize the data pattern rather than learn the data pattern and therefore, reduce the variance (reduce overfitting).
But, how to select a training set for every new decision tree made in a Random Forest? This is where Bootstrapping kicks in!!
We create new training sets for multiple decision trees in Random Forest using the concept of Bootstrapping, which is essentially random sampling with replacement.
Let us look at an example to understand how bootstrapping works:
Here, the main training dataset consists of five animals, and now to make different samples out of this one main training set.
Note:Random forest bootstraps both data points and features while making multiple indepedent decision trees
Total number of trees in random forest, which are also called estimators, can be set using n_estimators.
In the above example, you can observe that we repeated some animals while making the sample, and some animals did not even occur once in the sample.
Here, Sample1 does not have Rat and Cow whereas sample 3 had all the animals equal to the main training set.
While making the samples, data points were chosen randomly and with replacement, and the data points which fail to be a part of that particular sample are known as OUT-OF-BAG points.
Where does OOB_Score come into the picture?? OOB_Score is a very powerful Validation Technique used especially for the Random Forest algorithm for least Variance results.
Note: While using the cross-validation technique, every validation set has already been seen or used in training by a few decision trees and hence there is a leakage of data, therefore more variance. But, OOB_Score prevents leakage and gives a better model with low variance, so we use OOB_score for validating the model.
Let’s understand OOB_Score through an example:
Here, we have a training set with 5 rows and a classification target variable of whether the animals are domestic/pet?
In the random forest, we build multiple decision trees. Below, we show a bootstrapped sample for one particular decision tree, say DT_1.
Here, Rat and Cat data have been left out. And since, Rat and Cat are OOB for DT_1, we would predict the values for Rat and Cat using DT_1. (Note: Data of Rat and Cat hasn’t been seen by DT_1 while training the tree.)
Just like DT_1, there would be many more decision trees where either rat or cat was left out or maybe both of them were left out.
Let’s say that the 3rd, 7th, and 100th decision trees have ‘Rat’ as an OOB datapoint. This means that none of them saw the ‘Rat’ data before predicting the value for ‘Rat’.
So, we recorded all the predicted values for “Rat” from the trees DT_1, Dt_3, DT_7, and DT_100.
And saw that aggregated/majority prediction is the same as the actual value for “Rat”.
(To Note: None of the models had seen data before, and still predicted the values for a data point correctly)
Similarly, every data point is passed for prediction to trees where it would be behaving as OOB and an aggregated prediction is recorded for each row.
The OOB_score is computed as the number of correctly predicted rows from the out-of-bag sample.
And
OOB Error is the number of wrongly classifying the OOB Sample.
Random Forest can be a very powerful technique for predicting better values if we use the OOB_Score technique.Even though you spend a bit more time training the random forest model with the OOB_Score parameter set as True, the predictions justify the time consumed.
A. The out-of-bag error is a performance metric that estimates the performance of the Random Forest model using samples not included in the bootstrap sample for training.
A. In Random Forest classification, bagging, or bootstrap aggregation, combines predictions from multiple decision trees to reduce variance and avoid overfitting. By using different subsets of the training data (via sklearn’s RandomForestClassifier), it ensures that individual models generalize better. The model enhances its overall performance by making the final prediction based on a majority vote.
A. In a Random Forest model, each tree within the ensemble calculates the Out-of-Bag (OOB) error using the data samples it did not select for training during the bootstrap sampling process. These samples, referred to as “out-of-bag” samples, are the ones left out for each tree.
Hello, thank you for your very useful content
I searched for many documents on the internet. I find this article very clearly explains OOB.