This article was published as a part of the Data Science Blogathon
Random Forest (Ensemble technique) is a Supervised Machine Learning Algorithm that is constructed with the help of decision trees. This algorithm is heavily used in various industries such as Banking and e-commerce to predict behavior and outcomes.
Therefore it becomes necessary for every aspiring Data Scientist and Machine Learning Engineer to have a good knowledge of the Random Forest Algorithm.
In this article, we will discuss the most important questions on the Random Forest Algorithm which is helpful to get you a clear understanding of the Algorithm, and also for Data Science Interviews, which covers its very fundamental level to complex concepts.
Let’s get started,
Random forest is an ensemble machine learning technique that averages several decision trees on different parts of the same training set, with the objective of overcoming the overfitting problem of the individual decision trees.
In other words, a random forest algorithm is used for both classification and regression problem statements that operate by constructing a lot of decision trees at training time.
Image Source: Google Images
Random Forest is one of the most popular and widely used machine learning algorithms for classification problems. It can also be used for the regression problem statements but it mainly performs well on the classification model.
It has become a lethal weapon for modern data scientists to refine the predictive model. The best part of the algorithm is that there are very few assumptions attached to it so data preparation is less challenging which results in time-saving. It’s listed as a top algorithm (with ensembling) that is popular among the Kaggle Competitions.
Yes, Random Forest can be used for both continuous and categorical target (dependent) variables.
In a random forest i.e, the combination of decision trees, the classification model refers to the categorical dependent variable, and the regression model refers to the numeric or continuous dependent variable.
Image Source: Google Images
Step-2: Build and train a decision tree model on these K records.
Step-3: Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
Step-4: In the case of a regression problem, for an unseen data point, each tree in the forest predicts a value for output. The final value can be calculated by taking the mean or average of all the values predicted by all the trees in the forest.
and, in the case of a classification problem, each tree in the forest predicts the class to which the new data point belongs. Finally, the new data point is assigned to the class that has the maximum votes among them i.e, wins the majority vote.
So, we have to build a model carefully by keeping the bias-variance tradeoff in mind.
The main reason for the overfitting of the decision tree due to not put the limit on the maximum depth of the tree is because it has unlimited flexibility, which means it keeps growing unless, for every single observation, there is one leaf node present.
Moreover, instead of limiting the depth of the tree which results in reduced variance and an increase in bias, we can combine many decision trees that eventually convert into a forest, known as a single ensemble model (known as the random forest).
Image Source: Google Images
This is known as the out-of-bag error estimate which in short is an internal error estimate of a random forest as it is being constructed.
Random Record Selection: Each tree in the forest is trained on roughly 2/3rd of the total training data (exactly 63.2%) and here the data points are drawn at random with replacement from the original training dataset. This sample will act as the training set for growing the tree.
Random Variable Selection: Some independent variables(predictors) say, m are selected at random out of all the predictor variables, and the best split on this m is used to split the node.
NOTE:
The main features of Bagged Trees are as follows:
1. Reduces variance by averaging the ensemble’s results.
2. The resulting model uses the entire feature space when considering node splits.
3. It allows the trees to grow without pruning, reducing the tree-depth sizes which result in high variance but lower bias, which can help improve the prediction power.
The correlation between any two different trees in the forest. Increasing the correlation increases the forest error rate.
2. How strong each individual tree in the forest is i.e,
The strength of each individual tree in the forest. In a forest, a tree having a low error rate is considered a strong classifier. Increasing the strength of the individual trees eventually leads to a decrement in the forest error rate.
Moreover, reducing the value of mtry i.e, the number of random variables used in each tree reduces both the correlation and the strength. Increasing it increases both. So, in between, there exists an “optimal” range of mtry which is usually quite a wide range.
Using the OOB error rate, a value of mtry can quickly be found in the range. This parameter is only adjustable from which random forests are somewhat sensitive.
For example, suppose we fit 500 trees in a forest, and a case is out-of-bag in 200 of them:
In this case, the RF score is class1 since the probability for that case would be 0.8 which is 160/200. Similarly, it would be an average of the target variable for the regression problem.
The detailed explanation of the proof is as follows:
Input: n labelled training examples S = {(xi, yi)},i = 1,..,n
Suppose we select n samples out of n with replacement to get a training set Si still different from working with the entire training set.
Pr(Si = S) = n!/nn (very small number, exponentially small in n)
Pr( (xi,yi) not in Si ) = (1-1/n)n = e-1 ~ 0.37
Hence for large data sets, about 37% of the data set is left out!
Image Source: Google Images
Finally, it creates a proximity matrix i.e, a square matrix with entry as 1 on the diagonal and values between 0 and 1 in the off-diagonal positions. Proximities are close to 1 when the observations are “alike” and conversely the closer proximity to 0, implies the more dissimilar cases are.
As a result, we have 10 Random Forest classifiers in our hand for each value of n_tree, record the OOB error rate and see that value of n_tree where the out-of-bag error rate stabilizes and reaches its minimum value.
2. In this method, we are doing the experiment by including the values such as the square root of the total number of all predictors, half of this square root value, and twice of the square root value, etc and at the same time check which value of mtry gives the maximum area under the curve.
For Example, Suppose we have 1000 predictors, then the number of predictors to select for each node would be 16, 32, and 64.
Conclusion: The higher the value of mean decrease accuracy or mean decrease Gini score, the higher the importance of the variable in the model.
The steps for calculating variable importance in Random Forest Algorithm are as follows:
1. For each tree grown in a random forest, find the number of votes for the correct class in out-of-bag data.
2. Now perform random permutation of a predictor’s values (let’s say variable-k) in the OOB data and then check the number of votes for the correct class. By “random permutation of a predictor’s values”, it means changing the order of values (shuffling).
3. At this step, we subtract the number of votes for the correct class in the variable-k-permuted data from the number of votes for the correct class in the original OOB data.
4. Now, the raw importance score for variable k is the average of this number over all trees in the forest. Then, we normalized the score by taking the standard deviation.
5. Variables having large values for this score are ranked as more important as building a current model without original values of a variable gives a worse prediction, which means the variable is important.
The shortcomings of the Random Forest algorithm are as follows:
1. Random Forests aren’t good at generalizing cases with completely new data.
For Example, If we know that the cost of one ice cream is $1, 2 ice-creams cost $2, and 3 ice-creams cost $3, then how much do 10 ice-creams cost? In such cases, Linear regression models can easily figure this out, while a Random Forest has no way of finding the answer.
2. Random forests are biased towards the categorical variable having multiple levels or categories. It is because the feature selection technique is based on the reduction in impurity and is biased towards preferring variables with more categories so the variable selection is not accurate for this type of data.
Advantages:
Disadvantages:
Test your knowledge and enhance your skills with our ‘Mastering Random Forest and Bagging Techniques‘ course! Dive into 25 challenging questions on the Random Forest algorithm and gain the confidence you need to excel in your data science career—enroll now and take your expertise to the next level!
Thanks for reading!
I hope you enjoyed the questions and were able to test your knowledge about Random Forest Algorithm.
If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link
Please feel free to contact me on Linkedin, Email.
Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.
Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Thanku very much for all important questions. Helpful articles.