Random Forest Algorithm in Machine Learning

Sruthi Last Updated : 06 Oct, 2025

11 min read

Random Forest is a widely-used machine learning algorithm developed by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility, coupled with its effectiveness as a random forest classifier have, fueled its adoption, as it handles both classification and regression problems. In this article, we will understand how random forest algorithm works, and about its advantages , random forest regression techniques and how it differs from other algorithms and how to use it.

This article was published as a part of the Data Science Blogathon.

What is Random forest?
Random Forest Applications
Real-Life Analogy of Random Forest
Working of Random Forest Algorithm
Important Features of Random Forest
Difference Between Decision Tree and Random Forest
Important Hyperparameters in Random Forest
- Increase the Predictive Power
- Increase the Speed
Coding in Python – Random Forest Classifier
Advantages and Disadvantages of Random Forest Algorithm
Conclusion
Frequently Asked Questions

What is Random forest?

Random forest, a popular machine learning algorithm developed by Leo Breiman and Adele Cutler, merges the outputs of numerous decision trees to produce a single outcome. Its popularity stems from its user-friendliness and versatility, making it suitable for both classification and regression tasks.

Its widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It performs better for classification and regression tasks. In this tutorial, we will understand the working of random forest and implement random forest on a classification task.

Random Forest Applications

Customer churn prediction: Businesses can use random forests to predict which customers are likely to churn (cancel their service) so that they can take steps to retain them. For example, a telecom company might use a random forest model to identify customers who are using their phone less frequently or who have a history of late payments.
Fraud detection: Random forests can identify fraudulent transactions in real-time. For instance, a bank might employ a random forest model to spot transactions made from unusual locations or involving unusually large amounts of money.
Stock price prediction: It can predict future stock prices. However, it is important to note that stock price prediction is a very difficult task, and no model is ever going to be perfectly accurate.
Medical diagnosis: These can help doctors diagnose diseases. For example, a doctor might use a random forest model to help them diagnose a patient with cancer.
Image recognition: It can recognize objects in images. For example, a self-driving car might use a random forest model to identify pedestrians and other vehicles on the road.

Real-Life Analogy of Random Forest

Let’s dive into a real-life analogy to understand this concept further. A student named X wants to choose a course after his 10+2, and he cant decide which course fit for his skill set. So he decides to consult various people like his cousins, teachers, parents, degree students, and working people. He asks them varied questions like why he should choose, job opportunities with that course, course fee, etc. Finally, after consulting various people about the course he decides to take the course suggested by most people.

Working of Random Forest Algorithm

Before understanding the working of the random forest algorithm in machine learning, we must look into the ensemble learning technique. Ensemble simplymeans combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.

Ensemble uses two types of methods:

As mentioned earlier, Random forest Classifier works on the Bagging principle. Now let’s dive in and understand bagging in detail.

Bagging

Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique in the Random Forest algorithm. Here are the steps involved in Bagging:

Selection of Subset: Bagging starts by choosing a random sample, or subset, from the entire dataset.
Bootstrap Sampling: Each model from these samples, called Bootstrap Samples, which we take from the original data with replacement. This process is known as row sampling.
Bootstrapping: The step of row sampling with replacement is referred to as bootstrapping.
Independent Model Training: We train each model independently on its corresponding Bootstrap. This training process generates results for each model.
Majority Voting: The final output by combining the results of all models through majority voting. We select the most commonly predicted outcome among the models.
Aggregation: This step by combining all the results and generating the final output based on majority voting, which we call aggregation.

Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as Happy emoji.

Boosting

Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a model by using weak models in series.

There are several boosting algorithms; AdaBoost was the first really successful boosting algorithm that was developed for the purpose of binary classification. AdaBoost is an abbreviation for Adaptive Boosting and is a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong classifier.” There are Other Boosting techniques.

For more, you can visit – 4 Boosting Algorithms You Should Know: GBM, XGBoost, LightGBM & CatBoost.

Steps Involved in Random Forest Algorithm

Step 1: In this model, we select a subset of data points and a subset of features to construct each decision tree. Simply put, we take n random records and m features from a dataset containing k records.
Step 2: We construct individual decision trees for each sample.
Step 3: Each decision tree will generate an output.
Step 4: We consider the final output based on Majority Voting for classification and Averaging for regression, respectively.
Share
Rewrite

For example:

Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output based on majority voting. In the figure below, you can see that the majority decision tree outputs an apple compared to a banana, so we take the final output as an apple.

Important Features of Random Forest

Random Forest is distinguished by several key features that contribute to its effectiveness and versatility:
Diversity: Each decision tree in the Random Forest is built from a different subset of data and features. This diversity helps in reducing overfitting and improving the model’s generalization capability.
Robustness: By averaging the results from multiple trees, Random Forest reduces the variance and improves the robustness of the predictions.
Handling of Missing Values: It can handle missing values internally by using surrogate splits or by averaging results from other trees that do not have missing values for the same data points.
Feature Importance: It provides insights into the importance of each feature in the prediction process. This can be particularly useful for feature selection and understanding the underlying data patterns.
Scalability: Random Forest can be parallelized because each tree is built independently of the others. This makes it scalable to large datasets and high-dimensional data.
Versatility: It can be used for both classification and regression tasks. The algorithm is also effective for tasks involving categorical and continuous variables.
Stability: Due to the ensemble nature, It is less sensitive to changes in the training data compared to a single decision tree.
Out-of-Bag Error Estimation: Random Forest provides an internal mechanism for estimating the model error without the need for a separate validation set. This is done using the out-of-bag (OOB) samples, which are not used in the construction of each tree.

Difference Between Decision Tree and Random Forest

Random forest is a collection of decision trees; still, there are a lot of differences in their behavior.

Decision trees	Random Forest
1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control.	1. Random forests are created from subsets of data, and the final output is based on average or majority ranking; hence the problem of overfitting is taken care of.
2. A single decision tree is faster in computation.	2. It is comparatively slower.
3. When a data set with features is taken as input by a decision tree, it will formulate some rules to make predictions.	3. Random forest randomly selects observations, builds a decision tree, and takes the average result. It doesn’t use any set of formulas.

Thus random forests are much more successful than decision trees only if the trees are diverse and acceptable.

Important Hyperparameters in Random Forest

Random forests use hyperparameters to enhance model performance and predictive power or to increase the model’s speed.

Increase the Predictive Power

n_estimators: Number of trees the algorithm builds before averaging the predictions.
max_features: Maximum number of features random forest considers splitting a node.
mini_sample_leaf: Determines the minimum number of leaves required to split an internal node.
criterion: How to split the node in each tree? (Entropy/Gini impurity/Log Loss)
max_leaf_nodes: Maximum leaf nodes in each tree

Increase the Speed

n_jobs: it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor, but if the value is -1, there is no limit.
random_state: controls randomness of the sample. The model always produces the same results if it has a definite value of random state and receives the same hyperparameters and training data.
oob_score: OOB means out of the bag. It is a random forest cross-validation method. One-third of the sample does not train the data; instead, we use it to evaluate its performance. We call these samples out-of-bag samples.

Coding in Python – Random Forest Classifier

Now let’s implement Random Forest in scikit-learn.

1. Let’s import the libraries.

# Importing the required libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

2. Import the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('heart_v2.csv')
print(df.head())
sns.countplot(df['heart disease'])
plt.title('Value counts of heart disease patients')
plt.show()

3. Putting Feature Variable to X and Target variable to y.

# Putting feature variable to X
X = df.drop('heart disease',axis=1)
# Putting response variable to y
y = df['heart disease']

4. Train-Test-Split is performed

# now lets split the data into train and test
from sklearn.model_selection import train_test_split

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape

5. Let’s import RandomForestClassifier and fit the data.

from sklearn.ensemble import RandomForestClassifier

classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5,
                                       n_estimators=100, oob_score=True)

%%time
classifier_rf.fit(X_train, y_train)

# checking the oob score
classifier_rf.oob_score_

6. Let’s do hyperparameter tuning for Random Forest using GridSearchCV and fit the data.

rf = RandomForestClassifier(random_state=42, n_jobs=-1)

params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}

from sklearn.model_selection import GridSearchCV

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy")

%%time
grid_search.fit(X_train, y_train)

GridSearchCV(cv=4, estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [2, 3, 5, 10, 20],
                         'min_samples_leaf': [5, 10, 20, 50, 100, 200],
                         'n_estimators': [10, 25, 50, 100, 200]},
             scoring='accuracy', verbose=1)

grid_search.best_score_

rf_best = grid_search.best_estimator_
rf_best

From hyperparameter tuning, we can fetch the best estimator, as shown. The best set of parameters identified was max_depth=5, min_samples_leaf=10,n_estimators=10

7. Now, let’s visualize

from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[5], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);

from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[7], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True);

The trees created by estimators_[5] and estimators_[7] are different. Thus we can say that each tree is independent of the other.

8. Now let’s sort the data with the help of feature importance

rf_best.feature_importances_

imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf_best.feature_importances_
})

imp_df.sort_values(by="Imp", ascending=False)

Advantages and Disadvantages of Random Forest Algorithm

Advantages

You can use random forest for classification and regression problems.
It solves the problem of overfitting as output is based on majority voting or averaging.
It performs well even if the data contains null/missing values.
Each decision tree created is independent of the other; thus, it shows the property of parallelization.
It maintains high stability by taking the average answers from a large number of trees.
It maintains diversity because each decision tree does not consider all the attributes, although this is not true in all cases.
It is immune to the curse of dimensionality. Since each tree ignores some attributes, the feature space reduces.
We don’t need to segregate data into training and testing sets because 30% of the data will always remain unanalyzed by the decision tree created from bootstrap.

Disadvantages

Random forest is more complex than decision trees, where you can make decisions by following the path of the tree.
Training time is more than other models due to its complexity. Whenever it has to make a prediction, each decision tree has to generate output for the given input data.

Conclusion

Random forest is a great choice if anyone wants to build the model fast and efficiently, as one of the best things about the random forest Classifier is it can handle missing values. It is one of the best techniques with high performance, widely used in various industries for its efficiency. It can handle binary, continuous, and categorical data. Overall, random forest is a fast, simple, flexible, and robust model with some limitations.

Frequently Asked Questions

Q1. How do you explain a random forest?

A. Random forest is an ensemble learning method combining multiple decision trees, enhancing prediction accuracy, reducing overfitting, and providing insights into feature importance, widely used in classification and regression tasks.

Q2. How random forest works step by step?

A. Random forest works by first randomly selecting subsets of the training data. For each subset, it constructs decision trees, splitting the data at each node based on the best feature from a random subset of features. Each tree then makes a prediction for a given input. Finally, the random forest combines the predictions of all trees, using averaging for regression tasks or majority voting for classification tasks, to produce the final output.

Q3. What are the advantages of Random Forest?

A. Random Forest tends to have a low bias since it works on the concept of bagging. It works well even with a dataset with a large no. of features since it works on a subset of features. Moreover, it is faster to train as the trees are independent of each other, making the training process parallelizable.

Q4. Why do we use random forest algorithms?

A. Random forest algorithms provide superior prediction accuracy, handle large datasets effectively, offer versatility in tasks, are robust to noise, and reveal feature importance insights.

Q5. What is the difference between random forest and regression?

A. Random forest is an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting, used for both classification and regression tasks. Regression, on the other hand, is a statistical technique that models the relationship between dependent and independent variables to predict continuous outcomes.

Sruthi

Free Courses

4.8

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

4.5

Bagging and Boosting ML Algorithms

Explore Bagging and Boosting to understand advanced ML algorithms.

4.5

Naive Bayes from Scratch

Learn Naïve Bayes in ML! Explore, analyze, and apply the algorithm today!

4.9

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Joe

Great article. Quiet an interesting read. A good balance of content and visuals presented in an easy to understand manner. Kudos!

Good job! Interesting article, presented in a easy to understand manner. Good luck!

Great Article. Concise and clear. Gives a good understanding of the concept with some interesting visual as well. Kudos!

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Random Forest Algorithm in Machine Learning

Table of contents

What is Random forest?

Random Forest Applications

Real-Life Analogy of Random Forest

Working of Random Forest Algorithm

Bagging

Boosting

Steps Involved in Random Forest Algorithm

Important Features of Random Forest

Difference Between Decision Tree and Random Forest

Important Hyperparameters in Random Forest

Increase the Predictive Power

Increase the Speed

Coding in Python – Random Forest Classifier

1. Let’s import the libraries.

2. Import the dataset.

3. Putting Feature Variable to X and Target variable to y.

4. Train-Test-Split is performed

5. Let’s import RandomForestClassifier and fit the data.

6. Let’s do hyperparameter tuning for Random Forest using GridSearchCV and fit the data.

7. Now, let’s visualize

8. Now let’s sort the data with the help of feature importance

Advantages and Disadvantages of Random Forest Algorithm

Advantages

Disadvantages

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Ensemble Learning and Ensemble Learning Techniques

Bagging and Boosting ML Algorithms

Naive Bayes from Scratch

Dimensionality Reduction for Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses