In the previous article, we understood the complete flow of the decision tree algorithm. In this article, let‘s understand why we need to learn about the random forest. when we already have a decision tree algorithm. Why do we need Random forest? What is it all about? Similar to the decision tree. Random Forest is also a supervised machine-learning algorithm. It is extensively used in classification and regression. But, the decision tree has an overfitting problem.
Wondering what overfitting is? Overfitting occurs when the model is too complex and fits the data too closely. This means that the model is not able to make accurate predictions on unseen data. The Random Forest algorithm can address this issue by creating multiple Decision Trees, and combining their predictions to come up with more accurate predictions.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
In a decision tree, we have only one tree to answer the question. Let’s say we want to buy a phone. The following decision tree can be used to decide whether we should buy an iPhone or an Android phone.
However, in real life, do we do the same thing, do we just ask one person to buy a phone? Definitely No. We ask multiple people like family, friends, experts, and salespersons.
Assume each tree is like each person. So, if you ask 10 people (8 people said iPhone, and 2 people said Android. In Random forest classification, the final output is based on majority voting. We will buy an iPhone.
Note: what if we are building a regression model? We use mean or average.
We call the process of combing the several decision tree models a random forest. But, The question is, can we combine multiple models, like logistic, naïve Bayes, and KNN, and if yes, what do we call it? We call it Bagging and Boosting. They are two ensemble methods used in machine learning to improve the performance of a single model by combining the predictions of multiple models.
Bagging: bagging is a way of combing multiple models; it may be any model, as we discussed above, like knn, naive Bayes, logistic, etc. However, the result will be the same because the data input will be the same for all the models. To handle that, we will use a bootstrap aggregator.
In addition to that, Bagging also reduces variance because of the two points above.
Boosting: Conversely, BOOSTING produces strong learners by combining weak ones. In the above image, you can see that it is following sequential training.
Types of Boosting algorithms
1. Adaboosting
2. Gradient boosting
3. XGboost
We will cover all of these topics in the upcoming articles
Yes, it’s possible to understand the working of the random forest in 4 simple steps. But, before that, we need to understand one question about Random Forest.
Which type of ensembled learning random forest belongs to?
It belongs to bagging, where we build multiple decision trees called random forest.
Understanding random forests require a step-by-step approach. Here is a step-by-step guide.
When we have a training dataset. The model creates a bootstrap sample with the replacement.
What is Bootstrap?
Creating multiple subsets from the actual training dataset.
How do we create multiple subsets when we have Rows and columns in the training dataset? and what is with replacement?
Rows:
When we say with thereplacement(refer to the image below for better understanding), in a subset, we can have the same row multiple times. as you can see in subset 2, the 2nd row is repeated 2 times, and in subset 3 1st row is repeated 2 times.
Columns:
1. For classification, it’s a square root of the total number of features
Example: let’s say we have a total of 4 features for each subset we will have
The square root of 4= 2. which is 2 features for each tree.
2. Regression: total number of features and dividing them by 3
After completing step 1, we will build a decision tree for each subset. In the above example, we have 3 decision trees.
How were we able to build the decision trees from scratch?
To build a decision tree, we have to use two methods.
1. Gini
2. Entropy and Information gain
For a detailed understanding of math, you can refer to my Decision tree article in Analytics Vidhya.
After building the decision tree now, it’s time to get the results. suppose we have new information for prediction
salary | property | loan approval |
10k | No | ? |
The model predicts it as “0”. By combing all decision tree predictions above, as you can see in the image
What do we mean by combining the prediction of all the trees?
To understand it, we go to step four.
In Step 4, we can clearly understand the process of combining the predictions of multiple trees we call aggregation
with this, we understand what exactly a bootstrap aggregation is all about.
Now, We need to understand how it benefits.
It reduces the variance. This helps build robust models, which work well even on unseen data.
# Importing the necessary libraries import pandas as pd import numpy as np from sklearn.datasets import load_iris data = load_iris()
load the iris dataset from the sklearn library
# Convert the iris data into a Pandas data frame with feature names as column names df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a new column 'target' to the dataframe using target names and target codes df['target'] = pd.Categorical.from_codes(data.target, data.target_names) # Print first 5 rows of the dataframe print(df.head())
We’re printing the first 5 rows after converting the data into a data frame.
# Split the data X and y X = df.drop('target',axis=1) y = df['target']
# Import train_test_split function from sklearn from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Import the RandomForestClassifier from sklearn from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=20)
# Fit the training data classifier_rf.fit(X_train, y_train) # Predict on testing data y_pred = classifier_rf.predict(X_test) from sklearn.metrics import confusion_matrix, classification_report, accuracy_score print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred))
We have the predictions saved in the y_pred variable. We can compare the actual vs. predicted using the report below.
In this article, we looked at the most popular algorithm. To summarize, we learned about Random Forest in detail. Let’s take a look at the key takeaways.
Key takeaways:
Did you enjoy my article? Please comment below.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.