Complete Flow of Decision Tree Algorithm

ELLURU PAVAN KUMAR Last Updated : 10 May, 2022

8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

In Machine Learning, there are two types of algorithms. One is Supervised, and the other is Unsupervised algorithms. A decision tree algorithm is a supervised Machine Learning Algorithm. There are many algorithms in supervised Machine Learning Algorithms like Random Forest, K-nearest neighbour, Naive Bayes, Logistic Regression, Linear Regression, Boosting algorithms, etc. These algorithms are used for predicting the output. Based on historical data, it will indicate the new output for new data. In detail will see below.

What is a Decision Tree Algorithm?

It is a Supervised Machine Learning Algorithm used for both classification and regression problems but primarily for classification models. It is a tree-structured classifier consisting of Root nodes, Leaf nodes, Branch trees, parent nodes, and child nodes. These will helps to predict the output.

Root node: It represents the entire population.

The leaf node represents the last node, nothing but the output label.

Branch tree: A subsection of the entire tree is called a Branch tree.

Parent node: A node, which is divided into sub-nodes, is called a parent node.

Child nodes: sub-nodes of the parent node are called child nodes.

Splitting: It is a process of dividing the node into subnodes is called splitting.

Pruning: It is a process of stopping the sub-nodes of a decision node is called Pruning. Simply, opposite process of splitting.

Decision Node: Splitting of a sub-node into further sub-nodes based on conditions is called Decision Nodes.

Working on Decision Tree Algorithm

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree, and the decision tree algorithm compares the values of the root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node. The algorithm again compares the attribute value with the other sub-nodes for the next node and moves further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:

Step-1: Select the root node based on the information gained value from the complete dataset.

Step-2: Divide the root node into sub-nodes based on information gain and entropy values.

Step-3: Continue this process till we cannot further classify the node into sub-nodes called leaf nodes.

An elementary example uses titanic data set for predicting whether a passenger will survive or not survive. I have used only three Columns/attributes/features for this example—namely, Sex, age, and sibs(number of spouses or children).

In the above figure, is sex male a root node? It will divide into two sub-nodes based on condition(yes or no). Is age>9.5?, is branch node and survived is Leaf node. Is sibsp>2.5?, is also a branch node, and died is a Leaf node. Both died and survived are Leaf nodes; there is no chance to split. In real datasets, there are more columns/attributes/features.

Implementing the Decision Tree Algorithm

While implementing the decision tree algorithm, everyone will doubt how to select the Root node and sub-nodes. We have a technique called ASM(Attributes Selection Measures). In this technique, there are two methods:

1. Information Gain:

2. Gini Index

Information Gain

entropy: It is the sum of the probability of each label times, the log probability of that same label. It is an average rate at which a stochastic data source produces information, Or it measures the uncertainty associated with a random variable.

Where S=Total number of samples

P(yes)=Probability of Yes

P(No)=Probability of No

information Gain: An amount of information gained about a random variable or signal from observing another random variable.

It favours smaller partitions with distinct values.

Gini Index

It is calculated by subtracting the sum of the squared probabilities of each class from one.

It favours larger partitions.

Decision Tree Algorithm: Pruning

In the tree algorithm, the Pruning concept will play a significant role. We may get the model overfitting issue when a model builds on a large dataset. For reducing this issue, Pruning will help.

Pruning is the process to stop the splitting of nodes into sub-nodes.

In this, there are two types:

1. Cost Complexity Pruning

2. Reduced Error Pruning

Implementation of Decision Tree Algorithm

For building a model, we need to preprocess the data, Transform the data, split the data into train and test, and then make the model.

Firstly, we need to import the dataset, assign it to a variable, and then view it.

import pandas as pd
import numpy as np
data=pd.read_csv('train.csv')
data.head()

Output:-

Survived	Pclass	Sex	Age	SibSp
0	0	3	male	22.0	1
1	1	1	female	38.0	1
2	1	3	female	26.0	0
3	1	1	female	35.0	1
4	0	3	male	35.0	0

The Survived column is the output variable from the above image, and the remaining columns are input variables.

Next, we can describe the dataset and check the mean, standard deviation, and percentile values.

data.describe()

Output:-

	Survived	Pclass	Age	SibSp
count	891.000000	891.000000	714.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008
std	0.486592	0.836071	14.526497	1.102743
min	0.000000	1.000000	0.420000	0.000000
25%	0.000000	2.000000	20.125000	0.000000
50%	0.000000	3.000000	28.000000	0.000000
75%	1.000000	3.000000	38.000000	1.000000
max	1.000000	3.000000	80.000000	8.000000

Preprocessing the Data

Check the missing values, whether is there any or not. If there, then replace it with mean or median or drop.

data.isna().sum()

Output:-

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
dtype: int64

From the above image, 0 means no missing values present in the column, and 177 points 177 missing values current in that column. So, as of now, I removed the entire rows where missing values will be Present.

data.dropna(inplace=True)

After removing the missing values, rows then check once whether it was removed or not.

data.isna().sum()

Output:-

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
dtype: int64

By seeing the above image missing values, rows are removed successfully.

The model only deals with Numerical data. So now check whether any categorical data will be present in the dataset. If any categorical attributes are present in input data, convert them into numerical points using dummy variables.

data.info()

Output:-

Int64Index: 714 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 33.5+ KB

In the above image, the Dtype object is present, which means that the Sex column is categorical. Now apply the dummy variable method and convert it to numerical.

data_1=pd.get_dummies(data.Sex,drop_first=True)
data_1.head()

Output:-

	male
0	1
1	0
2	0
3	0
4	1

Let’s see the above image, converting the sex categorical column into numerical. ‘1’ represents male, and ‘0’ represents female. As of now, I have not changed the male column name. If you want, you can. And now, remove the original column from the dataset and add the new column to it.

data.drop(['Sex'],inplace=True,axis=1)
data.head(3)

Output:-

	Survived	Pclass	Age	tbsp
0	0	3	22.0	1
1	1	1	38.0	1
2	1	3	26.0	0

data1=pd.concat([data,data_1],axis=1)
data1.head(3)

Output:-

	Survived	Pclass	Age	tbsp	male
0	0	3	22.0	1	1
1	1	1	38.0	1	0
2	1	3	26.0	0	0

See the above image new column will be added. Now all columns will be numerical.

Splitting the Data

Before splitting the data, First, divide the input and output data separately. Split the dataset into two parts training and testing with some ratio. When a model suffers from a fitting problem, then adjust the ratio.

y=data1[['Survived']]
y.head(2)
Output:-

	Survived
0	0
1	1

x=data1.drop([‘Survived’],axis=1)
x.head(2)
Output:-

	Pclass	Age	SibSp	male
0	3	22.0	1	1
1	1	38.0	1	0

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.30, random_state=0)

Let’s see the above image, x is the input variable, and y is the output variable. For both data, we imported the train_test_split method. We divided the data into 70:30 ratio means training data is 70%, and testing data is 30%. For that, we have given test_size=0.30.

Apply Transformation

Apply Normalization or standardization to bring all attributes values to 0-1. This will helps to reduce the variance effect.

from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.fit_transform(x_test)
print(x_train[1])
print(x_test[1])
Output:-
array([-0.29658278, 0.10654022, -0.55031787, 0.75771531])
array([-1.42442296, 1.34921876, -0.56156739, -1.31206669])

From the above image, we can observe that all values are standardized.

Model Building

From the Sklearn library, import the model and build it.

from sklearn.tree import DecisionTreeClassifier  
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)  
classifier.fit(x_train, y_train)

Output:-

DecisionTreeClassifier(criterion='entropy', random_state=0)

By seeing the above image, I successfully built the model. This model has been created on training data.

Predict the Results

After building the model, we can predict the output.

y_pred= classifier.predict(x_test)  
y_pred

Output:-

array([0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0], dtype=int64)

Model built on training data and the prediction made on testing data. So, x_test was provided, For prediction, and it will predict the output of the x_train data.

Check the Accuracy

After predicting the output, For x_test, we can check with y_test and y_pred at what percentage of accuracy it will Predict.

from sklearn.metrics import confusion_matrix  
cm= confusion_matrix(y_test, y_pred) 
cm

Output:-

array([[106,  19],
       [ 32,  58]], dtype=int64)

The above image shows the confusion matrix of y_test and y_pred. We can calculate all the metrics like accuracy score, recall, precision, and sensitivity from this.

Conclusion

The decision tree algorithm is a supervised machine learning algorithm where data is continuously divided at each row based on specific rules until the outcome is generated. It works for both classification and regression models.

Decision tree algorithms deal with complex datasets and can be pruned if necessary to avoid overfitting. This algorithm is not suited for imbalanced datasets. This algorithm is more prevalent in the health, finance, and technology sectors.

Now that you have learned the basics, We will cover some practical applications of decision trees in more detail in future posts.

If you have any queries, content with me on LinkedIn

Read the latest articles on our blog.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

ELLURU PAVAN KUMAR

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Complete Flow of Decision Tree Algorithm

Introduction

What is a Decision Tree Algorithm?

Working on Decision Tree Algorithm

Implementing the Decision Tree Algorithm

Information Gain

Gini Index

Decision Tree Algorithm: Pruning

Implementation of Decision Tree Algorithm

Preprocessing the Data

Splitting the Data

Apply Transformation

Model Building

Predict the Results

Check the Accuracy

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM