Decision trees are a simple machine learning tool used for classification and regression tasks. They break complex decisions into smaller steps, making them easy to understand and implement. This article explains all about decision tree, how decision trees work, their advantages, disadvantages, and applications.
A decision tree, which has a hierarchical structure made up of root, branches, internal, and leaf nodes, is a non-parametric supervised learning approach used for classification and regression applications.
It is a tool that has applications spanning several different areas. These trees can be used for classification as well as regression problems. The name itself suggests that it uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits. It starts with a root node and ends with a decision made by leaves.
Types of Decision Tree
ID3 : This algorithm measures how mixed up the data is at a node using something called entropy. It then chooses the feature that helps to clarify the data the most.C4.5 : This is an improved version of ID3 that can handle missing data and continuous attributes.
CART : This algorithm uses a different measure called Gini impurity to decide how to split the data. It can be used for both classification (sorting data into categories) and regression (predicting continuous values) tasks.
Decision Tree Terminologies
Before learning more about decision trees let’s get familiar with some of the terminologies:
Root Node: The initial node at the beginning of a decision tree, where the entire population or dataset starts dividing based on various features or conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision nodes. These nodes represent intermediate decisions or conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final classification or outcome. Leaf nodes are also referred to as terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a these tree is referred to as a sub-tree. It represents a specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a tree to prevent overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire is referred to as a branch or sub-tree. It represents a specific path of decisions and outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as a parent node, and the sub-nodes emerging from it are referred to as child nodes. The parent node represents a decision or condition, while the child nodes represent the potential outcomes or further decisions based on that condition.
Let’s understand decision trees with the help of an example:
Decision trees are upside down which means the root is at the top and then this root is split into various several nodes. They are nothing but a bunch of if-else statements in layman terms. It checks if the condition is true and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes then it will go to the next feature which is humidity and wind. It will again check if there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and play.
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information gain, and Gini index. But in simple terms, I can say here that the output for the training dataset is always yes for cloudy weather, since there is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for this, we use these trees.
Now you must be thinking how do I know what should be the root node? what should be the decision node? when should I stop splitting? To decide this, there is a metric called “Entropy” which is the amount of uncertainty in the dataset.
How Decision Tree Algorithms Work?
Decision Tree algorithm works in simpler steps:
Starting at the Root: The algorithm begins at the top, called the “root node,” representing the entire dataset.
Asking the Best Questions: It looks for the most important feature or question that splits the data into the most distinct groups. This is like asking a question at a fork in the tree.
Branching Out: Based on the answer to that question, it divides the data into smaller subsets, creating new branches. Each branch represents a possible route through the tree.
Repeating the Process: The algorithm continues asking questions and splitting the data at each branch until it reaches the final “leaf nodes,” representing the predicted outcomes or classifications.
Several assumptions are made to build effective models when creating decision trees. These assumptions help guide the tree’s construction and impact its performance. Here are some common assumptions and considerations when creating decision trees:
Binary Splits
Decision trees typically make binary splits, meaning each node divides the data into two subsets based on a single feature or condition. This assumes that each decision can be represented as a binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is divided into child nodes, and this process continues until a stopping criterion is met. This assumes that data can be effectively subdivided into smaller, more manageable subsets.
Feature Independence
These trees often assume that the features used for splitting nodes are independent. In practice, feature independence may not hold, but it can still perform well if features are correlated.
Homogeneity
It aim to create homogeneous subgroups in each node, meaning that the samples within a node are as similar as possible regarding the target variable. This assumption helps in achieving clear decision boundaries.
Top-Down Greedy Approach
They are constructed using a top-down, greedy approach, where each split is chosen to maximize information gain or minimize impurity at the current node. This may not always result in the globally optimal tree.
Advantages of Decision Trees
Easy to Understand: They are simple to visualize and interpret, making them easy to understand even for non-experts.
Handles Both Numerical and Categorical Data: They can work with both types of data without needing much preprocessing.
No Need for Data Scaling: These trees do not require normalization or scaling of data.
Automated Feature Selection: They automatically identify the most important features for decision-making.
Handles Non-Linear Relationships: They can capture non-linear patterns in the data effectively.
Overfitting Risk: It can easily overfit the training data, especially if they are too deep.
Unstable with Small Changes: Small changes in data can lead to completely different trees.
Biased with Imbalanced Data: They tend to be biased if one class dominates the dataset.
Limited to Axis-Parallel Splits: They struggle with diagonal or complex decision boundaries.
Can Become Complex: Large trees can become hard to interpret and may lose their simplicity.
How do Decision Trees use Entropy?
Now we know what entropy is and what is its formula, Next, we need to know that how exactly does it work in this algorithm.
Entropy basically measures the impurity of a node. Impurity is the degree of randomness; it tells how random our data is. Apure sub-splitmeans that either you should be getting “yes”, or you should be getting “no”.
Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and 2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in both the nodes. In order to make a this tree, we need to calculate the impurity of each split, and when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more purity than right node since left node has a greater number of “yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity in the dataset, here by using the entropy we are getting the impurity of a particular node, we don’t know if the parent entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent entropy has decreased after splitting it with some feature.
Diagnosing diseases based on patient symptoms: Decision trees help doctors analyze symptoms and medical history to identify potential illnesses. For example, they can determine if a patient has diabetes or heart disease by evaluating factors like age, weight, and test results.
Predicting patient outcomes and treatment effectiveness: Decision trees can predict how a patient might respond to a specific treatment, helping doctors choose the best course of action.
Identifying risk factors for specific health conditions: They can analyze data to find patterns, such as lifestyle habits or genetic factors, that increase the risk of diseases like cancer or diabetes.
Finance
Assessing credit risk for loan approvals: Decision trees evaluate an applicant’s credit history, income, and other factors to decide whether to approve or reject a loan application.
Detecting fraudulent transactions: By analyzing transaction patterns, decision trees can flag unusual or suspicious activities, helping banks prevent fraud.
Predicting stock market trends and investment risks: They analyze historical data to forecast market trends, helping investors make informed decisions.
Marketing
Segmenting customers for targeted campaigns: Decision trees group customers based on their behavior, preferences, or demographics, allowing businesses to create personalized marketing strategies.
Predicting customer churn and retention: They analyze customer data to identify those likely to stop using a service, enabling companies to take proactive steps to retain them.
Recommending products based on customer preferences: Decision trees suggest products or services to customers based on their past purchases or browsing history.
Education
Predicting student performance and outcomes: It analyze factors like attendance, grades, and study habits to predict how well a student might perform in exams or courses.
Identifying factors affecting student dropout rates: They help schools understand why students drop out, such as financial issues or academic struggles, so they can intervene early.
Personalizing learning paths for students: These are recommend tailored learning materials or courses based on a student’s strengths and weaknesses.
To summarize, in this article we learned about decision trees. On what basis the tree splits the nodes and how to can stop overfitting. why linear regression doesn’t work in the case of classification problems.To check out the full implementation of these please refer to my Github repository. You can master all the Data Science topics with our Black Belt Plus Program with out 50+ projects and 20+ tools. We hope you like this article, and get clear understanding on decision tree algorithm, decision tree examples that will help you to get clear understanding .Start your learning journey today!
Frequently Asked Questions
Q1.Why is it called a decision tree?
A. A decision tree is a tree-like structure that represents a series of decisions and their possible consequences. It is used in machine learning for classification and regression tasks. An example of a decision tree is a flowchart that helps a person decide what to wear based on the weather conditions.
Q2. What are the three types of decision trees?
The three main types are: Classification Trees: Used to predict categories (e.g., yes/no, spam/not spam). Regression Trees: Used to predict numerical values (e.g., house prices, temperature). CART (Classification and Regression Trees): A combination of both classification and regression trees.
Q3. What are the 4 types of decision tree?
A. The four types of decision trees are Classification tree, Regression tree, Cost-complexity pruning tree, and Reduced Error Pruning tree.
Q4. What is a decision tree algorithm?
A. A decision tree algorithm is a machine learning algorithm that uses a decision tree to make predictions. It follows a tree-like model of decisions and their possible consequences. The algorithm works by recursively splitting the data into subsets based on the most significant feature at each node of the tree.
Q5.What is an example of a decision tree?
A. A decision tree is like a flowchart that helps make decisions. For example, imagine deciding whether to play outside or stay indoors. The tree might ask, “Is it raining?” If yes, you stay indoors. If no, it might ask, “Is it too hot?” and so on, until you reach a decision.
I have recently graduated with a Bachelor's degree in Statistics and am passionate about pursuing a career in the field of data science, machine learning, and artificial intelligence. Throughout my academic journey, I thoroughly enjoyed exploring data to uncover valuable insights and trends.
I am eager to continue learning and expanding my knowledge in the field of data science. I am particularly interested in exploring deep learning and natural language processing, and I am constantly seeking out new challenges to improve my skills. My ultimate goal is to use my expertise to help businesses and organizations make data-driven decisions and drive growth and success.
Hey Anshul,
Just wanted to appreciate you for an amazing explanation about theoretical concepts behind decision tree. The crisp clarity and easy, simple and relatable examples made the understanding better to core.
Thank You
Saurabh Prasad
The contents provided in this blog are extremely good. I can easily relate to the explanations and gain a better understanding. But still, I can't access it completely due to the inserted images in case of trees, formulae and tables. I will have to rely on some other sources. Please improve on this. Will be helpful for many. You may refer "Geeks for Geeks" or "stack Overflow" to understand how trees, formulae or tables can be written as text.
Dileep Koneru
Thanks for sharing this info. The content & presentation is excellent.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Hey Anshul, Just wanted to appreciate you for an amazing explanation about theoretical concepts behind decision tree. The crisp clarity and easy, simple and relatable examples made the understanding better to core. Thank You
The contents provided in this blog are extremely good. I can easily relate to the explanations and gain a better understanding. But still, I can't access it completely due to the inserted images in case of trees, formulae and tables. I will have to rely on some other sources. Please improve on this. Will be helpful for many. You may refer "Geeks for Geeks" or "stack Overflow" to understand how trees, formulae or tables can be written as text.
Thanks for sharing this info. The content & presentation is excellent.