
Reading Time: 8 Minutes
In the vast realm of machine learning, decision trees stand out as a versatile and comprehensible tool that mirrors human thinking capabilities. They provide a structured approach to problem-solving and are a valuable addition to the data scientist’s toolkit. In this blog post, we’ll embark on a journey to uncover the inner workings of decision trees, demystifying the process and terminology associated with this powerful algorithm.

The Theory and Logic
The Building Blocks of Decision Trees:
At its core, a decision tree is a supervised machine learning technique, primarily used for classification tasks. It takes labeled data, where categories are known, and predicts the target variable. The magic happens by constructing a decision tree where internal nodes represent features, branches convey decision rules, and leaf nodes yield outcomes.
The Essence of Decision Trees:
Imagine decision trees as a way to build a graph that captures all possible solutions to a problem based on given conditions. The tree is constructed by asking yes/no questions and further branching into subtrees until we reach a definitive answer. This method is not just robust, but it’s also incredibly transparent and comprehensible.
The Power of Simplicity:
One of the most significant advantages of decision trees is their simplicity. They provide a direct route to solutions by mimicking human decision-making. In an age of complex algorithms and black-box models, decision trees offer clarity and transparency. This simplicity makes them a valuable tool for both beginners and experts in machine learning.
Key Terminologies:
Let’s familiarise ourselves with some essential terminologies linked to decision trees:
- Root Node: This is the top node of the tree, representing the entire dataset.
- Internal Nodes: These nodes split the data into subsets based on a particular feature.
- Branches: Decision rules that guide us from one node to another.
- Leaf Nodes: The final outcomes or predictions reside in these nodes.
Attribute Selection Method (ASM):
Now, the pressing question: How do we choose attributes at each level of the tree? Here, we introduce the Attribute Selection Method (ASM), a crucial step in constructing a decision tree. ASM helps us decide which features are the most informative for making decisions. Common ASM techniques include Gini impurity, information gain, and gain ratio, among others.
In this post, we will be leveraging the use of entropy and information gain.
Entropy serves as a fundamental concept in machine learning, denoting the extent of disorder within a dataset. Within this context, it functions as a representation of the impurity present in a given attribute. When an attribute contains a greater number of categories, the entropy value increases, and conversely, it decreases when the attribute exhibits fewer categories. It lies in the range of 0 to 1.
Information Gain quantifies the alteration in entropy subsequent to the segmentation of a dataset based on a particular attribute. In essence, Information Gain provides a measure of the valuable insights on attribute regarding the classification of data points into their respective classes.
With this framework in mind, our primary objective revolves around the selection of the most suitable attribute as the parent node for decision tree construction. This selection is guided by the pursuit of the attribute that promises the highest Information Gain, thereby maximising the informational content it contributes to the classification process.
Constructing a Decision Tree:
The process of building a decision tree involves several steps, with ASM being a pivotal one. Here’s a simplified algorithm to construct a decision tree:
- Start with the entire dataset at the root node.
- Select the best attribute (using ASM) to split the data into subsets.
- Create child nodes for each outcome of the selected attribute.
- Recursively repeat steps 2 and 3 for each child node until a stopping condition is met (e.g., a maximum depth or purity threshold).
Now, lets take this example to understand the logic behind decision tree.

In this case our goal is to predict whether a person with details on Age, Income, Occupation, Now lets understand the math and calculation behind forming the decision tree.

here Pi is the probability of a certain category occurring from an attribute, you will get a detailed understanding when we delve into the calculation section.
Initially we need to find the entropy of our target variable which is Buys Computer, lets name it BC for simplified usage.
The sum of probabilities is quite intuitive, but questions may arise regarding why log is used, logarithms help to quantify and compare the amount of information and uncertainty in various situations.


To understand what this 0.9402 actuall means we need to go back to definition of entropy given above. So as the entropy value increases from 0 to 1, it indicates a higher level of uncertainty and impurity in the dataset.
This means that the elements are distributed among multiple categories or classes, and there is no clear majority class. An entropy of 1 (1.00) represents a state of maximum impurity, where elements are equally distributed among different categories. If there were 7 yes and 7 no, then the entropy would have turned out to be 1. You can do the math if you want to verify it, log(7/14) to the base 2 is -1.
Now once we have placed our root node, lets calculate the information gain of other attributes given in the dataset to generate the next level.

Information gain for Age = 0.246
So if we perform the same procedure for the rest of the attributes, we’ll get the following values:
Information gain for Income = 0.029
Information gain for Student = 0.151
Information gain for Credit Rating= 0.048
Therefore the order in which we need to choose the attributes for the root node are Age>Student>Credit Rating>Income. So this is how the tree would look like:
Now, Middle Aged people buys a computer in all cases in the given dataset, so we stop the expansion of Middle by ‘Yes’.

Now for Youth, we only take the other attributes where age is youth.

Now, Taking Age as the main attribute, we have to calculate the entropy and then the information of other attributes with respect to Age, as we have done above.
If we do the calculations, then we get to know that Student attribute has the highest information gain amongst Income and Credit Rating. So the Tree would look like this->

At last for the Senior Age, a similar approach has to be followed and the best attribute has to be chosen, in this case it turns out to be Credit Rating.

So this is how the decision tree would look like after we sketch it out based on our findings on information gain.
Now for real world applications, we don’t need to code functions that calculates entropy and information gain in iteration to all nodes, fortunately there is a library in sklearn called tree from which we can import DecisionTreeClassifier and perform create our decision tree. Now that we’ve covered all the fundamentals, we’ll delve into the code in detail in the next post to keep this one concise.
If you have any questions or need further clarification, please feel free to ask in the comment section below. Your curiosity and engagement are highly valued. Click here to view all the concepts related to machine learning.
Thank you for reading all along, subscribe to sapiencespace and enable notifications to get regular insights.