Entropy: how Decision Trees make decisions

A translation of the article was prepared ahead of the start of the Machine Learning course .




You are a Data Science Specialist who is currently following a learning path. And you have come a long way since you wrote your first line of code in Python or R. You know Scikit-Learn like the back of your hand. Now you are more sitting on Kaggle than on Facebook. You are not new to creating stunning random forests and other ensemble models of decision trees that do an excellent job. Nevertheless, you know that you will not achieve anything if you do not develop comprehensively. You want to dig deeper and understand the intricacies and concepts underlying the popular machine learning models. Well, me too.

Today I will talk about the concept of entropy - one of the most important topics in statistics, and later we will talk about the concept of Information Gain (information gain) and find out why these fundamental concepts form the basis of how decision trees are built from the data obtained.

Good. Now let’s transgress.

What is entropy? In simple terms, entropy is nothing but a measure of disorder. (It can also be considered a measure of purity, and soon you will see why. But I like the mess more because it sounds cooler.) The

mathematical formula of entropy is as follows:


Entropy. It is sometimes written as H.

Here p i is the frequency probability of an element / class i of our data. For simplicity, suppose we have only two classes: positive and negative. Then i will take the value of either "+" or "-". If we had a total of 100 points in our dataset, 30 of which would belong to the positive class and 70 to the negative, then p + would be 3/10, and p- will be 7/10. Everything is simple here.

If I calculate the entropy of the classes from this example, then this is what I get using the formula above:



Entropy is about 0.88. This value is considered quite high, that is, we have a high level of entropy or disorder (that is, a low value of purity). Entropy is measured in the range from 0 to 1. Depending on the number of classes in your dataset, the value of entropy may be more than 1, but it will mean the same as the level of disorder is extremely high. For simplicity of explanation, in today's article we will have entropy ranging from 0 to 1.

Take a look at the chart below.



On the X axis, the number of points from the positive class in each circle is reflected, and on the Y axis, the corresponding entropies. You can immediately notice the inverted U-shape of the graph. Entropy will be the smallest at extrema when there are no positive elements in the circle in principle, or when there are only positive elements in them. That is, when the same elements in the circle - the disorder will be 0. The entropy will be highest in the middle of the graph, where positive and negative elements will be evenly distributed inside the circle. Here the greatest entropy or disorder will be achieved, since there will be no predominant elements.

Is there any reason that the entropy is measured using the base 2 logarithm, or why is the entropy measured between 0 and 1, and not in a different range? No, there is no reason. This is just a metric. It’s not so important to understand why this is happening. It is important to know how what we got above is calculated and how it works. Entropy is a measure of confusion or uncertainty, and the goal of machine learning models and Data Science specialists in general is to reduce this uncertainty.

Now we know how mess is measured. Next, we need a value to measure the reduction of this disorder in the additional information (attributes / independent variables) of the target variable / class. This is where the Information Gain or Information Gain comes into play. From the point of view of mathematics, it can be written as follows:



We simply subtract the entropy Y from X, from the entropy Y, in order to calculate the decrease in uncertainty about Y, provided that there is information about X about Y. The stronger the uncertainty decreases, the more information can be obtained from Y about X.

Let's look at a simple example of the contingency table so that get closer to the question of how decision trees use entropy and information gain to decide on what basis to break nodes in the learning process on data.

Example: Conjugation table



Here, our target variable will be Liability , which can take only two values: “Normal” and “High”. We also have only one sign, which is called Credit Rating, it distributes the values ​​into three categories: “Excellent” , “Good” and “Poor” . A total of 14 observations were made. 7 of them belong to the Normal Liability class , and 7 more to the High Liability class . This is a division in itself.

If we look at the total sum of the values ​​in the first row, we will see that we have 4 observations with Excellent value based on Credit Rating . Moreover, I can even say that my target variable is broken by “Excellent” Credit Rating . Among the observations with the value “Excellent” by attributeCredit Rating , there are 3 that belong to the Normal Liability class and 1 that belongs to the High Liability class . Similarly, I can calculate similar results for other Credit Rating values from the contingency table.

For example, I use the contingency table above to independently calculate the entropy of our target variable, and then calculate its entropy, taking into account additional information from the Credit Rating attribute . So I can calculate how much additional information Credit Rating will give me for the Liability target variable .

So let's get started.



The entropy of our target variable is 1, which means maximum clutter due to the even distribution of elements between “Normal” and “High” . The next step is to calculate the entropy of the Liability target variable , taking into account additional information from Credit Rating . To do this, we calculate the Liability entropy for each Credit Rating value and add them using the average weighted observation ratio for each value. Why we use the weighted average will become clearer when we talk about decision trees.



We obtained the entropy of our target variable with the Credit Rating attribute. Now we can calculate the informational Liability gain from Credit Rating to understand how informative this feature is.



Knowing Credit Rating has helped us reduce the uncertainty of our Liability target variable.. Isn't that a good sign that should work? Give us information about the target variable? Well, for this very reason, decision trees use entropy and informational gain. They determine by what criterion to break the nodes into branches, in order to approach the target variable with each subsequent partition, and also to understand when the tree construction needs to be completed! (in addition to hyperparameters such as maximum depth, of course). Let's see how this all works in the following example using decision trees.

Example: Decision Tree

Let's look at an example of building a decision tree to predict whether a person’s credit will be written off or not. The population will be 30 copies. 16 will belong to the write-off class , and the other 14 will“Non-write-off” . We will have two signs, namely “Balance” , which can take two values: “<50K” or “> 50K”, and “Residence” , which takes three values: “OWN” , “RENT” or “OTHER” . I will demonstrate how the decision tree algorithm will decide which attribute to break first and which attribute will be more informative, that is, it best eliminates the uncertainty of the target variable by using the concept of entropy and information gain.

Symptom 1: Balance



Here the circles belong to the “write-off” class, and the stars correspond to the “non-write-off” class . Partitioning a Parent Root by AttributeBalance will give us 2 heir nodes. In the left node there will be 13 observations, where 12/13 (probability 0.92) of observations from the “write-off” class , and only 1/13 (probability 0.08) of observations from the class “non-write-off” . In the right node there will be 17 out of 30 observations, where 13/17 (probability 0.76) of observations from the “write-off” class and 4/17 (probability 0.24) of observations from the class “non-write-off” .

Let's calculate the entropy of the root and see how much the tree can reduce the uncertainty by using a partition based on Balance .



A split based on Balance will give an informational gain of 0.37. Let's count the same for the Residence signand compare the results.

Symptom 2: Residence



Splitting a tree based on Residence will give you 3 heir nodes. The left descendant node will receive 8 observations, where 7/8 (probability 0.88) of observations from the write-off class and only 1/8 (probability 0.12) of observations from the non-write-off class . The average successor node will receive 10 observations, where 4/10 (probability 0.4) of observations from the write-off class and 6/10 (probability 0.6) of observations from the non-write-off class . The right heir will receive 12 observations, where 5/12 (probability 0.42) of observations from the write-off class and 7/12 (probability 0.58) of observations from the non-write-off class. We already know the entropy of the parent node, so we simply calculate the entropy after the partition to understand the informational gain from the Residence attribute .



The informational gain from the Balance attribute is almost 3 times more than from the Residence ! If you look at the graphs again, you will see that partitioning according to Balance will give cleaner descendant nodes than according to Residence . However, the leftmost node in Residence is also quite clean, but it is here that the weighted average comes into play. Despite the fact that the node is clean, it has the least number of observations, and its result is lost in the general recalculation and calculation of the total entropy according to Residence. This is important because we are looking for the general informational content of the attribute and do not want the final result to be distorted by the rare value of the attribute.

The Balance attribute itself provides more information about the target variable than Residence . Thus, the entropy of our target variable is reduced. The decision tree algorithm uses this result to make the first split according to Balanceto later decide on what basis to break the following nodes. In the real world, when there are more than two features, the first breakdown occurs according to the most informative feature, and then, with each subsequent breakup, the information gain will be recounted for each additional feature, since it will not be the same as the information gain from each feature individually. Entropy and informational gain should be calculated after one or several partitions have occurred, which will affect the final result. The decision tree will repeat this process as it grows in depth, until it either reaches a certain depth or some kind of splitting leads to a higher informational gain beyond a certain threshold, which can also be specified as a hyperparameter!

That's all! Now you know what entropy, information gain and how they are calculated. Now you understand how the decision tree, by itself or as part of an ensemble, makes decisions about the best order of partitioning by attributes and decides when to stop when learning the available data. Well, if you have to explain to someone how decision trees work, I hope you will adequately cope with this task.

I hope you have learned something useful for yourself from this article. If I missed something or expressed myself inaccurately, write to me about it. I'll be very grateful to you! Thank.



Learn more about the course.



All Articles