












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Project on decision tree machine learning
Typology: Assignments
1 / 20
This page cannot be seen from the preview
Don't miss anything!













instances. A decision tree mainly contains of a root node, interior nodes, and leaf nodes which then connected by branches Figure 1 Decision trees are further subdivided whether the target feature is continuously scaled for instance housing/renting prices and or categorically scaled for instance different animal species. To better explain this, we introduced and discuss the concept more deeply.
Figure 1: A Typical Example of a Decision Tree)
Every split in a decision tree is based on a feature. If the feature is categorical, the split is done with the elements belonging to a particular class. If the feature is continuous, the split is done with the elements higher than a threshold. At every split, the decision tree will take the best variable at that moment. This will be done according to an impurity measure with the spitted branches. And the fact that the variable used to do split is categorical or continuous is irrelevant (in fact, decision trees categorize continuous variables by creating binary regions with the threshold). On the other hand, if it’s continuous, it is intuitive that you have subset A with value less than and equal to some threshold and subset B with value greater than that of the threshold. However, it depends basically on two main categories. Some algorithms, for example ID3 are able to handle categorical variables. Other, like CART algorithm are not. There are two basic approaches to encode categorical data as continuous.
One-hot encoding is pretty straightforward and is implemented in most software packages. The drawback is that it runs into problems if you have many categories (because the number of encoding dimensions is equal to number of categories). Mean encoding (also sometimes called target encoding) consists of encoding categories with
means of target (for example in regression if you have classes 0 and 1 then class 0 is encoded by mean of response for examples with 0 and so on). These processes are further explained as shown in the Figure
Figure 2: An Example of a Classification Tree
We have introduced the Classification decision Trees with some basic concepts un- derlying decision tree models, how they can be built with Python from scratch in our approach for this project. We have also explained some advantages and dis- advantages of decision tree models as well as important extensions and variations in comparison with other algorithms such as CART and ID3. One disadvantage of Classification decision Trees is that they need a target feature which is cate- gorically scaled like for instance weather in Tianjin City = Sunny, Rainy, Overcast, Thunderstorm (this is based on our assumptions as international students at Nankai University, Jinnan Campus). Here arises a problem: What if we want our tree for instance to predict the price of a house given some target feature attributes like the number of rooms and the location? Here the values of the target feature (prize) are no longer categorically scaled but are continuous - A house can have, theoretically, a infinite number of different prices 3. That’s where Regression Trees come in. Regression Trees work in principal in the same way as Classification Trees with the large difference that the target feature values can now take on an infinite number of continuously scaled values. Hence the task is now to predict the value of a continuously scaled target feature Y given the values of a set of categorically (or continuously) scaled descriptive features X. In a regression tree the idea is this: since the target variable does not have classes, we fit a regression model to the target variable using each of the independent variables. Then for each independent variable, the data is split at several split points. At each split point, the "error" between the predicted value and the actual values is squared to get a "Sum of Squared Errors (SSE)". The split point errors across the variables are compared and the variable/point yielding the lowest SSE is chosen as the root node/split point. This process is recursively continued (see Figure 3)
Regression-type problems: Regression-type problems are generally those where we attempt to predict the values of a continuous variable from one or more contin-
0.2 Idea of Decision Tree [Classifications and Re-
gression]
The main idea of the decision trees is thus explained below to help conceptualized the algorithms of both CART and ID3 respectively. In principal decision trees can be used to predict the target feature of an unknown query instance by building a model based on existing data for which the target feature values are known (supervised learning). Additionally, we know that this model can make predictions for unknown query instances because it models the relationship between the known descriptive features and the know target feature. In our following example, the tree model learns "how a specific animal species looks like" respectively the combination of descriptive feature values distinctive for animal species. Additionally, we know that to train a decision tree model we need a dataset consisting of a number of training examples characterized by a number of descriptive features and a target feature. How can we build a tree model? To answer this question, we should reca- pitulate what we try to achieve using a decision tree model (see Figure 1). We want, given a dataset, train a model which kind of learns the relationship between the descriptive features and a target feature such that we can present the model a new, unseen set of query instances and predict the target feature values for these query instances. Let’s further recapitulate the general shape of a decision tree. We know that we have at the bottom of the tree leaf nodes which contain (in the optimal case) target feature values. To make this more illustrative we use as a practical example a simplified version of the UCI machine learning Zoo Animal Classification dataset which includes properties of animals as descriptive features and the and the animal species as target feature (see Figure 4)
Figure 4: An Example of a Python Code that generated a table of the Animal Classification Dataset
Each leaf node should (in the best case) only contain "Mammals" or "Reptiles". The task for us is now to find the best "way" to split the dataset such that this can be achieved. Consider the dataset obtained from python programming code (see
Figure 4) and think about what must be done to split the dataset into a Dataset 1 containing as target feature values (species) only Mammals and a Dataset 2, containing only Reptiles. To achieve that, in this simplified example, we only need the descriptive feature hair since if hair is TRUE, the associated species is always a Mammal. Hence in this case our tree model would look like the one obtained in Figure 5
Figure 5: An Example of the Animal Dataset that have been spited
Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more pre- dictor variables. Classification tree analysis is one of the main techniques used in Data Mining. The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods of Discrimi- nant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estima- tion. The flexibility of classification trees makes them a very attractive analysis option, but this is not to say that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed. The study and use of classification trees are not widespread in the fields of prob- ability and statistical pattern recognition (Ripley, 1996), but classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classi- fication trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible.
using that subsample as a test sample for cross-validation, so that each subsample is used (V - 1) times in the learning sample and just once as the test sample. The CV costs (cross-validation cost) computed for each of the ’V’ test samples are then averaged to give the V-fold estimate of the CV costs. Minimal cost-complexity cross-validation pruning. In CART, minimal cost-complexity cross-validation pruning is performed, if Prune on misclassification error has been selected as the Stopping rule. On the other hand, if Prune on de- viance has been selected as the Stopping rule, then minimal deviance-complexity cross-validation pruning is performed. The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that equals the misclassification rate when priors are estimated and mis- classification costs are equal, while Prune on deviance uses a measure, based on maximum-likelihood principles, called the deviance (see Figure 6. In simplified terms, the process of training a decision tree and predicting the target features of query instances is as follows:
I Present a dataset containing of a number of training instances characterized by a number of descriptive features and a target feature
II Train the decision tree model by continuously splitting the target feature along the values of the descriptive features using a measure of information gain during the training process
III Grow the tree until we accomplish a stopping criterion that is by creating the leaf nodes which represent the predictions, we want to make for new query instances
IV Show query instances to the tree and run down the tree until we arrive at leaf nodes
Figure 6: Training Feature Dataset Vs Predictive Unknown Feature
0.3.2.1 Background Information
To be able to calculate the information gain, we have to first introduce the term entropy of a dataset. The entropy of a dataset is used to measure the impurity of a dataset and we will use this kind of informativeness measure in our calculations (see Figure 5). There are also other types of measures which can be used to calculate the information gain such as the Chi-Square, Information gain ration, Variance to name a few. The term entropy (in information theory) goes back to Claude E. Shannon. The idea behind the entropy is, in simplified terms, the following: We imagine we have a lottery wheel which includes 200 green balls. The set of balls within the lottery wheel can be said to be totally pure because only green balls are included. To express this in the terminology of entropy, this set of balls has an entropy of 0 (we can also say zero impurity). Consider now, 60 of these balls are replaced by red and 40 by blue balls. Now by further drawing another ball from the lottery wheel, the probability of receiving a green ball has dropped from 1.0 to 0.5. Since the impurity increased, the purity decreased, hence also the entropy increased. Hence, we can say, the more "impure" a dataset, the higher the entropy and the less "impure" a dataset, the lower the entropy. Shannon’s entropy model uses the logarithm function log2(P (x)) to measure the entropy and therewith the impurity of a dataset since the higher the probability of getting a specific result == P (x) (randomly drawing a green ball), the closer approaches the binary logarithm 1. For calculations in terms of the regression tree, we decided also to introduce the mathematical and concept of computing the Regression Tree.This will clearly help us to understand the concept of CART for this project.
0.3.2.2 Mathematical Concepts of Classification Tree
Step 1: Once our dataset contains more than one "type" of elements specifically more than one target feature value, the impurity will be greater than zero. Therewith also the entropy of the dataset will be greater than zero. Hence it is useful to sum up the entropies of each possible target feature value and weight it by the probability that we achieve these values assuming we would randomly draw values from the target feature value space (What is the probability to draw a green ball just by chance? Exactly, 0.5 and therewith we have to weigh the entropy calculated for the green balls with 0.5). This finally leads to the formal definition of Shannon’s entropy which serves as the baseline for the information gain calculation:
H(x) = −
f orktarget
(P (x = k) ∗ log2(P (x = k))) (1)
Green balls: H(x = green) = 0. 5 ∗ log2(0.5) = − 0. 5 Blue balls: H(x = blue) = 0. 2 ∗ log2(0.2) = − 0. 464 Red balls: H(x = red) = 0. 3 ∗ log2(0.3) = − 0. 521
H(x) : H(x) = −((− 0 .5) + (− 0 .464) + (− 0 .521)) = 1. 485 (2)
Step 2: Let’s apply this approach to our original dataset where we want to predict the animal species. Our dataset has two target feature values in its target
Toothed: H(toothed) = 57 ∗−((1∗log 2 (1))+(0))+ 27 ∗−((^12 ∗log 2 (^12 ))+(^12 ∗log 2 (^12 ))) =
Inf oGain(toothed) = 0. 5917 − 0 .5917 = 0 (12)
This is an example how our tree model generalizes behind the training data. If we consider the other branch, that is breathes == T rue we know, that after splitting the Dataset on the values of a specific feature (breathes True, False) in our case, the feature must be removed. Well, that leads to a dataset where no more features are available to further split the dataset on. Hence, we stop growing the tree and return the mode value of the direct parent node which is "Mammal" (see Figure 7).
0.3.2.3 Mathematical Concepts of Regression Tree
As stated above, the task during growing a Regression Tree is in principle the same as during the creation of Classification Trees. Though, since the IG turned out to be no longer an appropriate splitting criterion (neither is the Gini Index) due to the continuous character of the target feature we must have a new splitting criteria. Since we have generated the table for calculation, we can now compute the entropy of the “Number of Bedrooms” features in this case which is given by the formula:
H(N ) =
jN
|DN = j| |D|
jS
∗(−P (k|j) ∗ log 2 (P (k|j))))) (13)
Where N : N umberof Bedrooms and S : P riceof Sale We tried to calculate the weighted entropies, we see that for j = 3, we get a weighted entropy of 0. We get this result because there is only one house in the dataset with 3 bedrooms. On the other hand, for j = 2 (occurs three times) we will get a weighted entropy of 0. 59436. However, since our target feature is continuously scaled, the IGs of the categorically scaled descriptive features are no longer appropriate splitting criteria. Well, we could instead categorize the target feature along its values where for instance housing prices between 0 and 800 and 800 are categorized as low, between 801 and 1500 and 1500 as middle and > 1501 RM B as high. We want to have a splitting criteria which allows us to split the dataset in such a way that when arriving a tree node, the predicted value (we defined the predicted value as the mean target feature value of the instances at this leaf node where we defined the minimum number of 5 instances as early stopping criteria) is closest to the actual value. It turns out that the variance is one of the most commonly used splitting criteria for regression trees where we will use the variance as splitting criteria. The explanation therefore is, that we want to search for the feature attributes which most exactly point to the real target feature values when splitting the dataset along the values of these target features. Therefore, we use the variance which we will introduce now for just illustration in our example in Figure 8 using python programming code. We used this example to explain the mathematical principle of the Regression Theorem:
V ar(x) =
∑^ n
i=
(y 1 − y¯) n − 1
Where yi are the single target feature values and y¯ is the mean of these target feature values. Step 1: Since we want to know which descriptive feature is best suited to split the target feature on, we have to calculate the variance for each value of the descriptive feature with respect to the target feature values. Hence for the “Number of Rooms” descriptive feature above we get for the single numbers of rooms:
V ar(N = 1) =
V ar(N = 2) =
var(N = 3) = (2200 − 2200) = 0 (17)
V ar(N = 4) =
Step 2:
W eightV ar(N = 1) =
W eightV ar(N = 2) =
W eightV ar(N = 3) =
W eightV ar(N = 4) =
Step 3:Finally, we sum up these weighted variances to make an assessment about the feature as a whole:
SumV ar(f eature) =
valuef eature
W eighV ar(f eaturevalue) (23)
Which is in our case: 10125000 + 5083333.3 + 0 + 78125000 Putting all this together finally leads to the formula for the weighted feature variance which we will use at each node in the splitting process to determine which feature we should choose to split our dataset on next.
f eature[choose] = argmin
f f eaturesllevels(f )
|f = l| |f |
∗ V ar(t, f = l) (24)
= argmin
f f eaturesllevels(f )
|f = l| |f |
∑^ n
i=
(t 1 − ¯t)^2 n − 1
C 1 , C 2 , ..., Cn, C the target attribute, and a set S of recording learning. Figure 10 shows the pseudo code for the C4.5 algorithm which will be discussed in the next stage
0.5 Experimental Result of Decision Tree Using ID
and C4.5 Algorithm
Figure 7: Final Generalized Model of Animal Training Dataset
Figure 10: Pseudocode of C4.5 Algorithm Written in Python
Breiman, Leo et al. (1984). “Classification and regression trees. Monterey CA: Wadsworth & Brooks/Cole Advanced Books & Software”. In: Hssina, Badr et al. (2014). “A comparative study of decision tree ID3 and C4. 5”. In: International Journal of Advanced Computer Science and Applications 4.2, pp. 0–0. JR, QUINLAN (1986). “Induction of decision trees”. In: Machine learning 1, pp. 81–
White, Allan P and Wei Zhong Liu (1994). “Bias in information-based measures in decision tree induction”. In: Machine Learning 15.3, pp. 321–329.