







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. David C Parkes, Computer Science, Cross-Validation, ID3 Algorithm, Gain Heuristic, Decision Tree Learners, Overfitting Phenomenon, Inductive Bias, Chi-Squared Pruning, Validation-Set Pruning, Trading Empirical Error for Complexity, Harvard, Lecture Notes
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Today we continue our discussion of decision tree learning algorithms. The main focus will be on the phenomenon of overfitting, which is an issue for virtually all machine learning algorithms. After discussing why overfitting happens, we will look at several methods for dealing with it.
Several comments on the ID3 algorithm are in order. First, let us try to understand better what kinds of splits the information gain criterion selects. Recall that it selects the feature which maximizes information gain, or equivalently, since the current entropy is fixed, minimizes the Remainder (Xk, D), which is the weighted entropy of the partition of data induced by a split on feature Xk given data D consistent with the current node in the tree. To gain a better understanding of the heuristic used by ID3, we first examine the entropy function a little more closely. For binary data, where we denote pT = nT /n (the fraction of instances that with classification true), then the entropy is:
0
1
0 0.2 0.4 0.6 0.8 1
"log"
where the x-axis plots pT and the y-axis entropy. We have Entropy(pT ) = −pT log 2 pT − (1 − pT ) log 2 (1 − pT ). Taking the derivative,
d dpT
Entropy(pT )
pT =z =^
− ln z − 1 + ln(1 − z) ln 2
1 − z (1 − z) ln 2
ln 2
(ln(1 − z) − ln z), (2)
which is ∞ at 0 and −∞ at 1. The greatest effect on entropy occurs near the extremes of 0 and 1, while the effect of changes to mid-range distributions is relatively small. For example, we have Entropy(5 : 9) = 0. 94 and Entropy(7 : 7) = 1 while Entropy(1 : 6) = 0.59 but Entropy(0 : 7) = 0. There is a big difference in entropy from small changes to the distribution of target class at the extremes. This implies that the Remainder function selects strongly for splits that generate a partition on data in which some of the parts are very extreme, even if the other parts are very mixed and retain a lot of disorder. Another way of putting it is that Remainder prefers a split that generates a partition with one very extreme and one very mixed part to a partition with two fairly well-sorted parts. The second component of the Remainder function is the weights n nx. These mean that getting low entropy on a large part of the data partition is more important than for small amounts of data. To summarize, then, the information gain criterion will try to choose a feature to split on that induces a partition in which large parts of the data are classified extremely well.
Let’s have a look at how this works in the earlier example of poisonous and nutritious plants. Recall that there are four features and two target classes. Suppose that the original data set is as follows:
Skin Color Thorny Flowering Class smooth pink false true Nutritious smooth pink false false Nutritious scaly pink false true Poisonous rough purple false true Poisonous rough orange true true Poisonous scaly orange true false Poisonous smooth purple false true Nutritious smooth orange true true Poisonous rough purple true true Poisonous smooth purple true false Poisonous scaly purple false false Poisonous scaly pink true true Poisonous rough purple false false Nutritious rough orange true false Nutritious
Overall, there are 5 Nutritious cases and 9 Poisonous cases in the data, for an initial entropy before splitting of 0.94. Splitting on the various features results in the following information gain:
Feature Induced Data Partition Remainder Gain
Skin
Value Nutritious Poisonous Entropy smooth 3 2 0. rough 2 3 0. scaly 0 4 0
Color
Value Nutritious Poisonous Entropy pink 2 2 1 purple 2 4 0. orange 1 3 0.
Thorny
Value Nutritious Poisonous Entropy false 4 3 0. true 1 6 0.
Flowering
Value Nutritious Poisonous Entropy false 3 3 1 true 2 6 0.
Now we can consider what ID3 would do when presented with some training data from this domain. Because ID3 is greedy and considers only one feature at a time, it is no more likely to split on X 1 or X 2 than it is to split on any one of the other features. Once it has split on X 1 or X 2 , it will of course split on the other at the next step, and then terminate. But it may take a long time until it splits on X 1 or X 2. So ID3 will likely grow a much larger tree than necessary. Along the way it is fitting to spurious patterns in the training data and learning a hypothesis that does not generalize. Cases like this, in which two features are not individually predictive but work as co-predictors, would not be a problem for an algorithm that looked ahead one step in deciding what to split on, or split on pairs directly. The ID3 algorithm could be modified in this way. But the general problem of overfitting would remain, and this is what we turn to next.
We can consider the effect of overfitting by considering a graph that tracks the performance from the sequence of trees constructed by ID3 as it makes successive split decisions. The x-axis would plot the number of nodes in the tree (a measure of its complexity, or the training effort) and the y-axis would plot the percentage of correct classifications on the training data and also on some test set of data that is not used for learning and represents future data. The following graph is typical: the training set performance would continue to improve, while the test set performance would be expected to peak before the complete tree is grown and start to dip afterwards.
Asymptotic training accuracy
Optimal stopping point
Best test accuracy
Training effort
test
training
100%
Accuracy
Why does this happen? The problem is that a full decision tree is fitting the noise in the training data or finding other spurious patterns that happen to hold in the training data by happen stance but don’t generalize. In later splits, there may be very few instances on which to base decisions, and so the algorithm may make a splitting decision that is not statistically supported by the data. Such an unsupported decision will in fact override statistically supported decisions at a higher level. It may be the case that a higher level decision will generalize, while a lower level decision will not.
Overfitting is a huge issue in machine learning. It comes up for just about all learning algorithms. The basic issue is that there is always a tradeoff between producing a model that fits the training data as well as possible, and a model that generalizes well to new instances. Note:
the problem of overfitting occurs even in the absence of noisy data and can be simply because there are patterns in the training data that turn out to be insignificant.
Before discussing ways of dealing with overfitting, let us look at some of the possible causes. Overfitting can occur when at least one of the following occurs:
For example, we say in lecture that if there are |D| = 8 training examples and the true concept is f (x) = x 1 , but also 1999 uninformative features ({X 2 ,... , X 2000 }), each taking value true or false on any x with probability 0.5, then:
The basic idea behind virtually all methods for dealing with overfitting is to increase the inductive bias of the learning algorithm. Overfitting is the result of the algorithm being too heavily swayed by the training data. Increasing the bias means making stronger assumptions that are not supported by the data. This means in turn that more data will be needed to counter the assumptions. In particular, noise in the data or spurious patterns, which will tend not to be supported by large amounts of data, will not override the inductive assumptions. As discussed in the last class, increasing the inductive bias could mean using a more restricted hypothesis space. For example, in using decision trees one could stipulate that only trees whose depth is at most 3, or that have at most 10 nodes, or use at most 4 features (or any combination of these) will be considered. The size of the decision tree could also remain a parameter of the model and be selected through a cross-validation approach. We discuss this idea below. Alternatively, increasing the inductive bias could involve increasing the preference bias of the learning algorithm, so that some hypotheses will be preferred over others, and even if they perform worse on the training data. This is often referred to as regularization, and we will revisit this idea frequently in the course. For decision trees, this means that simpler trees that have some training
Note: Once I tell you how many instances have Xk false, and how many Xk true, then there is only one degree of freedom to fully determine the labels after the split on Xk. Once you know that there are 3 negative examples after an Xk false split, then you can complete all the other numbers of labeled instances (i.e., 5 = 8-3, 7=10-3, 5=10-5). We define a test statistic Dev (Xk) as follows. Let p and n denote the number of positive and negative examples in the data D before the split. (We consider the Boolean classification setting for simplicity, but everything generalizes.) Remember that D is not the complete training set, but rather the data associated with the current node. For each possible value x ∈ Xk, let px and nx denote the number of positive and negative examples in Dx, the subset of D in which Xk = x. Furthermore, let ˆpx = (^) p+pn |Dx| and ˆnx = (^) p+nn |Dx|, denote the fraction of positive and negative examples we would expect on average after a split on an unrelated feature. Our test statistic is the deviation of the data from this, defined as:
Dev (Xk) =
x∈Xk
(nx − nˆx)^2 ˆnx
(px − pˆx)^2 p ˆx
What does this mean? The larger this test statistic, the higher the probability that Xk is in fact informative. So, we will tend to reject the null hypothesis and accept the split as this deviation increases. In our example, we have Dev (Xk) = (5 − 4)^2 /4 + (3 − 4)^2 /4 + (7 − 6)^2 /6 + (5 − 6)^2 /6 = 0.833. Back to our question: what is the probability that a split of labels such as this could occur just by chance under the Null hypothesis? For this, we define the chi-square distribution and will assume that
Dev (Xk) ∼ χ^2 (v) (4)
where χ^2 (v) is the chi square distribution with degree of freedom v. In our setting of a Binary feature and a Boolean classification problem, we just have v = 1 as intuited in the discussion above (where once a single number is defined, all remaining numbers are defined, given that you know the number of instances for which Xk takes each value.) What is the χ^2 (v) distribution? Let Q =
∑v i=1 Z 2 i define a random variable that is the sum of^ v independent standard Normal random variables Z 1 ,... , Zv, each of which is squared. A standard Normal Zi ∼ N (0, 1) has mean 0, variance 1. Then this random variable Q is distributed χ^2 (v). Chi-square is a one-parameter distribution (a special case of the Gamma distribution). Let F (z; v) denote the cumulative distribution function for chi-square with v degrees of freedom; for v = 2 chi-square has cumulative distribution function F (z; 2) = 1 − e−^ z 2 . Why is this relevant here? Well, for enough data (and typically having px, nx ≥ 5 for each of x ∈ Xk is sufficient), then any one of the terms in Dev (Xk) is well approximated as a standard Normal under the null hypothesis. Because there is only one degree of freedom in our statistic, it is appropriate to model Dev (Xk) as being distributed χ^2 (v) for v = 1 under the null hypothesis. When there is less data than alternate statistical tests (Yates’ correction for continuity and Fisher’s exact test) can be used. But this is out of the scope of the class. We are now ready to use the chi square distribution to determine the probability that a test statistic of at least Dev (Xk) could have occured by chance. This is the p-value, and gives the probability of obtaining at least as extreme a classification under the null hypothesis. For example, one can use the chi2cdf function in Matlab or the CHIDIST function in Excel. In our running example, 1 − F (z = 0. 83333 , v = 1) ≈ 0 .36 and we see that there is a reasonable possibility that the split could occur simply by chance. If the p-value is small (and less than threshold α) then we reject the null hypothesis with high confidence; thresholds of α = 0.05 or α = 0.01 are common, where are are said to reject the null hypothesis with confidence 5% or 1% respectively. We accept the split if the p-value is less than α. In our example, we would not reject the null hypothesis, and so we would not accept the split (and create a leaf instead.) For a general (non-Binary) classification problem, with c = |Y | classes and r = |Xk| values to feature Xk then there are (r − 1)(c − 1) degrees of freedom. To see why, one can imagine a table where each feature value is a row and each class value is a column. The table is completed with the number of examples in each category after the split. In our running example, it is a 2 by 2 table:
classes negative positive false 3 5 true 7 5 Once all but one number in every row and all but one number in every column has been completed, then there is enough information to compelte the table. Why? Well, because the total number of examples in each row must add to the number of examples known to have that feature value. And the total number of examples in each column must sum to the number of examples in the data with that class value. In our example, we have (2 − 1)(2 − 1) = 1 degrees of freedom. Given this, we can modify the ID3 algorithm with this additional “pre-pruning” step by splicing in the following lines:
Choose the best splitting feature Xk in X Dev (Xk) =
x∈Xk
(px− pˆx )^2 ˆpx +^
(nx−ˆnx)^2 ˆnx v = (r − 1)(c − 1), where r values of Xk and c class values p-value = 1 − F (Dev (Xk); v) If p-value> α //do not reject null hypoth. Then Label(T ) = the most common classification in D Return T
For simple examples of chi-square pruning, consider the following splits on a binary feature of data D consisting of 20 instances, with an equal positive and negative examples, and an equal number of instances with Xk true and Xk false: (a) 10:10 splits to 5:5 and 5: (b) 10:10 splits to 6:4 and 4: (c) 10:10 splits to 7:3 and 3: (d) 10:10 splits to 8:2 and 2: Which of these could occur just because of chance under the null hypothesis that the feature is unrelated to the classification? Let confidence parameter α = 0.05. For (a), Dev (Xk) = 0 and 1 − F (0; 1) = 1 > α. So, we do not reject the null hypothesis, and we would terminate ID3 and not split.
For (b), Dev (Xk) = (6−5)
2 5 +^
(4−5)^2 5 +^
(4−5)^2 5 +^
(6−5)^2 5 = 0.8 and 1^ −^ F^ (0.8; 1) = 0.^37 > α^ and we do not reject the null hypothesis, and we would terminate ID3 and not split.
For (c), Dev (Xk) = (7−5)
2 5 +^
(3−5)^2 5 +^
(3−5)^2 5 +^
(7−5)^2 5 = 3.2 and 1^ −^ F^ (3.2; 1) = 0.^074 > α^ and we do not reject the null hypothesis, and we would terminate ID3 and not split.
For (d), Dev (Xk) = (8−5)
2 5 +^
(2−5)^2 5 +^
(2−5)^2 5 +^
(8−5)^2 5 = 7.2 and 1^ −^ F^ (7.2; 1) = 0.^0073 < α^ and we reject the null hypothesis, and we would accept the split and continue with ID3.
This all seems quite reasonable: the pattern induced by the fourth split is judged to be statistically signif- icant while the other three are not and could be explained with some probability under the null hypothesis.
The second pruning method tries to judge whether or not a particular split in a tree is justified by seeing if it actually works for real data (i.e., data that is not used for training). The idea is to use a validation set. This is a portion of the training data that set aside purely to test whether or not an actual split is actually a good split. In other words, the validation set is used to “validate” the model. In this case, a split is validated if it improves performance on the validation set. This provides a simple pruning criterion: prune a subtree if it does not improve performance on the validation set. This turns out to be a simple but powerful approach.
where λ > 0 is a parameter, Error (h, D) is the number of examples in D classified incorrectly by h, and Complexity (h) is some measure of complexity or size of the hypothesis. The parameter λ provides a tradeoff between training error and complexity and needs to be tuned. The appropriate notion of Complexity (h) is sometimes a bit hard to define. Later in the term we will see Bayesian methods to determine a measure of complexity penalty. So far in class we have seen the use of the scalar product on coefficients, w · w, as a measure of complexity. For decision trees, this is often taken to be the number of nodes or just the height of a tree. This process of making a tradeoff between empirical error and hypothesis complexity is called regulariza- tion, because it looks for a function that is more regular, or less complex. One seeks the hypothesis h that minimizes the total cost, Cost(h, D). This can be typically formulated as an optimization problem and then solved optimally or approximately. An unfortunate aspect of regularization is that we need to find a value of parameter λ that provides the best generalization. For this we can return to the idea of a validation set that we saw for validation-set pruning:
we hold out some data and use accuracy on this held out data to identify the value of λ that gives the best generalization performance.
This can be achieved through “cross-validation,” as explained in the next section.
Sometimes, one can also try to directly compare Error and Complexity and in doing so avoid the need to tune λ. Information theory provides a method to do this: we will measure both of them in bits. This is the idea of the minimum-description length approach. We select the hypothesis for which
the total number of bits required to encode the hypothesis and also the data, given that the hy- pothesis is known, is minimized.
To gain some intuition: one encoding is a hypothesis that exactly classifies all of the input data D. But this hypothesis may itself be costly to describe. Another encoding is a trivial hypothesis that predicts true always, and requires exactly describing the training data which can itself be costly. The minimum- description length (MDL) principle asserts that the best hypothesis makes the optimal tradeoff between these two encoding costs! The total cost function, which we will try to minimize, becomes:
Cost(h, D) = Bits(D|h) + Bits(h), (6)
where Bits(D|h) is a measure of the error of the hypothesis h on data D and the number of bits to encode D in the optimal encoding given h. If exact then no bits are required. If h is uninformative then more bits are required. Similarly, Bits(h) is the number of bits needed to describe hypothesis h from the space H of hypotheses, again in the optimal encoding given this hypothesis space. This idea of MDL has an elegant theoretical underpinning, the full details of which are beyond the scope of this course. Here is a very brief summary. For Bits(D|h), this is given by Shannon’s information theory, where the hypothesis is assumed to provide a probabilistic model of the data and the optimal encoding for x has bits − log 2 Pr(x|h) where Pr(x|h) is the probability of x given h. Given this, then Bits(D|h) = −
x∈D log 2 Pr(x|h).^ For^ Bits(h), this is determined in a similar way and with an optimal coding that depends on Pr(h), which is the prior on different hypotheses. This is where Occam’s razor is captured: we can put a higher prior on simpler hypotheses and therefore require smaller numbers of bits for their encoding. From a practical perspective we can adopt the principle of MDL to decision trees by adopting reasonable encodings for the hypothesis (a tree) and the data. For the tree, we can encode this so that the description length grows with the number of nodes (or the number of nodes and edges if this is not a binary tree.)
For the data D given a particular hypothesis h, we can suppose that the instances x 1 ,... , xn are known to a receiver, and what is not known is the target classes y 1 ,... , yn. (Note that the cost of transmitting x 1 ,... , xn is independent of the hypothesis in any case, and so wouldn’t affect any comparison of different h.) Now if the classifications are identical to h(xi) for every instance i then we need no additional data for their description. For any misclassified instance, we need to identify the instance (requiring log 2 n bits) and provide the correct classification (in log 2 c bits, where there are c = |Y | target classes.) Let Ne(h, D) denote the number of misclassified instances in D given h. Putting this together, we seek a tree with corresponding hypothesis h that minimizes the
Cost(h, D) = Ne(h, D)(log 2 n + log 2 c) + |h|, (7)
where |h| is the number of nodes in the decision tree. The tree that minimizes this can be identified through search. Experimental results suggest that MDL-based approaches for decision trees produce results com- parable to those with standard tree-pruning methods such as the pre- and post-pruning methods described above.
5 Experimental Methodology: Cross-Validation
One moral of overfitting is that we need to be very careful about our experimental methodology in testing machine learning algorithms. Early machine learning work reported success by showing that algorithms could learn a model that fit the training data. As we know, this is the wrong goal. We want algorithms that generalize well to unseen data. In this section we discuss the general and extremely important methodology of cross-validation. This is a technique that can be used both for validation set methods (e.g., for pruning, or for tuning a model parameter such as λ) and also for the evaluation of the performance of a fixed learning algorithm. In general, there are two kinds of questions we might be interested in:
A naive approach— that is WRONG —would be to put all the data D into a big training set, run each algorithm and different parameterizations of the algorithms on the data, and find the algorithm and parameter settings that minimizes prediction error. But this would badly over-fit and would not be expected to find the method that best generalizes. Moreover, any reported accuracy would be badly optimistic.
What we do instead is to split data D into training data and validation data (with typically 10-30% reserved for validation.) The validation set is often referred to as the “hold-out” set. Given this, one can
But, we might have insufficient data for such a test to be statistically valid. Instead, a nice approach is to use cross-validation. The data D is divided up into k equal sized pieces, referred to as “folds.” A typical value of k is 10. This is k-fold cross-validation. Given this, we run k different experiments. In each
The test set is ONLY used once all model selection is compelete. It is to be kept locked up until then. Only right at the final step should it be used in reporting final performance. We can also extend the cross-validation approach so as to continue to make could use of our precious data. For this, we run k experiments, where two folds of the data are held out each experiment, one to be used for validation and one to be used for testing. The best algorithm is selected by training on the remaining k − 2 folds in each experiment, with the average performance determined over the validation sets. Finally, the performance of a selected learner (and parameterizations) is reported over the k-fold test sets. This is an extremely important idea. For example, when using 5-fold cross-validation with both validation and test sets, the approach would be to divide the data D according to components (ABC|D|E), (ABE|C|D), (ADE|B|C), (CDE|A|B), (BCD|E|A), where the first part is the training set, the second the validation set, and the third part the test set. The algorithm that performs the best on D, C, B, A and E, when trained on ABC, ABE, ADE, CDE and BCD respectively, is selected, and with its performance finally reported in each case on the independent test sets E, D, C, B and A.