Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Decision Trees - Machine Learning - Lecture Notes | CS 446, Study notes of Computer Science

Material Type: Notes; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-alm
koofers-user-alm 🇺🇸

10 documents

1 / 24

Toggle sidebar

Related documents


Partial preview of the text

Download Decision Trees - Machine Learning - Lecture Notes | CS 446 and more Study notes Computer Science in PDF only on Docsity!

CS 446: Machine Learning

Lecture 2: Decision Trees

1 What Did We Learn?

The last set of notes discussed the learning problem: That sentences have some label for presentation, and it is necessary to find a function that best separates the data The question of finding what is ‘best’ can be defined different ways. If a function is best on the data which is seen, there is no guarantee that it will be best on the unseen data. Also, what is best can be defined relative to a class of functions (e.g., the best linear function). In the context of discussing the best linear function, two different algorithms were discussed that can find a linear separation of the data

2 Decision Trees

In the earlier discussion, the generation of feature space was decoupled from learning. Given examples can be mapped into another space, in which the tar- get functions are linearly separable. But questions remain about whether or not changing the feature space is the best solution. These notes will also discuss how we determine what are good mappings and whether or not there is a ‘best’ learn- ing algorithm. If very little is known about the domain, and the dimensions aren’t too high, decision trees are a good choice because they are a general purpose algorithm.

Figure 1: Collection of Examples (top) and Resulting Decision Tree

The study of decision trees may shed some light on these questions. With decision trees, learning is done directly from the given data representation. The algorithm ‘transforms’ the data itself. Decision trees are a hierarchical data structure that represents data by imple- menting a divide and conquer strategy. It can be used as a non-parametric classi- fication and regression method where, given a collection of examples, a decision tree is learned that represents the collection. The representation derived from the collection is then used to classify new examples.

2.1 The Representation of Decision Trees

Decision trees are classifiers for instances represented as feature vectors. In Figure 1, the features are color, shape, and label. Each node is a test for the feature values with one branch for each possible value of the feature. The leaves specify the category (labels). Instances can be categorized into multiple disjoint categories, for example, (color = RED; shape = TRIANGLE).

2.2 Boolean Decision Trees

As Boolean functions, decision trees can represent any Boolean function. The rules that the tree represents can be rewritten as rules in Disjunctive Normal Form (DNF). For example, in Figure 2, the positive rules can be written as: ((green ∧ square) ∨ (blue ∧ circle) ∨ (blue ∧ square) ∨ (red)). The fact that a Boolean decision tree can represent any Boolean function has trade-offs. It is a very expressive formalism, but the result is not necessarily sim- ple. The decision tree is a universal function, in contrast to linear functions. De- cision trees can represent functions, such as parity functions, which can not be represented by linear functions. The family of decision trees is an expressive class of functions which is sometimes too expressive for a given problem.

Figure 2: A Boolean Decision Tree

2.3 Decision Boundaries

Usually, instances are represented as attribute-value pairs, for example: (color = blue, shape = square, +) Numerical values can be used either by discretizing or by using thresholds for splitting nodes. In this case, the tree divides the feature space into axis-parallel rectangles, each labeled with one of the labels. In Figure 3 below, (left), in a plane xy, with real values, each node can test a value such as x < 3 (right).

Figure 3: Decision Boundaries

In summary, decision trees can represent any Boolean function. They can be viewed as a way to compactly represent a lot of data. The advantage is that decision trees can represent non-metric data. The representation is a natural one and resembles a game of ‘20 Questions’ in which features are tested at each node, given training examples. The evaluation of the Decision Tree Classifier is easy. Given data, there are many ways to represent them as a decision tree. Learning a good representation from the data is the challenge.

3 Representing Data

Consider a large table with N attributes, and assume you want to know something about the people represented as entries in this table. For example, some features could be: ‘owns an expensive car’ or ‘older than 28’. The simplest way to rep- resent the data is with a Histogram. The first attribute on the histogram could be ‘gender’, the first and second attributes: ‘gender’ and ‘own an expensive car’. Assuming that N = 16, there are 16 1-d histograms (contingency tables). There are 120 2-d contingency tables: 16 − choose − 2 = 16 ∗ 15 /2 = 120. There 560 3-d contingency tables. With 100 attributes, there are 161,700 3-d tables. Clearly it is necessary to find a better way to represent the data, deciding which are the important attributes to look at first. Information theory provides some solutions to help represent the data more efficiently.

3.1 Learning Decision Trees

The output of a decision tree is a discrete category, although Real valued outputs are possible with regression trees. There are efficient algorithms for processing large amounts of data, provided that there are not too many features. There are also methods for handling noisy data: data with classification or attribute noise. And there are methods for handling missing attribute values.

3.2 Basic Decision Trees Learning Algorithm

In the basic learning algorithm for decision trees, the data is processed in batch, so all of the data is available. Decision trees are built recursively, top-down.

Figure 4: Attributes and Values (left) and Decision Tree

Basic Decision Tree Learning Algorithm

DT(Examples, Attributes) If all Examples have same label: return a leaf node with Label Else

If Attributes is empty: return a leaf with majority Label

Else

Pick an attribute A as root For each value v of A Let Examples(v) be all the examples for which A = v Add a branch out of the root for the test A = v

If Examples(v) is empty create a leaf node labeled with the majority label in Examples Else recursively create subtree by calling DT(Examples(v), Attribute- {A})

3.3 Picking the Root Attribute

In choosing the root attribute, the goal is to have the resulting decision tree be as small as possible. This is based on the common heuristic of Occam’s Razor. But finding the minimal decision tree consistent with the data is NP-hard. The recursive algorithm is a greedy heuristic search for a simple tree, but can not guarantee optimality. The main decision in the algorithm is the selection of the next attribute to form a condition on. Given the data below with two Boolean attributes (A,B), which should be the first attribute selected?

< (A = 0, B = 0), − >: 50 examples

< (A = 0, B = 1), − >: 50 examples

< (A = 1, B = 0), − >: 0 examples

< (A = 1, B = 1), + >: 100 examples

By splitting on A, the nodes are purely labeled, but not by splitting on B.

A

+

0 - B A 1 +

0

-

0

Now, consider a similar example:

< (A = 0, B = 0), − >: 50 examples

< (A = 0, B = 1), − >: 50 examples

< (A = 1, B = 0), − >: 3 examples

< (A = 1, B = 1), + >: 100 examples

In this case, both choices of decision tree look similar.

B

A

+

0

0

A

B

+

0

-

0

The one with A as the root node is better because almost half of the labels are known on the first pass. The best attributes to choose first are those that split the examples into sets that are relatively pure in one label. By using this splitting heuristic, the leaf nodes are more likely to be of shallower depth. The most popular way of quantifying this heuristic is based on information gain. The measure of information gain originated with the ID3 system of Quinlan.

3.4 Entropy

The measure of entropy (impurity, disorder) is an important component of mea- suring information gain.

The entropy of a set of examples, S, relative to a binary classification is:

Entropy(S) = −p⊕log(p⊕) − p log(p ) Where p⊕ is the proportion of positive examples in S and p is the proportion of negative examples.

If all of the examples belong to the same category: Entropy = 0. If the examples are equally mixed (0. 5 , 0 .5), Entropy = 1. The general measure of

entropy that extends to cases with more than two classification labels is:

Entropy({p 1 , p 2 ,... pk}) = −

∑^ k

i− 1

pilog(pi)

Where pi is the fraction of examples labeled i. Entropy can be viewed as the number of bits required, on average, to encode the class of labels. If the probability for + is 0. 5 , a single bit is required for each example; if it is 0. 8 , it can use less than one bit.

3.5 Information Gain

The information gain of an attribute a is the expected reduction in entropy caused by partitioning on this attribute.

Gain(S, a) = Entropy(S) −

v∈values(a)

|Sv| |S|

Entropy(Sv)

Where Sv is the subset of S for which attribute a have value v and the entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set. Partitions of low entropy lead to high gain.

4 An Illustrative Example

Given the attributes and values in Figure 6 below, what is the best decision tree? There are 9 positive and 5 negative examples in the data. The entropy is:

Entropy(S) = −

log(

) −

log(

) = 0. 94

In determining which attribute should be the root node, the gain of each is considered. For Humidity, the Entropy for the value High is. 985 and Normal is

. 592. The Gain is:

Gain(S, Humidity) =. 94 −

0. 985 −

0 .592 = 0. 151

For the attribute Wind, the entropy for the value Weak is. 811 and Strong is

    1. The Gain is:

Gain(S, W ind) =. 94 −

0. 811 −

1 .0 = 0. 048

Figure 5: Positive Examples (left black bar), Negative Examples (right black bar), Entropy (blue bar)

Figure 6: Example of Features and Attributes

The attribute Humidity has a higher Gain than Wind. But the attribute Outlook has the highest Gain of 0. 246. So Outlook is the root node in Figure 7 below and the Gain is calculated for each remaining feature. The Gain for Humidity, Temperature, and Wind are now:

Gain(Ssunny, Humidity) =. 97 − (

)0 − (

)0 =. 97

Gain(Ssunny, T emp) =. 97 − 0 − (

)1 =. 57

Gain(Ssunny, W ind) =. 97 − (

)1 − (

).92 =. 02

The highest Gain for Sunny is Humidity. This continues until every attribute is included in the path or all examples in the leaf have the same label.

4.1 Summary

The basic ID3 algorithm, as illustrated in the last section is below. ID3 (Examples, Attributes, Label)

Let S be the set of Examples

Label is the target attribute (the prediction) Attributes is the set of measured attributes

Create a Root node for tree If all examples are labeled the same, return a single node tree with Label Otherwise Begin

A = attribute in Attributes that best classifies S for each possible value v of A

Add a new tree branch corresponding to A = v Let Sv be the subset of examples in S with A = v if Sv is empty: add leaf node with the common value of Label in S Else: below this branch add the subtree

Figure 7: Corresponding Decision Tree

ID3(Sv, Attributes − {a}, Label) End

Return Root

4.2 Hypothesis Space in Decision Tree Induction

The hypothesis space for decision tree induction involves conducting a search of the space of decision trees which can represent all possible discrete functions. The goal is to find the best decision tree, but finding the minimal decision tree consis- tent with the set of data is NP-hard. The algorithm performs a greedy heuristic search, using hill climbing without backtracking. It makes statistically based de- cisions using all data. The bias in decision tree induction is for trees of minimal depth. Greedy search, however, introduces complications: it positions features with high infor- mation gain high in the tree and may not find the minimal tree. It is possible to implement a preference bias (search bias) as opposed to a restriction bias (lan- guage bias). Occam’s razor can be defended on the basis that there are relatively few simple hypotheses compared to complex ones. Therefore, a simple hypothesis that is consistent with the data is less likely to be a statistical coincidence.

4.3 History of Decision Tree Research

Decision trees were not invented by Computer Scientists, they were invented by Psychologists. Hunt and colleagues in Psychology used full search decision tree methods to model human concept learning in the 1960’s. Quinlan developed ID3, with the information gain heuristics in the late 1970’s in order to learn expert sys- tems from examples. Breiman, Friedmans and colleagues in Statistics developed CART (classification and regression trees) simultaneously. A variety of improve- ments were made in the 1980’s in order to cope with noise, continuous attributes, missing data, and non-axis parallel, among other things. Quinlan’s updated algo- rithm, C4.5 (1993) is commonly used. Boosting or Bagging over decision trees is a very good general purpose algorithm.

5 Overfitting the Data

When learning decision trees, the representation is expressive. If the data is non- noisy, it can completely represent the data. The problem is that completely rep- resenting the data can lead to generalizing about information that is really just coincidental. A learning tree that classifies the training data perfectly may not lead to the tree with the best generalization performance. The problems could be due to noise in the training data that the tree is fitting, or the algorithm might be making decisions based on very little data. A hypothesis h is said to overfit the training data if there is another hypothesis, h′, such that h has a smaller error than h′^ on the training data, but h has a larger error on the test data than h′.

5.1 Avoiding Overfitting

There are two basic approaches to avoiding overfitting. The first is prepruning. The tree is made to stop growing as some point during construction when it is determined that there is not enough data to make reliable choices. It must be determined how to avoid increasing the complexity of the tree too much. One method of avoiding too much complexity is to avoid exploiting the hypothesis class entirely. For instance, with linear separators, there is always a separator if there are enough dimensions. One way to avoid overfitting is to avoid letting the dimensionality get to high. The second approach to avoid overfitting is postpruning. This involves grow- ing a full tree then removing nodes that seem to lack sufficient evidence. There are several methods for evaluating which subtrees to prune. Cross-validation involves reserving held-out data in order to evaluate the utility. Statistical testing tests if the observed regularity can be dismissed as likely to have occurred by chance. Maximum Description Length asks if the additional complexity of the hypothesis is smaller than remembering the exceptions.

5.2 Trees and Rules

Decision trees can be represented as Rules. For example, the tree in Figure 7 above can be described by conditional rules such as:

(outlook=sunny) and (humidity=high) then YES

Figure 8: Graph of Error with Overfitting

(outlook=rain) and (wind=strong) then NO

5.3 Reduced-Error Pruning

The following algorithm for post-pruning follows a general strategy of overfitting then simplifying. A post-pruning, cross validation approach

Partition training data into a ‘grow’ set and a ‘validation’ set.

Build a complete tree for the ‘grow’ data

Until accuracy on validation set decreases, do:

For each non-leaf node in the tree Temporarily prune the tree below; replace it by majority vote Test the accuracy of the hypothesis on the validation set Permanently prune the node with the greatest increase in accuracy on the validation test The problem is that this algorithm uses less data to construct the tree. Some- times pruning is done at the level of rules and rules are generalized by erasing a condition.

6 Continuous Attributes, Missing Values, and Other

Issues

6.1 Continuous Attributes

Real-valued attributes can, in advance by discretized into ranges, such as big, medium, small. The problem with this approach is that some information is lost. For example, there is no order to big, medium, small. Alternatively, there can be splitting nodes based on thresholds of the form A < c and A >= c. The information gain for these splits is calculated in the same way and compared to the information gain of discrete splits.

How to find the Split with the Highest Gain

For each continuous feature A :

Sort examples according to the value of A

For each ordered pair (x, y) with different labels Check the mid-point as a possible threshold (i.e.,

Sa≤x′^ , Sa≥y

For example, given the length and class:

Table 1: Length (L): 10 15 21 28 32 40 50 Class: - + + - + + -

Check the thresholds: L < 12 .5; L < 24 .5; L < 45 For the Subset of Examples =.. ., Split= k+, j−

6.2 Missing Values

Many times, the values for all attributes are not available during training or test. For example, in medical diagnosis:

diagnosis =< f ever, blood pressure,... , blood test =?,... >

In order to deal with this situation, training could involve evaluating Gain(S, a) where in some of the examples, a value for a is not given. Testing could involve classifying an example without knowing the value of a.

Figure 9: Training with Missing Values

7 Summary

Rather than using decision trees to represent the target function, it is becoming more common to use small decision trees as features. When learning over a large number of features, learning decision trees is difficult and the resulting tree may be very large. This can lead to overfitting. Instead, it is possible to learn small decision trees with limited depth. The small trees are treated as ‘experts’: They are correct, but only on a small region in the domain. Then a linear function is learned over these features. Boosting and other linear learners are used on top of the small decision trees, which have either Boolean or real valued features. In summary, the hypothesis space for decision trees is of variable size and contains all functions. It is deterministic and can have discrete or continuous attributes. For search algorithms, ID3 was presented. It uses an eager, batch, constructive search and can be extended to deal with missing values. Other issues involve determining what the goal of the decision tree is and when to stop. It is important to guarantee good generalization and not over-fit the data. This section did not address how to evaluate the methods correctness-wise or complexity wise.