Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Decision Trees - Machine Learning - Lecture Notes | CS 446, Study notes of Computer Science

University of Illinois - Urbana-Champaign Computer Science

Prof. Dan Roth

Material Type: Notes; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-alm 🇺🇸

10 documents

1 / 24

This page cannot be seen from the preview

Don't miss anything!

CS 446: Machine Learning

Lecture 2: Decision Trees

1 What Did We Learn?

The last set of notes discussed the learning problem: That sentences have some

label for presentation, and it is necessary to find a function that best separates the

data

The question of finding what is ‘best’ can be defined different ways. If a

function is best on the data which is seen, there is no guarantee that it will be

best on the unseen data. Also, what is best can be defined relative to a class of

functions (e.g., the best linear function). In the context of discussing the best linear

function, two different algorithms were discussed that can find a linear separation

of the data

2 Decision Trees

In the earlier discussion, the generation of feature space was decoupled from

learning. Given examples can be mapped into another space, in which the tar-

get functions are linearly separable. But questions remain about whether or not

changing the feature space is the best solution. These notes will also discuss how

we determine what are good mappings and whether or not there is a ‘best’ learn-

ing algorithm. If very little is known about the domain, and the dimensions aren’t

too high, decision trees are a good choice because they are a general purpose

algorithm.

1

Discover Study notes of Computer Science University of Illinois - Urbana-Champaign

Partial preview of the text

Download Decision Trees - Machine Learning - Lecture Notes | CS 446 and more Study notes Computer Science in PDF only on Docsity!

CS 446: Machine Learning

Lecture 2: Decision Trees

1 What Did We Learn?

The last set of notes discussed the learning problem: That sentences have some label for presentation, and it is necessary to find a function that best separates the data The question of finding what is ‘best’ can be defined different ways. If a function is best on the data which is seen, there is no guarantee that it will be best on the unseen data. Also, what is best can be defined relative to a class of functions (e.g., the best linear function). In the context of discussing the best linear function, two different algorithms were discussed that can find a linear separation of the data

2 Decision Trees

In the earlier discussion, the generation of feature space was decoupled from learning. Given examples can be mapped into another space, in which the tar- get functions are linearly separable. But questions remain about whether or not changing the feature space is the best solution. These notes will also discuss how we determine what are good mappings and whether or not there is a ‘best’ learn- ing algorithm. If very little is known about the domain, and the dimensions aren’t too high, decision trees are a good choice because they are a general purpose algorithm.

Figure 1: Collection of Examples (top) and Resulting Decision Tree

2.2 Boolean Decision Trees

As Boolean functions, decision trees can represent any Boolean function. The rules that the tree represents can be rewritten as rules in Disjunctive Normal Form (DNF). For example, in Figure 2, the positive rules can be written as: ((green ∧ square) ∨ (blue ∧ circle) ∨ (blue ∧ square) ∨ (red)). The fact that a Boolean decision tree can represent any Boolean function has trade-offs. It is a very expressive formalism, but the result is not necessarily sim- ple. The decision tree is a universal function, in contrast to linear functions. De- cision trees can represent functions, such as parity functions, which can not be represented by linear functions. The family of decision trees is an expressive class of functions which is sometimes too expressive for a given problem.

Figure 2: A Boolean Decision Tree

Figure 3: Decision Boundaries

In summary, decision trees can represent any Boolean function. They can be viewed as a way to compactly represent a lot of data. The advantage is that decision trees can represent non-metric data. The representation is a natural one and resembles a game of ‘20 Questions’ in which features are tested at each node, given training examples. The evaluation of the Decision Tree Classifier is easy. Given data, there are many ways to represent them as a decision tree. Learning a good representation from the data is the challenge.

3 Representing Data

Consider a large table with N attributes, and assume you want to know something about the people represented as entries in this table. For example, some features could be: ‘owns an expensive car’ or ‘older than 28’. The simplest way to rep- resent the data is with a Histogram. The first attribute on the histogram could be ‘gender’, the first and second attributes: ‘gender’ and ‘own an expensive car’. Assuming that N = 16, there are 16 1-d histograms (contingency tables). There are 120 2-d contingency tables: 16 − choose − 2 = 16 ∗ 15 /2 = 120. There 560 3-d contingency tables. With 100 attributes, there are 161,700 3-d tables. Clearly it is necessary to find a better way to represent the data, deciding which are the important attributes to look at first. Information theory provides some solutions to help represent the data more efficiently.

3.1 Learning Decision Trees

The output of a decision tree is a discrete category, although Real valued outputs are possible with regression trees. There are efficient algorithms for processing large amounts of data, provided that there are not too many features. There are also methods for handling noisy data: data with classification or attribute noise. And there are methods for handling missing attribute values.

3.2 Basic Decision Trees Learning Algorithm

In the basic learning algorithm for decision trees, the data is processed in batch, so all of the data is available. Decision trees are built recursively, top-down.

Basic Decision Tree Learning Algorithm

DT(Examples, Attributes) If all Examples have same label: return a leaf node with Label Else

If Attributes is empty: return a leaf with majority Label

Else

Pick an attribute A as root For each value v of A Let Examples(v) be all the examples for which A = v Add a branch out of the root for the test A = v

If Examples(v) is empty create a leaf node labeled with the majority label in Examples Else recursively create subtree by calling DT(Examples(v), Attribute- {A})

3.3 Picking the Root Attribute

In choosing the root attribute, the goal is to have the resulting decision tree be as small as possible. This is based on the common heuristic of Occam’s Razor. But finding the minimal decision tree consistent with the data is NP-hard. The recursive algorithm is a greedy heuristic search for a simple tree, but can not guarantee optimality. The main decision in the algorithm is the selection of the next attribute to form a condition on. Given the data below with two Boolean attributes (A,B), which should be the first attribute selected?

< (A = 0, B = 0), − >: 50 examples

< (A = 0, B = 1), − >: 50 examples

< (A = 1, B = 0), − >: 0 examples

< (A = 1, B = 1), + >: 100 examples

By splitting on A, the nodes are purely labeled, but not by splitting on B.

A

0 - B A 1 +

Now, consider a similar example:

< (A = 0, B = 0), − >: 50 examples

< (A = 0, B = 1), − >: 50 examples

< (A = 1, B = 0), − >: 3 examples

< (A = 1, B = 1), + >: 100 examples

In this case, both choices of decision tree look similar.

entropy that extends to cases with more than two classification labels is:

Entropy({p 1 , p 2 ,... pk}) = −

∑^ k

i− 1

pilog(pi)

Where pi is the fraction of examples labeled i. Entropy can be viewed as the number of bits required, on average, to encode the class of labels. If the probability for + is 0. 5 , a single bit is required for each example; if it is 0. 8 , it can use less than one bit.

3.5 Information Gain

The information gain of an attribute a is the expected reduction in entropy caused by partitioning on this attribute.

Gain(S, a) = Entropy(S) −

∑

v∈values(a)

|Sv| |S|

Entropy(Sv)

Where Sv is the subset of S for which attribute a have value v and the entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set. Partitions of low entropy lead to high gain.

4 An Illustrative Example

Given the attributes and values in Figure 6 below, what is the best decision tree? There are 9 positive and 5 negative examples in the data. The entropy is:

Entropy(S) = −

log(

In determining which attribute should be the root node, the gain of each is considered. For Humidity, the Entropy for the value High is. 985 and Normal is

. 592. The Gain is:

Gain(S, Humidity) =. 94 −

For the attribute Wind, the entropy for the value Weak is. 811 and Strong is

1. The Gain is:

Gain(S, W ind) =. 94 −

Figure 5: Positive Examples (left black bar), Negative Examples (right black bar), Entropy (blue bar)

The attribute Humidity has a higher Gain than Wind. But the attribute Outlook has the highest Gain of 0. 246. So Outlook is the root node in Figure 7 below and the Gain is calculated for each remaining feature. The Gain for Humidity, Temperature, and Wind are now:

Gain(Ssunny, Humidity) =. 97 − (

Gain(Ssunny, T emp) =. 97 − 0 − (

Gain(Ssunny, W ind) =. 97 − (

The highest Gain for Sunny is Humidity. This continues until every attribute is included in the path or all examples in the leaf have the same label.

4.1 Summary

The basic ID3 algorithm, as illustrated in the last section is below. ID3 (Examples, Attributes, Label)

Let S be the set of Examples

Label is the target attribute (the prediction) Attributes is the set of measured attributes

Create a Root node for tree If all examples are labeled the same, return a single node tree with Label Otherwise Begin

A = attribute in Attributes that best classifies S for each possible value v of A

Add a new tree branch corresponding to A = v Let Sv be the subset of examples in S with A = v if Sv is empty: add leaf node with the common value of Label in S Else: below this branch add the subtree

Figure 7: Corresponding Decision Tree

5 Overfitting the Data

When learning decision trees, the representation is expressive. If the data is non- noisy, it can completely represent the data. The problem is that completely rep- resenting the data can lead to generalizing about information that is really just coincidental. A learning tree that classifies the training data perfectly may not lead to the tree with the best generalization performance. The problems could be due to noise in the training data that the tree is fitting, or the algorithm might be making decisions based on very little data. A hypothesis h is said to overfit the training data if there is another hypothesis, h′, such that h has a smaller error than h′^ on the training data, but h has a larger error on the test data than h′.

5.1 Avoiding Overfitting

There are two basic approaches to avoiding overfitting. The first is prepruning. The tree is made to stop growing as some point during construction when it is determined that there is not enough data to make reliable choices. It must be determined how to avoid increasing the complexity of the tree too much. One method of avoiding too much complexity is to avoid exploiting the hypothesis class entirely. For instance, with linear separators, there is always a separator if there are enough dimensions. One way to avoid overfitting is to avoid letting the dimensionality get to high. The second approach to avoid overfitting is postpruning. This involves grow- ing a full tree then removing nodes that seem to lack sufficient evidence. There are several methods for evaluating which subtrees to prune. Cross-validation involves reserving held-out data in order to evaluate the utility. Statistical testing tests if the observed regularity can be dismissed as likely to have occurred by chance. Maximum Description Length asks if the additional complexity of the hypothesis is smaller than remembering the exceptions.

5.2 Trees and Rules

Decision trees can be represented as Rules. For example, the tree in Figure 7 above can be described by conditional rules such as:

(outlook=sunny) and (humidity=high) then YES

Figure 8: Graph of Error with Overfitting

Decision Trees - Machine Learning - Lecture Notes | CS 446, Study notes of Computer Science

Related documents

Partial preview of the text

Download Decision Trees - Machine Learning - Lecture Notes | CS 446 and more Study notes Computer Science in PDF only on Docsity!

CS 446: Machine Learning

Lecture 2: Decision Trees

1 What Did We Learn?

2 Decision Trees

2.2 Boolean Decision Trees

3.1 Learning Decision Trees

3.2 Basic Decision Trees Learning Algorithm

3.3 Picking the Root Attribute

A

0 - B A 1 +

3.5 Information Gain

4.1 Summary

5.1 Avoiding Overfitting

5.2 Trees and Rules