Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth, Study notes of Computer Science

An introduction to decision trees as a method for finding a linear function that best separates data. Topics covered include the basics of decision trees, their use as a non-parametric classification and regression method, and the process of representing data and learning the algorithm. Entropy and information gain are also discussed as they relate to decision trees.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-miy-2
koofers-user-miy-2 🇺🇸

1

(1)

10 documents

1 / 54

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Decision Trees CS446 Fall 08 1
What Did We Learn?
Learning problem:
Find a function that
best separates the data
What function?
What’s best?
How to find it?
A possibility: Define the learning problem to be:
Find a (linear) function that best separates the data
Linear:
X= data representation; w = the classifier
Y= sgn{X
Tw}
Home work
Registration
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36

Partial preview of the text

Download Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth and more Study notes Computer Science in PDF only on Docsity!

Decision Trees^

CS446 Fall 08^

What Did We Learn?^ •^ Learning problem

Find a function thatbest separates the data• What function?• What’s best?• How to find it? Linear:X= data representation; w = the classifierT^ Y = sgn {Xw} • A possibility: Define the learning problem to be:Find a (linear) function that best separates the data

Home workRegistration

Decision Trees^

CS446 Fall 08^

Introduction - Summary^ •^ We introduced the technical part of the class by giving two examples for (very different)approaches to linear discrimination.^ •^ There are many other solutions.^ •^ Questions 1: Our solution assumed that the target functions are linear. Can we learn afunction that is more flexible in terms of what it does with the features space?^ •^ Question 2: Can we say something about the quality of what we learn (samplecomplexity, time complexity; quality)

Decision Trees^

CS446 Fall 08^

Decision Trees ‰ A hierarchical data structure that represents data by implementing adivide and conquer strategy ‰ Can be used as a non-parametric classification and regression method. ‰ Given a collection of examples, learn a decision tree that represents it. ‰ Use this representation to classify new examples

A

B

C

Decision Trees^

CS446 Fall 08^

  • Decision Trees are classifiers for instances represented asfeatures vectors.

(color= ;shape= ;label= )

  • Nodes are tests for feature values;• There is one branch for each value of the feature• Leaves specify the categories (labels)• Can categorize instances into multiple disjoint categories

Color

Shape Blue^

red^ Green Shapetriangle^ circlesquare

circle square

A
BC
A
B
B

Evaluation of aDecision Tree

(color=^ RED^ ;shape=

triangle)^ Learning aDecision Tree

Decision Trees: The Representation

Decision Trees^

CS446 Fall 08^

  • Usually, instances are represented as attribute-value pairs(color=blue, shape=square, +)• Numerical values can be used either by discretizing or byusing thresholds for splitting nodes.• In this case, the tree divides the feature space into axis-parallelrectangles, each labeled with one of the labels.

X<3 no yesY>7^ Y<

yesno +- +

-

X < 1

no^ yes

+^ -

1 3

X Y+^75 +

+ - -

Decision Trees: Decision Boundaries^ +^ +^ + -

Decision Trees^

CS446 Fall 08^

Decision Trees •^ Can represent any Boolean Function•^ Can be viewed as a way to compactly represent a lot of data.•^ Advantage: non-metric data•^ Natural representation: (20 questions)•^ The evaluation of the Decision Tree Classifier is easy•^ Clearly, given data, there are many ways toRepresent it as a decision tree.•^ Learning a good representation fromdata is the challenge.

Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No

Decision Trees^

CS446 Fall 08^

  • Output is a discrete category. Real valued outputsare possible (regression trees)• There are efficient algorithms for processing large amountsof data. (But not too many features)• There are methods for handling noisy data(classification noise and attribute noise)and for handling missing attribute values.

Color

Shape Blue^

red^ Green Shapetriangle^ circlesquare

circle square

Learning Decision Trees

Decision Trees^

CS446 Fall 08^

  • Data is processed in Batch (I.e., all the data is available).• Recursively build a decision tree top-down. Day Outlook^ Temperature

Humidity^ Wind

PlayTennis

1 Sunny^

Hot^ High

Weak^

No

2 Sunny^

Hot^ High

Strong^

No

3 Overcast^

Hot^ High

Weak^

Yes

4 Rain^

Mild^ High

Weak^

Yes

5 Rain^

Cool^ Normal

Weak^

Yes

6 Rain^

Cool^ Normal

Strong^

No

7 Overcast^

Cool^ Normal

Strong^

Yes

8 Sunny^

Mild^ High

Weak^

No

9 Sunny^

Cool^ Normal

Weak^

Yes

10 Rain^

Mild^ Normal

Weak^

Yes

11 Sunny^

Mild^ Normal

Strong^

Yes

12 Overcast^

Mild^ High

Strong^

Yes

13 Overcast^

Hot^ Normal

Weak^

Yes

14 Rain^

Mild^ High

Strong^

No Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No

Basic Decision Trees Learning Algorithm

Decision Trees^

CS446 Fall 08^

Picking the Root Attribute^ • The goal is to have the resulting decision tree as small as^ possible (Occam’s Razor)^ • Finding the minimal decision tree consistent with the data^ is NP-hard^ • The recursive algorithm is a greedy heuristic search for a^ simple tree, but cannot guarantee optimality.^ • The main decision in the algorithm is the selection of the^ next attribute to condition on.

Decision Trees^

CS446 Fall 08^

  • Consider data with two Boolean attributes (A,B).< (A=0,B=0), - >:

50 examples

< (A=0,B=1), - >:

50 examples

< (A=1,B=0), - >:

0 examples

< (A=1,B=1), + >: 100 examples

A^01 - +

Picking the Root Attribute• What should be the first attribute we select?• Splitting on A: we get purely labeled nodes.B^01 •Splitting on B: we don’t get purely labeled nodes.-A^01 • What if we have: <(A=1,B=0), - >: 3 examples-+

Decision Trees^

CS446 Fall 08^

Picking the Root Attribute^ • The goal is to have the resulting decision tree as small aspossible (Occam’s Razor)• The main decision in the algorithm is the selection of thenext attribute to condition on.• We want attributes that split the examples to sets that arerelatively pure in one label; this way we are closer to a leaf node.• The most popular heuristics is based on information gain,originated with the ID3 system of Quinlan.

Decision Trees^

CS446 Fall 08^

  • Entropy (impurity, disorder)

of a set of examples, S, relative

to a binary classification is:where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

-^ If the examples are equally mixed (0.5,0.5)

)log(pp)log(p Entropy = 1 p Entropy(S)^

−− −−++ = + p p^ − Entropy can be viewed as the number of bits required, on average, to encode the class oflabels. If the probability for + is 0.5, a single bit is required for each example; if it is 0.8 --can use less then 1 bit.

k −=^ )log(pp}),...pii = i^1 p, Entropy({p^

k (^21)

In general, when p

is the fraction of examples labeled i:i^

Entropy

Decision Trees^

CS446 Fall 08^

where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

-^ If the examples are equally mixed (0.5,0.5)

Entropy^ = 1

p^ + p^ − 1 1

1

Entropy

)log(pp )log(pp Entropy(S)^

−− −−++ =

  • Entropy (impurity, disorder)

of a set of examples, S, relative to a binary classification is:

Decision Trees^

CS446 Fall 08^

• The information gain

of an attribute

a^ is the expected reduction

in entropy caused by partitioning on this attribute. where^ is the subset of S for which attribute

a^ has value^ v

and the entropy of partitioning the data is calculated byweighing the entropy of each partition by its size relative to theoriginal setPartitions of low entropy lead to high gain

|S| Entropy(S|S|

Entropy(S)a)

Gain(S,^

v

v ∑ values(a)v ∈

Information Gain S^ v^ Go back to check which of the A, B splits is better