Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth, Study notes of Computer Science

University of Illinois - Urbana-Champaign Computer Science

Prof. Dan Roth

An introduction to decision trees as a method for finding a linear function that best separates data. Topics covered include the basics of decision trees, their use as a non-parametric classification and regression method, and the process of representing data and learning the algorithm. Entropy and information gain are also discussed as they relate to decision trees.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-miy-2 🇺🇸

(1)

10 documents

1 / 54

This page cannot be seen from the preview

Don't miss anything!

Decision Trees CS446 Fall 08 1

What Did We Learn?

• Learning problem:

Find a function that

best separates the data

• What function?

• What’s best?

• How to find it?

• A possibility: Define the learning problem to be:

Find a (linear) function that best separates the data

Linear:

X= data representation; w = the classifier

Y= sgn{X

Tw}

Home work

Registration

Discover Study notes of Computer Science University of Illinois - Urbana-Champaign

Partial preview of the text

Download Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth and more Study notes Computer Science in PDF only on Docsity!

Decision Trees^

CS446 Fall 08^

What Did We Learn?^ •^ Learning problem

Find a function thatbest separates the data• What function?• What’s best?• How to find it? Linear:X= data representation; w = the classifierT^ Y = sgn {Xw} • A possibility: Define the learning problem to be:Find a (linear) function that best separates the data

Home workRegistration

Decision Trees^

CS446 Fall 08^

Introduction - Summary^ •^ We introduced the technical part of the class by giving two examples for (very different)approaches to linear discrimination.^ •^ There are many other solutions.^ •^ Questions 1: Our solution assumed that the target functions are linear. Can we learn afunction that is more flexible in terms of what it does with the features space?^ •^ Question 2: Can we say something about the quality of what we learn (samplecomplexity, time complexity; quality)

Decision Trees^

CS446 Fall 08^

Decision Trees A hierarchical data structure that represents data by implementing adivide and conquer strategy Can be used as a non-parametric classification and regression method. Given a collection of examples, learn a decision tree that represents it. Use this representation to classify new examples

A

B

C

Decision Trees^

CS446 Fall 08^

Decision Trees are classifiers for instances represented asfeatures vectors.

(color= ;shape= ;label= )

Nodes are tests for feature values;• There is one branch for each value of the feature• Leaves specify the categories (labels)• Can categorize instances into multiple disjoint categories

Color

Shape Blue^

red^ Green Shapetriangle^ circlesquare

circle square

A

BC

A

B

Evaluation of aDecision Tree

(color=^ RED^ ;shape=

triangle)^ Learning aDecision Tree

Decision Trees: The Representation

Decision Trees^

CS446 Fall 08^

Usually, instances are represented as attribute-value pairs(color=blue, shape=square, +)• Numerical values can be used either by discretizing or byusing thresholds for splitting nodes.• In this case, the tree divides the feature space into axis-parallelrectangles, each labeled with one of the labels.

X<3 no yesY>7^ Y<

yesno +- +

X < 1

no^ yes

+^ -

1 3

X Y+^75 +

+ - -

Decision Trees: Decision Boundaries^ +^ +^ + -

Decision Trees^

CS446 Fall 08^

Decision Trees •^ Can represent any Boolean Function•^ Can be viewed as a way to compactly represent a lot of data.•^ Advantage: non-metric data•^ Natural representation: (20 questions)•^ The evaluation of the Decision Tree Classifier is easy•^ Clearly, given data, there are many ways toRepresent it as a decision tree.•^ Learning a good representation fromdata is the challenge.

Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No

Decision Trees^

CS446 Fall 08^

Output is a discrete category. Real valued outputsare possible (regression trees)• There are efficient algorithms for processing large amountsof data. (But not too many features)• There are methods for handling noisy data(classification noise and attribute noise)and for handling missing attribute values.

Color

Shape Blue^

red^ Green Shapetriangle^ circlesquare

circle square

Learning Decision Trees

Decision Trees^

CS446 Fall 08^

Data is processed in Batch (I.e., all the data is available).• Recursively build a decision tree top-down. Day Outlook^ Temperature

Humidity^ Wind

PlayTennis

1 Sunny^

Hot^ High

Weak^

2 Sunny^

Hot^ High

Strong^

3 Overcast^

Hot^ High

Weak^

Yes

4 Rain^

Mild^ High

Weak^

Yes

5 Rain^

Cool^ Normal

Weak^

Yes

6 Rain^

Cool^ Normal

Strong^

7 Overcast^

Cool^ Normal

Strong^

Yes

8 Sunny^

Mild^ High

Weak^

9 Sunny^

Cool^ Normal

Weak^

Yes

10 Rain^

Mild^ Normal

Weak^

Yes

11 Sunny^

Mild^ Normal

Strong^

Yes

12 Overcast^

Mild^ High

Strong^

Yes

13 Overcast^

Hot^ Normal

Weak^

Yes

14 Rain^

Mild^ High

Strong^

No Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No

Basic Decision Trees Learning Algorithm

Decision Trees^

CS446 Fall 08^

Picking the Root Attribute^ • The goal is to have the resulting decision tree as small as^ possible (Occam’s Razor)^ • Finding the minimal decision tree consistent with the data^ is NP-hard^ • The recursive algorithm is a greedy heuristic search for a^ simple tree, but cannot guarantee optimality.^ • The main decision in the algorithm is the selection of the^ next attribute to condition on.

Decision Trees^

CS446 Fall 08^

Consider data with two Boolean attributes (A,B).< (A=0,B=0), - >:

50 examples

< (A=0,B=1), - >:

50 examples

< (A=1,B=0), - >:

0 examples

< (A=1,B=1), + >: 100 examples

A^01 - +

Picking the Root Attribute• What should be the first attribute we select?• Splitting on A: we get purely labeled nodes.B^01 •Splitting on B: we don’t get purely labeled nodes.-A^01 • What if we have: <(A=1,B=0), - >: 3 examples-+

Decision Trees^

CS446 Fall 08^

Picking the Root Attribute^ • The goal is to have the resulting decision tree as small aspossible (Occam’s Razor)• The main decision in the algorithm is the selection of thenext attribute to condition on.• We want attributes that split the examples to sets that arerelatively pure in one label; this way we are closer to a leaf node.• The most popular heuristics is based on information gain,originated with the ID3 system of Quinlan.

Decision Trees^

CS446 Fall 08^

Entropy (impurity, disorder)

of a set of examples, S, relative

to a binary classification is:where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

-^ If the examples are equally mixed (0.5,0.5)

)log(pp)log(p Entropy = 1 p Entropy(S)^

−− −−++ = + p p^ − Entropy can be viewed as the number of bits required, on average, to encode the class oflabels. If the probability for + is 0.5, a single bit is required for each example; if it is 0.8 --can use less then 1 bit.

k −=^ )log(pp}),...p ∑ ii = i^1 p, Entropy({p^

k (^21)

In general, when p

is the fraction of examples labeled i:i^

Entropy

Decision Trees^

CS446 Fall 08^

where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

-^ If the examples are equally mixed (0.5,0.5)

Entropy^ = 1

p^ + p^ − 1 1

Entropy

)log(pp )log(pp Entropy(S)^

−− −−++ =

Entropy (impurity, disorder)

of a set of examples, S, relative to a binary classification is:

Decision Trees^

CS446 Fall 08^

• The information gain

of an attribute

a^ is the expected reduction

in entropy caused by partitioning on this attribute. where^ is the subset of S for which attribute

a^ has value^ v

and the entropy of partitioning the data is calculated byweighing the entropy of each partition by its size relative to theoriginal setPartitions of low entropy lead to high gain

Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth, Study notes of Computer Science

Related documents

Partial preview of the text

Download Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth and more Study notes Computer Science in PDF only on Docsity!

What Did We Learn?^ •^ Learning problem

Find a function thatbest separates the data• What function?• What’s best?• How to find it? Linear:X= data representation; w = the classifierT^ Y = sgn {Xw} • A possibility: Define the learning problem to be:Find a (linear) function that best separates the data

A

B

C

(color= ;shape= ;label= )

A

BC

A

B

B

Evaluation of aDecision Tree

(color=^ RED^ ;shape=

triangle)^ Learning aDecision Tree

Decision Trees: The Representation

X < 1

+^ -

Decision Trees: Decision Boundaries^ +^ +^ + -

Learning Decision Trees

Basic Decision Trees Learning Algorithm

50 examples

< (A=0,B=1), - >:

50 examples

< (A=1,B=0), - >:

0 examples

< (A=1,B=1), + >: 100 examples

A^01 - +

Picking the Root Attribute• What should be the first attribute we select?• Splitting on A: we get purely labeled nodes.B^01 •Splitting on B: we don’t get purely labeled nodes.-A^01 • What if we have: <(A=1,B=0), - >: 3 examples-+

to a binary classification is:where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

Entropy

where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:

Entropy^ = 0

Entropy^ = 1

Entropy

• The information gain

of an attribute

a^ is the expected reduction

in entropy caused by partitioning on this attribute. where^ is the subset of S for which attribute

a^ has value^ v

and the entropy of partitioning the data is calculated byweighing the entropy of each partition by its size relative to theoriginal setPartitions of low entropy lead to high gain

|S| Entropy(S|S|

Entropy(S)a)

Gain(S,^

v ∑ values(a)v ∈

Information Gain S^ v^ Go back to check which of the A, B splits is better