Download Decision Trees for Linear Function Separation in Machine Learning - Prof. Dan Roth and more Study notes Computer Science in PDF only on Docsity!
Decision Trees^
CS446 Fall 08^
What Did We Learn?^ •^ Learning problem
Find a function thatbest separates the data• What function?• What’s best?• How to find it? Linear:X= data representation; w = the classifierT^ Y = sgn {Xw} • A possibility: Define the learning problem to be:Find a (linear) function that best separates the data
Home workRegistration
Decision Trees^
CS446 Fall 08^
Introduction - Summary^ •^ We introduced the technical part of the class by giving two examples for (very different)approaches to linear discrimination.^ •^ There are many other solutions.^ •^ Questions 1: Our solution assumed that the target functions are linear. Can we learn afunction that is more flexible in terms of what it does with the features space?^ •^ Question 2: Can we say something about the quality of what we learn (samplecomplexity, time complexity; quality)
Decision Trees^
CS446 Fall 08^
Decision Trees A hierarchical data structure that represents data by implementing adivide and conquer strategy Can be used as a non-parametric classification and regression method. Given a collection of examples, learn a decision tree that represents it. Use this representation to classify new examples
A
B
C
Decision Trees^
CS446 Fall 08^
- Decision Trees are classifiers for instances represented asfeatures vectors.
(color= ;shape= ;label= )
- Nodes are tests for feature values;• There is one branch for each value of the feature• Leaves specify the categories (labels)• Can categorize instances into multiple disjoint categories
Color
Shape Blue^
red^ Green Shapetriangle^ circlesquare
circle square
A
BC
A
B
B
Evaluation of aDecision Tree
(color=^ RED^ ;shape=
triangle)^ Learning aDecision Tree
Decision Trees: The Representation
Decision Trees^
CS446 Fall 08^
- Usually, instances are represented as attribute-value pairs(color=blue, shape=square, +)• Numerical values can be used either by discretizing or byusing thresholds for splitting nodes.• In this case, the tree divides the feature space into axis-parallelrectangles, each labeled with one of the labels.
X<3 no yesY>7^ Y<
yesno +- +
-
X < 1
no^ yes
+^ -
1 3
X Y+^75 +
+ - -
Decision Trees: Decision Boundaries^ +^ +^ + -
Decision Trees^
CS446 Fall 08^
Decision Trees •^ Can represent any Boolean Function•^ Can be viewed as a way to compactly represent a lot of data.•^ Advantage: non-metric data•^ Natural representation: (20 questions)•^ The evaluation of the Decision Tree Classifier is easy•^ Clearly, given data, there are many ways toRepresent it as a decision tree.•^ Learning a good representation fromdata is the challenge.
Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No
Decision Trees^
CS446 Fall 08^
- Output is a discrete category. Real valued outputsare possible (regression trees)• There are efficient algorithms for processing large amountsof data. (But not too many features)• There are methods for handling noisy data(classification noise and attribute noise)and for handling missing attribute values.
Color
Shape Blue^
red^ Green Shapetriangle^ circlesquare
circle square
Learning Decision Trees
Decision Trees^
CS446 Fall 08^
- Data is processed in Batch (I.e., all the data is available).• Recursively build a decision tree top-down. Day Outlook^ Temperature
Humidity^ Wind
PlayTennis
1 Sunny^
Hot^ High
Weak^
No
2 Sunny^
Hot^ High
Strong^
No
3 Overcast^
Hot^ High
Weak^
Yes
4 Rain^
Mild^ High
Weak^
Yes
5 Rain^
Cool^ Normal
Weak^
Yes
6 Rain^
Cool^ Normal
Strong^
No
7 Overcast^
Cool^ Normal
Strong^
Yes
8 Sunny^
Mild^ High
Weak^
No
9 Sunny^
Cool^ Normal
Weak^
Yes
10 Rain^
Mild^ Normal
Weak^
Yes
11 Sunny^
Mild^ Normal
Strong^
Yes
12 Overcast^
Mild^ High
Strong^
Yes
13 Overcast^
Hot^ Normal
Weak^
Yes
14 Rain^
Mild^ High
Strong^
No Outlook Overcast RainSunnyHumidity^ Wind^ Yes NormalHighWeakStrongYesYesNo^ No
Basic Decision Trees Learning Algorithm
Decision Trees^
CS446 Fall 08^
Picking the Root Attribute^ • The goal is to have the resulting decision tree as small as^ possible (Occam’s Razor)^ • Finding the minimal decision tree consistent with the data^ is NP-hard^ • The recursive algorithm is a greedy heuristic search for a^ simple tree, but cannot guarantee optimality.^ • The main decision in the algorithm is the selection of the^ next attribute to condition on.
Decision Trees^
CS446 Fall 08^
- Consider data with two Boolean attributes (A,B).< (A=0,B=0), - >:
50 examples
< (A=0,B=1), - >:
50 examples
< (A=1,B=0), - >:
0 examples
< (A=1,B=1), + >: 100 examples
A^01 - +
Picking the Root Attribute• What should be the first attribute we select?• Splitting on A: we get purely labeled nodes.B^01 •Splitting on B: we don’t get purely labeled nodes.-A^01 • What if we have: <(A=1,B=0), - >: 3 examples-+
Decision Trees^
CS446 Fall 08^
Picking the Root Attribute^ • The goal is to have the resulting decision tree as small aspossible (Occam’s Razor)• The main decision in the algorithm is the selection of thenext attribute to condition on.• We want attributes that split the examples to sets that arerelatively pure in one label; this way we are closer to a leaf node.• The most popular heuristics is based on information gain,originated with the ID3 system of Quinlan.
Decision Trees^
CS446 Fall 08^
- Entropy (impurity, disorder)
of a set of examples, S, relative
to a binary classification is:where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:
Entropy^ = 0
-^ If the examples are equally mixed (0.5,0.5)
)log(pp)log(p Entropy = 1 p Entropy(S)^
−− −−++ = + p p^ − Entropy can be viewed as the number of bits required, on average, to encode the class oflabels. If the probability for + is 0.5, a single bit is required for each example; if it is 0.8 --can use less then 1 bit.
k −=^ )log(pp}),...p ∑ ii = i^1 p, Entropy({p^
k (^21)
In general, when p
is the fraction of examples labeled i:i^
Entropy
Decision Trees^
CS446 Fall 08^
where^ is the proportion of positive examples in S andis the proportion of negatives.• If all the examples belong to the same category:
Entropy^ = 0
-^ If the examples are equally mixed (0.5,0.5)
Entropy^ = 1
p^ + p^ − 1 1
1
Entropy
)log(pp )log(pp Entropy(S)^
−− −−++ =
- Entropy (impurity, disorder)
of a set of examples, S, relative to a binary classification is:
Decision Trees^
CS446 Fall 08^
• The information gain
of an attribute
a^ is the expected reduction
in entropy caused by partitioning on this attribute. where^ is the subset of S for which attribute
a^ has value^ v
and the entropy of partitioning the data is calculated byweighing the entropy of each partition by its size relative to theoriginal setPartitions of low entropy lead to high gain
|S| Entropy(S|S|
Entropy(S)a)
Gain(S,^
v
v ∑ values(a)v ∈
Information Gain S^ v^ Go back to check which of the A, B splits is better