Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Decision Trees Approach to Machine Learning: An Overview of ID3 Algorithm and Applications, Slides of Robotics

An overview of decision trees as a machine learning approach, focusing on the id3 algorithm. It discusses the practical inductive inference method, its goal, and its advantages such as robustness to noise and ease of interpretation. The document also covers the id3 algorithm's process, including the search through hypothesis space and the use of entropy and information gain. Applications of decision trees, including equipment diagnosis, medical diagnosis, and pattern recognition, are also mentioned.

Typology: Slides

2013/2014

Uploaded on 01/29/2014

surii
surii 🇮🇳

3.5

(13)

130 documents

1 / 62

Toggle sidebar

Related documents


Partial preview of the text

Download Decision Trees Approach to Machine Learning: An Overview of ID3 Algorithm and Applications and more Slides Robotics in PDF only on Docsity! Machine Learning Approach based on Decision Trees docsity.com •Decision Tree Learning •Practical inductive inference method •Same goal as Candidate-Elimination algorithm •Find Boolean function of attributes •Decision trees can be extended to functions with more than two output values. •Widely used •Robust to noise •Can handle disjunctive (OR’s) expressions •Completely expressive hypothesis space •Easily interpretable (tree structure, if-then rules) docsity.com •The tree itself forms hypothesis •Disjunction (OR’s) of conjunctions (AND’s) •Each path from root to leaf forms conjunction of constraints on attributes •Separate branches are disjunctions •Example from PlayTennis decision tree: (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak) docsity.com •Types of problems decision tree learning is good for: •Instances represented by attribute-value pairs •For algorithm in book, attributes take on a small number of discrete values •Can be extended to real-valued attributes • (numerical data) •Target function has discrete output values •Algorithm in book assumes Boolean functions •Can be extended to multiple output values docsity.com • Hypothesis space can include disjunctive expressions. • In fact, hypothesis space is complete space of finite discrete-valued functions • Robust to imperfect training data • classification errors • errors in attribute values • missing attribute values • Examples: • Equipment diagnosis • Medical diagnosis • Credit card risk analysis • Robot movement • Pattern Recognition • face recognition • hexapod walking gates docsity.com •What is a greedy search? •At each step, make decision which makes greatest improvement in whatever you are trying optimize. •Do not backtrack (unless you hit a dead end) •This type of search is likely not to be a globally optimum solution, but generally works well. •What are we really doing here? •At each node of tree, make decision on which attribute best classifies training data at that point. •Never backtrack (in ID3) •Do this for each branch of tree. •End result will be tree structure representing a hypothesis which works best for the training data. docsity.com Information Theory Background • If there are n equally probable possible messages, then the probability p of each is 1/n • Information conveyed by a message is -log(p) = log(n) • Eg, if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message. • In general, if we are given a probability distribution P = (p1, p2, .., pn) • the information conveyed by distribution (aka Entropy of P) is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn)) docsity.com Question? How do you determine which attribute best classifies data? Answer: Entropy! •Information gain: •Statistical quantity measuring how well an attribute classifies the data. •Calculate the information gain for each attribute. •Choose attribute with greatest information gain. docsity.com •But how do you measure information? • Claude Shannon in 1948 at Bell Labs established the field of information theory. • Mathematical function, Entropy, measures information content of random process: •Takes on largest value when events are equiprobable. •Takes on smallest value when only one event has non-zero probability. •For two states: • Positive examples and Negative examples from set S: H(S) = - p+log2(p+) - p- log2(p-) Entropy = Measure of order in set S docsity.com • In general: • For an ensemble of random events: {A1,A2,...,An}, occurring with probabilities: z ={P(A1),P(A2),...,P(An)} H P A P Ai i i n     ( ) log ( ( ))2 1 (Note: = and ) 1 0 1 1 P(A ) P(A )i i= n i   If you consider the self-information of event, i, to be: -log2(P(Ai)) Entropy is weighted average of information carried by each event. Does this make sense? docsity.com •If an event conveys information, that means it’s a surprise. •If an event always occurs, P(Ai)=1, then it carries no information. -log2(1) = 0 •If an event rarely occurs (e.g. P(Ai)=0.001), it carries a lot of info. -log2(0.001) = 9.97 •The less likely the event, the more the information it carries since, for 0  P(Ai)  1, -log2(P(Ai)) increases as P(Ai) goes from 1 to 0. • (Note: ignore events with P(Ai)=0 since they never occur.) –Does this make sense? docsity.com •Choice of base 2 log corresponds to choosing units of information.(BIT’s) • Another remarkable thing: •This is the same definition of entropy used in statistical mechanics for the measure of disorder. • Corresponds to macroscopic thermodynamic quantity of Second Law of Thermodynamics. docsity.com •The concept of a quantitative measure for information content plays an important role in many areas: •For example, • Data communications (channel capacity) • Data compression (limits on error-free encoding) •Entropy in a message corresponds to minimum number of bits needed to encode that message. • In our case, for a set of training data, the entropy measures the number of bits needed to encode classification for an instance. • Use probabilities found from entire set of training data. • Prob(Class=Pos) = Num. of positive cases / Total case • Prob(Class=Neg) = Num. of negative cases / Total cases docsity.com (Back to the story of ID3) • Information gain is our metric for how well one attribute A i classifies the training data. • Information gain for a particular attribute = Information about target function, given the value of that attribute. (conditional entropy) • Mathematical expression for information gain: Gain S A H S) P A v H Si i v v Values Ai ( , ) ( ( ) ( ) ( )      entropy Entropy for value v docsity.com •Example: PlayTennis •Four attributes used for classification: •Outlook = {Sunny,Overcast,Rain} •Temperature = {Hot, Mild, Cool} •Humidity = {High, Normal} •Wind = {Weak, Strong} •One predicted (target) attribute (binary) •PlayTennis = {Yes,No} •Given 14 Training examples •9 positive •5 negative docsity.com Examples, minterms, cases, objects, test cases, Training Examples Day Outlook Temperature Humidity Wind PlayTennis Di Sunny D2 Sunny D3 Overcast DA Rain D5 Rain D6 Rain D7 Overcast D8 Sunny D9 Sunny D110 Rain Dil Sunny D12 Overcast D13 Overcast Di4- Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No docsity.com • Step 1: Calculate entropy for all cases: NPos = 9 NNeg = 5 NTot = 14 H(S) = -(9/14)*log2(9/14) - (5/14)*log2(5/14) = 0.940 14 cases 9 positive cases entropy docsity.com • Attribute = Humidity • (Repeat process looping over {High, Normal}) Gain(S,Humidity) = 0.029 • Attribute = Wind • (Repeat process looping over {Weak, Strong}) Gain(S,Wind) = 0.048 Find attribute with greatest information gain: Gain(S,Outlook) = 0.246, Gain(S,Temperature) = 0.029 Gain(S,Humidity) = 0.029, Gain(S,Wind) = 0.048  Outlook is root node of tree docsity.com •Iterate algorithm to find attributes which best classify training examples under the values of the root node •Example continued •Take three subsets: • Outlook = Sunny (NTot = 5) • Outlook = Overcast(NTot = 4) •Outlook = Rainy (NTot = 5) •For each subset, repeat the above calculation looping over all attributes other than Outlook docsity.com • For example: • Outlook = Sunny (NPos = 2, NNeg=3, NTot = 5) H=0.971 • Temp = Hot (NPos = 0, NNeg=2, NTot = 2) H = 0.0 • Temp = Mild (NPos = 1, NNeg=1, NTot = 2) H = 1.0 • Temp = Cool (NPos = 1, NNeg=0, NTot = 1) H = 0.0 Gain(SSunny,Temperature) = 0.971 - (2/5)*0 - (2/5)*1 - (1/5)*0 Gain(SSunny,Temperature) = 0.571 Similarly: Gain(SSunny,Humidity) = 0.971 Gain(SSunny,Wind) = 0.020  Humidity classifies Outlook=Sunny instances best and is placed as the node under Sunny outcome. • Repeat this process for Outlook = Overcast &Rainy docsity.com •Note: In this example data were perfect. •No contradictions •Branches led to unambiguous Yes, No decisions • If there are contradictions take the majority vote • This handles noisy data. •Another note: •Attributes are eliminated when they are assigned to a node and never reconsidered. • e.g. You would not go back and reconsider Outlook under Humidity •ID3 uses all of the training data at once •Contrast to Candidate-Elimination •Can handle noisy data. docsity.com Another Example: Russell’s and Norvig’s Restaurant Domain • Develop a decision tree to model the decision a patron makes when deciding whether or not to wait for a table at a restaurant. • Two classes: wait, leave • Ten attributes: alternative restaurant available?, bar in restaurant?, is it Friday?, are we hungry?, how full is the restaurant?, how expensive?, is it raining?,do we have a reservation?, what type of restaurant is it?, what's the purported waiting time? • Training set of 12 examples • ~ 7000 possible cases docsity.com A Training Set Atinbo Art | Aew fii | Wan Per | Price | Reco | Res Type fst Tes | Meo | Mo | Fes | Sem | Sa] eo] les | fies | oe Tes | Pte ive Yes Full & ten vem Tad Jiu fae | Hes | fo | eo | See | | Pe | Me | Baer | oe Ker | fio | Fes | Yes full & ae Thai Ia fe | Wo} Yes] Ato | Fo | Soe | Ne] Tes | ene | oo Mio | Hes | fio | Fes | Some | SS |] Hes | les | Mteling | GG Fer | Hes | Po | Pe | Peewee |] es | At | Baer | GU Me | Mo |] Alo | Fes | See | SS | Hes | les Téa ae fe | Hes | Fes | Mio | Full | hes | Mo | Suave id Hes | Hes | Fes | Fes Fall | SSS) Me | Tes | Malang | M30 ides ve ive five Adee ‘sy en ven Tad oS tes | Yes | Fes | Fes Full ie Ato | Ato] Buaper | 368 Gaal Wal int PSPs T FP ST FF docsity.com ID3 • A greedy algorithm for Decision Tree Construction developed by Ross Quinlan, 1987 • Consider a smaller tree a better tree • Top-down construction of the decision tree by recursively selecting the "best attribute" to use at the current node in the tree, based on the examples belonging to this node. • Once the attribute is selected for the current node, generate children nodes, one for each possible value of the selected attribute. • Partition the examples of this node using the possible values of this attribute, and assign these subsets of the examples to the appropriate child node. • Repeat for each child node until all examples associated with a node are either all positive or all negative. docsity.com Choosing the Best Attribute • The key problem is choosing which attribute to split a given set of examples. • Some possibilities are: • Random: Select any attribute at random • Least-Values: Choose the attribute with the smallest number of possible values (fewer branches) • Most-Values: Choose the attribute with the largest number of possible values (smaller subsets) • Max-Gain: Choose the attribute that has the largest expected information gain, i.e. select attribute that will result in the smallest expected size of the subtrees rooted at its children. • The ID3 algorithm uses the Max-Gain method of selecting the best attribute. docsity.com Splitting Examples by Testing Attributes +: ME MGMT NG HS KIS ia) —: MOMS MIAO KONE Patrons? Mone Some Full +: +: KENT NOS +: XAXK12 AFA - —) M2NT AG AIG +) MEMS AAMAS NEF (b) a) MXIT APN ATT Ty pe? Franch Italian Thai Burger +X! +. 46 +: XAXS +: MRN12 — ko —: Xi —: kA XH —: FXG +: ME MEAL NOMS NEF ic) a) XO MUMPXOMIOK ET Patrons? Nene Some Full +: +) HEX ENG HG +: N4NI2 —) XAX - —! K2 XXXII Yes No Hungry? +: Aa NI2 +: , -: MLN ~) 5 3p csity.com Resulting Decision Tree +: D3, D4, D5, D7, D9, D10, D11, D12, D13 -: D1, D2, Dé, DS, D14 Outlook Sunny Cloudy Rainy +: D9, D10 +: D4, D5, D10 -: D1, D2, D8 +: D3, D?, D12, D13 - D6, D14 Humidity Wind High Normal Strong Weak - D1, D2, Ds +: DY, D10 -- D6, D14 +: D4, D5, D10 docsity.com • The entropy is the average number of bits/message needed to represent a stream of messages. • Examples: • if P is (0.5, 0.5) then I(P) is 1 • if P is (0.67, 0.33) then I(P) is 0.92, • if P is (1, 0) then I(P) is 0. • The more uniform is the probability distribution, the greater is its information gain/entropy. docsity.com • What is the hypothesis space for decision tree learning? • Search through space of all possible decision trees • from simple to more complex guided by a heuristic: information gain • The space searched is complete space of finite, discrete-valued functions. • Includes disjunctive and conjunctive expressions • Method only maintains one current hypothesis • In contrast to Candidate-Elimination • Not necessarily global optimum • attributes eliminated when assigned to a node • No backtracking • Different trees are possible docsity.com Extensions of the Decision Tree Learning Algorithm • Using gain ratios • Real-valued data • Noisy data and Overfitting • Generation of rules • Setting Parameters • Cross-Validation for Experimental Validation of Performance • Incremental learning docsity.com •Algorithms used: • ID3 Quinlan (1986) • C4.5 Quinlan(1993) • C5.0 Quinlan • Cubist Quinlan • CART Classification and regression trees Breiman (1984) • ASSISTANT Kononenco (1984) & Cestnik (1987) • ID3 is algorithm discussed in textbook • Simple, but representative • Source code publicly available Entropy first time was used •C4.5 (and C5.0) is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. docsity.com Real-valued data • Select a set of thresholds defining intervals; • each interval becomes a discrete value of the attribute • We can use some simple heuristics • always divide into quartiles • We can use domain knowledge • divide age into infant (0-2), toddler (3 - 5), and school aged (5-8) • or treat this as another learning problem • try a range of ways to discretize the continuous variable • Find out which yield “better results” with respect to some metric. docsity.com Pruning Decision Trees • Pruning of the decision tree is done by replacing a whole subtree by a leaf node. • The replacement takes place if a decision rule establishes that the expected error rate in the subtree is greater than in the single leaf. E.g., • Training: eg, one training red success and one training blue Failures • Test: three red failures and one blue success • Consider replacing this subtree by a single Failure node. • After replacement we will have only two errors instead of five failures. Color 1 success 0 failure 0 success 1 failure red blue Color 1 success 3 failure 1 success 1 failure red blue 2 success 4 failure FAILURE docsity.com Incremental Learning • Incremental learning • Change can be made with each training example • Non-incremental learning is also called batch learning • Good for • adaptive system (learning while experiencing) • when environment undergoes changes • Often with • Higher computational cost • Lower quality of learning results • ITI (by U. Mass): incremental DT learning package docsity.com Evaluation Methodology • Standard methodology: cross validation 1. Collect a large set of examples (all with correct classifications!). 2. Randomly divide collection into two disjoint sets: training and test. 3. Apply learning algorithm to training set giving hypothesis H 4. Measure performance of H w.r.t. test set • Important: keep the training and test sets disjoint! • Learning is not to minimize training error (wrt data) but the error for test/cross-validation: a way to fix overfitting • To study the efficiency and robustness of an algorithm, repeat steps 2-4 for different training sets and sizes of training sets. • If you improve your algorithm, start again with step 1 to avoid evolving the algorithm to work well on just this collection. docsity.com C4.5 • C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. C4.5: Programs for Machine Learning J. Ross Quinlan, The Morgan Kaufmann Series in Machine Learning, Pat Langley, Series Editor. 1993. 302 pages. paperback book & 3.5" Sun disk. $77.95. ISBN 1-55860-240-2 docsity.com Summary of DT Learning • Inducing decision trees is one of the most widely used learning methods in practice • Can out-perform human experts in many problems • Strengths include • Fast • simple to implement • can convert result to a set of easily interpretable rules • empirically valid in many commercial products • handles noisy data • Weaknesses include: • "Univariate" splits/partitioning using only one attribute at a time so limits types of possible trees • large decision trees may be hard to understand • requires fixed-length feature vectors docsity.com •Summary of ID3 Inductive Bias • Short trees are preferred over long trees • It accepts the first tree it finds • Information gain heuristic • Places high information gain attributes near root • Greedy search method is an approximation to finding the shortest tree • Why would short trees be preferred? • Example of Occam’s Razor: Prefer simplest hypothesis consistent with the data. (Like Copernican vs. Ptolemic view of Earth’s motion) docsity.com