Download Model Selection and Assessment in Machine Learning and more Study notes Algorithms and Programming in PDF only on Docsity! 1 Model Selection and Assessment CS4780/5780 – Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell Chapter 5 Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. (http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf) Outline • Model Selection – Controlling overfitting in decision trees – Train, validation, test – K-fold cross validation • Evaluation – What is the true error of classification rule h? – Is rule h1 more accurate than h2? – Is learning algorithm A1 better than A2? Learning as Prediction Overfitting • Note: Accuracy = 1.0-Error [Mitchell] Controlling Overfitting in Decision Trees • Early Stopping: Stop growing the tree and introduce leaf when splitting no longer “reliable”. – Restrict size of tree (e.g., number of nodes, depth) – Minimum number of examples in node – Threshold on splitting criterion • Post Pruning: Grow full tree, then simplify. – Reduced-error tree pruning – Rule post-pruning Reduced-Error Pruning 2 Model Selection • Training: Run learning algorithm m times (e.g. different parameters). • Validation Error: Errors ErrSval (ĥi) is an estimates of ErrP(ĥi) for each hi. • Selection: Use hi with min ErrSval (ĥi) for prediction on test examples. Real-world Process Learner 1 Train Sample Strain’ Val. Sample Sval split randomly split randomly ĥ1 Strain’ Train Sample Strain drawn i.i.d. Learner m … ĥk ĥ Test Sample Stest drawn i.i.d. Text Classification Example: “Corporate Acquisitions” Results • Unpruned Tree (ID3 Algorithm): – Size: 437 nodes Training Error: 0.0% Test Error: 11.0% • Early Stopping Tree (ID3 Algorithm): – Size: 299 nodes Training Error: 2.6% Test Error: 9.8% • Reduced-Error Tree Pruning (C4.5 Algorithm): – Size: 167 nodes Training Error: 4.0% Test Error: 10.8% • Rule Post-Pruning (C4.5 Algorithm): – Size: 164 tests Training Error: 3.1% Test Error: 10.3% – Examples of rules • IF vs = 1 THEN - [99.4%] • IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%] Evaluating Learned Hypotheses • Goal: Find h with small prediction error ErrP(h) over P(X,Y). • Question: How good is ErrP(ĥ) of ĥ found on training sample Strain. • Training Error: Error ErrStrain (ĥ) on training sample. • Test Error: Error ErrStest (ĥ) is an estimate of ErrP(ĥ) . Real-world Process (x1,y1), …, (xn,yn) Learner (incl. ModSel) (x1,y1),…(xk,yk) Training Sample Strain Test Sample Stest split randomly split randomly ĥ Strain Sample S drawn i.i.d. What is the True Error of a Hypothesis? • Given – Sample of labeled instances S – Learning Algorithm A • Setup – Partition S randomly into Strain (70%) and Stest (30%) – Train learning algorithm A on Strain, result is ĥ. – Apply ĥ to Stest and compare predictions against true labels. • Test – Error on test sample ErrStest (ĥ) is estimate of true error ErrP(ĥ). – Compute confidence interval. (x1,y1), …, (xn,yn) Learner (x1,y1),…(xk,yk) Training Sample Strain Test Sample Stest Strain ĥ Binomial Distribution • The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is • Normal approximation: For np(1-p)>=5 the binomial can be approximated by the normal distribution with – Expected value: E(X)=np Variance: Var(X)=np(1-p) – With probability , the observation x falls in the interval 50% 68% 80% 90% 95% 98% 99% z 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Text Classification Example: Results • Data – Training Sample: 2000 examples – Test Sample: 600 examples • Unpruned Tree: – Size: 437 nodes Training Error: 0.0% Test Error: 11.0% • Early Stopping Tree: – Size: 299 nodes Training Error: 2.6% Test Error: 9.8% • Post-Pruned Tree: – Size: 167 nodes Training Error: 4.0% Test Error: 10.8% • Rule Post-Pruning: – Size: 164 tests Training Error: 3.1% Test Error: 10.3%