Download Machine Learning: Experimental Evaluation of Inductive Hypotheses and more Exams Health sciences in PDF only on Docsity! 1 1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin 2 Evaluating Inductive Hypotheses • Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. • Accuracy must be evaluated on an independent (usually disjoint) test set. • The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets. 3 Variance in Test Accuracy • Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. • Let errorD(h) denote the true error rate for the overall data distribution D. • When n is at least 30, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Guassian) distribution. P (e rr or S( h) ) errorS(h)errorD(h) 4 Comparing Two Learned Hypotheses • When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. – Assume h1 is tested on test set S1 of size n1 – Assume h2 is tested on test set S2 of size n2 P (e rr or S( h) ) errorS(h) errorS1(h1) errorS2(h2) Observe h1 more accurate than h2 5 Comparing Two Learned Hypotheses • When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. – Assume h1 is tested on test set S1 of size n1 – Assume h2 is tested on test set S2 of size n2 P (e rr or S( h) ) errorS(h) errorS1(h1) errorS2(h2) Observe h1 less accurate than h2 6 Statistical Hypothesis Testing • Determine the probability that an empirically observed difference in a statistic could be due purely to random chance assuming there is no true underlying difference. • Specific tests for determining the significance of the difference between two means computed from two samples gathered under different conditions. • Determines the probability of the null hypothesis, that the two samples were actually drawn from the same underlying distribution. • By scientific convention, we reject the null hypothesis and say the difference is statistically significant if the probability of the null hypothesis is less than 5% (p < 0.05) or alternatively we accept that the difference is due to an underlying cause with a confidence of (1 – p). 2 7 One-sided vs Two-sided Tests • One-sided test assumes you expected a difference in one direction (A is better than B) and the observed difference is consistent with that assumption. • Two-sided test does not assume an expected difference in either direction. • Two-sided test is more conservative, since it requires a larger difference to conclude that the difference is significant. 8 Z-Score Test for Comparing Learned Hypotheses • Assumes h1 is tested on test set S1 of size n1 and h2 is tested on test set S2 of size n2. • Compute the difference between the accuracy of h1 and h2 • Compute the standard deviation of the sample estimate of the difference. • Compute the z-score for the difference )()( 21 21 herrorherrord SS −= 2 22 1 11 ))(1()())(1()( 2211 n herrorherror n herrorherror SSSS d −⋅ + −⋅ =σ d d z σ = 9 Z-Score Test for Comparing Learned Hypotheses (continued) • Determine the confidence in the difference by looking up the highest confidence, C, for the given z-score in a table. • This gives the confidence for a two-tailed test, for a one tailed test, increase the confidence half way towards 100% 2.582.331.961.641.281.000.67z-score 99%98%95%90%80%68%50%confidence level ) 2 )100( 100( C C −−=′ 10 Sample Z-Score Test 1 Assume we test two hypotheses on different test sets of size 100 and observe: 30.0)( 20.0)( 21 21 == herrorherror SS 1.03.02.0)()( 21 21 =−=−= herrorherrord SS 0608.0 100 )3.01(3.0 100 )2.01(2.0 ))(1()())(1()( 2 22 1 11 2211 =−⋅+−⋅= −⋅ + −⋅ = n herrorherror n herrorherror SSSS dσ 644.1 0608.0 1.0 === d d z σ Confidence for two-tailed test: 90% Confidence for one-tailed test: (100 – (100 – 90)/2) = 95% 11 Sample Z-Score Test 2 Assume we test two hypotheses on different test sets of size 100 and observe: 25.0)( 20.0)( 21 21 == herrorherror SS 05.025.02.0)()( 21 21 =−=−= herrorherrord SS 0589.0 100 )25.01(25.0 100 )2.01(2.0 ))(1()())(1()( 2 22 1 11 2211 =−⋅+−⋅= −⋅ + −⋅ = n herrorherror n herrorherror SSSS dσ 848.0 0589.0 05.0 === d d z σ Confidence for two-tailed test: 50% Confidence for one-tailed test: (100 – (100 – 50)/2) = 75% 12 Z-Score Test Assumptions • Hypotheses can be tested on different test sets; if same test set used, stronger conclusions might be warranted. • Test sets have at least 30 independently drawn examples. • Hypotheses were constructed from independent training sets. • Only compares two specific hypotheses regardless of the methods used to construct them. Does not compare the underlying learning methods in general.