Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

The importance of evaluating machine learning hypotheses on independent test sets and the concept of variance in test accuracy. It also covers comparing two learned hypotheses, statistical hypothesis testing, and z-score tests for comparing learned hypotheses. The document assumes a background in machine learning and statistics.

Typology: Exams

Pre 2010

1 / 4

Download Machine Learning: Experimental Evaluation of Inductive Hypotheses and more Exams Health sciences in PDF only on Docsity! 1 1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin 2 Evaluating Inductive Hypotheses • Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. • Accuracy must be evaluated on an independent (usually disjoint) test set. • The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets. 3 Variance in Test Accuracy • Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. • Let errorD(h) denote the true error rate for the overall data distribution D. • When n is at least 30, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Guassian) distribution. P (e rr or S( h) ) errorS(h)errorD(h) 4 Comparing Two Learned Hypotheses • When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. – Assume h1 is tested on test set S1 of size n1 – Assume h2 is tested on test set S2 of size n2 P (e rr or S( h) ) errorS(h) errorS1(h1) errorS2(h2) Observe h1 more accurate than h2 5 Comparing Two Learned Hypotheses • When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. – Assume h1 is tested on test set S1 of size n1 – Assume h2 is tested on test set S2 of size n2 P (e rr or S( h) ) errorS(h) errorS1(h1) errorS2(h2) Observe h1 less accurate than h2 6 Statistical Hypothesis Testing • Determine the probability that an empirically observed difference in a statistic could be due purely to random chance assuming there is no true underlying difference. • Specific tests for determining the significance of the difference between two means computed from two samples gathered under different conditions. • Determines the probability of the null hypothesis, that the two samples were actually drawn from the same underlying distribution. • By scientific convention, we reject the null hypothesis and say the difference is statistically significant if the probability of the null hypothesis is less than 5% (p < 0.05) or alternatively we accept that the difference is due to an underlying cause with a confidence of (1 – p). 2 7 One-sided vs Two-sided Tests • One-sided test assumes you expected a difference in one direction (A is better than B) and the observed difference is consistent with that assumption. • Two-sided test does not assume an expected difference in either direction. • Two-sided test is more conservative, since it requires a larger difference to conclude that the difference is significant. 8 Z-Score Test for Comparing Learned Hypotheses • Assumes h1 is tested on test set S1 of size n1 and h2 is tested on test set S2 of size n2. • Compute the difference between the accuracy of h1 and h2 • Compute the standard deviation of the sample estimate of the difference. • Compute the z-score for the difference )()( 21 21 herrorherrord SS −= 2 22 1 11 ))(1()())(1()( 2211 n herrorherror n herrorherror SSSS d −⋅ + −⋅ =σ d d z σ = 9 Z-Score Test for Comparing Learned Hypotheses (continued) • Determine the confidence in the difference by looking up the highest confidence, C, for the given z-score in a table. • This gives the confidence for a two-tailed test, for a one tailed test, increase the confidence half way towards 100% 2.582.331.961.641.281.000.67z-score 99%98%95%90%80%68%50%confidence level ) 2 )100( 100( C C −−=′ 10 Sample Z-Score Test 1 Assume we test two hypotheses on different test sets of size 100 and observe: 30.0)( 20.0)( 21 21 == herrorherror SS 1.03.02.0)()( 21 21 =−=−= herrorherrord SS 0608.0 100 )3.01(3.0 100 )2.01(2.0 ))(1()())(1()( 2 22 1 11 2211 =−⋅+−⋅= −⋅ + −⋅ = n herrorherror n herrorherror SSSS dσ 644.1 0608.0 1.0 === d d z σ Confidence for two-tailed test: 90% Confidence for one-tailed test: (100 – (100 – 90)/2) = 95% 11 Sample Z-Score Test 2 Assume we test two hypotheses on different test sets of size 100 and observe: 25.0)( 20.0)( 21 21 == herrorherror SS 05.025.02.0)()( 21 21 =−=−= herrorherrord SS 0589.0 100 )25.01(25.0 100 )2.01(2.0 ))(1()())(1()( 2 22 1 11 2211 =−⋅+−⋅= −⋅ + −⋅ = n herrorherror n herrorherror SSSS dσ 848.0 0589.0 05.0 === d d z σ Confidence for two-tailed test: 50% Confidence for one-tailed test: (100 – (100 – 50)/2) = 75% 12 Z-Score Test Assumptions • Hypotheses can be tested on different test sets; if same test set used, stronger conclusions might be warranted. • Test sets have at least 30 independently drawn examples. • Hypotheses were constructed from independent training sets. • Only compares two specific hypotheses regardless of the methods used to construct them. Does not compare the underlying learning methods in general.