Download Administrative Machine Learning - Project | CSCI 567 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity! Fall 2008 Evaluation - Dr. Sofus A. Macskassy1 Machine Learning (CS 567) Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (
[email protected]) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (
[email protected]) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html Fall 2008 Evaluation - Dr. Sofus A. Macskassy2 Administrative – 599 Spring 2009 • I teach a 599 Seminar next semester (Spring 2009) – Style is seminar: weekly readings, class discussion • Title: Advanced Topics in Machine Learning: Statistical Relational Learning – This is not the same as what Prof. Fei Sha is teaching, which is a different advanced topics in ML seminar • Focus of the course is on relational learning – Standard ML considers instances to be independent – What if they are not? Such as in relational databases, social networks, other graphical data such as the web or hypertext? – Topics include issues such as collective inference, relational inference, search space, bias, graphical models and more. • Preliminary syllabus is posted on the csci567 page Fall 2008 Evaluation - Dr. Sofus A. Macskassy5 Cost-Sensitive Learning • In most applications, false positive and false negative errors are not equally important. We therefore want to adjust the tradeoff between them. Many learning algorithms provide a way to do this: – probabilistic classifiers: combine cost matrix with decision theory to make classification decisions – discriminant functions: adjust the threshold for classifying into the positive class – ensembles: adjust the number of votes required to classify as positive Fall 2008 Evaluation - Dr. Sofus A. Macskassy6 Example: 30 trees constructed by bagging • Classify as positive if K out of 30 trees predict positive. Vary K. Fall 2008 Evaluation - Dr. Sofus A. Macskassy7 Directly Visualizing the Tradeoff • We can plot the false positives versus false negatives directly. • If R ¢ L(0,1) = L(1,0) (i.e., a FP is R times more expensive than a FN), then total errors = L(0,1)*FN() + L(1,0)*FP() / FN() + R*FP(); FN() = –R*FP() • The best operating point will be tangent to a line with a slope of –R If R=1, we should set the threshold to 10. If R=10, the threshold should be 29 Fall 2008 Evaluation - Dr. Sofus A. Macskassy10 SVM: Asymmetric Margins Minimize ||w||2 + C i i Subject to w ¢ xi + i ¸ R (positive examples) –w ¢ xi + i ¸ 1 (negative examples) Fall 2008 Evaluation - Dr. Sofus A. Macskassy11 ROC: Sub-Optimal Models Sub-optimal region Fall 2008 Evaluation - Dr. Sofus A. Macskassy12 ROC Convex Hull • If we have two classifiers h1 and h2 with (fp1,fn1) and (fp2,fn2), then we can construct a stochastic classifier that interpolates between them. Given a new data point x, we use classifier h1 with probability p and h2 with probability (1-p). The resulting classifier has an expected false positive level of p fp1 + (1 – p) fp2 and an expected false negative level of p fn1 + (1 – p) fn2. • This means that we can create a classifier that matches any point on the convex hull of the ROC curve Fall 2008 Evaluation - Dr. Sofus A. Macskassy15 Computing AUC • Let S1 = sum of r(i) for yi = 1 (sum of the ranks of the positive examples) where N0 is the number of negative examples and N1 is the number of positive examples • This can also be computed exactly using standard geometry. dAUC = S1 ¡N1(N1+1)=2 N0N1 Fall 2008 Evaluation - Dr. Sofus A. Macskassy16 Optimizing AUC • A hot topic in machine learning right now is developing algorithms for optimizing AUC • RankBoost: A modification of AdaBoost. The main idea is to define a “ranking loss” function and then penalize a training example x by the number of examples of the other class that are misranked (relative to x) Fall 2008 Evaluation - Dr. Sofus A. Macskassy17 Rejection Curves • In most learning algorithms, we can specify a threshold for making a rejection decision – Probabilistic classifiers: adjust cost of rejecting versus cost of FP and FN – Decision-boundary method: if a test point x is within of the decision boundary, then reject • Equivalent to requiring that the “activation” of the best class is larger than the second-best class by at least Fall 2008 Evaluation - Dr. Sofus A. Macskassy20 Precision Recall Graph • Plot recall on horizontal axis; precision on vertical axis; and vary the threshold for making positive predictions (or vary K) Fall 2008 Evaluation - Dr. Sofus A. Macskassy21 The F1 Measure • Figure of merit that combines precision and recall. • where P = precision; R = recall. This is twice the harmonic mean of P and R. • We can plot F1 as a function of the classification threshold F1 = 2 ¢ P ¢R P +R Fall 2008 Evaluation - Dr. Sofus A. Macskassy22 Summarizing a Single Operating Point • WEKA and many other systems normally report various measures for a single operating point (e.g., = 0.5). Here is example output from WEKA: === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.854 0.1 0.899 0.854 0.876 0 0.9 0.146 0.854 0.9 0.876 1 Fall 2008 Evaluation - Dr. Sofus A. Macskassy25 Estimating the Error Rate of a Classifier • Compute the error rate on hold-out data – suppose a classifier makes k errors on n holdout data points – the estimated error rate is ê = k / n. • Compute a confidence internal on this estimate – the standard error of this estimate is – A (1–) confidence interval on the true error is – For a 95% confidence interval, Z0.025 = 1.96, so we use SE = s ²̂ ¢ (1¡ ²̂) n ²̂¡ z®=2SE <= ² <= ²̂+ z®=2SE ²̂¡ 1:96SE <= ² <= ²̂+1:96SE: Fall 2008 Evaluation - Dr. Sofus A. Macskassy26 Hypothesis Testing • Instead of computing error directly, we sometimes want to know whether the error is less than some value, X – e.g., the error of another learner – i.e., our hypothesis is that < X • If the sample data is consistent with this hypothesis, then we accept the hypothesis, otherwise we reject it. • However, we can only assess this given our observed data, so we can only be sure to a certain degree of confidence. Fall 2008 Evaluation - Dr. Sofus A. Macskassy27 Hypothesis Testing (2) • For example, we may want to know whether the error is equal to another value . • We define the null hypothesis as: H0: = against the alternative hypothesis: H1: • It is reasonable to accept H0 if is not too far from . • Using confidence intervals from before: – We accept H0 with a level of significance if lies in the range: – This is a two-sided test. ²̂¡ z®=2SE <= ² <= ²̂+ z®=2SE Fall 2008 Evaluation - Dr. Sofus A. Macskassy30 Problem of Multiple Comparisons • A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. – Imagine testing 100 coins by this method. – Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular coin come up heads 9 or 10 times would be a relatively likely event. – Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1-0.0107)100=0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify a fair coin as unfair. Fall 2008 Evaluation - Dr. Sofus A. Macskassy31 Problem of Multiple Comparisons • Technically, the problem of multiple comparisons (also known as multiple testing problem) can be described as the potential increase in Type I Error that occurs when statistical tests are used repeatedly: – If n independent comparisons are performed, the experiment- wide significance level is given by = 1 – (1-single_comparison) n – (=0.0107 on the previous slides) – Therefore for 100 fair coin tests from previous slide = 0.66 • This increases as the number of comparisons increases. Fall 2008 Evaluation - Dr. Sofus A. Macskassy32 Comparing Two Classifiers • Goal: decide which of two classifiers h1 and h2 has lower error rate • Method: Run them both on the same test data set and record the following information: – n00: the number of examples correctly classified by both classifiers – n01: the number of examples correctly classified by h1 but misclassified by h2 – n10: The number of examples misclassified by h1 but correctly classified by h2 – n11: The number of examples misclassified by both h1 and h2. n00 n01 n10 n11 Fall 2008 Evaluation - Dr. Sofus A. Macskassy35 Cost-Sensitive Comparison of Two Classifiers • Suppose we have a non-0/1 loss matrix L(ŷ,y) and we have two classifiers h1 and h2. Goal: determine which classifier has lower expected loss. • A method that does not work well: – For each algorithm a and each test example (xi,yi) compute ℓa,i = L(ha(xi),yi). – Let i = ℓ1,i – ℓ2,i – Treat the ‟s as normally distributed and compute a normal confidence interval • The problem is that there are only a finite number of different possible values for i. They are not normally distributed, and the resulting confidence intervals are too wide Fall 2008 Evaluation - Dr. Sofus A. Macskassy36 A Better Method: BDeltaCost • Let = {i} N i=1 be the set of i‟s computed as above • For b from 1 to 1000 do – Let Tb be a bootstrap replicate of – Let sb = average of the ‟s in Tb • Sort the sb‟s and identify the 26 th and 975th items. These form a 95% confidence interval on the average difference between the loss from h1 and the loss from h2. • The bootstrap confidence interval quantifies the uncertainty due to the size of the test set. It does not allow us to compare algorithms, only classifiers. Fall 2008 Evaluation - Dr. Sofus A. Macskassy37 Estimating the Error Rate of a Learning Algorithm • Under the PAC model, training examples x are drawn from an underlying distribution D and labeled according to an unknown function f to give (x,y) pairs where y = f(x). • The error rate of a classifier h is error(h) = PD(h(x) f(x)) • Define the error rate of a learning algorithm A for sample size m and distribution D as error(A,m,D) = ES [error(A(S))] • This is the expected error rate of h = A(S) for training sets S of size m drawn according to D. • We could estimate this if we had several training sets S1, …, SL all drawn from D. We could compute A(S1), A(S2), …, A(SL), measure their error rates, and average them. • Unfortunately, we don‟t have enough data to do this! Fall 2008 Evaluation - Dr. Sofus A. Macskassy40 5x2CV F test p (1;1) A p (1;1) B p (1) 1 p1 s 2 1 p (1;2) A p (1;2) B p (2) 1 p (2;1) A p (2;1) B p (1) 2 p2 s 2 2 p (2;2) A p (2;2) B p (2) 2 p (3;1) A p (3;1) B p (1) 3 p3 s 2 3 p (3;2) A p (3;2) B p (2) 3 p (4;1) A p (4;1) B p (1) 4 p4 s 2 4 p (4;2) A p (4;2) B p (2) 4 p (5;1) A p (5;1) B p (1) 5 p5 s 2 5 p (5;2) A p (5;2) B p (2) 5 Fall 2008 Evaluation - Dr. Sofus A. Macskassy41 5x2CV F test • If F > 4.47, then with 95% confidence, we can reject the null hypothesis that algorithms A and B have the same error rate when trained on data sets of size m/2. • An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true, e.g.: – The hypothesis that the means of multiple normally distributed populations, all having the same standard deviation, are equal. – The hypothesis that the standard deviations of two normally distributed populations are equal, and thus that they are of comparable origin. Fall 2008 Evaluation - Dr. Sofus A. Macskassy42 Summary • ROC Curves • Reject Curves • Precision-Recall Curves • Statistical Tests – Estimating error rate of classifier – Comparing two classifiers – Estimating error rate of a learning algorithm – Comparing two algorithms