Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Administrative Machine Learning - Project | CSCI 567, Study Guides, Projects, Research of Computer Science

Material Type: Project; Professor: Sha; Class: Machine Learning; Subject: Computer Science; University: University of Southern California; Term: Fall 2008;

Typology: Study Guides, Projects, Research

2009/2010

Uploaded on 02/24/2010

koofers-user-5o1
koofers-user-5o1 🇺🇸

10 documents

1 / 42

Toggle sidebar

Related documents


Partial preview of the text

Download Administrative Machine Learning - Project | CSCI 567 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity! Fall 2008 Evaluation - Dr. Sofus A. Macskassy1 Machine Learning (CS 567) Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy ([email protected]) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han ([email protected]) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html Fall 2008 Evaluation - Dr. Sofus A. Macskassy2 Administrative – 599 Spring 2009 • I teach a 599 Seminar next semester (Spring 2009) – Style is seminar: weekly readings, class discussion • Title: Advanced Topics in Machine Learning: Statistical Relational Learning – This is not the same as what Prof. Fei Sha is teaching, which is a different advanced topics in ML seminar • Focus of the course is on relational learning – Standard ML considers instances to be independent – What if they are not? Such as in relational databases, social networks, other graphical data such as the web or hypertext? – Topics include issues such as collective inference, relational inference, search space, bias, graphical models and more. • Preliminary syllabus is posted on the csci567 page Fall 2008 Evaluation - Dr. Sofus A. Macskassy5 Cost-Sensitive Learning • In most applications, false positive and false negative errors are not equally important. We therefore want to adjust the tradeoff between them. Many learning algorithms provide a way to do this: – probabilistic classifiers: combine cost matrix with decision theory to make classification decisions – discriminant functions: adjust the threshold for classifying into the positive class – ensembles: adjust the number of votes required to classify as positive Fall 2008 Evaluation - Dr. Sofus A. Macskassy6 Example: 30 trees constructed by bagging • Classify as positive if K out of 30 trees predict positive. Vary K. Fall 2008 Evaluation - Dr. Sofus A. Macskassy7 Directly Visualizing the Tradeoff • We can plot the false positives versus false negatives directly. • If R ¢ L(0,1) = L(1,0) (i.e., a FP is R times more expensive than a FN), then total errors = L(0,1)*FN() + L(1,0)*FP() / FN() + R*FP(); FN() = –R*FP() • The best operating point will be tangent to a line with a slope of –R If R=1, we should set the threshold to 10. If R=10, the threshold should be 29 Fall 2008 Evaluation - Dr. Sofus A. Macskassy10 SVM: Asymmetric Margins Minimize ||w||2 + C i i Subject to w ¢ xi + i ¸ R (positive examples) –w ¢ xi + i ¸ 1 (negative examples) Fall 2008 Evaluation - Dr. Sofus A. Macskassy11 ROC: Sub-Optimal Models Sub-optimal region Fall 2008 Evaluation - Dr. Sofus A. Macskassy12 ROC Convex Hull • If we have two classifiers h1 and h2 with (fp1,fn1) and (fp2,fn2), then we can construct a stochastic classifier that interpolates between them. Given a new data point x, we use classifier h1 with probability p and h2 with probability (1-p). The resulting classifier has an expected false positive level of p fp1 + (1 – p) fp2 and an expected false negative level of p fn1 + (1 – p) fn2. • This means that we can create a classifier that matches any point on the convex hull of the ROC curve Fall 2008 Evaluation - Dr. Sofus A. Macskassy15 Computing AUC • Let S1 = sum of r(i) for yi = 1 (sum of the ranks of the positive examples) where N0 is the number of negative examples and N1 is the number of positive examples • This can also be computed exactly using standard geometry. dAUC = S1 ¡N1(N1+1)=2 N0N1 Fall 2008 Evaluation - Dr. Sofus A. Macskassy16 Optimizing AUC • A hot topic in machine learning right now is developing algorithms for optimizing AUC • RankBoost: A modification of AdaBoost. The main idea is to define a “ranking loss” function and then penalize a training example x by the number of examples of the other class that are misranked (relative to x) Fall 2008 Evaluation - Dr. Sofus A. Macskassy17 Rejection Curves • In most learning algorithms, we can specify a threshold for making a rejection decision – Probabilistic classifiers: adjust cost of rejecting versus cost of FP and FN – Decision-boundary method: if a test point x is within  of the decision boundary, then reject • Equivalent to requiring that the “activation” of the best class is larger than the second-best class by at least  Fall 2008 Evaluation - Dr. Sofus A. Macskassy20 Precision Recall Graph • Plot recall on horizontal axis; precision on vertical axis; and vary the threshold for making positive predictions (or vary K) Fall 2008 Evaluation - Dr. Sofus A. Macskassy21 The F1 Measure • Figure of merit that combines precision and recall. • where P = precision; R = recall. This is twice the harmonic mean of P and R. • We can plot F1 as a function of the classification threshold  F1 = 2 ¢ P ¢R P +R Fall 2008 Evaluation - Dr. Sofus A. Macskassy22 Summarizing a Single Operating Point • WEKA and many other systems normally report various measures for a single operating point (e.g.,  = 0.5). Here is example output from WEKA: === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.854 0.1 0.899 0.854 0.876 0 0.9 0.146 0.854 0.9 0.876 1 Fall 2008 Evaluation - Dr. Sofus A. Macskassy25 Estimating the Error Rate of a Classifier • Compute the error rate on hold-out data – suppose a classifier makes k errors on n holdout data points – the estimated error rate is ê = k / n. • Compute a confidence internal on this estimate – the standard error of this estimate is – A (1–) confidence interval on the true error  is – For a 95% confidence interval, Z0.025 = 1.96, so we use SE = s ²̂ ¢ (1¡ ²̂) n ²̂¡ z®=2SE <= ² <= ²̂+ z®=2SE ²̂¡ 1:96SE <= ² <= ²̂+1:96SE: Fall 2008 Evaluation - Dr. Sofus A. Macskassy26 Hypothesis Testing • Instead of computing error directly, we sometimes want to know whether the error is less than some value, X – e.g., the error of another learner – i.e., our hypothesis is that  < X • If the sample data is consistent with this hypothesis, then we accept the hypothesis, otherwise we reject it. • However, we can only assess this given our observed data, so we can only be sure to a certain degree of confidence. Fall 2008 Evaluation - Dr. Sofus A. Macskassy27 Hypothesis Testing (2) • For example, we may want to know whether the error  is equal to another value . • We define the null hypothesis as: H0:  =  against the alternative hypothesis: H1:    • It is reasonable to accept H0 if  is not too far from . • Using confidence intervals from before: – We accept H0 with a level of significance  if  lies in the range: – This is a two-sided test. ²̂¡ z®=2SE <= ² <= ²̂+ z®=2SE Fall 2008 Evaluation - Dr. Sofus A. Macskassy30 Problem of Multiple Comparisons • A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. – Imagine testing 100 coins by this method. – Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular coin come up heads 9 or 10 times would be a relatively likely event. – Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1-0.0107)100=0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify a fair coin as unfair. Fall 2008 Evaluation - Dr. Sofus A. Macskassy31 Problem of Multiple Comparisons • Technically, the problem of multiple comparisons (also known as multiple testing problem) can be described as the potential increase in Type I Error that occurs when statistical tests are used repeatedly: – If n independent comparisons are performed, the experiment- wide significance level  is given by  = 1 – (1-single_comparison) n – (=0.0107 on the previous slides) – Therefore  for 100 fair coin tests from previous slide = 0.66 • This increases as the number of comparisons increases. Fall 2008 Evaluation - Dr. Sofus A. Macskassy32 Comparing Two Classifiers • Goal: decide which of two classifiers h1 and h2 has lower error rate • Method: Run them both on the same test data set and record the following information: – n00: the number of examples correctly classified by both classifiers – n01: the number of examples correctly classified by h1 but misclassified by h2 – n10: The number of examples misclassified by h1 but correctly classified by h2 – n11: The number of examples misclassified by both h1 and h2. n00 n01 n10 n11 Fall 2008 Evaluation - Dr. Sofus A. Macskassy35 Cost-Sensitive Comparison of Two Classifiers • Suppose we have a non-0/1 loss matrix L(ŷ,y) and we have two classifiers h1 and h2. Goal: determine which classifier has lower expected loss. • A method that does not work well: – For each algorithm a and each test example (xi,yi) compute ℓa,i = L(ha(xi),yi). – Let i = ℓ1,i – ℓ2,i – Treat the ‟s as normally distributed and compute a normal confidence interval • The problem is that there are only a finite number of different possible values for i. They are not normally distributed, and the resulting confidence intervals are too wide Fall 2008 Evaluation - Dr. Sofus A. Macskassy36 A Better Method: BDeltaCost • Let  = {i} N i=1 be the set of i‟s computed as above • For b from 1 to 1000 do – Let Tb be a bootstrap replicate of  – Let sb = average of the ‟s in Tb • Sort the sb‟s and identify the 26 th and 975th items. These form a 95% confidence interval on the average difference between the loss from h1 and the loss from h2. • The bootstrap confidence interval quantifies the uncertainty due to the size of the test set. It does not allow us to compare algorithms, only classifiers. Fall 2008 Evaluation - Dr. Sofus A. Macskassy37 Estimating the Error Rate of a Learning Algorithm • Under the PAC model, training examples x are drawn from an underlying distribution D and labeled according to an unknown function f to give (x,y) pairs where y = f(x). • The error rate of a classifier h is error(h) = PD(h(x)  f(x)) • Define the error rate of a learning algorithm A for sample size m and distribution D as error(A,m,D) = ES [error(A(S))] • This is the expected error rate of h = A(S) for training sets S of size m drawn according to D. • We could estimate this if we had several training sets S1, …, SL all drawn from D. We could compute A(S1), A(S2), …, A(SL), measure their error rates, and average them. • Unfortunately, we don‟t have enough data to do this! Fall 2008 Evaluation - Dr. Sofus A. Macskassy40 5x2CV F test p (1;1) A p (1;1) B p (1) 1 p1 s 2 1 p (1;2) A p (1;2) B p (2) 1 p (2;1) A p (2;1) B p (1) 2 p2 s 2 2 p (2;2) A p (2;2) B p (2) 2 p (3;1) A p (3;1) B p (1) 3 p3 s 2 3 p (3;2) A p (3;2) B p (2) 3 p (4;1) A p (4;1) B p (1) 4 p4 s 2 4 p (4;2) A p (4;2) B p (2) 4 p (5;1) A p (5;1) B p (1) 5 p5 s 2 5 p (5;2) A p (5;2) B p (2) 5 Fall 2008 Evaluation - Dr. Sofus A. Macskassy41 5x2CV F test • If F > 4.47, then with 95% confidence, we can reject the null hypothesis that algorithms A and B have the same error rate when trained on data sets of size m/2. • An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true, e.g.: – The hypothesis that the means of multiple normally distributed populations, all having the same standard deviation, are equal. – The hypothesis that the standard deviations of two normally distributed populations are equal, and thus that they are of comparable origin. Fall 2008 Evaluation - Dr. Sofus A. Macskassy42 Summary • ROC Curves • Reject Curves • Precision-Recall Curves • Statistical Tests – Estimating error rate of classifier – Comparing two classifiers – Estimating error rate of a learning algorithm – Comparing two algorithms