Download Understanding Support Vector Machines (SVM) in Machine Learning and more Study notes Engineering in PDF only on Docsity! 1 CS 540, University of Wisconsin-Madison, C. R. Dyer What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function Acknowledgments: These slides combine and modify ones provided by Andrew Moore (CMU), Glenn Fung (Wisconsin), and Olvi Mangasarian (Wisconsin) CS 540, University of Wisconsin-Madison, C. R. Dyer What are Support Vector Machines Used For? • Classification • Regression and data-fitting • Supervised and unsupervised learning CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers (aka Linear Discriminant Functions) • Definition It is a function that is a linear combination of the components of the input x where w is the weight vector and b the bias • A two-category classifier then uses the rule: Decide class c1 if f(x) > 0 and class c2 if f(x) < 0 ⇔ Decide c1 if wTx > -b and c2 otherwise T 1 ( ) m ij j j f x w x b b = = + = +∑ w x 2 CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine … … but which is best? 5 CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ Any location in ´ m: not necessarily a datapoint Rm: not point w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- • Claim: x+ = x- + λ w for some value of λ. Why? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- • Claim: x+ = x- + λ w for some value of λ. Why? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ The line from x- to x+ is perpendicular to the planes So to get from x- to x+ travel some distance in direction w w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin x- x+ w 6 CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin w (x - + λ w) + b = 1 ⇒ w x - + b + λ ww = 1 ⇒ -1 + λ ww = 1 ⇒ x- x+ w.w 2= λ w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M w.w 2= λ“Predict Class = +1” zone “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin = M = |x+ - x- | =| λ w |= x- x+ wwww ww . 2 . .2 == www .|| λλ == ww. 2 w w r 2= CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin = x- x+ ww. 2 w CS 540, University of Wisconsin-Madison, C. R. Dyer Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints • Minimize subject to w x + b ≥ +1 if x in class 1 w x + b ≤ -1 if x in class 2 2 w r 7 CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum ||w||2, while minimizing number of training set errors Problem: Two things to minimize makes for an ill-defined optimization CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize ||w||2 + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize ||w||2 + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) S o… an y other ideas? 10 CS 540, University of Wisconsin-Madison, C. R. Dyer CS 540, University of Wisconsin-Madison, C. R. Dyer • Project examples into some higher dimensional space where the data is linearly separable, defined by z = F(x) • Training depends only on dot products of the form F(xi) F(xj) • Example: K(xi, xj) = F(xi) F(xj) = (xi xj) 2 • Dimensionality of z space is generally much larger than the dimension of input space x 2 2 1 1 2 2( ) ( , 2 , )F x x x x=x CS 540, University of Wisconsin-Madison, C. R. Dyer Common SVM Basis Functions zk = ( polynomial terms of xk of degree 1 to q ) For example, when q=2 and m=2, K(x,y) = (x1y1 + x2y2 + 1) 2 = 1 + 2x1y1 + 2x2y2 + 2x1x2y1y2 + x1 2 y1 2 + x2 2y2 2 zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk ) − == KW || KernelFn)(][ jkkjk φj cxxz CS 540, University of Wisconsin-Madison, C. R. Dyer SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM kernel function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function • Radial-Basis-style Kernel Function: • Neural-Net-style Kernel Function: −−= 2 2 2 )( exp),( σ ba baK ).tanh(),( δκ −= babaK σ, κ and δ are magic parameters that must be chosen by a model selection method such as CV or VCSRM 11 CS 540, University of Wisconsin-Madison, C. R. Dyer The Federalist Papers • Written in 1787-1788 by Alexander Hamilton, John Jay, and James Madison to persuade the citizens of New York to ratify the constitution • Papers consisted of short essays, 900 to 3500 words in length • Authorship of 12 of those papers have been in dispute ( Madison or Hamilton); these papers are referred to as the disputed Federalist papers CS 540, University of Wisconsin-Madison, C. R. Dyer Description of the Data • For every paper: • Machine readable text was created using a scanner • Computed relative frequencies of 70 words that Mosteller-Wallace identified as good candidates for author-attribution studies • Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies • The dataset consists of 118 papers: • 50 Madison papers • 56 Hamilton papers • 12 disputed papers CS 540, University of Wisconsin-Madison, C. R. Dyer Function Words Based on Relative Frequencies CS 540, University of Wisconsin-Madison, C. R. Dyer SLA Feature Selection for Classifying the Disputed Federalist Papers • Apply the SVM Successive Linearization Algorithm for feature selection to: • Train on the 106 Federalist papers with known authors • Find a classification hyperplane that uses as few words as possible • Use the hyperplane to classify the 12 disputed papers 12 CS 540, University of Wisconsin-Madison, C. R. Dyer Hyperplane Classifier Using 3 Words • A hyperplane depending on three words was found: 0.537to + 24.663upon + 2.953would = 66.616 • All disputed papers ended up on the Madison side of the plane CS 540, University of Wisconsin-Madison, C. R. Dyer Results: 3D Plot of Hyperplane CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification • SVMs can only handle two-class outputs • What can be done? • Answer: for N-class problems, learn N SVM’s: • SVM 1, f1, learns “Output=1” vs “Output ≠ 1” • SVM 2, f2, learns “Output=2” vs “Output ≠ 2” • : • SVM N, fN, learns “Output=N” vs “Output ≠ N” CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification • Ideally, only one fi(x) > 0 and all others <0, but this is not often the case in practice • Instead, to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region: • Classify as class Ci if fi(x) = max { fj(x) } for all j