Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Support Vector Machines (SVM) in Machine Learning, Study notes of Engineering

An introduction to support vector machines (svm), a popular supervised machine learning algorithm used for classification and regression tasks. The concept of a support vector machine as an optimally defined surface, typically nonlinear in the input space but linear in a higher dimensional space, implicitly defined by a kernel function. The document also explains the use of svms for classification, regression, and data-fitting, as well as the concept of maximum margin and support vectors. Slides from the university of wisconsin-madison's cs 540 course, taught by c. R. Dyer.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-qwz-2
koofers-user-qwz-2 🇺🇸

10 documents

1 / 13

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Support Vector Machines (SVM) in Machine Learning and more Study notes Engineering in PDF only on Docsity! 1 CS 540, University of Wisconsin-Madison, C. R. Dyer What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function Acknowledgments: These slides combine and modify ones provided by Andrew Moore (CMU), Glenn Fung (Wisconsin), and Olvi Mangasarian (Wisconsin) CS 540, University of Wisconsin-Madison, C. R. Dyer What are Support Vector Machines Used For? • Classification • Regression and data-fitting • Supervised and unsupervised learning CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w — x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers (aka Linear Discriminant Functions) • Definition It is a function that is a linear combination of the components of the input x where w is the weight vector and b the bias • A two-category classifier then uses the rule: Decide class c1 if f(x) > 0 and class c2 if f(x) < 0 ⇔ Decide c1 if wTx > -b and c2 otherwise T 1 ( ) m ij j j f x w x b b = = + = +∑ w x 2 CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w — x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w — x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w — x + b) How would you classify this data? CS 540, University of Wisconsin-Madison, C. R. Dyer Linear Classifiers f x y denotes +1 denotes -1 f(x,w,b) = sign(w — x + b) Any of these would be fine … … but which is best? 5 CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ Any location in ´ m: not necessarily a datapoint Rm: not point w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- • Claim: x+ = x- + λ w for some value of λ. Why? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin • Plus-plane = { w x + b = +1 } • Minus-plane = { w x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x- • Claim: x+ = x- + λ w for some value of λ. Why? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin How do we compute M in terms of w and b? x- x+ The line from x- to x+ is perpendicular to the planes So to get from x- to x+ travel some distance in direction w w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin x- x+ w 6 CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin w (x - + λ w) + b = 1 ⇒ w x - + b + λ ww = 1 ⇒ -1 + λ ww = 1 ⇒ x- x+ w.w 2= λ w CS 540, University of Wisconsin-Madison, C. R. Dyer Computing the Margin What we know: • w x+ + b = +1 • w x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M w.w 2= λ“Predict Class = +1” zone “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin = M = |x+ - x- | =| λ w |= x- x+ wwww ww . 2 . .2 == www .|| λλ == ww. 2 w w r 2= CS 540, University of Wisconsin-Madison, C. R. Dyer Learning the Maximum Margin Classifier Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How? “Pr edi ct Cla ss = + 1” zon e “Pr edi ct Cla ss = - 1” zon e wx +b =1 wx +b =0 wx +b =- 1 M = Margin = x- x+ ww. 2 w CS 540, University of Wisconsin-Madison, C. R. Dyer Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints • Minimize subject to w x + b ≥ +1 if x in class 1 w x + b ≤ -1 if x in class 2 2 w r 7 CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum ||w||2, while minimizing number of training set errors Problem: Two things to minimize makes for an ill-defined optimization CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize ||w||2 + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter CS 540, University of Wisconsin-Madison, C. R. Dyer Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize ||w||2 + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) S o… an y other ideas? 10 CS 540, University of Wisconsin-Madison, C. R. Dyer CS 540, University of Wisconsin-Madison, C. R. Dyer • Project examples into some higher dimensional space where the data is linearly separable, defined by z = F(x) • Training depends only on dot products of the form F(xi) — F(xj) • Example: K(xi, xj) = F(xi) — F(xj) = (xi — xj) 2 • Dimensionality of z space is generally much larger than the dimension of input space x 2 2 1 1 2 2( ) ( , 2 , )F x x x x=x CS 540, University of Wisconsin-Madison, C. R. Dyer Common SVM Basis Functions zk = ( polynomial terms of xk of degree 1 to q ) For example, when q=2 and m=2, K(x,y) = (x1y1 + x2y2 + 1) 2 = 1 + 2x1y1 + 2x2y2 + 2x1x2y1y2 + x1 2 y1 2 + x2 2y2 2 zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk )       − == KW || KernelFn)(][ jkkjk φj cxxz CS 540, University of Wisconsin-Madison, C. R. Dyer SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM kernel function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right kernel function • Radial-Basis-style Kernel Function: • Neural-Net-style Kernel Function:       −−= 2 2 2 )( exp),( σ ba baK ).tanh(),( δκ −= babaK σ, κ and δ are magic parameters that must be chosen by a model selection method such as CV or VCSRM 11 CS 540, University of Wisconsin-Madison, C. R. Dyer The Federalist Papers • Written in 1787-1788 by Alexander Hamilton, John Jay, and James Madison to persuade the citizens of New York to ratify the constitution • Papers consisted of short essays, 900 to 3500 words in length • Authorship of 12 of those papers have been in dispute ( Madison or Hamilton); these papers are referred to as the disputed Federalist papers CS 540, University of Wisconsin-Madison, C. R. Dyer Description of the Data • For every paper: • Machine readable text was created using a scanner • Computed relative frequencies of 70 words that Mosteller-Wallace identified as good candidates for author-attribution studies • Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies • The dataset consists of 118 papers: • 50 Madison papers • 56 Hamilton papers • 12 disputed papers CS 540, University of Wisconsin-Madison, C. R. Dyer Function Words Based on Relative Frequencies CS 540, University of Wisconsin-Madison, C. R. Dyer SLA Feature Selection for Classifying the Disputed Federalist Papers • Apply the SVM Successive Linearization Algorithm for feature selection to: • Train on the 106 Federalist papers with known authors • Find a classification hyperplane that uses as few words as possible • Use the hyperplane to classify the 12 disputed papers 12 CS 540, University of Wisconsin-Madison, C. R. Dyer Hyperplane Classifier Using 3 Words • A hyperplane depending on three words was found: 0.537to + 24.663upon + 2.953would = 66.616 • All disputed papers ended up on the Madison side of the plane CS 540, University of Wisconsin-Madison, C. R. Dyer Results: 3D Plot of Hyperplane CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification • SVMs can only handle two-class outputs • What can be done? • Answer: for N-class problems, learn N SVM’s: • SVM 1, f1, learns “Output=1” vs “Output ≠ 1” • SVM 2, f2, learns “Output=2” vs “Output ≠ 2” • : • SVM N, fN, learns “Output=N” vs “Output ≠ N” CS 540, University of Wisconsin-Madison, C. R. Dyer Multi-Class Classification • Ideally, only one fi(x) > 0 and all others <0, but this is not often the case in practice • Instead, to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region: • Classify as class Ci if fi(x) = max { fj(x) } for all j