Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Support Vector Machines: Maximizing Margin of Separation - Prof. Y. Choe, Study notes of Computer Science

An overview of support vector machines (svm), a supervised machine learning algorithm used for classification and regression analysis. The main objective is to construct a hyperplane with the maximum margin of separation between two classes. The derivation of svm based on structural risk minimization, the calculation of the distance to the optimal hyperplane, and the concept of support vectors. It also introduces the primal and dual optimization problems and their solutions.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-jxo-1
koofers-user-jxo-1 🇺🇸

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Support Vector Machines: Maximizing Margin of Separation - Prof. Y. Choe and more Study notes Computer Science in PDF only on Docsity!

Slide

Haykin Chapter 6: Support-Vector

Machines

CPSC 636-

Instructor: Yoonsuck Choe Spring 2008

Note: Part of this lecture drew material from Ricardo Gutierrez-Osuna’s Pattern Analysis lectures.

1

Introduction

  • Support vector machine is a linear machine with some very nice properties.
  • The basic idea of SVM is to construct a separating hyperplane where the margin of separation between positive and negative examples are maximized.
  • Principled derivation: structural risk minimization - error rate is bounded by: (1) training error-rate and (2) VC-dimension of the model. - SVM makes (1) become zero and minimizes (2).

2

Optimal Hyperplane

For linearly separable patterns {(xi, di)}Ni=1 (with

di ∈ {+1, − 1 }):

  • The separating hyperplane is wT^ x + b = 0:

wT^ x + b ≥ 0 for di = +

wT^ x + b < 0 for di = − 1

  • Let wo be the optimal hyperplane and bo the optimal bias.

Distance to the Optimal Hyperplane

θ

w x i

x d r

d+r

  • From wTo x = −bo, the distance from the origin to the hyperplane is calculated as

d = ‖xi‖ cos(xi, wo) = −bo ‖wo‖

Distance to the Optimal Hyperplane (cont’d)

θ

w x i

x d r

d+r

  • The distance from an arbitrary point to the hyperplane can be calculated as: - When the point is in the positive area:

r = ‖x‖ cos(x, wo)−d = xT^ wo ‖wo‖

+

bo ‖wo‖

=

xT^ wo + bo ‖wo‖

.

- When the point is in the negative area:

r = d−‖x‖ cos(x, wo) = − xT^ wo ‖wo‖

bo ‖wo‖

= −

xT^ wo + bo ‖wo‖

.

5

Optimal Hyperplane and Support Vectors

Optimal hyperplane

x

x

x x

x

x x o

o

o

o

o

o Support Vectors

ρ

  • Support vectors : input points closest to the separating hyperplane.
  • Margin of separation ρ: distance between the separating hyperplane and the closest input point.

6

Optimal Hyperplane and Support Vectors (cont’d)

  • The optimal hyperplane is supposed to maximize the margin of

separation ρ.

  • With that requirement, we can write the conditions that wo and

bo must meet:

wTo x + bo ≥ +1 for di = +

wTo x + bo ≤ − 1 for di = − 1

Note: ≥ +1 and ≤ − 1 , and support vectors are those x(s)

where equality holds (i.e., wTo x(s)^ + bo = +1 or − 1 ).

  • Since r = (woT x + bo)/‖wo‖,

r =

<

:

1 /‖wo‖ if d = +

− 1 /‖wo‖ if d = − 1

Optimal Hyperplane and Support Vectors (cont’d)

Optimal hyperplane

x

x

x x

x

x x o

o

o

o

o

o Support Vectors

ρ

  • Margin of separation between two classes is

ρ = 2r =

‖wo‖

.

  • Thus, maximizing the margin of separation between two classes

is equivalent to minimizing the Euclidean norm of the weight wo!

Primal Problem: Constrained Optimization

For the training set T = {(xi, di)}Ni=1 find w and b such that

  • they minimize a certain value ( 1 /ρ) while satisfying a constraint (all examples are correctly classified): - Constraint: di(wT^ xi + b) ≥ 1 for i = 1, 2 , ..., N. - Cost function: Φ(w) = 12 wT^ w.

This problem can be solved using the method of Lagrange multipliers (see next two slides).

9

Mathematical Aside: Lagrange Multipliers

Turn a constrained optimization problem into an unconstrained optimization problem by absorbing the constraints into the cost function, weighted by the Lagrange multipliers.

Example: Find closest point on the circle x^2 + y^2 = 1 to the point

(2, 3) (adapted from Ballard, An Introduction to Natural Computation,

1997, pp. 119–120).

  • Minimize F (x, y) = (x − 2)^2 + (y − 3)^2 subject to the

constraint x^2 + y^2 − 1 = 0.

  • Absorb the constraint into the cost function, after multiplying the

Lagrange multiplier α:

F (x, y, α) = (x − 2)^2 + (y − 3)^2 + α(x^2 + y^2 − 1).

10

Lagrange Multipliers (cont’d)

Must find x, y, α that minimizes F (x, y, α) = (x − 2)^2 + (y − 2)^2 + α(x^2 + y^2 − 1). Set the partial derivatives to 0, and solve the system of equations.

∂F ∂x = 2(x − 2) + 2αx = 0

∂F ∂y = 2(y^ −^ 2) + 2αy^ = 0 ∂F ∂x =^ x

(^2) + y (^2) − 1 = 0

Solve for x and y in the 1st and 2nd, and plug in those to the 3rd equation

x = y = 2 1 + α , so

„ 2

1 + α

« 2

+

„ 2

1 + α

« 2

= 1

from which we get α = 2

2 − 1. Thus, (x, y) = (1/

2 , 1 /

2).

Primal Problem: Constrained Optimization (cont’d)

Putting the constrained optimization problem into the Lagrangian form, we get (utilizing the Kunh-Tucker theorem)

J(w, b, α) =

wT^ w −

X^ N

i=

αi

h

di(wT^ xi + b) − 1

i

.

  • From ∂J( ∂ww,b,α ) = 0:

w =

X^ N

i=

αidixi.

  • From ∂J(w ∂b,b,α ) = 0:

X^ N

i=

αidi = 0

Primal Problem: Constrained Optimization (cont’d)

  • Note that when the optimal solution is reached, the following condition must hold (Karush-Kuhn-Tucker complementary condition)

αi

h

di(wT^ xi + b) − 1

i

= 0

for all i = 1, 2 , ..., N.

  • Thus, ˆ non-zero αis can be attained only when

di(wT^ xi + b) − 1

˜

= 0, i.e., when the αi is associated

with a support vector x(s)!

  • Other conditions include αi ≥ 0.

13

Primal Problem: Constrained Optimization (cont’d)

  • Plugging in w =

PN

i=1 αidixi^ and^

PN

i=1 αidi^ = 0^ back into J(w, b, α), we get the dual problem.

J(w, b, α) = 12 wT^ w −

PN

i=1 αi

h di(wT^ xi + b) − 1

i

= 12 wT^ w −

PN

i=1 αidiw T (^) xi − nb PNi=1 αidi + PNi=1 αi noting wT^ w =

PN

i=1 αidiw T (^) xi and from

PN

i=1 αidi^ = 0

o

= − (^12)

PN

i=1 αidiw T (^) xi + PN i=1 αi = − (^12)

PN

i=

PN

j=1 αiαj^ didj^ x T i xj^ +^

PN

i=1 αi = Q(α).

  • So, J(w, b, α) = Q(α) (αi ≥ 0 ).
  • This results in the dual problem (next slide). 14

Dual Problem

  • Given the training sample {(xi, di)}Ni=1, find the Lagrange

multipliers {αi}Ni=1 that maximize the objective function:

Q(α) = −

X^ N

i=

X^ N

j=

αiαj didj xTi xj +

X^ N

i=

αi

subject to the constraints

-

PN

i=1 αidi^ = 0

- αi ≥ 0 for all i = 1, 2 , ..., N.

  • The problem is stated entirely in terms of the training data

(xi, di), and the dot products xTi xj play a key role.

Solution to the Optimization Problem

Once all the optimal Lagrange mulitpliers αo,i are found, wo and bo

can be found as follows:

wo =

X^ N

i=

αo,idixi

and from wTo xi + bo = di when xi is a support vector:

bo = d(s)^ − wTo x(s)

Note: calculation of final estimated function does not need any explicit

calculation of wo since they can be calculated from the dot product

between the input vectors!

wTo x =

X^ N

i=

αo,idixTi x

Margin of Separation in SVM and VC Dimension

Statistical learning theory shows that it is desirable to reduce both the error (empirical risk) and the VC dimension of the classifier.

  • Vapnik (1995, 1998) showed: Let D be the diameter of the

smallest ball containing all input vectors xi. The set of optimal

hyperplanes defined by wTo x + bo = 0 has a VC dimension h

bounded from above as

h ≤ min

‰

D^2

ρ^2

ı

, m 0

ff

where d·e is the ceiling, ρ the margin of separation equal to

2 /‖wo‖, and m 0 the dimensionality of the input space.

  • The implication is that the VC dimension can be controlled

independetly of m 0 , by choosing an appropriate (large) ρ!

17

Soft-Margin Classification

Optimal hyperplane

x

x

x x

x x x o

o

o

o

o

o

ρ

x Support Vectors

o Inside margin,incorrectly classified

Inside margin, correctly classified

  • Some problems can violate the condition:

di(wT^ xi + b) ≥ 1

  • We can introduce a new set of variables {ξi}Ni=1:

di(wT^ xi + b) ≥ 1 − ξi

where ξi is called the slack variable.

18

Soft-Margin Classification (cont’d)

  • We want to find a separating hyperplane that minimizes:

Φ(ξ) =

X^ N

i=

I(ξi − 1)

where I(ξ) = 0 if ξ ≤ 0 and 1 otherwise.

  • Solving the above is NP-complete, so we instead solve an approximation:

Φ(ξ) =

X^ N

i=

ξi

  • Furthermore, the weight vector can be factored in:

Φ(x, ξ) = 1 2 wT^ w | {z } Controls VC dim

+ C

X^ N

i=

ξi | {z } Controls error with a control parameter C. 19

Soft-Margin Classification: Solution

  • Following a similar route involving Lagrange multipliers, and a

more restrictive condition of 0 ≤ αi ≤ C, we get the solution:

wo =

X^ Ns

i=

αo,idixi

bo = di(1 − ξi) − wTo xi

20

Nonlinear SVM

(xi)

xi

Input space

Feature space

( )

  • Nonlinear mapping of an input vector to a high-dimensional feature space (exploit Cover’s theorem)
  • Construction of an optimal hyperplane for separating the features identified in the above step.

21

Inner-Product Kernel

  • Input x is mapped to ϕ(x).
  • With the weight w (including the bias b), the decision surface in

the feature space becomes (assume ϕ 0 (x) = 1):

wT^ ϕ(x) = 0

  • Using the steps in linear SVM, we get

w =

X^ N

i=

αidiϕ(xi)

  • Combining the above two, we get the decision surface

X^ N

i=

αidiϕT^ (xi)ϕ(x) = 0.

22

Inner-Product Kernel (cont’d)

  • The inner product ϕT^ (x)ϕ(xi) is between two vectors in the feature space.
  • The calculation of this inner product can be simpified by use of a

inner-product kernel K(x, xi):

K(x, xi) = ϕT^ (x)ϕ(xi) =

X^ m^1

j=

ϕj (x)ϕj (xi)

where m 1 is the dimension of the feature space. (Note:

K(x, xi) = K(xi, x).)

  • So, the optimal hyperplane becomes:

X^ N

i=

αidiK(x, xi) = 0

Inner-Product Kernel (cont’d)

  • Mercer’s theorem states that K(x, xi) that follow certain conditions (continuous, symmetric, positive semi-definite) can be expressed in terms of an inner-product in a nonlinearly mapped feature space.
  • Kernel function K(x, xi) allows us to calculate the inner

product ϕT^ (x)ϕ(xi) in the mapped feature space without any

explicit calculation of the mapping function ϕ(·).

Examples of Kernel Functions

  • Linear: K(x, xi) = xT^ xi.
  • Polynomial: K(x, xi) = (xT^ xi + 1)p.
  • RBF: K(x, xi) = exp

− 2 σ^12 ‖x − xi‖^2

.

  • Two-layer perceptron: K(x, xi) = tanh

`

β 0 xT^ xi + β 1

´

(for some β 0 and β 1 ).

25

Kernel Example

  • Expanding K(x, xi) = (1 + xT^ xi)^2 with x = [x 1 , x 2 ]T^ , xi = [xi 1 , xi 2 ]T^ ,

K(x, xi) = 1 + x^21 x^2 i 1 + 2x 1 x 2 xi 1 xi 2 +x^22 x^2 i 2 + 2x 1 xi 1 + 2x 2 xi 2 = [1, x^21 ,

2 x 1 x 2 , x^22 ,

2 x 1 ,

2 x 2 ] [1, x^2 i 1 ,

2 xi 1 xi 2 , x^2 i 2 ,

2 xi 1 ,

2 xi 2 ]T = ϕ(x)T^ ϕ(xi),

where ϕ(x) = [1, x^21 ,

2 x 1 x 2 , x^22 ,

2 x 1 ,

2 x 2 ]T^.

26

Nonlinear SVM: Solution

  • The solution is basically the same as the linear case, where

xT^ xi is replaced with K(x, xi), and an additinoal constraint

that α ≤ C is added.

Nonlinear SVM Summary

Project input to high-dimensional space to turn the problem into a linearly separable problem. Issues with a projection to higher dimensional feature space:

  • Statistical problem : Danger of invoking curse of dimensionality and higher chance of overfitting - Use large margins to reduce VC dimension
  • Computational problem : computational overhead for calculating

the mapping ϕ(·):

- Solve by using the kernel trick.