Download Support Vector Machines: Maximizing Margin of Separation - Prof. Y. Choe and more Study notes Computer Science in PDF only on Docsity!
Slide
Haykin Chapter 6: Support-Vector
Machines
CPSC 636-
Instructor: Yoonsuck Choe Spring 2008
Note: Part of this lecture drew material from Ricardo Gutierrez-Osuna’s Pattern Analysis lectures.
1
Introduction
- Support vector machine is a linear machine with some very nice properties.
- The basic idea of SVM is to construct a separating hyperplane where the margin of separation between positive and negative examples are maximized.
- Principled derivation: structural risk minimization - error rate is bounded by: (1) training error-rate and (2) VC-dimension of the model. - SVM makes (1) become zero and minimizes (2).
2
Optimal Hyperplane
For linearly separable patterns {(xi, di)}Ni=1 (with
di ∈ {+1, − 1 }):
- The separating hyperplane is wT^ x + b = 0:
wT^ x + b ≥ 0 for di = +
wT^ x + b < 0 for di = − 1
- Let wo be the optimal hyperplane and bo the optimal bias.
Distance to the Optimal Hyperplane
θ
w x i
x d r
d+r
- From wTo x = −bo, the distance from the origin to the hyperplane is calculated as
d = ‖xi‖ cos(xi, wo) = −bo ‖wo‖
Distance to the Optimal Hyperplane (cont’d)
θ
w x i
x d r
d+r
- The distance from an arbitrary point to the hyperplane can be calculated as: - When the point is in the positive area:
r = ‖x‖ cos(x, wo)−d = xT^ wo ‖wo‖
+
bo ‖wo‖
=
xT^ wo + bo ‖wo‖
.
- When the point is in the negative area:
r = d−‖x‖ cos(x, wo) = − xT^ wo ‖wo‖
−
bo ‖wo‖
= −
xT^ wo + bo ‖wo‖
.
5
Optimal Hyperplane and Support Vectors
Optimal hyperplane
x
x
x x
x
x x o
o
o
o
o
o Support Vectors
ρ
- Support vectors : input points closest to the separating hyperplane.
- Margin of separation ρ: distance between the separating hyperplane and the closest input point.
6
Optimal Hyperplane and Support Vectors (cont’d)
- The optimal hyperplane is supposed to maximize the margin of
separation ρ.
- With that requirement, we can write the conditions that wo and
bo must meet:
wTo x + bo ≥ +1 for di = +
wTo x + bo ≤ − 1 for di = − 1
Note: ≥ +1 and ≤ − 1 , and support vectors are those x(s)
where equality holds (i.e., wTo x(s)^ + bo = +1 or − 1 ).
- Since r = (woT x + bo)/‖wo‖,
r =
<
:
1 /‖wo‖ if d = +
− 1 /‖wo‖ if d = − 1
Optimal Hyperplane and Support Vectors (cont’d)
Optimal hyperplane
x
x
x x
x
x x o
o
o
o
o
o Support Vectors
ρ
- Margin of separation between two classes is
ρ = 2r =
‖wo‖
.
- Thus, maximizing the margin of separation between two classes
is equivalent to minimizing the Euclidean norm of the weight wo!
Primal Problem: Constrained Optimization
For the training set T = {(xi, di)}Ni=1 find w and b such that
- they minimize a certain value ( 1 /ρ) while satisfying a constraint (all examples are correctly classified): - Constraint: di(wT^ xi + b) ≥ 1 for i = 1, 2 , ..., N. - Cost function: Φ(w) = 12 wT^ w.
This problem can be solved using the method of Lagrange multipliers (see next two slides).
9
Mathematical Aside: Lagrange Multipliers
Turn a constrained optimization problem into an unconstrained optimization problem by absorbing the constraints into the cost function, weighted by the Lagrange multipliers.
Example: Find closest point on the circle x^2 + y^2 = 1 to the point
(2, 3) (adapted from Ballard, An Introduction to Natural Computation,
1997, pp. 119–120).
- Minimize F (x, y) = (x − 2)^2 + (y − 3)^2 subject to the
constraint x^2 + y^2 − 1 = 0.
- Absorb the constraint into the cost function, after multiplying the
Lagrange multiplier α:
F (x, y, α) = (x − 2)^2 + (y − 3)^2 + α(x^2 + y^2 − 1).
10
Lagrange Multipliers (cont’d)
Must find x, y, α that minimizes F (x, y, α) = (x − 2)^2 + (y − 2)^2 + α(x^2 + y^2 − 1). Set the partial derivatives to 0, and solve the system of equations.
∂F ∂x = 2(x − 2) + 2αx = 0
∂F ∂y = 2(y^ −^ 2) + 2αy^ = 0 ∂F ∂x =^ x
(^2) + y (^2) − 1 = 0
Solve for x and y in the 1st and 2nd, and plug in those to the 3rd equation
x = y = 2 1 + α , so
„ 2
1 + α
« 2
+
„ 2
1 + α
« 2
= 1
from which we get α = 2
√
2 − 1. Thus, (x, y) = (1/
√
2 , 1 /
√
2).
Primal Problem: Constrained Optimization (cont’d)
Putting the constrained optimization problem into the Lagrangian form, we get (utilizing the Kunh-Tucker theorem)
J(w, b, α) =
wT^ w −
X^ N
i=
αi
h
di(wT^ xi + b) − 1
i
.
w =
X^ N
i=
αidixi.
X^ N
i=
αidi = 0
Primal Problem: Constrained Optimization (cont’d)
- Note that when the optimal solution is reached, the following condition must hold (Karush-Kuhn-Tucker complementary condition)
αi
h
di(wT^ xi + b) − 1
i
= 0
for all i = 1, 2 , ..., N.
- Thus, ˆ non-zero αis can be attained only when
di(wT^ xi + b) − 1
˜
= 0, i.e., when the αi is associated
with a support vector x(s)!
- Other conditions include αi ≥ 0.
13
Primal Problem: Constrained Optimization (cont’d)
PN
i=1 αidixi^ and^
PN
i=1 αidi^ = 0^ back into J(w, b, α), we get the dual problem.
J(w, b, α) = 12 wT^ w −
PN
i=1 αi
h di(wT^ xi + b) − 1
i
= 12 wT^ w −
PN
i=1 αidiw T (^) xi − nb PNi=1 αidi + PNi=1 αi noting wT^ w =
PN
i=1 αidiw T (^) xi and from
PN
i=1 αidi^ = 0
o
= − (^12)
PN
i=1 αidiw T (^) xi + PN i=1 αi = − (^12)
PN
i=
PN
j=1 αiαj^ didj^ x T i xj^ +^
PN
i=1 αi = Q(α).
- So, J(w, b, α) = Q(α) (αi ≥ 0 ).
- This results in the dual problem (next slide). 14
Dual Problem
- Given the training sample {(xi, di)}Ni=1, find the Lagrange
multipliers {αi}Ni=1 that maximize the objective function:
Q(α) = −
X^ N
i=
X^ N
j=
αiαj didj xTi xj +
X^ N
i=
αi
subject to the constraints
-
PN
i=1 αidi^ = 0
- αi ≥ 0 for all i = 1, 2 , ..., N.
- The problem is stated entirely in terms of the training data
(xi, di), and the dot products xTi xj play a key role.
Solution to the Optimization Problem
Once all the optimal Lagrange mulitpliers αo,i are found, wo and bo
can be found as follows:
wo =
X^ N
i=
αo,idixi
and from wTo xi + bo = di when xi is a support vector:
bo = d(s)^ − wTo x(s)
Note: calculation of final estimated function does not need any explicit
calculation of wo since they can be calculated from the dot product
between the input vectors!
wTo x =
X^ N
i=
αo,idixTi x
Margin of Separation in SVM and VC Dimension
Statistical learning theory shows that it is desirable to reduce both the error (empirical risk) and the VC dimension of the classifier.
- Vapnik (1995, 1998) showed: Let D be the diameter of the
smallest ball containing all input vectors xi. The set of optimal
hyperplanes defined by wTo x + bo = 0 has a VC dimension h
bounded from above as
h ≤ min
‰
D^2
ρ^2
ı
, m 0
ff
where d·e is the ceiling, ρ the margin of separation equal to
2 /‖wo‖, and m 0 the dimensionality of the input space.
- The implication is that the VC dimension can be controlled
independetly of m 0 , by choosing an appropriate (large) ρ!
17
Soft-Margin Classification
Optimal hyperplane
x
x
x x
x x x o
o
o
o
o
o
ρ
x Support Vectors
o Inside margin,incorrectly classified
Inside margin, correctly classified
- Some problems can violate the condition:
di(wT^ xi + b) ≥ 1
- We can introduce a new set of variables {ξi}Ni=1:
di(wT^ xi + b) ≥ 1 − ξi
where ξi is called the slack variable.
18
Soft-Margin Classification (cont’d)
- We want to find a separating hyperplane that minimizes:
Φ(ξ) =
X^ N
i=
I(ξi − 1)
where I(ξ) = 0 if ξ ≤ 0 and 1 otherwise.
- Solving the above is NP-complete, so we instead solve an approximation:
Φ(ξ) =
X^ N
i=
ξi
- Furthermore, the weight vector can be factored in:
Φ(x, ξ) = 1 2 wT^ w | {z } Controls VC dim
+ C
X^ N
i=
ξi | {z } Controls error with a control parameter C. 19
Soft-Margin Classification: Solution
- Following a similar route involving Lagrange multipliers, and a
more restrictive condition of 0 ≤ αi ≤ C, we get the solution:
wo =
X^ Ns
i=
αo,idixi
bo = di(1 − ξi) − wTo xi
20
Nonlinear SVM
(xi)
xi
Input space
Feature space
( )
- Nonlinear mapping of an input vector to a high-dimensional feature space (exploit Cover’s theorem)
- Construction of an optimal hyperplane for separating the features identified in the above step.
21
Inner-Product Kernel
- Input x is mapped to ϕ(x).
- With the weight w (including the bias b), the decision surface in
the feature space becomes (assume ϕ 0 (x) = 1):
wT^ ϕ(x) = 0
- Using the steps in linear SVM, we get
w =
X^ N
i=
αidiϕ(xi)
- Combining the above two, we get the decision surface
X^ N
i=
αidiϕT^ (xi)ϕ(x) = 0.
22
Inner-Product Kernel (cont’d)
- The inner product ϕT^ (x)ϕ(xi) is between two vectors in the feature space.
- The calculation of this inner product can be simpified by use of a
inner-product kernel K(x, xi):
K(x, xi) = ϕT^ (x)ϕ(xi) =
X^ m^1
j=
ϕj (x)ϕj (xi)
where m 1 is the dimension of the feature space. (Note:
K(x, xi) = K(xi, x).)
- So, the optimal hyperplane becomes:
X^ N
i=
αidiK(x, xi) = 0
Inner-Product Kernel (cont’d)
- Mercer’s theorem states that K(x, xi) that follow certain conditions (continuous, symmetric, positive semi-definite) can be expressed in terms of an inner-product in a nonlinearly mapped feature space.
- Kernel function K(x, xi) allows us to calculate the inner
product ϕT^ (x)ϕ(xi) in the mapped feature space without any
explicit calculation of the mapping function ϕ(·).
Examples of Kernel Functions
- Linear: K(x, xi) = xT^ xi.
- Polynomial: K(x, xi) = (xT^ xi + 1)p.
- RBF: K(x, xi) = exp
“
− 2 σ^12 ‖x − xi‖^2
”
.
- Two-layer perceptron: K(x, xi) = tanh
`
β 0 xT^ xi + β 1
´
(for some β 0 and β 1 ).
25
Kernel Example
- Expanding K(x, xi) = (1 + xT^ xi)^2 with x = [x 1 , x 2 ]T^ , xi = [xi 1 , xi 2 ]T^ ,
K(x, xi) = 1 + x^21 x^2 i 1 + 2x 1 x 2 xi 1 xi 2 +x^22 x^2 i 2 + 2x 1 xi 1 + 2x 2 xi 2 = [1, x^21 ,
√
2 x 1 x 2 , x^22 ,
√
2 x 1 ,
√
2 x 2 ] [1, x^2 i 1 ,
√
2 xi 1 xi 2 , x^2 i 2 ,
√
2 xi 1 ,
√
2 xi 2 ]T = ϕ(x)T^ ϕ(xi),
where ϕ(x) = [1, x^21 ,
√
2 x 1 x 2 , x^22 ,
√
2 x 1 ,
√
2 x 2 ]T^.
26
Nonlinear SVM: Solution
- The solution is basically the same as the linear case, where
xT^ xi is replaced with K(x, xi), and an additinoal constraint
that α ≤ C is added.
Nonlinear SVM Summary
Project input to high-dimensional space to turn the problem into a linearly separable problem. Issues with a projection to higher dimensional feature space:
- Statistical problem : Danger of invoking curse of dimensionality and higher chance of overfitting - Use large margins to reduce VC dimension
- Computational problem : computational overhead for calculating
the mapping ϕ(·):
- Solve by using the kernel trick.