

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: Machine Learning; Subject: Computer Science; University: University of Utah; Term: Fall 2008;
Typology: Study notes
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Machine Learning (CS 5350/CS 6350) 09 Oct 2008
Large margin principle: want w, b that separates the classes maximally. We assume that w, b separates
the data and achieves a functional margin of at least 1. That is, w
x + b ≥ 1 for all positive x and
w
x + b ≤ −1 for all negative x. We defined the geometric margin based on the normalized weight vector:
γ = minn yn(u
xn + b), where u = w/ ||w||. Compute margin as a function of normalized weight vector u,
for positive point x
and negative point x
− :
γ ≤
u
x
x
− − b
γ =
w
||w||
x
−
w
||w||
x
−
2 ||w||
w
x
− w
x
−
||w||
This shows that the margin is inversely proportional to the norm of the weight vector, and independent of
the bias. Moreover, having a large margin is equivalent to having a small weight vector norm (why does this
make intuitive sense?).
Now, we write the learning problem as an optimization problem:
minimizew
||w||
2
subject to yn
w
xn + b
− 1 ≥ 0 (∀n)
Now we need to figure out how to solve this. Enter convex optimization and Lagrange theory...
Introduce Lagrange-multipliers α1:N , one for each constraint. Leads to Lagrangian:
L(w, α) =
||w||
2 −
n
αn
yn
w
xn + b
Now, we want to minimize L(w, α) with respect to both w and α. Take derivatives with respect to w:
∇wL = w −
n
αnynxn = 0
=⇒w =
n
αnynxn
∇bL =
n
ynαn = 0
So, given α, w is deterministic... plug back in to L:
L(α) =
n
αnynxn
2
n
αn
yn
m
αmymxm)
xn + b
m
n
αmαnymynxm
xn −
m
n
αmαnymynxm
xn − b
n
αnyn +
n
αn
m
n
αmαnymynxm
xn +
n
αn
Maximum margin classifiers 2
So now just solve:
minimizeα
n
αn −
n,m
ynymαnαmxn
xm
subject to
n
ynαn = 0
αn ≥ 0 , (∀n)
Then compute the bias:
b = −
max n:yn==− 1
w
xn + min n:yn==+
w
xn
This leads to a sparse solution: most α are zero. Why? The Karush-Kuhn-Tucker conditions say that at
the optimum:
αn
yn(w
xn + b) − 1
= 0 (∀n)
This means that αn 6 = 0 only when the point is right on the margin. These points are the support vectors.
Not linearly-separable data...
Introduce slack parameters: ξn is how far “on the wrong side” ynxn is from the margin. Then:
minimizew
||w||
2
n
ξn
subject to yn
w
xn + b
− 1 + ξn ≥ 0 , (∀n)
ξn ≥ 0 , (∀n)
High C means “fit data” while low C means “have a large margin.”
Following the same dual formulation, we get:
L(w, b, ξ, α, r) =
w
w + C
n
ξn −
n
αn
yn
w
xn + b
− 1 + ξn
n
riξi
Differentiate:
∇wL = w −
n
ynαnxn = 0
=⇒w =
n
ynαnxn
∇bL =
n
ynαn = 0
∇ξ n L = C − αn − rn = 0
Thus:
L(w, b, ξ, α, r) =
n
n,m
ynymαnαmxn
xm
Which is the same as before, but now we have constraints: αn ≥ 0 and rn ≥ 0 and C − αn − rn = 0. This
means: (1) αn ≤ C. (2) if rn = 0 then αn = C. Geometrically, support vectors are now also the “noisy”
points.
Why large margins? Because they mean simple solutions.