Lecture Notes on Maximum Margin Classifiers | CS 5350, Study notes of Computer Science

Material Type: Notes; Class: Machine Learning; Subject: Computer Science; University: University of Utah; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-rek
koofers-user-rek 🇺🇸

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning (CS 5350/CS 6350) 09 Oct 2008
Maximum margin classifiers
Large margin principle: want w, b that separates the classes maximally. We assume that w, b separates
the data and achieves a functional margin of at least 1. That is, w>x+b1 for all positive xand
w>x+b 1 for all negative x. We defined the geometric margin based on the normalized weight vector:
γ= minnyn(u>xn+b), where u=w/||w||. Compute margin as a function of normalized weight vector u,
for positive point x+and negative point x:
γ1
2u>x++bu>xb
γ=1
2w
||w||
>x+w
||w||
>x
=1
2||w|| w>x+w>x
=1
||w||
This shows that the margin is inversely proportional to the norm of the weight vector, and independent of
the bias. Moreover, having a large margin is equivalent to having a small weight vector norm (why does this
make intuitive sense?).
Now, we write the learning problem as an optimization problem:
minimizew
1
2||w||2
subject to ynw>xn+b10 (n)
Now we need to figure out how to solve this. Enter convex optimization and Lagrange theory. . .
Introduce Lagrange-multipliers α1:N, one for each constraint. Leads to Lagrangian:
L(w,α) = 1
2||w||2X
n
αnynw>xn+b1
Now, we want to minimize L(w,α) with respect to both wand α. Take derivatives with respect to w:
wL=wX
n
αnynxn= 0
=w=X
n
αnynxn
bL=X
n
ynαn= 0
So, given α,wis deterministic. . . plug back in to L:
L(α) = 1
2X
n
αnynxn
2
X
n
αn(yn"(X
m
αmymxm)>xn+b#1)
=1
2X
mX
n
αmαnymynxm
>xnX
mX
n
αmαnymynxm
>xnbX
n
αnyn+X
n
αn
=1
2X
mX
n
αmαnymynxm
>xn+X
n
αn
1
pf2

Partial preview of the text

Download Lecture Notes on Maximum Margin Classifiers | CS 5350 and more Study notes Computer Science in PDF only on Docsity!

Machine Learning (CS 5350/CS 6350) 09 Oct 2008

Maximum margin classifiers

Large margin principle: want w, b that separates the classes maximally. We assume that w, b separates

the data and achieves a functional margin of at least 1. That is, w

x + b ≥ 1 for all positive x and

w

x + b ≤ −1 for all negative x. We defined the geometric margin based on the normalized weight vector:

γ = minn yn(u

xn + b), where u = w/ ||w||. Compute margin as a function of normalized weight vector u,

for positive point x

and negative point x

− :

γ ≤

[

u

x

  • b − u

x

− − b

]

γ =

[

w

||w||

x

w

||w||

x

]

2 ||w||

[

w

x

− w

x

]

||w||

This shows that the margin is inversely proportional to the norm of the weight vector, and independent of

the bias. Moreover, having a large margin is equivalent to having a small weight vector norm (why does this

make intuitive sense?).

Now, we write the learning problem as an optimization problem:

minimizew

||w||

2

subject to yn

[

w

xn + b

]

− 1 ≥ 0 (∀n)

Now we need to figure out how to solve this. Enter convex optimization and Lagrange theory...

Introduce Lagrange-multipliers α1:N , one for each constraint. Leads to Lagrangian:

L(w, α) =

||w||

2 −

n

αn

yn

[

w

xn + b

]

Now, we want to minimize L(w, α) with respect to both w and α. Take derivatives with respect to w:

∇wL = w −

n

αnynxn = 0

=⇒w =

n

αnynxn

∇bL =

n

ynαn = 0

So, given α, w is deterministic... plug back in to L:

L(α) =

n

αnynxn

2

n

αn

yn

[

m

αmymxm)

xn + b

]

m

n

αmαnymynxm

xn −

m

n

αmαnymynxm

xn − b

n

αnyn +

n

αn

m

n

αmαnymynxm

xn +

n

αn

Maximum margin classifiers 2

So now just solve:

minimizeα

n

αn −

n,m

ynymαnαmxn

xm

subject to

n

ynαn = 0

αn ≥ 0 , (∀n)

Then compute the bias:

b = −

[

max n:yn==− 1

w

xn + min n:yn==+

w

xn

]

This leads to a sparse solution: most α are zero. Why? The Karush-Kuhn-Tucker conditions say that at

the optimum:

αn

[

yn(w

xn + b) − 1

]

= 0 (∀n)

This means that αn 6 = 0 only when the point is right on the margin. These points are the support vectors.

Not linearly-separable data...

Introduce slack parameters: ξn is how far “on the wrong side” ynxn is from the margin. Then:

minimizew

||w||

2

  • C

n

ξn

subject to yn

[

w

xn + b

]

− 1 + ξn ≥ 0 , (∀n)

ξn ≥ 0 , (∀n)

High C means “fit data” while low C means “have a large margin.”

Following the same dual formulation, we get:

L(w, b, ξ, α, r) =

w

w + C

n

ξn −

n

αn

[

yn

w

xn + b

− 1 + ξn

]

n

riξi

Differentiate:

∇wL = w −

n

ynαnxn = 0

=⇒w =

n

ynαnxn

∇bL =

n

ynαn = 0

∇ξ n L = C − αn − rn = 0

Thus:

L(w, b, ξ, α, r) =

n

n,m

ynymαnαmxn

xm

Which is the same as before, but now we have constraints: αn ≥ 0 and rn ≥ 0 and C − αn − rn = 0. This

means: (1) αn ≤ C. (2) if rn = 0 then αn = C. Geometrically, support vectors are now also the “noisy”

points.

Why large margins? Because they mean simple solutions.