Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Support Vector Machines-Introduction to Machine Learning-Lecture 12-Computer Science, Lecture notes of Introduction to Machine Learning

Support Vector Machines, Andreas Argyriou, SVM, Nonlinear Classification, Kernel, SVM Classification, Non-Separable Case, Slack Variables, Regularization, Loss, Nonlinear Features, Nonlinear Features, Logistic Regression, Nonlinear Mapping, Dot Products, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Support Vector Machines-Introduction to Machine Learning-Lecture 12-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 12: Support Vector Machines

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

Lecture by Andreas Argyriou

TTI–Chicago

October 22, 2010

Plan for today

SVM:

  • non-separable case;
  • (^) nonlinear classification with the kernel trick.

SVM: summary so far

We start with argmaxw,w 0

{

1 ‖w‖ mini^ yi

(

wT^ xi + w 0

) }

In linearly separable case, we get a quadratic program

max

∑^ N

i=

αi −

1

2

∑^ N

i,j=

αiαj yiyj x

T i xj

subject to

∑^ N

i=

αiyi = 0, αi ≥ 0 for all i = 1,... , N.

Solving it for α we get the SVM classifier

yˆ = sign

(

w ˆ 0 +

αi> 0

αiyixTi x

)

.

SVM classification

α > 0 w

1 /‖w‖

α > 0

α = 0

α = 0

α > 0

α > 0 α > 0

w

1 /‖w‖

α > 0

α > 0

α > 0

Only support vectors (points with αi > 0) determine the

boundary

Non-separable case

Not linearly separable data: we can no longer satisfy

yi

(

wT^ xi + w 0

)

≥ 1 for all i.

Non-separable case

Not linearly separable data: we can no longer satisfy

yi

(

wT^ xi + w 0

)

≥ 1 for all i.

We introduce slack variables ξi ≥ 0:

yi

(

w 0 + wT^ xi

)

− 1 + ξi ≥ 0.

Whenever the original constraint is satisfied, ξi = 0.

Non-separable case

Not linearly separable data: we can no longer satisfy

yi

(

wT^ xi + w 0

)

≥ 1 for all i.

We introduce slack variables ξi ≥ 0:

yi

(

w 0 + wT^ xi

)

− 1 + ξi ≥ 0.

Whenever the original constraint is satisfied, ξi = 0.

The updated objective:

min w

1

2

‖w‖^2 + C

∑^ N

i=

ξi.

The parameter C determines the penalty paid for violating

margin constraints.

This is applicable even when the data are separable!

Non-separable case: solution

min w

1

2

‖w‖

2

  • C

∑^ N

i=

ξi.

We can solve this using Lagrange multipliers

  • (^) Introduce additional multipliers for the ξs.

The resulting dual problem:

max

∑^ N

i=

αi −

1

2

∑^ N

i,j=

αiαj yiyj x

T i xj

subject to

∑^ N

i=

αiyi = 0, 0 ≤ αi ≤ C for all i = 1,... , N.

SVM with slack variables

ξi/‖w‖

w

1 /‖w‖

SVM with slack variables

ξi/‖w‖

w

1 /‖w‖

α = C, ξ > 1

0 < α < C, ξ = 0

α = C, ξ > 1

α = C, 0 < ξ < 1 0 < α < C, ξ = 0

0 < α < C, ξ = 0

0 < α < C, ξ = 0

Support vectors: points with α > 0

If 0 < α < C: SVs on the margin, ξ = 0.

If 0 < α = C: SVs over the margin, either misclassified

(ξ > 1) or not (0 < ξ ≤ 1).

SVM and regularization

min w

1

2

‖w‖^2 + C

∑^ N

i=

ξi

C is a regularization parameter, controling penalty for

imperfect fit to training labels.

Larger C ⇒ more reluctant to make mistakes

How do we select value of C?

SVM and regularization

min w

1

2

‖w‖^2 + C

∑^ N

i=

ξi

C is a regularization parameter, controling penalty for

imperfect fit to training labels.

Larger C ⇒ more reluctant to make mistakes

How do we select value of C? Cross validation is a common

practical way to do that.

Loss in SVM

Loss is measured as

∑N

i=1 ξi This surrogate loss is known as hinge loss

yf (x)

L(yf (x), 1)

0

0 / (^1) log p

err^2

hinge

Nonlinear features

As with logistic regression, we can move to nonlinear

classifiers by mapping data into nonlinear feature space.

φ : [x 1 , x 2 ]

T → [x

2 1 ,^

2 x 1 x 2 , x

2 2 ]

T

Nonlinear features

As with logistic regression, we can move to nonlinear

classifiers by mapping data into nonlinear feature space.

φ : [x 1 , x 2 ]

T → [x

2 1 ,^

2 x 1 x 2 , x

2 2 ]

T

Elliptical decision boundary in the input space becomes linear

in the feature space z = φ(x):

x^21

a^2

+

x^22

b^2

= c ⇒

z 1

a^2

+

z 3

b^2

= c.

Example of nonlinear mapping

Consider the mapping:

φ : [x 1 , x 2 ]T^ → [1,

2 x 1 ,

2 x 2 , x^21 , x^22 ,

2 x 1 x 2 ]T^.

The (linear) SVM classifier in the feature space:

yˆ = sign

(

w ˆ 0 +

αi> 0

αiyiφ(xi)T^ φ(x)

)

The dot product in the feature space:

φ(x)

T φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + x

2 1 z

2 1 +^ x

2 2 z

2 2 + 2x^1 x^2 z^1 z^2

=

(

1 + x

T z

) 2

.

Dot products and feature space

We defined a non-linear mapping into feature space

φ : [x 1 , x 2 ]

T → [1,

2 x 1 ,

2 x 2 , x

2 1 , x

2 2 ,^

2 x 1 x 2 ]

T

and saw that φ(x)T^ φ(z) = K(x, z) using the kernel

K(x, z) =

(

1 + x

T z

) 2

.

I.e., we can calculate dot products in the feature space

implicitly, without ever writing the feature expansion!