Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Support Vector Machines, Andreas Argyriou, SVM, Nonlinear Classification, Kernel, SVM Classification, Non-Separable Case, Slack Variables, Regularization, Loss, Nonlinear Features, Nonlinear Features, Logistic Regression, Nonlinear Mapping, Dot Products, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 17
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
Lecture by Andreas Argyriou
TTI–Chicago
October 22, 2010
We start with argmaxw,w 0
1 ‖w‖ mini^ yi
wT^ xi + w 0
In linearly separable case, we get a quadratic program
max
i=
αi −
i,j=
αiαj yiyj x
T i xj
subject to
i=
αiyi = 0, αi ≥ 0 for all i = 1,... , N.
Solving it for α we get the SVM classifier
yˆ = sign
w ˆ 0 +
αi> 0
αiyixTi x
α > 0 w
1 /‖w‖
α > 0
α = 0
α = 0
α > 0
α > 0 α > 0
w
1 /‖w‖
α > 0
α > 0
α > 0
Only support vectors (points with αi > 0) determine the
boundary
Not linearly separable data: we can no longer satisfy
yi
wT^ xi + w 0
≥ 1 for all i.
Not linearly separable data: we can no longer satisfy
yi
wT^ xi + w 0
≥ 1 for all i.
We introduce slack variables ξi ≥ 0:
yi
w 0 + wT^ xi
− 1 + ξi ≥ 0.
Whenever the original constraint is satisfied, ξi = 0.
Not linearly separable data: we can no longer satisfy
yi
wT^ xi + w 0
≥ 1 for all i.
We introduce slack variables ξi ≥ 0:
yi
w 0 + wT^ xi
− 1 + ξi ≥ 0.
Whenever the original constraint is satisfied, ξi = 0.
The updated objective:
min w
‖w‖^2 + C
i=
ξi.
The parameter C determines the penalty paid for violating
margin constraints.
This is applicable even when the data are separable!
min w
‖w‖
2
i=
ξi.
We can solve this using Lagrange multipliers
The resulting dual problem:
max
i=
αi −
i,j=
αiαj yiyj x
T i xj
subject to
i=
αiyi = 0, 0 ≤ αi ≤ C for all i = 1,... , N.
ξi/‖w‖
w
1 /‖w‖
ξi/‖w‖
w
1 /‖w‖
α = C, ξ > 1
0 < α < C, ξ = 0
α = C, ξ > 1
α = C, 0 < ξ < 1 0 < α < C, ξ = 0
0 < α < C, ξ = 0
0 < α < C, ξ = 0
Support vectors: points with α > 0
If 0 < α < C: SVs on the margin, ξ = 0.
If 0 < α = C: SVs over the margin, either misclassified
(ξ > 1) or not (0 < ξ ≤ 1).
min w
‖w‖^2 + C
i=
ξi
C is a regularization parameter, controling penalty for
imperfect fit to training labels.
Larger C ⇒ more reluctant to make mistakes
How do we select value of C?
min w
‖w‖^2 + C
i=
ξi
C is a regularization parameter, controling penalty for
imperfect fit to training labels.
Larger C ⇒ more reluctant to make mistakes
How do we select value of C? Cross validation is a common
practical way to do that.
Loss is measured as
i=1 ξi This surrogate loss is known as hinge loss
yf (x)
L(yf (x), 1)
0 / (^1) log p
err^2
hinge
As with logistic regression, we can move to nonlinear
classifiers by mapping data into nonlinear feature space.
φ : [x 1 , x 2 ]
T → [x
2 1 ,^
2 x 1 x 2 , x
2 2 ]
T
As with logistic regression, we can move to nonlinear
classifiers by mapping data into nonlinear feature space.
φ : [x 1 , x 2 ]
T → [x
2 1 ,^
2 x 1 x 2 , x
2 2 ]
T
Elliptical decision boundary in the input space becomes linear
in the feature space z = φ(x):
x^21
a^2
x^22
b^2
= c ⇒
z 1
a^2
z 3
b^2
= c.
Consider the mapping:
φ : [x 1 , x 2 ]T^ → [1,
2 x 1 ,
2 x 2 , x^21 , x^22 ,
2 x 1 x 2 ]T^.
The (linear) SVM classifier in the feature space:
yˆ = sign
w ˆ 0 +
αi> 0
αiyiφ(xi)T^ φ(x)
The dot product in the feature space:
φ(x)
T φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + x
2 1 z
2 1 +^ x
2 2 z
2 2 + 2x^1 x^2 z^1 z^2
=
1 + x
T z
We defined a non-linear mapping into feature space
φ : [x 1 , x 2 ]
T → [1,
2 x 1 ,
2 x 2 , x
2 1 , x
2 2 ,^
2 x 1 x 2 ]
T
and saw that φ(x)T^ φ(z) = K(x, z) using the kernel
K(x, z) =
1 + x
T z
I.e., we can calculate dot products in the feature space
implicitly, without ever writing the feature expansion!