Download Machine learning ensemble methods Machine learning ensemble methods Machine learning ensem and more Cheat Sheet Introduction to Machine Learning in PDF only on Docsity! ENCS5341 Machine Learning and Data Science Kernels and SVM Yazan Abu Farha  Birzeit University Based on slides prepared by Tamás Horváth
Support Vector Machine
Noise tolerating linear separation
noisy examples: suppose some amount of noise (~) has been added to each
example
Choose the hyperplane with the largest margin
hyperplane with maximum margin (7):
Arguments for maximum margin hyperplane • Robust against noise. • Excellent predictive performance in practice. • Separating hyperplane becomes unique. 6 Example: distance from a hyperplane • 𝑓 𝑥 = 𝑥! + 𝑥" − 3 • Signed distance of the point (0,0) from f is 𝑑 = #( %,% ) (!,!) = () " • Signed distance of the point (3,3) from f is 𝑑 = #( ),) ) (!,!) = ) " 9 x1 x2 1 2 3 1 2 3
The Maximum Margin Hyperplane
hyperplane (w, 7) + b = 0: scale w and b such that  (Ww, z’)+b = 1 of all points
on the dashed hyperplanes
=> margin:
prediction of the class of an unseen instance Z: sign((w, Z) + 6)
10
Support Vector Machines: Hard Margin Constraint
optimization problem for S = {(71,¥1),;(En,Yn)} C R® x {+1, 1} such
that S is linearly separable:
2
max sir
wb (ea
subject to (w,#;) +b] >1 fori=1,...,n
remark: hard margin constraints: all data points are classified correctly
problem: objective function is nonconvex
11
Dual Optimization Problem
solve:
« Ya ~ 5 >» OA; YY; (Li, Z;)
t,j=1
s.t. 5 ay; =0 and
a, >0 it=1,...,n
n
maximum margin hyperplane: (« => wast)
i=l
f(@) = Dies (Z;,2) +b
b= —5 (os (Sone Li, 2; )
+ min
yi=l
(>: yicis (Bi, Z; '))
14
Dual Form: Remark
maximum margin hyperplane: f(Z) = >> yai(Z;,Z) +b
w=1
optimization theory: the dual complementary conditions guarantee that
ai (yi((W, Xi) +b) — 1) = 0
=> only the points on the margin hyperplanes are active in the prediction (i.e.,
a > 0); all other points are inactive (a; = 0)
— points on the margin hyperplanes: support vectors
= sparse kernel method because after training, a significant proportion of
the data can be discarded; only the support vectors must be kept
15
Dual Form: Remark
a; =0
support vectors
16
Soft Margin SVM
hard margin constraints are relaxed to soft margin constraints:
yi((w,Z;) +b) >1€ for € >0 (i=1,...,n)
€;: slack variables:
e € =0: correct classification
7
e 0 <€ <1: lies inside the margin, but on the correct side oN.
7
e €>1: lies on the wrong side
19
3.0
Ll
—— Hinge Loss
—— Binomial Deviance
———~ Squared Error
— Class Huber
Soft Margin SVM: Optimization Problem
25
2.0
1
optimization problem with C' > 0:
Loss
15
1.0
” 1
min C at =a]?
_ Xs 5 lal
0.5
L
0.0

s.t. yi((w,z;)+b)>1&, €& >0 i=1,...,n Be
remarks: yf
e regularized empirical risk minimization with the hinge loss function
V(f(a),y) = max(0,1— ys f(x)
— f in our case: f(z) = (w,z) +b
— C > O plays the role of the regularization parameter \ (C' = 1/))
e ‘> & is an upper bound on the number of misclassified points
w=1
20
Soft Margin SVM: Dual Form
dual can be obtained in a way similar to the case of hard margins:
n 1 n
ae So ai — 3 S QGAYY; (Xj, Z;)
i=l ij=l
n
st. O<a;<C and So aiyi = 0, t= lessgh
i=1
remark 1: quadratic programming problem
remark 2: almost the same as for hard margin SVM
difference: instead of 0 < a; (hard margin) we have the box constraints:
0<ai<C
21
Learning in Feature Space
learning in input space: difficult if the inputoutput relationship is nonlinear
common strategy in ML: using some appropriate function
®: R45 R?
transform your data (in R%) into another space (R”), called the feature
space, in which the relationship becomes linear 15
24
25
Example: XOR
Challenges of learning nonlinear relationships • How to choose the transformation such that the relation become linear? • The transformation increases the features dimension, which increases the computation cost The Kernel Trick solves both problems 29
The Kernel Trick
Def.: a kernel is a function X x X — R such that for all x,y € X,
k(x, y) = (P(x), ®(y))
for some function ® mapping X to an inner product feature space H.
kernel trick: substitute all occurrences of (,) by a kernel & with
k(a,y) = (®(a), ®(y))
where ® is the underlying function mapping the input space into the fea
ture space
crucial point: ® does not have to be calculated; it can be even unknown!
30
The Kernel Trick
example: let k : X x X — Rwith X C R?@ be defined by
k(@, 7) = (@, 9)" for all 7,7 eX
claim: & is a kernel corresponding to the feature map ® : R¢ + R® defined by
d
@ : 7+ (242;)§ j= for all Ze R?é
proof: (®(Z),®(9)) = ((wixs)Fj=1, (YiMs )Ej=1)
d
= S° VL GY 5
ij=l
d d
i=1 j=l
= Ca
31
Kernel Construction
we present some basis kernel functions, as well as show rules for constructing
more complex kernels from simple ones
proof techniques used to show these results:
e construct the underlying feature map ® corresponding to the kernel
e or use Mercer’s characterization theorem (will not be discussed in this course)
34
Kernel Construction
Prop. k(x, y) = f(x) f(y) is a kernel over X x X for all functions f : X > R
proof: let ® : X — R be defined by
®:a+> f(x) forall ae xX
= k(x,y) = f(x) f(y) = (P(x), ®(y))
q.e.d.
35
Kernel Construction
Prop. Let k,, kg be kernels over X x X. Then for all a, 6 > 0,
k(x, y) = ak (x,y)  Bko(a, y)
is a kernel.
Prop. Let k,,k2 be kernels over X x X. Then
k(x, y) _ ky (a, y)ko(x, y)
is a kernel.
36
Common Kernel Functions
common kernel functions over R? x R?:
linear kernel: k(Z, 7) := <'7
polynomial kernel: k(z’, 7) := (#' y+ c)*
Gaussian or RBF kernel: k(2, 7) = exp (
_ W@al3
202
39
Recap: Dual Optimization Problem
solve:
« Ya —= >» Aaj yi¥; (Li, Z;)
t,j=1
s.t. 5 ay; =0 and
a, >0 it=1,...,n
n
maximum margin hyperplane: (« => wast)
i=1
f(@) = anes (#;,#) +b
e
b== ana, Yio, (£;,%;) } + min
2 1 yi=l
i=
(>: yicis (Bi, Z; '))
40
Dual Form: Remark
optimization problem:
n n
1 i ue «
max yau5 > COZY ley D5)
° i=l ij=l
s.t. aa =0 and
i=1
remark 1. input data and new
points (z') are used only
through inner products
a; >0,i=1,...,n
nieeIman margin NYDSt plans = kernel trick is applicable!
41