Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine learning ensemble methods Machine learning ensemble methods Machine learning ensem, Cheat Sheet of Introduction to Machine Learning

Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods Machine learning ensemble methods

Typology: Cheat Sheet

2019/2020

Uploaded on 02/24/2023

yasmin-jwabreh
yasmin-jwabreh 🇵🇸

5 documents

1 / 44

Toggle sidebar

Related documents


Partial preview of the text

Download Machine learning ensemble methods Machine learning ensemble methods Machine learning ensem and more Cheat Sheet Introduction to Machine Learning in PDF only on Docsity! ENCS5341 Machine Learning and Data Science Kernels and SVM Yazan Abu Farha - Birzeit University Based on slides prepared by Tamás Horváth Support Vector Machine Noise tolerating linear separation noisy examples: suppose some amount of noise (~) has been added to each example Choose the hyperplane with the largest margin hyperplane with maximum margin (7): Arguments for maximum margin hyperplane • Robust against noise. • Excellent predictive performance in practice. • Separating hyperplane becomes unique. 6 Example: distance from a hyperplane • 𝑓 𝑥 = 𝑥! + 𝑥" − 3 • Signed distance of the point (0,0) from f is 𝑑 = #( %,% ) (!,!) = () " • Signed distance of the point (3,3) from f is 𝑑 = #( ),) ) (!,!) = ) " 9 x1 x2 1 2 3 1 2 3 The Maximum Margin Hyperplane hyperplane (w, 7) + b = 0: scale w and b such that | (Ww, z’)+-b| = 1 of all points on the dashed hyperplanes => margin: prediction of the class of an unseen instance Z: sign((w, Z) + 6) 10 Support Vector Machines: Hard Margin Constraint optimization problem for S = {(71,¥1),---;(En,Yn)} C R® x {+1, -1} such that S is linearly separable: 2 max sir wb (ea subject to |(w,#;) +b] >1 fori=1,...,n remark: hard margin constraints: all data points are classified correctly problem: objective function is non-convex 11 Dual Optimization Problem solve: « Ya ~ 5 >» OA; YY; (Li, Z;) t,j=1 s.t. 5 ay; =0 and a, >0 it=1,...,n n maximum margin hyperplane: (« => wast) i=l f(@) = Dies (Z;,2) +b b= —5 (os (Sone Li, 2; ) + min yi=l (>: yicis (Bi, Z; ')) 14 Dual Form: Remark maximum margin hyperplane: f(Z) = >> yai(Z;,Z) +b w=1 optimization theory: the dual complementary conditions guarantee that ai (yi((W, Xi) +b) — 1) = 0 => only the points on the margin hyperplanes are active in the prediction (i.e., a > 0); all other points are inactive (a; = 0) — points on the margin hyperplanes: support vectors = sparse kernel method because after training, a significant proportion of the data can be discarded; only the support vectors must be kept 15 Dual Form: Remark a; =0 support vectors 16 Soft Margin SVM hard margin constraints are relaxed to soft margin constraints: yi((w,Z;) +b) >1-€ for € >0 (i=1,...,n) €;: slack variables: e € =0: correct classification 7 e 0 <€ <1: lies inside the margin, but on the correct side oN. 7 e €>1: lies on the wrong side 19 3.0 Ll —— Hinge Loss —— Binomial Deviance ———~ Squared Error — Class Huber Soft Margin SVM: Optimization Problem 25 2.0 1 optimization problem with C' > 0: Loss 15 1.0 ” 1 min C at =a]? _ Xs 5 lal 0.5 L 0.0 | s.t. yi((w,z;)+b)>1-&, €& >0 i=1,...,n Be remarks: yf e regularized empirical risk minimization with the hinge loss function V(f(a),y) = max(0,1— ys f(x) — f in our case: f(z) = (w,z) +b — C > O plays the role of the regularization parameter \ (C' = 1/)) e ‘> & is an upper bound on the number of misclassified points w=1 20 Soft Margin SVM: Dual Form dual can be obtained in a way similar to the case of hard margins: n 1 n ae So ai — 3 S- QGAYY; (Xj, Z;) i=l ij=l n st. O<a;<C and So aiyi = 0, t= lessgh i=1 remark 1: quadratic programming problem remark 2: almost the same as for hard margin SVM difference: instead of 0 < a; (hard margin) we have the box constraints: 0<ai<C 21 Learning in Feature Space learning in input space: difficult if the input-output relationship is nonlinear common strategy in ML: using some appropriate function ®: R45 R? transform your data (in R%) into another space (R”), called the feature space, in which the relationship becomes linear 15 24 25 Example: XOR Challenges of learning non-linear relationships • How to choose the transformation such that the relation become linear? • The transformation increases the features dimension, which increases the computation cost The Kernel Trick solves both problems 29 The Kernel Trick Def.: a kernel is a function X x X — R such that for all x,y € X, k(x, y) = (P(x), ®(y)) for some function ® mapping X to an inner product feature space H. kernel trick: substitute all occurrences of (-,-) by a kernel & with k(a,y) = (®(a), ®(y)) where ® is the underlying function mapping the input space into the fea- ture space crucial point: ® does not have to be calculated; it can be even unknown! 30 The Kernel Trick example: let k : X x X — Rwith X C R?@ be defined by k(@, 7) = (@, 9)" for all 7,7 eX claim: & is a kernel corresponding to the feature map ® : R¢ -+ R® defined by d @ : 7+ (242;)§ j= for all Ze R?é proof: (®(Z),®(9)) = ((wixs)Fj=1, (YiMs )Ej=1) d = S° VL GY 5 ij=l d d i=1 j=l = Ca 31 Kernel Construction we present some basis kernel functions, as well as show rules for constructing more complex kernels from simple ones proof techniques used to show these results: e construct the underlying feature map ® corresponding to the kernel e or use Mercer’s characterization theorem (will not be discussed in this course) 34 Kernel Construction Prop. k(x, y) = f(x) f(y) is a kernel over X x X for all functions f : X > R proof: let ® : X — R be defined by ®:a+> f(x) forall ae xX = k(x,y) = f(x) f(y) = (P(x), ®(y)) q.e.d. 35 Kernel Construction Prop. Let k,, kg be kernels over X x X. Then for all a, 6 > 0, k(x, y) = ak (x,y) - Bko(a, y) is a kernel. Prop. Let k,,k2 be kernels over X x X. Then k(x, y) _ ky (a, y)ko(x, y) is a kernel. 36 Common Kernel Functions common kernel functions over R? x R?: linear kernel: k(Z, 7) := <'7 polynomial kernel: k(z’, 7) := (#' y+ c)* Gaussian or RBF kernel: k(2, 7) = exp ( _ W@-al3 202 39 Recap: Dual Optimization Problem solve: « Ya —= >» Aaj yi¥; (Li, Z;) t,j=1 s.t. 5 ay; =0 and a, >0 it=1,...,n n maximum margin hyperplane: (« => wast) i=1 f(@) = anes (#;,#) +b e b=--= ana, Yio, (£;,%;) } + min 2 1 yi=l i= (>: yicis (Bi, Z; ')) 40 Dual Form: Remark optimization problem: n n 1 i ue « max yau-5 > COZY ley D5) ° i=l ij=l s.t. aa =0 and i=1 remark 1. input data and new points (z') are used only through inner products a; >0,i=1,...,n nieeIman margin NYDSt plans = kernel trick is applicable! 41