





















































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
The document consists of notes taken in class. The topics covered are: - Linear Classification; - Linear Regression; - Logistic Regression; - Model Selection; - Non Linear Modeling; - Non Parametric Classification; - Clustering; - Stochastic Processes;
Tipologia: Appunti
1 / 61
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!






















































THE (^) FEASIBILITY OF LEARNING THE (^) LEARNING PROBLEM - SUPERVISES L (^). P T
Output/label :^ ye Y^ y = (spam/nam] Target function^ :^ f^ :^ Y^ "roled^ law"^ that^ maps^ email^ attributes into (^) a certan (^) label Data : DN = ((x
, y())^ ,^ ( x(2)^ , y(z))^ ... (x()^ , y(n))) record^ of^ emails^ in^ the^ strack ↓ (4)^ : the entire set of attributes
Hypothesis/model : g :^ candidate^ model^ forf & is^ choosed^ from a set of (^) candidate ge H^ :^ hypotesis^ set (^) formulas under consideration : It (^) hypothesis set THE LEARNING PROCEDURE unknown (^) examples (^) learning final G G (^) & hypothesis ↑ -^ Y^ (DATA) algorithm (^) g(f) N (^) &
hypothesis
H (^) models
The learning from (^) data is (^) meaningful If :
coloured or^ empty - For x")^ ,^ x(2)^ , y(3)0y =^ +^1 For x(m)^ , x(5)^ ,^ x(6)^ - y = -^3 4
↓
y (3) (^) x(n) x(5) &
· · ⑧ 0 O^00 What about17)^? x(7) · o o It can be^ y
= 1 If (^) you consider the color of (^) the first ball It can be (^) y 17)^ = -^1 if (^) you consider (^) no2 - (^) y = (^) + 1 3 We don't have^ enough info nc (^) -22 y=^1 to^ answer^ the^ question
Y A XEIR I · w The relationship between^ that^ points can^ be^ linear ·
-^ Y Learning from^ data^ is^ not^ feasible.^ Hume's^ problem^
deterministic (^) point of view. But (^) it is (^) possible from (^) a statistical (^) point of view statistical models can be^ wrong but (^) they are not^ wrong BIN EXPERIMENT There (^) is a bin in which we can not (^) see inside. There is an^ unknow
↑ I :
we (^) pick N^ marbles (^) independenty p :^ ip[picking a^ blue^ marble]^ (p^ is^ unknow) N : fraction^ of^ blue^ marbles^ in^ the^ sample^ (^ is^ known)
blue marble^ <^ g(x) =^ f(x)
For each (^) data (^) point Xi (^) we know (^) f(xi) and for (^) each (^) ge tt we can find out whether
This doesn't tell (^) us whatf (^) is (^) , but it can tell us if (^) I will (^) approx. f tells an (^) estimation of the (^) error rate (^) p thatg makes (^) in approximating
We (^) can explore the entire (^) hypothesis set H to find (^) a function h with small (^) error rate
= [g(x(i))^ +^ f(x(i)))^ (g(H) cerrore sul^ dataset) XEN
Xe N This is the^ real^ probability of (^) mis-classifying one (^) input We (^) can (^) apply (^) Hoeffding's inequality for the (^) learning problem
gi M^ =^3 candidate^ models^ M^ =^11 HIl 92 93 - Cardinality IP(/Ein(g) - Fout(g)/c2] =^ 2Me.2EN^ > (^) we'll (^) need more samples If M increase we are^ not^ sure (^) anymore that^ Einlg) Fout(g). On the^ other^ hand the^ higher M^ ,
findg that^ fits (^) well the data: trade off OBJECTIVES (^) OF LEARNING (^2). (^) Fitting : (^) Minimize (^) Ein(g) with (^) respect to (^) gett & trade-of
f(x) (^) - DATA (^) Learning (^) -G algorithm (^) model of (^) the A description of^ A^ turget^
the real World by a^ turget ↑ function H
guess Istart with (^) a (^) simple model (^) class) H
&
To find^ out^ the best^ filting model we use PLA PERCEPTRON (^) LEARNING ALGORITHM (PLA) key assumption^ :^ there^ exists^ a^ hyperplane^ that^ divides^ the^ classes The main rationale behind PLA is^ to^ move the line until we don't^ have (^) any misclassified (^) points -wan(x(il (^) , w(a))^ = y (i)^ , V(x(i) (^) , y(i)) e Di 42 ,height^
. (^) & - (^) weight
It (^) serves to (^) improve the (^) quality of^ w^. M n(x, w) =^ [iwixi^ =^ wTx i =^0 It's (^) an iterative (^) algorithm.
guess (for^ w)
this (^) is a new coefficient for^ m-point · With (^) iteration (t) : wHH) = w(t) + y(m) (t)^
. X (m)(t) ( (m) , (^) y (m))^ denotes^ one^ misclassified^ point
PLA (^) converges to perfect classification in a finite number^ of^ iteration. ·^ Good^ in^ terms^ of data (^) Fitting (Ein) we don't know Why does^ PLA^ Update^ rule^ work^ :^ anything^ about^ Eout Assume (x(m) (^) , y(m)) is a misclassified^ point +^ y(m) + n(x (m)^ , w(t)) Assume (^) y(m) =^ + 1
If the result^ is^ to^ the^ algorithm continues^ and^ the^ term^ witTx (m) (t) becomes avery time^ less^ negative by (^) adding a^ positive^ term · PLA solves the filting problem · Generalization (^) problem could be an issue
THEOREM 2 models
= (^) IP(n(x) + (^) f(x)] = Ein(g) + (^) 0) dimension of^ the^ input vector · en (^) (N)) NX (^) size of the dataset Fout(g) =^ Ein(g)^ +^ 0) - en(N) The difference between Font and Ein decrease with^ N and increase with the number of
objective to^ minimize^ :^ Eout(n)^ = It((n()-f(x))] not^ possible to^ compute it Approssimate (^) sample-based objective : Ein(h)-hilgli^ can be^ compt se We want^ to^ find^ the^ line/surface^ that^ minimize^ Ein^ least^ squares problem Hypothesis set^ : y =^ w,^ X^ +^ wo^ /2D^ casel U N y = [wixi +^ wo^
i =^0 General Idea : (^) we want^ to (^) minimize the vertical distance between the (^) point and the predictor (^) y Learning problem^ =^ optimization^ problem
n(x) W Ein(w) = wixyli
to (^0) Necessary condition^ for^ optimality^ : ·^ 2Ein(w)
zu (^) w
Linear (^) algebra recall : d +^1 A (^) Matrix (^) M(d + ) +^ (d^ +^ 1)^ IS (^) positive definite (M30) if^ Maso^ , VX0^ , XeIR Trides : · Vw(w"Aw) = (^) (A + (^) AT) w (^) , A c1R(d +) +(d+1)
Ein(w) = (wix" Xw + yy - 2wTXTy) *Ein(w) = * (2x
-Ein(w) =^ (x+^ Xw - x (^) y)
So (^) , Nu Ein V au Second condition^ is (^) always satisfied ↓ W2 ·^ JW, L I we have^ one (^) stationary point , this^ is^ a^ minimum
= (^) x : (^) prediction of heart (^) attack blood It^ can^ be^ formulated^ as^ a & age (^) G^ f(x)^
: (^) & a heart attack the corresponding prediction^ will^
weight (^) very low accuracy
In (^) supervised (^) learning problems : Y (^) I
X g (^) f(x) P(y/X)& random^ y (^) & stocastic (^) description
we are not^ interested in (^) y but in (^) P(y(x) Turget function^ :^ f(x)^ =^ IP(y^ =^1 (x) Sequivalent description^ :^ f(x)^ = 1P(y =^ - 1(x))
yf f(x) f(x) (^) ~ data ⑭ & &
He have two (^) problems : · we need to (^) determine the (^) hypotesis Set H
learning algorithm^ A
S & go^ Y^ G ye( -^1 ,^ +^13 &
linear (^) regression model ·^ wTX^ &^ g^ G y EIR Si (^) score S= wTX
risk is linear
In our (^) problem , the^ output of^ the^ model^ is a (^) probability it^ must be^ a number e [0, 17 LOGISTIC MODEL
Y (^5) .- s^ represents^ the^ risk
S or^ sigmod function
h(s) = e
g 1 +^ es g this^ can^ be^ used^ to^ output^ a^ probability S
h(x) (^) y =^ + 1 IP(y =^ +^ 1)x] =
y =^ -^1 es (^) I h(s) =^1 -^ h(S)^ =^ =^ h)^ - S)
(P[y =^ +^ 1(x]^ =^ h(y^
err(y(i) (^) , nw(X(i))^ =^ en(1^ +^ e
NONLINEAR OPTIMIZATION In (^) logistic (^) regression : N
[en(
= 1 = gli^1 + ey(i)wT x(i) There is (^) no analitical (^) expression for i^ s (^).^ t. VEin(w) =o This (^) is a non linear function o^ we can not (^) compute : VEin(w) =^0 We'll use an (^) algorithm which is used for^ numerical^ minimization. This (^) algorithm is called (^) gradient descent
guess i of w (^) and more towards a minimum of (^) the (^) curve
M ↑ (^1 1). how to (^) "roll down" the surface in (^) a high-dimensional w(d) (^) local (^) global w
minimum
Open problem Smart (^) Strategy : (^) We can (^) try several initial conditions and take^ the best^ minimizer How not to be stuck in local (^) minima : (2.) Good (^) news : (^) logistic model has (^) only one (^) minimum (the cast is convex) Ein(w)" Ein(w)^ is^ a^ convex^ function^ , so^ it^ has^ only 1 minimum · (^) GD can not end (^) up in (^) a local minimum (^) , so (^) problem 2. does (^) not exist. -W How to roll down the surface :^ (1^. )
key idea^ of^ gradient^ descent (^) (GD) : to take (^) a "small step" in the direction of a unit Vector r w[i +^ 1) = (^) w(i) + nu n : (^) Step size What is the best^ choise for V^ : We need^ to^ find the direction ofi corresponding to the^ steepest slope (^) (negative -Ein(w) =^ Ein^ (w(it) -^ Ein^ (w(is) (^) Ein(w)"
we want^ DEin(w)^ to^ be^ the^ largest possible" · ( with^ negative (^) sign!^ ) Ein(wi^ +^ 1]^ -W
The best situation is^ a variable n
get close^ to^ the^ minimum
= 7
. (^110) Ein (^) (w(ij)Il [i] : learning rate this is an indicator of "how far" the^ minimum is Final (^) update rule^ : VEin (^) (w(it) (^) = w(i] - m/18Ein(wsi)). DEin(w(is) W[i +^ 1] = W (i) - 7(i]
GD rule : (^) W[it] = (i]-nBEin (w[i])
& an (^) leration · 3. To set a threshold (^) on (^) Ein number
In (^) machine (^) learning , there (^) are two main (^) problems
Me data & g() (^ = (x) &
To (^) have a trade-off between data (^) fitting and (^) generalization , we have to^ choose the right hypothesis set^ H.
the right hypothesis
very (^) likely that^ my target^ function^ is^ outside J Blas/Variance (^) decomposition of Fout is the (^) right tool to (^) chose H BIAS-VARIANCE DECOMPOSITION Iwe focus on linear (^) regression only , for (^) simplicity) Dr = G(x) , (^) y(1)) , (x(2), y(2)) ... (x (^ *) , y(N))) data set^ (N^ fixed) n(x(i)) = (^) w +^ x(i)^ = (^) y(i) class (^) of models (^) In fixed) We assume that (^) y(i) =^ f(x(i))^ ~^ no noise in the dataset Ein(w) (^) = (y(i) · y(i)(y() wx)^ costfunction^ to^ minimee se Real (^) objective (unfeasible) :^ minimization^ of^ Eout(w) D It (^) out (n) =(((f(x) -^ (x))"]
#p (Eout^ (n"))^ :^ che +^ (f(x)^