Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Regression Model and Error Decomposition, Unrestricted Predictor, Generative, Discriminative Approach, Normalize, Decomposition, Structural Error, Approximation Error, Statistical View, Regression, Gaussian Noise Model, Likelihood, Maximum Likelihood Estimation, Least Squares, Polynomial Regression, Error Decomposition, Model Complexity, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 40
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 6, 2010
Learning by estimating parameters w∗^ that minimize empirical loss L(w) = (^) N^1
i=1 L^ (f^ (xi;^ w), yi) Expected loss (risk) R(w) = E(x 0 ,y 0 )∼p(x,y) [L (f (x 0 ; w), y 0 )]
Least squares: f (x; w) = wT
x
w∗^ = argmin w
i=
(f (xi, w) − yi))^2 =
XT^ y
Apparently, as N increases, LN goes up but R goes down.
The optimal regression function
Error decomposition for parametric regression
Statistical view of regression
Practical issues
What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?
f ∗^ = argmin f :X →R
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
Any f : X → R is allowed.
What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?
f ∗^ = argmin f :X →R
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
Any f : X → R is allowed.
What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?
f ∗^ = argmin f :X →R
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
Any f : X → R is allowed.
The product rule of probability: p(x, y) = p(y|x)p(x) By definition: Ep(y,x) [g(y, x)] =
x
y g(y,^ x)p(y|x)p(x)dydx
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
x 0
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
p(x 0 )dx 0
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
x 0
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
p(x 0 )dx 0
Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)
(f (x 0 ) − y 0 )^2 | x 0
= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
x 0
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
p(x 0 )dx 0
Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)
(f (x 0 ) − y 0 )^2 | x 0
= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]
= 2
f (x 0 ) − Ep(y|x) [y 0 |x 0 ]
E(x 0 ,y 0 )∼p(x,y)
(f (x 0 ) − y 0 )^2
= Ex 0 ∼p(x)
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
x 0
Ey 0 ∼p(y|x)
(f (x 0 ) − y 0 )^2 | x 0
p(x 0 )dx 0
Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)
(f (x 0 ) − y 0 )^2 | x 0
= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]
= 2
f (x 0 ) − Ep(y|x) [y 0 |x 0 ]
We minimize the expected loss by setting f to the conditional expectation of y for each x: f ∗(x 0 ) = Ep(y|x) [y 0 |x 0 ]
Generative approach:
Generative approach:
Generative approach:
Let’s take a closer look at the expected loss:
w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)
Let’s take a closer look at the expected loss:
w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)
Ep(x,y)
(y − wˆ 0 − wˆ 1 x)^2
= Ep(x,y)
(y − w∗ 0 − w∗ 1 x)^2
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
Let’s take a closer look at the expected loss:
w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)
Ep(x,y)
(y − wˆ 0 − wˆ 1 x)^2
= Ep(x,y)
(y − w∗ 0 − w∗ 1 x)^2
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
The second term vanishes since prediction errors y 0 − w 0 ∗ − w∗ 1 x are uncorrelated with any linear function of x including w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x.
Ep(x,y)
(y − wˆ 0 − wˆ 1 x)^2
= Ep(x,y)
(y − w∗ 0 − w∗ 1 x)^2
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
Ep(x,y)
(y − wˆ 0 − wˆ 1 x)^2
= Ep(x,y)
(y − w∗ 0 − w∗ 1 x)^2
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
Structural error Ep(x,y)
(y − w∗ 0 − w 1 ∗x)^2
measures inherent limitations of the chosen hypothesis class (linear function). This error will remain even with infinite training data. Approximation error Ep(x,y)
(w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x)^2
measures how close to the optimal w∗^ is ˆw estimated with finite training data. Note: since training data X, Y are random variables drawn from p(x, y), the estimated ˆw is a random variable as well.
Structural error
E
(y − w∗ 0 − w 1 ∗x)^2
Approximation error
E
(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2
best regression f ∗^ = E[y|x]
best linear regression w∗
estimate ˆw
For a consistent estimation procedure, limN →∞ wˆ = w∗, and so the approximation error decreases.
The structural error can not be removed without changing the hypothesis class (e.g., moving from linear to quadratic regression).
Structural error is minimized if f ∗^ ∈ H. But is it 0 then?
We will now explicitly model the randomness in the data:
y = f (x; w) + ν
where the noise ν accounts for everything not captured by f.
−4−3 −2 −1 0 1 2 3
−2
0
2
4
6
8
−4−3 −2 −1 0 1 2 3
−2
0
2
4
6
8
y = f (x; w) + ν
Under this model, the best predictor is
Ep(y|x) [f (x; w) + ν | x] = f (x; w) + Ep(ν) [ν]
Typically, Ep(ν) [ν] = 0 (white noise).
Under such a model, f (x; w) captures the expected value of y|x if we believe the distribution in the model.
y = f (x; w) + ν, ν ∼ N
ν; 0, σ^2
Given the input x, the label y is a random variable
p(y|x; w, σ) = N
y; f (x; w), σ^2
that is,
p(y|x; w, σ) =
σ
2 π
exp
(y − f (x; w))^2 2 σ^2
This is an explicit model of y that allows us, for instance, to sample y for a given x.
The likelihood of the parameters w given the observed data X = [x 1 ,... , xN ], Y = [y 1 ,... , yN ]T^ is defined as
p(Y |X; w, σ)
i.e., the probability of observing these ys for the given xs, under the model parametrized by w and σ.
Under the assumption that data are i.i.d. (independently, identically distributed) according to p(x),
p(Y |w, σ) =
i=1
p(yi|xi, w, σ)
Maximum likelihood (ML) estimation principle:
wˆM L = argmax w
p(Y |w, σ)
Here we focus on likelihood as a function of w.
For Gaussian noise model:
wˆM L = argmax w
i=1
σ
2 π
exp
(yi − f (xi; w))^2 2 σ^2
This may become numerically unwieldy...