Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Regression Model and Error Decomposition-Introduction to Machine Learning-Lecture 05-Computer Science, Lecture notes of Introduction to Machine Learning

Regression Model and Error Decomposition, Unrestricted Predictor, Generative, Discriminative Approach, Normalize, Decomposition, Structural Error, Approximation Error, Statistical View, Regression, Gaussian Noise Model, Likelihood, Maximum Likelihood Estimation, Least Squares, Polynomial Regression, Error Decomposition, Model Complexity, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 40

Toggle sidebar

Related documents


Partial preview of the text

Download Regression Model and Error Decomposition-Introduction to Machine Learning-Lecture 05-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 5: Regression model and error

decomposition

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

October 6, 2010

Review

Learning by estimating parameters w∗^ that minimize empirical loss L(w) = (^) N^1

∑N

i=1 L^ (f^ (xi;^ w), yi) Expected loss (risk) R(w) = E(x 0 ,y 0 )∼p(x,y) [L (f (x 0 ; w), y 0 )]

Least squares: f (x; w) = wT

[

1

x

]

,

w∗^ = argmin w

1

N

∑^ N

i=

(f (xi, w) − yi))^2 =

(

XT^ X

)− 1

XT^ y

  • Prediction errors yi − f (xi; w) have zero mean and are uncorrelated with any linear function of the inputs x.

Apparently, as N increases, LN goes up but R goes down.

Plan for today

The optimal regression function

Error decomposition for parametric regression

Statistical view of regression

Practical issues

Best unrestricted predictor

What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?

f ∗^ = argmin f :X →R

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

Any f : X → R is allowed.

Best unrestricted predictor

What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?

f ∗^ = argmin f :X →R

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

Any f : X → R is allowed.

Best unrestricted predictor

What is the best possible predictor of y, in terms of expected squared loss, if we do not restrict H at all?

f ∗^ = argmin f :X →R

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

Any f : X → R is allowed.

The product rule of probability: p(x, y) = p(y|x)p(x) By definition: Ep(y,x) [g(y, x)] =

x

y g(y,^ x)p(y|x)p(x)dydx

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

Best unrestricted predictor

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

Best unrestricted predictor

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

=

x 0

{

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]}

p(x 0 )dx 0

Best unrestricted predictor

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

=

x 0

{

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]}

p(x 0 )dx 0

Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]

= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]

Best unrestricted predictor

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

=

x 0

{

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]}

p(x 0 )dx 0

Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]

= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]

= 2

(

f (x 0 ) − Ep(y|x) [y 0 |x 0 ]

)

= 0

Best unrestricted predictor

E(x 0 ,y 0 )∼p(x,y)

[

(f (x 0 ) − y 0 )^2

]

= Ex 0 ∼p(x)

[

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]]

=

x 0

{

Ey 0 ∼p(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]}

p(x 0 )dx 0

Must minimize the inner conditional expectation for each x 0! δ δf (x) Ep(y|x)

[

(f (x 0 ) − y 0 )^2 | x 0

]

= 2Ep(y|x) [f (x 0 ) − y 0 | x 0 ]

= 2

(

f (x 0 ) − Ep(y|x) [y 0 |x 0 ]

)

= 0

We minimize the expected loss by setting f to the conditional expectation of y for each x: f ∗(x 0 ) = Ep(y|x) [y 0 |x 0 ]

Generative versus discriminative approach

Generative approach:

  • (^) Estimate the joint probability density p(x, y)
  • Normalize to find the conditional density p(y|x)
  • (^) Given a specific x 0 , marginalize to find the conditional expectation ˆy = Ep(y|x) [y 0 |x 0 ].

Generative versus discriminative approach

Generative approach:

  • (^) Estimate the joint probability density p(x, y)
  • Normalize to find the conditional density p(y|x)
  • (^) Given a specific x 0 , marginalize to find the conditional expectation ˆy = Ep(y|x) [y 0 |x 0 ]. Discriminative approach:
  • Estimate/infer the conditional density p(y|x) directly from the data; don’t bother with p(x, y).
  • Marginalize and obtain ˆy.

Generative versus discriminative approach

Generative approach:

  • (^) Estimate the joint probability density p(x, y)
  • Normalize to find the conditional density p(y|x)
  • (^) Given a specific x 0 , marginalize to find the conditional expectation ˆy = Ep(y|x) [y 0 |x 0 ]. Discriminative approach:
  • Estimate/infer the conditional density p(y|x) directly from the data; don’t bother with p(x, y).
  • Marginalize and obtain ˆy. Non-probabilistic approach: don’t deal with probabilities, fit f (x) directly to the data.

Decomposition of error

Let’s take a closer look at the expected loss:

w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)

Decomposition of error

Let’s take a closer look at the expected loss:

w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)

Ep(x,y)

[

(y − wˆ 0 − wˆ 1 x)^2

]

= Ep(x,y)

[

(y − w∗ 0 − w∗ 1 x)^2

]

  • 2Ep(x,y) [(y − w∗ 0 − w∗ 1 x) (w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x)]
  • Ep(x,y)

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

.

Decomposition of error

Let’s take a closer look at the expected loss:

w ˆ = [ ˆw 0 , wˆ 1 ]T^ are LSQ estimates from training data (assuming 1D case). w∗^ = [w∗ 0 , w 1 ∗]T^ are optimal linear regression parameters (generally unknown!) y − wˆ 0 − wˆ 1 x = (y − w∗ 0 − w∗ 1 x) + (w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)

Ep(x,y)

[

(y − wˆ 0 − wˆ 1 x)^2

]

= Ep(x,y)

[

(y − w∗ 0 − w∗ 1 x)^2

]

  • 2Ep(x,y) [(y − w∗ 0 − w∗ 1 x) (w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x)]
  • Ep(x,y)

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

.

The second term vanishes since prediction errors y 0 − w 0 ∗ − w∗ 1 x are uncorrelated with any linear function of x including w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x.

Decomposition of error

Ep(x,y)

[

(y − wˆ 0 − wˆ 1 x)^2

]

= Ep(x,y)

[

(y − w∗ 0 − w∗ 1 x)^2

]

  • Ep(x,y)

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

.

Decomposition of error

Ep(x,y)

[

(y − wˆ 0 − wˆ 1 x)^2

]

= Ep(x,y)

[

(y − w∗ 0 − w∗ 1 x)^2

]

  • Ep(x,y)

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

.

Structural error Ep(x,y)

[

(y − w∗ 0 − w 1 ∗x)^2

]

measures inherent limitations of the chosen hypothesis class (linear function). This error will remain even with infinite training data. Approximation error Ep(x,y)

[

(w 0 ∗ + w 1 ∗x − wˆ 0 − wˆ 1 x)^2

]

measures how close to the optimal w∗^ is ˆw estimated with finite training data. Note: since training data X, Y are random variables drawn from p(x, y), the estimated ˆw is a random variable as well.

Decomposition of error

Structural error

E

[

(y − w∗ 0 − w 1 ∗x)^2

]

Approximation error

E

[

(w∗ 0 + w∗ 1 x − wˆ 0 − wˆ 1 x)^2

]

best regression f ∗^ = E[y|x]

best linear regression w∗

estimate ˆw

For a consistent estimation procedure, limN →∞ wˆ = w∗, and so the approximation error decreases.

The structural error can not be removed without changing the hypothesis class (e.g., moving from linear to quadratic regression).

Structural error is minimized if f ∗^ ∈ H. But is it 0 then?

Statistical view of regression

We will now explicitly model the randomness in the data:

y = f (x; w) + ν

where the noise ν accounts for everything not captured by f.

−4−3 −2 −1 0 1 2 3

−2

0

2

4

6

8

−4−3 −2 −1 0 1 2 3

−2

0

2

4

6

8

  • (^) This definition of “noise” may include a meaningful component of the signal, which stops being noise once we move to a more complex f.

Statistical view of regression

y = f (x; w) + ν

Under this model, the best predictor is

Ep(y|x) [f (x; w) + ν | x] = f (x; w) + Ep(ν) [ν]

Typically, Ep(ν) [ν] = 0 (white noise).

Under such a model, f (x; w) captures the expected value of y|x if we believe the distribution in the model.

  • If (and only if) the model is “correct”, f is optimal.
  • (^) Real data unlikely to have a true “correct” model.

Gaussian noise model

y = f (x; w) + ν, ν ∼ N

(

ν; 0, σ^2

)

Given the input x, the label y is a random variable

p(y|x; w, σ) = N

(

y; f (x; w), σ^2

)

that is,

p(y|x; w, σ) =

1

σ

2 π

exp

(

(y − f (x; w))^2 2 σ^2

)

This is an explicit model of y that allows us, for instance, to sample y for a given x.

Likelihood

The likelihood of the parameters w given the observed data X = [x 1 ,... , xN ], Y = [y 1 ,... , yN ]T^ is defined as

p(Y |X; w, σ)

i.e., the probability of observing these ys for the given xs, under the model parametrized by w and σ.

Under the assumption that data are i.i.d. (independently, identically distributed) according to p(x),

p(Y |w, σ) =

∏^ N

i=1

p(yi|xi, w, σ)

Maximum likelihood estimation

Maximum likelihood (ML) estimation principle:

wˆM L = argmax w

p(Y |w, σ)

Here we focus on likelihood as a function of w.

For Gaussian noise model:

wˆM L = argmax w

∏^ N

i=1

1

σ

2 π

exp

(

(yi − f (xi; w))^2 2 σ^2

)

This may become numerically unwieldy...