Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture 8: Introduction to Classification in Machine Learning, Lecture notes of Introduction to Machine Learning

A lecture note from ttic 31020: introduction to machine learning, covering the topic of classification. The instructor, greg shakhnarovich, discusses bias-variance decomposition, ridge regression, lasso, forward stepwise regression, and the concept of classification as regression. The document also includes examples and explanations of linear discriminant functions and the geometry of projections.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 40

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture 8: Introduction to Classification in Machine Learning and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 8: Intro to classification

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

October 13, 2010

Review: bias-variance decomposition

Question raised last time: should the bias^2 term be

EX

[(

F (x 0 ) − f¯ (x 0 )

) 2 ]

or EX

[(

F (x 0 ) − f¯ (x 0 )

)] 2

?

Review: bias-variance decomposition

Question raised last time: should the bias^2 term be

EX

[(

F (x 0 ) − f¯ (x 0 )

) 2 ]

or EX

[(

F (x 0 ) − f¯ (x 0 )

)] 2

?

answer: both are correct, since

EX

[(

F (x 0 ) − f¯ (x 0 )

) 2 ]

=

(

EX

[

F (x 0 ) − f¯ (x 0 )

]) 2

Review: bias-variance decomposition

Question raised last time: should the bias^2 term be

EX

[(

F (x 0 ) − f¯ (x 0 )

) 2 ]

or EX

[(

F (x 0 ) − f¯ (x 0 )

)] 2

?

answer: both are correct, since

EX

[(

F (x 0 ) − f¯ (x 0 )

) 2 ]

=

(

EX

[

F (x 0 ) − f¯ (x 0 )

]) 2

=

(

F (x 0 ) − f¯ (x 0 )

) 2

See notes posted with Lecture 7 for a cleaned-up derivation.

Review

Ridge regression:

min w

∑^ N

i=

(

yi − wT^ xi

) 2

  • λ

∑^ m

j=

w^2 j

Lasso:

min w

∑^ N

i=

(yi − w

T xi)

2

  • λ

∑^ m

j=

|wj |

w 2

w 1

`

w ˆM L

w^21 + w^22

wˆridge

|w 1 | + |w 2 |

w ˆlasso

Plan for today

Review stagewise regression

Intro to classification

Linear discriminant functions

Combination of regressors

Consider linear regression model

y = f (x; w) = w 0 φ 0 (x) ︸ ︷︷ ︸ ≡ 1

+w 1 φ 1 (x) +... + wdφd(x).

We can see this as a combination of d + 1 simple regressors:

y =

∑^ d

j=

fj (x; w), fj (x; w) , wj φj (x)

−4−5 0 5

0

2

4

6

8

−4−5 0 5

0

2

4

6

8

−4−5 0 5

0

2

4

6

8

−4−5 0 5

0

2

4

6

8

−5^ −4 0 5

0

2

4

6

8

Forward stepwise regression

y =

∑^ d

j=

fj (x; w), fj (x; w) = wj φj (x)

We can build this combination greedily, one function at a

time.

Parametrize the set of functions: f (x; θ), θ = [w, j]

Step 1: fit the first simple model

θ 1 = argmin θ

∑N

i=

(yi − f (xi; θ))

2

Forward stepwise regression

Step 1: fit the first simple model

θ 1 = argmin θ

∑N

i=

(yi − f (xi; θ))

2

Step 2: fit second simple model to the residuals of the first:

θ 2 = argmin θ

∑N

i=

(yi − f (xi; θ 1 ) ︸ ︷︷ ︸ residual

−f (xi; θ))

2

... Step n: fit a simple model to the residuals of the previous

step.

Stop when no significant improvement in training error.

Final estimate after M steps:

yˆ(x) = f (x; θ 1 ) +... + f (x; θM )

Stepwise regression: example

Fit

∑k j=1 f^ (x;^ θj^ )^ Residuals^ θj

k = 1

−6−5 0 5

0

2

4

6

8

−5^0 0

1

2

d = 3, w = − 0. 0512

k = 2

−6−5 0 5

0

2

4

6

8

−1.5−5 0 5

−0.

0

1

d = 0, w = 1. 1024

Stepwise regression: example

Fit

∑k j=1 f^ (x;^ θj^ )^ Residuals^ θj

k = 3

−6−5 0 5

0

2

4

6

8

−0.8−5 0 5

−0.

−0.

−0.

0

1

d = 5, w = 0. 0002

k = 4

−6−5 0 5

0

2

4

6

8

−0.8−5 0 5

−0.

−0.

−0.

0

d = 0, w = 0. 0536

Classification

Shifting gears: classification. Many successful applications of

ML: vision, speech, medicine, etc.

Setup: need to map x ∈ X to a label y ∈ Y.

Examples:

digits recognition; Y = { 0 ,... , 9 }

prediction from microarray data; Y = {desease present/absent}

Classification versus regression

Formally: just like in regression, we want to learn a mapping

from X to Y, but Y is finite, and non-metric.

One approach is to (na¨ıvely) ignore that Y is such.

Regression on the indicator matrix:

  • Code the possible values of the label as 1,... , C.
  • (^) Define matrix Y:

Yij =

{

1 if yi = c,

0 otherwise

  • (^) This defines C independent regression problems; solving

them with least squares yields

Yˆ 0 = X 0 (XT^ X)−^1 XY.

Classification as regression

Suppose we have a binary problem, y ∈ {− 1 , 1 }.

Assuming the standard model y = f (x; w) + ν, and solving

with least squares, we get ˆw.

This corresponds to squared loss as a measure of classification

performance! Does this make sense?

Classification as regression

Suppose we have a binary problem, y ∈ {− 1 , 1 }.

Assuming the standard model y = f (x; w) + ν, and solving

with least squares, we get ˆw.

This corresponds to squared loss as a measure of classification

performance! Does this make sense?

How do we decide on the label based on f (x; ˆw)?

Classification as regression: example

A 1D example:

x

Classification as regression: example

A 1D example:

x

y

Classification as regression: example

A 1D example:

x

y

w 0 + wT^ x

Classification as regression: example

A 1D example:

x

y

w 0 + wT^ x

yˆ = +1 y^ ˆ^ =^ −^1

Classification as regression

f (x; ˆw) = w 0 + ˆw

T x

Can’t just take ˆy = f (x; ˆw) since it won’t be a valid label.

A reasonable decision rule:

decide on ˆy = 1 if f (x; ˆw) ≥ 0, otherwise ˆy = −1.

yˆ = sign

(

w 0 + ˆw

T x

)

This specifies a linear classifier:

  • (^) The linear decision boundary (hyperplane) given by the

equation w 0 + ˆwT^ x = 0 separates the space into two

“half-spaces”.

Classification as regression

Seems to work well here but not so well here?

Geometry of projections

x 2

w 0 + wT^ x = 0

x 1

Geometry of projections

x 2

w 0 + wT^ x = 0

x 1

w

−w 0 ‖w‖

Geometry of projections

x 2

w 0 + wT^ x = 0

x 1

w

−w 0 ‖w‖

x 0

Geometry of projections

x 2

w 0 + wT^ x = 0

x 1

w

−w 0 ‖w‖

x 0

w 0 +wT^ x 0 ‖w‖

x 0 ⊥