Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A lecture note from ttic 31020: introduction to machine learning, covering the topic of classification. The instructor, greg shakhnarovich, discusses bias-variance decomposition, ridge regression, lasso, forward stepwise regression, and the concept of classification as regression. The document also includes examples and explanations of linear discriminant functions and the geometry of projections.
Typology: Lecture notes
1 / 40
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 13, 2010
Question raised last time: should the bias^2 term be
EX
F (x 0 ) − f¯ (x 0 )
or EX
F (x 0 ) − f¯ (x 0 )
Question raised last time: should the bias^2 term be
EX
F (x 0 ) − f¯ (x 0 )
or EX
F (x 0 ) − f¯ (x 0 )
answer: both are correct, since
F (x 0 ) − f¯ (x 0 )
F (x 0 ) − f¯ (x 0 )
Question raised last time: should the bias^2 term be
EX
F (x 0 ) − f¯ (x 0 )
or EX
F (x 0 ) − f¯ (x 0 )
answer: both are correct, since
F (x 0 ) − f¯ (x 0 )
F (x 0 ) − f¯ (x 0 )
F (x 0 ) − f¯ (x 0 )
See notes posted with Lecture 7 for a cleaned-up derivation.
Ridge regression:
min w
i=
yi − wT^ xi
∑^ m
j=
w^2 j
Lasso:
min w
i=
(yi − w
T xi)
2
∑^ m
j=
|wj |
w 2
w 1
`
w ˆM L
w^21 + w^22
wˆridge
|w 1 | + |w 2 |
w ˆlasso
Review stagewise regression
Intro to classification
Linear discriminant functions
Consider linear regression model
y = f (x; w) = w 0 φ 0 (x) ︸ ︷︷ ︸ ≡ 1
+w 1 φ 1 (x) +... + wdφd(x).
We can see this as a combination of d + 1 simple regressors:
y =
∑^ d
j=
fj (x; w), fj (x; w) , wj φj (x)
−4−5 0 5
−
0
2
4
6
−4−5 0 5
−
0
2
4
6
8
−4−5 0 5
−
0
2
4
6
8
−4−5 0 5
−
0
2
4
6
8
−5^ −4 0 5
−
0
2
4
6
8
y =
∑^ d
j=
fj (x; w), fj (x; w) = wj φj (x)
We can build this combination greedily, one function at a
time.
Parametrize the set of functions: f (x; θ), θ = [w, j]
Step 1: fit the first simple model
θ 1 = argmin θ
i=
(yi − f (xi; θ))
2
Step 1: fit the first simple model
θ 1 = argmin θ
i=
(yi − f (xi; θ))
2
Step 2: fit second simple model to the residuals of the first:
θ 2 = argmin θ
i=
(yi − f (xi; θ 1 ) ︸ ︷︷ ︸ residual
−f (xi; θ))
2
... Step n: fit a simple model to the residuals of the previous
step.
Stop when no significant improvement in training error.
Final estimate after M steps:
yˆ(x) = f (x; θ 1 ) +... + f (x; θM )
Fit
∑k j=1 f^ (x;^ θj^ )^ Residuals^ θj
k = 1
−6−5 0 5
−
−
0
2
4
6
8
−5^0 0
1
2
d = 3, w = − 0. 0512
k = 2
−6−5 0 5
−
−
0
2
4
6
8
−1.5−5 0 5
−
−0.
0
1
d = 0, w = 1. 1024
Fit
∑k j=1 f^ (x;^ θj^ )^ Residuals^ θj
k = 3
−6−5 0 5
−
−
0
2
4
6
8
−0.8−5 0 5
−0.
−0.
−0.
0
1
d = 5, w = 0. 0002
k = 4
−6−5 0 5
−
−
0
2
4
6
8
−0.8−5 0 5
−0.
−0.
−0.
0
d = 0, w = 0. 0536
Shifting gears: classification. Many successful applications of
ML: vision, speech, medicine, etc.
Setup: need to map x ∈ X to a label y ∈ Y.
Examples:
digits recognition; Y = { 0 ,... , 9 }
prediction from microarray data; Y = {desease present/absent}
Formally: just like in regression, we want to learn a mapping
from X to Y, but Y is finite, and non-metric.
One approach is to (na¨ıvely) ignore that Y is such.
Regression on the indicator matrix:
Yij =
1 if yi = c,
0 otherwise
them with least squares yields
Suppose we have a binary problem, y ∈ {− 1 , 1 }.
Assuming the standard model y = f (x; w) + ν, and solving
with least squares, we get ˆw.
This corresponds to squared loss as a measure of classification
performance! Does this make sense?
Suppose we have a binary problem, y ∈ {− 1 , 1 }.
Assuming the standard model y = f (x; w) + ν, and solving
with least squares, we get ˆw.
This corresponds to squared loss as a measure of classification
performance! Does this make sense?
How do we decide on the label based on f (x; ˆw)?
A 1D example:
x
A 1D example:
x
y
A 1D example:
x
y
w 0 + wT^ x
A 1D example:
x
y
w 0 + wT^ x
yˆ = +1 y^ ˆ^ =^ −^1
f (x; ˆw) = w 0 + ˆw
T x
Can’t just take ˆy = f (x; ˆw) since it won’t be a valid label.
A reasonable decision rule:
decide on ˆy = 1 if f (x; ˆw) ≥ 0, otherwise ˆy = −1.
yˆ = sign
w 0 + ˆw
T x
This specifies a linear classifier:
equation w 0 + ˆwT^ x = 0 separates the space into two
“half-spaces”.
Seems to work well here but not so well here?
x 2
w 0 + wT^ x = 0
x 1
x 2
w 0 + wT^ x = 0
x 1
w
−w 0 ‖w‖
x 2
w 0 + wT^ x = 0
x 1
w
−w 0 ‖w‖
x 0
x 2
w 0 + wT^ x = 0
x 1
w
−w 0 ‖w‖
x 0
w 0 +wT^ x 0 ‖w‖
x 0 ⊥