


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of linear regression, focusing on the use of cost functions and gradient descent to find the optimal parameters (θ0 and θ1) for a linear prediction model. The least-squares cost function, its optimization using gradient descent, and the derivation of the closed-form solution. Additionally, it introduces the maximum likelihood formulation and the concept of overfitting.
Typology: Exams
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Machine Learning (CS 5350/CS 6350) 16 Jan 2006
Our old simple (perhaps unrealistic) regression example:
Square footage Price
Linear prediction model:
price ≈ θ 0 + θ 1 × square footage (1)
We can write this as:
[θ 1 θ 0 ]
It is easy to check that no {θ 0 , θ 1 } exists that satisfies this.
Define a cost function, least-squares:
J(θ) =
N ∑
n=
θ
xn − yn
This cost function penalizes outliers.
Now, we’ve changed the learning problem to an optimization problem: find θ to minimize J(θ).
Gradient Descent
Iteratively update θ according to:
θ
(t+1) = θ
(t) − α
∂θ
J(θ)
For the least-squares cost function, the partial is:
∂θ
J(θ) =
N ∑
n=
θ
xn − yn
xn
The gradient is big on examples for which there is a high error.
α is a learning rate. Too low −→ slow convergence, too high −→ no convergence.
It turns out that we can actually obtain a solution in closed form. Let X be the data matrix, let Y be a
(column) vector containing the targets. Then Xθ − Y is a column vector whose nth element is θ
xn − yn.
So:
J(θ) =
[Xθ − Y ]
[Xθ − Y ]
Then, we can compute the gradient:
∇θ J(θ) = ∇θ
[Xθ − Y ]
[Xθ − Y ]
∇θ
θ
X
Xθ − θ
X
Y − Y
Xθ + Y
Y
∇θ tr
θ
X
Xθ − θ
X
Y − Y
Xθ + Y
Y
∇θ
tr θ
X
Xθ − 2 tr Y
Xθ
Xθ + X
Xθ − 2 X
Y
Xθ − X
Y
Thus, setting the gradient equal to zero, we obtain:
Xθ = X
Y
So:
θ =
X
Y
Maximum Likelihood
An alternative formulation: y = θ
x + , where ∼ Nor(0, σ
2 ). Then y ∼ Nor(θ
x, σ
2 ). Now, find θ to
maximize likelihood of the training set.
This is an ` 2 penalty. λ controls how complex functions we allow.
Easy to compute gradient:
∇θ J(θ) = ∇θ
[Xθ − Y ]
[Xθ − Y ] +
λ
θ
θ
Xθ − X
Y + λθ
So we can solve for θ:
X + λI)θ = X
Y
=⇒θ = [X
X + λI]
− 1 X
Y
This is especially nice when X
X is illconditioned.
We can also do a probabilistic interpretation, putting a prior on θ: θ ∼ Nor(0, λ
− 1 ).
In general, too many features is bad, too few is bad. Why? We want to minimize the expected cost (going
back to un-regularized). Suppose f = f (x) and t = f + . Write y for θ
x. Then:
E[J(θ)] = E
n
(tn − yn)
2
n
(tn − yn)
2
Let’s look at the expectation:
(tn − yn)
2
(tn − fn + fn − yn)
2
(tn − fn)
2
(fn − yn)
2
2
(fn − yn)
2
E[fntn] − E[f
2 n ] − E[yntn] + E[ynfn]
2 ] + E
(fn − yn)
2
(fn − yn)
2
(fn − E[yn] + E[yn] − yn)
2
(fn − E[yn])
2
(E[yn] − yn)
2
(fn − E[yn])
2
(E[yn] − yn)
2
(tn − yn)
2
2 ] + E
(fn − E[yn])
2
(E[yn] − yn)
2
= V[noise] + bias
2