Linear Regression: Learning from Data using Cost Functions and Gradient Descent, Exams of Computer Science

The concept of linear regression, focusing on the use of cost functions and gradient descent to find the optimal parameters (θ0 and θ1) for a linear prediction model. The least-squares cost function, its optimization using gradient descent, and the derivation of the closed-form solution. Additionally, it introduces the maximum likelihood formulation and the concept of overfitting.

Typology: Exams

Pre 2010

Uploaded on 08/30/2009

koofers-user-r9d-1
koofers-user-r9d-1 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning (CS 5350/CS 6350) 16 Jan 2006
Linear models for regression
Our old simple (perhaps unrealistic) regression example:
Square footage Price
1200 $120
1340 $125
1390 $105
1400 $130
1420 $135
1500 $145
1550 $160
1700 $155
1900 $140
2150 $130
2300 $135
Linear prediction model:
price θ0+θ1×square footage (1)
We can write this as:
[θ1θ0]
1200 1
1340 1
1390 1
1400 1
1420 1
1500 1
1550 1
1700 1
1900 1
2150 1
2300 1
>
=
120
125
105
130
135
145
160
155
140
130
135
(2)
It is easy to check that no {θ0, θ1}exists that satisfies this.
Define a cost function, least-squares:
J(θ) = 1
2
N
X
n=1 θ>xnyn2
This cost function penalizes outliers.
Now, we’ve changed the learning problem to an optimization problem: find θto minimize J(θ).
Gradient Descent
Iteratively update θaccording to:
1
pf3
pf4

Partial preview of the text

Download Linear Regression: Learning from Data using Cost Functions and Gradient Descent and more Exams Computer Science in PDF only on Docsity!

Machine Learning (CS 5350/CS 6350) 16 Jan 2006

Linear models for regression

Our old simple (perhaps unrealistic) regression example:

Square footage Price

Linear prediction model:

price ≈ θ 0 + θ 1 × square footage (1)

We can write this as:

[θ 1 θ 0 ]

It is easy to check that no {θ 0 , θ 1 } exists that satisfies this.

Define a cost function, least-squares:

J(θ) =

N ∑

n=

[

θ

xn − yn

] 2

This cost function penalizes outliers.

Now, we’ve changed the learning problem to an optimization problem: find θ to minimize J(θ).

Gradient Descent

Iteratively update θ according to:

θ

(t+1) = θ

(t) − α

∂θ

J(θ)

For the least-squares cost function, the partial is:

∂θ

J(θ) =

N ∑

n=

[

θ

xn − yn

]

xn

The gradient is big on examples for which there is a high error.

α is a learning rate. Too low −→ slow convergence, too high −→ no convergence.

It turns out that we can actually obtain a solution in closed form. Let X be the data matrix, let Y be a

(column) vector containing the targets. Then Xθ − Y is a column vector whose nth element is θ

xn − yn.

So:

J(θ) =

[Xθ − Y ]

[Xθ − Y ]

Then, we can compute the gradient:

∇θ J(θ) = ∇θ

[Xθ − Y ]

[Xθ − Y ]

∇θ

[

θ

X

Xθ − θ

X

Y − Y

Xθ + Y

Y

]

∇θ tr

[

θ

X

Xθ − θ

X

Y − Y

Xθ + Y

Y

]

∇θ

[

tr θ

X

Xθ − 2 tr Y

]

[

X

Xθ + X

Xθ − 2 X

Y

]

= X

Xθ − X

Y

Thus, setting the gradient equal to zero, we obtain:

X

Xθ = X

Y

So:

θ =

[

X

X

]− 1

X

Y

Maximum Likelihood

An alternative formulation: y = θ

x + , where  ∼ Nor(0, σ

2 ). Then y ∼ Nor(θ

x, σ

2 ). Now, find θ to

maximize likelihood of the training set.

This is an ` 2 penalty. λ controls how complex functions we allow.

Easy to compute gradient:

∇θ J(θ) = ∇θ

[Xθ − Y ]

[Xθ − Y ] +

λ

θ

θ

= X

Xθ − X

Y + λθ

So we can solve for θ:

(X

X + λI)θ = X

Y

=⇒θ = [X

X + λI]

− 1 X

Y

This is especially nice when X

X is illconditioned.

We can also do a probabilistic interpretation, putting a prior on θ: θ ∼ Nor(0, λ

− 1 ).

In general, too many features is bad, too few is bad. Why? We want to minimize the expected cost (going

back to un-regularized). Suppose f = f (x) and t = f + . Write y for θ

x. Then:

E[J(θ)] = E

[

n

(tn − yn)

2

]

n

E

[

(tn − yn)

2

]

Let’s look at the expectation:

E

[

(tn − yn)

2

]

= E

[

(tn − fn + fn − yn)

2

]

= E

[

(tn − fn)

2

]

+ E

[

(fn − yn)

2

]

  • 2E [(fn − yn)(tn − yn)]

= E

[

2

]

+ E

[

(fn − yn)

2

]

E[fntn] − E[f

2 n ] − E[yntn] + E[ynfn]

= E[

2 ] + E

[

(fn − yn)

2

]

AND

E

[

(fn − yn)

2

]

= E

[

(fn − E[yn] + E[yn] − yn)

2

]

= E

[

(fn − E[yn])

2

]

+ E

[

(E[yn] − yn)

2

]

  • 2E [(E[yn] − yn)(fn − yn)]

= E

[

(fn − E[yn])

2

]

+ E

[

(E[yn] − yn)

2

]

E

[

(tn − yn)

2

]

= E[

2 ] + E

[

(fn − E[yn])

2

]

+ E

[

(E[yn] − yn)

2

]

= V[noise] + bias

2

  • V[y]