Linear Regression: Finding the Best Linear Function for Data, Study notes of Programming Languages

Linear regression, a method used to find the best linear function that fits a given dataset. The assumptions behind linear regression, the math involved, and the process of minimizing the squared error loss function. It also covers useful definitions and linear algebra identities.

Typology: Study notes

Pre 2010

Uploaded on 07/22/2009

koofers-user-tg6
koofers-user-tg6 🇺🇸

10 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Intro to Linear Methods
Reading: DH&S, Ch 5.{1-4,8}
hip to be hyperplanar...
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Linear Regression: Finding the Best Linear Function for Data and more Study notes Programming Languages in PDF only on Docsity!

Intro to Linear Methods

Reading: DH&S, Ch 5.{1-4,8}

hip to be hyperplanar...

Then & now...

• Last time:

• k -NN geometry

• Bayesian decision theory -- Bayes optimal

classifiers & Bayes error

• NN in daily life

• Today:

• Intro to linear methods

• Formulation

• Math

• The linear regression problem

Warning!

• Change of notation!

• I usually write

• x^ for the data vector

• y^ for the class var;^ y^ for the class vector

• In this section, your book uses

• x^ for the data vector

• y^ for the “augmented” data vector

• g ( x )^ for the class;^ b^ for the class vector

Linear regression prelims

• Basic idea: assume^ g ( x )^ is a linear function of^ x :

• Our job: find best^ wi to fit^ g ( x )^ “as well as possible”

g(x) = w 0 + w 1 x 1 + w 2 x 2 + · · · + wd xd

Linear regression prelims

• Basic idea: assume^ g ( x )^ is a linear function of^ x :

• Our job: find best^ to fit^ g ( x )^ “as well as possible”

• Note:

• Stupid math fonts...

wi

wi != ω (^) i

double-u omega

little curly squiggle...

g(x) = w 0 + w 1 x 1 + w 2 x 2 + · · · + wd xd

Linear regression prelims

• By “as well as possible”, we mean here,^ minimum

squared error : Js (w 0 ,... , wd ) =

∑^ n

i=

g(x (^) i ) − f̂ (x (^) i )

∑^ n

i=

b (^) i −

w 0 +

∑^ d

j=

wj xji

Loss function 2

A helpful “method”

• Recall

• Want to be able to easily write

• Introduce “pseudo-feature” of^ x ,

• Now have:

w 0 + w 1 x 1 + · · · + wd xd

x 0 = 1

x = [x 1 , x 2 ,... , xd ]^ T

y = [1, x 1 , x 2 ,... , xd ]^ T

A helpful “method”

• Now have:

• And:

• So:

• And our “loss function” becomes:

y = [1, x 1 , x 2 ,... , xd ]^ T a = [w 0 , w 1 , w 2 ,... , wd ]^ T f̂ (x) = a T^ y

Js (a) =

∑^ n

i=

b (^) i − a T^ y

Minimizing loss

• Back up to the 1-d case

• Suppose you had the function:

• And wanted to find^ w^ that minimizes^ l ()

• Std answer: take derivative, set equal to 0, and

solve:

• To be sure of a min, check 2nd derivative too...

l(w) = aw 2 + bw + c

∂w l(w) = 2aw^ +^ b = 0 ⇒ wmin = −

b 2 a

5 minutes of math...

• Some useful linear algebra identities:

• If^ A^ and^ B^ are matrices,

(A + B)^ T^ = A T^ + B T

(AB)^ T^ = B T^ A T

(AB)−^1 = B −^1 A −^1 (for invertible square matrices)

Exercise

• Derive the vector derivative expressions:

• Find an expression for the minimum squared error

weight vector, a , in the loss function:

∂x x^

T A = A ∂

∂x x^

T = I

Js (a) = (b − Y T^ a)^ T^ (b − Y T^ a)

∂x x^

T (^) Ax = x T (^) (A + A T (^) )

The LSE method

• The quantity^ is called a^ Gram matrix^ and is

positive semidefinite and symmetric

• The quantity^ is the pseudoinverse of

Y

• May not exist if Gram matrix is not invertable

• The complete “learning algorithm” is 2 whole lines

of Matlab code

YY T

(YY T^ )−^1 Y