Matrix Cookbook: Linear Regression and Basis Functions, Lecture notes of Linear Algebra

An in-depth exploration of matrix operations and their applications in linear regression. It covers topics such as matrix notation, transpose, product, inner product, matrix derivatives, and the use of basis functions for non-linear regression. The text also discusses the concepts of structure error and approximation error in the context of function approximation.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

laurinda
laurinda 🇬🇧

4.8

(8)

220 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 The Matrix Cookbook
Notation : A vector a=
a1
a2
.
.
.
ad
always denotes a row-vector. With aT= [a1, a2, . . . , ad] we will
denote a column vector.
Transpose :
(A+B)T=AT+BT,(AB)T=BTAT,(A1)T= (AT)1=AT(1)
Product :
(AB)ij =X
k
AikBki (2)
(AB)C=A(BC),AB 6=BA (3)
Inner product of 2 vectors:
aTb=X
k
akbk(4)
1.1 Matrix derivatives
Gradient of a function f(x) : Rd R:The gradient is defined as row vector:
∂f
x=∂f
∂x1
,∂f
∂x2
, . . . , ∂f
∂xd(5)
Chain Rule
Z
X=Z
Y
Y
X(6)
Product Rule
(YZ)
X=Y
XZ+YZ
X(7)
Linear derivatives
aTx
x=xTa
x=aT(8)
Ax
x=A,xTA
x=AT(9)
Quadratic derivatives
xTAx
x=xTA+xTAT,xTx
x= 2xT(10)
1
pf3
pf4
pf5

Partial preview of the text

Download Matrix Cookbook: Linear Regression and Basis Functions and more Lecture notes Linear Algebra in PDF only on Docsity!

1 The Matrix Cookbook

Notation : A vector a =

a 1 a 2 .. . ad

always denotes a row-vector. With aT^ = [a 1 , a 2 ,... , ad] we will

denote a column vector. Transpose :

(A + B)T^ = AT^ + BT^ , (AB)T^ = BT^ AT^ , (A−^1 )T^ = (AT^ )−^1 = A−T^ (1)

Product :

(AB)ij =

k

AikBki (2)

(AB)C = A(BC), AB 6 = BA (3)

Inner product of 2 vectors:

aT^ b =

k

akbk (4)

1.1 Matrix derivatives

Gradient of a function f (x) : Rd^ → R: The gradient is defined as row vector:

∂f ∂x

[

∂f ∂x 1

∂f ∂x 2

∂f ∂xd

]

Chain Rule ∂Z ∂X

∂Z

∂Y

∂Y

∂X

Product Rule ∂(YZ) ∂X

∂Y

∂X

Z + Y

∂Z

∂X

Linear derivatives

∂aT^ x ∂x

∂xT^ a ∂x = aT^ (8)

∂Ax ∂x

= A, ∂x

T (^) A ∂x =^ A

T (9)

Quadratic derivatives

∂xT^ Ax ∂x

= xT^ A + xT^ AT^ ,

∂xT^ x ∂x

= 2xT^ (10)

Figure 1: Linear Regression: The training data points are given by the blue circles, the original function is plotted by the green line. Based on the knowledge from the data-points we want to find the original function.

2 Linear Regression

We are given a dataset D = 〈xi, yi〉i=1...N (for simplicity we assume that y is a scalar, x is a vector of dimensionality d). We want to find a linear function f (x; w) = w 0 +

∑d k=1 wkxk^ = ˜x T (^) w with

x˜T^ = [1xT^ ] which minimizes the quadratic error function :

E = 1/N

i

(˜xTi w − yi)^2 (11)

The setting is illustrated in Figure 1. The error function can be easily written in matrix form by noting that a sum over the squared error terms can be represented as the inner product of the error vector

z =

x ˜T 1 w − y 1 x ˜T 2 w − y 2 .. . x˜TN w − yN

= Xw − y (12)

with X =

x ˜T 1 x ˜T 2 .. . x˜TN

and y =

y 1 y 2 .. . yN

E = 1/N zT^ z = 1/N (Xw − y)T^ (Xw − y) (13)

2.1 Least Squares Solution

We now derivate E w.r.t w and set the gradient to 0 T Apply Eq. 6, 10 and 9 :

Figure 3: RBF basis functions

3 Decomposition in Structure and Approximation Error

Definitions : Expected Error: ED [E(D)] Expected error if we sample a data set of given size N and use this data set to fit our function. Here, the expectation has to be done also over all possible data sets of size N! E(D) denotes the error if we use training set D to fit the function (see below). Structure Error: Error that comes solely from the structure of the function used to fit the data. Can be estimated by fitting the function on a very large data set (thus, the approximation error vanishes). Approximation Error: Is given by the variance of the function estimates if we sample different training sets of size N. The more complex functions we use, the higher the variance of our estimate gets! Lets denote f (x; D) a function learned from the a specific dataset D containing N examples and y(x) denote the target function. We will also denote the expected learned function as ED [f (x; D)], which is calculated by taking the expectation with respect to all possible datasets D of size N. The error E(D) for the training set D is given by

E(D) =

x

(f (x; D)) − y(x))^2 p(x)dx (17)

We now add and substract ED [f (x; D)]

E(D) =

x

(f (x; D)) − ED [f (x; D)] + ED [f (x; D)] − y(x))^2 p(x)dx =

x

(f (x; D) − ED [f (x; D)])^2 + (ED [f (x; D)] − y(x))^2 +

2(f (x; D) − ED [f (x; D)])(ED [f (x; D)] − y(x))) p(x)dx

If we now want to calculate the expected error w.r.t all data sets, ED [E(D)], the last line of this equation will vanish, and therefore

ED [E(D)] =

x

ED

[

(f (x; D) − ED [f (x; D)])^2

]

p(x)dx + ∫

x

(ED [f (x; D)] − y(x))^2 p(x)dx

The first term corresponds to the approximation error and the last term to the structure error. This decomposition is also known as bias-variance tradeoff.

Figure 4: Structure error (left) and Approximation error (right) for a 1-degree polynomial

Figure 5: Structure error (left) and Approximation error (right) for a 2-degree polynomial

expected loss = variance + bias^2 (18)

When decreasing the bias (which is usually done by increasing the complexity of f ), the variance of our function estimate will usually increase! Note that instead of using many datasets of size N for ED [f (x; D)], we can use a single huge dataset of size M  N. This can be easily proofed to be equivalent. In Figure 4, 5 and 6 we can see the structure error (left) and data fits for different data sets of size 10 (right) for a 1-degree, 2-degree and 6-degree polynomial. The gray points illustrate the probability mass of all data sets (these points are also used to estimate the structure error), the blue point illustrate the small data sets used for the single fits. The approximation error is given by the deviation of the single fits from the optimal hypothesis.