



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth exploration of matrix operations and their applications in linear regression. It covers topics such as matrix notation, transpose, product, inner product, matrix derivatives, and the use of basis functions for non-linear regression. The text also discusses the concepts of structure error and approximation error in the context of function approximation.
Typology: Lecture notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Notation : A vector a =
a 1 a 2 .. . ad
always denotes a row-vector. With aT^ = [a 1 , a 2 ,... , ad] we will
denote a column vector. Transpose :
(A + B)T^ = AT^ + BT^ , (AB)T^ = BT^ AT^ , (A−^1 )T^ = (AT^ )−^1 = A−T^ (1)
Product :
(AB)ij =
k
AikBki (2)
Inner product of 2 vectors:
aT^ b =
k
akbk (4)
Gradient of a function f (x) : Rd^ → R: The gradient is defined as row vector:
∂f ∂x
∂f ∂x 1
∂f ∂x 2
∂f ∂xd
Chain Rule ∂Z ∂X
Product Rule ∂(YZ) ∂X
Linear derivatives
∂aT^ x ∂x
∂xT^ a ∂x = aT^ (8)
∂Ax ∂x
= A, ∂x
T (^) A ∂x =^ A
Quadratic derivatives
∂xT^ Ax ∂x
= xT^ A + xT^ AT^ ,
∂xT^ x ∂x
= 2xT^ (10)
Figure 1: Linear Regression: The training data points are given by the blue circles, the original function is plotted by the green line. Based on the knowledge from the data-points we want to find the original function.
We are given a dataset D = 〈xi, yi〉i=1...N (for simplicity we assume that y is a scalar, x is a vector of dimensionality d). We want to find a linear function f (x; w) = w 0 +
∑d k=1 wkxk^ = ˜x T (^) w with
x˜T^ = [1xT^ ] which minimizes the quadratic error function :
i
(˜xTi w − yi)^2 (11)
The setting is illustrated in Figure 1. The error function can be easily written in matrix form by noting that a sum over the squared error terms can be represented as the inner product of the error vector
z =
x ˜T 1 w − y 1 x ˜T 2 w − y 2 .. . x˜TN w − yN
= Xw − y (12)
with X =
x ˜T 1 x ˜T 2 .. . x˜TN
and y =
y 1 y 2 .. . yN
E = 1/N zT^ z = 1/N (Xw − y)T^ (Xw − y) (13)
We now derivate E w.r.t w and set the gradient to 0 T Apply Eq. 6, 10 and 9 :
Figure 3: RBF basis functions
Definitions : Expected Error: ED [E(D)] Expected error if we sample a data set of given size N and use this data set to fit our function. Here, the expectation has to be done also over all possible data sets of size N! E(D) denotes the error if we use training set D to fit the function (see below). Structure Error: Error that comes solely from the structure of the function used to fit the data. Can be estimated by fitting the function on a very large data set (thus, the approximation error vanishes). Approximation Error: Is given by the variance of the function estimates if we sample different training sets of size N. The more complex functions we use, the higher the variance of our estimate gets! Lets denote f (x; D) a function learned from the a specific dataset D containing N examples and y(x) denote the target function. We will also denote the expected learned function as ED [f (x; D)], which is calculated by taking the expectation with respect to all possible datasets D of size N. The error E(D) for the training set D is given by
x
(f (x; D)) − y(x))^2 p(x)dx (17)
We now add and substract ED [f (x; D)]
x
(f (x; D)) − ED [f (x; D)] + ED [f (x; D)] − y(x))^2 p(x)dx =
x
(f (x; D) − ED [f (x; D)])^2 + (ED [f (x; D)] − y(x))^2 +
2(f (x; D) − ED [f (x; D)])(ED [f (x; D)] − y(x))) p(x)dx
If we now want to calculate the expected error w.r.t all data sets, ED [E(D)], the last line of this equation will vanish, and therefore
x
(f (x; D) − ED [f (x; D)])^2
p(x)dx + ∫
x
(ED [f (x; D)] − y(x))^2 p(x)dx
The first term corresponds to the approximation error and the last term to the structure error. This decomposition is also known as bias-variance tradeoff.
Figure 4: Structure error (left) and Approximation error (right) for a 1-degree polynomial
Figure 5: Structure error (left) and Approximation error (right) for a 2-degree polynomial
expected loss = variance + bias^2 (18)
When decreasing the bias (which is usually done by increasing the complexity of f ), the variance of our function estimate will usually increase! Note that instead of using many datasets of size N for ED [f (x; D)], we can use a single huge dataset of size M N. This can be easily proofed to be equivalent. In Figure 4, 5 and 6 we can see the structure error (left) and data fits for different data sets of size 10 (right) for a 1-degree, 2-degree and 6-degree polynomial. The gray points illustrate the probability mass of all data sets (these points are also used to estimate the structure error), the blue point illustrate the small data sets used for the single fits. The approximation error is given by the deviation of the single fits from the optimal hypothesis.