












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Artificial Intelligence. Lectures Handout of Machine Learning. Linear Algebra Review and Reference. Prof. Andrew Ng - Stanford University
Typology: Study notes
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Linear algebra provides a way of compactly representing and operating on sets of linear equations. For example, consider the following system of equations:
4 x 1 − 5 x 2 = − 13 − 2 x 1 + 3 x 2 = 9.
This is two equations and two variables, so as you know from high school algebra, you can find a unique solution for x 1 and x 2 (unless the equations are somehow degenerate, for example if the second equation is simply a multiple of the first, but in the case above there is in fact a unique solution). In matrix notation, we can write the system more compactly as: Ax = b
with A =
, b =
As we will see shortly, there are many advantages (including the obvious space savings) to analyzing linear equations in this form.
We use the following notation:
x =
x 1 x 2 .. . xn
a 11 a 12 · · · a 1 n a 21 a 22 · · · a 2 n .. .
am 1 am 2 · · · amn
a 1 a 2 · · · an | | |
— aT 1 — — aT 2 — .. . — aTm —
2 Matrix Multiplication
The product of two matrices A ∈ Rm×n^ and B ∈ Rn×p^ is the matrix
C = AB ∈ Rm×p,
where
Cij =
∑^ n
k=
AikBkj.
Note that in order for the matrix product to exist, the number of columns in A must equal the number of rows in B. There are many ways of looking at matrix multiplication, and we’ll start by examining a few special cases.
express A in terms on its rows or columns. In the first case we express A in terms of its columns, which gives
yT^ = xT
a 1 a 2 · · · an | | |
xT^ a 1 xT^ a 2 · · · xT^ an
which demonstrates that the ith entry of yT^ is equal to the inner product of x and the ith column of A. Finally, expressing A in terms of rows we get the final representation of the vector-matrix product,
yT^ =
x 1 x 2 · · · xn
— aT 1 — — aT 2 — .. . — aTm —
= x 1
— aT 1 —
— aT 2 —
— aTn —
so we see that yT^ is a linear combination of the rows of A, where the coefficients for the linear combination are given by the entries of x.
Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the matrix-matrix multiplication C = AB as defined at the beginning of this section. First we can view matrix-matrix multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the definition, is that the i, j entry of C is equal to the inner product of the ith row of A and the jth row of B. Symbolically, this looks like the following,
— aT 1 — — aT 2 — .. . — aTm —
b 1 b 2 · · · bp | | |
aT 1 b 1 aT 1 b 2 · · · aT 1 bp aT 2 b 1 aT 2 b 2 · · · aT 2 bp .. .
aTmb 1 aTmb 2 · · · aTmbp
Remember that since A ∈ Rm×n^ and B ∈ Rn×p, ai ∈ Rn^ and bj ∈ Rn, so these inner products all make sense. This is the most “natural” representation when we represent A by rows and B by columns. Alternatively, we can represent A by columns, and B by rows, which leads to the interpretation of AB as a sum of outer products. Symbolically,
a 1 a 2 · · · an | | |
— bT 1 — — bT 2 — .. . — bTn —
∑^ n
i=
aibTi.
Put another way, AB is equal to the sum, over all i, of the outer product of the ith column of A and the ith row of B. Since, in this case, ai ∈ Rm^ and bi ∈ Rp, the dimension of the outer product aibTi is m × p, which coincides with the dimension of C. Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we represent B by columns, we can view the columns of C as matrix-vector products between A and the columns of B. Symbolically,
b 1 b 2 · · · bp | | |
Ab 1 Ab 2 · · · Abp | | |
Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi. These matrix-vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the analogous viewpoint, where we repre- sent A by rows, and view the rows of C as the matrix-vector product between the rows of A and C. Symbolically,
— aT 1 — — aT 2 — .. . — aTm —
— aT 1 B — — aT 2 B — .. . — aTmB —
Here the ith row of C is given by the matrix-vector product with the vector on the left, cTi = aTi B. It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these viewpoints follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section. However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend some time trying to develop an intuitive understanding of the viewpoints presented here. In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:
3 Operations and Properties
In this section we present several operations and properties of matrices and vectors. Hope- fully a great deal of this will be review for you, so the notes can just serve as a reference for these topics.
and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matrices occur a great deal in practice, and they have many nice properties which we will look at shortly. It is common to denote the set of all symmetric matrices of size n as Sn, so that A ∈ Sn^ means that A is a symmetric n × n matrix;
The trace of a square matrix A ∈ Rn×n, denoted tr(A) (or just trA if the parentheses are obviously implied), is the sum of diagonal elements in the matrix:
trA =
∑^ n
i=
Aii.
As described in the CS229 lecture notes, the trace has the following properties (included here for the sake of completeness):
A norm of a vector ‖x‖ is informally measure of the “length” of the vector. For example, we have the commonly-used Euclidean or ℓ 2 norm,
‖x‖ 2 =
∑n
i=
x^2 i.
Note that ‖x‖^22 = xT^ x. More formally, a norm is any function f : Rn^ → R that satisfies 4 properties:
Other examples of norms are the ℓ 1 norm,
‖x‖ 1 =
∑^ n
i=
|xi|
and the ℓ∞ norm, ‖x‖∞ = maxi|xi|.
In fact, all three norms presented so far are examples of the family of ℓp norms, which are parameterized by a real number p ≥ 1, and defined as
‖x‖p =
( (^) n ∑
i=
|xi|p
) 1 /p .
Norms can also be defined for matrices, such as the Frobenius norm,
∑^ m
i=
∑^ n
j=
A^2 ij =
tr(AT^ A).
Many other norms exist, but they are beyond the scope of this review.
A set of vectors {x 1 , x 2 ,... xn} is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, a vector which can be represented as a linear combination of the remaining vectors is said to be (linearly) dependent. For example, if
xn =
∑^ n−^1
i=
αixi
for some {α 1 ,... , αn− 1 } then xn is dependent on {x 1 ,... , xn− 1 }; otherwise, it is independent of {x 1 ,... , xn− 1 }. The column rank of a matrix A is the largest number of columns of A that constitute linearly independent set. This is often referred to simply as the number of linearly indepen- dent columns, but this terminology is a little sloppy, since it is possible that any vector in some set {x 1 ,... xn} can be expressed as a linear combination of the remaining vectors, even though some subset of the vectors might be independent. In the same way, the row rank is the largest number of rows of A that constitute a linearly independent set. It is a basic fact of linear algebra, that for any matrix A, columnrank(A) = rowrank(A), and so this quantity is simply refereed to as the rank of A, denoted as rank(A). The following are some basic properties of the rank:
The span of a set of vectors {x 1 , x 2 ,... xn} is the set of all vectors that can be expressed as a linear combination of {x 1 ,... , xn}. That is,
span({x 1 ,... xn}) =
v : v =
∑^ n
i=
αixi, αi ∈ R
It can be shown that if {x 1 ,... , xn} is a set of n linearly independent vectors, where each xi ∈ Rn, then span({x 1 ,... xn}) = Rn. In other words, any vector v ∈ Rn^ can be written as a linear combination of x 1 through xn. The projection of a vector y ∈ Rm^ onto the span of {x 1 ,... , xn} (here we assume xi ∈ Rm) is the vector v ∈ span({x 1 ,... xn}) , such that v as close as possible to y, as measured by the Euclidean norm ‖v − y‖ 2. We denote the projection as Proj(y; {x 1 ,... , xn}) and can define it formally as,
Proj(y; {x 1 ,... xn}) = argminv∈span({x 1 ,...,xn})‖y − v‖ 2. The range (sometimes also called the columnspace) of a matrix A ∈ Rm×n, denoted R(A), is the the span of the columns of A. In other words,
R(A) = {v ∈ Rm^ : v = Ax, x ∈ Rn}.
Making a few technical assumptions (namely that A is full rank and that n < m), the projection of a vector y ∈ Rm^ onto the range of A is given by,
Proj(y; A) = argminv∈R(A)‖v − y‖ 2 = A(AT^ A)−^1 AT^ y.
This last equation should look extremely familiar, since it is almost the same formula we derived in class (and which we will soon derive again) for the least squares estimation of parameters. Looking at the definition for the projection, it should not be too hard to convince yourself that this is in fact the same objective that we minimized in our least squares problem (except for a squaring of the norm, which doesn’t affect the optimal point) and so these problems are naturally very connected. When A contains only a single column, a ∈ Rm, this gives the special case for a projection of a vector on to a line:
Proj(y; a) =
aaT aT^ a
y.
The nullspace of a matrix A ∈ Rm×n, denoted N (A) is the set of all vectors that equal 0 when multiplied by A, i.e.,
N (A) = {x ∈ Rn^ : Ax = 0}.
Note that vectors in R(A) are of size m, while vectors in the N (A) are of size n, so vectors in R(AT^ ) and N (A) are both in Rn. In fact, we can say much more. It turns out that { w : w = u + v, u ∈ R(AT^ ), v ∈ N (A)
= Rn^ and R(AT^ ) ∩ N (A) = ∅.
In other words, R(AT^ ) and N (A) are disjoint subsets that together span the entire space of Rn. Sets of this type are called orthogonal complements, and we denote this R(AT^ ) = N (A)⊥.
The determinant of a square matrix A ∈ Rn×n, is a function det : Rn×n^ → R, and is denoted |A| or detA (like the trace operator, we usually omit parentheses). The full formula for the determinant gives little intuition about its meaning, so we instead first give three defining properties of the determinant, from which all the rest follow (including the general formula):
— t aT 1 — — aT 2 — .. . — aTm —
= t|A|.
— aT 2 — — aT 1 — .. . — aTm —
These properties, however, also give very little intuition about the nature of the deter- minant, so we now list several properties that follow from the three properties above:
Before given the general definition for the determinant, we define, for A ∈ Rn×n, A\i,\j ∈ R(n−1)×(n−1)^ to be the matrix that results from deleting the ith row and jth column from A. The general (recursive) formula for the determinant is
∑^ n
i=
(−1)i+j^ aij |A\i,\j | (for any j ∈ 1 ,... , n)
∑^ n
j=
(−1)i+j^ aij |A\i,\j | (for any i ∈ 1 ,... , n)
It should be obvious that if A is positive definite, then −A is negative definite and vice versa. Likewise, if A is positive semidefinite then −A is negative semidefinite and vice versa. If A is indefinite, then so is −A. It can also be shown that positive definite and negative definite matrices are always invertible. Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention. Given any matrix A ∈ Rm×n^ (not necessarily symmetric or even square), the matrix G = AT^ A (sometimes called a Gram matrix ) is always positive semidefinite. Further, if m ≥ n (and we assume for convenience that A is full rank), then G = AT^ A is positive definite.
Given a square matrix A ∈ Rn×n, we say that λ ∈ C is an eigenvalue of A and x ∈ Cn^ is the corresponding eigenvector 1 if
Ax = λx, x 6 = 0.
Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same direction as x, but scaled by a factor λ. Also note that for any eigenvector x ∈ Cn, and scalar t ∈ C, A(cx) = cAx = cλx = λ(cx), so cx is also an eigenvector. For this reason when we talk about “the” eigenvector associated with λ, we usually assume that the eigenvector is normalized to have length 1 (this still creates some ambiguity, since x and −x will both be eigenvectors, but we will have to live with this). We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair of A if, (λI − A)x = 0, x 6 = 0. (^1) Note that λ and the entries of x are actually in C, the set of complex numbers, not just the reals; we
will see shortly why this is necessary. Don’t worry about this technicality for now, you can think of complex vectors in the same way as real vectors.
But (λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a non-empty nullspace, which is only the case if (λI − A) is singular, i.e.,
|(λI − A)| = 0.
We can now use the previous definition of the determinant to expand this expression into a (very large) polynomial in λ, where λ will have maximum degree n. We then find the n (possibly complex) roots of this polynomial to find the n eigenvalues λ 1 ,... , λn. To find the eigenvector corresponding to the eigenvalue λi, we simply solve the linear equation (λiI − A)x = 0. It should be noted that this is not the method which is actually used in practice to numerically compute the eigenvalues and eigenvectors (remember that the complete expansion of the determinant has n! terms); it is rather a mathematical argument. The following are properties of eigenvalues and eigenvectors (in all cases assume A ∈ Rn×n has eigenvalues λi,... , λn and associated eigenvectors x 1 ,... xn):
trA =
∑^ n
i=
λi.
∏^ n
i=
λi.
We can write all the eigenvector equations simultaneously as
AX = XΛ
where the columns of X ∈ Rn×n^ are the eigenvectors of A and Λ is a diagonal matrix whose entries are the eigenvalues of A, i.e.,
X ∈ Rn×n^ =
x 1 x 2 · · · xn | | |
(^) , Λ = diag(λ 1 ,... , λn).
If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so A = XΛX−^1. A matrix that can be written in this form is called diagonalizable.
Suppose that f : Rm×n^ → R is a function that takes as input a matrix A of size m × n and returns a real value. Then the gradient of f (with respect to A ∈ Rm×n) is the matrix of partial derivatives, defined as:
∇Af (A) ∈ Rm×n^ =
∂f (A) ∂A 11
∂f (A) ∂A 12 · · ·^
∂f (A) ∂A 1 n ∂f (A) ∂A 21
∂f (A) ∂A 22 · · ·^
∂f (A) ∂A 2 n .. .
∂f (A) ∂Am 1
∂f (A) ∂Am 2 · · ·^
∂f (A) ∂Amn
i.e., an m × n matrix with
(∇Af (A))ij =
∂f (A) ∂Aij
Note that the size of ∇Af (A) is always the same as the size of A. So if, in particular, A is just a vector x ∈ Rn,
∇xf (x) =
∂f (x) ∂x 1 ∂f (x) ∂x 2 .. . ∂f (x) ∂xn
It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if it returns a scalar value. We can not, for example, take the gradient of Ax, A ∈ Rn×n^ with respect to x, since this quantity is vector-valued. It follows directly from the equivalent properties of partial derivatives that:
It is a little bit trickier to determine what the proper expression is for ∇xf (Ax), A ∈ Rn×n, but this is doable as well (if fact, you’ll have to work this out for a homework problem).
Suppose that f : Rn^ → R is a function that takes a vector in Rn^ and returns a real number. Then the Hessian matrix with respect to x, written ∇^2 xf (x) or simply as H is the n × n matrix of partial derivatives,
∇^2 xf (x) ∈ Rn×n^ =
∂^2 f (x) ∂x^21
∂^2 f (x) ∂x 1 ∂x 2 · · ·^
∂^2 f (x) ∂x 1 ∂xn ∂^2 f (x) ∂x 2 ∂x 1
∂^2 f (x) ∂x^22 · · ·^
∂^2 f (x) ∂x 2 ∂xn .. .
∂^2 f (x) ∂xn∂x 1
∂^2 f (x) ∂xn∂x 2 · · ·^
∂^2 f (x) ∂x^2 n
In other words, ∇^2 xf (x) ∈ Rn×n, with
(∇^2 xf (x))ij =
∂^2 f (x) ∂xi∂xj
Note that the Hessian is always symmetric, since
∂^2 f (x) ∂xi∂xj
∂^2 f (x) ∂xj ∂xi
Similar to the gradient, the Hessian is defined only when f (x) is real-valued. It is natural to think of the gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative (and the symbols we use also suggest this relation). This intuition is generally correct, but there a few caveats to keep in mind. First, for real-valued functions of one variable f : R → R, it is a basic definition that the second derivative is the derivative of the first derivative, i.e.,
∂^2 f (x) ∂x^2
∂x
∂x
f (x).
However, for functions of a vector, the gradient of the function is a vector, and we cannot take the gradient of a vector — i.e.,
∇x∇xf (x) = ∇x
∂f (x) ∂x 1 ∂f (x) ∂x 2 .. . ∂f (x) ∂x 1
and this expression is not defined. Therefore, it is not the case that the Hessian is the gradient of the gradient. However, this is almost true, in the following sense: If we look at the ith entry of the gradient (∇xf (x))i = ∂f (x)/∂xi, and take the gradient with respect to x we get
∇x
∂f (x) ∂xi
∂^2 f (x) ∂xi∂x 1 ∂^2 f (x) ∂xi∂x 2 .. . ∂f (x) ∂xi∂xn
which is the ith column (or row) of the Hessian. Therefore,
∇^2 xf (x) =
∇x(∇xf (x)) 1 ∇x(∇xf (x)) 2 · · · ∇x(∇xf (x))n
If we don’t mind being a little bit sloppy we can say that (essentially) ∇^2 xf (x) = ∇x(∇xf (x))T^ , so long as we understand that this really means taking the gradient of each entry of (∇xf (x))T^ , not the gradient of the whole vector.
Lets apply the equations we obtained in the last section to derive the least squares equations. Suppose we are given matrices A ∈ Rm×n^ (for simplicity we assume A is full rank) and a vector b ∈ Rm^ such that b 6 ∈ R(A). In this situation we will not be able to find a vector x ∈ Rn, such that Ax = b, so instead we want to find a vector x such that Ax is as close as possible to b, as measured by the square of the Euclidean norm ‖Ax − b‖^22. Using the fact that ‖x‖^22 = xT^ x, we have
‖Ax − b‖^22 = (Ax − b)T^ (Ax − b) = xT^ AT^ Ax − 2 bT^ Ax + bT^ b
Taking the gradient with respect to x we have, and using the properties we derived in the previous section
∇x(xT^ AT^ Ax − 2 bT^ Ax + bT^ b) = ∇xxT^ AT^ Ax − ∇x 2 bT^ Ax + ∇xbT^ b = 2 AT^ Ax − 2 AT^ b
Setting this last expression equal to zero and solving for x gives the normal equations
x = (AT^ A)−^1 AT^ b
which is the same as what we derived in class.
Now lets consider a situation where we find the gradient of a function with respect to a matrix, namely for A ∈ Rn×n, we want to find ∇A|A|. Recall from our discussion of determinants that
|A| =
∑^ n
i=
(−1)i+j^ Aij |A\i,\j | (for any j ∈ 1 ,... , n)
so ∂ ∂Akℓ
∂Akℓ
∑n
i=
(−1)i+j^ Aij |A\i,\j | = (−1)k+ℓ|A\k,\ℓ| = (adj(A))ℓk.
From this it immediately follows from the properties of the adjoint that
∇A|A| = (adj(A))T^ = |A|A−T^.
Now lets consider the function f : Sn ++ → R, f (A) = log |A|. Note that we have to restrict the domain of f to be the positive definite matrices, since this ensures that |A| > 0, so that the log of |A| is a real number. In this case we can use the chain rule (nothing fancy, just the ordinary chain rule from single-variable calculus) to see that
∂ log |A| ∂Aij
∂ log |A| ∂|A|
∂Aij
∂Aij
From this is should be obvious that
∇A log |A| =
where we can drop the transpose in the last expression because A is symmetric. Note the similarity to the single-valued case, where ∂/(∂x) log x = 1/x.
Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following, equality constrained optimization problem: maxx∈Rn^ xT^ Ax subject to ‖x‖^22 = 1
for a symmetric matrix A ∈ Sn. A standard way of solving optimization problems with equality constraints is by forming the Lagrangian, an objective function that includes the equality constraints.^2 The Lagrangian in this case can be given by
L(x, λ) = xT^ Ax − λxT^ x
where λ is called the Lagrange multiplier associated with the equality constraint. It can be established that for x∗^ to be a optimal point to the problem, the gradient of the Lagrangian has to be zero at x∗^ (this is not the only condition, but it is required). That is,
∇xL(x, λ) = ∇x(xT^ Ax − λxT^ x) = 2AT^ x − 2 λx = 0.
Notice that this is just the linear equation Ax = λx. This shows that the only points which can possibly maximize (or minimize) xT^ Ax assuming xT^ x = 1 are the eigenvectors of A.
(^2) Don’t worry if you haven’t seen Lagrangians before, as we will cover them in greater detail later in
CS229.