






















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
To get a quick executive guide, please refer to the cheat sheet in section(4). To see how matrix calculus simplify the process of derivation, please.
Typology: Slides
1 / 30
This page cannot be seen from the preview
Don't miss anything!























Abstract Matrix Calculus[3] is a very useful tool in many engineering prob- lems. Basic rules of matrix calculus are nothing more than ordinary calculus rules covered in undergraduate courses. However, using ma- trix calculus, the derivation process is more compact. This document is adapted from the notes of a course the author recently attends. It builds matrix calculus from scratch. Only prerequisites are basic cal- culus notions and linear algebra operation. To get a quick executive guide, please refer to the cheat sheet in section(4). To see how matrix calculus simplify the process of derivation, please refer to the application in section(3.4).
∗hupili [at] ie [dot] cuhk [dot] edu [dot] hk †Last compile:April 24, 2012
Table 1: Derivation and Application Correspondance Derivation Application 2.1-2.7 3. 2.9,2.10 3. 2.8,2.11 3.
2 Derivation
From the introductary example, we already see that matrix calculus does not distinguish from ordinary calculus by fundamental rules. However, with better organization of elements and proving useful properties, we can sim- plify the derivation process in real problems. The author would like to adopt the following definition:
Definition 1. For a scalar valued function f (x), the result ∂f ∂x has the same
size with x. That is
∂f ∂x
∂f ∂x 11
∂f ∂x 12
∂f ∂x 1 n ∂f ∂x 21
∂f ∂x 22
∂f ∂x 2 n .. .
∂f ∂xm 1
∂f ∂xm 2
∂f ∂xmn
In eqn(2), x is a 1-by-1 matrix and the result ∂f ∂x
= a is also a 1-by-
matrix. In eqn(5), x is a column vector(known as n-by-1 matrix) and the
result ∂f ∂x
= a has the same size.
Example 1. By this definition, we have:
∂f ∂xT^
∂f ∂x
)T^ = aT^ (7)
Note that we only use the organization definition in this example. Later we’ll show that with some matrix properties, this formula can be derived without
using ∂f ∂x as a bridge.
Theorem 1. If there’s a multivariate scalar function f (x) = aTx, we have ∂f ∂x
= a.
Proof. See introductary example.
Since aTx is scalar, we can write it equivalently as the trace of its own. Thus,
Proposition 2. If there’s a multivariate scalar function f (x) = Tr
aTx
we have
∂f ∂x = a.
Tr [•] is the operator to sum up diagonal elements of a matrix. In the next section, we’ll explore more properties of trace. As long as we can transform our target function into the form of theorem(1) or proposition(2), the result can be written out directly. Notice in proposition(2), a and x are both vectors. We’ll show later as long as their sizes agree, it holds for matrix a and x.
Definition 2. Trace of square matrix is defined as: Tr [A] =
i Aii
Example 2. Using definition(1,2), it is very easy to show:
∂Tr [A] ∂A
since only diagonal elements are kept by the trace operator.
Theorem 3. Matrix trace has the following properties:
i
j Aij^ Bij
where A, B are matrices with proper sizes, and c is a scalar value.
Proof. See wikipedia [5] for the proof.
Here we explain the intuitions behind each property to make it eas- ier to remenber. Property(1) and property(2) shows the linearity of trace. Property(3) means two matrices’ multiplication inside a the trace operator is commutative. Note that the matrix multiplication without trace is not commutative and the commutative property inside the trace does not hold
Although we want matrix derivative at most time, it turns out matrix differ- ential is easier to operate due to the form invariance property of differential. Matrix differential inherit this property as a natural consequence of the fol- lowing definition.
Definition 3. Define matrix differential:
dA =
dA 11 dA 12... dA 1 n dA 21 dA 22... dA 2 n .. .
dAm 1 dAm 2... dAmn
Theorem 5. Differential operator is distributive through trace operator: dTr [A] = Tr [dA]
Proof.
LHS = d(
i
Aii) =
i
dAii (19)
RHS = Tr
dA 11 dA 12... dA 1 n dA 21 dA 22... dA 2 n .. .
dAm 1 dAm 2... dAmn
i
dAii = LHS (21)
Now that matrix differential is well defined, we want to relate it back to matrix derivative. The scalar version differential and derivative can be related as follows:
df =
∂f ∂x dx (22)
So far, we’re dealing with scalar function f and matrix variable x.
∂f ∂x and
dx are both matrix according to definition. In order to make the quantities in eqn(22) equal, we must figure out a way to make the RHS a scalar. It’s not surprising that trace is what we want.
Theorem 6.
df = Tr
∂f ∂x
)Tdx
for scalar function f and arbitrarily sized x.
Proof.
LHS = df (24)
(definition of scalar differential) =
ij
∂f ∂xij dxij (25)
RHS = Tr
∂f ∂x
)Tdx
(trace property (5)) =
ij
∂f ∂x )ij (dx)ij (27)
(definition(3)) =
ij
∂f ∂x
)ij dxij (28)
(definition(1)) =
ij
∂f ∂xij dxij (29)
= LHS (30)
Theorem(6) is the bridge between matrix derivative and matrix differ- ential. We’ll see in later applications that matrix differential is more con- venient to manipulate. After certain manipulation we can get the form of theorem(6). Then we can directly write out matrix derivative using this theorem.
Theorem 7. We claim the following properties of matrix differential:
Proof. They’re all natural consequences given the definition(3). We only show the 3rd one in this document. Note that the equivalence holds if LHS and RHS are equivalent element by element. We consider the (ij)-th element.
LHSij = d(
k
AikBkj ) (31)
k
(dAikBkj + AikdBkj ) (32)
k
dAikBkj +
k
AikdBkj (34)
= LHSij (35)
To this point, you can handle many problems. In this schema, matrix differential and trace play crucial roles. Later we’ll deduce some widely used formula to facilitate potential applications. As you will see, although we rely on matrix differential in the schema, the deduction of certain formula may be more easily done using matrix derivatives.
For a background of determinant, please refer to [6]. We first quote some definitions and properties without proof:
Theorem 8. Let A be a square matrix:
j Aij^ Cij^.
∂ det(A) ∂Aij
j Aij^ Cij^ ) ∂Aij
= Cij (52)
First equality is from determinant definition and second equality is by the observation that only coefficient of Aij is left. Grouping all elements using definition(1), we have:
∂ det(A) ∂A = C = adj(A)T^ (53)
If A is non-singular, we have:
∂ det(A) ∂A = (det(A)A−^1 )T^ = det(A)(A−^1 )T^ (54)
Next, we use theorem(6) to give the differential relationship:
d det(A) = Tr
∂ det(A) ∂A
)TdA
= Tr
(det(A)(A−^1 )T)TdA
= Tr
det(A)A−^1 dA
In many practical problem, the log determinant is more widely used: ∂ ln det(A) ∂A
det(A)
∂ det(A) ∂A
The first equality comes from chain rule of ordinary calculus( ln det(A) and det(A) are both scalars). Similarly, we derive for differential:
d ln det(A) = Tr
A−^1 dA
The above sections show how to deal with scalar functions. In order to deal with vector function, we should restrict our attention to vector variables. It’s no surprising that the tractable forms in matrix calculus is so scarse. If we allow matrix functions and matrix variables, given the fact that fully specification of all partial derivatives calls for a tensor, it will be difficult to visualize the result on a paper. An alternative is to stretch functions and variables such that they appear as vectors. An annoying fact of matrix calculus is that, when you try to find reference materials, there are always two kinds of people. One group calculates as the transpose of another. Many online resources are not coherent, which mislead people. We borrow the following definitions of Hessian matrix[7] and Jacobian matrix[8] from Wikipedia:
H(f ) =
∂^2 f ∂x^21
∂^2 f ∂x 1 ∂x 2 · · ·^
∂^2 f ∂x 1 ∂xn
∂^2 f ∂x 2 ∂x 1
∂^2 f ∂x^22 · · ·^
∂^2 f ∂x 2 ∂xn
.. .
∂^2 f ∂xn ∂x 1
∂^2 f ∂xn ∂x 2 · · ·^
∂^2 f ∂x^2 n
∂y 1 ∂x 1
∂y 1 ∂xn .. .
∂ym ∂x 1
∂ym ∂xn
In the previous sections, we relate matrix differential with matrix derivatvie using theorem(6). This theorem bridges the two quantities using trace. Thus we can handle the problem using preferred form intechangeably. (As is seen: calculating derivative of determinant is more direct; calculating differential of inverse is tractable.) In this section, we provide another theorem to relate vector function differential to vector-to-vector derivatives. Amazingly, it takes a cleaner form.
Theorem 9. Consider the definition(4), we have: df = (
∂f ∂x )Tdx
Proof. Apparently, df has n components, so we prove element by element. Consider the j-th component:
LHSj = dfj (65)
=
∑^ m
i=
∂fj ∂xi
dxi (66)
RHSj = (( ∂f ∂x
)Tx)j (67)
∑^ m
i=
∂f ∂x )T jixi (68)
∑^ m
i=
∂f ∂x
)ij xi (69)
∑^ m
i=
∂f ∂x )ij xi (70)
∑^ m
i=
∂fj ∂xi )xi = LHSj (71)
Note that the trace operator is gone compared with theorem(6) due to the nice way of defining matrix vector multiplication. We can have a similar schema of handling vector-to-vector derivatives using this scheme. We don’t bother to list the schema again. Instead, we provide an example.
Example 7. Consider the variable transformation:x = σΛ−^0.^5 W Tξ, where σ is a real value, Λ is full rank diagonal matrix, and W is orthonormal square matrix (namelyW W T^ = W TW = I). Compute the absolute value of Jacobian.
First, we want to find ∂x ∂ξ
. This can be easily done by computing the
differential: dx = d(σΛ−^0.^5 W Tξ) = σΛ−^0.^5 W Tdξ (72)
Applying theorem(9), we have:
∂x ∂ξ = (σΛ−^0.^5 W T)T^ (73)
Thus,
Jm = ( ∂x ∂ξ )T^ = ((σΛ−^0.^5 W T)T)T^ = σΛ−^0.^5 W T^ (74)
where Jm is the Jacobian matrix and J = det(Jm). Then we use some property of determinant to calculate the absolute value of Jacobian:
| det(J mT)| | det(Jm)| (77)
=
| det(J mTJm)| (78)
=
| det(JmJ mT)| (79)
=
| det(W Λ−^0.^5 σσΛ−^0.^5 W T)| (80)
=
| det(σ^2 W Λ−^1 W T)| (81)
If the dimension of x and ξ is d and we define Σ = W ΛW T. A nice result is calculated: |J| = σd^ det(Σ)−^1 /^2 (82)
which we’ll see an application of generalizing the multivariate Gaussian dis- tribution.
Theorem 10. When f and x are of the same size, with definition(4):
( ∂f ∂x
∂x ∂f
Proof. Using theorem(9):
df = (
∂f ∂x )Tdx (84)
(( ∂f ∂x )T)−^1 df = dx (85)
dx = (( ∂f ∂x
)−^1 )Tdf (86) (87)
Proposition 12. Consider a chain of: x a scalar, y a column vector, z a scalar. z = z(y), yi = yi(x), i = 1, 2... , n. Apply the chain rule, we have
∂z ∂x
∂y ∂x
∂z ∂y
∑^ n
i=
∂yi ∂x
∂z ∂yi
Now we explain the intuition behind. x, z are both scalar, so we’re basically calculating the derivative in ordinary calculus. Besides, we have a
group of ”bridge variables”, yi. ∂yi ∂x
∂z ∂yi is just the result of applying scalar
chain rule on the chain:x → yi → z. The separate results of different bridge variables are additive! To see why I make this proposition, interested readers can refer to the comments in the corresponding LATEX source file.
Example 8. Show the derivative of (x − μ)TΣ−^1 (x − μ) to μ (for symmetric Σ−^1 ).
∂[(x − μ)TΣ−^1 (x − μ)] ∂μ
∂[x − μ] ∂μ
∂[(x − μ)TΣ−^1 (x − μ)] ∂[x − μ]
(example(4)) = ∂[x − μ] ∂μ
2 Σ−^1 (x − μ) (100)
(d[x − μ] = −I dμ) = −I 2 Σ−^1 (x − μ) (101) = −2Σ−^1 (x − μ) (102)
3 Application
The induced norm of matrix is defined as [4]:
||A||p = max x
||Ax||p ||x||p
where || • ||p denotes the p-norm of vectors. Now we solve for p = 2. (By default, || • || means || • || 2 ) The problem can be restated as:
||A||^2 = max x
||Ax||^2 ||x||^2
since all quantities involved are non-negative. Then we consider a scaling of vector x′^ = tx, thus:
||A||^2 = max x′
||Ax′||^2 ||x′||^2
= max x
||tAx||^2 ||tx||^2
= max x
t^2 ||Ax||^2 t^2 ||x||^2
= max x
||Ax||^2 ||x||^2
This shows the invariance under scaling. Now we can restrict our attention to those x with ||x|| = 1, and reach the following formulation:
Maximize f (x) = ||Ax||^2 (106) s.t. ||x||^2 = 1 (107)
The standard way to handle this constrained optimization is using La- grange relaxation: L(x) = f (x) − λ(||x||^2 − 1) (108)
Then we apply the general schema of handling scalar function on L(x). First take differential:
dL(x) = dTr [L(x)] (109) = Tr [d(L(x))] (110) = Tr
d(xTATAx − λ(xTx − 1))
= Tr
2 xTATAdx − λ(2xTdx)
= Tr
(2xTATA − 2 λxT)dx
Next write out derivative:
∂L ∂x = 2ATAx − 2 λx (114)
Let
∂x
= 0, we have: (ATA)x = λx (115)
That means x is the eigen vector of (ATA) (normalized to ||x|| = 1), and λ is corresponding eigen value. We plug this result back to objective function:
f (x) = xT(ATAx) = xT(λx) = λ (116)
which means, to maximize f (x), we should pick the maximum eigen value:
||A||^2 = max f (x) = λmax(ATA) (117)
That is: ||A|| =
λmax(ATA) = σmax(A) (118)
where σmax denotes the maximum singular value. If A is real symmetric, σmax(A) = λmax(A). Now we consider a real symmetric A and check whether:
λ^2 max(A) = max x
||Ax||^2 ||x||^2
= max x
xTATAx xTx
2 σ }.^ σ^ can also be thought to control the uncertainty.
Now we already get Gaussian distribution. The division by 2 in the exponent is only to simplify deductions. Writting σ^2 instead of σ is to align with some statistic quantities. The rest term is basically used to do normalization, which can be derived by extending the 1-D Gaussian integral to a 2-D area integral using Fubini’s thorem[11]. Interested reader can refer to [10]. Now we’re ready to extend the univariate version to multivariate version. The above four characteristics appear as simple interpretation to everyone who learned Gaussian distribution before. However, they’re the real ”ax- ioms” behind. Now consider an isotropic Gaussian distribution and we start with perpendicular axes. The uncertainty along each axis is the same, so the distance measure is now ||x − μ||^22 = (x − μ)T(x − μ). The exponent is exp{− (^21) σ 2 (x − μ)T(x − μ)}. Integrating over the volume of x can be done by transforming it into iterated form using Fubini’s theorem. Then we simply apply the method that we deal with univariate Gaussian integral. Now we have multivariate isotropic Gaussian distribution:
G(x|μ, σ^2 ) =
(2π)d/^2 σd^
exp{−
2 σ^2
(x − μ)T(x − μ)} (125)
where d is the dimension of x, μ. We are just one step away from a general Gaussian. Suppose we have a general Gaussian, whose covariance is not isotropic nor does it all along the standard Euclidean axes. Denote the sample from this Gaussian by ξ. We can first apply a rotation on ξ to bring it back to the standard axes, which can be done by left multiplying an orthongonal matrix W T^. Then we scale each component of W ξ by different ratio to make them isotropic, which can be done by left multiplying a diagonal matrix Λ−^0.^5. The exponent − 0 .5 is simply to make later discussion convenient. Finally, we multiply it by σ to be able to control the uncertainty at each direction again. The transform is given by x = σΛ−^0.^5 W Tξ. Plugging this back to eqn(125), we derive the following exponent:
exp{−
2 σ^2
(x − μ)T(x − μ)} (126)
= exp{−
2 σ^2
(σΛ−^0.^5 W Tξ − μ)T(σΛ−^0.^5 W Tξ − μ)} (127)
= exp{−
2 σ^2
(ξ − μξ)Tσ^2 W Λ−^1 W T(ξ − μξ)} (128)
= exp{−
(ξ − μξ)TΣ−^1 (ξ − μξ)} (129)
where we let μξ = Eξ, and we plugged in the following result:
μ = Ex = E[σΛ−^0.^5 W Tξ] = σΛ−^0.^5 W TEξ = σΛ−^0.^5 W Tμξ (130)
Now we only need to find the normalization factor to make it a probability distribution. Note eqn(125) integrates to 1, which is: ∫ G(x|μ, σ^2 )dx =
(2π)d/^2 σd^
exp{−
2 σ^2
(x − μ)T(x − μ)}dx = 1 (131)
Transforming variable from x to ξ causes a scaling by absolute Jacobian, which we already calculated in example(7). That is: ∫ 1 (2π)d/^2 σd^
exp{−
(ξ − μξ)TΣ−^1 (ξ − μξ)}|J|dξ = 1 (132) ∫ σd^ det(Σ)−^1 /^2 (2π)d/^2 σd^
exp{−
(ξ − μξ)TΣ−^1 (ξ − μξ)}dξ = 1 (133) ∫ 1 (2π)d/^2 det(Σ)^1 /^2
exp{−
(ξ − μξ)TΣ−^1 (ξ − μξ)}dξ = 1 (134)
The term inside integral is just the general Gaussian density we want to find:
G(x|μ, Σ) =
(2π)d/^2 det(Σ)^1 /^2
exp{−
(x − μ)TΣ−^1 (x − μ)} (135)
Given N samples, xt, t = 1, 2 ,... , N , independently indentically distributed that are drawn from the following Gaussian distribution:
G(xt|μ, Σ) =
(2π)d/^2 det(Σ)^1 /^2
exp{−
(xt − μ)TΣ−^1 (xt − μ)} (136)
solve the paramters θ = {μ, Σ}, that maximize:
p(X|θ) =
t=
G(xt|θ) (137)
It’s more conveninent to handle the log likelihood as defined below:
L = ln p(X|θ) =
t=
ln G(xt|θ) (138)
We write each term of L out to facilitate further processing:
L = − N d 2
ln(2π) −
ln |Σ| −
t
(xt − μ)TΣ−^1 (xt − μ) (139)
Taking derivative of μ, the first two terms are gone and the third term is already handled by example(8):
∂L ∂μ
t
Σ−^1 (xt − μ) (140)