Matrix Calculus: Derivation and Simple Application, Slides of Calculus

To get a quick executive guide, please refer to the cheat sheet in section(4). To see how matrix calculus simplify the process of derivation, please.

Typology: Slides

2021/2022

Uploaded on 08/05/2022

jacqueline_nel
jacqueline_nel 🇧🇪

4.4

(242)

3.2K documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Matrix Calculus:
Derivation and Simple Application
HU, Pili
March 30, 2012
Abstract
Matrix Calculus[3] is a very useful tool in many engineering prob-
lems. Basic rules of matrix calculus are nothing more than ordinary
calculus rules covered in undergraduate courses. However, using ma-
trix calculus, the derivation process is more compact. This document
is adapted from the notes of a course the author recently attends. It
builds matrix calculus from scratch. Only prerequisites are basic cal-
culus notions and linear algebra operation.
To get a quick executive guide, please refer to the cheat sheet in
section(4).
To see how matrix calculus simplify the process of derivation, please
refer to the application in section(3.4).
hupili [at] ie [dot] cuhk [dot] edu [dot] hk
Last compile:April 24, 2012
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Matrix Calculus: Derivation and Simple Application and more Slides Calculus in PDF only on Docsity!

Matrix Calculus:

Derivation and Simple Application

HU, Pili∗

March 30, 2012†

Abstract Matrix Calculus[3] is a very useful tool in many engineering prob- lems. Basic rules of matrix calculus are nothing more than ordinary calculus rules covered in undergraduate courses. However, using ma- trix calculus, the derivation process is more compact. This document is adapted from the notes of a course the author recently attends. It builds matrix calculus from scratch. Only prerequisites are basic cal- culus notions and linear algebra operation. To get a quick executive guide, please refer to the cheat sheet in section(4). To see how matrix calculus simplify the process of derivation, please refer to the application in section(3.4).

∗hupili [at] ie [dot] cuhk [dot] edu [dot] hk †Last compile:April 24, 2012

Contents

  • 1 Introductory Example
  • 2 Derivation
    • 2.1 Organization of Elements
    • 2.2 Deal with Inner Product
    • 2.3 Properties of Trace
    • 2.4 Deal with Generalized Inner Product
    • 2.5 Define Matrix Differential
    • 2.6 Matrix Differential Properties
    • 2.7 Schema of Hanlding Scalar Function
    • 2.8 Determinant
    • 2.9 Vector Function and Vector Variable
    • 2.10 Vector Function Differential
    • 2.11 Chain Rule
  • 3 Application
    • 3.1 The 2nd Induced Norm of Matrix
    • 3.2 General Multivaraite Gaussian Distribution
    • 3.3 Maximum Likelihood Estimation of Gaussian
    • 3.4 Least Square Error Inference: a Comparison
  • 4 Cheat Sheet
    • 4.1 Definition
    • 4.2 Schema for Scalar Function
    • 4.3 Schema for Vector Function
    • 4.4 Properties
    • 4.5 Frequently Used Formula
    • 4.6 Chain Rule
  • Acknowledgements
  • References
  • Appendix

Table 1: Derivation and Application Correspondance Derivation Application 2.1-2.7 3. 2.9,2.10 3. 2.8,2.11 3.

2 Derivation

2.1 Organization of Elements

From the introductary example, we already see that matrix calculus does not distinguish from ordinary calculus by fundamental rules. However, with better organization of elements and proving useful properties, we can sim- plify the derivation process in real problems. The author would like to adopt the following definition:

Definition 1. For a scalar valued function f (x), the result ∂f ∂x has the same

size with x. That is

∂f ∂x

∂f ∂x 11

∂f ∂x 12

∂f ∂x 1 n ∂f ∂x 21

∂f ∂x 22

∂f ∂x 2 n .. .

∂f ∂xm 1

∂f ∂xm 2

∂f ∂xmn

In eqn(2), x is a 1-by-1 matrix and the result ∂f ∂x

= a is also a 1-by-

matrix. In eqn(5), x is a column vector(known as n-by-1 matrix) and the

result ∂f ∂x

= a has the same size.

Example 1. By this definition, we have:

∂f ∂xT^

∂f ∂x

)T^ = aT^ (7)

Note that we only use the organization definition in this example. Later we’ll show that with some matrix properties, this formula can be derived without

using ∂f ∂x as a bridge.

2.2 Deal with Inner Product

Theorem 1. If there’s a multivariate scalar function f (x) = aTx, we have ∂f ∂x

= a.

Proof. See introductary example.

Since aTx is scalar, we can write it equivalently as the trace of its own. Thus,

Proposition 2. If there’s a multivariate scalar function f (x) = Tr

[

aTx

]

we have

∂f ∂x = a.

Tr [•] is the operator to sum up diagonal elements of a matrix. In the next section, we’ll explore more properties of trace. As long as we can transform our target function into the form of theorem(1) or proposition(2), the result can be written out directly. Notice in proposition(2), a and x are both vectors. We’ll show later as long as their sizes agree, it holds for matrix a and x.

2.3 Properties of Trace

Definition 2. Trace of square matrix is defined as: Tr [A] =

i Aii

Example 2. Using definition(1,2), it is very easy to show:

∂Tr [A] ∂A

= I (8)

since only diagonal elements are kept by the trace operator.

Theorem 3. Matrix trace has the following properties:

  • (1) Tr [A + B] = Tr [A] + Tr [B]
  • (2) Tr [cA] = cTr [A]
  • (3) Tr [AB] = Tr [BA]
  • (4) Tr [A 1 A 2... An] = Tr [AnA 1... An− 1 ]
  • (5) Tr

[

ATB

]

i

j Aij^ Bij

  • (6) Tr [A] = Tr

[

AT

]

where A, B are matrices with proper sizes, and c is a scalar value.

Proof. See wikipedia [5] for the proof.

Here we explain the intuitions behind each property to make it eas- ier to remenber. Property(1) and property(2) shows the linearity of trace. Property(3) means two matrices’ multiplication inside a the trace operator is commutative. Note that the matrix multiplication without trace is not commutative and the commutative property inside the trace does not hold

2.5 Define Matrix Differential

Although we want matrix derivative at most time, it turns out matrix differ- ential is easier to operate due to the form invariance property of differential. Matrix differential inherit this property as a natural consequence of the fol- lowing definition.

Definition 3. Define matrix differential:

dA =

dA 11 dA 12... dA 1 n dA 21 dA 22... dA 2 n .. .

....^

dAm 1 dAm 2... dAmn

Theorem 5. Differential operator is distributive through trace operator: dTr [A] = Tr [dA]

Proof.

LHS = d(

i

Aii) =

i

dAii (19)

RHS = Tr

dA 11 dA 12... dA 1 n dA 21 dA 22... dA 2 n .. .

dAm 1 dAm 2... dAmn

i

dAii = LHS (21)

Now that matrix differential is well defined, we want to relate it back to matrix derivative. The scalar version differential and derivative can be related as follows:

df =

∂f ∂x dx (22)

So far, we’re dealing with scalar function f and matrix variable x.

∂f ∂x and

dx are both matrix according to definition. In order to make the quantities in eqn(22) equal, we must figure out a way to make the RHS a scalar. It’s not surprising that trace is what we want.

Theorem 6.

df = Tr

[

∂f ∂x

)Tdx

]

for scalar function f and arbitrarily sized x.

Proof.

LHS = df (24)

(definition of scalar differential) =

ij

∂f ∂xij dxij (25)

RHS = Tr

[

∂f ∂x

)Tdx

]

(trace property (5)) =

ij

∂f ∂x )ij (dx)ij (27)

(definition(3)) =

ij

∂f ∂x

)ij dxij (28)

(definition(1)) =

ij

∂f ∂xij dxij (29)

= LHS (30)

Theorem(6) is the bridge between matrix derivative and matrix differ- ential. We’ll see in later applications that matrix differential is more con- venient to manipulate. After certain manipulation we can get the form of theorem(6). Then we can directly write out matrix derivative using this theorem.

2.6 Matrix Differential Properties

Theorem 7. We claim the following properties of matrix differential:

  • d(cA) = cdA
  • d(A + B) = dA + dB
  • d(AB) = dAB + AdB

Proof. They’re all natural consequences given the definition(3). We only show the 3rd one in this document. Note that the equivalence holds if LHS and RHS are equivalent element by element. We consider the (ij)-th element.

LHSij = d(

k

AikBkj ) (31)

k

(dAikBkj + AikdBkj ) (32)

RHSij = (dAB)ij + (AdB)ij (33)

k

dAikBkj +

k

AikdBkj (34)

= LHSij (35)

  1. Apply theorem(6) to get: ∂f ∂x

= A (51)

To this point, you can handle many problems. In this schema, matrix differential and trace play crucial roles. Later we’ll deduce some widely used formula to facilitate potential applications. As you will see, although we rely on matrix differential in the schema, the deduction of certain formula may be more easily done using matrix derivatives.

2.8 Determinant

For a background of determinant, please refer to [6]. We first quote some definitions and properties without proof:

Theorem 8. Let A be a square matrix:

  • The minor Mij is obtained by remove i-th row and j-th column of A and then take determinant of the resulting (n-1) by (n-1) matrix.
  • The ij-th cofactor is defined as Cij = (−1)i+j^ Mij.
  • If we expand determinant with respect to the i-th row, det(A) =

j Aij^ Cij^.

  • The adjugate of A is defined as adj(A)ij = (−1)i+j^ Mji = Cji. So adj(A) = CT
  • For non-singular matrix A, we have: A−^1 = adj(det(AA)) = C T det(A) Now we’re ready to show the derivative of determinant. Note that de- terminant is just a scalar function, so all techniques discussed above is applicable. We first write the derivative element by element. Expanding determinant on the i-th row, we have:

∂ det(A) ∂Aij

j Aij^ Cij^ ) ∂Aij

= Cij (52)

First equality is from determinant definition and second equality is by the observation that only coefficient of Aij is left. Grouping all elements using definition(1), we have:

∂ det(A) ∂A = C = adj(A)T^ (53)

If A is non-singular, we have:

∂ det(A) ∂A = (det(A)A−^1 )T^ = det(A)(A−^1 )T^ (54)

Next, we use theorem(6) to give the differential relationship:

d det(A) = Tr

[

∂ det(A) ∂A

)TdA

]

= Tr

[

(det(A)(A−^1 )T)TdA

]

= Tr

[

det(A)A−^1 dA

]

In many practical problem, the log determinant is more widely used: ∂ ln det(A) ∂A

det(A)

∂ det(A) ∂A

= (A−^1 )T^ (58)

The first equality comes from chain rule of ordinary calculus( ln det(A) and det(A) are both scalars). Similarly, we derive for differential:

d ln det(A) = Tr

[

A−^1 dA

]

2.9 Vector Function and Vector Variable

The above sections show how to deal with scalar functions. In order to deal with vector function, we should restrict our attention to vector variables. It’s no surprising that the tractable forms in matrix calculus is so scarse. If we allow matrix functions and matrix variables, given the fact that fully specification of all partial derivatives calls for a tensor, it will be difficult to visualize the result on a paper. An alternative is to stretch functions and variables such that they appear as vectors. An annoying fact of matrix calculus is that, when you try to find reference materials, there are always two kinds of people. One group calculates as the transpose of another. Many online resources are not coherent, which mislead people. We borrow the following definitions of Hessian matrix[7] and Jacobian matrix[8] from Wikipedia:

H(f ) =

∂^2 f ∂x^21

∂^2 f ∂x 1 ∂x 2 · · ·^

∂^2 f ∂x 1 ∂xn

∂^2 f ∂x 2 ∂x 1

∂^2 f ∂x^22 · · ·^

∂^2 f ∂x 2 ∂xn

.. .

∂^2 f ∂xn ∂x 1

∂^2 f ∂xn ∂x 2 · · ·^

∂^2 f ∂x^2 n

J =

∂y 1 ∂x 1

∂y 1 ∂xn .. .

∂ym ∂x 1

∂ym ∂xn

2.10 Vector Function Differential

In the previous sections, we relate matrix differential with matrix derivatvie using theorem(6). This theorem bridges the two quantities using trace. Thus we can handle the problem using preferred form intechangeably. (As is seen: calculating derivative of determinant is more direct; calculating differential of inverse is tractable.) In this section, we provide another theorem to relate vector function differential to vector-to-vector derivatives. Amazingly, it takes a cleaner form.

Theorem 9. Consider the definition(4), we have: df = (

∂f ∂x )Tdx

Proof. Apparently, df has n components, so we prove element by element. Consider the j-th component:

LHSj = dfj (65)

=

∑^ m

i=

∂fj ∂xi

dxi (66)

RHSj = (( ∂f ∂x

)Tx)j (67)

∑^ m

i=

∂f ∂x )T jixi (68)

∑^ m

i=

∂f ∂x

)ij xi (69)

∑^ m

i=

∂f ∂x )ij xi (70)

∑^ m

i=

∂fj ∂xi )xi = LHSj (71)

Note that the trace operator is gone compared with theorem(6) due to the nice way of defining matrix vector multiplication. We can have a similar schema of handling vector-to-vector derivatives using this scheme. We don’t bother to list the schema again. Instead, we provide an example.

Example 7. Consider the variable transformation:x = σΛ−^0.^5 W Tξ, where σ is a real value, Λ is full rank diagonal matrix, and W is orthonormal square matrix (namelyW W T^ = W TW = I). Compute the absolute value of Jacobian.

First, we want to find ∂x ∂ξ

. This can be easily done by computing the

differential: dx = d(σΛ−^0.^5 W Tξ) = σΛ−^0.^5 W Tdξ (72)

Applying theorem(9), we have:

∂x ∂ξ = (σΛ−^0.^5 W T)T^ (73)

Thus,

Jm = ( ∂x ∂ξ )T^ = ((σΛ−^0.^5 W T)T)T^ = σΛ−^0.^5 W T^ (74)

where Jm is the Jacobian matrix and J = det(Jm). Then we use some property of determinant to calculate the absolute value of Jacobian:

|J| = | det(Jm)| (75)

| det(Jm)| | det(Jm)| (76)

| det(J mT)| | det(Jm)| (77)

=

| det(J mTJm)| (78)

=

| det(JmJ mT)| (79)

=

| det(W Λ−^0.^5 σσΛ−^0.^5 W T)| (80)

=

| det(σ^2 W Λ−^1 W T)| (81)

If the dimension of x and ξ is d and we define Σ = W ΛW T. A nice result is calculated: |J| = σd^ det(Σ)−^1 /^2 (82)

which we’ll see an application of generalizing the multivariate Gaussian dis- tribution.

Theorem 10. When f and x are of the same size, with definition(4):

( ∂f ∂x

)−^1 =

∂x ∂f

Proof. Using theorem(9):

df = (

∂f ∂x )Tdx (84)

(( ∂f ∂x )T)−^1 df = dx (85)

dx = (( ∂f ∂x

)−^1 )Tdf (86) (87)

Proposition 12. Consider a chain of: x a scalar, y a column vector, z a scalar. z = z(y), yi = yi(x), i = 1, 2... , n. Apply the chain rule, we have

∂z ∂x

∂y ∂x

∂z ∂y

∑^ n

i=

∂yi ∂x

∂z ∂yi

Now we explain the intuition behind. x, z are both scalar, so we’re basically calculating the derivative in ordinary calculus. Besides, we have a

group of ”bridge variables”, yi. ∂yi ∂x

∂z ∂yi is just the result of applying scalar

chain rule on the chain:x → yi → z. The separate results of different bridge variables are additive! To see why I make this proposition, interested readers can refer to the comments in the corresponding LATEX source file.

Example 8. Show the derivative of (x − μ)TΣ−^1 (x − μ) to μ (for symmetric Σ−^1 ).

∂[(x − μ)TΣ−^1 (x − μ)] ∂μ

∂[x − μ] ∂μ

∂[(x − μ)TΣ−^1 (x − μ)] ∂[x − μ]

(example(4)) = ∂[x − μ] ∂μ

2 Σ−^1 (x − μ) (100)

(d[x − μ] = −I dμ) = −I 2 Σ−^1 (x − μ) (101) = −2Σ−^1 (x − μ) (102)

3 Application

3.1 The 2nd Induced Norm of Matrix

The induced norm of matrix is defined as [4]:

||A||p = max x

||Ax||p ||x||p

where || • ||p denotes the p-norm of vectors. Now we solve for p = 2. (By default, || • || means || • || 2 ) The problem can be restated as:

||A||^2 = max x

||Ax||^2 ||x||^2

since all quantities involved are non-negative. Then we consider a scaling of vector x′^ = tx, thus:

||A||^2 = max x′

||Ax′||^2 ||x′||^2

= max x

||tAx||^2 ||tx||^2

= max x

t^2 ||Ax||^2 t^2 ||x||^2

= max x

||Ax||^2 ||x||^2

This shows the invariance under scaling. Now we can restrict our attention to those x with ||x|| = 1, and reach the following formulation:

Maximize f (x) = ||Ax||^2 (106) s.t. ||x||^2 = 1 (107)

The standard way to handle this constrained optimization is using La- grange relaxation: L(x) = f (x) − λ(||x||^2 − 1) (108)

Then we apply the general schema of handling scalar function on L(x). First take differential:

dL(x) = dTr [L(x)] (109) = Tr [d(L(x))] (110) = Tr

[

d(xTATAx − λ(xTx − 1))

]

= Tr

[

2 xTATAdx − λ(2xTdx)

]

= Tr

[

(2xTATA − 2 λxT)dx

]

Next write out derivative:

∂L ∂x = 2ATAx − 2 λx (114)

Let

∂L

∂x

= 0, we have: (ATA)x = λx (115)

That means x is the eigen vector of (ATA) (normalized to ||x|| = 1), and λ is corresponding eigen value. We plug this result back to objective function:

f (x) = xT(ATAx) = xT(λx) = λ (116)

which means, to maximize f (x), we should pick the maximum eigen value:

||A||^2 = max f (x) = λmax(ATA) (117)

That is: ||A|| =

λmax(ATA) = σmax(A) (118)

where σmax denotes the maximum singular value. If A is real symmetric, σmax(A) = λmax(A). Now we consider a real symmetric A and check whether:

λ^2 max(A) = max x

||Ax||^2 ||x||^2

= max x

xTATAx xTx

  • How fast the likelihood decreases should be controlled by a parameter, say exp{ −(x−μ)

2 σ }.^ σ^ can also be thought to control the uncertainty.

Now we already get Gaussian distribution. The division by 2 in the exponent is only to simplify deductions. Writting σ^2 instead of σ is to align with some statistic quantities. The rest term is basically used to do normalization, which can be derived by extending the 1-D Gaussian integral to a 2-D area integral using Fubini’s thorem[11]. Interested reader can refer to [10]. Now we’re ready to extend the univariate version to multivariate version. The above four characteristics appear as simple interpretation to everyone who learned Gaussian distribution before. However, they’re the real ”ax- ioms” behind. Now consider an isotropic Gaussian distribution and we start with perpendicular axes. The uncertainty along each axis is the same, so the distance measure is now ||x − μ||^22 = (x − μ)T(x − μ). The exponent is exp{− (^21) σ 2 (x − μ)T(x − μ)}. Integrating over the volume of x can be done by transforming it into iterated form using Fubini’s theorem. Then we simply apply the method that we deal with univariate Gaussian integral. Now we have multivariate isotropic Gaussian distribution:

G(x|μ, σ^2 ) =

(2π)d/^2 σd^

exp{−

2 σ^2

(x − μ)T(x − μ)} (125)

where d is the dimension of x, μ. We are just one step away from a general Gaussian. Suppose we have a general Gaussian, whose covariance is not isotropic nor does it all along the standard Euclidean axes. Denote the sample from this Gaussian by ξ. We can first apply a rotation on ξ to bring it back to the standard axes, which can be done by left multiplying an orthongonal matrix W T^. Then we scale each component of W ξ by different ratio to make them isotropic, which can be done by left multiplying a diagonal matrix Λ−^0.^5. The exponent − 0 .5 is simply to make later discussion convenient. Finally, we multiply it by σ to be able to control the uncertainty at each direction again. The transform is given by x = σΛ−^0.^5 W Tξ. Plugging this back to eqn(125), we derive the following exponent:

exp{−

2 σ^2

(x − μ)T(x − μ)} (126)

= exp{−

2 σ^2

(σΛ−^0.^5 W Tξ − μ)T(σΛ−^0.^5 W Tξ − μ)} (127)

= exp{−

2 σ^2

(ξ − μξ)Tσ^2 W Λ−^1 W T(ξ − μξ)} (128)

= exp{−

(ξ − μξ)TΣ−^1 (ξ − μξ)} (129)

where we let μξ = Eξ, and we plugged in the following result:

μ = Ex = E[σΛ−^0.^5 W Tξ] = σΛ−^0.^5 W TEξ = σΛ−^0.^5 W Tμξ (130)

Now we only need to find the normalization factor to make it a probability distribution. Note eqn(125) integrates to 1, which is: ∫ G(x|μ, σ^2 )dx =

(2π)d/^2 σd^

exp{−

2 σ^2

(x − μ)T(x − μ)}dx = 1 (131)

Transforming variable from x to ξ causes a scaling by absolute Jacobian, which we already calculated in example(7). That is: ∫ 1 (2π)d/^2 σd^

exp{−

(ξ − μξ)TΣ−^1 (ξ − μξ)}|J|dξ = 1 (132) ∫ σd^ det(Σ)−^1 /^2 (2π)d/^2 σd^

exp{−

(ξ − μξ)TΣ−^1 (ξ − μξ)}dξ = 1 (133) ∫ 1 (2π)d/^2 det(Σ)^1 /^2

exp{−

(ξ − μξ)TΣ−^1 (ξ − μξ)}dξ = 1 (134)

The term inside integral is just the general Gaussian density we want to find:

G(x|μ, Σ) =

(2π)d/^2 det(Σ)^1 /^2

exp{−

(x − μ)TΣ−^1 (x − μ)} (135)

3.3 Maximum Likelihood Estimation of Gaussian

Given N samples, xt, t = 1, 2 ,... , N , independently indentically distributed that are drawn from the following Gaussian distribution:

G(xt|μ, Σ) =

(2π)d/^2 det(Σ)^1 /^2

exp{−

(xt − μ)TΣ−^1 (xt − μ)} (136)

solve the paramters θ = {μ, Σ}, that maximize:

p(X|θ) =

∏^ N

t=

G(xt|θ) (137)

It’s more conveninent to handle the log likelihood as defined below:

L = ln p(X|θ) =

∑^ N

t=

ln G(xt|θ) (138)

We write each term of L out to facilitate further processing:

L = − N d 2

ln(2π) −

N

ln |Σ| −

t

(xt − μ)TΣ−^1 (xt − μ) (139)

Taking derivative of μ, the first two terms are gone and the third term is already handled by example(8):

∂L ∂μ

t

Σ−^1 (xt − μ) (140)