Matrix Differential Calculus for Optimization: Notes from 10-725 Course, Fall 2012, Exams of Calculus

These notes cover matrix differential calculus, including matrix differentials, chain rule, product rule, and identities. The document also discusses finding a maximum or minimum of a scalar function or matrix function using the coefficient of dX being set to zero. Examples are provided for Infomax Independent Component Analysis (ICA) and Newton's method.

Typology: Exams

2021/2022

Uploaded on 08/05/2022

dirk88
dirk88 🇧🇪

4.4

(222)

3.1K documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Matrix differential calculus
10-725 Optimization
Geoff Gordon
Ryan Tibshirani
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Matrix Differential Calculus for Optimization: Notes from 10-725 Course, Fall 2012 and more Exams Calculus in PDF only on Docsity!

Matrix differential calculus

10-725 Optimization

Geoff Gordon

Ryan Tibshirani

Geoff Gordon—10-725 Optimization—Fall 2012

Review

Matrix differentials: sol’n to matrix calculus pain

‣ compact way of writing Taylor expansions, or …

‣ definition:

‣ df = a(x; dx) [+ r(dx)]

‣ a(x; .) linear in 2nd arg

‣ r(dx)/||dx|| → 0 as dx → 0

d(…) is linear: passes thru +, scalar *

Generalizes Jacobian, Hessian, gradient, velocity

Geoff Gordon—10-725 Optimization—Fall 2012

Finding a maximum

or minimum, or saddle point

ï 3 ï 2 ï 1 0 1 2 3 ï 1 ï0. 0

1

2

ID for df(x) scalar x vector x matrix X

scalar f

vector f

matrix F

df = a dx df = a

T

d x df = tr(A

T

dX)

d f = a dx d f = A d x

dF = A dx

Geoff Gordon—10-725 Optimization—Fall 2012

Finding a maximum

or minimum, or saddle point

ID for df(x) scalar x vector x matrix X

scalar f

vector f

matrix F

df = a dx df = a

T

d x df = tr(A

T

dX)

d f = a dx d f = A d x

dF = A dx

Geoff Gordon—10-725 Optimization—Fall 2012

Ex: Infomax ICA

  • Training examples xi ∈ ℝ d , i = 1:n
  • Transformation yi = g(Wxi) ‣ W ∈ ℝ d!d ‣ g(z) =
  • Want:

ï 10 ï 5 0 5 10 ï 10 ï 5

Wxi 0 .2 0 .4 0. 6 0. 8

yi ï 10 ï 5 0 5 10 ï 10 ï 5

xi

Geoff Gordon—10-725 Optimization—Fall 2012

Volume rule

Geoff Gordon—10-725 Optimization—Fall 2012

Gradient

L = ∑ ln |det Ji| yi = g(Wxi) dyi = Ji dxi

i

Geoff Gordon—10-725 Optimization—Fall 2012

Gradient

Ji = diag(ui) W dJi = diag(ui) dW + diag(vi) diag(dW xi) W

dL =

Geoff Gordon—10-725 Optimization—Fall 2012

yi

ICA natural gradient

[W

-T

+ C] W

T

W =

Wxi

start with W 0 = I

Geoff Gordon—10-725 Optimization—Fall 2012

yi

ICA natural gradient

[W

-T

+ C] W

T

W =

Wxi

start with W 0 = I

Geoff Gordon—10-725 Optimization—Fall 2012

ICA on natural image patches

Geoff Gordon—10-725 Optimization—Fall 2012

More info

Minka’s cheat sheet:

‣ http://research.microsoft.com/en-us/um/people/minka/

papers/matrix/

Magnus & Neudecker. Matrix Differential Calculus.

Wiley, 1999. 2nd ed.

‣ http://www.amazon.com/Differential-Calculus-

Applications-Statistics-Econometrics/dp/047198633X

Bell & Sejnowski. An information-maximization

approach to blind separation and blind

deconvolution. Neural Computation, v7, 1995.

Geoff Gordon—10-725 Optimization—Fall 2012

Nonlinear equations

x ∈ R

d

f: R

d

→R

d

, diff’ble

‣ solve:

Taylor:

‣ J:

Newton:

0 1 2 ï 1 ï0. 0

1

Geoff Gordon—10-725 Optimization—Fall 2012

Error analysis