









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concepts of principal angles, which define the relationship between two subspaces in multidimensional space. It also introduces the multivariate gaussian distribution, a probabilistic model used to describe the distribution of multivariate data. How to find the mean and covariance matrix of a gaussian distribution based on maximum likelihood estimation.
Typology: Slides
1 / 17
This page cannot be seen from the preview
Don't miss anything!










Multivariate Gaussian Mahalanobis distance Probabilistic PCA Factor analysis
Related to the notion of distance between eqidimensional subspaces If p = q, then dist(F , G ) =
1 − cos(θp )^2 If the columns of QF ∈ IRm×p^ and QG ∈ IRm×q^ define orthonormal bases for F and G respectively (from QR decomposition), then
max u∈F ‖u‖ 2 =
max v∈G ‖v‖ 2 =
u>v = max y∈IRp ‖y‖ 2 =
max z∈IRq ‖z‖ 2 =
y>(Q F> QG )z
If
A =
(^) and B =
then the cosines of the principal angles between ran(A) and ran(B) are 1.000 and 0.
Assume X = {x 1 ,... , xn} can be modeled with Gaussian distribution
p(x|μ, C) = | 2 π|−^
(^12) exp{−
(x − μ)>C−^1 (x − μ)}
where μ is the mean and C is the covariance matrix Assume independent observations, find μ and C that maximize log likelihood p(X |μ, C) =
∏n i=1 p(xi^ |μ,^ C) L = log
∏n i=1 p(xi^ |μ,^ C) =^ −^
n 2 log^ |^2 πC| −^
1 2
i (xi^ −^ μ)
C− (^1) (xi − μ)
Maximum likelihood estimate: ∂L ∂μ = 0^ ⇒^ μˆ^ =^
1 n
i xi^ (sample mean) ∂L ∂C = 0^ ⇒^ Cˆ^ =^
1 n
i (xi^ −^ μˆ)(xi^ −^ μˆ)
(^) (sample covariance)
The equidensity contours of a non-singular Gaussian are ellipsoids (i.e., linear transformation of hyperspheres) The directions of the principal axes of the ellipsoids are the eigenvectors of covariance matrix C, and the lengths are the corresponding singular values
Let C = UΣU>^ = (UΣ^1 /^2 )(UΣ^1 /^2 )>^ (i.e., eigendecomposition) where the columns of U are orthonormal basis and Σ is a diagonal matrix
X ∼ N(μ, C) ⇐⇒ X ∼ μ + UΣ^1 /^2 N(0, I ) ⇐⇒ X ∼ μ + UN(0, Σ)
The distribution of N(μ, C) is equivalent to N(0, I ) scaled by Σ^1 /^2 , rotated by U and translated by μ
The quantity
d M^2 = (x − μ)>C−^1 (x − μ) = (C −^1 /^2 (x − μ))>(C −^1 /^2 (x − μ))
is called the Mahalanobis distance from x to μ Also known as generalized squared inter-point distance The distance of a point x to the center of mass divided by the width of the ellipsoid in the direction of x Linear transformation of the coordinate system Keep its quadratic form and remain non-negative If C = I , Mahalanobis distance reduces to Euclidean distance If C is diagonal, the resulting distance is normalized Euclidean distance d(x, y) =
∑m i=
(xi −yi )^2 σ i^2 where^ σi^ is the standard deviation of^ xi Can be approximated with eigenvectors of C Used for learning distance metric
A generative dimensionality reduction algorithm Let x ∈ IRm^ and z ∈ IRd^ , x is modeled by z, dubbed as factors (d < m) x = Λz + ε
I (^) Λ is factor loading matrix I (^) z is assumed be N(0, I ) distributed (zero mean, unit variance normals) I (^) The factors z model correlation between the elements of x I (^) ε is a random variable to account for noise and assumed to be distributed with N(0, Ψ) where Ψ is a diagonal matrix (whereas PCA uses an isotropic error model with ψi = σ^2 ) I (^) ε accounts for independent noise in each element of x I (^) The diagonality of Ψ is a key assumption: constraining the error covariance Ψ for estimation I (^) The observed variable, xi , are conditionally independent given the factors z I (^) x is N(0, ΛΛ>^ + Ψ) distributed (whereas PCA models with N(0, ΛΛ>^ + σ^2 I )
Factor analysis: x = Λz + ε Latent variables z: explain correlations between x εi represents variability unique to a particular xi Differ from PCA which treats covariance and variance identically Want to infer Λ and Ψ from x Suppose Λ and Ψ are known, by linear projection E [z|x] = βx where β = Λ>(Ψ + ΛΛ>)−^1 , since the joint Gaussian of data x and factors z: p(
x z
Expectation-Maximization: useful technique for dealing with missing data Start with some initial guess of missing data and evaluate the expected values Optimize the missing parameters by taking derivate of likelihood of observed and missing data w.r.t. parameters Repeat until the data likelihood does not change E-step: Given Λ and Ψ, for each data point xi , compute E [z|x] = βx E [zz>|x] = Var (z|x) + E [z|x]E [z|x]> = I − βΛ + βxx>β> M-step: Λnew^ = (
∑n i=1 xi^ E^ [z|xi^ ]
)(∑n i=1 E^ [zz |xi ])− 1 Ψnew^ = (^1) n diag{
∑n i=1 xi^ x
i −^ Λ new (^) E [z|xi ]x> i } where diag operator sets all off-diagonal elements to zero
Factor analysis provides a proper probabilistic model PCA is rotationally invariant; FA is not Given a set of data points, would Λ correspond to orthonormal basis of a PCA subspace? No, in most cases However, Λ corresponds to orthonormal basis if FA has isotropic error model, i.e., ψi = σ^2
Maximize log likelihood with the EM algorithm,
Λ = U(Σ − σ^2 I )^1 /^2 R
I (^) Um×d is the first d eigenvectors computed from covariance matrix S I (^) Σd×d is a diagonal matrix corresponding to the first d eigenvalues, λi I (^) Rd×d is an arbitrary orthogonal rotation matrix (note z has a uniform Gaussian distribution) I (^) The noise variance σ^2 is the residual variance per dimension
σ^2 = 1 m − d
∑^ m i=d+
λi
from “A unifying review of linear Gaussian models” by Zoubin Ghahramani and Sam Roweis