Expectation-Maximization (EM) Algorithm: A Comprehensive Guide, Study notes of Algorithms and Programming

In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. In this set of notes, ...

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

astarloa
astarloa 🇺🇸

4.2

(12)

298 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS229 Lecture notes
Tengyu Ma and Andrew Ng
May 13, 2019
Part IX
The EM algorithm
In the previous set of notes, we talked about the EM algorithm as applied to
fitting a mixture of Gaussians. In this set of notes, we give a broader view
of the EM algorithm, and show how it can be applied to a large family of
estimation problems with latent variables. We begin our discussion with a
very useful result called Jensen’s inequality
1 Jensen’s inequality
Let fbe a function whose domain is the set of real numbers. Recall that
fis a convex function if f′′(x)0 (for all xR). In the case of ftaking
vector-valued inputs, this is generalized to the condition that its hessian H
is positive semi-definite (H0). If f′′(x)>0 for all x, then we say fis
strictly convex (in the vector-valued case, the corresponding statement is
that Hmust be positive definite, written H > 0). Jensen’s inequality can
then be stated as follows:
Theorem. Let fbe a convex function, and let Xbe a random variable.
Then:
E[f(X)] f(EX).
Moreover, if fis strictly convex, then E[f(X)] = f(EX) holds true if and
only if X= E[X] with probability 1 (i.e., if Xis a constant).
Recall our convention of occasionally dropping the parentheses when writ-
ing expectations, so in the theorem above, f(EX) = f(E[X]).
For an interpretation of the theorem, consider the figure below.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Expectation-Maximization (EM) Algorithm: A Comprehensive Guide and more Study notes Algorithms and Programming in PDF only on Docsity!

CS229 Lecture notes

Tengyu Ma and Andrew Ng

May 13, 2019

Part IX

The EM algorithm

In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. We begin our discussion with a very useful result called Jensen’s inequality

1 Jensen’s inequality

Let f be a function whose domain is the set of real numbers. Recall that f is a convex function if f ′′(x) ≥ 0 (for all x ∈ R). In the case of f taking vector-valued inputs, this is generalized to the condition that its hessian H is positive semi-definite (H ≥ 0). If f ′′(x) > 0 for all x, then we say f is strictly convex (in the vector-valued case, the corresponding statement is that H must be positive definite, written H > 0). Jensen’s inequality can then be stated as follows:

Theorem. Let f be a convex function, and let X be a random variable. Then: E[f (X)] ≥ f (EX).

Moreover, if f is strictly convex, then E[f (X)] = f (EX) holds true if and only if X = E[X] with probability 1 (i.e., if X is a constant).

Recall our convention of occasionally dropping the parentheses when writ- ing expectations, so in the theorem above, f (EX) = f (E[X]). For an interpretation of the theorem, consider the figure below.

a E[X] (^) b

f(a)

f(b) f(EX)

E[f(X)]

f

Here, f is a convex function shown by the solid line. Also, X is a random variable that has a 0.5 chance of taking the value a, and a 0.5 chance of taking the value b (indicated on the x-axis). Thus, the expected value of X is given by the midpoint between a and b. We also see the values f (a), f (b) and f (E[X]) indicated on the y-axis. Moreover, the value E[f (X)] is now the midpoint on the y-axis between f (a) and f (b). From our example, we see that because f is convex, it must be the case that E[f (X)] ≥ f (EX). Incidentally, quite a lot of people have trouble remembering which way the inequality goes, and remembering a picture like this is a good way to quickly figure out the answer. Remark. Recall that f is [strictly] concave if and only if −f is [strictly] convex (i.e., f ′′(x) ≤ 0 or H ≤ 0). Jensen’s inequality also holds for concave functions f , but with the direction of all the inequalities reversed (E[f (X)] ≤ f (EX), etc.).

2 The EM algorithm

Suppose we have an estimation problem in which we have a training set {x(1),... , x(n)} consisting of n independent examples. We have a latent vari- able model p(x, z; θ) with z being the latent variable (which for simplicity is assumed to take finite number of values). The density for x can be obtained by marginalized over the latent variable z:

p(x; θ) =

z

p(x, z; θ) (1)

Let Q be a distribution over the possible values of z. That is,

z Q(z) = 1, Q(z) ≥ 0). Consider the following:^3

log p(x; θ) = log

z

p(x, z; θ)

= log

z

Q(z)

p(x, z; θ) Q(z)

z

Q(z) log

p(x, z; θ) Q(z)

The last step of this derivation used Jensen’s inequality. Specifically, f (x) = log x is a concave function, since f ′′(x) = − 1 /x^2 < 0 over its domain x ∈ R+. Also, the term

z

Q(z)

[

p(x, z; θ) Q(z)

]

in the summation is just an expectation of the quantity [p(x, z; θ)/Q(z)] with respect to z drawn according to the distribution given by Q.^4 By Jensen’s inequality, we have

f

Ez∼Q

[

p(x, z; θ) Q(z)

])

≥ Ez∼Q

[

f

p(x, z; θ) Q(z)

)]

where the “z ∼ Q” subscripts above indicate that the expectations are with respect to z drawn from Q. This allowed us to go from Equation (6) to Equation (7). Now, for any distribution Q, the formula (7) gives a lower-bound on log p(x; θ). There are many possible choices for the Q’s. Which should we choose? Well, if we have some current guess θ of the parameters, it seems natural to try to make the lower-bound tight at that value of θ. I.e., we will make the inequality above hold with equality at our particular value of θ. To make the bound tight for a particular value of θ, we need for the step involving Jensen’s inequality in our derivation above to hold with equality.

(^3) If z were continuous, then Q would be a density, and the summations over z in our

discussion are replaced with integrals over z. (^4) We note that the notion p(x,z;θ) Q(z) only makes sense if^ Q(z)^6 = 0 whenever^ p(x, z;^ θ)^6 = 0. Here we implicitly assume that we only consider those Q with such a property.

For this to be true, we know it is sufficient that the expectation be taken over a “constant”-valued random variable. I.e., we require that

p(x, z; θ) Q(z)

= c

for some constant c that does not depend on z. This is easily accomplished by choosing Q(z) ∝ p(x, z; θ).

Actually, since we know

z Q(z) = 1 (because it is a distribution), this further tells us that

Q(z) =

p(x, z; θ) ∑ z p(x, z;^ θ) =

p(x, z; θ) p(x; θ) = p(z|x; θ) (8)

Thus, we simply set the Q’s to be the posterior distribution of the z’s given x and the setting of the parameters θ. Indeed, we can directly verify that when Q(z) = p(z|x; θ), then equa- tion (7) is an equality because

z

Q(z) log

p(x, z; θ) Q(z)

z

p(z|x; θ) log

p(x, z; θ) p(z|x; θ)

=

z

p(z|x; θ) log

p(z|x; θ)p(x; θ) p(z|x; θ)

=

z

p(z|x; θ) log p(x; θ)

= log p(x; θ)

z

p(z|x; θ)

= log p(x; θ) (because

z p(z|x;^ θ) = 1)

For convenience, we call the expression in Equation (7) the evidence lower bound (ELBO) and we denote it by

ELBO(x; Q, θ) =

z

Q(z) log

p(x, z; θ) Q(z)

Repeat until convergence {

(E-step) For each i, set

Qi(z(i)) := p(z(i)|x(i); θ).

(M-step) Set

θ := arg max θ

∑n

i=

ELBO(x(i); Qi, θ)

= arg max θ

i

z(i)

Qi(z(i)) log

p(x(i), z(i); θ) Qi(z(i))

How do we know if this algorithm will converge? Well, suppose θ(t)^ and θ(t+1)^ are the parameters from two successive iterations of EM. We will now prove that ℓ(θ(t)) ≤ ℓ(θ(t+1)), which shows EM always monotonically im- proves the log-likelihood. The key to showing this result lies in our choice of the Qi’s. Specifically, on the iteration of EM in which the parameters had started out as θ(t), we would have chosen Q( it )(z(i)) := p(z(i)|x(i); θ(t)). We saw earlier that this choice ensures that Jensen’s inequality, as applied to get Equation (11), holds with equality, and hence

ℓ(θ(t)) =

∑^ n

i=

ELBO(x(i); Q( it ), θ(t)) (13)

The parameters θ(t+1)^ are then obtained by maximizing the right hand side of the equation above. Thus,

ℓ(θ(t+1)) ≥

∑^ n

i=

ELBO(x(i); Q( i t), θ(t+1))

(because ineqaulity (11) holds for all Q and θ)

∑^ n

i=

ELBO(x(i); Q( i t), θ(t)) (see reason below)

= ℓ(θ(t)) (by equation (13))

where the last inequality follows from that θ(t+1)^ is chosen explicitly to be

arg max θ

∑n

i=

ELBO(x(i); Q( it ), θ)

Hence, EM causes the likelihood to converge monotonically. In our de- scription of the EM algorithm, we said we’d run it until convergence. Given the result that we just showed, one reasonable convergence test would be to check if the increase in ℓ(θ) between successive iterations is smaller than some tolerance parameter, and to declare convergence if EM is improving ℓ(θ) too slowly.

Remark. If we define (by overloading ELBO(·))

ELBO(Q, θ) =

∑^ n

i=

ELBO(x(i); Qi, θ) =

i

z(i)

Qi(z(i)) log

p(x(i), z(i); θ) Qi(z(i)) (14)

then we know ℓ(θ) ≥ ELBO(Q, θ) from our previous derivation. The EM can also be viewed an alternating maximization algorithm on ELBO(Q, θ), in which the E-step maximizes it with respect to Q (check this yourself), and the M-step maximizes it with respect to θ.

2.1 Other interpretation of ELBO

Let ELBO(x; Q, θ) =

z Q(z) log^

p(x,z;θ) Q(z) be defined as in equation (9). There are several other forms of ELBO. First, we can rewrite

ELBO(x; Q, θ) = Ez∼Q[log p(x, z; θ)] − Ez∼Q[log Q(z)] = Ez∼Q[log p(x|z; θ)] − DKL(Q‖pz ) (15)

where we use pz to denote the marginal distribution of z (under the distri- bution p(x, z; θ)), and DKL() denotes the KL divergence

DKL(Q‖pz ) =

z

Q(z) log

Q(z) p(z)

In many cases, the marginal distribution of z does not depend on the param- eter θ. In this case, we can see that maximizing ELBO over θ is equivalent to maximizing the first term in (15). This corresponds to maximizing the conditional likelihood of x conditioned on z, which is often a simpler question than the original question. Another form of ELBO(·) is (please verify yourself) ELBO(x; Q, θ) = log p(x) − DKL(Q‖pz|x) (17)

where pz|x is the conditional distribution of z given x under the parameter θ. This forms shows that the maximizer of ELBO(Q, θ) over Q is obtained when Q = pz|x, which was shown in equation (8) before.

which was what we had in the previous set of notes. Let’s do one more example, and derive the M-step update for the param- eters φj. Grouping together only the terms that depend on φj , we find that we need to maximize ∑n

i=

∑^ k

j=

w( ji )log φj.

However, there is an additional constraint that the φj ’s sum to 1, since they represent the probabilities φj = p(z(i)^ = j; φ). To deal with the constraint that

∑k j=1 φj^ = 1, we construct the Lagrangian

L(φ) =

∑^ n

i=

∑^ k

j=

w( ji )log φj + β(

∑^ k

j=

φj − 1),

where β is the Lagrange multiplier.^5 Taking derivatives, we find

∂ ∂φj

L(φ) =

∑^ n

i=

w j(i) φj

  • β

Setting this to zero and solving, we get

φj =

∑n i=1 w

(i) j −β

I.e., φj ∝

∑n i=1 w

(i) j.^ Using the constraint that^

j φj^ = 1, we easily find that −β =

∑n i=

∑k j=1 w

(i) j =^

∑n i=1 1 =^ n.^ (This used the fact that^ w

(i) j = Qi(z(i)^ = j), and since probabilities sum to 1,

j w

(i) j = 1.)^ We therefore have our M-step updates for the parameters φj :

φj :=

n

∑^ n

i=

w (i) j.

The derivation for the M-step updates to Σj are also entirely straightfor- ward.

(^5) We don’t need to worry about the constraint that φj ≥ 0, because as we’ll shortly see, the solution we’ll find from this derivation will automatically satisfy that anyway.

4 Variational inference and variational auto-

encoder

Loosely speaking, variational auto-encoder [2] generally refers to a family of algorithms that extend the EM algorithms to more complex models parame- terized by neural networks. It extends the technique of variational inference with the additional “re-parametrization trick” which will be introduced be- low. Variational auto-encoder may not give the best performance for many datasets, but it contains several central ideas about how to extend EM algo- rithms to high-dimensional continuous latent variables with non-linear mod- els. Understanding it will likely give you the language and backgrounds to understand various recent papers related to it. As a running example, we will consider the following parameterization of p(x, z; θ) by a neural network. Let θ be the collection of the weights of a neural network g(z; θ) that maps z ∈ Rk^ to Rd. Let

z ∼ N (0, Ik×k) (18) x|z ∼ N (g(z; θ), σ^2 Id×d) (19)

Here Ik×k denotes identity matrix of dimension k by k, and σ is a scalar that we assume to be known for simplicity. For the Gaussian mixture models in Section 3, the optimal choice of Q(z) = p(z|x; θ) for each fixed θ, that is the posterior distribution of z, can be analytically computed. In many more complex models such as the model (19), it’s intractable to compute the exact the posterior distribution p(z|x; θ). Recall that from equation (10), ELBO is always a lower bound for any choice of Q, and therefore, we can also aim for finding an approximation of the true posterior distribution. Often, one has to use some particular form to approximate the true posterior distribution. Let Q be a family of Q’s that we are considering, and we will aim to find a Q within the family of Q that is closest to the true posterior distribution. To formalize, recall the definition of the ELBO lower bound as a function of Q and θ defined in equation (14)

ELBO(Q, θ) =

∑^ n

i=

ELBO(x(i); Qi, θ) =

i

z(i)

Qi(z(i)) log

p(x(i), z(i); θ) Qi(z(i))

Recall that EM can be viewed as alternating maximization of ELBO(Q, θ). Here instead, we optimize the EBLO over Q ∈ Q

max Q∈Q max θ ELBO(Q, θ) (20)

rewrite the ELBO as a function of φ, ψ, θ by

ELBO(φ, ψ, θ) =

∑^ n

i=

Ez(i)∼Qi

[

log

p(x(i), z(i); θ) Qi(z(i))

]

where Qi = N (q(x(i); φ), diag(v(x(i); ψ))^2 )

Note that to evaluate Qi(z(i)) inside the expectation, we should be able to compute the density of Qi. To estimate the expectation Ez(i)∼Qi , we should be able to sample from distribution Qi so that we can build an empirical estimator with samples. It happens that for Gaussian distribution Qi = N (q(x(i); φ), diag(v(x(i); ψ))^2 ), we are able to be both efficiently. Now let’s optimize the ELBO. It turns out that we can run gradient ascent over φ, ψ, θ instead of alternating maximization. There is no strong need to compute the maximum over each variable at a much greater cost. (For Gaussian mixture model in Section 3, computing the maximum is analytically feasible and relatively cheap, and therefore we did alternating maximization.) Mathematically, let η be the learning rate, the gradient ascent step is

θ := θ + η∇θELBO(φ, ψ, θ) φ := φ + η∇φELBO(φ, ψ, θ) ψ := ψ + η∇ψELBO(φ, ψ, θ)

Computing the gradient over θ is simple because

∇θELBO(φ, ψ, θ) = ∇θ

∑n

i=

Ez(i)∼Qi

[

log p(x(i), z(i); θ) Qi(z(i))

]

= ∇θ

∑n

i=

Ez(i)∼Qi

[

log p(x(i), z(i); θ)

]

∑^ n

i=

Ez(i)∼Qi

[

∇θ log p(x(i), z(i); θ)

]

But computing the gradient over φ and ψ is tricky because the sampling distribution Qi depends on φ and ψ. (Abstractly speaking, the issue we face can be simplified as the problem of computing the gradient Ez∼Qφ [f (φ)] with respect to variable φ. We know that in general, ∇Ez∼Qφ [f (φ)] 6 = Ez∼Qφ [∇f (φ)] because the dependency of Qφ on φ has to be taken into ac- count as well. ) The idea that comes to rescue is the so-called re-parameterization trick: we rewrite z(i)^ ∼ Qi = N (q(x(i); φ), diag(v(x(i); ψ))^2 ) in an equivalent

way:

z(i)^ = q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i)^ where ξ(i)^ ∼ N (0, Ik×k) (24) Here x ⊙ y denotes the entry-wise product of two vectors of the same dimension. Here we used the fact that x ∼ N (μ, σ^2 ) is equivalent to that x = μ+ξσ with ξ ∼ N (0, 1). We mostly just used this fact in every dimension simultaneously for the random variable z(i)^ ∼ Qi. With this re-parameterization, we have that

Ez(i)∼Qi

[

log

p(x(i), z(i); θ) Qi(z(i))

]

= Eξ(i)∼N (0,1)

[

log

p(x(i), q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i); θ) Qi(q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i))

]

It follows that

∇φEz(i)∼Qi

[

log

p(x(i), z(i); θ) Qi(z(i))

]

= ∇φEξ(i)∼N (0,1)

[

log

p(x(i), q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i); θ) Qi(q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i))

]

= Eξ(i)∼N (0,1)

[

∇φ log

p(x(i), q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i); θ) Qi(q(x(i); φ) + v(x(i); ψ) ⊙ ξ(i))

]

We can now sample multiple copies of ξ(i)’s to estimate the the expecta- tion in the RHS of the equation above.^7 We can estimate the gradient with respect to ψ similarly, and with these, we can implement the gradient ascent algorithm to optimize the ELBO over φ, ψ, θ. There are not many high-dimensional distributions with analytically com- putable density function are known to be re-parameterizable. We refer to [2] for a few other choices that can replace Gaussian distribution.

References

[1] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational in- ference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.

[2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. (^7) Empirically people sometimes just use one sample to estimate it for maximum com-

putational efficiency.