Data Analysis Mixture Models Latent Variables And The EM Algorithm, Lecture Slide - Engineering, Slides of Advanced Data Analysis

Data Analysis Mixture Models Latent Variables And The EM Algorithm, Lecture Slide - Engineering, Advanced Data Analysis, Mixture Models, Kernel Density Estimates, Geometry, Probabilistic Clustering, EM Algorithm, Probabilistic LSA, Non-parametric Mixture Modeling

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
19. Mixture Models, Latent Variables and the
EM Algorithm
36-402, Advanced Data Analysis
31 March 2011
Contents
1 Two Routes to Mixture Models 1
1.1 From Factor Analysis to Mixture Models . . . . . . . . . . . . . . 1
1.2 From Kernel Density Estimates to Mixture Models . . . . . . . . 2
1.3 MixtureModels............................ 2
1.4 Geometry ............................... 3
1.5 Identifiability ............................. 4
1.6 Probabilistic Clustering . . . . . . . . . . . . . . . . . . . . . . . 5
2 Estimating Parametric Mixture Models 5
2.1 More about the EM Algorithm . . . . . . . . . . . . . . . . . . . 7
2.2 Further Reading on and Applications of EM . . . . . . . . . . . . 10
2.3 Topic Models and Probabilistic LSA . . . . . . . . . . . . . . . . 10
3 Non-parametric Mixture Modeling 11
4 R 11
5 Exercises 11
1 Two Routes to Mixture Models
1.1 From Factor Analysis to Mixture Models
In factor analysis, the origin myth is that we have a fairly small number, qof
real variables which happen to be unobserved (“latent”), and the much larger
number pof variables we do observe arise as linear combinations of these factors,
plus noise. The mythology is that it’s possible for us (or for Someone) to
continuously adjust the latent variables, and the distribution of observables
likewise changes continuously. What if the latent variables are not continuous
but ordinal, or even categorical? The natural idea would be that each value of
the latent variable would give a different distribution of the observables.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data Analysis Mixture Models Latent Variables And The EM Algorithm, Lecture Slide - Engineering and more Slides Advanced Data Analysis in PDF only on Docsity!

19. Mixture Models, Latent Variables and the

EM Algorithm

36-402, Advanced Data Analysis

31 March 2011

Contents

1 Two Routes to Mixture Models 1 1.1 From Factor Analysis to Mixture Models.............. 1 1.2 From Kernel Density Estimates to Mixture Models........ 2 1.3 Mixture Models............................ 2 1.4 Geometry............................... 3 1.5 Identifiability............................. 4 1.6 Probabilistic Clustering....................... 5

2 Estimating Parametric Mixture Models 5 2.1 More about the EM Algorithm................... 7 2.2 Further Reading on and Applications of EM............ 10 2.3 Topic Models and Probabilistic LSA................ 10

3 Non-parametric Mixture Modeling 11

4 R 11

5 Exercises 11

1 Two Routes to Mixture Models

1.1 From Factor Analysis to Mixture Models

In factor analysis, the origin myth is that we have a fairly small number, q of real variables which happen to be unobserved (“latent”), and the much larger number p of variables we do observe arise as linear combinations of these factors, plus noise. The mythology is that it’s possible for us (or for Someone) to continuously adjust the latent variables, and the distribution of observables likewise changes continuously. What if the latent variables are not continuous but ordinal, or even categorical? The natural idea would be that each value of the latent variable would give a different distribution of the observables.

1.2 From Kernel Density Estimates to Mixture Models

We have also previously looked at kernel density estimation, where we approxi- mate the true distribution by sticking a small ( (^) n^1 weight) copy of a kernel pdf at each observed data point and adding them up. With enough data, this comes arbitrarily close to any (reasonable) probability density, but it does have some drawbacks. Statistically, it labors under the curse of dimensionality. Compu- tationally, we have to remember all of the data points, which is a lot. We saw similar problems when we looked at fully non-parametric regression, and then saw that both could be ameliorated by using things like additive models, which impose more constraints than, say, unrestricted kernel smoothing. Can we do something like that with density estimation? Additive modeling for densities is not as common as it is for regression — it’s harder to think of times when it would be natural and well-defined^1 — but we can do things to restrict density estimation. For instance, instead of putting a copy of the kernel at every point, we might pick a small number K  n of points, which we feel are somehow typical or representative of the data, and put a copy of the kernel at each one (with weight (^) K^1 ). This uses less memory, but it ignores the other data points, and lots of them are probably very similar to those points we’re taking as prototypes. The differences between prototypes and many of their neighbors are just matters of chance or noise. Rather than remembering all of those noisy details, why not collapse those data points, and just remember their common distribution? Different regions of the data space will have different shared distributions, but we can just combine them.

1.3 Mixture Models

More formally, we say that a distribution f is a mixture of K component distributions f 1 , f 2 ,... fK if

f (x) =

∑^ K

k=

λkfk(x) (1)

with the λk being the mixing weights, λk > 0,

k λk^ = 1. Eq. 1 is a complete stochastic model, so it gives us a recipe for generating new data points: first pick a distribution, with probabilities given by the mixing weights, and then generate one observation according to that distribution. Symbolically,

Z ∼ Mult(λ 1 , λ 2 ,... λK ) (2) X|Z ∼ fZ (3)

where I’ve introduced the discrete random variable Z which says which compo- nent X is drawn from. (^1) Remember that the integral of a probability density over all space must be 1, while the integral of a regression function doesn’t have to be anything in particular. If we had an additive density, f (x) = P j fj^ (xj^ ), ensuring normalization is going to be very tricky; we’d need P j

R fj (xj )dx 1 dx 2 dxp = 1. It would be easier to ensure normalization while making the log-density additive, but that assumes the features are independent of each other.

this mixture distribution will hardly ever be exactly the same as the factor model’s distribution — mixtures of Gaussians aren’t Gaussian, the mixture will usually (but not always) be multimodal while the factor distribution is always unimodal — but it will have the same geometry, the same mean and the same covariances, so we will have to look beyond those to tell them apart. Which, frankly, people hardly ever do.

1.5 Identifiability

Before we set about trying to estimate our probability models, we need to make sure that they are identifiable — that if we have distinct representations of the model, they make distinct observational claims. It is easy to let there be too many parameters, or the wrong choice of parameters, and lose identifiability. If there are distinct representations which are observationally equivalent, we either need to change our model, change our representation, or fix on a unique representation by some convention.

  • With additive regression, E [Y |X = x] = α +

j fj^ (xj^ ), we can add arbi- trary constants so long as they cancel out. That is, we get the same pre- dictions from α + c 0 +

jfj (xj ) + cj when c 0 = −

j cj^. This is another model of the same form, α′^ +

j f^

′ j (xj^ ), so it’s not identifiable. We dealt with this by imposing the convention that α = E [Y ] and E [fj (Xj )] = 0 — we picked out a favorite, convenient representation from the infinite collection of equivalent representations.

  • Linear regression becomes unidentifiable with collinear features. Collinear- ity is a good reason to not use linear regression (i.e., we change the model.)
  • Factor analysis is unidentifiable because of the rotation problem. Some people respond by trying to fix on a particular representation, others just ignore it. Two kinds of identification problems are common for mixture models; one is trivial and the other is fundamental. The trivial one is that we can always swap the labels of any two components with no effect on anything observable at all — if we decide that component number 1 is now component number 7 and vice versa, that doesn’t change the distribution of X at all. This label degeneracy can be annoying, especially for some estimation algorithms, but that’s the worst of it. A more fundamental lack of identifiability happens when mixing two distri- butions from a parametric family just gives us a third distribution from the same family. For example, suppose we have a single binary feature, say an indicator for whether someone will pay back a credit card. We might think there are two kinds of customers, with high- and low- risk of not paying, and try to represent this as a mixture of binomial distribution. If we try this, we’ll see that we’ve gotten a single binomial distribution with an intermediate risk of repayment. A mixture of binomials is always just another binomial. In fact, a mixture of multinomials is always just another multinomial.

1.6 Probabilistic Clustering

Yet another way to view mixture models, which I hinted at when I talked about how they are a way of putting similar data points together into “clusters”, where clusters are represented by, precisely, the component distributions. The idea is that all data points of the same type, belonging to the same cluster, are more or less equivalent and all come from the same distribution, and any differences between them are matters of chance. This view exactly corresponds to mixture models like Eq. 1; the hidden variable Z I introduced above in just the cluster label. One of the very nice things about probabilistic clustering is that Eq. 1 ac- tually claims something about what the data looks like; it says that it follows a certain distribution. We can check whether it does, and we can check whether new data follows this distribution. If it does, great; if not, if the predictions sys- tematically fail, then the model is wrong. We can compare different probabilistic clusterings by how well they predict (say under cross-validation).^3 In particular, probabilistic clustering gives us a sensible way of answering the question “how many clusters?” The best number of clusters to use is the number which will best generalize to future data. If we don’t want to wait around to get new data, we can approximate generalization performance by cross-validation, or by any other adaptive model selection procedure.

2 Estimating Parametric Mixture Models

From intro stats., we remember that it’s generally a good idea to estimate distributions using maximum likelihood, when we can. How could we do that here? Remember that the likelihood is the probability (or probability density) of observing our data, as a function of the parameters. Assuming independent samples, that would be ∏n

i=

f (xi; θ) (5)

for observations x 1 , x 2 ,... xn. As always, we’ll use the logarithm to turn multi- plication into addition:

`(θ) =

∑^ n

i=

log f (xi; θ) (6)

∑^ n

i=

log

∑^ K

k=

λkf (xi; θk) (7)

(^3) Contrast this with k-means or hierarchical clustering, which you may have seen in other classes: they make no predictions, and so we have no way of telling if they are right or wrong. Consequently, comparing different non-probabilistic clusterings is a lot harder!

  1. Start with guesses about the mixture components θ 1 , θ 2 ,... θK and the mixing weights λ 1 ,... λK.
  2. Until nothing changes very much:

(a) Using the current parameter guesses, calculate the weights wij (E- step) (b) Using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step)

  1. Return the final parameter estimates (including mixing proportions) and cluster probabilities

The M in “M-step” and “EM” stands for “maximization”, which is pretty transparent. The E stands for “expectation”, because it gives us the condi- tional probabilities of different values of Z, and probabilities are expectations of indicator functions. (In fact in some early applications, Z was binary, so one really was computing the expectation of Z.) The whole thing is also called the “expectation-maximization” algorithm.

2.1 More about the EM Algorithm

The EM algorithm turns out to be a general way of maximizing the likelihood when some variables are unobserved, and hence useful for other things besides mixture models. So in this section, where I try to explain why it works, I am going to be a bit more general abstract. (Also, it will actually cut down on notation.) I’ll pack the whole sequence of observations x 1 , x 2 ,... xn into a single variable d (for “data”), and likewise the whole sequence of z 1 , z 2 ,... zn into h (for “hidden”). What we want to do is maximize

`(θ) = log p(d; θ) = log

h

p(d, h; θ) (14)

This is generally hard, because even if p(d, h; θ) has a nice parametric form, that is lost when we sum up over all possible values of h (as we saw above). The essential trick of the EM algorithm is to maximize not the log likelihood, but a lower bound on the log-likelihood, which is more tractable; we’ll see that this lower bound is sometimes tight, i.e., coincides with the actual log-likelihood, and in particular does so at the global optimum.

0.5 1.0 1.5 2.

-0.

x

log(x)

curve(log(x),from=0.4,to=2.1) segments(0.5,log(0.5),2,log(2),lty=2)

Figure 1: The logarithm is a concave function, i.e., the curve connecting any two points lies above the straight line doing so. Thus the average of logarithms is less than the logarithm of the average.

We can introduce an arbitrary^5 distribution on h, call it q(h), and we’ll

`(θ) = log

h

p(d, h; θ) (15)

= log

h

q(h) q(h)

p(d, h; θ) (16)

= log

h

q(h) p(d, h; θ) q(h)

So far so trivial. Now we need a geometric fact about the logarithm function, which is that its curve is concave: if we take any two points on the curve and connect them by a straight line, the curve lies above the line (Figure 1). Algebraically, this means that w log t 1 + (1 − w) log t 2 ≤ log wt 1 + (1 − w)t 2 (18)

for any 0 ≤ w ≤ 1, and any points t 1 , t 2 > 0. Nor does this just hold for two points: for any r points t 1 , t 2 ,... tr > 0, and any set of non-negative weights

(^5) Well, almost arbitrary; it shouldn’t give probability zero to value of h which has positive probability for all θ.

We saw above that the maximization in the E step is just computing the posterior probability p(h|d; θ). What about the maximization in the M step?

h

q(h) log p(d, h; θ) q(h)

h

q(h) log p(d, h; θ) −

h

q(h) log q(h) (27)

The second sum doesn’t depend on θ at all, so it’s irrelevant for maximizing, giving us back the optimization problem from the last section. This confirms that using the lower bound from Jensen’s inequality hasn’t yielded a different algorithm!

2.2 Further Reading on and Applications of EM

My presentation of the EM algorithm draws heavily on Neal and Hinton (1998). Because it’s so general, the EM algorithm is applied to lots of problems with missing data or latent variables. Traditional estimation methods for factor analysis, for example, can be replaced with EM. (Arguably, some of the older methods were versions of EM.) A common problem in time-series analysis and signal processing is that of “filtering” or “state estimation”: there’s an unknown signal St, which we want to know, but all we get to observe is some noisy, corrupted measurement, Xt = h(St) + ηt. (A historically important example of a “state” to be estimated from noisy measurements is “Where is our rocket and which way is it headed?” — see McGee and Schmidt, 1985.) This is solved by the EM algorithm, with the signal as the hidden variable; Fraser (2008) gives a really good introduction to such models and how they use EM. Instead of just doing mixtures of densities, one can also do mixtures of predictive models, say mixtures of regressions, or mixtures of classifiers. The hidden variable Z here controls which regression function to use. A general form of this is what’s known as a mixture-of-experts model (Jordan and Jacobs, 1994; Jacobs, 1997) — each predictive model is an “expert”, and there can be a quite complicated set of hidden variables determining which expert to use when. The EM algorithm is so useful and general that it has in fact been re-invented multiple times. The name “EM algorithm” comes from the statistics of mixture models in the late 1970s; in the time series literature it’s been known since the 1960s as the “Baum-Welch” algorithm.

2.3 Topic Models and Probabilistic LSA

Mixture models over words provide an alternative to latent semantic indexing for document analysis. Instead of finding the principal components of the bag- of-words vectors, the idea is as follows. There are a certain number of topics which documents in the corpus can be about; each topic corresponds to a dis- tribution over words. The distribution of words in a document is a mixture of the topic distributions. That is, one can generate a bag of words by first picking a topic according to a multinomial distribution (topic i occurs with probability λi), and then picking a word from that topic’s distribution. The distribution of

topics varies from document to document, and this is what’s used, rather than projections on to the principal components, to summarize the document. This idea was, so far as I can tell, introduced by Hofmann (1999), who estimated ev- erything by EM. Latent Dirichlet allocation, due to Blei and collaborators (Blei et al., 2003) is an important variation which smoothes the topic distribu- tions; there is a CRAN package called lda. Blei and Lafferty (2009) is a good recent review paper of the area.

3 Non-parametric Mixture Modeling

We could replace the M step of EM by some other way of estimating the dis- tribution of each mixture component. This could be a fast-but-crude estimate of parameters (say a method-of-moments estimator if that’s simpler than the MLE), or it could even be a non-parametric density estimator of the type we talked about last time. (Similarly for mixtures of regressions, etc.) Issues of dimensionality re-surface now, as well as convergence: because we’re not, in general, increasing J at each step, it’s harder to be sure that the algorithm will in fact converge. This is an active area of research.

4 R

There are several R packages which implement mixture models. The mclust package (http://www.stat.washington.edu/mclust/) is pretty much stan- dard for Gaussian mixtures. One of the most recent and powerful is mixtools (Benaglia et al., 2009), which, in addition to classic mixtures of parametric densities, handles mixtures of regressions and some kinds of non-parametric mixtures. The FlexMix package (Leisch, 2004) is (as the name implies) very good at flexibly handling complicated situations, though you have to do some programming to take advantage of this.

5 Exercises

Not to hand in.

  1. Work through the E- step and M- step for a mixture of two Poisson dis- tributions.
  2. Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K = 3 Gaussians. How well does your code assign data-points to components if you give it the actual Gaussian parameters as your initial guess? If you give it other initial parameters?