Probabilistic Methods in AI: EM Algorithm for Clustering & Maximum Likelihood, Study notes of Computer Numerical Control

The use of probabilistic models in artificial intelligence, focusing on the em algorithm for clustering and maximum likelihood estimation. Topics include classification, learning with missing data, probabilistic reasoning, and decision theory. The document also provides examples of maximum likelihood estimation for bernoulli distribution, multivariate gaussian, and mixture-of-gaussians.

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 10 Probabilistic Methods I
Avi Pfeffer; Revised by David Parkes
Feb 21, 2011
In this lecture we explore the method of maximum likelihood and the expectation-
maximization algorithm in some detail. An application is developed around
using the EM algorithm for clustering. We also discuss the many ways in which
probabilistic methods are used within AI.
1 Uses of Probabilistic Models
For the remainder of the course, our focus will shift somewhat towards learn-
ing and reasoning with probabilistic models. Probabilistic models are powerful
tools for building intelligent agents and also provide some of the state-of-the-art
approaches to machine learning.
An agent that has a probabilistic model of the way the world works can do
many things. Amongst them:
Classification Given a probability model of the joint distribution on random
variables Xand Y(taking on values in feature space Xand labels Y
respectively), then classification of an example xis possible as
arg max
yYP(Y=y|X=x) = arg max
yY
P(Y=y, X=x)
P(X=x)(1)
= arg max
yYP(Y=y, X=x),(2)
where the second equality follows since the P(X=x) is the same for all
y. Note that this also gives the degree of confidence an agent has in the
classification.
Classification with missing data Up till now, we have assumed that in ev-
ery example to be classified, all the attribute values are available to the
classifier. This is a very strong assumption, and not generally true. For
example, consider a medical diagnosis system that tries to diagnose a per-
son based on test results. Different people will have different tests, so
the set of inputs available for every example will be different. This is a
problem for decision tree and neural network approaches.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Probabilistic Methods in AI: EM Algorithm for Clustering & Maximum Likelihood and more Study notes Computer Numerical Control in PDF only on Docsity!

CS181 Lecture 10 — Probabilistic Methods I

Avi Pfeffer; Revised by David Parkes

Feb 21, 2011

In this lecture we explore the method of maximum likelihood and the expectation- maximization algorithm in some detail. An application is developed around using the EM algorithm for clustering. We also discuss the many ways in which probabilistic methods are used within AI.

1 Uses of Probabilistic Models

For the remainder of the course, our focus will shift somewhat towards learn- ing and reasoning with probabilistic models. Probabilistic models are powerful tools for building intelligent agents and also provide some of the state-of-the-art approaches to machine learning. An agent that has a probabilistic model of the way the world works can do many things. Amongst them:

Classification Given a probability model of the joint distribution on random variables X and Y (taking on values in feature space X and labels Y respectively), then classification of an example x is possible as

arg max y∈Y P (Y = y | X = x) = arg max y∈Y

P (Y = y, X = x) P (X = x)

= arg max y∈Y

P (Y = y, X = x), (2)

where the second equality follows since the P (X = x) is the same for all y. Note that this also gives the degree of confidence an agent has in the classification.

Classification with missing data Up till now, we have assumed that in ev- ery example to be classified, all the attribute values are available to the classifier. This is a very strong assumption, and not generally true. For example, consider a medical diagnosis system that tries to diagnose a per- son based on test results. Different people will have different tests, so the set of inputs available for every example will be different. This is a problem for decision tree and neural network approaches.

However, if one has a probability distribution P (X 1 ,... , XM , Y), and observes an example with values known for features 1 to j (only), then classification can proceed as:

arg max y∈Y P (Y = y | X 1 = x 1 ,... , Xj = xj ) (3)

= arg max y∈Y

P (Y = y, X 1 ...j = x 1 ...j ) P (X 1 ...j = x 1 ...j )

= arg max y∈Y

x′ j+1,...,x′ m

P (Y = y, X 1 ...j = x 1 ...j , Xj+1...m = x′ j+1...m) (5)

where notation Xa...b = xa...b indicates that random variables in the se- quence Xa,... , Xb assume values from the sequence xa,... , xb. The final equality comes from noticing that P (X 1 ...j = x 1 ...j ) is invariant to y, and then by using marginalization. A basic rule from probability theory is that the unconditional probability of some random variable A can be determined as the sum over the joint probability of A and the possible values of other random variables Bj ):

P (A) =

j

P (A, Bj ) (6)

Note that the summation (5) has an exponential number of summation terms (one summation term for each set of values for the features whose values are unknown). Later in the course we will use Bayesian networks to make this kind of reasoning tractable.

Learning with missing data In addition to missing attributes in examples used for classification, there may also be missing attributes in examples used for learning. The missing data may be the class, the values of some attributes, or both. Missing attributes can be a problem for many learning methods. However, it is generally possible to take missing data into ac- count when learning a probabilistic model. The basic approach is through the expectation-maximization (EM) algorithm.

Clustering (unsupervised learning) An extreme form of missing data oc- curs in clustering. Here, there is no class label but the goal is to associate each example with a class (= cluster.) Clustering is well supported by probabilistic methods such as the EM algorithm applied to a mixture-of- Gaussians model or the Naive Bayes model. We will see this soon.

Inference (prediction, diagnosis, temporal reasoning Probabilistic mod- els support several kinds of reasoning. Prediction involves trying to de- termine the future based on the present. This can be an easy task for probabilistic approaches, as one can specify directly a probabilistic model of how things happen in the world. Prediction can be viewed as reasoning

2 Generative vs. Discriminative Models

One high level decision to make when designing a probabilistic model is whether it should be generative or discriminative.

  • In a generative approach, the joint distribution P (X 1 ,... , Xm, Y) is mod- eled explicitly. For this, it is typical to use the product rule of probability theory and write

P (X 1 ,... , Xm, Y) = P (Y)P (X 1 ,... , Xm | Y) (9)

Recall that the product rule states, for two random variables A and B, that

P (A, B) = P (A)P (B | A) = P (B)P (A | B) (10)

This is known as a generative approach because it is possible to generate new examples (x, y) from a learned model. The generative approach can be a very natural and easy to use model- ing approach, with latent (or hidden) variables used to model the causal process by which data is generated, and has become extremely popular in recent years. We will focus mainly on the generative approach.

  • On the other hand, the generative approach learns a model of the distri- bution on X, and this is not strictly necessary if the goal is classification. In a discriminative approach, then one learns just enough to be able to “discriminate” (i.e., decide) the target class of an example, that is, the model learned is

P (Y | X 1 ,... , X) (11)

The discriminative approach can require less data because it adopts a less ambitious goal. On the other hand, additional structure can sometimes be usefully recog- nized by learning the model on x as well, as in the generative approach. The generative approach can also be used to identify outliers because the probability of any example x can be computed. This is not possible with a discriminative approach.

Note: A word of warning: in doing additional reading you may also find the distinction between generative and discriminative made in which the generative approach is probabilistic and the discriminative approach is non probabilistic (e.g., via an SVM or similar.) SVMs, neural networks, decision trees and so forth are examples of discriminative approaches but probabilistic methods may also be discriminative.

3 Maximum Likelihood

Let us return now to parameterized models with P (X = x | θ) a distribution parameterized by parameter vector θ. In the maximum likelihood method, we seek parameter

θML = arg max θ

P (D | θ) (12)

where D = {x 1 ,... , xn}. From last time, we know that it is helpful to work with the log likelihood, and find

θML = arg max θ

∏n

i=

P (X = xi | θ) = arg max θ

∑n

i=

ln P (X = xi | θ) (13)

Let us consider three increasingly involved examples for how to solve for maximum likelihood parameters by taking partial derivatives and setting to zero.

3.1 Example 1: Bernoulli

Suppose that the data x ∈ { 0 , 1 } indicates whether or not a soccer team won or lost (0 is a loss.) Suppose that D = { 1 , 0 , 0 } to indicate one win and two losses. Adopt the Bernoulli distribution with

P (X = x | θ) = θx(1 − θ)(1−x)^ (14)

where θ ∈ [0, 1] is the probability of a win, and x ∈ { 0 , 1 }. Given data D, we write

P (D | θ) =

∏^ n

i=

θxi^ (1 − θ)(1−xi)^ = θNT^ (1 − θ)NF^ , (15)

where NT , NF ∈ { 0 , 1 ,.. .} is the number of 1’s and number of 0’s in the data, respectively. Taking logs, we have log likelihood

ln P (D | θ) = NT ln θ + NF ln(1 − θ) (16)

and taking the partial derivative with respect to θ and equating to zero,

θML =

NT

n

and 1/3 given the data.

3.3 Example 3: Multivariate Gaussian

Consider now the multivariate Gaussian, which provides a density function on a continuous m-dimensional random variable, x, with

P (X = x | θ) =

(2π)m/^2 |Σ|(1/2)^

exp

(x − μ)T^ Σ−^1 (x − μ)

with parameter μ ∈ Rm^ the mean, m-by-m matrix Σ the covariance matrix, Σ−^1 the inverse matrix, and |Σ| the determinant. This has m + m(m + 1)/ 2 parameters. Consider data D = {x 1 ,... , xn}. The log likelihood for the data given a multivariate Gaussian is:

ln[L(D, θ)] = −

nm 2

ln(2π) −

n 2

ln |Σ| −

∑^ n

i=

(xi − μ)T^ Σ−^1 (xi − μ) (25)

For maximum likelihood parameters, we can solve for partial derivatives with respect to μ and set to zero, obtaining

μML =

n

∑^ n

i=

xi (26)

and the sample mean. Solving for Σ is a bit more involved, but produces the estimate

ΣML =

n

∑^ n

i=

(xi − μML)(xi − μML)T^ (27)

Both are familiar from what we derived for the univariate Gaussian.

3.4 Example 4: Mixture-of-Gaussians

Now suppose that we now adopt as the model a mixture on Gaussians, with K component distributions, and

P (X = x | θ) =

∑^ K

k=

πk · N (X = x | μk, Σk), (28)

where the parameters θ = {πk, μk, Σk} define the prior πk for generating an example from Gaussian k, and the mean μk and covariance Σk defined for Gaussian k. (Don’t confuse this parameters πk with the irrational number π of “Life of π” fame and as used to define a Gaussian distribution amongst other things!) Each multivariate Gaussian is called a component of the distribution. This distribution implicitly introduces a new random variable, which is the component from which an example is generated. For now, let us sim- ply assume that we are able to observe the instantiation of this random

variable, yi ∈ { 1 ,... , K} for every example xi in the data. The data is D = {(x 1 , y 1 ),... , (xn, yn)}. Given this, then we want a maximum likelihood estimate of the parameters to solve,

arg max θ

∑n

i=

ln P (X = xi, Y = yi | θ) (29)

It is convenient to introduce index vector yi = (yi 1 ,... , yK ) with yik ∈ { 0 , 1 } and

k yik^ = 1, and^ yik^ indicating the value assumed by^ yi.^ In particular, if yik = 1 then example i was generated by Gaussian component k. By simple algebra, we have

∑^ n

i=

ln P (X = xi, Y = yi | θ) (30)

∑^ n

i=

∑^ K

k=

yik ln[πkN (X = xi | μk, Σk)] (31)

∑^ K

k=

( (^) n ∑

i=

yik

ln πk +

∑^ k

k=

( (^) n ∑

i=

yik ln[N (X = xi | μk, Σk)]

Given this, then we can solve for θML, with

πML,k =

Nk n

, and Nk =

∑^ n

i=

yik (33)

coming from the first term (by a simple use of Lagrangian optimization, noting that

∑K

k=1 πk^ = 1.) Considering the second term, this decomposes completely by component k, and the solution follows from essentially the same analysis as that used to derive the ML parameters for a single multivariate Gaussian. We have,

μML,k =

Nk

∑n

i=

yikxi (34)

and

ΣML,k =

Nk

∑n

i=

yik(xi − μML,k)(xi − μML,k)T^ (35)

4 Clustering via the EM Algorithm

The challenge for clustering is that the data D only consists of examples x 1 ,... , xn, and includes no association of a cluster with each example. It is the clustering that we want to find! The maximum likelihood problem remains

This is a little different from the assignment determined in K-means because the covariance associated with each “prototype” μk need not be the same (and need not be spherical) and because the prior πk need not be uniform. The EM algorithm will guess parameters θ, then determine the probability P (Y = k|X = xi, θ) with which each example xi is generated from each com- ponent, and then use this to maximize the expected log likelihood given these probabilities. Then repeat! Continue until convergence or some maximum num- ber of iterations is reached.

To see why this can allow for tractability and to also understand the specifics of the EM update for this problem, consider again the log likelihood expres- sion (32) for the complete information version of the problem,

∑^ K

k=

( (^) n ∑

i=

yik

ln πk +

∑^ K

k=

( (^) n ∑

i=

yik ln[N (X = xi | μk, Σk)]

In each step of EM, we are interested in finding new parameters θ(new) that maximize the expected log likelihood, fixing the probability of assignment γik = P (Y = k | X = xi, θ(old)) of each example to each component,

arg max θ

EY

[ (^) n ∑

i=

ln P (X = xi, Yi | θ, {γik})

]

It is a simple matter to recognize that the expected log likelihood, fixing probabilities {γik} is obtained by just substituting γik for yik in the complete information log likelihood, so that the expected log likelihood is just

∑^ K

k=

( (^) n ∑

i=

γik

ln πk +

∑^ K

k=

( (^) n ∑

i=

γik ln[N (X = xi | μk, Σk)]

Recognizing that yik ∈ { 0 , 1 } and γik ∈ [0, 1] are both just constants for the purpose of likelihood maximization, then almost exactly the same expressions from the complete information version can now be adopted for the purpose of the EM algorithm. We have:

  • (Step 0) Initialize the Gaussian mixture by specifying θ(0). For example, we may set π k(0) = 1/K, equate each μ(0) k with a random example, and set Σ (0) k equal to the overall data covariance. These parameters become^ θ

(old) for the first EM step.

  • Repeat Until Convergence:
    • (E-Step) Evaluate the current posterior assignment probabilities γik = P (Y = k | X = xi, θ(old)).
  • (M-Step) Update the parameters according to

π (new) k =^

Nˆk n , where Nˆk =

∑^ n

i=

γik (44)

μ(n k ew)=

Nˆk

∑n

i=

γikxi (45)

Σ(n k ew)=

Nˆk

∑n

i=

γik(xi − μ(n k ew))(xi − μ(n k ew))T^ (46)

Replace θ(old)^ ← θ(new).

Upon termination, adopt θ(new)^ as parameters θML. The first (E)xpectation step finds the expected assignment of each example to each cluster (equivalently, the probability that each indicator variables is one). The second (M)aximization step then solves for maximum likelihood adopting the solution to the (E) step as weights with which to consider the influence of each example. In effect, the (M) step is maximizing the expected log-likelihood.

Caution: this algorithm should be initialized with different parameters for each component to ensure diversity during learning. Suppose that we initially set all the clusters to have identical means and covariances. In this case they would all be changed in exactly the same way in each EM iteration, and we would be estimating a single Gaussian distribution! To see this note that the posterior assignments γik would be the same for every cluster. This is why it is important that the initial mean vectors {μ(0) 1 ,... , μ(0) K } are set to be randomly chosen examples, or even by iteratively selecting them to be far from the already chosen means.

Why does EM work? EM behaves a lot like gradient descent— at each step, it adjusts the parameter values so as to improve the likelihood of the data. It is a basic fact (that we won’t prove in this course) that the EM algorithm increases the likelihood of the (incomplete) data D at the end of every M-step. Indexing the parameters θ(0), θ(1),... , then we have

P (D | θ(0)) < P (D | θ(1)) < P (D | θ(2)) <... (47)

except at a local minimum. From this we obtain convergence to a set of pa- rameter values that locally maximize the likelihood. As always with iterative improvement algorithms, local optima may be a problem. We can use standard techniques such as random restarts to deal with this. It is important to understand how this algorithm works, because it is paradig- matic of a general algorithm that can be applied to many different probabilistic models. If you can reconstruct this algorithm, you should be able to construct an EM algorithm for many different examples. The EM algorithm is used, for example, to learn the parameters in Bayesian networks (graphical models