Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Mixture Models-Introduction to Machine Learning-Lecture 16-Computer Science, Lecture notes of Introduction to Machine Learning

Mixture Models, Generative Models, Naive Bayes Classifier, EM Algorithm, Semi-Parametric Models, Parametric Mixtures, Mixture, Likelihood, Mixture Density Estimation, Assignment, Expected Likelihood, Gaussian Mixture, Intro to EM, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 27

Toggle sidebar

Related documents


Partial preview of the text

Download Mixture Models-Introduction to Machine Learning-Lecture 16-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 16: Mixture models, EM

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 1, 2010

Review: generative models

General idea: assume (pretend?) p(x | y) comes from a certain parametric class, p(x | y; θy)

Estimate ̂θy from data in each class

Under this estimate, select class with highest p(x 0 | y; ̂θy)

Example: Gaussian model

  • (^) Can make various assumptions regarding form and complexity of Gaussian covariance!

Review: Na¨ıve Bayes classifier

Assume x is represented by m features φ 1 (x),... , φm(x) independent given the class:

p (x | c) = p (φ 1 (x),... , φm(x) | c) =

∏^ m

j=

p (φj (x) | c).

Unter this assumption, the Bayes (optimal) classifier is

h∗(x) = sign

∑^ m

j=

log

p (φj (x) | +1) p (φj (x) | −1)

  • log P+1 − log P− 1

 .

For Gaussian models, equivalent to assuming diagonal covariance

Plan for today

Semi-parametric models

the EM algorithm

Mixture models

So far, we have assumed that each class has a single coherent model.

−6−8 −6 −4 −2 0 2 4 6 8 10

0

2

4

6

8

10

Mixture models

So far, we have assumed that each class has a single coherent model.

−6−8 −6 −4 −2 0 2 4 6 8 10

0

2

4

6

8

10

What if the examples (within the same class) are from a number of distinct “types”?

Examples

Images of the same person under different conditions: with/without glasses, different expressions, different views.

Images of the same category but different sorts of objects: chairs with/without armrests.

Multiple topics within the same document.

Different ways of pronouncing the same phonemes.

Mixture models

Assumptions:

  • (^) k underlying types (components);
  • (^) yi is the identity of the component “responsible” for xi;
  • yi is a hidden (latent) variable: never observed.

A mixture model:

−6^ −4 −4 −2 0 2 4 6 8

0

2

4

6

8

10

p(x; π) =

∑^ k

c=

p(y = c)p (x | y = c).

πc , p(y = c) are the mixing probabilities

We need to parametrize the component densities p (x | y = c).

Parametric mixtures

Suppose that the parameters of the c-th component are θc. Then we can denote θ = [θ 1 ,... , θk] and write

p(x; θ, π) =

∑^ k

c=

πc · p(x; θc).

Any valid setting of θ and π, subject to

∑k c=1 πc^ = 1, produces a valid pdf. Example: mixture of Gaussians.

−10^0 −5 0 5 10

× 0. 7 +

−10^0 −5 0 5 10

× 0. 3 =

=−10^0 −5^0 5

0.02^ 0.

0.06^ 0.

0.12 0. 0.14^ 0.

Generative model for a mixture

The generative process with k-component mixture:

  • The parameters θc for each component c are fixed.
  • (^) Draw yi ∼ [π 1 ,... , πk];
  • (^) Given yi, draw xi ∼ p (x | yi; θyi ).
  • The graphical model representation:

y (^) x

π θ

p(x, y; θ, π) = p(y; π)·p(x|y; θy)

Any data point xi could have been generated in k ways.

Gaussian mixture models

If the c-th component is a Gaussian, p (x | y = c) = N (x; μc, Σc), then

p(x; θ, π) =

∑^ k

c=

πc · N (x; μc, Σc) ,

where θ = [μ 1 ,... , μk, Σ 1 ,... , Σk].

The graphical model

y (^) x

π μ^1 ,...,k Σ^1 ,...,k

Likelihood of a mixture model

Idea: estimate set of parameters that maximize likelihood given the observed data.

The log-likelihood of π, θ:

log p(X; p, θ) =

∑^ N

i=

log

∑^ k

c=

πc N (xi; μc, Σc).

No closed-form solution because of the sum inside log.

  • (^) We need to take into account all possible components that could have generated xi.

Mixture density estimation

Suppose that we do observe yi ∈ { 1 ,... , k} for each i = 1,... , N.

Let us introduce a set of binary indicator variables zi = [zi 1 ,... , zik] where

zic = 1 =

{

1 if yi = c, 0 otherwise.

The count of examples from c-th component:

Nc =

∑^ N

i=

zic.

Mixture density estimation: known labels

If we know zi, the ML estimates of the Gaussian components, just like in class-conditional model, are

−6^ −4 −4 −2 0 2 4 6 8

0

2

4

6

8

10 y=

y=

π̂ c =

Nc N

,

μ̂ c =

1

Nc

∑N

i=

zicxi,

Σ̂ c = 1 Nc

∑N

i=

zic(xi − μ̂ c)(xi − ̂μc)T^.

Credit assignment

When we don’t know y, we face a credit assignment problem: which component is responsible for xi?

Suppose for a moment that we do know component parameters θ = [μ 1 ,... , μk, Σ 1 ,... , Σk] and mixing probabilities π = [π 1 ,... , πk].

Then, we can compute the posterior of each label using Bayes’ theorem:

γic = p̂(y = c | x; θ, π) = πc · p(x; μc, Σc) ∑k l=1 πl^ ·^ p(x;^ μl,^ Σl)

We will call γic the responsibility of the c-th component for x.

  • Note:

∑k c=1 γic^ = 1 for each^ i.

Expected likelihood

The “complete data” likelihood (when z are known):

p(X, Z; π, θ) ∝

∏^ N

i=

∏^ k

c=

(πcN (xi; μc, Σc))zic^.

and the log:

log p(X, Z; π, θ) = const +

∑^ N

i=

∑^ k

c=

zic (log πc + log N (xi; μc, Σc)).

We can’t compute it, but can take the expectation w.r.t. the posterior of z, which is just γic:

Ezic∼γic [log p(xi, zic; π, θ)].

Expected likelihood

log p(X, Z; π, θ) = const +

∑^ N

i=

∑^ k

c=

zic (log πc + log N (xi; μc, Σc)).

Expectation of zic:

Ezic∼γic [zic] =

z∈ 0 , 1

z · γicz = γic.

The expected likelihood of the data:

Ezic∼γic [log p(X, Z; p, θ)] = const

∑^ N

i=

∑^ k

c=

γic (log πc + log N (xi; μc, Σc)).

Expectation maximization

Ezic∼γic [log p(XN , ZN ; p, θ)] =

∑^ N

i=

∑^ k

c=

γic (log pc + log N (xi; μc, Σc)).

We can find π, θ that maximize this expected likelihood – by setting derivatives to zero and, for π, using Lagrange multipliers to enforce

c πc^ = 1.

πˆc =

1

N

∑^ N

i=

γic,

μ ˆc =

1

∑N

i=1 γic

∑N

i=

γicxi,

Σ̂ c = (^) ∑^1 N i=1 γic

∑N

i=

γic(xi − μˆc)(xi − μˆc)T^.

Summary so far

If we know the parameters and indicators (assignments) we are done.

If we know the indicators but not the parameters, we can do ML estimation of the parameters – and we are done.

If we know the parameters but not the indicators, we can compute the posteriors of indicators;

  • With known posteriors, we can estimate parameters that maximize the expected likelihood – and then we are done.

But in reality we know neither the parameters nor the indicators.

The EM algorithm

Start with a guess of θ, π.

  • Typically, random θ and πc = 1/k.

Iterate between: E-step Compute values of expected assignments, i.e. calculate γic, using current estimates of θ, π. M-step Maximize the expected likelihood, under current γic.

Repeat until convergence.

EM for Gaussian mixture: an example

Colors represent γic after the E-step.

1st iteration

−6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8

EM for Gaussian mixture: an example

Colors represent γic after the E-step.

1st iteration 2nd iteration

−6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8

−4−8 −6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8

EM for Gaussian mixture: an example

Colors represent γic after the E-step.

1st iteration 2nd iteration 3rd iteration

−6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8

−4−8 −6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8

−8 −6 −4 −2 0 2 4 6 8 −4

−2

0

2

4

6

8

EM for Gaussian mixture: an example

4th iteration

−8 −6 −4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8

EM for Gaussian mixture: an example

4th iteration 7th iteration

−8 −6 −4 −2 0 2 4 6 8

−4

−2

0

2

4

6

8

−4−8 −6 −4 −2 0 2 4 6 8

−2

0

2

4

6

8