Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning, Study notes of Computer Science

Markov chain monte carlo (mcmc) and gibbs sampling techniques used in machine learning for drawing samples from probability distributions. Mcmc maintains a 'random walk' of parameters over parameter space, while gibbs sampling only works for multivariate distributions and directly sampling from the distribution p(θd | θ−d). Both methods aim to move towards high probability regions and avoid unnecessary favors towards certain parameters.

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-xc2-1
koofers-user-xc2-1 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning (CS 5350/CS 6350) 05 Apr 2007
Bayesian inference II
The key problem with (the good) Monte Carlo techniques is that they require a proposal distribution qthat
is good everywhere. What we’d like to do is to have a proposal distribution qthat is good locally. Of course,
the question is “locally to what ?”
In Markov Chain Monte Carlo techniques, we maintain a “random walk” of parameters over parameter
space. That is, instead of drawing a bunch of samples θ(1), θ(2) ,...,θ(R)independently from a proposal
distribution, we will draw θ(r+1) conditioned on θr. This introduces a few new problems (samples are no
longer independent), but will enable us to define a better proposal distribution.
Metropolis-Hastings
Let q(θ0|θ) be a proposal distribution. MH works as follows:
1. Initialize θ(1)
2. For r= 1 . . . R 1,
(a) Draw θ0according to q(θ0|θ(r1))
(b) Compute acceptance probability:
a= min 1,p(θ0)q(θ(r)|θ0)
p(θ(r))q(θ0|θ(r))
(c) If accepted, set θ(r+1) to θ0; otherwise, set to θ(r)
The idea behind the acceptance probability is as follows. We want to move to θ0if it has high p() probability;
we want to stay in θ(r)if it has high p() probability. If qunnecessarily favors θ0, we don’t want to move
there; if it unnecessarily favors θ(r), we want to move away.
Gibbs
Gibbs sampling is a bit different from other sampling algorithms we’ve talked about. First, it only works in
very particular cases. Pretty much, if you’ve constrained yourself to conjugate distributions, then it works.
It also only works for multivariate distributions.
Let’s say our parameters are a vector θ=hθ1,...,θDi. We’ll write θdfor θwithout position d; namely,
θd=hθ1,...,θd1, θd+1,...,θDi.
Now, we have to assume that we can directly sample from the distribution p(θd|θd) for all d. While this
seems strong, it’s actually not that unheard of. We’ll shortly see an example.
The Gibbs sampler works as follows:
1. Initialize θ(1)
2. For r= 2 . . . R ,
(a) Set θ(r)=θ(r1)
1
pf3

Partial preview of the text

Download Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning and more Study notes Computer Science in PDF only on Docsity!

Machine Learning (CS 5350/CS 6350) 05 Apr 2007

Bayesian inference II

The key problem with (the good) Monte Carlo techniques is that they require a proposal distribution q that is good everywhere. What we’d like to do is to have a proposal distribution q that is good locally. Of course, the question is “locally to what?”

In Markov Chain Monte Carlo techniques, we maintain a “random walk” of parameters over parameter space. That is, instead of drawing a bunch of samples θ(1), θ(2),... , θ(R)^ independently from a proposal distribution, we will draw θ(r+1)^ conditioned on θr^. This introduces a few new problems (samples are no longer independent), but will enable us to define a better proposal distribution.

Metropolis-Hastings

Let q(θ′^ | θ) be a proposal distribution. MH works as follows:

  1. Initialize θ(1)
  2. For r = 1... R − 1,

(a) Draw θ′^ according to q(θ′^ | θ(r−1)) (b) Compute acceptance probability:

a = min

p(θ′)q(θ(r)^ | θ′) p(θ(r))q(θ′^ | θ(r))

(c) If accepted, set θ(r+1)^ to θ′; otherwise, set to θ(r)

The idea behind the acceptance probability is as follows. We want to move to θ′^ if it has high p() probability; we want to stay in θ(r)^ if it has high p() probability. If q unnecessarily favors θ′, we don’t want to move there; if it unnecessarily favors θ(r), we want to move away.

Gibbs

Gibbs sampling is a bit different from other sampling algorithms we’ve talked about. First, it only works in very particular cases. Pretty much, if you’ve constrained yourself to conjugate distributions, then it works. It also only works for multivariate distributions.

Let’s say our parameters are a vector θ = 〈θ 1 ,... , θD 〉. We’ll write θ−d for θ without position d; namely, θ−d = 〈θ 1 ,... , θd− 1 , θd+1,... , θD 〉.

Now, we have to assume that we can directly sample from the distribution p(θd | θ−d) for all d. While this seems strong, it’s actually not that unheard of. We’ll shortly see an example.

The Gibbs sampler works as follows:

  1. Initialize θ(1)
  2. For r = 2... R,

(a) Set θ(r)^ = θ(r−1)

Machine Learning (CS 5350/CS 6350) 2

(b) For each dimension d, i. Sample θ (r) d ∼^ p(θ

(r) d |^ θ

(r) −d)

Latent Dirichlet Allocation

We’ll now explore one particular model: LDA. LDA is a probabilistic model of text. It posits that a document is composed of a mixture of topics. Eg., we might have a sports topic, an entertainment topic, a tech topic and a science topic. Something about blue-ray might be a mix of tech and science.

The model is formally specified as follows. We have a vocabulary over V words. We have K “topics.” For each topic k, there is a multinomial βk over V. The βs have a symmetric Dirichlet prior with concentration η. The corpus is over D documents. Each word wdn in document d is assigned a discrete latent variable zdn that specifies which topic (1... K) this word comes from. Documents themselves have a mixture parameter θ that is a K-dimensional multinominal with symmetric (global) Dirichlet prior with concentration α.

There generative model is as follows:

  1. Choose α ∼ Uni(0, 10)
  2. For each topic k = 1... K,

(a) Choose βk ∼ Dir(η,... , η)

  1. For each document d = 1... D,

(a) Choose θd ∼ Dir(α,... , α) (b) For each word n = 1... N , i. Choose topic zdn ∼ Mult(θd) ii. Choose word wdn ∼ Mult(βzdn )

Or, written as a hierarchical model:

α | ∼ Uni(0, 10) βk | η ∼ Dir(η,... , η) θd | α ∼ Dir(α,... , α) zdn | θd ∼ Mult(θd) wdn | zdn, β ∼ Mult(βzdn )

It turns out we can construct a simple Gibbs sampler for this model. A useful fact for Gibbs samplers is that if we have a graphical model, then p(θd | θ−d) only depends on the Markov blanket of d. The Markov blanket contains d’s parents, d’s children, and the parents of d’s children.

Using the Markov blanket, we can easily see that α depends only on the θs, βk depends only on η, the ws and the zs, etc. We get the following Gibbs updates: