

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Markov chain monte carlo (mcmc) and gibbs sampling techniques used in machine learning for drawing samples from probability distributions. Mcmc maintains a 'random walk' of parameters over parameter space, while gibbs sampling only works for multivariate distributions and directly sampling from the distribution p(θd | θ−d). Both methods aim to move towards high probability regions and avoid unnecessary favors towards certain parameters.
Typology: Study notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!


Machine Learning (CS 5350/CS 6350) 05 Apr 2007
The key problem with (the good) Monte Carlo techniques is that they require a proposal distribution q that is good everywhere. What we’d like to do is to have a proposal distribution q that is good locally. Of course, the question is “locally to what?”
In Markov Chain Monte Carlo techniques, we maintain a “random walk” of parameters over parameter space. That is, instead of drawing a bunch of samples θ(1), θ(2),... , θ(R)^ independently from a proposal distribution, we will draw θ(r+1)^ conditioned on θr^. This introduces a few new problems (samples are no longer independent), but will enable us to define a better proposal distribution.
Metropolis-Hastings
Let q(θ′^ | θ) be a proposal distribution. MH works as follows:
(a) Draw θ′^ according to q(θ′^ | θ(r−1)) (b) Compute acceptance probability:
a = min
p(θ′)q(θ(r)^ | θ′) p(θ(r))q(θ′^ | θ(r))
(c) If accepted, set θ(r+1)^ to θ′; otherwise, set to θ(r)
The idea behind the acceptance probability is as follows. We want to move to θ′^ if it has high p() probability; we want to stay in θ(r)^ if it has high p() probability. If q unnecessarily favors θ′, we don’t want to move there; if it unnecessarily favors θ(r), we want to move away.
Gibbs
Gibbs sampling is a bit different from other sampling algorithms we’ve talked about. First, it only works in very particular cases. Pretty much, if you’ve constrained yourself to conjugate distributions, then it works. It also only works for multivariate distributions.
Let’s say our parameters are a vector θ = 〈θ 1 ,... , θD 〉. We’ll write θ−d for θ without position d; namely, θ−d = 〈θ 1 ,... , θd− 1 , θd+1,... , θD 〉.
Now, we have to assume that we can directly sample from the distribution p(θd | θ−d) for all d. While this seems strong, it’s actually not that unheard of. We’ll shortly see an example.
The Gibbs sampler works as follows:
(a) Set θ(r)^ = θ(r−1)
Machine Learning (CS 5350/CS 6350) 2
(b) For each dimension d, i. Sample θ (r) d ∼^ p(θ
(r) d |^ θ
(r) −d)
Latent Dirichlet Allocation
We’ll now explore one particular model: LDA. LDA is a probabilistic model of text. It posits that a document is composed of a mixture of topics. Eg., we might have a sports topic, an entertainment topic, a tech topic and a science topic. Something about blue-ray might be a mix of tech and science.
The model is formally specified as follows. We have a vocabulary over V words. We have K “topics.” For each topic k, there is a multinomial βk over V. The βs have a symmetric Dirichlet prior with concentration η. The corpus is over D documents. Each word wdn in document d is assigned a discrete latent variable zdn that specifies which topic (1... K) this word comes from. Documents themselves have a mixture parameter θ that is a K-dimensional multinominal with symmetric (global) Dirichlet prior with concentration α.
There generative model is as follows:
(a) Choose βk ∼ Dir(η,... , η)
(a) Choose θd ∼ Dir(α,... , α) (b) For each word n = 1... N , i. Choose topic zdn ∼ Mult(θd) ii. Choose word wdn ∼ Mult(βzdn )
Or, written as a hierarchical model:
α | ∼ Uni(0, 10) βk | η ∼ Dir(η,... , η) θd | α ∼ Dir(α,... , α) zdn | θd ∼ Mult(θd) wdn | zdn, β ∼ Mult(βzdn )
It turns out we can construct a simple Gibbs sampler for this model. A useful fact for Gibbs samplers is that if we have a graphical model, then p(θd | θ−d) only depends on the Markov blanket of d. The Markov blanket contains d’s parents, d’s children, and the parents of d’s children.
Using the Markov blanket, we can easily see that α depends only on the θs, βk depends only on η, the ws and the zs, etc. We get the following Gibbs updates: