Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning, Study notes of Computer Science

University of Utah (The U)Computer Science

Markov chain monte carlo (mcmc) and gibbs sampling techniques used in machine learning for drawing samples from probability distributions. Mcmc maintains a 'random walk' of parameters over parameter space, while gibbs sampling only works for multivariate distributions and directly sampling from the distribution p(θd | θ−d). Both methods aim to move towards high probability regions and avoid unnecessary favors towards certain parameters.

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-xc2-1 🇺🇸

10 documents

1 / 3

This page cannot be seen from the preview

Don't miss anything!

Machine Learning (CS 5350/CS 6350) 05 Apr 2007

Bayesian inference II

The key problem with (the good) Monte Carlo techniques is that they require a proposal distribution qthat

is good everywhere. What we’d like to do is to have a proposal distribution qthat is good locally. Of course,

the question is “locally to what ?”

In Markov Chain Monte Carlo techniques, we maintain a “random walk” of parameters over parameter

space. That is, instead of drawing a bunch of samples θ(1), θ(2) ,...,θ(R)independently from a proposal

distribution, we will draw θ(r+1) conditioned on θr. This introduces a few new problems (samples are no

longer independent), but will enable us to define a better proposal distribution.

Metropolis-Hastings

Let q(θ0|θ) be a proposal distribution. MH works as follows:

1. Initialize θ(1)

2. For r= 1 . . . R −1,

(a) Draw θ0according to q(θ0|θ(r−1))

(b) Compute acceptance probability:

a= min 1,p(θ0)q(θ(r)|θ0)

p(θ(r))q(θ0|θ(r))

(c) If accepted, set θ(r+1) to θ0; otherwise, set to θ(r)

The idea behind the acceptance probability is as follows. We want to move to θ0if it has high p() probability;

we want to stay in θ(r)if it has high p() probability. If qunnecessarily favors θ0, we don’t want to move

there; if it unnecessarily favors θ(r), we want to move away.

Gibbs

Gibbs sampling is a bit different from other sampling algorithms we’ve talked about. First, it only works in

very particular cases. Pretty much, if you’ve constrained yourself to conjugate distributions, then it works.

It also only works for multivariate distributions.

Let’s say our parameters are a vector θ=hθ1,...,θDi. We’ll write θ−dfor θwithout position d; namely,

θ−d=hθ1,...,θd−1, θd+1,...,θDi.

Now, we have to assume that we can directly sample from the distribution p(θd|θ−d) for all d. While this

seems strong, it’s actually not that unheard of. We’ll shortly see an example.

The Gibbs sampler works as follows:

1. Initialize θ(1)

2. For r= 2 . . . R ,

(a) Set θ(r)=θ(r−1)

1

Discover Study notes of Computer Science University of Utah (The U)

Partial preview of the text

Download Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning and more Study notes Computer Science in PDF only on Docsity!

Machine Learning (CS 5350/CS 6350) 05 Apr 2007

Bayesian inference II

The key problem with (the good) Monte Carlo techniques is that they require a proposal distribution q that is good everywhere. What we’d like to do is to have a proposal distribution q that is good locally. Of course, the question is “locally to what?”

In Markov Chain Monte Carlo techniques, we maintain a “random walk” of parameters over parameter space. That is, instead of drawing a bunch of samples θ(1), θ(2),... , θ(R)^ independently from a proposal distribution, we will draw θ(r+1)^ conditioned on θr^. This introduces a few new problems (samples are no longer independent), but will enable us to define a better proposal distribution.

Metropolis-Hastings

Let q(θ′^ | θ) be a proposal distribution. MH works as follows:

Initialize θ(1)
For r = 1... R − 1,

(a) Draw θ′^ according to q(θ′^ | θ(r−1)) (b) Compute acceptance probability:

a = min

p(θ′)q(θ(r)^ | θ′) p(θ(r))q(θ′^ | θ(r))

(c) If accepted, set θ(r+1)^ to θ′; otherwise, set to θ(r)

The idea behind the acceptance probability is as follows. We want to move to θ′^ if it has high p() probability; we want to stay in θ(r)^ if it has high p() probability. If q unnecessarily favors θ′, we don’t want to move there; if it unnecessarily favors θ(r), we want to move away.

Gibbs

Gibbs sampling is a bit different from other sampling algorithms we’ve talked about. First, it only works in very particular cases. Pretty much, if you’ve constrained yourself to conjugate distributions, then it works. It also only works for multivariate distributions.

Let’s say our parameters are a vector θ = 〈θ 1 ,... , θD 〉. We’ll write θ−d for θ without position d; namely, θ−d = 〈θ 1 ,... , θd− 1 , θd+1,... , θD 〉.

Now, we have to assume that we can directly sample from the distribution p(θd | θ−d) for all d. While this seems strong, it’s actually not that unheard of. We’ll shortly see an example.

The Gibbs sampler works as follows:

Initialize θ(1)
For r = 2... R,

(a) Set θ(r)^ = θ(r−1)

Machine Learning (CS 5350/CS 6350) 2

(b) For each dimension d, i. Sample θ (r) d ∼^ p(θ

(r) d |^ θ

(r) −d)

Latent Dirichlet Allocation

We’ll now explore one particular model: LDA. LDA is a probabilistic model of text. It posits that a document is composed of a mixture of topics. Eg., we might have a sports topic, an entertainment topic, a tech topic and a science topic. Something about blue-ray might be a mix of tech and science.

The model is formally specified as follows. We have a vocabulary over V words. We have K “topics.” For each topic k, there is a multinomial βk over V. The βs have a symmetric Dirichlet prior with concentration η. The corpus is over D documents. Each word wdn in document d is assigned a discrete latent variable zdn that specifies which topic (1... K) this word comes from. Documents themselves have a mixture parameter θ that is a K-dimensional multinominal with symmetric (global) Dirichlet prior with concentration α.

There generative model is as follows:

Choose α ∼ Uni(0, 10)
For each topic k = 1... K,

(a) Choose βk ∼ Dir(η,... , η)

For each document d = 1... D,

(a) Choose θd ∼ Dir(α,... , α) (b) For each word n = 1... N , i. Choose topic zdn ∼ Mult(θd) ii. Choose word wdn ∼ Mult(βzdn )

Or, written as a hierarchical model:

It turns out we can construct a simple Gibbs sampler for this model. A useful fact for Gibbs samplers is that if we have a graphical model, then p(θd | θ−d) only depends on the Markov blanket of d. The Markov blanket contains d’s parents, d’s children, and the parents of d’s children.

Using the Markov blanket, we can easily see that α depends only on the θs, βk depends only on η, the ws and the zs, etc. We get the following Gibbs updates:

Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning, Study notes of Computer Science

Related documents

Partial preview of the text

Download Markov Chain Monte Carlo and Gibbs Sampling in Machine Learning and more Study notes Computer Science in PDF only on Docsity!

Bayesian inference II