Markov Chain Monte Carlo (MCMC) Sampling for Approximate Counting and Sampling, Study notes of Approximation Algorithms

Markov chain monte carlo (mcmc) sampling approaches for approximating uniform sampling from a given distribution. The motivation behind mcmc methods, the properties of random walks and markov chains, and the analysis of mixing times. It also covers canonical paths and coupling techniques for bounding mixing times.

Typology: Study notes

2011/2012

Uploaded on 02/14/2012

alexey
alexey 🇺🇸

4.7

(20)

325 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS880: Approximations Algorithms
Scribe: Dave Andrzejewski Lecturer: Shuchi Chawla
Topic: Approx counting/sampling, MCMC methods Date: 4/24/07
The previous lecture showed that, for self-reducible problems, the problem of estimating the size of
the set of feasible solutions is equivalent to the problem of sampling nearly uniformly from that set.
This lecture explores the applications of that result by developing techniques for sampling from
a uniform distribution. Specifically, this lecture introduces the concept of Markov Chain Monte
Carlo (MCMC) sampling approaches.
25.1 Markov Chain Monte Carlo (MCMC)
25.1.1 Problem motivation
There are many situations where we wish to sample from a given distribution, but it is not im-
mediately clear how to do so. For example, we may have a function that gives the probability of
an event within a normalization factor, but no way to calculate that normalization factor. Or the
sample space of possible outcomes may be exponentially large, or even infinite.
25.1.2 Random walks and their properties
We approach this problem by considering our state space to be a graph, with individual events as
nodes on this graph. If we can define this graph in a way such that the stationary distribution of
a random walk over this graph is equal to our target distribution that we wish to draw samples
from, then samples from a random walk on this graph can be used to approximate samples from
the target distribution. We introduce the following definitions:
= the state space
n=||
P= the transition matrix, Pij =P r[ij]
π= a distribution on the nodes in
Then, if we start from a node chosen from the distribution πand take a single step according to our
transition matrix P, we get the distribution πP . Note that a random walk obeying these definitions
is memoryless. That is, the next step in the walk depends only on the current state, and not on
any history beyond that. This is also known as the Markov property, and our random walk is an
example of a Markov chain.
For our random walk, there are two quantities of particular interest. The first is the stationary
distribution π. This is a distribution over all states with the special property that πP=π. If
we follow a random walk on a Markov chain, after a while we expect our position to be distributed
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Markov Chain Monte Carlo (MCMC) Sampling for Approximate Counting and Sampling and more Study notes Approximation Algorithms in PDF only on Docsity!

CS880: Approximations Algorithms Scribe: Dave Andrzejewski Lecturer: Shuchi Chawla Topic: Approx counting/sampling, MCMC methods Date: 4/24/

The previous lecture showed that, for self-reducible problems, the problem of estimating the size of the set of feasible solutions is equivalent to the problem of sampling nearly uniformly from that set. This lecture explores the applications of that result by developing techniques for sampling from a uniform distribution. Specifically, this lecture introduces the concept of Markov Chain Monte Carlo (MCMC) sampling approaches.

25.1 Markov Chain Monte Carlo (MCMC)

25.1.1 Problem motivation

There are many situations where we wish to sample from a given distribution, but it is not im- mediately clear how to do so. For example, we may have a function that gives the probability of an event within a normalization factor, but no way to calculate that normalization factor. Or the sample space of possible outcomes may be exponentially large, or even infinite.

25.1.2 Random walks and their properties

We approach this problem by considering our state space to be a graph, with individual events as nodes on this graph. If we can define this graph in a way such that the stationary distribution of a random walk over this graph is equal to our target distribution that we wish to draw samples from, then samples from a random walk on this graph can be used to approximate samples from the target distribution. We introduce the following definitions:

  • Ω = the state space
  • n = |Ω|
  • P = the transition matrix, Pij = P r[i → j]
  • π = a distribution on the nodes in Ω

Then, if we start from a node chosen from the distribution π and take a single step according to our transition matrix P , we get the distribution πP. Note that a random walk obeying these definitions is memoryless. That is, the next step in the walk depends only on the current state, and not on any history beyond that. This is also known as the Markov property, and our random walk is an example of a Markov chain.

For our random walk, there are two quantities of particular interest. The first is the stationary distribution π∗. This is a distribution over all states with the special property that π∗P = π∗. If we follow a random walk on a Markov chain, after a while we expect our position to be distributed

according to π∗. This is the definition of a stationary distribution. It is the limit distribution of the location of a random walk as the number of steps taken goes to infinity. There are special properties of the chain which are required to guarantee the existence and uniqueness of π∗, and these will be introduced shortly.

The second quantity of interest is the mixing time τǫ, which is a measure of how long a random walk on the graph will take to converge to π∗. This will be defined more formally.

To illustrate these concepts, we will study the simple example of a d-regular undirected graph (all vertices have degree d). We define the transition probabilities from vertex to be uniform over all outgoing edges. That is, each edge is taken with probability 1/d. We then make the following claims.

Claim 25.1.1 For a uniform random walk over our d-regular graph, π∗^ is uniform.

Claim 25.1.2 For the directed version of our graph with d(in) = d(out) = d, π∗^ is uniform.

Claim 25.1.3 If G is undirected, π∗(v) = d 2 (mv).

To see this last claim, consider that we can also define our random walk as a distribution over all edges. Let Q(u, v) be the probability of taking the edge (u, v). Then:

Q(u, v) = π∗(u)Puv (25.1.1)

= d(u) 2 m

d(u)

2 m

Summing this over all d of v’s neighbors then shows us that this π∗^ satisfies π∗P = π∗^ for any all v and is therefore a valid stationary distribution. The first two claims follow by a similar argument.

But when can we know that a unique π∗^ exists? Consider the 2 example graphs shown in Figure 25.1.2. For the first graph, if we say that you start in the left node with probability 1, then your probability of being in that node is 1 at the start of iteration 1,3,5,... and 0 at the start of iteration 2,4,6,... Although the uniform distribution satisfies πP = π, it is not guaranteed that all random walks on the graph will converge to this distribution, as shown by this example. Therefore this graph can not have a stationary distribution, because it is periodic. For the second graph, you can reach 2 different stationary distributions starting from the center node, depending on which direction is taken on the first step. In this case there is no unique stationary distribution. These ideas are formalized in the following theorem:

Theorem 25.1.4 An aperiodic irreducible finite Markov chain is ergodic and has a unique station- ary distribution.

A chain is aperiodic if for every state, there is no number > 1 which can divide the index of every future step which has non-zero probability of returning to that state. That is, given that you are in the state, there is no periodic pattern to when you can return (every second step, third step etc). This can be a somewhat tricky notion to prove about a graph, but adding self-edges to all nodes

n / 2

n / 2

Figure 25.1.2: A graph with a slow mixing time.

The key property of the transition matrix which determines mixing time is the eigenvalue gap γ between the principal and second eigenvalues. Assume that all eigenvalues λ are real (which is the case for undirected graphs anyways). Then order the eigenvalues λ 1 ≥ λ 2 ≥ ... ≥ λn. Then call the eigenvalue gap γ = λ 1 − λ 2.

We know that at least one of the λi’s is equal to 1, because by definition π∗^ is an eigenvector or P whenever a stationary distribution exists (π∗P = π∗). Since P is a stochastic matrix, it is also easy to see that |λi| ≤ 1 ∀i, therefore λ 1 = 1 and π∗^ is the first eigenvector. It is also interesting to note than if |λi| < 1, that means that vi must have mixed-sign components, and cannot be normalized to sum to 1. Also, λ 2 cannot be 1, unless the graph has a more than one strongly connected component, and the stationary distribution is therefore non-unique.

Also, the eigenvectors of P are guaranteed to be orthogonal if P is real and symmetric, which corresponds to a time-reversible Markov chain, where time-reversible means that π∗(u)Puv = π∗(v)Pvu∀u, v.

Theorem 25.1.5 τǫ ≤ O( (^1) γ log( n ǫ ) for a time-reversible Markov chain.

Proof: Consider the representation of a distribution over states π in the basis of the eigenvectors π =

i civi.

πP = (

i

civi)P =

i

ciλivi (25.1.6)

πP 2 =

i

ciλ^2 i vi (25.1.7)

πP t^ =

i

ciλtivi (25.1.8)

Observe that for |λi| < 1, λti → 0 as t → ∞. The lone exception if for i = 1, since |λ 1 | = 1. Thus we can see that πP t^ → π∗^ = c 1 v 1 as t → ∞, where c 1 = 1 as π∗^ = v 1. We can express the distribution at time t in terms of the stationary distribution and an error term.

πP t^ = π∗^ +

i> 1

ciλtivi (25.1.9)

||πP t^ − π∗||^22 = ||

i

ciλtivi||^22 (25.1.10)

i

c^2 i λ^2 i t||vi||^22 (25.1.11)

(the previous step uses the orthogonality of the vi’s)

≤ λ^22 t

i> 1

c^2 i ||vi||^22 (25.1.12)

≤ λ^22 t (25.1.13)

The final step can be seen by noting that

i> 1 c 2 i ||vi|| 2 2 ≤^

i c 2 i ||vi|| 2 2 =^ ||π|| 2 2 ≤^ 1, because our original π is a probability distribution.

Now we wish to use this ℓ 2 bound to get an ℓ 1 bound. It is a general result that, in n-dimensional space ||x|| 1 ≤

n||x|| 2 (Figure 25.1.3). This result is essentially a restatement of the Cauchy- Schwarz inequality. Plugging this result into our bound from above, we get:

Figure 25.1.3: Visual representation of bounding the ℓ 1 -norm in terms of the ℓ 2 -norm.

||πP t^ − π∗|| 1 ≤ λt 2

n (25.1.14)

We want to pick t such that λt 2

n ≤ ǫ.

(1 − γ)t

n ≤ ǫ (25.1.15) e−tγ^

n ≤ ǫ (25.1.16) √ n ǫ

≤ etγ^ (25.1.17)

t ≥

γ log(

n ǫ

25.1.6 Coupling

The coupling technique works by running two Markov chains X and Y in parallel. The chain X is started from some initial distribution π, while the chain Y is started from the stationary distribution π∗. The evolution of the chains im time is then coupled or linked. If it can be shown that for some t we get Xt^ = Y t, then by the Markov property the chain X will have reached the stationary distribution π∗.

Consider the example of the random walk on the graph shown in Figure 25.1.6. From our earlier claims, we can immediately see the stationary distribution π∗^ for this walk is uniform. So now consider our two random walks X and Y , where their evolution is defined such that for each step:

  • X chooses a neighbor uniformly
  • Y go in the same ’direction’ as X

Figure 25.1.4: Coupling method example.

The key consequence of this scheme is that whenever one of the random walks is at an endpoint, the distance between the two points will be reduced by 1 if that walk takes the self-loop edge of that endpoint, which will happend with probability 1/2. This means that, no matter where the two chains started, we can guarantee that X = Y after a self-loop edge has been taken n times. The following lemma formalizes the relationship between mixing (small ||Xt^ − πt|| 1 ) and coupling (Xt^ = Y t).

Lemma 25.1.10 ||Xt^ − π∗|| 1 ≤ 2 P r[Xt^6 = Y t]

Proof:

P r[Xt^ = Y t] ≤

i

min{Xit , Y (^) it } (25.1.24)

||Xt^ − Y t|| 1 =

i

max{Xit , Y (^) it } − min{Xit , Y (^) it } (25.1.25)

i

(max{Xit , Y (^) it } + min{Xit , Y (^) it }) − 2 min{Xit , Y (^) it } (25.1.26)

i

min{Xit , Y (^) it } (25.1.27)

≤ 2(1 − P r[Xt^ = Y t]) (25.1.28) ≤ 2(P r[Xt^6 = Y t]) (25.1.29)

In order to complete the argument, we need to determine a t such that P r[Xt^6 = Y t] ≤ ǫ. The crux of this method is determining the hitting time h(u, v) for each pair of points (u, v), where h(u, v) is the expected time to reach v, starting from u. This will give us the time it takes to go to an end point on the line starting from an arbitrary point in the middle.

For this example, the maximum h(u, v) occurs when u and v are the two endpoints on the opposite sides of the graph, A and B. This can be shown to be n^2 by solving the following set of equations.

h(i, 0) = 1 +

h(i − 1 , 0) +

h(i + 1, 0) ∀ 1 < i < n (25.1.30)

h(1, 0) = 1 +

h(2, 0) (25.1.31)

This O(n^2 ) hitting time in turn implies an O(1/ǫ, n^3 ) mixing time for this random walk. We omit the details of this step, which involves the use of Markov’s inequality.

The coupling analysis technique requires typically requires the underlying graph to have a nice, known structure. It can be applied to a number of Markov chains, such as electrical networks, card shuffling, and random graph colorings.

References

[1] V. Vazirani. Approximation Algorithms. Springer, 2001.