Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Monte Carlo Methods: Rejection Sampling, Markov Chain Monte Carlo, and Applications, Essays (high school) of Mathematics

Yale University Mathematics

Monte carlo methods, focusing on rejection sampling and markov chain monte carlo (mcmc). Rejection sampling is a technique for generating samples from a distribution by proposing samples from a simpler distribution and accepting or rejecting them based on their likelihood ratio. Mcmc, on the other hand, is a more general method for generating samples from complex distributions by constructing a markov chain that converges to the desired distribution. The document also covers various applications of these methods, such as computing integrals, bayesian inference, and decoding population spike trains.

Typology: Essays (high school)

2011/2012

Uploaded on 03/05/2012

tarquin 🇺🇸

4.3

(15)

260 documents

1 / 35

This page cannot be seen from the preview

Don't miss anything!

Statistical analysis of neural data:

Monte Carlo techniques for decoding spike trains

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience

Columbia University

http://www.stat.columbia.edu/∼liam

March 31, 2009

Discover Essays (high school) of Mathematics Yale University

Partial preview of the text

Download Monte Carlo Methods: Rejection Sampling, Markov Chain Monte Carlo, and Applications and more Essays (high school) Mathematics in PDF only on Docsity!

Statistical analysis of neural data:

Monte Carlo techniques for decoding spike trains

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience

Columbia University

http://www.stat.columbia.edu/∼liam

March 31, 2009

Contents

1 Often we need to integrate over the posterior, not just maximize it 3

2 “Monte Carlo” techniques provide a tractable method for computing integrals 3 2.1 Example: decoding hippocampal place field activity (the static case)......... 4

3 The “importance sampling” idea extends the range of the basic Monte Carlo technique, but is only effective in low dimensions 5

4 Rejection sampling is a basic building block of more general, powerful Monte Carlo methods 6

5 “Rao-Blackwellization”: it is wise to compute as much as possible analytically 7

6 Markov chain Monte Carlo (MCMC) provides a general-purpose integration technique 8 6.1 The Metropolis-Hastings sampling algorithm is a fundamental tool for constructing Markov chains with the correct equilibrium density................... 9 6.2 Choosing the proposal density is a key step in optimizing the efficiency of Metropolis- Hastings........................................... 10 6.2.1 Non-isotropic random-walk Metropolis is the simplest effective proposal... 11 6.2.2 Hybrid (Hamiltonian) Monte Carlo typically mixes much faster than random- walk Metropolis.................................. 13 6.2.3 The “Gibbs sampler” is a key special case of the Metropolis-Hastings algorithm 15 6.2.4 The hit-and-run algorithm performs Gibbs sampling in random directions.. 16 6.3 Comparing mixing speed in the log-concave case: for smooth densities, hybrid meth- ods are best; for densities with “corners,” hit-and-run is more effective........ 17 6.4 Example: comparing MAP and Monte Carlo stimulus decoding given GLM obser- vations............................................ 19

7 Gibbs sampling is useful in a wide array of applications 24 7.1 Example: sampling network spike trains given the observed activity of a subset of neurons in the population................................. 25 7.2 Blockwise sampling is often more efficient than Gibbs sampling each variable indi- vidually........................................... 25 7.3 Example: computing errorbars in low-rank stimulus encoding models......... 27 7.4 Example: Bayesian adaptive regression splines for time-varying firing rate estimation 28 7.5 Example: estimating hierarchical network spike train models............. 29 7.6 Example: “slice sampling” provides another easy way to sample from a unimodal density............................................ 29 7.7 Example: decoding in the presence of model uncertainty................ 29

8 Computing marginal likelihood is nontrivial but necessary in a number of appli- cations 31

9 Further reading 33 1 (^1) Much of these notes are directly adapted from (Ahmadian et al., 2008). Thanks also to Y. Mishchenko for helpful

Thus we have reduced our hard integration problem to an apparently simpler problem of drawing i.i.d. samples from p(~x). To decide how many samples N are sufficient, we may simply compute

V ar(ˆgN ) =

N

V ar[g(~x)];

when N is large enough that this variance is sufficiently small, we may stop sampling. Of course, in general, computing V ar[g(~x)] is usually also intractable, and so we have to estimate this variance from data, and this can be a difficult problem in itself, e.g. if the random variable g(~x) has large tails.

2.1 Example: decoding hippocampal place field activity (the static case)

As a simple concrete example, imagine that we would like to decode the position x of a rat along a one-dimensional track, given some observed set of neural spike counts D = {ni}. Assume that, given x, the spike counts ni are conditionally independent Poisson random variables with rate

λi(x) = fi(x),

where fi(.) is neuron i’s tuning curve for position x. Further assume that we know the a priori distribution p(x) of observing any position x (i.e., we have watched the rat’s behavior long enough to estimate p(x). Then, if we write

E(x|D) =

xp(x|D)dx =

x p(D|x)p(x) p(D)

dx =

xp(D|x)p(x)dx p(D)

with

p(D) =

p(x, D)dx =

p(x)p(D|x)dx,

we see that we may apply the Monte Carlo idea twice to compute E(x|D), once for the numerator and once for the denominator: if xj are i.i.d. samples from the prior density p(x), then

∫ xp(D|x)p(x)dx = lim N →∞

N

∑^ N

xj p(D|xj )

and

p(D) = lim N →∞

N

∑^ N

p(D|xj ),

with

p(D|x) =

[λn i iexp(−λi)/ni!] =

fi(x)ni^ exp(−fi(x)) ni!

However, these direct methods for computing p(D) and E(x|D) are numerically unstable (i.e., we may need a very large number of samples N to obtain reliable estimates) and are not recommended, for reasons we will discuss below.

3 The “importance sampling” idea extends the range of the basic

Monte Carlo technique, but is only effective in low dimensions

Two major problems can arise in a direct application of the Monte Carlo idea. First, it might be difficult to sample directly from p(~x), particularly when ~x is high-dimensional. Second, the likelihood term p(D|~x) might be very sharply peaked; in the preceding position-decoding example, this is often the case when many neurons are observed, since the product over neurons i tends to become exponentially concentrated around its maximum. This means that most of our samples are wasted, since they fall in regions where the p(D|~x) term is relatively very small, leading to a large variance in our estimate of E(~x|D). One solution to both of these problems is to use the “importance sampling” trick: instead of drawing i.i.d. samples from p(~x), we draw ~xj from a “proposal” density q(~x) instead, and then form

ˆg( Nq )≡

N

∑^ N

p(~xj ) q(~xj ) g(~xj ) →N →∞

p(~x) q(~x) g(~x)q(~x)d~x =

p(~x)g(~x)d~x = Ep(g).

The terms p q((~x~xjj^ ) ) — defined so that we obtain the correct expectation, despite the fact that we are not sampling from p(~x) — are known as “importance weights”: we are biasing our sampler so that we can potentially (assuming we choose a good q(~x)) spend more of our time sampling from the “important” part of the density. What constitutes a good choice for q(~x) here? It is reasonable to try to make the variance of our estimate g N(q )as small as possible. We may write out this variance explicitly:

V ar(ˆgN ) =

N

g(~x)^2 p(~x)^2 q(~x)

d~x − Ep(g)^2

(note that this reduces to the simpler V ar[g(~x)] formula above in the case that p = q); a simple calculus of variations argument gives that the optimal q is

q∗(~x) =

Z

|g(~x)|p(~x),

with the normalizer Z =

|g(~x)|p(~x)d~x < ∞, by assumption. For example, if we want to compute p(D) =

p(~x)p(D|~x)d~x for some fixed observation D, the optimal proposal density is

q(~x) = (1/Z)p(~x)|p(D|~x)| = (1/Z)p(~x)p(D|~x) = p(~x|D),

which makes intuitive sense. It is worth noting that a bad choice of q(~x), on the other hand, can be quite damaging. For

example, if q(~x) is such that

∫ (^) g(~x) (^2) p(~x) 2 q(~x) d~x^ is infinite (i.e., if the importance weight^

p(~xj ) q(~xj ) is very large with high p(~x)-probabiliy), then the variance of our estimate ˆgN will be infinite, a situation we would obviously like to avoid. The rule of thumb is that the proposal q should have “fatter tails” than p, in order to avoid very large values of the importance weight p/q. If we return to the example discussed at length above — decoding the spike trains from a population of GLM-like neurons — then a simple proposal density q(~x) suggests itself: our Gaussian approximation to p(~x|D). Once we obtain ~xM AP and the Hessian Jx of the log-posterior evaluated

at ~xM AP , then to sample from this distribution we need only form ~xj = J x− 1 /^2 ~zj + ~xM AP , where ~zj is a standard white Gaussian vector: this proposal q(~x) is therefore very easy to sample from, and is

provide exact samples. We apply the rejection sampling idea: that is, we draw from our proposal q, and then accept with probability cp/q. The constant c is chosen such that maxx cp(x)/q(x) ≤ 1, i.e., that the acceptance probability is always less than or equal to one. It is clear that this algorithm provides exact samples, once we write down the density p∗(x) from which the rejection sampler is drawing:

p∗(x) =

Z

q(x)p(accept|x) =

Z

q(x)c

p(x) q(x)

where the normalization

Z =

q(x)c p(x) q(x)

dx = c

is just the proportion of samples that are accepted; thus

p∗(x) =

q(x)c p(x) q(x)

= p(x).

(Note that the thinning algorithm we discussed in the point process chapter was a special case of this idea.) However, in many cases (particularly in high-dimensional applications) the restriction that p(x)/q(x) be bounded is not met, or the bound 1/c = maxx p(x)/q(x) is so large as to make the method useless. In this case, we must turn to the more general MCMC methods discussed below.

5 “Rao-Blackwellization”: it is wise to compute as much as pos-

sible analytically

We saw above that the importance sampling idea (or indeed, the Monte Carlo approach in general) is most effective when we can control the dimensionality of the vector ~x which needs to be sampled. One natural idea along these lines is to perform as much of the integral as possible analytically. More precisely, it is sometimes possible to decompose p(~x) into two pieces,

p(~x) = p(~x 1 )p(~x 2 |~x 1 ).

In some cases, this decomposition may be chosen such that the integral

g(~x 1 ) ≡

p(~x 2 |~x 1 )g(~x 1 , ~x 2 )d~x 2

may be calculated analytically^2. Now the integral Ep(g) may be computed as

Ep(g) =

d~x 1 p(~x 1 )

d~x 2 p(~x 2 |~x 1 )g(~x 1 , ~x 2 ) =

d~x 1 p(~x 1 )g(~x 1 );

in particular, we may apply Monte Carlo methods (now in a the lower-dimensional ~x 1 space) to the integral on the right. The Rao-Blackwell theorem (a version of the usual bias-variance decom- position (Schervish, 1995)) guarantees that this approach, which effectively replaces a randomized estimator for g(~x 1 ) with its analytical conditional expectation, will perform at least as well as Monte Carlo integration applied to ~x directly; hence this approach is often referred to as “Rao- Blackwellization.” We will see a number of examples of this approach below. (^2) The most common example of this is that ~x 2 is conditionally Gaussian given ~x 1 , and g(.) is some simple function such as a polynomial or exponential. For example, it is easy to see that this pertains when we are decoding Gaussian stimuli ~x given observations from a GLM model and the number m of spike counts observed is smaller than the dimensionality of ~x. Thus, in this case, just as in the case that we are optimizing (not integrating) over ~x — where we made use of the representer theorem to reduce the computational load — the computation time is determined by the smaller of dim(~x) and m.

6 Markov chain Monte Carlo (MCMC) provides a general-purpose

integration technique

As we discussed above, if we can sample from a distribution we can (in principle) compute any quan- tity of interest: means, variances, correlations, and even quantities which are nonlinear functions of the distribution, such as mutual information (as we will see below). However, it can often be quite challenging to sample directly from a complex distribution p(~x). One (quite clever) idea that has become the default method (Robert and Casella, 2005) is to invent an ergodic Markov chain^3 whose conditional distributions p(~y|~x) may be sampled tractably and whose equilibrium measure is exactly p(~x). Then we may simply iteratively run the chain forward, by recursively sampling p(~xN +1|~xN ).

We know that the distribution of ~x after N steps of the chain is

pN (~x) ≡ p(~xN ) = T N^ p 0 (~x),

where T abbreviates the transition operator

T g =

g(~x)p(~y|~x)d~x

(think of the simplest case, that ~x can take on only a finite number of values d — then g is a d-dimensional vector and T corresponds to a d × d matrix) and that

pN (~x) → p(~x), N → ∞.

Thus if we run the chain long enough, ~xN will be arbitrarily close to a sample from p(~x). One drawback is that we don’t know how long we need to run the Markov chain: how large does N have to be for pN (~x) to be “close enough” to p(~x)? In the case that the sample space of ~x is finite, we know from basic matrix theory that the largest eigenvalue of T is 1, with corresponding eigenvector p(~x); in an ergodic chain, all other eigenvalues are strictly less than one. Since the size of these smaller eigenvalues determines the rate of convergence of T N^ p 0 (~x) to p(~x), the goal is to choose T such that the second-largest eigenvalue is as small as possible. More generally, we tend to argue more qualitatively that the chain should “mix” — i.e., “forget” the initial distribution p 0 (~x) and evolve towards equilibrium — as quickly as possible; we will discuss this issue more quantitatively in section 6.3 below. We also want to choose the initial density p 0 (~x) to be as close to the target density p(~x) as possible (within the consraint that we may sample from p 0 (~x) directly); for example, in the decoding context, a reasonable choice for p 0 (~x) is the Laplace-approximation proposal density we discussed above. Typically the chain is run for some “burn-in” period before any samples are recorded, so that the effects of the initialization are forgotten and the chain may be considered to be in equilibrium. (If the chain mixes slowly, we’ll have to “burn” more samples — yet another reason to construct faster-mixing chains.) Finding a Markov chain that satisfies our desiderata — (1), ergodicity (or more quantitatively, a rapid mixing speed), and (2), T p(~x) = p(~x),

that is, p(~x) as an equilibrium measure — is the hard part, but a few general-purpose approaches have been developed and are quite widely used. (^3) Ergodicity is a property of a Markov chain that guarantees that the Markov chain evolves towards a unique equilibrium density; roughly speaking, a Markov chain is ergodic if the probability of getting from one state to another is nonzero, and if the chain is acyclic in the sense that particles which start at any state ~x do not return back to ~x at any fixed time with probability one.

p(~y)q(~x|~y)/p(~x)q(~y|~x) on the left-hand side, and flip this ratio to set α(~x|~y) on the right-hand side. Putting these three cases together, we can express our acceptance probability a bit more elegantly,

α(~y|~x) ≡ min

p(~y)q(~x|~y) p(~x)q(~y|~x)

for any pair (~x, ~y). This completes the definition of the Metropolis-Hastings (M-H) algorithm. Thus we have succeeded in our goal of constructing a simple algorithm that allows us to sample from a distribution which becomes arbitrarily close to p(~x) (as N increases). See (Billera and Diaconis, 2001) for a geometric interpretation of the algorithm; it turns out that the choice of the acceptance ratio α(.|.) corresponds to a kind of projection of our proposal Markov chain onto the set of chains which satisfy detailed balance under p(~x). See (Neal, 2004) for further details, including a generalization to the more powerful class of non-reversible chains. It is worth pointing out a couple special features of this algorithm. First, note that in the special case of symmetric proposals, q(~y|~x) = q(~x|~y), then the acceptance probability simplifies:

αsymm(~y|~x) ≡ min

p(~y) p(~x)

this was the original Metropolis algorithm (later generalized to the nonsymmetric case by Hast- ings). This proposal has the nice feature that proposals which increase the value of p(.) are always accepted (i.e., if p(~y) ≥ p(~x), then αsymm(~y|~x) = 1), while proposals which decrease p(.) are ony accepted probabilistically. This makes the analogy with the “simulated annealing” technique from optimization quite transparent; the M-H algorithm is just simulated annealing when the “temper- ature” is set to one. Note also, very importantly, that no matter what proposal density we use we only need to know p(~x) up to a constant. That is, it is enough to know that

p(~x) =

Z

r(~x)

for some relatively easy-to-compute r(~x) — we do not need to calculate the normalization constant Z (since the Z term cancels in the ratio p(~y)/p(~x)), which in many cases is intractable. This prop- erty has made the Metropolis-Hastings algorithm perhaps the fundamental algorithm in Bayesian statistics.

6.2 Choosing the proposal density is a key step in optimizing the efficiency of

Metropolis-Hastings

As discussed above, the key to the success of any MCMC algorithm is to find a chain that reaches equilibrium and mixes as quickly as possible. In the context of the M-H algorithm, we want to choose the proposal density q(~y|~x) to maximize the mixing speed and to minimize the time needed to compute and sample from q(.). A good rule of thumb is that we would like the proposals q(.) to match the true density p(.) as well as possible (in the limiting case that p(.) and q(.) are exactly equal, it is clear that the acceptance probability α will be one, and the M-H algorithm gives i.i.d. samples from p(.), which is the best we can hope for; of course, in general, we can’t sample directly from p(.), or else the MCMC technique would not be necessary). The so-called “independence” sampler is the simplest possible proposal: we choose the proposal q(.) to be independent of the current state ~x, i.e. q(~y|~x) = q(~y), for some fixed q(~y). For example, the proposal q(~x) could be taken to be our Gaussian or student-t approximation to p(~x|D). This

is extremely easy to code, and the simplicity of the proposal makes detailed mathematical analysis of the mixing properties of the resulting Markov chain much more tractable than in the general case (Liu, 2002). However, as we discussed above, it can be very difficult to choose a q(~x) which is both easy to sample from directly and which is close enough to the (possibly high-dimensional) p(~x) that the importance weights p(~x)/q(~x) — and therefore the M-H term

p(~y)q(~x|~y) p(~x)q(~y|~x)

p(~y)q(~x) p(~x)q(~y)

— are well-behaved. Thus in most cases we need to choose a proposal density which depends on the current point ~y. We will discuss a few of the most useful example proposals below, but before going into these specifics, it is worth emphasizing that, all of the M-H algorithms we will discuss propose jumps ~y − ~x which are in some sense “local.” Thus, just as in the optimization setting, M-H methods are typically much more effective in sampling from unimodal (e.g., log-concave) target densities p(~x) than more general p(~x), which might have local maxima that act as “traps” for the sampler, decreasing the overall mixing speed of the chain. Therefore our focus in this section will be on constructing efficient M-H methods for log-concave densities; in particular, we will focus on the example of stimulus decoding given GLM observations, where we know that log-concave posteriors arise quite naturally.

6.2.1 Non-isotropic random-walk Metropolis is the simplest effective proposal

Perhaps the most common proposal is of the random walk type: q(~y|~x) = q(~y − ~x), for some fixed density q(.). This is a special case of the symmetric proposal discussed above; recall that the acceptance probability simplifies somewhat in this case. Centered isotropic Gaussian distributions are a simple choice for q(.), leading to proposals

~y ∼ ~x + σ~ǫ, (2)

where ~ǫ is Gaussian of zero mean and identity covariance, and σ determines the proposal jump scale. (We will discuss the choice of σ at more length in the next section.) Of course, different choices of the proposal distribution will affect the mixing rate of the chain. To increase this rate, it is generally a good idea to align the axes of q(.) with the target density, if possible, so that the proposal jump scales in different directions are roughly proportional to the width of p(~x) along those directions. Such proposals will reduce the rejection probability and increase the average jump size by biasing the chain to jump in more favorable directions. For Gaussian proposals, we can thus choose the covariance matrix of q(.) to be proportional to the covariance of p(~x). Of course, calculating the latter covariance is often a difficult problem (which the MCMC method is intended to solve!), but in many cases we can exploit our usual Laplace approximation ideas and take the inverse of the Hessian of the log-posterior at MAP as a rough approximation for the covariance. This is equivalent to modifying the proposal rule (2) to

~y ∼ ~x + σA~ǫ, (3)

where A is the Cholesky decomposition of H−^1

T = H−^1 , (4)

and H is the negative log-Hessian of the log-posterior at the MAP solution. We refer to chains with such jump proposals as non-isotropic Gaussian RWM. Figure 1 compares the isotropic and nonisotropic proposals.

where C is the prior covariance. Thus, if C−^1 is banded, H will be banded in time as well. As an example, Gaussian autoregressive processes of any finite order form a large class of priors which have banded C−^1. In particular, for white-noise stimuli, C−^1 is diagonal, and therefore H will have the same bandwidth as J. Efficient algorithms can find the Cholesky decomposition of a banded d × d matrix, with bandwidth nb, in a number of computations ∝ n^2 b d, instead of ∝ d^3. Likewise, if B is a banded triangular matrix with bandwidth nb, the linear equation B~x = ~y can be solved for ~x in ∝ nbd computations. Therefore, to calculate ~x = A˜x from ˜x in each step of the Markov chain, we proceed as follows. Before starting the chain, we first calculate the Cholesky decomposition of H, such that H = BT B and ~x = Ax˜ = B−^1 x˜. Then, at each step of the MCMC, given ˜xt, we find ~xt by solving the equation B~xt = ˜xt. Since both of these procedures involve a number of computations that only scale with d (and thus with T ), we can perform the whole MCMC decoding in O(T ) computational time. This allows us to decode stimuli with essentially unlimited durations. Similar O(T ) methods have been applied frequently in the state-space model literature (Shephard and Pitt, 1997; Davis and Rodriguez-Yam, 2005; Jungbacker and Koopman, 2007); we will discuss these state-space methods in more depth in a later chapter.

6.2.2 Hybrid (Hamiltonian) Monte Carlo typically mixes much faster than random- walk Metropolis

The random-walk chains discussed above are convenient, but often mix slowly in practice, basically because random walks spend a lot of time re-covering old ground: in the simplest case, a Brownian motion requires O(n^2 ) time to achieve a distance of n steps from its starting point, on average. The so-called hybrid (or Hamiltonian) Monte Carlo (HMC) method is a powerful method for constructing chains which avoid the slow mixing behavior of random-walk chains. This method was originally inspired by the equations of Hamiltonian dynamics for the molecules in a gas (Duane et al., 1987), but has since been used extensively in Bayesian settings (for its use in sampling from posteriors based on GLMs see (Ishwaran, 1999); see also (Neal, 1996) for further applications and extensions). Let’s begin with the physical analogy. Let ~x be the vector of positions of all the molecules in a gas at a moment in time, and ~p the vector of their momenta (~x and ~p have the same dimension). In thermal equilibrium, these variables have the distribution p(~x, ~p) ∝ e−H(~x,~p), where

H(~x, ~p) =

~pT~p + E(~x)

is the Hamiltonian (total energy) of the system, which is the sum of the kinetic energy, 12 ~pT~p, and the potential energy, E(~x). Notice that due to the factorization of this distribution, ~x and ~p are independently distributed; the marginal distribution of ~x is proportional to exp(−E(~x)) and that of ~p is the Gaussian ∝ exp(− 12 ~pT~p). Thus if we could obtain samples {(~xt, ~pt)} from p(~x, ~p) with the potential energy E(~x) = − log p(~x) + const., we would also have obtained samples from the marginal {~xt} from p(~x). Now, ~x and ~p evolve according to the Hamiltonian dynamics

~p˙ = − ∂H ∂~x

= −∇E(~x), ~x˙ =

∂H

∂~p

= p,~ (5)

under which the energy H(~x, ~p) is conserved. These dynamics also preserve the volume element d~xd~p in the (~x, ~p) space (this is the content of Liouville’s theorem from classical dynamics). These

two facts imply that the continuous-time (deterministic) Markov chain in the (~x, ~p) space that cor- responds to the solution of Eqs. (5) leaves the distribution p(~x, ~p) ∝ e−H(~x,~p)^ invariant. Ergodicity of this Markov chain can be assured if we “thermalize” the momentum variables, by sampling ~p from its exact marginal distribution (the standard Gaussian) at regular time intervals. Thus so- lutions of Eqs. (5) should lead to good transition densities for our MCMC chain. Moreover, we expect these proposals to outperform the basic random-walk proposals for two reasons: first, by construction, the Hamiltonian method will follow the energy gradient and thus propose jumps to- wards higher-probability regions, instead of just proposing random jumps without paying attention to the gradient of p(~x). Second, the Hamiltonian method will move for many time steps in the same direction, allowing us to move n steps in O(n) time, instead of the O(n^2 ) time required by the random-walk method. In practice, a numerical solution to Eqs. (5) is obtained by discretizing the Hamiltonian equa- tions, in which case H(~x, ~p) may change slightly with each step, and the M-H correction is required to guarantee that p(~x, ~p) remains the equilibrium distribution. The discretization customarily used in HMC applications is the “leapfrog” method, which is time reversible and leaves the (~x, ~p)-volume element invariant. The complete HMC algorithm with the leapfrog discretization has the following steps for generating {~xt+1, ~pt+1}, starting from {~xt, ~pt}, in each step of the Markov chain:

Set ~x := ~xt, and sample ~p from the isotropic Gaussian distribution Nd(0, 1 ).
Set ~p 0 := ~p, and update (~x, ~p) by repeating the following leapfrog steps L times
- ~p := ~p − σ 2 ∇E(~x)
- ~x := ~x + σ~p
- ~p := ~p − σ 2 ∇E(~x)
With probability max{ 1 , exp (−∆H)}, where ∆H ≡ H(~x, ~p) − H(~xt, ~p 0 ), accept the proposal ~x as ~xt+1. Otherwise reject it and set ~xt+1 = ~xt. (It can be shown that this is a bona fide Metropolis-Hastings rejection rule, ensuring that the resulting MCMC chain indeed has the desired equilibrium density (Duane et al., 1987).)

This chain has two parameters, L and σ, which can be chosen to maximize the mixing rate of the chain while minimizing the number of evaluations of E(~x) and its gradient. In practice, even a small L (e.g., L < 5) often yields a rapidly mixing chain. The special case L = 1 corresponds to a chain that has proposals of the form

~y ∼ ~x −

σ^2 2 ∇E(~x) + σ~p,

where ~p is normal with zero mean and identity covariance, and the proposal ~y is accepted according to the usual M-H rule. In the limit σ → 0, this chain becomes a continuous Langevin process with the potential function E(~x) = − log p(~x), whose stationary distribution is the Gibbs measure, p(~x) = exp(−E(~x)), without the Metropolis-Hastings rejection step. For a finite σ, however, the Metropolis-Hastings acceptance step is necessary to guarantee that p(~x) is the invariant distribution. The chain is thus refered to as the “Metropolis-adjusted Langevin” algorithm (MALA) (Roberts and Rosenthal, 2001). The scale parameter σ (which also needs to be adjusted for the RWM chain) sets the average size of the proposal jumps: we must typically choose this scale to be small enough to avoid jumping wildly into a region of low p(~x) (and therefore wasting the proposal, since it will be rejected with high probability). At the same time, we want to make the jumps as large as possible, on average,

a) Gibbs b) ARS Hit−and−Run c) Random Walk Metropolis

Figure 2: Comparison of different MCMC algorithms in sampling from a non-isotropic truncated Gaussian distribution. Panels (a-c) show 50-sample chains for a Gibbs, isotropic hit-and-run, and isotropic random walk Metropolis (RWM) samplers, respectively. The grayscale indicates the height of the probability density. As seen in panel (a), the narrow, non-isotropic likelihood can significantly hamper the mixing of the Gibbs chain as it chooses its jump directions unfavorably. The hit-and- run chain, on the other hand, mixes much faster as it samples the direction randomly and hence can move within the narrow high likelihood region with relative ease. The mixing of the RWM chain is relatively slower due to its rejections (note that there are fewer than 50 distinct dots in panel (c) due to rejections; the acceptance rate was about 0.4 here). For illustrative purposes, the hit-and-run direction and the RWM proposal distributions were taken to be isotropic here, which is disadvantageous, as explained in the text (also see Fig. 1).

where p(Ai) denotes the probability that we have chosen to update the subset Ai. It is important to note that the Gibbs update rule can sometimes fail to lead to an ergodic chain (i.e., the chain can get “stuck” and not sample from p(~x) properly) (Robert and Casella, 2005). An extreme case of this is when the conditional distributions pm(xm|~x⊥m) are deterministic: then the Gibbs algorithm will never move, clearly breaking the ergodicity of the chain. More generally, in cases where strong correlations between the components of ~x lead to nearly deterministic condi- tionals, the mixing rate of the Gibbs method can be extremely low (panel (a) of Fig. 2, shows this phenomenon for a 2-dimensional distribution with strong correlation between the two components). Thus, it is a good idea to choose the parameterization of the model carefully before applying the Gibbs algorithm. For example, we can change the basis, or more systematically, exploit the Laplace approximation, as described above, to sample from the auxiliary distribution ˜p(˜x) instead.

6.2.4 The hit-and-run algorithm performs Gibbs sampling in random directions

The hit-and-run algorithm (Lovasz and Vempala, 2003) can be thought of as “random-direction Gibbs”: in each step of the hit-and-run algorithm, instead of updating ~x along one of the coordinate axes, we update it along a random general direction not necessarily parallel to any coordinate axis. More precisely, the sampler is defined in two steps: first, choose a direction ~n from some positive density ρ(~n) on the unit sphere ~nT~n = 1. Then, similar to Gibbs, sample the new point on the line defined by ~n and ~x, with a density proportional to the underlying distribution. That is, we sample s from the one-dimensional density

s ∼ h(s|~n, ~x) ∝ p(~x + s~n), −∞ < s < ∞,

and then set ~y = ~x + s~n.

The main gain over RWM or HMC is that instead of taking small local steps (of size proportional to σ), we may take very large jumps in the ~n direction; the jump size is set by the underlying distribution itself, not an arbitrary scale σ. This, together with the fact that all hit-and-run proposals are accepted, makes the chain better at escaping from sharp high-dimensional corners (see (Lovasz and Vempala, 2003) and the discussion at the end of Sec. 6.2.2 above). Moreover, the algorithm is parameter-free, once we choose the direction sampling distribution r(.), unlike the random walk case, where we have to choose the jump size somewhat carefully. The advantage over Gibbs is in situations such as depicted in Fig. 1, where jumps parallel to coordinates lead to small steps but there are directions that allow long jumps to be made by hit-and-run. The price to pay for these possibly long nonlocal jumps, however, is that now (as well as in the Gibbs case) we need to sample from the one-dimensional density s ∼ (^) Z^1 p(~x + s~n), which is in general non-trivial. This may be done by rejection sampling, the cumulative inverse transformation, or one-dimensional Metropolis. If the posterior distribution is log-concave, then so are all of its “slices,” and very efficient specialized methods such as slice sampling (Neal, 2003) or adaptive rejection sampling (Gilks and Wild, 1992) can be used; see (Ahmadian et al., 2008) for a discussion of the adaptive rejection sampling method, and section 7.6 below for a brief discussion of the slice sampling method. To establish the validity of the hit-and-run sampler, we argue as in the Gibbs case: we compute the ratio p(~y)q(~x|~y) p(~x)q(~y|~x)

p(~y)

ρ(~n)q(~x|~y, ~n)d~n p(~x)

ρ(~n)q(~y|~x, ~n)d~n

∫ ρ(~n)p(~y)q(~x|~y, ~n)d~n ρ(~n)p(~x)q(~y|~x, ~n)d~n

since p(~x)q(~y|~x, ~n) = p(~y)q(~x|~y, ~n) ∝ p(~x)p(~y) for any ~x and ~y connected by a line along the direction ~n; thus we see that the hit-and-run proposal is indeed a M-H proposal which is accepted with probability one. Regarding the direction density, ρ(~n), the easiest choice is the isotropic ρ(~n) = 1. More generally it is easy to sample from ellipses, by sampling from the appropriate Gaussian distribution and normalizing. Thus, again, a reasonable approach is to exploit the Laplace approximation: we sample ~n by sampling an auxiliary point ˜x from N (0, H−^1 ), and setting ~n = ˜x/‖x˜‖ (see Fig. 1). Sampling from N (0, H−^1 ) may be performed via the efficient Cholesky techniques discussed in the previous sections. This non-isotropic direction density enhances hit-and-run’s advantage over Gibbs by giving more weight to directions that allow larger jumps to be made.

6.3 Comparing mixing speed in the log-concave case: for smooth densities,

hybrid methods are best; for densities with “corners,” hit-and-run is more

effective

In the last few sections, we have discussed some qualitative arguments supporting various MCMC algorithms, in terms of their mixing rates and computational costs. Here we give a more quantitative account, and also compare the different methods based on their performance in the neural decoding setup. From a practical point of view, the most relevant notion of mixing is how fast the estimate ˆgN converges to the true expectation of the quantity of interest, E(g). As one always has access to finitely many samples, N , even in the optimal case of i.i.d. samples from p, ˆgN has a finite random error, (1/N )V arp(g). For correlated samples from the MCMC chain, the asymptotic error is typically larger (since each sample gives somewhat redundant information), and the error may be written (Robert and Casella, 2005) as

Var(ˆgN ) = τcorr N

Var[g(~x)] + o

( (^) τ corr N

regarding sampling from some special classes of high-dimensional distributions. Based on their results, the authors argue that in general, the jump scales of RWM and MALA proposals should be chosen such that their acceptance rates are roughly 0.25 and 0.55, respectively. For the special case of sampling from a d dimensional standard Gaussian distribution, p(~x) ∝ exp (−‖~x‖^2 /2), and for optimally chosen proposal jump scales they show that the FOE of Gaussian MALA and RWM are asymptotically equal to 1. 6 d^2 /^3 and 1.33, respectively, for large d; in comparison, (Ahmadian et al.,

compute the FOE of Gaussian hit-and-run as 2. Therefore, while hit-and-run has higher FOE than RWM in this case, we see that for unimodal, nearly Gaussian distributions, MALA will mix much faster (by a factor ∝ d^2 /^3 ) than both RWM and hit-and-run in large dimensions, and HMC will mix even faster (see (Kennedy and Pendleton, 1991; Neal, 1993) for further discussion of the mixing speed of HMC in this simple multivariate Gaussian case). The superiority of HMC over the other chains is clearly visible in panel (a) of Fig. 3, which shows a plot of the estimated autocorrelation function γt for the sampling of the three chains from the GLM posterior with standard Gaussian priors. More generally, in simulations with Gaussian priors and smooth GLM nonlinearities, HMC (including MALA) had an order of magnitude advantage over the other chains for most of the relevant parameter ranges (Ahmadian et al., 2008). The situation can be very different, however, for highly non-Gaussian (but still log-concave) distributions, such as those with sharp boundaries. In our GLM decoding setting this can be the case with flat stimulus priors on convex sets, when the likelihood is broad as a function of ~x and does not restrict the posterior support away from the boundaries and corners of the set on which the prior is supported. In this case, HMC and MALA lose their advantage because they do not take advantage of the information in the prior distribution, which has zero gradient within its support. Furthermore, as mentioned in Secs. 6.2.2 and 6.2.4, when the convex body has sharp corners, hit- and-run will have an advantage over both RWM and HMC in avoiding getting trapped in those corners, which can otherwise considerably slow down the chain (see the arguments in (Lovasz and Vempala, 2003)). Figure 3, panel (b), shows the estimated autocorrelation function for different chains in sampling from the posterior distribution in GLM-based decoding with a flat stimulus prior distribution with cubic support. For this prior, the correlation time of the hit-and-run chain was consistently lower than those of the RWM, MALA, and Gibbs (not shown in the figure) chains, unless the likelihood was sharp and concentrated away from the boundaries of the prior cube. As we mentioned above (also see the next section), the Laplace approximation is adequate in this latter case. Thus we see that hit-and-run is the faster chain when this approximation fails, which is also the case where MCMC is more indispensable. Finally, we note that a number of other robust empirical methods of diagnosing mixing and convergence have been introduced in the literature; the so-called Rˆ statistic discussed in (Gelman et al., 2003) is particularly popular. These alternate methods give consistent results with those based on the autocorrelation time, τcorr, presented here.

6.4 Example: comparing MAP and Monte Carlo stimulus decoding given GLM

observations

In this section we compare Bayesian stimulus decoding using the MAP and the MCMC posterior mean estimates, based on the response of a population of neurons modeled via the GLM. We will show that in the flat prior case, the MAP estimate is much less efficient (in terms of its mean squared error) than the posterior mean estimate. We contrast this with the Gaussian prior case, where the Laplace approximation is accurate over a large range of model parameters, and thus the two estimates are close. Furthermore, for both kinds of priors, in the limit of strong likelihoods

−

amplitude

−

amplitude

0 100 200 300 400 500

−

time(ms)

amplitude

−

amplitude

−

amplitude

0 100 200 300 400 500

−

time(ms)

amplitude

−

amplitude

−

amplitude

0 100 200 300 400 500

−

time(ms)

amplitude

Stimulus

MAP

BLS

c‖k‖ =. 5 c‖k‖ = 1 c‖k‖ = 2. 4

Figure 4: Comparison of MAP and posterior mean estimates, for a pair of ON and OFF model neurons, for different values of the stimulus filter amplitude (‖k‖ = 0.5, 1, and 2.4; i.e., the dependence of the firing rate on the stimulus was increasing from left to right). The stimulus (the black trace shown on all panels) consists of a 500 ms interval of uniformly distributed white noise, refreshed every 10 ms. Thus the stimulus space is 50 dimensional. The cyan horizontal lines mark the boundaries of the flat prior distribution of the stimulus intensity on each 10 ms subinterval. They are set at ±

3, corresponding to intensity variance of 1 and zero mean. Dots on the top row show the spikes of the ON (gray) and the OFF (black) cell. The red traces in the middle row are the MAP estimates, and the blue traces in the bottom rows show the posterior means estimated from 10000 samples of a hit-and-run chain (after burning 2500 samples). The shaded regions in the second and third rows are error bars representing the estimated marginal posterior uncertainty about each stimulus value for the MAP and mean estimates. For the MAP (second rows), these are calculated as the square root of the diagonal of the inverse Hessian H−^1 , but they have been cut-off where they would have encroached on the zero prior region beyond the horizontal cyan lines. For the posterior mean (third rows), they represent one standard deviation about the mean, calculated as the square root of the diagonal of the covariance matrix, itself estimated from the MCMC chain (the standard error of the posterior mean estimate due to the finite samples of the MCMC were much smaller than these error bars, and are not shown). Note that the errorbars of the mean are in general smaller than those for the MAP, and that all estimate uncertainties decrease as the stimulus informativeness grows.

(e.g., due to a strong stimulus filter or a large number of neurons) the posterior distribution will be sharply concentrated, the Laplace approximation becomes asymptotically more and more accurate, and both estimates will eventually converge to the true stimulus (more precisely the part of the stimulus that is not outside the receptive field of all the neurons; see footnote 5, below). In the first two examples (Figs. 4–5), the stimulus estimates were computed given the simulated spike trains of a population of pairs of ON and OFF retinal ganglion cells (RGC), in response

Monte Carlo Methods: Rejection Sampling, Markov Chain Monte Carlo, and Applications, Essays (high school) of Mathematics

Related documents

Partial preview of the text

Download Monte Carlo Methods: Rejection Sampling, Markov Chain Monte Carlo, and Applications and more Essays (high school) Mathematics in PDF only on Docsity!

Statistical analysis of neural data:

Monte Carlo techniques for decoding spike trains

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience

Columbia University

http://www.stat.columbia.edu/∼liam

March 31, 2009

N

2.1 Example: decoding hippocampal place field activity (the static case)

N

∑^ N

N

∑^ N

N

∑^ N

N

Z

Z

Z

Z

6.2 Choosing the proposal density is a key step in optimizing the efficiency of

Metropolis-Hastings

∂H

6.3 Comparing mixing speed in the log-concave case: for smooth densities,

hybrid methods are best; for densities with “corners,” hit-and-run is more

effective

6.4 Example: comparing MAP and Monte Carlo stimulus decoding given GLM

observations