



























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Monte carlo methods, focusing on rejection sampling and markov chain monte carlo (mcmc). Rejection sampling is a technique for generating samples from a distribution by proposing samples from a simpler distribution and accepting or rejecting them based on their likelihood ratio. Mcmc, on the other hand, is a more general method for generating samples from complex distributions by constructing a markov chain that converges to the desired distribution. The document also covers various applications of these methods, such as computing integrals, bayesian inference, and decoding population spike trains.
Typology: Essays (high school)
1 / 35
This page cannot be seen from the preview
Don't miss anything!




























Contents
1 Often we need to integrate over the posterior, not just maximize it 3
2 “Monte Carlo” techniques provide a tractable method for computing integrals 3 2.1 Example: decoding hippocampal place field activity (the static case)......... 4
3 The “importance sampling” idea extends the range of the basic Monte Carlo technique, but is only effective in low dimensions 5
4 Rejection sampling is a basic building block of more general, powerful Monte Carlo methods 6
5 “Rao-Blackwellization”: it is wise to compute as much as possible analytically 7
6 Markov chain Monte Carlo (MCMC) provides a general-purpose integration technique 8 6.1 The Metropolis-Hastings sampling algorithm is a fundamental tool for constructing Markov chains with the correct equilibrium density................... 9 6.2 Choosing the proposal density is a key step in optimizing the efficiency of Metropolis- Hastings........................................... 10 6.2.1 Non-isotropic random-walk Metropolis is the simplest effective proposal... 11 6.2.2 Hybrid (Hamiltonian) Monte Carlo typically mixes much faster than random- walk Metropolis.................................. 13 6.2.3 The “Gibbs sampler” is a key special case of the Metropolis-Hastings algorithm 15 6.2.4 The hit-and-run algorithm performs Gibbs sampling in random directions.. 16 6.3 Comparing mixing speed in the log-concave case: for smooth densities, hybrid meth- ods are best; for densities with “corners,” hit-and-run is more effective........ 17 6.4 Example: comparing MAP and Monte Carlo stimulus decoding given GLM obser- vations............................................ 19
7 Gibbs sampling is useful in a wide array of applications 24 7.1 Example: sampling network spike trains given the observed activity of a subset of neurons in the population................................. 25 7.2 Blockwise sampling is often more efficient than Gibbs sampling each variable indi- vidually........................................... 25 7.3 Example: computing errorbars in low-rank stimulus encoding models......... 27 7.4 Example: Bayesian adaptive regression splines for time-varying firing rate estimation 28 7.5 Example: estimating hierarchical network spike train models............. 29 7.6 Example: “slice sampling” provides another easy way to sample from a unimodal density............................................ 29 7.7 Example: decoding in the presence of model uncertainty................ 29
8 Computing marginal likelihood is nontrivial but necessary in a number of appli- cations 31
9 Further reading 33 1 (^1) Much of these notes are directly adapted from (Ahmadian et al., 2008). Thanks also to Y. Mishchenko for helpful
Thus we have reduced our hard integration problem to an apparently simpler problem of drawing i.i.d. samples from p(~x). To decide how many samples N are sufficient, we may simply compute
V ar(ˆgN ) =
V ar[g(~x)];
when N is large enough that this variance is sufficiently small, we may stop sampling. Of course, in general, computing V ar[g(~x)] is usually also intractable, and so we have to estimate this variance from data, and this can be a difficult problem in itself, e.g. if the random variable g(~x) has large tails.
As a simple concrete example, imagine that we would like to decode the position x of a rat along a one-dimensional track, given some observed set of neural spike counts D = {ni}. Assume that, given x, the spike counts ni are conditionally independent Poisson random variables with rate
λi(x) = fi(x),
where fi(.) is neuron i’s tuning curve for position x. Further assume that we know the a priori distribution p(x) of observing any position x (i.e., we have watched the rat’s behavior long enough to estimate p(x). Then, if we write
E(x|D) =
xp(x|D)dx =
x p(D|x)p(x) p(D)
dx =
xp(D|x)p(x)dx p(D)
with
p(D) =
p(x, D)dx =
p(x)p(D|x)dx,
we see that we may apply the Monte Carlo idea twice to compute E(x|D), once for the numerator and once for the denominator: if xj are i.i.d. samples from the prior density p(x), then
∫ xp(D|x)p(x)dx = lim N →∞
j=
xj p(D|xj )
and
p(D) = lim N →∞
j=
p(D|xj ),
with
p(D|x) =
i
[λn i iexp(−λi)/ni!] =
i
fi(x)ni^ exp(−fi(x)) ni!
However, these direct methods for computing p(D) and E(x|D) are numerically unstable (i.e., we may need a very large number of samples N to obtain reliable estimates) and are not recommended, for reasons we will discuss below.
3 The “importance sampling” idea extends the range of the basic
Monte Carlo technique, but is only effective in low dimensions
Two major problems can arise in a direct application of the Monte Carlo idea. First, it might be difficult to sample directly from p(~x), particularly when ~x is high-dimensional. Second, the likelihood term p(D|~x) might be very sharply peaked; in the preceding position-decoding example, this is often the case when many neurons are observed, since the product over neurons i tends to become exponentially concentrated around its maximum. This means that most of our samples are wasted, since they fall in regions where the p(D|~x) term is relatively very small, leading to a large variance in our estimate of E(~x|D). One solution to both of these problems is to use the “importance sampling” trick: instead of drawing i.i.d. samples from p(~x), we draw ~xj from a “proposal” density q(~x) instead, and then form
ˆg( Nq )≡
i=
p(~xj ) q(~xj ) g(~xj ) →N →∞
p(~x) q(~x) g(~x)q(~x)d~x =
p(~x)g(~x)d~x = Ep(g).
The terms p q((~x~xjj^ ) ) — defined so that we obtain the correct expectation, despite the fact that we are not sampling from p(~x) — are known as “importance weights”: we are biasing our sampler so that we can potentially (assuming we choose a good q(~x)) spend more of our time sampling from the “important” part of the density. What constitutes a good choice for q(~x) here? It is reasonable to try to make the variance of our estimate g N(q )as small as possible. We may write out this variance explicitly:
V ar(ˆgN ) =
g(~x)^2 p(~x)^2 q(~x)
d~x − Ep(g)^2
(note that this reduces to the simpler V ar[g(~x)] formula above in the case that p = q); a simple calculus of variations argument gives that the optimal q is
q∗(~x) =
|g(~x)|p(~x),
with the normalizer Z =
|g(~x)|p(~x)d~x < ∞, by assumption. For example, if we want to compute p(D) =
p(~x)p(D|~x)d~x for some fixed observation D, the optimal proposal density is
q(~x) = (1/Z)p(~x)|p(D|~x)| = (1/Z)p(~x)p(D|~x) = p(~x|D),
which makes intuitive sense. It is worth noting that a bad choice of q(~x), on the other hand, can be quite damaging. For
example, if q(~x) is such that
∫ (^) g(~x) (^2) p(~x) 2 q(~x) d~x^ is infinite (i.e., if the importance weight^
p(~xj ) q(~xj ) is very large with high p(~x)-probabiliy), then the variance of our estimate ˆgN will be infinite, a situation we would obviously like to avoid. The rule of thumb is that the proposal q should have “fatter tails” than p, in order to avoid very large values of the importance weight p/q. If we return to the example discussed at length above — decoding the spike trains from a population of GLM-like neurons — then a simple proposal density q(~x) suggests itself: our Gaussian approximation to p(~x|D). Once we obtain ~xM AP and the Hessian Jx of the log-posterior evaluated
at ~xM AP , then to sample from this distribution we need only form ~xj = J x− 1 /^2 ~zj + ~xM AP , where ~zj is a standard white Gaussian vector: this proposal q(~x) is therefore very easy to sample from, and is
provide exact samples. We apply the rejection sampling idea: that is, we draw from our proposal q, and then accept with probability cp/q. The constant c is chosen such that maxx cp(x)/q(x) ≤ 1, i.e., that the acceptance probability is always less than or equal to one. It is clear that this algorithm provides exact samples, once we write down the density p∗(x) from which the rejection sampler is drawing:
p∗(x) =
q(x)p(accept|x) =
q(x)c
p(x) q(x)
where the normalization
Z =
q(x)c p(x) q(x)
dx = c
is just the proportion of samples that are accepted; thus
p∗(x) =
c
q(x)c p(x) q(x)
= p(x).
(Note that the thinning algorithm we discussed in the point process chapter was a special case of this idea.) However, in many cases (particularly in high-dimensional applications) the restriction that p(x)/q(x) be bounded is not met, or the bound 1/c = maxx p(x)/q(x) is so large as to make the method useless. In this case, we must turn to the more general MCMC methods discussed below.
5 “Rao-Blackwellization”: it is wise to compute as much as pos-
sible analytically
We saw above that the importance sampling idea (or indeed, the Monte Carlo approach in general) is most effective when we can control the dimensionality of the vector ~x which needs to be sampled. One natural idea along these lines is to perform as much of the integral as possible analytically. More precisely, it is sometimes possible to decompose p(~x) into two pieces,
p(~x) = p(~x 1 )p(~x 2 |~x 1 ).
In some cases, this decomposition may be chosen such that the integral
g(~x 1 ) ≡
p(~x 2 |~x 1 )g(~x 1 , ~x 2 )d~x 2
may be calculated analytically^2. Now the integral Ep(g) may be computed as
Ep(g) =
d~x 1 p(~x 1 )
d~x 2 p(~x 2 |~x 1 )g(~x 1 , ~x 2 ) =
d~x 1 p(~x 1 )g(~x 1 );
in particular, we may apply Monte Carlo methods (now in a the lower-dimensional ~x 1 space) to the integral on the right. The Rao-Blackwell theorem (a version of the usual bias-variance decom- position (Schervish, 1995)) guarantees that this approach, which effectively replaces a randomized estimator for g(~x 1 ) with its analytical conditional expectation, will perform at least as well as Monte Carlo integration applied to ~x directly; hence this approach is often referred to as “Rao- Blackwellization.” We will see a number of examples of this approach below. (^2) The most common example of this is that ~x 2 is conditionally Gaussian given ~x 1 , and g(.) is some simple function such as a polynomial or exponential. For example, it is easy to see that this pertains when we are decoding Gaussian stimuli ~x given observations from a GLM model and the number m of spike counts observed is smaller than the dimensionality of ~x. Thus, in this case, just as in the case that we are optimizing (not integrating) over ~x — where we made use of the representer theorem to reduce the computational load — the computation time is determined by the smaller of dim(~x) and m.
6 Markov chain Monte Carlo (MCMC) provides a general-purpose
integration technique
As we discussed above, if we can sample from a distribution we can (in principle) compute any quan- tity of interest: means, variances, correlations, and even quantities which are nonlinear functions of the distribution, such as mutual information (as we will see below). However, it can often be quite challenging to sample directly from a complex distribution p(~x). One (quite clever) idea that has become the default method (Robert and Casella, 2005) is to invent an ergodic Markov chain^3 whose conditional distributions p(~y|~x) may be sampled tractably and whose equilibrium measure is exactly p(~x). Then we may simply iteratively run the chain forward, by recursively sampling p(~xN +1|~xN ).
We know that the distribution of ~x after N steps of the chain is
pN (~x) ≡ p(~xN ) = T N^ p 0 (~x),
where T abbreviates the transition operator
T g =
g(~x)p(~y|~x)d~x
(think of the simplest case, that ~x can take on only a finite number of values d — then g is a d-dimensional vector and T corresponds to a d × d matrix) and that
pN (~x) → p(~x), N → ∞.
Thus if we run the chain long enough, ~xN will be arbitrarily close to a sample from p(~x). One drawback is that we don’t know how long we need to run the Markov chain: how large does N have to be for pN (~x) to be “close enough” to p(~x)? In the case that the sample space of ~x is finite, we know from basic matrix theory that the largest eigenvalue of T is 1, with corresponding eigenvector p(~x); in an ergodic chain, all other eigenvalues are strictly less than one. Since the size of these smaller eigenvalues determines the rate of convergence of T N^ p 0 (~x) to p(~x), the goal is to choose T such that the second-largest eigenvalue is as small as possible. More generally, we tend to argue more qualitatively that the chain should “mix” — i.e., “forget” the initial distribution p 0 (~x) and evolve towards equilibrium — as quickly as possible; we will discuss this issue more quantitatively in section 6.3 below. We also want to choose the initial density p 0 (~x) to be as close to the target density p(~x) as possible (within the consraint that we may sample from p 0 (~x) directly); for example, in the decoding context, a reasonable choice for p 0 (~x) is the Laplace-approximation proposal density we discussed above. Typically the chain is run for some “burn-in” period before any samples are recorded, so that the effects of the initialization are forgotten and the chain may be considered to be in equilibrium. (If the chain mixes slowly, we’ll have to “burn” more samples — yet another reason to construct faster-mixing chains.) Finding a Markov chain that satisfies our desiderata — (1), ergodicity (or more quantitatively, a rapid mixing speed), and (2), T p(~x) = p(~x),
that is, p(~x) as an equilibrium measure — is the hard part, but a few general-purpose approaches have been developed and are quite widely used. (^3) Ergodicity is a property of a Markov chain that guarantees that the Markov chain evolves towards a unique equilibrium density; roughly speaking, a Markov chain is ergodic if the probability of getting from one state to another is nonzero, and if the chain is acyclic in the sense that particles which start at any state ~x do not return back to ~x at any fixed time with probability one.
p(~y)q(~x|~y)/p(~x)q(~y|~x) on the left-hand side, and flip this ratio to set α(~x|~y) on the right-hand side. Putting these three cases together, we can express our acceptance probability a bit more elegantly,
α(~y|~x) ≡ min
p(~y)q(~x|~y) p(~x)q(~y|~x)
for any pair (~x, ~y). This completes the definition of the Metropolis-Hastings (M-H) algorithm. Thus we have succeeded in our goal of constructing a simple algorithm that allows us to sample from a distribution which becomes arbitrarily close to p(~x) (as N increases). See (Billera and Diaconis, 2001) for a geometric interpretation of the algorithm; it turns out that the choice of the acceptance ratio α(.|.) corresponds to a kind of projection of our proposal Markov chain onto the set of chains which satisfy detailed balance under p(~x). See (Neal, 2004) for further details, including a generalization to the more powerful class of non-reversible chains. It is worth pointing out a couple special features of this algorithm. First, note that in the special case of symmetric proposals, q(~y|~x) = q(~x|~y), then the acceptance probability simplifies:
αsymm(~y|~x) ≡ min
p(~y) p(~x)
this was the original Metropolis algorithm (later generalized to the nonsymmetric case by Hast- ings). This proposal has the nice feature that proposals which increase the value of p(.) are always accepted (i.e., if p(~y) ≥ p(~x), then αsymm(~y|~x) = 1), while proposals which decrease p(.) are ony accepted probabilistically. This makes the analogy with the “simulated annealing” technique from optimization quite transparent; the M-H algorithm is just simulated annealing when the “temper- ature” is set to one. Note also, very importantly, that no matter what proposal density we use we only need to know p(~x) up to a constant. That is, it is enough to know that
p(~x) =
r(~x)
for some relatively easy-to-compute r(~x) — we do not need to calculate the normalization constant Z (since the Z term cancels in the ratio p(~y)/p(~x)), which in many cases is intractable. This prop- erty has made the Metropolis-Hastings algorithm perhaps the fundamental algorithm in Bayesian statistics.
As discussed above, the key to the success of any MCMC algorithm is to find a chain that reaches equilibrium and mixes as quickly as possible. In the context of the M-H algorithm, we want to choose the proposal density q(~y|~x) to maximize the mixing speed and to minimize the time needed to compute and sample from q(.). A good rule of thumb is that we would like the proposals q(.) to match the true density p(.) as well as possible (in the limiting case that p(.) and q(.) are exactly equal, it is clear that the acceptance probability α will be one, and the M-H algorithm gives i.i.d. samples from p(.), which is the best we can hope for; of course, in general, we can’t sample directly from p(.), or else the MCMC technique would not be necessary). The so-called “independence” sampler is the simplest possible proposal: we choose the proposal q(.) to be independent of the current state ~x, i.e. q(~y|~x) = q(~y), for some fixed q(~y). For example, the proposal q(~x) could be taken to be our Gaussian or student-t approximation to p(~x|D). This
is extremely easy to code, and the simplicity of the proposal makes detailed mathematical analysis of the mixing properties of the resulting Markov chain much more tractable than in the general case (Liu, 2002). However, as we discussed above, it can be very difficult to choose a q(~x) which is both easy to sample from directly and which is close enough to the (possibly high-dimensional) p(~x) that the importance weights p(~x)/q(~x) — and therefore the M-H term
p(~y)q(~x|~y) p(~x)q(~y|~x)
p(~y)q(~x) p(~x)q(~y)
— are well-behaved. Thus in most cases we need to choose a proposal density which depends on the current point ~y. We will discuss a few of the most useful example proposals below, but before going into these specifics, it is worth emphasizing that, all of the M-H algorithms we will discuss propose jumps ~y − ~x which are in some sense “local.” Thus, just as in the optimization setting, M-H methods are typically much more effective in sampling from unimodal (e.g., log-concave) target densities p(~x) than more general p(~x), which might have local maxima that act as “traps” for the sampler, decreasing the overall mixing speed of the chain. Therefore our focus in this section will be on constructing efficient M-H methods for log-concave densities; in particular, we will focus on the example of stimulus decoding given GLM observations, where we know that log-concave posteriors arise quite naturally.
6.2.1 Non-isotropic random-walk Metropolis is the simplest effective proposal
Perhaps the most common proposal is of the random walk type: q(~y|~x) = q(~y − ~x), for some fixed density q(.). This is a special case of the symmetric proposal discussed above; recall that the acceptance probability simplifies somewhat in this case. Centered isotropic Gaussian distributions are a simple choice for q(.), leading to proposals
~y ∼ ~x + σ~ǫ, (2)
where ~ǫ is Gaussian of zero mean and identity covariance, and σ determines the proposal jump scale. (We will discuss the choice of σ at more length in the next section.) Of course, different choices of the proposal distribution will affect the mixing rate of the chain. To increase this rate, it is generally a good idea to align the axes of q(.) with the target density, if possible, so that the proposal jump scales in different directions are roughly proportional to the width of p(~x) along those directions. Such proposals will reduce the rejection probability and increase the average jump size by biasing the chain to jump in more favorable directions. For Gaussian proposals, we can thus choose the covariance matrix of q(.) to be proportional to the covariance of p(~x). Of course, calculating the latter covariance is often a difficult problem (which the MCMC method is intended to solve!), but in many cases we can exploit our usual Laplace approximation ideas and take the inverse of the Hessian of the log-posterior at MAP as a rough approximation for the covariance. This is equivalent to modifying the proposal rule (2) to
~y ∼ ~x + σA~ǫ, (3)
where A is the Cholesky decomposition of H−^1
AA
T = H−^1 , (4)
and H is the negative log-Hessian of the log-posterior at the MAP solution. We refer to chains with such jump proposals as non-isotropic Gaussian RWM. Figure 1 compares the isotropic and nonisotropic proposals.
where C is the prior covariance. Thus, if C−^1 is banded, H will be banded in time as well. As an example, Gaussian autoregressive processes of any finite order form a large class of priors which have banded C−^1. In particular, for white-noise stimuli, C−^1 is diagonal, and therefore H will have the same bandwidth as J. Efficient algorithms can find the Cholesky decomposition of a banded d × d matrix, with bandwidth nb, in a number of computations ∝ n^2 b d, instead of ∝ d^3. Likewise, if B is a banded triangular matrix with bandwidth nb, the linear equation B~x = ~y can be solved for ~x in ∝ nbd computations. Therefore, to calculate ~x = A˜x from ˜x in each step of the Markov chain, we proceed as follows. Before starting the chain, we first calculate the Cholesky decomposition of H, such that H = BT B and ~x = Ax˜ = B−^1 x˜. Then, at each step of the MCMC, given ˜xt, we find ~xt by solving the equation B~xt = ˜xt. Since both of these procedures involve a number of computations that only scale with d (and thus with T ), we can perform the whole MCMC decoding in O(T ) computational time. This allows us to decode stimuli with essentially unlimited durations. Similar O(T ) methods have been applied frequently in the state-space model literature (Shephard and Pitt, 1997; Davis and Rodriguez-Yam, 2005; Jungbacker and Koopman, 2007); we will discuss these state-space methods in more depth in a later chapter.
6.2.2 Hybrid (Hamiltonian) Monte Carlo typically mixes much faster than random- walk Metropolis
The random-walk chains discussed above are convenient, but often mix slowly in practice, basically because random walks spend a lot of time re-covering old ground: in the simplest case, a Brownian motion requires O(n^2 ) time to achieve a distance of n steps from its starting point, on average. The so-called hybrid (or Hamiltonian) Monte Carlo (HMC) method is a powerful method for constructing chains which avoid the slow mixing behavior of random-walk chains. This method was originally inspired by the equations of Hamiltonian dynamics for the molecules in a gas (Duane et al., 1987), but has since been used extensively in Bayesian settings (for its use in sampling from posteriors based on GLMs see (Ishwaran, 1999); see also (Neal, 1996) for further applications and extensions). Let’s begin with the physical analogy. Let ~x be the vector of positions of all the molecules in a gas at a moment in time, and ~p the vector of their momenta (~x and ~p have the same dimension). In thermal equilibrium, these variables have the distribution p(~x, ~p) ∝ e−H(~x,~p), where
H(~x, ~p) =
~pT~p + E(~x)
is the Hamiltonian (total energy) of the system, which is the sum of the kinetic energy, 12 ~pT~p, and the potential energy, E(~x). Notice that due to the factorization of this distribution, ~x and ~p are independently distributed; the marginal distribution of ~x is proportional to exp(−E(~x)) and that of ~p is the Gaussian ∝ exp(− 12 ~pT~p). Thus if we could obtain samples {(~xt, ~pt)} from p(~x, ~p) with the potential energy E(~x) = − log p(~x) + const., we would also have obtained samples from the marginal {~xt} from p(~x). Now, ~x and ~p evolve according to the Hamiltonian dynamics
~p˙ = − ∂H ∂~x
= −∇E(~x), ~x˙ =
∂~p
= p,~ (5)
under which the energy H(~x, ~p) is conserved. These dynamics also preserve the volume element d~xd~p in the (~x, ~p) space (this is the content of Liouville’s theorem from classical dynamics). These
two facts imply that the continuous-time (deterministic) Markov chain in the (~x, ~p) space that cor- responds to the solution of Eqs. (5) leaves the distribution p(~x, ~p) ∝ e−H(~x,~p)^ invariant. Ergodicity of this Markov chain can be assured if we “thermalize” the momentum variables, by sampling ~p from its exact marginal distribution (the standard Gaussian) at regular time intervals. Thus so- lutions of Eqs. (5) should lead to good transition densities for our MCMC chain. Moreover, we expect these proposals to outperform the basic random-walk proposals for two reasons: first, by construction, the Hamiltonian method will follow the energy gradient and thus propose jumps to- wards higher-probability regions, instead of just proposing random jumps without paying attention to the gradient of p(~x). Second, the Hamiltonian method will move for many time steps in the same direction, allowing us to move n steps in O(n) time, instead of the O(n^2 ) time required by the random-walk method. In practice, a numerical solution to Eqs. (5) is obtained by discretizing the Hamiltonian equa- tions, in which case H(~x, ~p) may change slightly with each step, and the M-H correction is required to guarantee that p(~x, ~p) remains the equilibrium distribution. The discretization customarily used in HMC applications is the “leapfrog” method, which is time reversible and leaves the (~x, ~p)-volume element invariant. The complete HMC algorithm with the leapfrog discretization has the following steps for generating {~xt+1, ~pt+1}, starting from {~xt, ~pt}, in each step of the Markov chain:
This chain has two parameters, L and σ, which can be chosen to maximize the mixing rate of the chain while minimizing the number of evaluations of E(~x) and its gradient. In practice, even a small L (e.g., L < 5) often yields a rapidly mixing chain. The special case L = 1 corresponds to a chain that has proposals of the form
~y ∼ ~x −
σ^2 2 ∇E(~x) + σ~p,
where ~p is normal with zero mean and identity covariance, and the proposal ~y is accepted according to the usual M-H rule. In the limit σ → 0, this chain becomes a continuous Langevin process with the potential function E(~x) = − log p(~x), whose stationary distribution is the Gibbs measure, p(~x) = exp(−E(~x)), without the Metropolis-Hastings rejection step. For a finite σ, however, the Metropolis-Hastings acceptance step is necessary to guarantee that p(~x) is the invariant distribution. The chain is thus refered to as the “Metropolis-adjusted Langevin” algorithm (MALA) (Roberts and Rosenthal, 2001). The scale parameter σ (which also needs to be adjusted for the RWM chain) sets the average size of the proposal jumps: we must typically choose this scale to be small enough to avoid jumping wildly into a region of low p(~x) (and therefore wasting the proposal, since it will be rejected with high probability). At the same time, we want to make the jumps as large as possible, on average,
a) Gibbs b) ARS Hit−and−Run c) Random Walk Metropolis
Figure 2: Comparison of different MCMC algorithms in sampling from a non-isotropic truncated Gaussian distribution. Panels (a-c) show 50-sample chains for a Gibbs, isotropic hit-and-run, and isotropic random walk Metropolis (RWM) samplers, respectively. The grayscale indicates the height of the probability density. As seen in panel (a), the narrow, non-isotropic likelihood can significantly hamper the mixing of the Gibbs chain as it chooses its jump directions unfavorably. The hit-and- run chain, on the other hand, mixes much faster as it samples the direction randomly and hence can move within the narrow high likelihood region with relative ease. The mixing of the RWM chain is relatively slower due to its rejections (note that there are fewer than 50 distinct dots in panel (c) due to rejections; the acceptance rate was about 0.4 here). For illustrative purposes, the hit-and-run direction and the RWM proposal distributions were taken to be isotropic here, which is disadvantageous, as explained in the text (also see Fig. 1).
where p(Ai) denotes the probability that we have chosen to update the subset Ai. It is important to note that the Gibbs update rule can sometimes fail to lead to an ergodic chain (i.e., the chain can get “stuck” and not sample from p(~x) properly) (Robert and Casella, 2005). An extreme case of this is when the conditional distributions pm(xm|~x⊥m) are deterministic: then the Gibbs algorithm will never move, clearly breaking the ergodicity of the chain. More generally, in cases where strong correlations between the components of ~x lead to nearly deterministic condi- tionals, the mixing rate of the Gibbs method can be extremely low (panel (a) of Fig. 2, shows this phenomenon for a 2-dimensional distribution with strong correlation between the two components). Thus, it is a good idea to choose the parameterization of the model carefully before applying the Gibbs algorithm. For example, we can change the basis, or more systematically, exploit the Laplace approximation, as described above, to sample from the auxiliary distribution ˜p(˜x) instead.
6.2.4 The hit-and-run algorithm performs Gibbs sampling in random directions
The hit-and-run algorithm (Lovasz and Vempala, 2003) can be thought of as “random-direction Gibbs”: in each step of the hit-and-run algorithm, instead of updating ~x along one of the coordinate axes, we update it along a random general direction not necessarily parallel to any coordinate axis. More precisely, the sampler is defined in two steps: first, choose a direction ~n from some positive density ρ(~n) on the unit sphere ~nT~n = 1. Then, similar to Gibbs, sample the new point on the line defined by ~n and ~x, with a density proportional to the underlying distribution. That is, we sample s from the one-dimensional density
s ∼ h(s|~n, ~x) ∝ p(~x + s~n), −∞ < s < ∞,
and then set ~y = ~x + s~n.
The main gain over RWM or HMC is that instead of taking small local steps (of size proportional to σ), we may take very large jumps in the ~n direction; the jump size is set by the underlying distribution itself, not an arbitrary scale σ. This, together with the fact that all hit-and-run proposals are accepted, makes the chain better at escaping from sharp high-dimensional corners (see (Lovasz and Vempala, 2003) and the discussion at the end of Sec. 6.2.2 above). Moreover, the algorithm is parameter-free, once we choose the direction sampling distribution r(.), unlike the random walk case, where we have to choose the jump size somewhat carefully. The advantage over Gibbs is in situations such as depicted in Fig. 1, where jumps parallel to coordinates lead to small steps but there are directions that allow long jumps to be made by hit-and-run. The price to pay for these possibly long nonlocal jumps, however, is that now (as well as in the Gibbs case) we need to sample from the one-dimensional density s ∼ (^) Z^1 p(~x + s~n), which is in general non-trivial. This may be done by rejection sampling, the cumulative inverse transformation, or one-dimensional Metropolis. If the posterior distribution is log-concave, then so are all of its “slices,” and very efficient specialized methods such as slice sampling (Neal, 2003) or adaptive rejection sampling (Gilks and Wild, 1992) can be used; see (Ahmadian et al., 2008) for a discussion of the adaptive rejection sampling method, and section 7.6 below for a brief discussion of the slice sampling method. To establish the validity of the hit-and-run sampler, we argue as in the Gibbs case: we compute the ratio p(~y)q(~x|~y) p(~x)q(~y|~x)
p(~y)
ρ(~n)q(~x|~y, ~n)d~n p(~x)
ρ(~n)q(~y|~x, ~n)d~n
∫ ρ(~n)p(~y)q(~x|~y, ~n)d~n ρ(~n)p(~x)q(~y|~x, ~n)d~n
since p(~x)q(~y|~x, ~n) = p(~y)q(~x|~y, ~n) ∝ p(~x)p(~y) for any ~x and ~y connected by a line along the direction ~n; thus we see that the hit-and-run proposal is indeed a M-H proposal which is accepted with probability one. Regarding the direction density, ρ(~n), the easiest choice is the isotropic ρ(~n) = 1. More generally it is easy to sample from ellipses, by sampling from the appropriate Gaussian distribution and normalizing. Thus, again, a reasonable approach is to exploit the Laplace approximation: we sample ~n by sampling an auxiliary point ˜x from N (0, H−^1 ), and setting ~n = ˜x/‖x˜‖ (see Fig. 1). Sampling from N (0, H−^1 ) may be performed via the efficient Cholesky techniques discussed in the previous sections. This non-isotropic direction density enhances hit-and-run’s advantage over Gibbs by giving more weight to directions that allow larger jumps to be made.
In the last few sections, we have discussed some qualitative arguments supporting various MCMC algorithms, in terms of their mixing rates and computational costs. Here we give a more quantitative account, and also compare the different methods based on their performance in the neural decoding setup. From a practical point of view, the most relevant notion of mixing is how fast the estimate ˆgN converges to the true expectation of the quantity of interest, E(g). As one always has access to finitely many samples, N , even in the optimal case of i.i.d. samples from p, ˆgN has a finite random error, (1/N )V arp(g). For correlated samples from the MCMC chain, the asymptotic error is typically larger (since each sample gives somewhat redundant information), and the error may be written (Robert and Casella, 2005) as
Var(ˆgN ) = τcorr N
Var[g(~x)] + o
( (^) τ corr N
regarding sampling from some special classes of high-dimensional distributions. Based on their results, the authors argue that in general, the jump scales of RWM and MALA proposals should be chosen such that their acceptance rates are roughly 0.25 and 0.55, respectively. For the special case of sampling from a d dimensional standard Gaussian distribution, p(~x) ∝ exp (−‖~x‖^2 /2), and for optimally chosen proposal jump scales they show that the FOE of Gaussian MALA and RWM are asymptotically equal to 1. 6 d^2 /^3 and 1.33, respectively, for large d; in comparison, (Ahmadian et al.,
In this section we compare Bayesian stimulus decoding using the MAP and the MCMC posterior mean estimates, based on the response of a population of neurons modeled via the GLM. We will show that in the flat prior case, the MAP estimate is much less efficient (in terms of its mean squared error) than the posterior mean estimate. We contrast this with the Gaussian prior case, where the Laplace approximation is accurate over a large range of model parameters, and thus the two estimates are close. Furthermore, for both kinds of priors, in the limit of strong likelihoods
−
0
2
amplitude
−
0
2
amplitude
0 100 200 300 400 500
−
0
2
time(ms)
amplitude
−
0
2
amplitude
−
0
2
amplitude
0 100 200 300 400 500
−
0
2
time(ms)
amplitude
−
0
2
amplitude
−
0
2
amplitude
0 100 200 300 400 500
−
0
2
time(ms)
amplitude
Stimulus
MAP
BLS
c‖k‖ =. 5 c‖k‖ = 1 c‖k‖ = 2. 4
Figure 4: Comparison of MAP and posterior mean estimates, for a pair of ON and OFF model neurons, for different values of the stimulus filter amplitude (‖k‖ = 0.5, 1, and 2.4; i.e., the dependence of the firing rate on the stimulus was increasing from left to right). The stimulus (the black trace shown on all panels) consists of a 500 ms interval of uniformly distributed white noise, refreshed every 10 ms. Thus the stimulus space is 50 dimensional. The cyan horizontal lines mark the boundaries of the flat prior distribution of the stimulus intensity on each 10 ms subinterval. They are set at ±
3, corresponding to intensity variance of 1 and zero mean. Dots on the top row show the spikes of the ON (gray) and the OFF (black) cell. The red traces in the middle row are the MAP estimates, and the blue traces in the bottom rows show the posterior means estimated from 10000 samples of a hit-and-run chain (after burning 2500 samples). The shaded regions in the second and third rows are error bars representing the estimated marginal posterior uncertainty about each stimulus value for the MAP and mean estimates. For the MAP (second rows), these are calculated as the square root of the diagonal of the inverse Hessian H−^1 , but they have been cut-off where they would have encroached on the zero prior region beyond the horizontal cyan lines. For the posterior mean (third rows), they represent one standard deviation about the mean, calculated as the square root of the diagonal of the covariance matrix, itself estimated from the MCMC chain (the standard error of the posterior mean estimate due to the finite samples of the MCMC were much smaller than these error bars, and are not shown). Note that the errorbars of the mean are in general smaller than those for the MAP, and that all estimate uncertainties decrease as the stimulus informativeness grows.
(e.g., due to a strong stimulus filter or a large number of neurons) the posterior distribution will be sharply concentrated, the Laplace approximation becomes asymptotically more and more accurate, and both estimates will eventually converge to the true stimulus (more precisely the part of the stimulus that is not outside the receptive field of all the neurons; see footnote 5, below). In the first two examples (Figs. 4–5), the stimulus estimates were computed given the simulated spike trains of a population of pairs of ON and OFF retinal ganglion cells (RGC), in response