






























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The application of bayesian approaches for model uncertainty problems, where multiple models are under consideration. It explains how the bayesian approach induces a posterior distribution of model uncertainty, which can be used for model selection and other inference and decision problems. Practical implementations for bayesian variable selection for the linear model and bayesian cart model selection, illustrating the benefits of model averaging. Related approaches for generalized linear models, time series models, loglinear models, and graphical model selection are also mentioned.
Typology: Study notes
1 / 70
This page cannot be seen from the preview
Don't miss anything!































































Model Selection IMS Lecture Notes - Monograph Series (2001) Volume 38
Hugh Chipman, Edward I. George and Robert E. McCulloch
The University of Waterloo, The University of Pennsylvania and The University of Chicago
Abstract
In principle, the Bayesian approach to model selection is straightforward. Prior probability distributions are used to describe the uncertainty surround- ing all unknowns. After observing the data, the posterior distribution pro- vides a coherent post data summary of the remaining uncertainty which is relevant for model selection. However, the practical implementation of this approach often requires carefully tailored priors and novel posterior calcula- tion methods. In this article, we illustrate some of the fundamental practical issues that arise for two different model selection problems: the variable se- lection problem for the linear model and the CART model selection problem.
(^0) Hugh Chipman is Associate Professor of Statistics, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada; email: [email protected]. Edward I. George is Professor of Statistics, Department of Statistics, The Wharton School of the University of Penn- sylvania, 3620 Locust Walk, Philadelphia, PA 19104-6302, U.S.A; email: [email protected]. Robert E. McCulloch is Professor of Statistics, Graduate School of Business, University of Chicago, 1101 East 58th Street, Chicago, IL, U.S.A; email: [email protected]. This work was supported by NSF grant DMS-98.03756 and Texas ARP grant 003658.690.
68 H. Chipman, E. I. George and R. E. McCulloch
the need for subjective inputs, automatic default hyperparameter choices are often rec- ommended. For this purpose, empirical Bayes considerations, either formal or informal, can be helpful, especially when informative choices are needed. However, subjective con- siderations can also be helpful, at least for roughly gauging prior location and scale and for putting small probability on implausible values. Of course, when substantial prior information is available, the Bayesian model selection implementations provide a natural environment for introducing realistic and important views.
By abandoning the pure subjective point of view, the evaluation of such Bayesian methods must ultimately involve frequentist considerations. Typically, such evaluations have taken the form of average performance over repeated simulations from hypothetical models or of cross validations on real data. Although such evaluations are necessarily limited in scope, the Bayesian procedures have consistently performed well compared to non-Bayesian alternatives. Although more work is clearly needed on this crucial aspect, there is cause for optimism, since by the complete class theorems of decision theory, we need look no further than Bayes and generalized Bayes procedures for good frequentist performance.
The second major challenge confronting the practical application of Bayesian model selection approaches is posterior calculation or perhaps more accurately, posterior ex- ploration. Recent advances in computing technology coupled with developments in nu- merical and Monte Carlo methods, most notably Markov Chain Monte Carlo (MCMC), have opened up new and promising directions for addressing this challenge. The basic idea behind MCMC here is the construction of a sampler which simulates a Markov chain that is converging to the posterior distribution. Although this provides a route to calculation of the full posterior, such chains are typically run for a relatively short time and used to search for high posterior models or to estimate posterior character- istics. However, constructing effective samplers and the use of such methods can be a delicate matter involving problem specific considerations such as model structure and the prior formulations. This very active area of research continues to hold promise for future developments.
In this introduction, we have described our overall point of view to provide context for the implementations we are about to describe. In Section 2, we describe the general Bayesian approach in more detail. In Sections 3 and 4, we illustrate the practical im- plementation of these general ideas to Bayesian variable selection for the linear model and Bayesian CART model selection, respectively. In Section 5, we conclude with a brief discussion of related recent implementations for Bayesian model selection.
Practical Bayes Model Selection 69
Suppose a set of K models M = {M 1 ,... , MK } are under consideration for data Y , and that under Mk, Y has density p(Y | θk, Mk) where θk is a vector of unknown parameters that indexes the members of Mk. (Although we refer to Mk as a model, it is more precisely a model class). The Bayesian approach proceeds by assigning a prior probability distribution p(θk | Mk) to the parameters of each model, and a prior probability p(Mk) to each model. Intuitively, this complete specification can be understood as a three stage hierarchical mixture model for generating the data Y ; first the model Mk is generated from p(M 1 ),... , p(MK ), second the parameter vector θk is generated from p(θk | Mk), and third the data Y is generated from p(Y | θk, Mk).
Letting Yf be future observations of the same process that generated Y , this prior formulation induces a joint distribution p(Yf , Y, θk, Mk) = p(Yf , Y | θk, Mk) p(θk | Mk) p(Mk). Conditioning on the observed data Y , all remaining uncertainty is captured by the joint posterior distribution p(Yf , θk, Mk | Y ). Through conditioning and marginal- ization, this joint posterior can be used for a variety Bayesian inferences and decisions. For example, when the goal is exclusively prediction of Yf , attention would focus on the predictive distribution p(Yf | Y ), which is obtained by margining out both θk and Mk. By averaging over the unknown models, p(Yf |Y ) properly incorporates the model uncer- tainty embedded in the priors. In effect, the predictive distribution sidesteps the problem of model selection, replacing it by model averaging. However, sometimes interest focuses on selecting one of the models in M for the data Y , the model selection problem. One may simply want to discover a useful simple model from a large speculative class of models. Such a model might, for example, provide valuable scientific insights or perhaps a less costly method for prediction than the model average. One may instead want to test a theory represented by one of a set of carefully studied models.
In terms of the three stage hierarchical mixture formulation, the model selection problem becomes that of finding the model in M that actually generated the data, namely the model that was generated from p(M 1 ),... , p(MK ) in the first step. The probability that Mk was in fact this model, conditionally on having observed Y , is the posterior model probability
p(Mk | Y ) = ∑p(Y^ |^ Mk)p(Mk) k p(Y^ |^ Mk)p(Mk)^
where
p(Y | Mk) =
∫ p(Y | θk, Mk)p(θk | Mk)dθk (2.2)
Practical Bayes Model Selection 71
the posterior weighted logarithmic divergence
∑ k′
p(Mk′ | Y )
∫ p(∆ | Mk′ , Y ) log p(∆^ |^ Mk
p(∆ | Mk, Y )
(San Martini and Spezzaferri 1984).
However, if the goal is strictly prediction and not model selection, then expected logarithmic utility is maximized by using the posterior weighted mixture p(∆ | Y ) in (2.5). Under squared error loss, the best prediction of ∆ is the overall posterior mean
E(∆ | Y ) =
∑ k
E(∆ | Mk, Y )p(Mk | Y ). (2.8)
Such model averaging or mixing procedures incorporate model uncertainty and have been advocated by Geisser (1993), Draper (1995), Hoeting, Madigan, Raftery and Volinsky (1999) and Clyde, Desimone and Parmigiani (1995). Note however, that if a cost of model complexity is introduced into these utilities, then model selection may dominate model averaging.
Another interesting modification of the decision theory setup is to allow for the possibility that the “true” model is not one of the Mk, a commonly held perspective in many applications. This aspect can be incorporated into a utility analysis by using the actual predictive density in place of p(∆ | Y ). In cases where the form of the true model is completely unknown, this approach serves to motivate cross validation types of training sample approaches, (see Bernardo and Smith 1994, Berger and Pericchi 1996 and Key, Perrichi and Smith 1998).
For a given set of models M, the effectiveness of the Bayesian approach rests firmly on the specification of the parameter priors p(θk|Mk) and the model space prior p(M 1 ),... , p(MK ). Indeed, all of the utility results in the previous section are predicated on the assump- tion that this specification is correct. If one takes the subjective point of view that these priors represent the statistician’s prior uncertainty about all the unknowns, then the posterior would be the appropriate update of this uncertainty after the data Y has been observed. However appealing, the pure subjective point of view here has practical limitations. Because of the sheer number and complexity of unknowns in most model uncertainty problems, it is probably unrealistic to assume that such uncertainty can be meaningfully described.
The most common and practical approach to prior specification in this context is to try and construct noninformative, semi-automatic formulations, using subjective and
72 H. Chipman, E. I. George and R. E. McCulloch
empirical Bayes considerations where needed. Roughly speaking, one would like to spec- ify priors that allow the posterior to accumulate probability at or near the actual model that generated the data. At the very least, such a posterior can serve as a heuristic device to identify promising models for further examination.
Beginning with considerations for choosing the model space prior p(M 1 ),... , p(MK ), a simple and popular choice is the uniform prior
p(Mk) ≡ 1 /K (2.9)
which is noninformative in the sense of favoring all models equally. Under this prior, the model posterior is proportional to the marginal likelihood, p(Mk |Y ) ∝ p(Y |Mk), and posterior odds comparisons in (2.3) reduce to Bayes factor comparisons. However, the apparent noninformativeness of (2.9) can be deceptive. Although uniform over models, it will typically not be uniform on model characteristics such as model size. A more subtle problem occurs in setups where many models are very similar and only a few are dis- tinct. In such cases, (2.9) will not assign probability uniformly to model neighborhoods and may bias the posterior away from good models. As will be seen in later sections, alternative model space priors that dilute probability within model neighborhoods can be meaningfully considered when specific model structures are taken into account.
Turning to the choice of parameter priors p(θk | Mk), direct insertion of improper noninformative priors into (2.1) and (2.2) must be ruled out because their arbitrary norming constants are problematic for posterior odds comparisons. Although one can avoid some of these difficulties with constructs such as intrinsic Bayes factors (Berger and Pericchi 1996) or fractional Bayes factors (O’Hagan 1995), many Bayesian model selection implementations, including our own, have stuck with proper parameter priors, especially in large problems. Such priors guarantee the internal coherence of the Bayesian formulation, allow for meaningful hyperparameter specifications and yield proper poste- rior distributions which are crucial for the MCMC posterior calculation and exploration described in the next section.
Several features are typically used to narrow down the choice of proper parameter priors. To ease the computational burden, it is very useful to choose priors under which rapidly computable closed form expressions for the marginal p(Y | Mk) in (2.2) can be obtained. For exponential family models, conjugate priors serve this purpose and so have been commonly used. When such priors are not used, as is sometimes necessary outside the exponential family, computational efficiency may be obtained with the ap- proximations of p(Y |Mk) described in Section 2.3. In any case, it is useful to parametrize p(θk | Mk) by a small number of interpretable hyperparameters. For nested model for- mulations, which are obtained by setting certain parameters to zero, it is often natural
74 H. Chipman, E. I. George and R. E. McCulloch
α(η∗^ | η(j)) = min
{ 1 , q(η(j)^ | η∗) q(η∗^ | η(j))
p(η∗^ | Y ) p(η(j)^ | Y )
}
Otherwise set η(j+1)^ = η(j),
This is a special case of the more elaborate reversible jump MH algorithms (Green 1995) which can be used when dimension of η is changing. The general availability of such MH algorithms derives from the fact that p(η | Y ) is only needed up to the norming constant for the calculation of α above.
The are endless possibilities for constructing Markov transition kernels p(η | η(j)) that guarantee convergence to p(η | Y ). The GS can be applied to different groupings and reorderings of the coordinates, and these can be randomly chosen. For MH algorithms, only weak conditions restrict considerations of the choice of q(η|η(j)) and can also be con- sidered componentwise. The GS and MH algorithms can be combined and used together in many ways. Recently proposed variations such as tempering, importance sampling, perfect sampling and augmentation offer a promising wealth of further possibilities for sampling the posterior. As with prior specification, the construction of effective transi- tion kernels and how they can be exploited is meaningfully guided by problem specific considerations as will be seen in later sections. Various illustrations of the broad practical potential of MCMC are described in Gilks, Richardson, and Spieglehalter (1996).
The use of MCMC to simulate the posterior distribution of a model index η is greatly facilitated when rapidly computable closed form expressions for the marginal p(Y | Mk) in (2.2) are available. In such cases, p(Y | η)p(η) ∝ p(η | Y ) can be used to implement GS and MH algorithms. Otherwise, one can simulate an index of the values of (θk, Mk) (or at least Mk and the values of parameters that cannot be eliminated analytically). When the dimension of such an index is changing, MCMC implementations for this purpose typically require more delicate design, see Carlin and Chib (1995), Dellaportas, Forster and Ntzoufras (2000), Geweke (1996), Green (1995), Kuo and Mallick (1998) and Phillips and Smith (1996).
Because of the computational advantages of having closed form expressions for p(Y |Mk), it may be preferable to use a computable approximation for p(Y | Mk) when exact ex- pressions are unavailable. An effective approximation for this purpose, when h(θk) ≡ log p(Y | θk, Mk)p(θk | Mk) is sufficiently well-behaved, is obtained by Laplace’s method (see Tierney and Kadane 1986) as
p(Y | Mk) ≈ (2π)dk^ /^2 |H(θ˜k)|^1 /^2 p(Y | θ˜k, Mk)p(θ˜k | Mk) (2.10)
where dk is the dimension of θk, θ˜k is the maximum of h(θk), namely the posterior mode of p(θ˜k |Y, Mk), and H(θ˜k) is minus the inverse Hessian of h evaluated at θ˜k. This is obtained
Practical Bayes Model Selection 75
by substituting the Taylor series approximation h(θk) ≈ h(θ˜k)− 12 (θk − θ˜k)′H(θ˜k)(θk − θ˜k) for h(θk) in p(Mk | Y ) = ∫ exp{h(θk)}dθk. When finding θ˜k is costly, further approximation of p(Y | M ) can be obtained by
p(Y | Mk) ≈ (2π)dk^ /^2 |H∗(θˆk)|^1 /^2 p(Y | ˆθk, Mk)p(θˆk | Mk) (2.11)
where θˆk is the maximum likelihood estimate and H∗^ can be H, minus the inverse Hessian of the log likelihood or Fisher’s information matrix. Going one step further, by ignoring the terms in (2.11) that are constant in large samples, yields the BIC approximation (Schwarz 1978) log p(Y | M ) ≈ log p(Y | θˆk, Mk) − (dk/2) log n (2.12)
where n is the sample size. This last approximation was successfully implemented for model averaging in a survival analysis problem by Raftery, Madigan and Volinsky (1996). Although it does not explicitly depend on a parameter prior, (2.12) may be considered an implicit approximation to p(Y | M ) under a “unit information prior” (Kass and Wasser- man 1995) or under a “normalized” Jeffreys prior (Wasserman 2000). It should be emphasized that the asymptotic justification for these successive approximations, (2.10), (2.11), (2.12), may not be very good in small samples, see for example, McCulloch and Rossi (1991).
Suppose Y a variable of interest, and X 1 ,... , Xp a set of potential explanatory variables or predictors, are vectors of n observations. The problem of variable selection, or subset selection as it often called, arises when one wants to model the relationship between Y and a subset of X 1 ,... , Xp, but there is uncertainty about which subset to use. Such a situation is particularly of interest when p is large and X 1 ,... , Xp is thought to contain many redundant or irrelevant variables.
The variable selection problem is usually posed as a special case of the model selec- tion problem, where each model under consideration corresponds to a distinct subset of X 1 ,... , Xp. This problem is most familiar in the context of multiple regression where at- tention is restricted to normal linear models. Many of the fundamental developments in variable selection have occurred in the context of the linear model, in large part because its analytical tractability greatly facilitates insight and computational reduction, and because it provides a simple first order approximation to more complex relationships. Furthermore, many problems of interest can be posed as linear variable selection prob- lems. For example, for the problem of nonparametric function estimation, the values of the unknown function are represented by Y , and a linear basis such as a wavelet basis or
Practical Bayes Model Selection 77
which are easy to specify, substantially reduce computational requirements, and often yield sensible results, see, for example, Clyde, Desimone and Parmigiani (1996), George and McCulloch (1993, 1997), Raftery, Madigan and Hoeting (1997) and Smith and Kohn (1996). Under this prior, each Xi enters the model independently of the other coefficients, with probability p(γi = 1) = 1 − p(γi = 0) = wi. Smaller wi can be used to downweight Xi which are costly or of less interest.
A useful reduction of (3.2) has been to set wi ≡ w, yielding
p(γ) = wqγ^ (1 − w)p−qγ^ , (3.3)
in which case the hyperparameter w is the a priori expected proportion of X i′s in the model. In particular, setting w = 1/2, yields the popular uniform prior
p(γ) ≡ 1 / 2 p, (3.4)
which is often used as a representation of ignorance. However, this prior puts most of its weight near models of size qγ = p/2 because there are more of them. Increased weight on parsimonious models, for example, could instead be obtained by setting w small. Alternatively, one could put a prior on w. For example, combined with a beta prior w ∼ Beta(α, β), (3.3) yields
p(γ) = B(α^ +^ qγ^ , β^ +^ p^ −^ qγ^ ) B(α, β)
where B(α, β) is the beta function. More generally, one could simply put a prior h(qγ ) on the model dimension and let
p(γ) =
( p qγ
)− 1 h(qγ ), (3.6)
of which (3.5) is a special case. Under priors of the form (3.6), the components of γ are exchangeable but not independent, (except for the special case (3.3)).
Independence and exchangeable priors on γ may be less satisfactory when the models under consideration contain dependent components such as might occur with interac- tions, polynomials, lagged variables or indicator variables (Chipman 1996). Common practice often rules out certain models from consideration, such as a model with an X 1 X 2 interaction but no X 1 or X 2 linear terms. Priors on γ can encode such prefer- ences.
With interactions, the prior for γ can capture the dependence relation between the importance of a higher order term and those lower order terms from which it was formed. For example, suppose there are three independent main effects A, B, C and three two- factor interactions AB, AC, and BC. The importance of the interactions such as AB will
78 H. Chipman, E. I. George and R. E. McCulloch
often depend only on whether the main effects A and B are included in the model. This belief can be expressed by a prior for γ = (γA,... , γBC ) of the form:
p(γ) = p(γA)p(γB )p(γC )p(γAB | γA, γB )p(γAC | γA, γC )p(γBC | γB , γC ). (3.7)
The specification of terms like p(γAC |γA, γC ) in (3.7) would entail specifying four probabilities, one for each of the values of (γA, γC ). Typically p(γAC | 0 , 0) < (p(γAC | 1 , 0), // p(γAC | 0 , 1)) < p(γAC | 1 , 1). Similar strategies can be considered to downweight or eliminate models with isolated high order terms in polynomial regressions or isolated high order lagged variables in ARIMA models. With indicators for a categorical predictor, it may be of interest to include either all or none of the indicators, in which case p(γ) = 0 for any γ violating this condition.
The number of possible models using interactions, polynomials, lagged variables or indicator variables grows combinatorially as the number of variables increases. In con- trast to independence priors of the form (3.2), priors for dependent component models, such as (3.7), is that they concentrate mass on “plausible” models, when the number of possible models is huge. This can be crucial in applications such as screening de- signs, where the number of candidate predictors may exceed the number of observations (Chipman, Hamada, and Wu 1997).
Another more subtle shortcoming of independence and exchangeable priors on γ is their failure to account for similarities and differences between models due to covariate collinearity or redundancy. An interesting alternative in this regard are priors that “dilute” probability across neighborhoods of similar models, the so called dilution priors (George 1999). Consider the following simple example.
Suppose that only two uncorrelated predictors X 1 and X 2 are considered, and that they yield the following posterior probabilities:
Variables in γ X 1 X 2 X 1 , X 2 p(γ | Y ) 0.3 0.4 0.
Suppose now a new potential predictor X 3 is introduced, and that X 3 is very highly correlated with X 2 , but not with X 1. If the model prior is elaborated in a sensible way, as is discussed below, the posterior may well look something like
Variables in γ X 1 X 2 X 3 X 1 , X 2 · · · p(γ | Y ) 0.3 0.13 0.13 0.06 · · ·
80 H. Chipman, E. I. George and R. E. McCulloch
We now consider parameter prior formulations for variable selection where the goal is to ignore only those Xi for which βi = 0 in (3.1). In effect, the problem then becomes that of selecting a submodel of (3.1) of the form
p(Y | βγ , σ^2 , γ) = Nn(Xγ βγ , σ^2 I) (3.8)
where Xγ is the n x qγ matrix whose columns correspond to the γth subset of X 1 ,... , Xp, βγ is a qγ × 1 vector of unknown regression coefficients, and σ^2 is the unknown residual variance. Here, (βγ , σ^2 ) plays the role of the model parameter θk described in Section 2.
Perhaps the most useful and commonly applied parameter prior form for this setup, especially in large problems, is the normal-inverse-gamma, which consists of a qγ -dimensional normal prior on βγ p(βγ | σ^2 , γ) = Nqγ ( β¯γ , σ^2 Σγ ), (3.9)
coupled with an inverse gamma prior on σ^2
p(σ^2 | γ) = p(σ^2 ) = IG(ν/ 2 , νλ/2), (3.10)
(which is equivalent to νλ/σ^2 ∼ χ^2 ν ). For example, see Clyde, DeSimone and Parmigiani (1996), Fernandez, Ley and Steel (2001), Garthwaite and Dickey (1992, 1996), George and McCulloch (1997), Kuo and Mallick (1998), Raftery, Madigan and Hoeting (1997) and Smith and Kohn (1996). Note that the coefficient prior (3.9), when coupled with p(γ), implicitly assigns a point mass at zero for coefficients in (3.1) that are not contained in βγ. As such, (3.9) may be thought of as a point-normal prior. It should also be mentioned that in one the first Bayesian variable selection treatments of the setup (3.8), Mitchell and Beauchamp (1988) proposed spike-and-slab priors. The normal-inverse- gamma prior above is obtained by simply replacing their uniform slab by a normal distribution.
A valuable feature of the prior combination (3.9) and (3.10) is analytical tractability; the conditional distribution of βγ and σ^2 given γ is conjugate for (3.8), so that βγ and σ^2 can be eliminated by routine integration from p(Y, βγ , σ^2 | γ) = p(Y | βγ , σ^2 , γ)p(βγ | σ^2 , γ) p(σ^2 | γ) to yield
p(Y | γ) ∝ |X γ′ Xγ + Σ− γ 1 |−^1 /^2 |Σγ |−^1 /^2 (νλ + S γ^2 )−(n+ν)/^2 (3.11)
where S^2 γ = Y ′Y − Y ′Xγ (X γ′ Xγ + Σ− γ 1 )−^1 X γ′ Y. (3.12)
As will be seen in subsequent sections, the use of these closed form expressions can substantially speed up posterior evaluation and MCMC exploration. Note that the scale
Practical Bayes Model Selection 81
of the prior (3.9) for βγ depends on σ^2 , and this is needed to obtain conjugacy. Integrating out σ^2 with respect to (3.10), the prior for βγ conditionally only on γ is
p(βγ | γ) = Tqγ (ν, β¯γ , λΣγ ) (3.13)
the qγ -dimensional multivariate T -distribution centered at β¯γ with ν degrees of freedom and scale λΣγ.
The priors (3.9) and (3.10) are determined by the hyperparameters β¯γ , Σγ , λ, ν,, which must be specified for implementations. Although a good deal of progress can be made through subjective elicitation of these hyperparameter values in smaller prob- lems when substantial expert information is available, for example see Garthwaite and Dickey (1996), we focus here on the case where such information is unavailable and the goal is roughly to assign values that “minimize” prior influence.
Beginning with the choice of λ and ν, note that (3.10) corresponds to the likelihood information about σ^2 provided by ν independent observations from a N (0, λ) distribu- tion. Thus, λ may be thought of as a prior estimate of σ^2 and ν may be thought of as the prior sample size associated with this estimate. By using the data and treating s^2 Y , the sample variance of Y , as a rough upper bound for σ^2 , a simple default strategy is to choose ν small, say around 3, and λ near s^2 Y. One might also go a bit further, by treating s^2 F U LL, the traditional unbiased estimate of σ^2 based on a saturated model, as a rough lower bound for σ^2 , and then choosing λ and ν so that (3.10) assigns substantial probability to the interval (s^2 F U LL, s^2 Y ). Similar informal approaches based on the data are considered by Clyde, Desimone and Parmigiani (1996) and Raftery, Madigan and Hoeting (1997). Alternatively, the explicit choice of λ and ν can be avoided by using p(σ^2 | γ) ∝ 1 /σ^2 , the limit of (3.10) as ν → 0, a choice recommended by Smith and Kohn (1996) and Fernandez, Ley and Steel (2001). This prior corresponds to the uniform dis- tribution on log σ^2 , and is invariant to scale changes in Y. Although improper, it yields proper marginals p(Y | γ) when combined with (3.9) and so can be used formally.
Turning to (3.9), the usual default for the prior mean β¯γ has been β¯γ = 0, a neutral choice reflecting indifference between positive and negative values. This specification is also consistent with standard Bayesian approaches to testing point null hypotheses, where under the alternative, the prior is typically centered at the point null value. For choosing the prior covariance matrix Σγ , the specification is substantially simplified by setting Σγ = c Vγ , where c is a scalar and Vγ is a preset form such as Vγ = (X γ′ Xγ )−^1 or Vγ = Iqγ , the qγ ×qγ identity matrix. Note that under such Vγ , the conditional priors (3.9) provide a consistent description of uncertainty in the sense that they are the conditional distributions of the nonzero components of β given γ when β ∼ Np(0, c σ^2 (X′X)−^1 ) or β ∼ Np(0, c σ^2 I), respectively. The choice Vγ = (X γ′ Xγ )−^1 serves to replicate the covariance structure of the likelihood, and yields the g-prior recommended by Zellner
Practical Bayes Model Selection 83
where SSγ = βˆ′ γ X γ′ Xγ βˆγ , βˆγ ≡ (X′ γ Xγ )−^1 X γ′ Y and F is a fixed penalty. This corre- spondence may be of interest because a wide variety of popular model selection criteria are obtained by maximizing (3.16) with particular choices of F and with σ^2 replaced by an estimate ˆσ^2. For example F = 2 yields Cp (Mallows 1973) and, approximately, AIC (Akaike 1973), F = log n yields BIC (Schwarz 1978) and F = 2 log p yields RIC (Donoho and Johnstone 1994, Foster and George 1994). The motivation for these choices are varied; Cp is motivated as an unbiased estimate of predictive risk, AIC by an expected information distance, BIC by an asymptotic Bayes factor and RIC by minimax predictive risk inflation.
The posterior correspondence with (3.16) is obtained by reexpressing the model pos- terior under (3.14) and (3.15) as
p(γ | Y ) ∝ wqγ^ (1 − w)p−qγ^ (1 + c)−qγ^ /^2 exp
{ − Y ′Y − SSγ 2 σ^2 −^
SSγ 2 σ^2 (1 + c)
}
∝ exp
[ (^) c 2(1 + c) {SSγ^ /σ
(^2) − F (c, w) qγ }
] , (3.17)
where F (c, w) = 1 +c^ c
{ 2 log^1 −w^ w+ log(1 + c)
}
. (3.18)
The expression (3.17) reveals that, as a function of γ for fixed Y , p(γ | Y ) is increasing in (3.16) when F = F (c, w). Thus, both (3.16) and (3.17) are simultaneously maximized by the same γ when c and w are chosen to satisfy F (c, w) = F. In this case, Bayesian model selection based on p(γ | Y ) is equivalent to model selection based on the criterion (3.16).
This correspondence between seemingly different approaches to model selection pro- vides additional insight and interpretability for users of either approach. In particular, when c and w are such that F (c, w) = 2, log n or 2 log p, selecting the highest posterior model (with σ^2 set equal to ˆσ^2 ) will be equivalent to selecting the best AIC/Cp, BIC or RIC models, respectively. For example, F (c, w) = 2, log n and 2 log p are obtained when c ' 3. 92 , n and p^2 respectively, all with w = 1/2. Similar asymptotic connections are pointed out by Fernandez, Ley and Steel (2001) when p(σ^2 | γ) ∝ 1 /σ^2 and w = 1/2. Because the posterior probabilities are monotone in (3.16) when F = F (c, w), this cor- respondence also reveals that the MCMC methods discussed in Section 3.5 can also be used to search for large values of (3.16) in large problems where global maximization is not computationally feasible. Furthermore, since c and w control the expected size and proportion of the nonzero components of β, the dependence of F (c, w) on c and w provides an implicit connection between the penalty F and the profile of models for which its value may be appropriate.
Ideally, the prespecified values of c and w in (3.14) and (3.15) will be consistent with
84 H. Chipman, E. I. George and R. E. McCulloch
the true underlying model. For example, large c will be chosen when the regression coef- ficients are large, and small w will be chosen when the proportion of nonzero coefficients are small. To avoid the difficulties of choosing such c and w when the true model is completely unknown, it may be preferable to treat c and w as unknown parameters, and use empirical Bayes estimates of c and w based on the data. Such estimates can be obtained, at least in principle, as the values of c and w that maximize the marginal likelihood under (3.14) and (3.15), namely
L(c, w | Y ) ∝
∑ γ
p(γ | w) p(Y | γ, c)
∝
∑ γ
wqγ^ (1 − w)p−qγ^ (1 + c)−qγ^ /^2 exp
{ (^) c SS γ 2 σ^2 (1 + c)
}
. (3.19)
Although this maximization is generally impractical when p is large, the likelihood (3.19) simplifies considerably when X is orthogonal, a setup that occurs naturally in nonpara- metric function estimation with orthogonal bases such as wavelets. In this case, letting ti = bivi/σ where v^2 i is the ith diagonal element of X′X and bi is the ith component of βˆ = (X′X)−^1 X′Y , (3.19) reduces to
L(c, w | Y ) ∝
∏^ p i=
{ (1 − w)e−t (^2) i / 2
. (3.20)
Since many fewer terms are involved in the product in (3.20) than in the sum in (3.19), maximization of (3.20) is feasible by numerical methods even for moderately large p.
Replacing σ^2 by an estimate ˆσ^2 , the estimators ˆc and ˆw that maximize the marginal likelihood L above can be used as prior inputs for an empirical Bayes analysis under the priors (3.14) and (3.15). In particular, (3.17) reveals that the γ maximizing the posterior p(γ | Y, ˆc, wˆ) can be obtained as the γ that maximizes the marginal maximum likelihood criterion CMML = SSγ /ˆσ^2 − F (ˆc, wˆ) qγ , (3.21)
where F (c, w) is given by (3.18). Because maximizing (3.19) to obtain ˆc and wˆ can be computationally overwhelming when p is large and X is not orthogonal, one might instead consider a computable empirical Bayes approximation, the conditional maximum likelihood criterion
CCML = SSγ /σˆ^2 − qγ
{ 1 + log+(SSγ /σˆ^2 qγ )
} − 2
{ log(p − qγ )−(p−qγ^ )^ + log q− γqγ
} (3.22)
where log+(·) is the positive part of log(·). Selecting the γ that maximizes CCML provides an approximate empirical Bayes alternative to selection based on CMML.
In contrast to criteria of the form (3.16), which penalize SSγ /σˆ^2 by F qγ , with F constant, CMML uses an adaptive penalty F (ˆc, wˆ) that is implicitly based on the esti- mated distribution of the regression coefficients. CCML also uses an adaptive penalty,