







































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of lower probabilities in Bayesian inference and how they can be used to summarize the posterior distribution. The document also covers the relationship between lower and upper probabilities and the regularity conditions required for their convergence to frequentist coverage probabilities. The document further explores the use of moment equalities and the likelihood function for the current moment equality model, as well as the concept of unrevisable prior knowledge and its relation to posterior distributions.
Typology: Study notes
1 / 47
This page cannot be seen from the preview
Don't miss anything!








































Abstract In inference for set-identiÖed parameters, Bayesian probability statements about unknown parameters do not coincide, even asymptotically, with frequentistís conÖdence statements. This paper aims to smooth out this disagreement from a robust Bayes perspective. I show that a class of prior distributions exists, with which the posterior inference statements drawn via the lower envelope (lower probability) of the class of posterior distributions asymptotically agrees with frequentist conÖdence statements for the identiÖed set. With this class of priors, the statistical decision problems, including the point and set estimation of the set-identiÖed parameters, are analyzed under the posterior gamma-minimax criterion. Keywords: Partial IdentiÖcation, Bayesian Robustness, Belief Function, Imprecise Prob- ability, Gamma-minimax, Random Set. Email: [email protected]. I thank Gary Chamberlain, Andrew Chesher, Siddhartha Chib, Larry Epstein, Jean-Pierre Florens, Guido Imbens, Hiro Kaido, Charles Manski, Ulrich M¸ller, Andriy Norets, Adam Rosen, Kevin Song, and Elie Tamer for their valuable discussions and comments. I also thank the seminar participants at Academia Sinica, Brown, Cornell, Cowles Conference 2011, EC^2 2010, Harvard/MIT, Midwest Econometrics 2010, Northwestern, NYU, RES Conference 2011, Seoul National University, Simon Fraser University, and UBC for their helpful comments. All remaining errors are mine. Financial support from the ESRC through the ESRC Centre for Microdata Methods and Practice (CeMMAP) (grant number RES-589-28-0001) is gratefully acknowledged.
1 Introduction
In inferring identiÖed parameters in a parametric setup, the Bayesian probability statements about unknown parameters are found to be similar, at least asymptotically, to the frequentist conÖdence statements about the true value of the parameters. In partial identiÖcation analyses initiated by Manski (1989, 1990, 2003, 2007), such asymptotic harmony between the two inference paradigms breaks down (Moon and Schorfheide (2011)). The Bayesian interval estimates for the set-identiÖed parameter are shorter, even asymptotically, than the frequentist ones, and they asymptotically lie inside the frequentist conÖdence intervals. Frequentists might interpret this phenomenon, Bayesian over-conÖdence in their inferential statements, as being Öctitious. Bayesians, on the other hand, might consider that the frequentist conÖdence statements, which apparently lack posterior probability interpretation, raise some interpretative di¢ culty once data are observed. The primary aim of this paper is to smooth out the disagreement between the two schools of statistical inference by applying the perspective of a robust Bayes inference, where one can incorporate partial prior knowledge into posterior inference. While there is a variety of robust Bayes approaches, this paper focuses on a multiple prior Bayes analysis, where the partial prior knowledge, or the robustness concern against prior misspeciÖcation, is modeled with a class of priors (ambiguous belief). The Bayes rule is applied to each prior to form a class of posteriors. The posterior inference procedures considered in this paper operate on the class of posteriors by focusing on their lower and upper envelopes, the so-called posterior lower and upper probabilities. When the parameters are not identiÖed, the prior distribution of the model parameters can be decomposed into two components: one that can be updated by data (revisable prior knowledge) and one that can never be updated by data (unrevisable prior knowledge). Given that the ultimate goal of the partially identiÖcation analysis is to establish a "domain of consensus" (Manski (2007)) among the set of assumptions that data are silent about, a natural way to incorporate this agenda into the robust Bayes framework is to design a prior class in such a way that it shares a single prior distribution for the revisable prior knowledge, but allows for arbitrary prior distributions for the unrevisable prior knowledge. Using this prior class as a prior input, this paper derives the posterior lower probability and investigates
identiÖed model of entry games with multiple equilibria, and provide an axiomatic argument that justiÖes a single-prior Bayesian inference for a set-identiÖed parameter. The current paper does not intend to provide any normative argument as to whether one should proceed with a single prior or multiple priors in inferring non-identiÖed parameters. The analysis of lower and upper probabilities originates with Dempster (1966, 1967a, 1967b, 1968), in his Öducial argument of drawing posterior inferences without specifying a prior distribution. The ináuence of Dempsterís appears in the belief function analysis of Shafer (1976, 1982) and the imprecise probability analysis of Walley (1991). In the context of robust Bayes analysis, the lower and upper probabilities have been playing important roles in measuring the global sensitivity of the posterior (Berger (1984), Berger and Berliner (1986)) and also in characterizing a class of priors/posteriors (DeRobertis and Hartigan (1981), Wasserman (1989, 1990), and Wasserman and Kadane (1990)). In econometrics, pioneering work using multiple priors was carried out by Chamberlain and Leamer (1976), and Leamer (1982), who obtained the bounds for the posterior mean of the regression coe¢ cients when a prior varies over a certain class. All of these previous studies did not explicitly consider non-identiÖed models. This paper, in contrast, focuses on non-identiÖed models, and aims to clarify a link between the early idea of the lower and upper probabilities and a recent issue on inferences in set-identiÖed models. The posterior lower probability to be obtained in this paper is an inÖnite-order monotone capacity, or equivalently, a containment functional in the random set theory. Beresteanu and Molinari (2008) and Beresteanu, Molchanov, and Molinari (2012) show the usefulness and wide applicability of the random set theory to a class of partially identiÖed models by viewing observations as random sets, and the estimand (identiÖed set) as its Aumann expectation. They propose an asymptotically valid frequentist inference procedure for the identiÖed set by employing the central limit theorem applicable to the properly deÖned sum of random sets. Galichon and Henry (2006, 2009) and Beresteanu, Molchanov, and Molinari (2011) propose a use of inÖnite-order capacity in deÖning and inferring the identiÖed set in the structural econometric model with multiple equilibria. The robust Bayes analysis of this paper closely relates to the literature of non-additive measures and random sets, but the way that these theories enter to the analysis di§ers from these previous works in the following ways. First, the class of models to be considered is assumed to have well-
deÖned likelihood functions, and the lack of identiÖcation is modeled in terms of the "data- independent áat regions" of the likelihood. Ambiguity is not explicitly modeled at the level of observations, but instead ambiguity for the parameters is introduced through the absence of prior knowledge on each áat region of the likelihood. Second, I obtain the identiÖed set as random sets, whose probability law is represented by the posterior lower probability. Here, the source of probability that induces the random identiÖed set is the posterior uncertainty for the identiÖable parameters, not the sampling probability of the observations. Third, the inferential statements to be proposed in the paper are made conditional on data, and they do not invoke any large-sample approximations. The decision theoretic analysis in this paper employs the posterior gamma-minimax cri- terion, which leads to a decision that minimizes the worst case posterior risk over the class of posteriors. The gamma-minimax decision analysis often becomes challenging, both ana- lytically and numerically, and the existing analyses are limited to rather simple parametric models with a certain choice of prior class (Betro and Ruggeri (1992), Chamberlain (2000), and Vidakovic (2000)). The speciÖed prior class, in contrast, o§ers a general and feasible way to solve the posterior gamma-minimax decision problem, provided that the identiÖed set for the parameter of interest can be computed for each of the identiÖed parameter values. In a recent study by Song (2012), point estimation for an interval-identiÖed parameter from the local asymptotic minimax approach is considered.
The rest of the paper is organized as follows. In Section 2, the main results of this paper are presented using a simple example of missing data. Section 3 introduces the general frame- work, where I construct a class of prior distributions that can contain arbitrary unrevisable prior knowledge. I then derive the posterior lower and upper probabilities. Statistical deci- sion analyses with multiple priors are examined in Section 4. In Section 5, how to construct the posterior credible regions based on the posterior lower probability is discussed and their large-sample behaviors are examined in an interval-identiÖed parameter case. Proofs and lemmas are provided in Appendix A.
robust Bayes procedure considered in this paper aims to make the posterior inference free from such sensitivity concerns by introducing multiple priors for . The way to construct a prior class is as follows. I Örst specify a single prior for the identiÖed parameters . In view of , prior speciÖes how much prior belief should be assigned to each áat region of the ís likelihood (), whereas, depending on ways to allocate the assigned belief over 2 () (for each ), the implied prior for may di§er. Therefore, by collecting all the possible ways of allocate the assigned belief over f 2 ()g for each , I can construct the following class of prior distributions of , M ^ ^ = : ( (B)) = (B) for all B , where denotes a prior distribution for . By applying the Bayes rule to each 2 M
and marginalizing each posterior of for , I obtain the class of posteriors of ,
FjX : 2 M
. I now summarize the class of posteriors of by its lower envelope (lower probability), FjX (D) = inf 2M() FjX (D), which maps subset D in the parameter space of to [0; 1]. In words, the posterior lower probability evaluated at D says that the posterior belief allocated for f 2 Dg is at least FjX (D), no matter which 2 M
is used. The main theorem of this paper shows that the posterior lower probability satisÖes
FjX (D) = FjX (f : H () Dg) ,
where FjX denotes the posterior distribution of implied from the prior . The key insight of this equality is that, with prior class M
, drawing inference for based on its posterior lower probability is done by analyzing the probability law of random sets H (), FjX. Leaving their formal analysis to the later sections of this paper, I now outline the imple- mentation of the posterior lower probability inference for proposed in this paper.
decision criterion, and they can be approximated by
arg min a S^1
s=
sup 2 H(s) (a )^2 and arg min a S^1
s=
sup 2 H(s) ja j ;
respectively.
3 Multiple-prior Analysis and the Lower and Upper
Probabilities
Let (X; X ) and (; A) be measurable spaces of a sample X 2 X and a parameter vector 2 , respectively. The analytical framework of this paper covers both a parametric model = Rd, d < 1 , and a non-parametric model where is a separable Banach space. The sample size is implicit in the notation. Let be a marginal probability distribution on the parameter space (; A), referred to as a prior distribution for . Assume that the conditional distribution of X given exists and has the probability density p(xj) at every 2 with respect to a -Önite measure on (X; X ). The parameter vector may consist of parameters that determine the behaviors of the economic agents, as well as those that characterize the distribution of the unobserved het- erogeneities in the population. In the context of the missing data or counterfactual causal models, indexes the distribution of the underlying population outcomes or the potential outcomes. In all of these cases, the parameter should be distinguished from the parameters
where () and (^0 ) for 6 = ^0 are disjoint, and f () ; 2 g constitutes a partition of . I assume g() = ; so () is non-empty for every 2 .^3 In the set-identiÖed model, the parameter of interest 2 H is a subvector or a trans- formation of denoted by = h(), h : (; A)! (H; D). The formal deÖnition of the identiÖed set of is given as follows.
DeÖnition 3.1 (IdentiÖed Set of ) (i) The identiÖed set of is a set-valued map H : H deÖned by the projection of () onto H through h(), H() fh() : 2 ()g :
(ii) The parameter = h() is point-identiÖed at if H() is a singleton, and is set- identiÖed at if H () is not a singleton.
Note that the identiÖcation of is deÖned in the pre-posterior sense because it is based on the likelihood evaluated at every possible realization of a sample, not only for the observed one.
I now provide some examples, in addition to the illustrating example of Section 2, both to illustrate the above concepts and notations, and to provide a concrete focus for the later development.
Example 3.1 (Bounding ATE by Linear Programming) Consider the treatment ef- fect model with incompliance and a binary instrument Z 2 f 1 ; 0 g, as considered in Imbens and Angrist (1994), and Angrist, Imbens, and Rubin (1996). Assume that the treatment status and the outcome of interest are both binary. Let (W 1 ; W 0 ) 2 f 1 ; 0 g^2 be the poten- tial treatment status in response to the instrument, and W = ZW 1 + (1 Z)W 0 be the observed treatment status. (Y 1 ; Y 0 ) 2 f 1 ; 0 g^2 is a pair of treated and control outcomes and (^3) In an observationally restrictive model, in the sense of Koopman and Reiersol (1950), p^(xj) likelihood function for the su¢ cient parameters, is well deÖned for a domain larger than g() (see Example 3.1 in Section 3.2). In this case, the model possesses the falisiÖability property, and () can be empty for some 2 .
Y = W Y 1 + (1 W )Y 0 is the observed outcome. Data is a random sample of (Yi; Wi; Zi). Following Imbens and Angrist (1994), consider partitioning the population into four subpop- ulations deÖned in terms of the potential treatment-selection responses:
Ti =
c if W 1 i = 1 and W 0 i = 0 : complier, at if W 1 i = W 0 i = 1 : always-taker, nt if W 1 i = W 0 i = 0 : never-taker, d if W 1 i = 0 and W 0 i = 1 : deÖer,
where Ti is the indicator for the types of selection responses. Assume a randomized instrument, Z? (Y 1 ; Y 0 ; W 1 ; W 0 ). Then, the distribution of observables and the distribution of potential outcomes satisfy the following equalities for y 2 f 1 ; 0 g:
Pr(Y = y; W = 1jZ = 1) = Pr(Y 1 = y; T = c) + Pr(Y 1 = y; T = at); (3.2) Pr(Y = y; W = 1jZ = 0) = Pr(Y 1 = y; T = d) + Pr(Y 1 = y; T = at); Pr(Y = y; W = 0jZ = 1) = Pr(Y 0 = y; T = d) + Pr(Y 1 = y; T = nt); Pr(Y = y; W = 0jZ = 0) = Pr(Y 0 = y; T = c) + Pr(Y 1 = y; T = nt):
Ignoring the marginal distribution of Z, a full parameter vector of the model can be speciÖed by a joint distribution of (Y 1 ; Y 0 ; T ):
= (Pr(Y 1 = y; Y 0 = y^0 ; T = t) : y = 1; 0 ; y^0 = 1; 0 ; t = c; nt; at; d) 2 ;
where is the 16-dimensional probability simplex. Let ATE be the parameter of interest.
E(Y 1 Y 0 ) =
t=c;nt;at;d
[Pr(Y 1 = 1; T = t) Pr(Y 0 = 1; T = t)]
=
t=c;nt;at;d
y=1; 0
[Pr(Y 1 = 1; Y 0 = y; T = t) Pr(Y 1 = y; Y 0 = 1; T = t)] h():
The likelihood conditional on Z depends on only through the distribution of (Y; W ) given Z, so the su¢ cient parameter vector consists of eight probability masses:
= (Pr(Y = y; W = wjZ = z) : y = 1; 0 ; d = 1; 0 ; z = 1; 0) :
where
wi() = exp^ f^ (g())
(^0) (m(xi) g())g Pn i=1 exp^ f^ (g())^0 (m(xi)^ ^ g())g
(g()) = arg min 2 RJ
( (^) Xn
i=
exp f 0 (m(xi) g())g
Thus, the parameter = (; ) enters the likelihood only through g() = A + . Conse- quently, I take = g() to be the su¢ cient parameters. The identiÖed set for is given by