Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding the Link Between Designs, Morphisms, and Distributions in Statistical Models, Lecture notes of Statistics

This document delves into the concept of parameterized statistical models, explaining how they assign probability distributions to a parameter set using a function. various aspects of statistical models, including factorial designs, subrepresentations, and hierarchical models. It also discusses identifiability and natural transformations of functors. essential for students and researchers in statistics, data analysis, and machine learning.

Typology: Lecture notes

2021/2022

Uploaded on 08/01/2022

hal_s95
hal_s95 🇵🇭

4.4

(644)

10K documents

1 / 86

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding the Link Between Designs, Morphisms, and Distributions in Statistical Models and more Lecture notes Statistics in PDF only on Docsity! The Annals of Statistics 2002, Vol. 30, No. 5, 1225–1310 WHAT IS A STATISTICAL MODEL?1 BY PETER MCCULLAGH University of Chicago This paper addresses two closely related questions, “What is a statistical model?” and “What is a parameter?” The notions that a model must “make sense,” and that a parameter must “have a well-defined meaning” are deeply ingrained in applied statistical work, reasonably well understood at an instinctive level, but absent from most formal theories of modelling and inference. In this paper, these concepts are defined in algebraic terms, using morphisms, functors and natural transformations. It is argued that inference on the basis of a model is not possible unless the model admits a natural extension that includes the domain for which inference is required. For example, prediction requires that the domain include all future units, subjects or time points. Although it is usually not made explicit, every sensible statistical model admits such an extension. Examples are given to show why such an extension is necessary and why a formal theory is required. In the definition of a subparameter, it is shown that certain parameter functions are natural and others are not. Inference is meaningful only for natural parameters. This distinction has important consequences for the construction of prior distributions and also helps to resolve a controversy concerning the Box–Cox model. 1. Introduction. According to currently accepted theories [Cox and Hink- ley (1974), Chapter 1; Lehmann (1983), Chapter 1; Barndorff-Nielsen and Cox (1994), Section 1.1; Bernardo and Smith (1994), Chapter 4] a statistical model is a set of probability distributions on the sample space S. A parameterized statistical model is a parameter set  together with a function P : → P (S), which assigns to each parameter point θ ∈ a probability distribution Pθ on S. Here P (S) is the set of all probability distributions on S. In much of the following, it is important to distinguish between the model as a function P :→P (S), and the associated set of distributions P⊂P (S). In the literature on applied statistics [McCullagh and Nelder (1989); Gelman, Carlin, Stern and Rubin (1995); Cox and Wermuth (1996)], sound practical advice is understandably considered to be more important than precise mathematical Received January 2000; revised July 2001. 1Supported in part by NSF Grants DMS-97-05347 and DMS-00-71726. AMS 2000 subject classifications. Primary 62A05; secondary 62F99. Key words and phrases. Aggregation, agricultural field experiment, Bayes inference, Box–Cox model, category, causal inference, commutative diagram, conformal model, contingency table, embedding, exchangeability, extendability, extensive variable, fertility effect, functor, Gibbs model, harmonic model, intensive variable, interference, Kolmogorov consistency, lattice process, measure process, morphism, natural parameterization, natural subparameter, opposite category, quadratic exponential model, representation, spatial process, spline model, type III model. 1225 1226 P. MCCULLAGH definitions. Thus, most authors do not offer a precise mathematical definition of a statistical model. Typically, the preceding definition is taken for granted, and applied to a range of sensibly constructed models. At a minimum, a Bayesian model requires an additional component in the form of a prior distribution on . A Bayesian model in the sense of Berger (1985), Smith (1984) or Bernardo and Smith (1994) requires an extra component in the form of a judgment of infinite exchangeability or partial exchangeability in which parameters are defined by limits of certain statistics. Although Bayesian formulations are not the primary focus of this paper, the notion that the model is extendable to a sequence, usually infinite, is a key concept. The parameterization is said to be identifiable if distinct parameter values give rise to distinct distributions; that is, Pθ = Pθ ′ implies θ = θ ′. Thus, the parameter is identifiable if and only if P : → P (S) is injective. Apart from these conditions, the standard definition permits arbitrary families of distributions to serve as statistical models, and arbitrary sets  to serve as parameter spaces. For applied work, the inadequacy of the standard definition is matched only by the eccentricity of the formulations that are permitted. These examples make it abundantly clear that, unless the model is embedded in a suitable structure that permits extrapolation, no useful inference is possible, either Bayesian or non- Bayesian. To be fair, most authors sound a note of warning in their discussion of statistical models. Thus, for example, Cox and Hinkley [(1974), Chapter 1], while admitting that “it is hard to lay down precise rules for the choice of the family of models,” go on to offer a range of recommendations concerning the model and the parameterization. In particular, “the model should be consistent with known limiting behavior” and the parameterization should be such that “different parameter [components] have individually clear-cut interpretations.” The intention of this article is to define some of these concepts in purely algebraic terms. 2. Examples. 2.1. Twelve statistical exercises. The following list begins with four exercises in which the models are plainly absurd. The point of the exercises, however, is not so much the detection of absurd models as understanding the sources of absurdity. From a more practical viewpoint, the more interesting exercises are those in which the absurdity is not obvious at first sight. EXERCISE 1 (A binary regression model). Consider a model for independent binary responses in which certain covariates are prespecified. One of these covariates is designated the treatment indicator. The model specifies a logit link if the number of subjects is even, and a probit link otherwise. Inference is required for the treatment effect. Predictions are required for the response of a future subject with specified covariate values. WHAT IS A STATISTICAL MODEL? 1229 the observations are independent, normally distributed with conditional mean E(Y |x)= α+ βx and constant variance σ 2. The parameter space is (α,β,σ 2) ∈R2 × [0,∞). A confidence interval or posterior distribution is required for the correlation coefficient. EXERCISE 11 [Spatial simultaneous equation model (Section 6.5)]. Let Y be a spatial process observed on a rectangular lattice of sites. The joint distribution is defined by a set of simultaneous equations Yi = ∑ j∼i βYj + εi, in which j ∼ i means that site j = i is a neighbour of site i. This expression is interpreted to mean that (I − B)Y = ε is standard normal. The components of B are zero except for neighboring sites for which bij = β . The system is such that I − B is invertible, so that Y is normal with zero mean and inverse covariance matrix (I −B)T (I −B). A confidence interval is required for β . EXERCISE 12 [Contingency table model (Section 6.4)]. All three factors in a contingency table are responses. In principle, each factor has an unbounded number of levels, but some aggregation has occurred, and factor B is in fact recorded in binary form. The log-linear model AB + BC is found to fit well, but no log-linear submodel fits. What conclusions can be drawn? 2.2. Remarks. These exercises are not intended to be comparable in their degree of absurdity. They are intended to illustrate a range of model formulations and inferential questions, some clearly artificial, some a little fishy, and others bordering on acceptability. In the artificial class, I include Exercises 1, 2, 4, and possibly 3. Exercise 9 ought also to be regarded as artificial or meaningless on the grounds of common sense. The type III model has been criticized from a scientific angle by Nelder (1977) in that it represents a hypothesis of no scientific interest. Despite this, the type III model continues to be promoted in text books [Yandell (1997), page 172] and similar models obeying the so-called weak heredity principle [Hamada and Wu (1992)] are used in industrial experiments. The absurdity in the other exercises is perhaps less obvious. Although they appear radically different, from an algebraic perspective Exercises 2 and 5 are absurd in rather similar ways. Each of the formulations satisfies the standard definition of a statistical model. With respect to a narrowly defined inferential universe, each formulation is also a statistical model in the sense of the definition in Sections 1 and 4. Inference, however, is concerned with natural extension, and the absurdity of 1230 P. MCCULLAGH each formulation lies in the extent of the inferential universe or the scope of the model. Although a likelihood function is available in each case, and a posterior distribution can be computed, no inferential statement can breach the bounds of the inferential universe. To the extent that the scope of the model is excessively narrow, none of the formulations permits inference in the sense that one might reasonably expect. We conclude that, in the absence of a suitable extension, a likelihood and a prior are not sufficient to permit inference in any commonly understood sense. 3. Statistical models. 3.1. Experiments. Each statistical experiment or observational study is built from the following objects: 1. A set U of statistical units, also called experimental units, plots, or subjects; 2. A covariate space ); 3. A response scale V. The design is a function x :U → ) that associates with each statistical unit i ∈U a point xi in the covariate space ). The set D =)U of all such functions from the units into the covariate space is called the design space. The response is a function y :U→V that associates with each statistical unit i, a response value yi in V. The set S = VU of all such functions is called the sample space for the experiment. In the definition given in Section 1, a statistical model consists of a design x :U → ), a sample space S = VU and a family of distributions on S. A statistical model P : → P (S) associates with each parameter value θ a distribution Pθ on S. Of necessity, this map depends on the design, for example, on the association of units with treatments, and the number of treatment levels that occur. Thus, to each design x :U→), there corresponds a map Px :→ P (S) such that Pxθ is a probability distribution on S. Exercise 2 suggests strongly that the dependence of Px on x ∈D cannot be arbitrary or capricious. 3.2. The inferential universe. Consider an agricultural variety trial in which the experimental region is a regular 6 × 10 grid of 60 rectangular plots, each seven meters by five meters, in which the long side has an east–west orientation. It is invariably understood, though seldom stated explicitly, that the purpose of such a trial is to draw conclusions concerning variety differences, not just for plots of this particular shape, size and orientation, but for comparable plots of various shapes, sizes and orientations. Likewise, if the trial includes seven potato varieties, it should ordinarily be possible to draw conclusions about a subset of three varieties from the subexperiment in which the remaining four varieties are ignored. These introductory remarks may seem obvious and unnecessary, but they have far-reaching implications for the construction of statistical models. WHAT IS A STATISTICAL MODEL? 1231 The first point is that before any model can be discussed it is necessary to establish an inferential universe. The mathematical universe of experimental units might be defined as a regular 6 × 10 grid of 7 × 5 plots with an east–west orientation. Alternatively, it could be defined as the set of all regular grids of rectangular plots, with an east–west orientation. Finally, and more usefully, the universe could be defined as a suitably large class of subsets of the plane, regardless of size, shape and orientation. From a purely mathematical perspective, each of these choices is internally perfectly consistent. From the viewpoint of inference, statistical or otherwise, the first and second choices determine a limited universe in which all inferential statements concerning the likely yields on odd-shaped plots of arbitrary size are out of reach. In addition to the inferential universe of statistical units, there is a universe of response scales. In an agricultural variety trial, the available interchangeable response scales might be bushels per acre, tones per hectare, kg/m2, and so on. In a physics or chemistry experiment, the response scales might be ◦C, ◦K, ◦F, or other suitable temperature scale. In a food-tasting experiment, a seven-point ordered scale might be used in which the levels are labelled as V = {unacceptable, mediocre, . . . , excellent}, or a three-point scale with levels V′ = {unacceptable, satisfactory, very good}. It is then necessary to consider transformations V → V′ in which certain levels of V are, in effect, aggregated to form a three-level scale. Finally, if the response scale is bivariate, including both yield and quality, this should not usually preclude inferential statements concerning quality or quantity in isolation. Similar comments are in order regarding the covariate space. If, in the experiment actually conducted, fertilizer was applied at the rates, 0, 100, 200 and 300 kg/ha, the inferential universe should usually include all nonnegative doses. Likewise, if seven varieties of potato were tested, the inferential universe should include all sets of varieties because these particular varieties are a subset of other sets of varieties. This does not mean that informative statements can be made about the likely yield for an unobserved variety, but it does mean that the experiment performed may be regarded as a subset of a larger notional experiment in which further varieties were tested but not reported. 3.3. Categories. Every logically defensible statistical model in the classical sense has a natural extension from the set of observed units, or observed varieties, or observed dose levels, to other unobserved units, unobserved blocks, unobserved treatments, unobserved covariate values and so on. Thus, in constructing a statistical model, it is essential to consider not only the observed units, the observed blocks, the observed treatments and so on, but the inferential universe of all 1234 P. MCCULLAGH [(1981), Section 2.1]. The implications for the theory of random processes are discussed briefly by Kingman [(1984), page 235]. In terms of its logical structure, each statistical model is constructed from the following three components: 1. A category catU in which each object U is a set of statistical units. The morphisms U → U′ in catU are all injective maps preserving the structure of the units. In typical regression problems, catU may be identified with the category I of all injective maps on finite sets. 2. A category cat) in which each object ) is a covariate space. The morphisms )→)′ are all injective maps preserving the structure of the covariate spaces. 3. A category catV in which each object V is a response scale. The morphisms V → V′ are all maps preserving the structure of the response scale. These are typically surjective. These three categories are the building blocks from which the design category, the sample space category and all statistical models are constructed. Given a set of units U and a covariate space ), the design is a map x :U→) associating with each unit u ∈U a point xu in the covariate space ). In practice, this information is usually coded numerically for the observed units in the form of a matrix X whose uth row is the coded version of xu. The set of all such designs is a category catD in which each object x is a pair (U,)) together with a map x :U→). Since the domain and codomain are understood to be part of the definition of x, the set of designs is the set of all such maps with U in catU and ) in cat), in effect, the set of model matrices X with labelled rows and columns. Each morphism ϕ :x → x′ in which x′ :U′ → )′, may be associated with a pair of injective maps ϕd :U → U′ in catU, and ϕc :)→ )′ in cat) such that the diagram U x−→ ) ϕd  ϕc  U′ x′−→ )′ commutes [Tjur (2000)]. In other words x′ϕd and ϕcx represent the same design U→ )′. In matrix notation, X′ = UXW in which U is a row selection matrix, and W is a code-transformation matrix. The general idea behind this construction can be understood by asking what it means for one design to be embedded in another. Here we consider simple embeddings obtained by selection of units or selection of covariate values. First, consider the effect of selecting a subset U ⊂ U′ of the units and discarding the remainder. Let ϕd :U→U′ be the insertion map that carries each u ∈U to itself as an element of U′. The design that remains when the units not in U are discarded is the composition x′ϕd :U→)′, which is the restriction of x′ to U. The diagram WHAT IS A STATISTICAL MODEL? 1235 may thus be completed by taking ) = )′, ϕc the identity, and x = x′ϕd . Next, consider the effect of selection based on covariate values, that is, selecting only those units U whose covariate values lie in the subset ) ⊂ )′. Let ϕc :)→ )′ and ϕd :U→U′ be the associated insertion maps. The design map x :U→) is given by x′ restricted to the domain U and codomain )⊂)′. Finally, if U=U′, )=)′ and ϕc is a permutation of ), the design ϕcx is simply the design x with a rearrangement of labels. The same design could be obtained by suitably permuting the units before using the design map x, that is, the design xϕd . 4.2. Model. Let V be a fixed response scale. A response y on U is a function y :U → V, a point in the sample space VU. To each set U there corresponds a sample space VU of V-valued functions on U. Likewise, to each injective morphism ϕ :U→U′ in catU there corresponds a coordinate-projection map ϕ∗ :VU′ → VU. For f ∈ VU′ , the pullback map defined by functional composition ϕ∗f = f ◦ϕ is a V-valued function on U. Thus (V,∗) is a functor on catU, associating with each set U the sample space VU, and with each morphism ϕ :U →U′ the map ϕ∗ :VU′ → VU by functional composition. The identity map U→U is carried to the identity VU →VU and the composite map ψϕ :U→U′′ to the composite ϕ∗ψ∗ :VU′′ →VU in reverse order. Before presenting a general definition of a statistical model, it may be helpful to give a definition of a linear model. Let V be a vector space, so that the sample space VU is also a vector space. A linear model is a subspace of VU, suitably related to the design. In the functor diagram below, each map in the right square is a linear transformation determined by functional composition. Thus, for f ∈ V)′ , the pullback by ϕc is ϕ∗c f = f ◦ϕc , which is a vector in V). Likewise, ψ ′∗f = f ◦ψ ′ is a vector in VU′ , Design Linear model U ψ−→ ) S =VU ψ∗ ←− ) ⊂V) ϕd  ϕc  ϕ∗d  ϕ∗c U′ ψ ′ −→ )′ S′ =VU′ ψ ′∗ ←− )′ ⊂V)′ . As defined in McCullagh (2000), a linear model is determined by a subrepresenta- tion ) ⊂V), together with the design pullback map ψ∗. A subrepresentation  in the standard representation V) is a sequence of subspaces {) ⊂V)}, indexed by the objects in cat) such that, for each map ϕc :)→)′ in cat), the linear trans- formation ϕ∗c :V)′ →V) also satisfies ϕ∗c)′ =). This skeleton description of a linear model focuses only on the linear subspace, and says nothing about probability distributions on the sample space. In the usual complete formulation, the parameter space is extended to ×R+ by the inclusion of a dispersion parameter σ 2. For the design ψ :U → ), the sample space is 1236 P. MCCULLAGH S =VU, and the full normal-theory linear model is the set of normal distributions with mean vector in the subspace ψ∗), and variance matrix proportional to the identity. The skeletal description is shared by nonnormal linear models, certain nonlinear models and all generalized linear models [McCullagh and Nelder (1989)]. The only modification required is the step by which the representation  determines a family of distributions on S. In generalized linear models, this step is determined by the linear predictor η = ψ∗θ from  into VU, an invertible link function η = g(µ) that is also a natural componentwise transformation VU → VU, together with a suitable error distribution with independent components. The general construction follows the same lines. For each fixed V, the parameter (,∗) is a contravariant functor cat) →K associating with each covariate set ) a parameter set ), and with each (injective) covariate morphism ϕc :)→ )′ a surjective parameter map ϕ∗c :)′ → ). Frequently, K is the category of surjective linear transformations on vector spaces, in which case  is called a representation of cat). A statistical model is a functor on catD associating with each design object ψ :U→) a model object, which is a map Pψ :) → P (S) such that Pψθ is a probability distribution on S. The set of distributions thus generated is Fψ = Pψ). To each morphism (ϕd,ϕc) :ψ → ψ ′ in catD , the functor associates a map (ϕ † d , ϕ ∗ c ) :Pψ ′ → Pψ as illustrated below: Design Sample space Model U ψ−→ ) S =VU P (S) Pψ←− ) ϕd  ϕc  ϕ∗d  ϕ † d  ϕ∗c  U′ ψ ′ −→ )′ S′ =VU′ P (S′) Pψ ′←− )′ . (1) As usual in such diagrams, the maps in both squares are assumed to commute. Some consequences of this definition are as follows: 1. The sample-space transformation ϕ∗d :S′ → S also carries each distribution F on S′ to the transformed distribution ϕ † dF = F ◦ϕ∗−1 d on S. 2. Commutativity: Pψϕ∗c = ϕ † dPψ ′ :)′ → P (S). In other symbols, for each θ ∈)′ , Pψϕ∗c θ = ϕ † dPψ ′θ . Condition (1) ensures that the family of distributions on S is suitably embedded in the family of distributions on S′. The maps ϕ∗d :S′ → S that define this embedding depend on the category of morphisms on units. For ϕc equal to the identity on ), and thus ϕ∗c the identity on ), the consistency condition (2) asserts that the distribution Pψθ on S is the same as the marginal distribution of Pψ ′θ on S′ by the embedding map ϕ∗d :S′ → S. WHAT IS A STATISTICAL MODEL? 1239 subparameter σ/µ for different varieties, but the “subparameters” µ + σ 2 and µ + σ/µ are clearly meaningless. We are thus led to the concept of a natural subparameter. A natural subparameter is a natural transformation g : → 6 of functors on catD . To each design ψ :U→) the natural transformation g associates a map gψ :) →6) in such a way that the diagram shown below commutes ψ ) gψ−→ 6) (ϕd,ϕc)  ϕ∗c  ϕ′c  ψ ′ )′ g ψ ′−→ 6)′ . In other words gψϕ∗c = ϕ′cgψ ′ . Let g : → 6 and h : → 7 be two natural subparameters such that (g,h) :→6×7 is a natural isomorphism of functors. When such a pair exists, we say that g,h are complementary subparameters in . In general, for a given subparameter g there need not exist a complementary subparameter. When a complementary parameter exists, it need not be unique. The notion of orthogonal parameters [Cox and Reid (1987)] does not arise here unless the parameter sets ) are inner product spaces, and the maps ϕ∗c preserve inner products. Such functors do arise in rather specialized applications. 4.6. Identifiability. The parameter  is said to be identifiable at the design ψ :U → ) if distinct parameter values give rise to distinct distributions on the sample space. In other words,  is identifiable at ψ if the associated map Pψ :) →P (S) is injective. Likewise, a natural subparameter 6 is identifiable at ψ if distinct ξ -values give rise to nonoverlapping sets of distributions. To each point ξ ∈ 6) there corresponds a set of points g−1 ψ ξ in ), and a set of distributions Pψg−1 ψ ξ in Fψ = Pψ). The subparameter g → 6 is said to be identifiable at ψ if, for each ξ = ξ ′ in 6), the sets Pψg−1 ψ ξ and Pψg−1 ψ ξ ′ are nonempty and nonoverlapping. An identifiable subparameter is thus a surjec- tive natural transformation g :→6 that generates a partition of Fψ by nonover- lapping sets. 4.7. Response-scale morphisms. In the discussion thus far, the response scale has been kept fixed. That is to say, the analysis has not considered the effect on the model of changing the units of measurement from, say, bushels per acre to tonnes per hectare, or of aggregating selected levels of an ordinal response variable. To complete the story, it is necessary to consider the various morphisms γ :V → V′ 1240 P. MCCULLAGH of the response scale and their effect on the sample space and model. For each response scale V in catV , the sample space is a contravariant functor on catU, and the model is a contravariant functor on the design. To each morphism γ :V →V′ in catV there corresponds a natural transformation of sample spaces as illustrated in the commutative diagram below: V γ−→ V′ U VU γU−→ V′U ϕ  ϕ∗V  ϕ∗V ′ U′ VU′ γU′−→ V′U′ . The category of morphisms of sample spaces, catS , is the category whose objects are all sample spaces S = VU with V ∈ catV and U ∈ catU. The morphisms VU′ → V′U are all compositions γUϕ∗V = ϕ∗V ′γU′ in which γ :V → V′ is a morphism of the codomain, and ϕ∗ is a functional composition with the domain morphism ϕ :U→U′. In this way, the category of morphisms on sample spaces is built up from a category of injective morphisms on units, and a category of morphisms on response scales. In fact, catS is isomorphic with the product category catV × catop U , where catop U is the opposite category derived from catU by the process of reversing all arrows. The categories catop D and catop ) are defined in the same way. The logical structure of a statistical model is then illustrated by the following functor sequences, which show that catS and the parameter  are covariant functors on catV × catop D . catV × catop D → catV × catop U ∼= catS ∼=P (S), catV × catop D → catV × catop ) →K.(2) Model: P :→P (S). The functor catS →P (S) is an isomorphism, associating with each sample space S = VU the set of all probability distributions on S, and with each morphism f :S → S′ of sample spaces a morphism on distributions defined by composition with the inverse image of f . A model is a natural transformation P :→ P (S) associating with each pair (V,ψ :U→)) a map Pψ :→P (VU) satisfying the properties in (1). In the absence of covariate effects, the commutative diagram of a statistical model takes the following simplified form in which the model P : → P (S) is indexed by sample spaces S,S′, . . . . First, the parameter space (,†) is a covariant functor on the category catV , associating with each response scale V, a parameter set V , and with each morphism γ :V → V′ a parameter morphism WHAT IS A STATISTICAL MODEL? 1241 γ † :V → V ′ . Second, to each morphism of sample spaces γ ϕ∗ :VU′ → VU, the transformation on the model is described by the commutative diagram, S′ =V′U P (S′) PS′←− V ′ γ ϕ∗  ◦(γ ϕ∗)−1 γ † S =VU′ P (S) PS←− V . To see what this means, consider a random variable Y taking values in S = VU′ with distribution PSθ for some θ ∈V . Each morphism S → S′ is the composition of a coordinate projection map ϕ∗ and a response-scale morphism γ . The marginal distribution of Y is (PSθ) ◦ (γ ϕ∗)−1 on S′. Commutativity ensures that this distribution is also equal to the image under PS′ of the induced parameter γ †θ . In other words, P :→P (S) is a natural transformation of functors. 5. Special cases. The statement in equations (1) and (2), that a statistical model is a natural transformation P : → P (S) of functors, is a consistency condition ensuring that different ways of computing the probability of equivalent events give the same answer. As a consequence, equivalent events determine the same likelihood. Some familiar instances are as follows. Equivariance and transformation models. Let C = G be a group acting on the response scale V. The group induces a homomorphic action S → S on the sample space, usually by componentwise transformation. Likewise, the same group induces a homomorphic action → on the parameter space. The group actions on S and on  are conventionally denoted by the same symbol y → gy and θ → gθ , respectively. Ordinarily, this abuse of notation leads to no confusion because the functors G → S and G →  are usually group isomorphisms. The group transformation formulation is an assertion that the event-parameter pairs (E; θ) and (gE;gθ) are equivalent. The commutativity condition may then be written in the form P (E; θ)= P (gE;gθ). Examples include the Cauchy model, location-scale models, and the Box–Cox model (Section 7). A natural, or equivariant, subparameter [Lehmann and Casella (1998), page 160], or a permissible subparameter [Helland (1999a, b)] is a function h :→6 such that h(θ1) = h(θ2) implies h(gθ1)= h(gθ2) for each g ∈ G. Pro- vided that h is surjective, this condition ensures that hgh−1 :6→ 6 is well de- fined, and is a group homomorphism. The term “invariantly estimable parameter” has been used by Hora and Buehler (1966), but this terminology is doubly mislead- ing. First, a natural parameter is equivariant and not ordinarily invariant under the group, that is, hgh−1 need not be the identity on 6. Second, estimability depends on the design, and a natural parameter need not be estimable or even identifiable at a given design. 1244 P. MCCULLAGH A completely randomized design. Consider a design in which catU = I imply- ing that the units are infinitely exchangeable. We assume also that cat) includes all insertion maps. In a linear or generalized linear model,  is a representation of cat) by surjective linear transformations. For generalized linear models with un- known dispersion, the most common choice is the representation ) =R⊕R), or the subrepresentation R⊕1 implying equal treatment effects. Depending on the choice of P :→ P (S), the representation  may serve as an index set for var- ious families, typically with independent components, such as Yi ∼ N(θψ(i), θ 2 0 ), Yi ∼ Cauchy(θψ(i), |θ0|), Yi ∼ Po(exp(θψ(i))), Yi ∼ Bi(1,1/(1+ exp(−θψ(i)))) in which ψ(i) is the treatment associated with unit i. In addition to the two com- ponent parameters θ0 ∈ R and θ ∈ R), the only natural linear subparameter g :→ 6 is the quotient projection R) → R)/1, corresponding to the space of treatment contrasts [McCullagh (1999)]. Block designs. Let catU = N D be the category in which each object U is a set together with the equivalence relation, u ∼ u′ if units u and u′ belong to the same block. The morphisms ϕ :U→U′ are all injective maps that preserve the equivalence relationship: u ∼ u′ in U if and only if ϕu ∼ ϕu′ in U′. To say the same thing in a slightly different way, each object in N D is a set of plots together with a map b : plots → blocks. If b is surjective, we may write U= {(i, b(i)) : i ∈ plots}. The morphisms ϕ :b→ b′ are all ordered pairs (ϕd,ϕb) of injective maps such that the following diagram commutes: plots b−→ blocks ϕd  ϕb  plots′ b′−→ blocks′ . Let cat) = {0} be a one-point set, so that  is also a set. According to (1), a model for a block design with no specific factors is a natural transformation P : → P (S) in which S is the representation RU, meaning real-valued functions on plots. In other words, each Pθ is a distribution in InvN D (RU). Let ε0, {εi}, {ηb} be independent standard normal random variables. Let Y be an infinite sequence of random variables indexed by the objects U such that, for some measurable function g :R3 → R, YU has the same distribution as XU on RU, where XU(i)= g(ε0, εi, ηb(i)) for i ∈ U. Then Y has a distribution in InvN D (S). The usual normal-theory variance-components model is obtained if g is a linear combination of εi and ηb(i). Thus, we may have  =R3, and Pθ equal to the distribution of Y , where Yi = θ0 + θ1εi + θ2ηb(i). WHAT IS A STATISTICAL MODEL? 1245 Causality and nonspecific effects. Although the same category of morphisms arises in these last two examples, the standard models are quite different. If the units are regarded as having the infinitely exchangeable block structure associated with N D , we arrive at a random-effects model. If the units are regarded as infinitely exchangeable modulo specific covariate effects, the model contains specific parameters for those effects. Nonspecific factors such as blocks, temporal structure and spatial structure, for which no level-specific inference is required, are best regarded as a property defining the structure of the units. Specific factors such as treatment, variety, age, race, sex or religion, for which level-specific inferences may be required, must be regarded as objects in cat). Level-specific effects may be ascribed to specific factors, whether intrinsic or not, but causal effects are not ordinarily ascribed to intrinsic factors [Holland (1986); Cox (1986)]. Thus, longevity may vary with race, sex and religion, in a specific manner, but it would be at best an abuse of language to assert that race, sex or religion is a cause of longevity. The definitions given in Section 4 deliberately avoid all reference to causal mechanisms, which are necessarily context dependent in ways that categories and morphisms are not. A given model such as quasi-symmetry, arising in very different fields of application from linkage disequilibrium in genetics to citation studies in science, may be capable of numerous mechanistic interpretations. Also, models exist for which no mechanististic interpretation is readily available, or which are in conflict with accepted scientific laws. Such alternative models are necessary in order that scientific laws may be tested. 6. Examples. 6.1. Location-scale models. For simplicity of exposition in this section, we consider only models for independent and identically distributed scalar responses. Such models are determined by their one-dimensional marginal distributions on the response scale, so we may ignore catU and cat). The motivation for location- scale models comes from transformations of the response scale, that is, the morphisms in catV . For definiteness, take catV to be the category of morphisms of temperature scales. The objects in this category are all temperature scales, ◦F, ◦C, ◦K, and possibly others not yet invented. Associated with each scale V is a certain set of real numbers, and to each pair (V,V′) of scales there corresponds a single invertible affine map (a, b) :V → V′, which transforms y on the V-scale into a + by on the V′-scale. If V =◦C and V′ =◦F, the map (32,9/5) carries y in ◦C to 32+ 9y/5 in ◦F. One example of a functor is the map T : catV → GA(R) in which each object in catV is associated with R, and each map (a, b) :V → V′ is associated with an affine transformation (a, b)† :R→R defined by x → a+bx. If catV contains the three objects ◦C, ◦F and ◦K, and nine maps, the image maps R→R in T catV are the identity, the affine maps (32,9/5), (273,1), (523.4,9/5) and their inverses, 1246 P. MCCULLAGH making a total of seven maps. These are not closed under composition, so the image T catV is not a category. However, if all conceivable scales are included in catV , the image of T is equal to the affine subgroup with b > 0 acting on R. Another example of a functor is the map  : catV → GA(R2) in which each (a, b) :V →V′ is associated with the affine map R2 →R2 defined by (a, b)† : (θ1, θ2) → (a+ bθ1, bθ2), and we may take  as the parameter space for a location-scale model. Two examples of models →P (VU) with independent components are (θ1, θ2) →N(θ1, θ 2 2 ) and (θ1, θ2) → Cauchy(θ1 + iθ2). Since the points (θ1,±θ2) give rise to the same distribution, the parameter is not identifiable. Provided that b > 0, the subobject R × R+ determines a subfunctor, and the submodel with θ2 ≥ 0 generates the same family. These models are equivalent to the usual group formulation by affine transformations acting componentwise on RU only if catV contains all scales, that is, all pairs (a, b) with b > 0. Otherwise, the image of  is not a group. To each affine transformation (a, b) of response scales, there corresponds a transformation (a, b)† :R2 → R2 on the object in . The first component of this transformation θ1 → a + bθ1 does not depend on θ2, so the coordinate projection (θ1, θ2) → θ1 is a natural transformation. The mean or median is a natural subparameter. The same is true for the second component, so θ → θ2 is also a natural subparameter. However the coefficient of variation τ = θ1/θ2 is not a natural subparameter because the transformation (θ1, θ2) → θ2/θ1 is not natural for the category of location-scale transformations. The location-scale morphism (a, b) carries τ = θ1/θ2 to |b|θ2/(a+bθ1), which cannot be expressed as a function of (τ, a, b) alone. On the other hand, for the subcategory (0, b) of componentwise nonzero scalar multiplication, the coefficient of variation is a natural subparameter. The induced transformation is τ → sign(b)τ . The analysis for other location-scale families follows the same lines. Consider, for example, the location-scale t family, F = { tν(µ,σ 2) :µ ∈R, σ 2 ≥ 0, ν = 1,2, . . . } with integer degrees of freedom. Then ν is invariant under all maps in the category, and the induced transformations for (µ,σ 2) are those described above. Relative to the category of location-scale and coordinate-projection maps, µ and σ 2 are natural subparameters, but combinations such as τ = σ/µ or µ + σ are not. However µ+ σ and each percentile is a natural subparameter relative to the sub- category in which b≥ 0. WHAT IS A STATISTICAL MODEL? 1249 in which each letter occurs exactly once. No formulation such as AB+BC whose interpretation necessarily accords preferential status to the recorded levels can be a statistical model in the sense of being a covariant functor on Surj3. If in fact, some or all factors have ordered levels, we may restrict Surj to the subcategory Surjord of weakly monotone surjective maps. The set of models is thereby greatly increased. In the bivariate case, the set of distributions having constant global cross-ratio [Pearson (1913); Plackett (1965); Dale (1984)] is closed under monotone marginal transformation. The sets of distributions having additive or log-additive global cross-ratios are likewise closed. The importance of the distinction between response and explanatory factor is evident in the functor sequence (2). Suppose that A is the response and B,C are explanatory, all having unordered levels. It may then be appropriate to choose catV = Surj for surjective morphisms on the levels of A, and cat) = I2 for selection of the levels of B and C. Then  is a covariant functor on Surj×Iop × Iop. Among log-linear formulations, each factorial expression that includes A+ BC represents a model in the sense of (2). For two-way tables, the set of nonnegative matrices of rank ≤ k is a functor corresponding to a model not of the log-linear type. If the levels of A are ordered and attention is restricted to order-preserving surjective maps, further models are available in the form of cumulative logit and related formulations [McCullagh (1980)]. Apart from the rank 2 canonical correlation models, none of the association models described by Goodman (1979, 1981) for two-way tables with ordered levels is a covariant functor on either Surj2ord or Surjord×I op ord. Goodman’s association structure is not preserved under aggregation of adjacent response levels. In the absence of an extension to other levels of aggregation, all conclusions derived from fitting such models are necessarily specific to the particular level of aggregation observed, so the scope for inference is rather narrow. 6.5. Embeddability and stochastic processes. The point of Exercise 11 is that such a specification need not define a process that extends beyond the observed lattice. Let the observed 3 × 3 lattice be embedded as the central square of a larger n × n lattice. The distribution of the response on the sublattice may be computed from the n2-dimensional normal distribution with zero mean and inverse variance matrix (In − Bn) T (In − Bn) by integrating out the n2 − 9 unwanted variables. The matrices In and Bn are of order n2 × n2. But the result of this computation is different for each n, and is not equal to the normal distribution with inverse covariance matrix (I3 − B3) T (I3 − B3). In other words, Exercise 11 does not identify an embeddable process. Similar issues arise in connection with edge effects in Gibbs models for finite lattice systems: for details, see Besag and Higdon [(1999), Section 2.3.1]. 1250 P. MCCULLAGH 6.6. Comments on selected exercises. With the usual understanding of what is meant by natural embedding, the formulations in Exercises 1–5 do not determine a unique probability distribution or likelihood function. In Exercise 1, which asks for predictions for a future subject, the observed event is either E ⊂ Rn or the equivalent set E′ = E × R ⊂ Rn+1. The likelihood function may thus legitimately be calculated either by Pn(E; θ) or by Pn+1(E ′; θ). The absurdity of Exercise 1 lies in the fact that these expressions are not equal. To the extent that this embedding is natural, the likelihood function is not well defined. The nature of the embeddings may be different, but the same objection applies to Exercises 1–5 and 11. Although the set P in Exercise 4 is the set of all i.i.d. normal models, the parameterization is unnatural and the likelihood function is not well defined. The logical contradiction in the type III model in Exercise 5 is exactly the same as in Exercises 1 and 2, but the formulation is less flagrant and consequently more persistent. Let Fkb be the family of normal distributions on Rkb with constant variance σ 2 > 0, such that µ lies in the linear subspace IIIkb. It is understood that the sample space morphisms include the coordinate projection map Rkb+k →Rkb in which one entire block is deleted or ignored. It is easy to see that the fixed-sum constraint is not satisfied by the blocks that remain. In other words, the sequence of vector spaces {IIIkb} is not closed under such maps, and consequently the sequence F = {Fkb} of families is not closed either. Put more bluntly, the observation y ⊂ Rkb is equivalent to the incomplete observation y′ = y × Rk ⊂ Rk(b+1) taken from a larger design in which one entire block is unobserved. But the likelihood function based on (y,Fkb) is not equivalent to the likelihood function based on (y′,Fk(b+1)). Given this embedding, no unique likelihood can be associated with the type III model. Gibbsian models have the property that the conditional distribution of certain subsets of the variables given the values of all other variables is also of the same type. This property makes it straightforward to construct families that are closed under conditioning. Exercise 7 illustrates a familiar problem in the use of quadratic exponential families and Gibbs models. Although the normal family is closed under coordinate projection, the implied parameterization, in which (θ1, θ2) does not depend on cluster size, does not yield a commutative model diagram. In other words, if (Y1, . . . , Yk) have the exchangeable normal distribution with parameter (θ1, θ2), the marginal distribution of (Y1, . . . , Yk−1) is exchangeable and normal, but the parameter is not θ . By contrast, the conventional parameterization by variance and nonnegative covariance independent of cluster size, does yield a commutative model diagram. Even though the two formulations may coincide when applied to an example in which the cluster size is constant, their extensions to variable-sized clusters are different, and this difference is critical for prediction and inference. The version in which the variance and covariance are nonnegative and independent of cluster size is a statistical model. The version described in Exercise 7 is not. WHAT IS A STATISTICAL MODEL? 1251 The problem with the Cauchy model is that, although the family is closed under the group of real fractional linear transformations, the projection (θ1, θ2) → θ1 is not a natural transformation. Since the components are independent, it is sufficient to take S =R and ϕy = (ay + b)/(cy + d) in the diagram below: y C(θ) ←− θ −→ g(θ)= θ1 ϕ  ϕ†  ϕ∗  ϕ′ = ? ay+b cy+d C(ψ) ←− ψ = aθ+b cθ+d −→ g(ψ)=ψ1 In fact, ψ1 = ((aθ + b)/(cθ + d)) depends on θ2 and is not expressible as a function of (θ1, ϕ) alone. The real part is not a natural parameter relative to the group of fractional linear transformations. It appears that the only natural transformations are the identity on  and the constant function →{0}. In each particular application, it is essential to establish the category of relevant morphisms of sample spaces at the outset. Even though the observations are mod- elled by Cauchy distributions, it is perfectly acceptable, and frequently entirely reasonable, to choose the family of location-scale and coordinate projection maps. Unless the response is an overt ratio of outcomes, fractional linear transformations are seldom appealing in applied work, and they should be excluded on that ba- sis. The category of morphisms on sample spaces must not be determined by the chosen family of distributions, but by the context. The mere fact that the Cauchy family happens to be closed under reciprocals does not mean that the category must include reciprocals. The implicit category of morphisms in Exercise 9 is generated by affine transformations of temperature, together with scalar and rotational transformations of physical space. When the temperature scale is morphed from Fahrenheit to Celsius, there is a corresponding morph, (µ,σ 2, λ) → ( 5(µ− 32)/9, (5σ/9)2, λ ) of the parameter space. But there exists no corresponding morph for θ1 = µ/σ or θ2 =µ/λ, or for any other nonsensical combination such as σ + λ. 7. The Box–Cox model. 7.1. Natural subparameter. It is sufficient in what follows to consider the simplified Box–Cox model in which the observations y1, . . . , yn are independent and identically distributed. The sample spaces are all finite-dimensional real vector spaces {Rn :n≥ 0}. For m ≥ n, the morphisms Rm →Rn include all coordinate permutation and coordinate projection maps. In addition, all scalar multiples y → γy for γ > 0 and all componentwise power transformations y → ±|y|λ for λ = 0, are also included in C. The inferential difficulties that arise in the Box– Cox model are due principally to the interaction between the power transformation 1254 P. MCCULLAGH The yield of crop in a field trial is an additive set function. Typically, the process is defined on Borel sets, and observed on the algebra generated by the plots. In the discussion that follows, it is assumed that the process Y takes values in a set V that includes zero and is closed under addition (an additive semigroup). In practice, this usually means that V is the set of nonnegative integers, the set of nonnegative reals, the set of real or complex numbers, or the Cartesian product of such sets. Let C be the category in which each object A is a finite algebra of Borel- measurable subsets of D , and each morphism ϕ :A → A′ is the insertion map in which A⊂A′. The sample space is a contravariant functor on C that associates with each algebra A the set measV(A) of V-valued additive set functions on A, and with each insertion map ϕ :A → A′ the map ϕ∗ : measV(A′) → measV(A) by restriction to A ⊂ A′. A probability model P for the process is a contravariant functor on C that associates with each A a probability distribution PA on measV(A) in such a way that, when S ⊂ measV(A) is PA-measurable, the inverse image ϕ∗−1S ⊂ measV(A′) is PA′ -measurable, and PA(S)= PA′(ϕ∗−1S). In other words, PA is the marginal distribution of PA′ , A measV(A) PA ϕ  ϕ∗  ◦ϕ∗−1 A′ measV(A′) PA′ . This is a category-style statement of the Kolmogorov consistency condition that must be satisfied by the finite-dimensional distributions of a measure process on D [Kingman (1984)]. The term “process” refers both to the random variable Y and to the probability model P on D satisfying the preceding conditions. A point process is a process for which each outcome is a countable set of points in D . In other words, a point process is a nonnegative integer-valued random measure such that, with probability 1, Y has countable support and no multiple points. Then Y (A) is the number of points in A. A Poisson process [Kingman (1993)] is a point process in which, for some weakly finite nonatomic measure µ on D , (a) Y (A) has the Poisson distribution with mean µ(A); (b) for nonoverlapping sets A1,A2, . . . , the random variables Y (A1), Y (A2), . . . are independent. A process satisfying condition (2) is said to be completely random. A measure process is called Gaussian if, for each finite collection of subsets {A1, . . . ,An}, the random variable (Y (A1), . . . , Y (An)) is normally distributed. 8.2. Domain morphisms. We consider first some of the issues connected with statistical models for processes in which covariate effects do not arise. Ordinarily we require a family that is extendable to a large class of subsets of the plane. Accordingly, we begin with a category catD in which the objects D,D ′, . . . are some or all such subsets of the plane. The class of morphisms ϕ :D →D ′ depends on the context, but is restricted to injective maps that preserve Borel sets. In WHAT IS A STATISTICAL MODEL? 1255 practice, the morphisms are usually translations, rotations and scalar multiples that preserve some aspect of Euclidean structure. Every injective map ϕ :D →D ′ preserves Boolean structure in the sense that, for A,B ⊂ D , ϕ(A ∪ B) = ϕ(A) ∪ ϕ(B), and ϕ(A ∩ B) = ϕ(A) ∩ ϕ(B). Thus, ϕ carries each algebra of Borel subsets of D into an algebra of Borel subsets of D ′. In the diagram below, to each domain D there corresponds a sample space, SD , of V-valued measures on (the Borel subsets of) D . For each point Y ∈ SD ′ , a measure on D ′, (ϕ∗Y )(A)= Y (ϕA) is the composition measure on D . D Y ◦ϕ SD P (SD ) PD←− 6Dϕ  ϕ∗ ϕ† ϕ′ D ′ Y SD ′ P (SD ′) PD ′←− 6D ′ . (3) The map ϕ† on probability models is defined by composition with the inverse image of ϕ∗, that is, (ϕ†F)(S)= F(ϕ∗−1S) for S ⊂ SD and F ∈P (SD ′). Typically, the process Y to be studied is modelled as a measure on D , assigning values to each of the Borel sets. Although we refer to the sample space as the set of measures on D it is important to understand that Y is ordinarily not observed on the Borel sets, but on the finite subalgebra generated by the plots. An invariant probability model is a natural transformation {0} → P (S) that associates with each object D a process, PD on meas(D), in such a way that, for each domain morphism ϕ :D → D ′, the sample-space map ϕ∗ :SD ′ → SD satisfies PD = PD ′ ◦ϕ∗−1. For example, the Poisson process with constant unit intensity function is invariant under translation and rotation of domains. For many categories of domain morphisms, however, invariant processes do not exist. For example, there exists no nontrivial process that is invariant under translation and scalar multiplication. However, the family of Poisson processes with constant intensity function is closed under similarity transformations. Likewise, the family of stationary Gaussian processes with constant intensity and isotropic covariance density σ 2 exp(−γ |x − x′|) dx dx′, σ 2, γ ≥ 0, is also closed under similarity transformations. A statistical model P :6 → P (S) is a natural transformation of functors on catD . For example, if catD is the group of similarity transformations acting on R2, we may take 6D =R+ and ϕ′ equal to the Jacobian of ϕ. If PDξ is the homogeneous Poisson process on D with intensity ξ , the diagram commutes. More generally, let 6D be the set of nonnegative measures on D . For each ϕ :D →D ′ the induced map ϕ′ on measures is given by (ϕ′µ)(A)= µ(ϕA) for µ ∈ 6D ′ and A ⊂ D . If µ is nonatomic, so also is ϕ′µ. The set of nonatomic measures is thus a subfunctor of 6. If PD ′µ is the Poisson process on D ′ with nonatomic mean measure µ, the induced process ϕ†PD ′µ on D is Poisson with 1256 P. MCCULLAGH nonatomic mean measure ϕ′µ, and the diagram commutes. An arbitrary measure µ with atoms gives rise to a process in which multiple events occur with nonzero probability at the same point. Although PDµ is a Poisson process, it is not a point process according to the usual definition [Kingman (1993)]. Further submodels exist if catD is restricted to continuous maps. 8.3. Covariate effects in spatial processes. In a field trial over a domain D , the design is a map ψ :D → ), associating with each point u in the domain a point ψu in ). In practice, since observations are made on plots, not points, ψ is constant on each plot. Each morphism ψ → ψ ′ of designs is a pair of injective maps (ϕd,ϕc) such that ψ ′ϕd = ϕcψ . To construct a model for the effect of covariates on a measure-valued process, it is essential to begin with a suitable baseline family, or uniformity model, for the process when covariate effects are absent. Such a family is necessarily closed under the category of morphisms of the domains. In the diagram shown below, the parameter space is a contravariant functor associating with each design ψ :D →) a parameter set 6D × ), and with each morphism (ϕd,ϕc) :ψ → ψ ′ a map (ϕ′d , ϕ′c) :6D ′ ×)′ →6D ×). A statistical model is a natural transformation of functors on the design. The baseline uniformity model is obtained by choosing = {0}. Design Sample space Model D ψ−→ ) SD =meas(D) P (SD) Pψ←− 6D ×) ϕd  ϕc  ϕ∗d  ϕ † d  ϕ′d  ϕ′c D ′ ψ ′ −→ )′ SD ′ =meas(D ′) P (SD ′) Pψ ′←− 6D ′ ×)′ (4) In order to see what this diagram means in practice, let ϕ :D → D ′ be a similarity transformation with scalar multiple √ 2. Then the area of the image ϕdD ⊂D ′ is twice the area of D , and the mean intensity of Yϕd = ϕ∗dY on D is twice the intensity of Y on D ′. Thus, in the baseline model, we may choose 6D =R and ϕ′d = |∂ϕd/∂u| = 2 as a scalar multiple. Let ϕc be the identity on ), so that ϕ′c is the identity on ). Suppose that the design ψ ′ on D ′ uses two varieties and that the yield for variety 2 exceeds the yield for variety 1 by one ton per acre. As observed on D , the yield per unit area for each variety is doubled, and the difference is also doubled. For the two varieties, the processes on D ′ are Pψ ′(ξ, θ1) and Pψ ′(ξ, θ2) with intensity functions λ(ξ, θ1), λ(ξ, θ2). On D we obtain Pψ(2ξ, θ1) and Pψ(2ξ, θ2) with intensity functions λ(2ξ, θ1), λ(2ξ, θ2). Commutativity requires a model parameterized in such a way that the difference between the intensity functions on D is twice the difference between the intensity functions on D ′. So far as WHAT IS A STATISTICAL MODEL? 1259 λ2 = λ′2 ◦ϕ are in HD ′ . Further, if λ11 ∈ H⊗2 D , the corresponding transformed parameter λ′11 is a vector in H⊗2 D ′ . For example, log |x − x′| is a vector in H⊗2 D ′ (excluding the diagonal set x = x′ in D ×D ), and log |ϕ(x)− ϕ(x′)| is a vector in H⊗2 D , also excluding the diagonal. The baseline processes considered here are those conformal processes for which the log-transformed cumulant densities are harmonic functions. The sub- family of completely random processes for which λ11 = −∞, and thus λ2(x) = λ1(x)+ const, is of particular interest in the analysis of field experiments. 8.5. A conformal model with covariates. We consider a model in which, conditionally on the real-valued function λ, the mean and variance of the process Y at x ∈D are modelled as follows: µ(dx)=E ( Y (dx) )= exp ( λ(x)+ (ψ∗θ)(x) ) dx, var ( Y (dx) )= σ 2µ(dx), cov ( Y (dx),Y (dx′) )= 0, (5) for some volatility parameter σ 2 > 0. Following Besag and Higdon (1999), we may interpret the term λ(x) as the fertility intensity at x. Although the notation (ψ∗θ)(x) may appear unfamiliar, the effects of variety, treatment and other covariates are modelled in the usual manner, though their effects on the mean are multiplicative rather than additive. The multiplicative form in (5) is necessary to ensure that the diagram (4) is commutative for the category of conformal transformations. It is possible but not necessary that Y should be a Gaussian process. The key issue concerns the assumptions to be made about fertility effects. It is generally understood that, for field experiments, the assumption of uniform fertility is unduly optimistic. Neighboring plots tend to have similar fertilities, but there can be substantial fertility gradients over the domain of interest. At the other extreme, if no assumptions are made about fertility patterns, that is, if λ is regarded as an arbitrary continuous function in RD , the variety effects are not identifiable. One intermediate option is to assume that fertility variation can be modelled as a quadratic function or a rational function of suitable low order. In this section, we explore a third option, namely assuming that the fertility intensity λ on D lies in the vector space of harmonic functions HD . This assumption means that the fertility at x ∈D is equal to the average fertility on each disk D(x, r)⊂D with center x and radius r > 0. The model allows considerable flexibility for fertility gradients, but it does imply that the fertility function cannot have a local maximum or minimum in the interior of D . As with all such assumptions born of algebraic convenience, empirical work is needed to see if the assumption is reasonable. The available numerical evidence is limited, but very positive. 1260 P. MCCULLAGH A harmonic function has the property that the Laplacian at x = (x1, x2) vanishes: ∇λ(x)= ∂2λ ∂x2 1 + ∂2λ ∂x2 2 = 0 for each x ∈D . Formal application of the Laplacian operator to our model gives ∇ log dµ dx =∇(ψ∗θ)(x) since the Laplacian is a linear operator. Provided that the the fertility function is sufficiently smooth to be approximated by a vector in HD , no prior distribution is required. Even though the vector space HD of fertility intensities is infinite dimensional, certain treatment and variety contrasts are identifiable. Identifiability requires that ∇(ψ∗θ) not be identically zero. Numerical calculation shows that, in many cases, all variety contrasts remain identifiable. 8.6. Lattice approximation. We assume here that the response measure Y is observed on a regular m × n grid of rectangular plots, Dx × Dy , sufficiently small that the Laplacian can be approximated at each internal plot. Let the rows be indexed by 1 ≤ i ≤ m, and the columns by 1 ≤ j ≤ m. On the assumption that the plot sizes are sufficiently small that λ is effectively constant on plots, the multiplicative model in the preceding section gives µ(A)=E ( Y (A) )= exp ( λA+ (ψ∗θ)A )× |A|, var ( Y (A) )= σ 2µ(A) in which (ψ∗θ)A is the effect of the treatment or variety associated with plot A. Reverting to more classical notation in which the plots are indexed by (i, j) ∈ m× n, we have logµij = λij + (Xθ)ij + log |Aij |. In the terminology of generalized linear models [McCullagh and Nelder (1989)], the offset is log |A|, the link function is log, and the variance function is the mean. Provided that the plot areas are constant, the term log |A| can be absorbed into the harmonic λ. The simplest lattice version of the Laplacian is proportional to (∇Y )ij = (Yi−1,j − 2Yij + Yi+1,j )/D 2 y + (Yi,j−1 − 2Yij + Yi,j+1)/D 2 x for 2≤ i ≤m− 1 and 2≤ j ≤ n− 1. It is absolutely essential to take account of the geometry of the plots by including the factors Dx and Dy in the definition of the Laplacian. If the plots are square, the Laplacian at (i, j) may be approximated by the sum of the four neighbors minus four times the value at (i, j). WHAT IS A STATISTICAL MODEL? 1261 The kernel of the linear transformation ∇ :Rmn → R(m−2)(n−2) is the sub- space H of lattice harmonic functions, of dimension equal to the number of boundary plots. The linear projection Rmn →Rmn, H = I −∇′(∇∇′)−1∇ has the property that the image of H is the kernel of ∇ , which is the subspace H . The model formula is thus ker∇ + X, or H + X, with log link and variance proportional to the mean. An approximate estimate of the variety contrasts may be obtained by solving (X′WX)β̃ =X′W(logY ), where W = ∇′(∇∇′)−1∇ . Since ∇ annihilates constants and linear functions, the matrix X′WX is not invertible. At best, only contrasts are estimable. In principle, the nonlinear model can be fitted by standard software such as GLIM or the glm() function in Splus. In practice, since H is mn×mn of rank h= 2(m+ n− 2), it is necessary to eliminate the redundant columns. This can be done by using an equivalent full-rank matrix H ′ = HJ in which J is mn× h of full rank. A matrix of independent uniform random numbers is adequate for this purpose. The data from the wheat uniformity trial reported by Mercer and Hall (1911), and reproduced in Andrews and Herzberg (1985), were reanalysed in the light of the preceding discussion. The design is a grid of 20× 25 nearly square plots, each 1/500 of an acre in area. Each plot contains 11 drill rows 10.82 feet long. From the given total area, we infer that the drill rows are approximately nine inches apart and that the plot size is 10.82× 8.05 in feet. After allowing for multiplicative row and column effects, it is found that the mean square due to harmonic variations is 0.304 on 82 degrees of freedom, whereas the residual mean square is 0.107 on 374 degrees of freedom. At the expense of the boundary plots, a 25% reduction in the variance of treatment contrasts is achieved by the elimination of harmonic trends. The observed variance ratio of 2.85 provides strong evidence for the presence of at least two variance components in these data, and the ratio might well have been considerably larger had the region not been selected on the basis of its uniform appearance. Numerically, it matters little whether the model is additive or multiplicative. The real and imaginary parts of (x1 + ix2) k are homogeneous polynomials of degree k, both of which are harmonic functions. The set Hk ⊂H of harmonics of degree k or less on D = R2 is a subspace of dimension 2k + 1 closed under similarity transformations, that is, planar rotation, translation and scalar multiplication. That is to say, Hk and H/Hk are representations of this group, and there is a corresponding invariant decomposition of the harmonic sum of squares. In the Mercer–Hall data, low-order harmonics tend to dominate, but this dominance is much less pronounced than I had anticipated. After allowing for 1264 P. MCCULLAGH Isomorphism. An isomorphism of categories is a covariant functor T :C →K that is a bijection on both objects and arrows. A contravariant functor T :C →K that is a bijection on both objects and arrows is an opposite-isomorphism C → Cop ∼=K . Representation. Let K be the category of linear transformations on vector spaces. A functor T :C → K is said to be a representation of C by linear transformations, or simply a representation of C. The standard contravariant representation of C associates with each object ) the vector space R), and with each morphism ϕ :)→ )′ the linear transforma- tion ϕ∗ :R)′ →R) by functional composition: ϕ∗f = f ◦ϕ. Dual representation. Let K be the category of linear transformations on vector spaces. The dual functor associates with each vector space V in K the dual vector space V′ of linear functionals on V. To each morphism, or linear transformation, A :V → W in K there corresponds a dual linear transformation A′ :W ′ → V′, also called the vector-space adjoint transformation. This operation reverses the direction of each arrow, so the dual is a contravariant functor K →K . In matrix notation, A′ is the transpose of A. The dual of a covariant representation T :C → K is the composition C T→ K ′→K associating with each object ) the dual vector space, T ′), and with each morphism ϕ :)→)′ the dual linear transformation (T ϕ)′ :T ′ )′ → T ′). Opposites. To each category C we make correspond an abstract category Cop by retaining the objects, and reversing all arrows. The functor C → Cop is contravariant, and the order of composition is reversed. Each category of invertible maps (groupoid) is isomorphic with its opposite. The category obtained by replacing each object ) of C by the power set 2), and each map ϕ :)→)′ by its inverse image ϕ−1 : 2)′ → 2), is isomorphic with Cop. Other concrete instances of opposites include the standard representation and the dual category of linear transformations. Natural transformation. Let C,K be two categories, and let S,T be two covariant functors C → K . A natural transformation g :S → T associates with each object ) in C an arrow g) :S) → T) in such a way that the arrow diagram commutes. ) S) g)−→ T) ϕ  ϕ∗s  ϕ∗t  )′ S)′ g )′−→ T)′ WHAT IS A STATISTICAL MODEL? 1265 In other words, for each ϕ :) → )′ in C the image arrows ϕ∗s and ϕ∗t satisfy ϕ∗t g) = g )′ϕ∗s :S) → T)′ . If S and T are contravariant functors, the direction of the arrows ϕ∗s and ϕ∗t is reversed. The commutativity condition then becomes ϕ∗t g)′ = g)ϕ ∗ s :S)′ → T). Acknowledgments. I am grateful to Yali Amit, Steen Andersson, Jim Berger, Peter Bickel, David Cox, Art Dempster, Gary Glonek, Inge Helland, Saunders Mac Lane, Nick Polson, Nancy Reid, Michael Stein, Stephen Stigler, Burt Totaro, Ernst Wit and the referees for helpful discussion and comments on various points. REFERENCES ALDOUS, D. (1981). Representations for partially exchangeable arrays of random variables. J. Multivariate Analysis 11 581–598. ANDREWS, D. F. and HERZBERG, A. (1985). Data. Springer, New York. BARNDORFF-NIELSEN, O. E. and COX, D. R. (1994). Inference and Asymptotics. Chapman and Hall, London. BARTLETT, M. S. (1978). Nearest neighbour models in the analysis of field experiments (with discussion). J. Roy. Statist. Soc. Ser. B 40 147–174. BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York. BERNARDO, J. M. and SMITH, A. F. M. (1994). Bayesian Theory. Wiley, New York. BESAG, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. Ser. B 36 192–236. BESAG, J. and HIGDON, D. (1999). Bayesian analysis of agricultural field experiments (with discussion). J. Roy. Statist. Soc. Ser. B 61 691–746. BESAG, J. and KOOPERBERG, C. (1995). On conditional and intrinsic autoregressions. Biometrika 82 733–746. BEST, N. G., ICKSTADT, K. and WOLPERT, R. L. (1999). Contribution to the discussion of Besag (1999). J. Roy. Statist. Soc. Ser. B 61 728–729. BICKEL, P. and DOKSUM, K. A. (1981). An analysis of transformations revisited. J. Amer. Statist. Assoc. 76 296–311. BILLINGSLEY, P. (1986). Probability and Measure, 2nd ed. Wiley, New York. BOX, G. E. P. and COX, D. R. (1964). An analysis of transformations (with discussion). J. Roy. Statist. Soc. Ser. B 26 211–252. BOX, G. E. P. and COX, D. R. (1982). An analysis of transformations revisited, rebutted. J. Amer. Statist. Assoc. 77 209–210. COX, D. R. (1958). Planning of Experiments. Wiley, New York. COX, D. R. (1986). Comment on Holland (1986). J. Amer. Statist. Assoc. 81 963–964. COX, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics. Chapman and Hall, London. COX, D. R. and REID, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. Roy. Statist. Soc. Ser. B 49 1–39. COX, D. R. and SNELL, E. J. (1981). Applied Statistics. Chapman and Hall, London. COX, D. R. and WERMUTH, N. (1996). Multivariate Dependencies. Chapman and Hall, London. DALE, J. R. (1984). Local versus global association for bivariate ordered responses. Biometrika 71 507–514. DE FINETTI, B. (1975). Theory of Probability 2. Wiley, New York. GELMAN, A., CARLIN, J. B., STERN, H. and RUBIN, D. B. (1995). Bayesian Data Analysis. Chapman and Hall, London. 1266 P. MCCULLAGH GOODMAN, L. A. (1979). Simple models for the analysis of association in cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74 537–552. GOODMAN, L. A. (1981). Association models and canonical correlation in the analysis of cross- classifications having ordered categories. J. Amer. Statist. Assoc. 76 320–334. HAMADA, M. and WU, C. F. J. (1992). Analysis of designed experiments with complex aliasing. J. Qual. Technology 24 130–137. HARVILLE, D. A. and ZIMMERMANN, D. L. (1999). Contribution to the discussion of Besag (1999). J. Roy. Statist. Soc. Ser. B 61 733–734. HELLAND, I. S. (1999a). Quantum mechanics from symmetry and statistical modelling. Internat. J. Theoret. Phys. 38 1851–1881. HELLAND, I. S. (1999b). Quantum theory from symmetries in a general statistical parameter space. Technical report, Dept. Mathematics, Univ. Oslo. HINKLEY, D. V. and RUNGER, G. (1984). The analysis of transformed data (with discussion). J. Amer. Statist. Assoc. 79 302–320. HOLLAND, P. (1986). Statistics and causal inference (with discussion). J. Amer. Statist. Assoc. 81 945–970. HORA, R. B. and BUEHLER, R. J. (1966). Fiducial theory and invariant estimation. Ann. Math. Statist. 37 643–656. KINGMAN, J. F. C. (1984). Present position and potential developments: Some personal views. Probability and random processes. J. Roy. Statist. Soc. Ser. A 147 233–244. KINGMAN, J. F. C. (1993). Poisson Processes. Oxford Univ. Press. LAURITZEN, S. (1988). Extremal Families and Systems of Sufficient Statistics. Lecture Notes in Statist. 49. Springer, New York. LEHMANN, E. L. (1983). Theory of Point Estimation. Wiley, New York. LEHMANN, E. L. and CASELLA, G. (1998). Theory of Point Estimation, 2nd ed. Springer, New York. LITTELL, R., FREUND, R. J. and SPECTOR, P. C. (1991). SAS System for Linear Models, 3rd ed. SAS Institute, Cary, NC. MAC LANE, S. (1998). Categories for the Working Mathematician, 2nd ed. Springer, New York. MCCULLAGH, P. (1980). Regression models for ordinal data (with discussion). J. Roy. Statist. Soc. Ser. B 42 109–142. MCCULLAGH, P. (1992). Conditional inference and Cauchy models. Biometrika 79 247–259. MCCULLAGH, P. (1996). Möbius transformation and Cauchy parameter estimation. Ann. Statist. 24 787–808. MCCULLAGH, P. (1999). Quotient spaces and statistical models. Canad. J. Statist. 27 447–456. MCCULLAGH, P. (2000). Invariance and factorial models (with discussion). J. Roy. Statist. Soc. Ser. B 62 209–256. MCCULLAGH, P. and NELDER, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. MCCULLAGH. P. and WIT, E. (2000). Natural transformation and the Bayes map. Technical report. MERCER, W. B. and HALL, A. D. (1911). The experimental error of field trials. J. Agric. Research 50 331–357. NELDER, J. A. (1977). A re-formulation of linear models (with discussion). J. Roy. Statist. Soc. Ser. A 140 48–77. PEARSON, K. (1913). Note on the surface of constant association. Biometrika 9 534–537. PLACKETT, R. L. (1965). A class of bivariate distributions. J. Amer. Statist. Assoc. 60 516–522. RUBIN, D. (1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6 34–58. RUBIN, D. (1986). Comment on Holland (1986). J. Amer. Statist. Assoc. 81 961–962. WHAT IS A STATISTICAL MODEL? 1269 model occurs merely because the sample size is too small. That is, the argument becomes irrelevant if one recognizes that the generality of statistical formulations should normally increase with sample size, until ultimately one may indeed be allowed the luxury of machine learning. I may need to add that my interpretation of a p-value is merely as an exploratory device that quantifies inconsistency between the observed data and a particular formulation. McCullagh seems not to accept that sample size should be an important ingredient in statistical modeling; see Section 4.2. Of course, I agree that generally there should be coherence between formulations in going from sample size n to n+ 1 but this process also takes us from n to n2 and then there will often be overriding concerns and a need for greater flexibility. Markov random fields. Markov random fields (MRFs) are distributions specified via their full conditionals (originally called local characteristics). The identification between any particular MRF and a corresponding Gibbs distribution and vice versa follows from the Hammersley–Clifford theorem [e.g., Besag (1974)]. The only restriction on either class is a minor positivity condition, so Gibbs (say) distributions are not required to have the property McCullagh ascribes to them in Section 6.6. Exercise 7 is supposedly an example of the inconsistencies that plague MRFs but seems badly chosen. I agree of course that the formulation is bizarre but, without a context, there is no reason why the distribution of Y in a cluster of size k, marginalized over its kth component, should be required to coincide with that of Y in a cluster of size k − 1. For example, suppose that k refers to the litter sizes in a study of piglets and that a measurement is made on each piglet. Then exchangeability within litters might be a reasonable assumption but marginalizing over a piglet in a litter of size k does not produce a litter of size k−1. And whatever the formulation, it seems reckless even to contemplate drawing inferences about litters of size eight from data merely on litters of size two! A better example of the contradictions that arise in MRF formulations is mentioned in McCullagh’s discussion of Exercise 11. That is, a parametric MRF on a grid (say) is generally inconsistent with the corresponding MRF on a subset of the grid. In principle, consistency can be restored by conditioning on an appropriate boundary but this is usually too wasteful in practice. Partial fixes may be available by using marginalizations of MRFs on Z2 or otherwise; see Besag and Kooper- berg (1995), Besag and Higdon (1999) and Rue and Tjelmeland (2002). However, spatial effects are often of secondary importance, as in variety trials, and the main intention is to absorb an appropriate level of spatial variation in the formulation, rather than produce a spatial model with scientifically interpretable parameters. Nevertheless, McCullagh’s basic point is well taken. For example, I view the use of MRFs in geographical epidemiology [e.g., Besag, York and Mollié (1991)] as mainly of exploratory value, in suggesting additional spatially related covariates whose inclusion would ideally dispense with the need for a spatial formulation; 1270 DISCUSSION see Byers and Besag (2000) for an example on prostate cancer and ethnicity. A particularly blatant abuse of MRFs occurs in the analysis of social networks, where the parameters in Markov random graphs are often ascribed substantive interpretations that are meaningless, if only because they depend on the size of the system. I anticipate that MRFs will play a diminishing role in statistical analysis but currently they still have useful computational advantages when MCMC is used. Agricultural field experiments. Although field experiments no longer fea- ture in most graduate programs, their design and analysis comprise an important area of application for statistics. Variety trials usually involve say 25 to 75 vari- eties of a crop, with very limited replication, perhaps three or, even worse, only two plots being assigned to each variety. Here I exclude early generation trials, of- ten having very large numbers of varieties and no replication but with check plots of a standard variety used as controls. It has always been recognized that generally a crop will perform much more similarly on two plots close together than on plots further apart. Thus, Fisher [(1928), page 229] wrote, “. . . the peculiarity of agricultural field experiments lies in the fact, verified in all careful uniformity trials, that the area of ground chosen may be assumed to be markedly heterogeneous, in that its fertility varies in a systematic, and often a complicated manner from point to point.” Soil measurements, such as pH, are not generally available to make adjustments for fertility. Fisher’s solution to the problem resulted in the design and analysis of experiments, an approach that provides rigorous inference via randomization analysis but, for modern-day experiments, can be very inefficient when compared to model-based analysis. McCullagh refers frequently to agricultural experiments and in Section 8 proposes a spatial formulation based on harmonic functions. Despite initial resistance, spatial methods have become increasingly popular: for example, frequentist spatial analysis is now used in some 5000 experiments annually in Australia alone [Gilmour, Cullis, Smith and Verbyla (1999)]. In several places, McCullagh mentions Besag and Higdon (1999), henceforth BH, though with no obvious enthusiasm. As background, BH describes a Bayesian approach to statistical analysis, with some emphasis on variety trials, and examines several complicated data sets; easier examples are analyzed in Besag and Higdon (1993), in Besag, Green, Higdon and Mengersen [(1995), Section 5] and in the rejoinder to discussion in BH. A first-order Gaussian intrinsic autoregression is used as a simple but flexible baseline representation of the spatial variation in fertility. BH does not pretend to provide a complete solution and indeed discusses some current deficiencies. Below I will confine my comments to points raised by McCullagh. WHAT IS A STATISTICAL MODEL? 1271 Response scale (Section 3.2). McCullagh makes the basic point that statistical analysis should not depend on response scale. BH achieves this by standardizing the raw data, which is effective but rather untidy in a Bayesian framework. Covariate space (Section 3.2). McCullagh states that, for a trial in which fertilizer is applied at rates in the range 0–300 kg/ha, he requires the inferential universe to extend to all nonnegative rates. Yet this seems pointless without a corresponding extension of the model itself, even though any such extension cannot be assessed. Indeed, I presume that McCullagh himself would be unwilling to draw inferences at a rate of, say, 400 kg/ha, without additional information. Similarly, in his example on potatoes, I would not address varieties that are not in the experiment or remain yet to be invented. Of course, this is not the case if the tested varieties form a random sample from a definite population. Experimental units (Section 3.2). In discussing variety trials, McCullagh claims, “It is invariably understood, though seldom stated explicitly, that the purpose of such a trial is to draw conclusions concerning variety differences, not just for plots of this particular shape, size and orientation, but for comparable plots of various shapes, sizes and orientations.” Here I would replace “invariably” by “almost never.” McCullagh seems to confuse variety trials with uniformity trials in which a single variety is grown. Uniformity trials (e.g., the Mercer and Hall data in Section 8.6) were used long ago to investigate optimal plot size for genuine experiments, but they are performed very rarely now and plot dimensions are often determined by management practice rather than by statistical criteria. However, in passing I note that the asymptotic logarithmic behavior of the variogram for the BH intrinsic autoregression is in good agreement with the empirical findings from uniformity trials in Fairfield Smith (1938) and Pearce (1976). Of course, in a genuine variety trial, one might want to predict what the aggregate yield over the entire field would have been for a few individual varieties but this does not require any extension of the formulation to McCullagh’s conceptual plots. Indeed, such calculations are especially well suited to the Bayesian paradigm, both theoretically, because one is supposed to deal with potentially observable quantities rather than merely with parameters, and in practice, via MCMC, because the posterior predictive distributions are available rigorously. That is, for the aggregate yield of variety A, one uses the observed yields on plots that were sown with A and generates a set of observations from the likelihood for those that were not for each MCMC sample of parameter values, hence building a corresponding distribution of total yield. One may also construct credible intervals for the difference in total yields between varieties A and B and easily address all manner of questions in ranking and selection that simply cannot be considered in a frequentist framework; for example, the posterior probability that the total yield obtained by sowing any particular variety (perhaps chosen in the light of the experiment) would have been at least 10% greater than that of growing any other test variety in the field. 1274 DISCUSSION three-dimensional space or merely in two dimensions. Even in two dimensions, there has been rather limited success in devising plausible formulations that are amenable to integration. In geography, the issue of invariance is referred to as the modifiable areal unit problem and has a long history. In statistics, early references are Heine (1955), Whittle (1962) and Matérn (1986), first published (remarkably) in 1960. However, it seems extremely unlikely that any formulation can be thought of as an accurate model for variation in fertility without additional measurements of factors that influence the growth of the particular crop under study. These could be in the form of extensive soil measurements, such as pH levels, or the use of check plots of a standard variety, dispersed over the experimental site, as occurs in single-replicate early generation trials. Fortunately, the sole purpose of variety trials is to compare varieties, not to assess spatial variation, which enters the formulation merely as a nuisance factor. With the benefit of some replication, it seems reasonable to expect that an approximate representation of “fertility” is generally adequate for statistical analysis. All the investigations that I know of support this view. Such investigations usually involve uniformity data to which dummy variety effects are added, so that the true values are known to be zero. An early example is in Besag and Kemp- ton (1986). The findings typically suggest that the gains from spatial analysis in a badly designed experiment provide improvements commensurate with standard analysis and optimal design. This is not a reason to adopt poor designs but the simple fact is that, despite the efforts of statisticians, many experiments are carried out using nothing better than randomized complete blocks. It is highly desirable that the representation of fertility is flexible but is also parsimonious because there are many variety effects to be estimated, with very limited replication. McCullagh’s use of discrete approximations to harmonic functions in Section 8 fails on both counts: first, local maxima or minima cannot exist except (artificially) at plots on the edge of the trial; second, the degrees of freedom lost in the fit equals the number of such plots and is therefore substantial (in fact, four less in a rectangular layout because the corner plots are ignored throughout the analysis!). Nevertheless, there is something appealing about the averaging property of harmonic functions, if only it were a little more flexible. What is required is a random effects (in frequentist terms) version and that is precisely the thinking behind the use of intrinsic autoregressions in BH and elsewhere. Indeed, such schemes fit McCullagh’s discretized harmonic functions perfectly, except for edge effects (because BH embeds the array in a larger one to cater for such effects), and they also provide a good fit to more plausible fertility functions. For specific comments on the Mercer and Hall data, see below. Of course, spatial scale remains an important issue for variety trials and indeed is discussed empirically in Section 2.3 and in the rejoinder of BH. For one- dimensional adjustment, the simplest plausible continuum process is Brownian motion with an arbitrary level, for which the necessary integrations can be WHAT IS A STATISTICAL MODEL? 1275 implemented rigorously in the analysis. In the example in BH, there is close agreement between the estimates from the discrete and continuous formulations (which are not quite consistent mathematically). In two-dimensional adjustment, one can experiment with splitting the plots and comparing the results obtained from the fertility priors at the two scales. This can be done rigorously via MCMC by treating the yields in each pair of half plots as unknown but summing to the observed value. The few examples I have tried again suggest close agreement but, of course, I would much rather see a sound mathematical justification of approximate closure under spatial aggregation. This might be available via an appropriate extension of McCullagh’s harmonic processes. Mercer and Hall data (Section 8.6). McCullagh does not draw a clear distinction between the purposes of analyzing data from uniformity trials and from genuine variety trials. He also comments on a purely illustrative analysis of mine from more than 25 years ago about which I wrote [Besag (1974)], “It cannot be claimed that the present auto-normal schemes have been successful in reflecting the overall probabilistic structure of the wheat plots process.” The perhaps definitive discussion of the Mercer and Hall data is McBratney and Webster (1981), which uses the original records from Rothamsted Experimental Station to explain the characteristics noted by McCullagh, in terms of a previous ridge and furrow system. McCullagh’s formulation includes fixed effects for rows and columns in addition to those for the harmonic approximation, so that more than 120 parameters are fitted. This type of approach does not seem well suited to variety trials. The BH formulation fits two parameters, one of which provides data- driven directional flexibility, which McCullagh does not allow. Although, after 90 years, the Mercer and Hall data are now very tired indeed and deserve a decent burial, it might be worth noting that the basic BH fit at least retains all of the peaks and troughs in McBratney and Webster’s Table 1, though it is certainly not a physical model. Spatial design (Section 8.7). McCullagh proposes the development of a theory of efficient harmonic designs for agricultural experiments. Such designs would be very close to those that exist already for more relevant spatial formulations. For a recent review, see Atkinson and Bailey (2001), especially Section 10. Conclusion. Although McCullagh’s paper makes many valuable points, I believe that the approach is too rigid and places an emphasis on pure mathematics that is inappropriate for applied statistics. The paper promotes a universality in statistical modeling that is seldom present or necessary in practice. The role of approximation and its interrelationship with sample size seem to be ignored. As regards spatial statistics, the paper rediscovers old problems but does not yet provide effective solutions. Nevertheless, I am glad to end on a positive note by agreeing that the generalized covariance function in Section 8.6, known as the 1276 DISCUSSION de Wijs model in geostatistics and dating back to the early 1950s, may be useful in developing more coherent spatial formulations for the analysis of agricultural field experiments. Acknowledgments. I am grateful to Ted Harding for very helpful discussions on this topic and to Peter McCullagh for his patience in arguing about specific points. REFERENCES ATKINSON, A. C. and BAILEY, R. A. (2001). One hundred years of the design of experiments on and off the pages of Biometrika. Biometrika 88 53–97. BESAG, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. Ser. B 36 192–236. BESAG, J. E. (1975). Statistical analysis of non-lattice data. The Statistician 24 179–195. BESAG, J. E., GREEN, P. J., HIGDON, D. M. and MENGERSEN, K. L. (1995). Bayesian computation and stochastic systems (with discussion). Statist. Sci. 10 3–66. BESAG, J. E. and HIGDON, D. M. (1993). Bayesian inference for agricultural field experiments. Bull. Internat. Statist. Inst. 55 121–136. BESAG, J. E. and HIGDON, D. M. (1999). Bayesian analysis of agricultural field experiments (with discussion). J. Roy. Statist. Soc. Ser. B 61 691–746. BESAG, J. E. and KEMPTON, R. A. (1986). Statistical analysis of field experiments using neighbouring plots. Biometrics 42 231–251. BESAG, J. E. and KOOPERBERG, C. L. (1995). On conditional and intrinsic autoregressions. Biometrika 82 733–746. BESAG, J. E., YORK, J. C. and MOLLIÉ, A. (1991). Bayesian image restoration, with two applications in spatial statistics (with discussion). Ann. Inst. Statist. Math. 43 1–59. BREIMAN, L. (2001). Statistical modeling: the two cultures (with discussion). Statist. Sci. 16 199– 231. BYERS, S. D. and BESAG, J. E. (2000). Inference on a collapsed margin in disease mapping. Statistics in Medicine 19 2243–2249. FAIRFIELD SMITH, H. (1938). An empirical law describing heterogeneity in the yields of agricultural crops. J. Agric. Sci. 28 1–23. FISHER, R. A. (1922). On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London Ser. A 222 309–368. FISHER, R. A. (1928). Statistical Methods for Research Workers, 2nd ed. Oliver and Boyd, Edinburgh. GILMOUR, A. R., CULLIS, B. R., SMITH, A. B. and VERBYLA, A. P. (1999). Discussion of paper by J. E. Besag and D. M. Higdon. J. Roy. Statist. Soc. B 61 731–732. HEINE, V. (1955). Models for two-dimensional stationary stochastic processes. Biometrika 42 170– 178. KÜNSCH, H. R. (1987). Intrinsic autoregressions and related models on the two-dimensional lattice. Biometrika 74 517–524. MATÉRN, B. (1986). Spatial Variation. Springer, New York. MCBRATNEY, A. B. and WEBSTER, R. (1981). Detection of ridge and furrow pattern by spectral analysis of crop yield. Internat. Statist. Rev. 49 45–52. PEARCE, S. C. (1976). An examination of Fairfield Smith’s law of environmental variation. J. Agric. Sci. 87 21–24. WHAT IS A STATISTICAL MODEL? 1279 DISCUSSION BY HANS BRØNS University of Copenhagen Peter McCullagh’s paper is exciting, because it can be seen as the start of a new, long overdue discussion of the mathematical foundation of the theory of statistics. He rightly points out that only part of the thinking in theoretical statistics is formalized mathematically and tries to extend the existing theory of statistical modelling using modern abstract mathematical tools. This is a great philosophical challenge, but it is a formidable pedagogical task to communicate the results to the statisticians. The paper contains beyond the abstract definition of the new extended concept of a statistical model a treasure trove of examples and counterexamples, but I shall concentrate on an analysis of the definition of models. McCullagh’s idea is that parameter spaces and sample spaces which are usually treated as sets or measure spaces in applications have a richer structure defining what one could be tempted to call their “physical nature,” which should be reflected in the models and in the choice of transformations between them. This is done by giving them an inner structure, symmetry for example, and by considering each model as a unit in a greater universe of models. To give this a mathematical expression, the much loved and much hated theory of categories is used. McCullagh’s categories are a little bushy. Any object in any category can be connected by an arrow with any other object in any of the categories considered. They are evidently assumed to be built on sets and can therefore be connected by functions; that is, they are all concrete categories [Mac Lane (1998), page 26]. A category A is concrete, if it is equipped with a faithful (injective on hom- sets) functor to the category Set of sets and functions. Let this forgetful functor be denoted by UA. Several of the categories constructed by McCullagh are so-called comma categories [Mac Lane (1998), page 46]. Consider a diagram B T→A S← C of three categories and two functors with the same codomain. The comma category from T to S, denoted (T , S) or (T ↓ S) as we shall do, has as objects triples (b, c, f ), where b is an object in category B , c an object in category C, and f :T b → Sc is an arrow in category A. Arrows between objects (b, c, f ) and (b′, c′, f ′) in the category (T ↓ S) are all pairs (g, g′) consisting of an arrow g :b → b′ in category B and an arrow g′ : c → c′ in category C such that the 1280 DISCUSSION diagram T b Tg f T b′ f ′ Sc Sg′ Sc′ is commutative. Composition in (T ↓ S) is coordinatewise, and identities are pairs of identities. By taking the first and second coordinates of the objects and arrows in the comma category it is provided with a pair of natural (forgetful) projection functors B PrB←−− (T ↓ S) PrC−−→ C. Let D be a category with object class obj(D) and let D :D → (T ↓ S) be a functor (if D is small enough D would be called a diagram over D in the comma category). For d ∈ obj(D) put D(d) = (D1(d),D2(d),D3(d)) such that D3(d) :T (D1(d))→ S(D2(d)) in A. By joggling the symbols it is seen that the family (D3(d))d∈obj(D) is a natural transformation T ◦ D → S ◦ D :D → A. The mapping D → (PrB ◦D,PrC ◦D,(D3(d))d∈obj(D)) is a 1–1 correspondence between functors D :D → (T ↓ S) and triples (DB,DC, π) consisting of a functor DB :D → B , a functor DC :D → C, and a natural transformation π :T ◦ DB → S ◦ DC :D → A. In this way a diagram in (T ↓ S) is identified with a natural transformation between certain functors. In his most general definition of a statistical model McCullagh starts with three categories: (1) the category catU of statistical units, (2) the category catV of response scales, and (3) the category cat) of covariate spaces. From these he first constructs the category catD of designs as a comma category, in our notation: catD = (UcatU ↓Ucat)) and then the category catS of sample spaces as a product category, catS = catV × catop U . Finally, the basic category is the product category catV × catop D . Let catV PrV←−− catV × catop D Pr cat op D−−−−→ catop D be the diagram of projections. Since catV × catop D idcatV×Pr cat op U−−−−−−−−→ catV × catop U = catS, both catD and catS can be derived from the basic category. WHAT IS A STATISTICAL MODEL? 1281 The objects of catS are the sample spaces S which are carriers of probability distributions with P (S) the set of probability distributions on S. Sample spaces are therefore not just sets in general, but must be sets with some kind of measurability property by having an attached σ -algebra of subsets to support abstract probabilitiy measures, by having the structure of a locally compact topological space to be the basis for a normed Radon-measure, or by having some similar structure. Arrows between two such spaces should be measurable, proper or something similar, such that transformation of probability measures is defined. These properties should be part of the definition of the basic category, and they ensure the existence of a (faithful) functor V from catS to the concrete category MeasSet of measurable sets and measurable mappings. If to an arrow t :S → S′ in MeasSet we assign the mapping P (t) :P (S)→ P (S′) that carries a probability measure on S into the t-transformed probability on S′, we get a functor P : MeasSet→ Set. The parameter  in McCullagh’s model is a contravariant functor on cat) to a not very accurately specified concrete category K , and the model is a natural transformation, P :UK ◦ ◦ Prcatop D ◦ Prcatop U →P ◦ ◦ (idcatV × Prcatop U ) : catV × catop D → Set or in more detail, catV × catop D Pr cat op D idcatV×Pr cat op U catop D Pr cat op ) catV × catop U catop )  P⇒ catS V K UK MeasSet P Set Besides considering this overall structure McCullagh puts specific contraints on the arrows in some of the categories, constraints of the same sort as limiting factors to be surjective mappings in the multifactor analysis of variance models. Such restrictions tend to make the mathematical formalization heavier and are really not needed. The simplest statistical models in McCullagh’s paper are the (parametrized) statistical models, which are triples (,S,P ) consisting of a parameter set, a 1284 DISCUSSION in addition a mapping from a parameter space into the set of distributions. The discussion focusses on how a model impacts on the context being examined, and how it might extend in applications to broader contexts. Category theory is used to provide the general framework for this. One could reasonably question the familiar definition of a model in various simple ways. Do we wish to speak of just a set of possible response distributions as providing the core model of statistics? Why not perhaps the minimal step up to the perception of a black box mechanism, with the output being the response variable and an input dial giving the parameter value that determines the response distribution, or are we so influenced by set theory that we would view these as the same? One could argue that the two formulations are isomorphic, but the perception and accessibility and influence for the latter are quite different. The former describes an essentially unrelated set of possible statistical behaviors while the latter at least conceptualizes a behavioral pattern that is changed by the input dial [Fraser (1968a)]. The modelling should perhaps be more elaborate still. Most deterministic sys- tems identify discrete component elements with linkages among these. The sys- tem is not modelled by just recording the set of pairs consisting of the determined outcome response with the corresponding input values. The components and the directions of causes and influences are modelled as well. Why should it be less for the statistical model? While the theory of graphical models does attempt to describe such linkages, much of applied statistics does not. In many contexts there are identifiable sources for variation and error contribu- tion. Shouldn’t these individual sources, rather than the global effect, be directly modelled? The discussion in the paper does not address such more elaborate mod- elling. Does it matter when the response behavior is the same? It can matter in terms of what is known about the system when output values are observed. When the detailed model provides more information, one should certainly expect to ob- tain more for inference. An examination of statistical models that does not go be- yond the pre-black box modelling to include specific elements of structure such as the error variables and other internal component elements can be viewed as short on the “extensive” side of coverage. The attention given to component parameters is needed and very welcome. The artificiality of the regression parameters for the Box–Cox model has taken a long time to emerge with such clarity as it does here. It is also very interesting to have a mathematical explanation of the widely (but not uniformly) adopted rule in applied work that models with interactions and no corresponding main effects are not sensible. In the context of transformation models there is some discussion of the nature of particular component parameters in Fraser [(1968b), page 65]. Consider a transformation model with parameter θ taking values in a transformation group and with identifiable error variable, say z; thus y = θz. In this transformation model setting what does it mean to say a component parameter ϕ is natural? WHAT IS A STATISTICAL MODEL? 1285 It is argued in Fraser [(1968b), page 65] that a component parameter is natural if it indexes the left cosets of a subgroup. The arguments are not dissimilar to those arguing in reverse that the regression parameter β is not a natural parameter for the Box and Cox model. Also, suppose for simplicity that the application of the group is unitary and that left and right invariant measures on the group are available. The model is invariant in the sense that ỹ = gy has distribution with parameter value θ̃ = gθ . The application of the group to the parameter space would suggest the left invariant measure dµ(θ) as the natural default prior measure for the parameter. But if we view the transformation θ as being relative to some initial error presentation then we might change that presentation and have θ̄ = θh applied to the modified error as h−1z. We might then want invariance relative to the transformation θ̄ = θh. This then suggests that we should use the right invariant measure dν(θ) as the natural default measure for the parameter, a default measure often preferred in the Bayesian context. The weighing in of category theory leading to the definition of a natural parameter would seem to need to be qualified by an acceptance that a statistical model in an application is only providing some reasonable approximation to the reality under investigation. Thus, for example, in Section 6.1 it is noted that the coefficient of variation is not a natural component parameter; but in many applications it can be a very useful and informative parameter. McCullagh shows how category theory provides a frame of reference for examining various simple but anomalous examples. But can category theory do more than provide a frame of reference? Can it provide a positive way of implementing and extracting anomalies in modelling? It is clear that the subject of category theory is difficult to master. Perhaps it would be better to extract simple and broadly based principles for modelling. Or will the approach need to be one of seeking the anomalous examples and then seeking the related principle. The Cauchy example highlights an important point. McCullagh notes that the location parameter is not a natural parameter in the context of the linear fractional group. The concerns for the linear fractional group should perhaps go further. A typical transformation is not a mapping of the real line to itself, but can map a point to the point at infinity and vice versa; a rather severe lack of continuity. McCullagh (1992) used these transformations to illustrate the apparent nonuniqueness of the common ancillary statistic for location scale analysis. For this the proposed reference set from the use of the linear fractional group is the contour corresponding to the observed value of the ancillary, and will have multiple points with one or another coordinate at infinity, hardly a reference set with the reasonable continuity expected for applications. It would seem appropriate that such anomalies be avoided at the foundational assessment level for the model. Thus the use of the linear fractional group for assessing the location model should be rejected less for the anomalous nature of the location parameter and more for the fact that it does not provide mappings of the sample space to itself. 1286 DISCUSSION This is a very comprehensive examination of serious foundational issues for statistical theory and practice. We look forward to new examples of anomalies uncovered by category theory and also to a wider concern for foundational issues. REFERENCE FRASER, D. A. S. (1968a). A black box or a comprehensive model. Technometrics 10 219–229. FRASER, D. A. S. (1968b). The Structure of Inference. Wiley, New York. MCCULLAGH, P. (1992). Conditional inference and Cauchy models. Biometrika 79 247–259. DEPARTMENT OF STATISTICS UNIVERSITY OF TORONTO 100 ST. GEORGE STREET TORONTO, ONTARIO M5S 3G3 CANADA E-MAIL: [email protected] DISCUSSION BY INGE S. HELLAND University of Oslo Theoretical statistics has since the beginning been founded on the idea that a statistical model is just a parametric family of probability measures. Now Peter McCullagh has given us a much richer model concept, and, importantly, this is strictly motivated from applied science considerations. In fact this new model concept is so rich and at the same time so abstract that it will take time for us to absorb it and for the statistical community to assess its implications. Therefore I will limit these comments to the simplest situation, which is a model under symmetry, so that the categories involved are groups. The idea that it might be useful to supplement the choice of a statistical model—in the sense of a parametric class of measures—by the choice of a symmetry group, is an old one, albeit still somewhat controversial regarding its implications. One thing that is new here in McCullagh’s approach is the systematic use of the concept of natural subparameter. I will expand a little on this concept, using parallel developments from Helland (2002). Let a group G be defined on the parameter space  of a model. A measurable function ξ from  to another space 6 is called a natural subparameter if ξ(θ1)= ξ(θ2) implies ξ(gθ1)= ξ(gθ2) for all g ∈G. For example, in the location and scale case the location parameter µ and the scale parameter σ are natural, while the coefficient of variation µ/σ is not natural (it is if the group is changed to the pure scale group). In general the parameter ξ is natural iff the level sets of the function ξ = ξ(θ) are transformed onto other WHAT IS A STATISTICAL MODEL? 1289 Consider as an example the following situation: assume that the model of two experiments on the same unit depends upon which experiment is carried out first, say that the parameter of the first experiment is θ1(ψ,α) if this is done first, otherwise θ1(ψ,β), where there is a given relationship between α and β . Let the morphisms contain permutation of the order of the experiments. Then, in general, θ1 will not be a natural subparameter. It is tempting to approach quantum physics from a statistical point of view by saying that it contains a number of situations with similar properties. In fact, one can proceed quite a bit from this, but there are also very difficult questions in this approach, in particular in understanding the formal structure of quantum mechanics. It seems, however, that group representation theory may be of some help, combined with a systematic reduction of statistical models. In working in such new directions it is very important that we now have a better characterization of what a statistical model is. The kind of precise link between common sense and formal theory which Peter McCullagh has demonstrated so forcefully in his paper, is indeed useful for several purposes, and it may set a standard for much theoretical work in statistics. REFERENCE DAWID, A. P., STONE, M. and ZIDEK, J. V. (1973). Marginalization paradoxes in Bayesian and structural inference (with discussion). J. Roy. Statist. Soc. Ser. B 35 189–233. HELLAND, I. S. (2001). Reduction of regression models under symmetry. In Algebraic Methods in Statistics and Probability (M. Viana and D. Richards, eds.) 139–153. Amer. Math. Soc., Providence, RI. HELLAND, I. S. (2002). Statistical inference under a fixed symmetry group. Available at http:// www.math.uio.no/∼ingeh/. DEPARTMENT OF MATHEMATICS UNIVERSITY OF OSLO P.O. BOX 1053 BLINDERN N-0316 OSLO NORWAY E-MAIL: [email protected] DISCUSSION BY PETER J. HUBER Klosters, Switzerland McCullagh’s paper is a plea for more realistic statistical models. He hopes to enforce sensibility of models through category theory, by imbedding some equivalent of “having a well-defined meaning” into the formal theory. I certainly agree with his aims and that it is necessary to pay attention to the inference domain 1290 DISCUSSION and that category theory may help you to do so. But I must confess that I stumbled over some telltale idiosyncrasies of his terminology. He variously denotes bad models as “absurd,” “eccentric,” “arbitrary” or “capricious.” For me, a statistical model ordinarily is supposed to model some real-world situation, and as such it either is adequate, or it is not. His attributes do not apply, except to straw-man models concocted by a mathematical statistician. Correspondingly, he may go too far with his attempts to discuss sensibility of models in the abstract; in my opinion, constant reference to the underlying real-world situation is advisable and even necessary. If the latter is capricious, the model should faithfully render this capriciousness! I am thinking here, for example, of the distinction between integral and half-integral spin values in quantum mechanics, which to the uninitiated looks no less absurd or artificial than the even-odd distinction in McCullagh’s Exercises 1, 2 and 4. In the following, I shall concentrate on the category theory aspects of the paper. I should begin by recounting some bygone experiences of my own. More than forty years ago—I had just finished a thesis in category theory [Huber (1961)]—I went to Berkeley to learn statistics. There, I sat in Lucien Le Cam’s courses on decision theory. He devoted much time, I believe an entire term, to the comparison of experiments in the style of his recently submitted, but as yet unpublished 1964 paper in the Annals. One should know that Le Cam loved abstractness and at that time was enthusiastic about Grothendieck’s (1955) thesis. To avoid possible misunderstandings I should add that that thesis was in functional analysis, not in homological algebra [see Cartier (2001), page 392 for some of the background]. In Le Cam’s terminology, an “experiment” is exactly the same as a “model” in McCullagh’s sense, namely a parameter set  together with a function P : → P (X), which assigns to each parameter point θ ∈  a probability distribution Pθ on the sample space X. Le Cam, however, found it convenient to replace this classical notion of experiment by a weaker structure consisting of a vector lattice with a unit, E, together with a family indexed by  of positive normalized linear functionals on E [Le Cam (1964), pages 1420– 1421]. With some restriction of generality, E may be thought of as the space of bounded observable random variables on X, and to facilitate this interpretation, the set X was retained by Le Cam as a part of the structure. Moreover, since he wanted to handle sufficiency, simple pointwise maps would not do, and Le Cam had to replace them by randomized maps, or transition functions (see the example below). Anyway, things got pretty complex, and to sort them out in my own mind, I rephrased the fundamentals of Le Cam’s theory in categorical language. Example: Sufficient statistics. Let Experiment A consist in observing X = (X1, . . . ,Xn), where the Xi are i.i.d. (µ,σ 2), while Experiment B consists in observing two independent variables (U,V ), where U is (µ,σ 2/n), and V/σ 2 is χ2 n−1. There is a natural map from A to B [U = X = n−1∑Xi,V = S2 = WHAT IS A STATISTICAL MODEL? 1291 ∑ (Xi − X)2]. On the other hand, since the conditional distribution of X = (X1, . . . ,Xn) given (X,S2) does not depend on (µ,σ 2), there is a randomized map (or transition function) from B to A, reconstructing a sample Y = (Y1, . . . , Yn) having the same stochastic properties as the original sample X for all values of (µ,σ ). With suitably chosen definitions, these two morphisms from A to B and from B to A would be inverse to each other. In order to deal with situations occurring in the analysis of variance, one had to consider randomized maps not only of the sample spaces, but also of the parameter spaces. Thus, morphisms in the category of statistical experiments really were suitably defined equivalence classes at least of pairs of transition functions, and built-in redundancies (vector lattices and sample spaces) complicated matters even further. Categorical thinking, that is: thinking in terms of morphisms, functors and commutative diagrams, always helps to clarify the underlying structure of mathematical objects, to define structural isomorphisms properly, and in particular, to recognize morphisms that are natural (or canonical). Rephrasing statistical problems in categorical language had clarified my thinking about some of them in a crucial fashion. In particular, it had helped me to understand and properly formalize the concept of invariance of a statistical experiment within Le Cam’s framework of topological vector lattices, and hence to formulate and prove the Hunt–Stein theorem in what I still believe to be its natural habitat; see the remarks by Brown [(1984), page 407]. For my own purposes I collected the categorized foundations of Le Cam’s theory of statistical experiments in a memo which ultimately grew to about 10 to 12 pages, all of it definitions. But then, why did I not expand my little memo and prepare it for publication? There were two reasons: First, abstract categorical thinking had helped me only because I already was fluent in that language. Very few people would profit from a paper piling more abstraction on Le Cam’s already very abstract setup, and I foresaw that I would have even more difficulties getting it accepted than Lucien had with his papers (he told us some anecdotes about his prolonged fights with editors and referees). Second, while my translation into categorese had streamlined the exposition, it had not added content to these particular problems. It did not unify different lines of thought (as I had been able to do in my thesis, for example), it did not lead to new theorems, except very indirectly, and did not even simplify the proof of a central statement such as the Blackwell–Sherman–Stein theorem. Apart from that, around that time I became engrossed in much more exciting robustness problems. Now, how about McCullagh’s paper? First, when we compare what he does now to what I did 40 years ago, it is quite illuminating to notice that basically the same situation can be categorized in more than one fashion. But I think the same comments, benefits and difficulties apply here too. It may be heartening to learn that such papers have become publishable in statistical journals now. McCullagh certainly is able to clarify thinking about models (but I must admit 1294 DISCUSSION REFERENCE ARBUTHNOTT, J. (1712). An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes. Philos. Trans. Roy. Soc. London 27 186–190. DEPARTMENT OF MATHEMATICS ETH ZURICH WEN B17 ZURICH SWITZERLAND E-MAIL: [email protected] DISCUSSION BY STEVE PINCUS Guilford, Connecticut This is a very interesting and thought-provoking paper, which I enjoyed reading. While the notions of sound statistical models and parameters are essential to the core of a classical statistical framework, heretofore a broadly applicable, single (parsimonious) unifying conceptual approach and formalism had not been put forth, nor one addressing the caveats pointed out in the Introduction in the present paper. Professor McCullagh now successfully advances such a formalism, based on the natural algebraic methodology given by category theory, building on previous work on invariance and factorial models [McCullagh (2000)] and on quotient spaces and statistical models [McCullagh (1999)]. The author shows that a diverse range of sensible statistical applications, often considered as distinct settings, all can be handled within the framework of this formalism. The breadth of the range of the 12 examples in Section 2 reinforces the scope of the formalism. Some might argue that this unification could be achieved without introducing the technologic machinery of category theory or algebra. The crucial point, to my sensibilities, clarifying that the present (or similar) technology is required for a general formalism is the discussion of the naturality of subparameter specification, Section 4.5, with its implications to inference. The concept of a natural transformation is a central notion within group and category theory, but not ubiquitous throughout mathematics. Insofar as cross-pollination between and from algebra to (probability and) statistics, there is a long and rich history that goes back at least half a century. Thus the utility of Professor McCullagh’s approach is not unprecedented, although the extensive application at the level of generality of category theory may be so. The series of contributions to a variety of statistical problems, including topics as diverse as experimental design and card shuffling made by Persi Diaconis, Graham and Kantor (1983), Diaconis (1988) and Rosemary Bailey (1981, 1991), among others, exploiting group invariance and symmetries are probably well known WHAT IS A STATISTICAL MODEL? 1295 to the readers here. Notably, as well, the history also includes significant and influential contributions made by Ulf Grenander [e.g., Grenander (1963)] to the study of patterns, and by Harry Furstenberg (1963) to the study of products of random matrices (from which the Lyapunov spectra and related dynamical systems theory parameters are derived). I especially note these latter two mathematicians as each integrally features Lie group and Lie algebra theory throughout their efforts. Thus, disciples of these developments, already facile with the algebraic machinery utilized herein, may be able to make additional contributions atop those developed to date by the author. I have two stylistic comments. First, the organization is fine for someone rela- tively familiar with category theory, but for other readers it may be considerably more challenging, as it leads heavily with this theory in Sections 3–5. Without a road map, the interested reader might give up before recognizing the “meat” of the approach. I’d strongly suggest that relatively early on, before diving headlong into Sections 3–5, the reader jump to Section 6 and find an example of particular inter- est, then go back and forth between the theory sections and the examples. One can then appreciate the formal developments “concretely” from which the abstractions and generalizations become much clearer. One illuminating example for me was the example in 6.2, Exercise 10 of Section 2, regarding inference concerning the correlation coefficient ρ derived from the standard linear regression model with one covariate. The second (and cosmetic) comment is that I prefer the adjective ad hoc to absurd in describing, for example, the models of Section 2. I agree with the substance of the author’s point, that there is no hope of a broad or natural theoretic general framework or basis for these models. Nonetheless, such models are of course used by well-intended yet (somewhat) naïve practitioners to produce, for example, figures of merit within specific contexts, and I see no need to embarass the target user. My major concern is primarily a recognition of a limitation of the algebraic approach. Namely, the (category theory) technology utilized herein cannot be readily extended or adapted to address some spiritually related operational questions that are central to the core of statistical practice. To my sensibilities, this paper is probabilistic (rather than statistical) in spirit, albeit evidently applied to model types and problems classically considered by statisticians. I would probably be happier with the title “What is a parametrized probabilistic (or stochastic process) model?” This is not “bad,” as I am a probabilist, but the formalism inferentially bears on processes and distributions, not short and moderate length sequences apart from their definition as possible initial segments from typical realizations of a process. The specific technical issue is that an algebraic approach has a considerable limitation, in the lack of a natural metric. This point is noted by Professor McCullagh in Section 8.3, albeit towards a very different consideration (the Besag–Higdon formulation of yield in agricultural field trials): “Algebra does not easily deal with approximations or inequalities. . . .” The presence of a metric 1296 DISCUSSION is not required in the present paper, and the lack thereof within category theory does not obviate any of the formal developments. But suppose we wish to consider which model form is better, given one or several (necessarily finite) realizations, among two proposed forms? As David Cox (1990) states, “In particular, choice of an appropriate family of distributions may be the most challenging phase of analysis.” Most sensible means to resolve such a question, and there are many such, require a metric, and thus primarily algebraically (particularly category theory) based attempts would likely be artificial and inadequate. I do wish to point out if there is topological as well as algebraic structure in a formal model framework (e.g., a Lie group), then convergence defined by the topology can resolve the metric question in terms of some limiting or asymptotic behavior; thus some probabilistic, that is, infinite sequence comparisons can possibly be resolved within an algebraic framework. But for decidedly nonasymptotic data lengths, and statistical procedures based on them, many formal descriptions will still require nonalgebraic approaches. Finally, also on a related topic, I would like to mention that for discrete state data, there is a means of assessing the “extent of randomness” or irregularity of a finite or infinite sequence that does not require a model-based (or stochastic process) setting, parametrized or otherwise, in the classical sense [Pincus and Singer (1996); Pincus and Kalman (1997)]. A formalism is developed strictly on a combinatorial basis, again for both finite and infinite length sequences. This could be used, for example, in both discrimination and classification contexts, although this is a separate discussion altogether. REFERENCE BAILEY, R. A. (1981). A unified approach to design of experiments. J. Roy. Statist. Soc. Ser. A 144 214–223. BAILEY, R. A. (1991). Strata for randomized experiments (with discussion). J. Roy. Statist. Soc. Ser. B 53 27–78. COX, D. R. (1990). Roles of models in statistical analysis. Statist. Sci. 5 169–174. DIACONIS, P. (1988). Group Representations in Probability and Statistics. IMS, Hayward, CA. DIACONIS, P., GRAHAM, R. L. and KANTOR, W. M. (1983). The mathematics of perfect shuffles. Adv. in Appl. Math. 4 175–196. FURSTENBURG, H. (1963). Noncommuting random products. Trans. Amer. Math. Soc. 108 377–428. GRENANDER, U. (1963). Probabilities on Algebraic Structures. Wiley, New York. MCCULLAGH, P. (1999). Quotient spaces and statistical models. Canad. J. Statist. 27 447–456. MCCULLAGH, P. (2000). Invariance and factorial models (with discussion). J. Roy. Statist. Soc. Ser. B 62 209–256. PINCUS, S. and KALMAN, R. E. (1997). Not all (possibly) “random” sequences are created equal. Proc. Nat. Acad. Sci. U.S.A. 94 3513–3518. PINCUS, S. and SINGER, B. H. (1996). Randomness and degrees of irregularity. Proc. Nat. Acad. Sci. U.S.A. 93 2083–2088. 990 MOOSE HILL ROAD GUILFORD, CONNECTICUT 06437 E-MAIL: [email protected] WHAT IS A STATISTICAL MODEL? 1299 Statistics as practiced today suffers from many similar problems and paradoxes. In survival analysis we have Cox’s likelihood, which undoubtably is a canonical thing, but unfortunately not a likelihood in any model that we know of. In basic statistical inference, paradoxes related to sequential situations and Birnbaum’s paradox are among the problems we have learned to ignore and live with. This does not have much to do with Peter McCullaghs paper, but it illustrates the need for a clarification of basic concepts. A more restrictive definition of a statistical model than the usual one by arbitrary families of probability distributions on a sample space may turn out to be a necessary ingredient here. Certain aspects of Peter McCullagh’s exposition take us back to the roots, in the sense that many of the ideas that he expresses in terms of category theory are very similar to ingredients of the slightly old-fashioned way of expressing models in terms of populations and samples, rather than probability distributions and random variables. I would like to stress this by repeating a single point he has concerning regression analysis. Just to emphasize that category theory is not a necessary tool here, but only a convenient language, I will try to do it without this tool. The standard textbook way of specifying a simple regression model is to say that we have observations yi of independent normal random variables Yi , i = 1, . . . , n, with the same variance σ 2 and expectations of the form α+βxi , where x1, . . . , xn are known quantities. This is undoubtably a useful way of saying it, but it has the drawback that it does not specify the parameters that can be meaningfully estimated. Quantities like the sample size n and the average expectation α+βx̄ (or, more complicated, the correlation coefficient as suggested by the author) can be estimated without problems in this model, but this is irrelevant for most purposes because these quanties are related not only to the unknown parameters, but also to the design—in this case represented by the choice of the values x1, . . . , xn of the independent variable. A way of stating the model that takes this into account goes as follows. Take as the starting point an infinite population {(xi, yi)} of x-values and corresponding y-values. Only the x-values are visible from the beginning, and what we do when we perform the experiment is actually to draw— not at random, but rather by some criterion for selection of a representative set of x-values—a sample from this population. Now, define a parameter function as a rule which to each such selection x1, . . . , xn of finitely many x-values assigns a parameter in the usual sense, that is, a function of the unknown distribution in the corresponding statistical model. This includes any function of (α,β,σ 2), and also, so far, expressions like n or α + βx̄. Here, a technical condition is required, stating that at least two distinct x-values should be present, but we will ignore this irrelevant detail in the following. Now, define a meaningful parameter function as a parameter function which is invariant under the formation of marginal models, that is, when a design is reduced by removal of some of the x’s, the parameter associated with the marginal distribution should equal the parameter associated with the distribution in the model for the bigger sample. This reduces the possible parameter functions to the set of functions of (α,β,σ 2). A quantity like α + βx̄, 1300 REJOINDER where x̄ is the average of the x-values in the design we happened to select, is clearly not a meaningful parameter function in this sense. Moreover, this kind of set-up is exactly what is required if one wants a theory that does not allow for such absurdities as those suggested in “exercises” 1, 2, 3, 4 and 5 of Section 2.1. REFERENCE MCCULLAGH, P. (1980). Regression models for ordinal data (with discussion). J. Roy. Statist. Soc. Ser. B 42 109–142. MCCULLAGH, P. and NELDER, J. A. (1989). Generalized Linear Models, 2nd. ed. Chapman and Hall, London. NELDER, J. A. and WEDDERBURN, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384. TJUR, T. (1998). Nonlinear regression, quasi likelihood, and overdispersion in generalized linear models. Amer. Statist. 52 222–227. STATISTICS GROUP COPENHAGEN BUSINESS SCHOOL SOLBJERG PLADS 3 DK-2000 FREDERIKSBERG DENMARK E-MAIL: [email protected] REJOINDER BY PETER MCCULLAGH University of Chicago Jourdain: And what are they, these three operations of the mind? Philosopher: The first, the second and the third. The first to conceive by means of premises; the second to reason by means of categories; and the third to draw conclusions by means of figures. Molière. Le bourgeois gentilhomme. Major themes. The discussants cover a wide range of topics. Before respond- ing individually, I will first address a few of the major themes. Abstraction. The purpose of abstraction is to incorporate into the mathematics those essential relationships that are taken for granted in the domain of application. Yet that is seldom how it is seen from the outside. In my view, necessity is the mother of abstraction, so there can be no virtue in abstraction for its own sake. It follows that this paper should not have been written had I not judged the need to be clear and the proposed solution reasonably compelling. Unless the information WHAT IS A STATISTICAL MODEL? 1301 is suitably encoded, individual plots, pigs, varieties, treatments and subjects are, in mathematical terms, simply elements in a set. We can, and we must, do better than that. A neighbor relationship on a graph is a possible step in this direction, but this paper pushes that idea a little further by expecting each object (graph) to be embedded in other objects in such a way that relationships are preserved. Many of the discussants are clearly more comfortable with algebraic abstraction than I am, others evidently less so. To those floating in the abstract wing it may appear that my hesitant steps have not gone far enough. To others anchored in the concrete it may appear that I have fallen into an abyss of mindless nonsense. The tension between the concrete and the abstract is seldom comfortable, but I believe it is healthy and central to the art of statistical modelling. One does not have to look far in probability, statistics or physics to see that today’s concrete foundations are yesterday’s abstractions. On the other hand, it must ruefully be admitted that most of yesterday’s abstractions are buried elsewhere. Model versus analysis. Statistical models are usually encountered in connec- tion with the statistical analysis of data, and for that reason many statisticians seem to regard the model and the analysis as inextricably intertwined concepts. The situ- ation is quite different in classical applied mathematics, where it is accepted prac- tice to discuss properties of the wave equation or Poisson’s equation independently of any data that might (or might not) be collected. While it is difficult to discuss statistical models without also mentioning analysis, it seems to me that the two notions are logically distinct. As I have used the term in this paper, a model is di- vorced from any particular area of application in much the same way that the heat equation is not necessarily tied to heat transmission. Although I do not recom- mend the practice, one might talk of the Bradley–Terry model without indicating whether the intended application is to competition experiments or citation studies or transmission disequilibrium in genetics. In that sense, I am interested in what Tjur calls canonical models. An analysis might well involve the comparison of two or more mutually incompatible models by methods that require numerical approximation of integrals using MCMC. Model checking is an essential component of every analysis, and this might well lead to the consideration of further models. An analysis of financial transaction data might well lead to the conclusion that Brownian motion is not consistent with the data but that an alternative long-tailed model is satisfactory. Thus, in any definition of the term “statistical model,” it is essential to bear in mind that models must exist that are not compatible with any data likely to be collected. By comparison with models, statistical analysis is a complicated topic, and I have tried in the paper to say as little on the matter as I could hope to get away with. As the title suggests, the topic of the paper is narrow. So far as I can tell, it has nothing to say about computation, design or model checking, and very little to say about model selection, or Bayesian versus frequentist analysis. 1304 REJOINDER size. One may be inclined to regard the proposition as an matter of established agricultural fact, so much so that it does not count as an assumption. What makes it true, agriculturally speaking, is that agronomy knows something about plots and varieties that mathematics does not. Agronomy knows that variety v planted in a 2 × 2 plot is equivalent to the same variety planted in each of the individual subplots. On the scale of scientific insights this revelation may achieve a record low, but it not vacuous. First, it admits the existence of larger and smaller plots, and their relation to one another by subset inclusion. Second, it says something substantive about the relationship between varieties and plots that is true for most agricultural treatments but is not true in general. Remember that mathematics knows nothing about anything except mathematics, so mathematics must be instructed in the facts of rural life. In the formalism of the present paper these facts and relationships are encoded in the category and thereby made available explicitly. It seems that everyone does this instinctively, much like Molière’s M. Jourdain who was delighted to learn that he had been speaking prose all his life. Responses to individual discussants. Many of Julian Besag’s remarks are concerned with the more complicated subject of statistical strategy, ranging from model construction and statistical analysis to model adequacy and numerical approximation. I do not dispute the importance or the subtlety of statistical strategy and model construction, but I view it as complementary to the topic of the paper. Even if we cannot agree on a definition, or even the need for one, I welcome Besag’s remarks concerning the relevance of models in applications. It is also hard to disagree openly with his preference for models that have a physical justification, or what are sometimes called mechanistic or causal models. While this preference is understandable, it must be tempered with the knowledge that certain rather successful models, such as Maxwell’s theory of electromagnetism, are hard to justify on such grounds. The universality that I demand of a model is only the universality that follows from the category of embeddings, and one can always opt for a smaller category if this seems sensible. One interpretation of definition [equation (1)] is that it states in an unfamiliar language only what is intuitively obvious. Thus, Besag is entirely correct in his assertion that the commutativity conditions amount to nothing more than common sense. But where I demand internal consistency for a range of plot sizes, it seems that Besag attaches little weight to this. Deep down, though, I’ll bet he must! Aside from this point, I aim to show below that the extent of our disagreement is considerably less than I had originally thought. I agree with essentially all of Peter Bickel’s remarks, including the plea for expressing categorical ideas in more acceptable language. In the last section of his remarks, Tue Tjur goes some way in this direction, explaining what is meant by a natural parameter in regression. The connection between natural parameters and identifiable parameters is not an obvious one, so I wonder whether it is a lucky WHAT IS A STATISTICAL MODEL? 1305 coincidence that β/σ in the Box–Cox model is both natural and also identifiable in a larger model. A bushy category is evidence, if any were needed, of a novice gardener at work, one who has acquired a few sharp tools but not yet mastered the principles of effective pruning. Brøns takes the formal definition, dissects it, extends it, reassembles it and reinterprets it with a dexterity I can admire but not emulate. The comma category StatMod is a very natural and familiar statistical object with the right properties and only those properties. The new proposal, to define a statistical model as a functor C → (idset ↓ P ) from an arbitrary category C into StatMod, achieves a degree of simplicity and generality through a new level of abstraction. The parameter space and the sample space remain accessible by composition with the natural projections, so nothing has been sacrificed. The pedagogical hurdles remain formidable, but Brøns’s construction is right on the mark, and I thank him for it. One difficulty in constructing a general theory of models is to determine where various conditions belong most naturally. In my view, no theory of linear models can be considered satisfactory unless, to each subrepresentation ′ ⊂  there corresponds a quotient representation /′ in a natural way. This is not a consequence of the extended definition, and is the reason for the additional surjectivity condition that I had included. If this is unnecessary in the general definition, I suspect it must be a part of the definition of linear models. Peter Huber is right to point out that, in some circumstances at least, what appears to be absurd might well be real. After all, most of quantum mechanics appears absurd and yet we must suppose it to be real. At some stage, one must ask where the category comes from, and the answer here must depend on the application in ways that are not easy to describe. A very sparse category containing many objects but few morphisms is not very useful, but the inclusion of too many morphisms may lead to the exclusion of otherwise useful formulations. I was fascinated to read of Huber’s earlier categorization of aspects of statistical theory, which seems to be connected with morphisms on the sample side. Inge Helland brings a fresh quantum-mechanical perspective to the subject, and particularly to the matter of natural subparameters. I must confess that I was initially skeptical of the restriction to natural subparameters when I first encountered this in Helland (1999a), and I took an opposing view at the time, but I have to agree now that the arguments are convincing. The arguments in Fraser (1968b) are essentially the same. It is instructive to note that the objections raised in the paper to the correlation coefficient in regression do not apply to autocorrelation coefficients in time series or spatial processes. Kalman has much to say about models and statistical practice, all negative. I’m sure glad he’s not pinning the blame entirely on me! By the statement that there does not exist a concrete i.i.d. process in nature, I presume that he means that this mathematical model is not a satisfactory description of observed processes. If so, we need something better, most likely a family of non-i.i.d. processes. 1306 REJOINDER History offers ample support for Kalman’s assertion that scientists demand a physical or mechanical explanation, in our case an explanation for the origin of randomness in nature. But this demand does not persuade me that a mechanical explanation should necessarily be forthcoming. The existence of the ether was widely accepted in the nineteenth century, not for its beauty as Kalman claims, but for its supposed mechanical properties which were deemed necessary for the transmission of light. In 1839, the currently accepted mathematical model for the transmission of light was put forward by James MacCullagh (no relation), but rejected by physicists of the day because no mechanical explanation could be provided. Although the date is given as 1842 and the name is misspelled, other details provided by Feynman, Leighton and Sands [(1964) 2, Section 1–5] are reliable. Two quotes from the 1839 paper suffice to illustrate my point. Concerning the peculiar constitution of the ether, we know nothing and shall assume nothing except what is involved in the foregoing assumptions (symmetry!). And later, by way of summary, the reasoning which has been used to account for the [potential] function is indirect, and cannot be regarded as sufficient in a mechanical point of view. It is, however, the only kind of reasoning that we are able to employ, as the constitution of the luminiferous ether is entirely unknown. Indeed it was, and remained so for much of the rest of the century, even after the same equations were resurrected as one leg of Maxwell’s theory in 1864. Incidentally, the 1839 paper may be the first to employ the combination of partial derivatives now known as the curl operator, and to show that it transforms as a vector under coordinate rotation in R3. Steve Pincus points out that the inferential formalism emphasizes processes and families of processes rather than families of distributions. This is true, but to some extent a matter of presentation. There is no objection in principle to considering a truncated category containing a finite number of objects, such as the category of injective maps on sets of size 12 and smaller. A process that is exchangeable relative to this category is not necessarily extendable to an infinitely exchangeable process, so inferences extending beyond 12 elements are inaccessible. The absence of a metric may look odd, but we must bear in mind that each sample space object must be a measure space with an attached σ -algebra, and possibly additional topological structure that I would rather not think about. One generic example is the category C in which the objects are finite-dimensional inner product spaces and the morphisms are injective linear maps ϕ :X→X′ such that 〈ϕx, ϕy〉 = 〈x, y〉. If S is the opposite, or dual, functor C → C, the objects are implicitly assumed to be equipped with the usual Borel σ -algebra, and each ϕ is sent to the conjugate linear transformation ϕ∗ :X′ →X. These maps also preserve inner products, but in a different sense such that the induced map (kerϕ∗)⊥ ∼= X′/kerϕ∗ → X is an isomorphism of inner product spaces. A process relative to C is necessarily orthogonally invariant, or spherically symmetric, and is characterized by scale mixtures of normals [Kingman (1972)]. WHAT IS A STATISTICAL MODEL? 1309 Ignoring variety effects, the Besag–Higdon model is a sum of white noise and a Markov random field on the lattice. Besag’s remark that the variogram exhibits logarithmic behavior suggests that the fitted MRF is close to Wσ for some σ . This is certainly plausible since Wσ is Markov and can be well approximated by a low-order lattice scheme, in which the MRF coefficients in the interior are an approximation to the Laplacian. If this speculation is confirmed in applications, the fitted model is approximately a sum of white noise and Wσ , which is in fact a conformal process. In principle, the relation between the MRF coefficients and σ (in Wσ ) can be determined. Thus, if the fitted MRF is close to Wσ , or even to a linear deformation of Wσ , the fitted Besag–Higdon process has a natural extension beyond the lattice to plots of arbitrary size and shape, in which case my criticism on these grounds does not apply. At the time I wrote Section 8, I was aware of certain peculiarities of the de Wijs process connected with its interpretation as the Newtonian potential, or inverse of the Laplace operator. But I did not fully appreciate the Markov property, and I was unaware of its conformal invariance. So this revelation comes as a pleasant surprise, and helps me to understand better the connection with MRF lattice schemes in the analysis of field trials. With twenty years of first- hand experience in farming I thought I understood the meaning of fertility, but I knew of no plausible argument to suggest that fertility processes should be spatially Markov, and I failed to see why this proposition should be so readily accepted. A wide range of arguments suggested that fertility might be a compound process or sum of elementary processes, indicating that a model, as a set of processes, should be closed under convolution. From the present viewpoint of conformal transformation, this mechanistic reasoning is replaced by an argument of an entirely different sort in which the entire concept of fertility is absent. The set of conformal Gaussian processes is a convex cone of dimension two generated by white noise and Wσ , in essence the random effects model that Besag suggests. Although the Besag–Higdon scheme is not closed under convolution, all of the conformal processes are already included at least as lattice approximations. It would be good to have this important subset explicitly labelled. This analysis suggests that two very different lines of argument may, in practice, lead to very similar conclusions. The application of conformal transformations to agricultural processes in muddy fields may seem frivolous to the point of absurdity, but that is not how I see it. The driving thesis is that, whatever the generating process, it should look very much the same after any transformation that preserves the relevant aspects of the geometry. Conformal maps preserve Euclidean geometry locally with no angular distortion, which is the only argument for their relevance. However, Euclidean geometry is only the visible side of a corn field. Specific nonisotropic effects connected with topography, ploughing, harvesting and drainage are inevitable and may even dominate the nonspecific variations. To the extent possible, these are included in the category, and are carried along with the conformal maps. Even 1310 REJOINDER with this adjustment, we should not delude ourselves by pretending that we have learned anything about the laws of biology or agricultural science by studying the axioms of algebra. To learn the laws of agriculture, we must study agricultural data, but it undoubtedly helps to do so in the light of potential models. The observed logarithmic behavior of variograms suggests that some field processes have a strong Wσ component, consistent with conformal processes, but we should not be surprised to find additional components that are not conformal. Acknowledgments. I want to thank John Marden for arranging this discussion paper and also the discussants for their remarks. My colleagues Michael Stein, Steve Lalley, Norman Lebovitz, Steve Stigler and Sydney Webster have given advice freely on a range of points. But I am especially indebted to Julian Besag for his generous help and forthright remarks, which have helped to clarify my thinking on spatial processes and Markov random fields. REFERENCE CHILÈS, J.-P. and DELFINER, P. (1999). Geostatistics. Wiley, New York. FEYNMAN, R. P., LEIGHTON, R. B. and SANDS, M. (1964). The Feynman Lectures on Physics. Addison-Wesley, Reading, MA. FRASER, D. A. S. (1968b). The Structure of Inference. Wiley, New York. HELLAND, I. S. (1999a). Quantum mechanics from symmetry and statistical modelling. Internat. J. Theoret. Phys. 38 1851–1881. KINGMAN, J. F. C. (1972). On random sequences with spherical symmetry. Biometrika 59 492–494. MACCULLAGH, J. (1839). An essay towards the dynamical theory of crystalline reflexion and refraction. Trans. Roy. Irish Academy 21 17–50. MATHERON, G. (1971). The theory of regionalized variables and its applications. Cahiers du Centre de Morphologie Mathématique de Fontainbleu 5. WHITTLE, P. (1954). On stationary processes in the plane. Biometrika 41 434–449. WICHURA, M. (2001). Some de Finetti type theorems. Preprint. DEPARTMENT OF STATISTICS UNIVERSITY OF CHICAGO 5734 UNIVERSITY AVENUE CHICAGO, ILLINOIS 60637-1514 E-MAIL: [email protected]