






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of Generalized Linear Models (GLMs), a statistical modeling approach that extends traditional regression models. GLMs consist of three components: the linear predictor, the link function, and the error distribution. the structure of GLMs, their application to count data, and diagnostics for model assessment. GLMs have become essential tools in statistical data analysis due to their flexibility and wide range of applications.
Typology: Exams
1 / 46
This page cannot be seen from the preview
Don't miss anything!







































ue originally to Nelder and Wedderburn (1972), generalized linear models are a remarkable synthesis and extension of familiar regression models such as the linear models described in Part II of this text and the logit and probit models described in the preceding chapter. The current chapter begins with a consideration of the general structure and range of application of generalized linear models; proceeds to examine in greater detail generalized linear models for count data, including contingency tables; briefly sketches the statistical theory underlying generalized linear models; and concludes with the extension of regression diagnostics to generalized linear models. The unstarred sections of this chapter are perhaps more difficult than the unstarred material in preceding chapters. Generalized linear models have become so central to effective statistical data analysis, however, that it is worth the additional effort required to acquire a basic understanding of the subject.
A generalized linear model (or GLM 1 ) consists of three components:
(^1) Some authors use the acronym “GLM” to refer to the “ general linear model”—that is, the linear regression model with normal errors described in Part II of the text—and instead employ “GLIM” to denote generalized linear models (which is also the name of a computer program used to fit GLMs).
379
380 Chapter 15. Generalized Linear Models
Table 15.1 Some Common Link Functions and Their Inverses
Link η i = g(μ i ) μ i = g−^1 (η i )
Identity μ i η i Log log e μ i e η i Inverse μ− i^1 η− i^1 Inverse-square μ− i^2 η i −^1 /^2 Square-root √μ i η^2 i Logit log e μ i 1 − μ i
1 1 + e −η i Probit −^1 (μ i ) (η i ) Log-log −log e [−log e (μ i )] exp[−exp(−η i )] Complementary log-log log e [−log e ( 1 −μ i )] 1 −exp[−exp (η i )]
NOTE: μ i is the expected value of the response; η i is the linear predictor; and (·) is the cumulative distribution function of the standard-normal distribution.
Because the link function is invertible, we can also write μi = g−^1 (η (^) i ) = g−^1 (α + β 1 X (^) i 1 + β 2 X (^) i 2 + · · · + β (^) k X (^) ik ) and, thus, the GLM may be thought of as a linear model for a transformation of the expected response or as a nonlinear regression model for the response. The inverse link g−^1 (·) is also called the mean function. Commonly employed link functions and their inverses are shown in Table 15.1. Note that the identity link simply returns its argument unaltered, ηi = g(μ (^) i ) = μi , and thus μ (^) i = g−^1 (η (^) i ) = ηi. The last four link functions in Table 15.1 are for binomial data, where Y (^) i represents the observed proportion of “successes” in ni independent binary trials; thus, Y (^) i can take on any of the values 0, 1 /ni , 2 /ni ,... , (ni − 1 )/n (^) i , 1. Recall from Chapter 15 that binomial data also encompass binary data, where all the observations represent ni = 1 trial, and consequently Yi is either 0 or 1. The expectation of the response μ (^) i = E(Y (^) i ) is then the probability of success, which we symbolized by πi in the previous chapter. The logit, probit, log-log, and complementary log-log links are graphed in Figure 15.1. In contrast to the logit and probit links (which, as we noted previously, are nearly indistinguishable once the variances of the underlying normal and logistic distributions are equated), the log-log and complementary log-log links approach the asymptotes of 0 and 1 asymmetrically. 2 Beyond the general desire to select a link function that renders the regression of Y on the Xs linear, a promising link will remove restrictions on the range of the expected response. This is a familiar idea from the logit and probit models discussed in Chapter 14, where the object was to model the probability of “success,” represented by μ (^) i in our current general notation. As a probability, μi is confined to the unit interval [0,1]. The logit and probit links map this interval to the entire real line, from −∞ to +∞. Similarly, if the response Y is a count, taking on only non-negative integer values, 0, 1, 2,... , and consequently μ (^) i is an expected count, which (though not necessarily an integer) is also non-negative, the log link maps μi to the whole real line. This is not to say that the choice of link function is entirely determined by the range of the response variable.
(^2) Because the log-log link can be obtained from the complementary log-log link by exchanging the definitions of “success” and “failure,” it is common for statistical software to provide only one of the two—typically, the complementary log-log link.
382 Chapter 15. Generalized Linear Models
Table 15.2 Canonical Link, Response Range, and Conditional Variance Function for Exponential Families
Family Canonical Link Range of Yi V(Yi |η i )
Gaussian Identity (−∞, +∞) φ Binomial Logit 0,1,..., ni ni
μ i ( 1 − μ i ) ni Poisson Log 0,1,2,... μ i Gamma Inverse (0,∞) φμ^2 i Inverse-Gaussian Inverse-square (0,∞) φμ^3 i
NOTE: φ is the dispersion parameter, η i is the linear predictor, and μ i is the expectation of Y (^) i (the response). In the binomial family, n i is the number of trials.
simplifies the GLM, 3 but other link functions may be used as well. Indeed, one of the strengths of the GLM paradigm—in contrast to transformations of the response variable in linear regression— is that the choice of linearizing transformation is partly separated from the distribution of the response, and the same transformation does not have to both normalize the distribution of Y and make its regression on the Xs linear. 4 The specific links that may be used vary from one family to another and also—to a certain extent—from one software implementation of GLMs to another. For example, it would not be promising to use the identity, log, inverse, inverse-square, or square-root links with binomial data, nor would it be sensible to use the logit, probit, log-log, or complementary log-log link with nonbinomial data. I assume that the reader is generally familiar with the Gaussian and binomial families and simply give their distributions here for reference. The Poisson, gamma, and inverse-Gaussian distributions are perhaps less familiar, and so I provide some more detail: 5
p(y) =
σ
2 π
exp
(y − μ)^2 2 σ 2
p(y) =
n ny
μny^ ( 1 − μ) n(^1 −y)^ (15.2)
(^3) This point is pursued in Section 15.3. (^4) There is also this more subtle difference: When we transform Y and regress the transformed response on the Xs, we are modeling the expectation of the transformed response,
E[g(Y (^) i )] = α + β 1 x (^) i 1 + β 2 xi 2 + · · · + β (^) k xik
In a GLM, in contrast, we model the transformed expectation of the response,
g[E(Y (^) i )] = α + β 1 x (^) i 1 + β 2 x (^) i 2 + · · · + β (^) k xik
While similar in spirit, this is not quite the same thing when (as is true except for the identity link) the link function g(·) is nonlinear. (^5) The various distributions used in this chapter are described in a general context in Appendix D on probability and estimation.
15.1. The Structure of Generalized Linear Models 383
Here, ny is the observed number of successes in the n trials, and n( 1 − y) is the number of failures; and ( n ny
n! (ny)![n( 1 − y)]! is the binomial coefficient.
p(y) = μ y^ ×
e−μ y!
for y = 0 , 1 , 2 ,...
The expectation and variance of a Poisson random variable are both equal to μ. Poisson distributions for several values of the parameter μ are graphed in Figure 15.2. As we will see in Section 15.2, the Poisson distribution is useful for modeling count data. As μ increases, the Poisson distribution grows more symmetric and is eventually well approximated by a normal distribution.
p(y) =
( (^) y ω
)ψ− 1 ×
exp
−y ω
ω (ψ)
for y > 0 (15.3)
where (·) is the gamma function.^6 The expectation and variance of the gamma distri- bution are, respectively, E(Y ) = ωψ and V (Y ) = ω^2 ψ. In the context of a generalized linear model, where, for the gamma family, V (Y ) = φμ^2 (recall Table 15.2 on page 382), the dispersion parameter is simply the inverse of the shape parameter, φ = 1 /ψ. As the names of the parameters suggest, the scale parameter in the gamma family influences the spread (and, incidentally, the location) but not the shape of the distribution, while the shape parameter controls the skewness of the distribution. Figure 15.3 shows gamma distributions for scale ω = 1 and several values of the shape parameter ψ. (Altering the scale param- eter would change only the labelling of the horizontal axis in the graph.) As the shape parameter gets larger, the distribution grows more symmetric. The gamma distribution is useful for modeling a positive continuous response variable, where the conditional variance of the response grows with its mean but where the coefficient of variation of the response, SD(Y )/μ, is constant.
p(y) =
λ 2 πy^3
exp
λ(y − μ)^2 2 yμ^2
for y > 0
The expectation and variance of Y are E(Y ) = μ and V (Y ) = μ^3 /λ. In the context of a GLM, where, for the inverse-Gaussian family, V (Y ) = φμ^3 (as recorded in Table 15.
(^6) * The gamma function is defined as
(x) =
∫ (^) ∞ 0
e−z^ z x−^1 dz
and may be thought of as a continuous generalization of the factorial function in that when x is a non-negative integer, x! = (x + 1 ).
15.1. The Structure of Generalized Linear Models 385
0 2 4 6 8 10 y
p(y)
ψ = 2 ψ = 5
ψ = 1
1.5 ψ = 0.
Figure 15.3 Several gamma distributions for scale ω =1 and various values of the shape parameter ψ.
on page 382), λ is the inverse of the dispersion parameter φ. Like the gamma distribution, therefore, the variance of the inverse-Gaussian distribution increases with its mean, but at a more rapid rate. Skewness also increases with the value of μ and decreases with λ. Figure 15.4 shows several inverse-Gaussian distributions.
A convenient property of distributions in the exponential families is that the conditional variance of Yi is a function of its mean μ (^) i and, possibly, a dispersion parameter φ. In addi- tion to the familiar Gaussian and binomial families (the latter for proportions), the Poisson family is useful for modeling count data, and the gamma and inverse-Gaussian families for modeling positive continuous data, where the conditional variance of Y increases with its expectation.
GLMs are fit to data by the method of maximum likelihood, providing not only estimates of the regression coefficients but also estimated asymptotic (i.e., large-sample) standard errors of
the coefficients. 7 To test the null hypothesis H 0 : βj = β (^) j(^0 )we can compute the Wald statistic
Z 0 =
B (^) j − β (^) j(^0 )
/SE(B (^) j ), where SE(B (^) j ) is the asymptotic standard error of the estimated
coefficient Bj. Under the null hypothesis, Z 0 follows a standard normal distribution. 8 As explained, some of the exponential families on which GLMs are based include an unknown dispersion parameter φ. Although this parameter can, in principle, be estimated by maximum likelihood as well, it is more common to use a “method of moments” estimator, which I will denote ˜φ. 9
(^7) Details are provided in Section 15.3.2. The method of maximum likelihood is introduced in Appendix D on probability and estimation. (^8) Wald tests and F -tests of more general linear hypotheses are described in Section 15.3.3. (^9) Again, see Section 15.3.2.
386 Chapter 15. Generalized Linear Models
0 1 2 3 4 5 y
p(y)
λ = 1, μ = 1 λ = 2, μ = 1 λ = 1, μ = 5 λ = 2, μ = 5
Figure 15.4 Inverse-Gaussian distributions for several combinations of values of the mean μ and inverse-dispersion λ.
As is familiar from the preceding chapter on logit and probit models, the ANOVA for linear models has a close analog in the analysis of deviance for GLMs. In the current more general context, the residual deviance for a GLM is
D (^) m ≡ 2 (loge L (^) s − loge L (^) m )
where Lm is the maximized likelihood under the model in question and L (^) s is the maximized likelihood under a saturated model , which dedicates one parameter to each observation and consequently fits the data as closely as possible. The residual deviance is analogous to (and, indeed, is a generalization of) the residual sum of squares for a linear model. In GLMs for which the dispersion parameter is fixed to 1 (i.e., binomial and Poisson GLMs), the likelihood-ratio test statistic is simply the difference in the residual deviances for nested models. Suppose that Model 0, with k 0 + 1 coefficients, is nested within Model 1, with k 1 + 1 coefficients (where, then, k 0 < k 1 ); most commonly, Model 0 would simply omit some of the regressors in model 1. We test the null hypothesis that the restrictions on Model 1 represented by Model 0 are correct by computing the likelihood-ratio test statistic
G^20 = D 0 − D 1
Under the hypothesis, G^20 is asymptotically distributed as chi-square with k 1 − k 0 degrees of freedom. Likelihood-ratio tests can be turned around to provide confidence intervals for coefficients; as mentioned in Section 14.1.4 in connection with logit and probit models, tests and intervals based on the likelihood-ratio statistic tend to be more reliable than those based on the Wald statistic. For example, the 95% confidence interval for βj includes all values β′ j for which the
hypothesis H 0 : βj = β′ j is acceptable at the .05 level—that is, all values of β j′ for which
2 (loge L 1 − loge L 0 ) ≤ χ.^205 , 1 = 3 .84, where loge L 1 is the maximized log likelihood for the full model, and loge L 0 is the maximized log likelihood for a model in which β (^) j is constrained to the value β j′. This procedure is computationally intensive because it required “profiling” the
likelihood—refitting the model for various fixed values β j′ of β (^) j.
388 Chapter 15. Generalized Linear Models
0 20 40 60 80 100 Number of Interlocks
Frequency
25
20
15
10
5
0
Figure 15.5 The distribution of number of interlocks among 248 dominant Canadian corporations.
that a firm maintained by virtue of its board members and top executives also serving as board members or executives of other firms in the data set. Ornstein was interested in the regression of number of interlocks on other characteristics of the firms—specifically, on their assets (mea- sured in billions of dollars), nation of control (Canada, the United States, the United Kingdom, or another country), and the principal sector of operation of the firm (10 categories, including banking, other financial institutions, heavy manufacturing, etc.). Examining the distribution of number of interlocks (Figure 15.5) reveals that the variable is highly positively skewed, and that there are many zero counts. Although the conditional distribution of interlocks given the explanatory variables could differ from its marginal dis- tribution, the extent to which the marginal distribution of interlocks departs from symme- try bodes ill for least-squares regression. Moreover, no transformation will spread out the zeroes. 12 The results of the Poisson regression of number of interlocks on assets, nation of control, and sector are summarized in Table 15.3. I set the United States as the baseline category for nation of control, and Construction as the baseline category for sector—these are the categories with the smallest fitted numbers of interlocks controlling for the other variables in the regression, and the dummy-regressor coefficients are therefore all positive. The residual deviance for this model is D(Assets, Nation, Sector) = 1887 .402 on n − k − 1 = 248 − 13 − 1 = 234 degrees of freedom. Deleting each explanatory variable in turn from the model produces the following residual deviances and degrees of freedom:
Explanatory Variables Residual Deviance df
Nation, Sector 2278.298 235 Assets, Sector 2216.345 237 Assets, Nation 2248.861 243
(^12) Ornstein (1976) in fact performed a linear least-squares regression for these data, though one with a slightly different specification from that given here. He cannot be faulted for having done so, however, inasmuch as Poisson regression models—and, with the exception of loglinear models for contingency tables, other specialized models for counts—were not typically in sociologists’ statistical toolkit at the time.
15.2. Generalized Linear Models for Counts 389
Table 15.3 Estimated Coefficients for the Poisson Regression of Number of Interlocks on Assets, Nation of Control, and Sector, for Ornstein’s Canadian Interlocking-Directorate Data
Coefficient Estimate Standard Error
Constant 0.8791 0. Assets 0.02085 0. Nation of Control (baseline: United States) Canada 0.8259 0. Other 0.6627 0. United Kingdom 0.2488 0. Sector (Baseline: Construction) Wood and paper 1.331 0. Transport 1.297 0. Other financial 1.297 0. Mining, metals 1.241 0. Holding companies 0.8280 0. Merchandising 0.7973 0. Heavy manufacturing 0.6722 0. Agriculture, food, light industry 0.6196 0. Banking 0.2104 0.
Taking differences between these deviances and the residual deviance for the full model yields the following analysis-of-deviance table:
Source G^20 df p
Assets 390.90 1 . Nation 328.94 3 . Sector 361.46 9 .
All the terms in the model are therefore highly statistically significant. Because the model uses the log link, we can interpret the exponentiated coefficients (i.e., the e Bj^ ) as multiplicative effects on the expected number of interlocks. Thus, for example, holding nation of control and sector constant, increasing assets by 1 billion dollars (the unit of the assets variable) multiplies the estimated expected number of interlocks by e^0.^02085 = 1 .021—that is, an increase of just over 2%. Similarly, the estimated expected number of interlocks is e^0.^8259 = 2 .283 times as high in a Canadian-controlled firm as in a comparable U.S.-controlled firm. As mentioned, the residual deviance for the full model fit to Ornstein’s data is D 1 = 1887 .402; the deviance for a model fitting only the constant (i.e., the null deviance) is D 0 = 3737 .010. Consequently, R^2 = 1 − 1887. 402 / 3737. 010 = .495, revealing that the model accounts for nearly half the deviance in number of interlocks. The Poisson-regression model is a nonlinear model for the expected response, and I therefore find it generally simpler to interpret the model graphically using effect displays than to examine the estimated coefficients directly. The principles of construction of effect displays for GLMs are essentially the same as for linear models and for logit and probit models: 13 We usually construct one display for each high-order term in the model, allowing the explanatory variables in that
(^13) See Section 15.3.4 for details.
15.2. Generalized Linear Models for Counts 391
is highly skewed to the right. Skewness produces some high-leverage observations and suggests the possibility of a nonlinear effect for assets, points that I pursue later in the chapter. 14
The residual deviance for the Poisson regression model fit to the interlocking-directorate data, D = 1887 .4, is much larger than the 234 residual degrees of freedom for the model. If the Poisson model fits the data reasonably, we would expect the residual deviance to be roughly equal to the residual degrees of freedom. 15 That the residual deviance is so large suggests that the conditional variation of the expected number of interlocks exceeds the variation of a Poisson-distributed variable, for which the variance equals the mean. This common occurrence in the analysis of count data is termed overdispersion.^16 Indeed, overdispersion is so common in regression models for count data, and its consequences are potentially so severe, that models such as the quasi- Poisson and negative-binomial GLMs discussed in this section should be employed as a matter of course.
A simple remedy for overdispersed count data is to introduce a dispersion parameter into the Poisson model, so that the conditional variance of the response is now V (Y (^) i |η (^) i ) = φμi. If φ > 1, therefore, the conditional variance of Y increases more rapidly than its mean. There is no expo- nential family corresponding to this specification, and the resulting GLM does not imply a specific probability distribution for the response variable. Rather, the model specifies the conditional mean and variance of Yi directly. Because the model does not give a probability distribution for Y (^) i , it cannot be estimated by maximum likelihood. Nevertheless, the usual procedure for maximum- likelihood estimation of a GLM yields the so-called quasi-likelihood estimators of the regression coefficients, which share many of the properties of maximum-likelihood estimators. 17 As it turns out, the quasi-likelihood estimates of the regression coefficients are identical to the ML estimates for the Poisson model. The estimated coefficient standard errors differ, however: If ˜φ is the estimated dispersion for the model, then the coefficient standard errors for the quasi- Poisson model are ˜φ^1 /^2 times those for the Poisson model. In the event of overdispersion, therefore, where ˜φ > 1, the effect of introducing a dispersion parameter and obtaining quasi-likelihood esti- mates is (realistically) to inflate the coefficient standard errors. Likewise, F -tests for terms in the model will reflect the estimated dispersion parameter, producing smaller test statistics and larger p-values. As explained in the following section, we use a method-of-moments estimator for the dispersion parameter. In the quasi-Poisson model, the dispersion estimator takes the form
˜φ = 1 n − k − 1
∑ (^) (Y (^) i − ̂μ (^) i )^2 ̂ μ (^) i
(^14) See Section 15.4 on diagnostics for GLMs. (^15) That is, the ratio of the residual deviance to degrees of freedom can be taken as an estimate of the dispersion parameter φ, which, in a Poisson model, is fixed to 1. It should be noted, however, that this deviance-based estimator of the dispersion can perform poorly. A generally preferable “method of moments” estimator is given in Section 15.3. (^16) Although it is much less common, it is also possible for count data to be underdispersed —that is, for the conditional variation of the response to be less than the mean. The remedy for underdispsered count data is the same as for overdispersed data; for example, we can fit a quasi-Poisson model with a dispersion parameter, as described immediately below. (^17) See Section 15.3.2.
392 Chapter 15. Generalized Linear Models
where ̂μ (^) i = g−^1 (̂η (^) i ) is the fitted expectation of Yi. Applied to Ornstein’s interlocking-directorate regression, for example, we get ˜φ = 7 .9435, and, therefore, the standard errors of the regression coefficients for the Poisson model in Table 15.3 are each multiplied by
I note in passing that there is a similar quasi-binomial model for over-dispersed proportions, replacing the fixed dispersion parameter of 1 in the binomial distribution with a dispersion param- eter φ to be estimated from the data. Overdispersed binomial data can arise, for example, when different individuals who share the same values of the explanatory variables nevertheless differ in their probability μ of success, a situation that is termed unmodelled heterogeneity. Similarly, overdispersion can occur when binomial observations are not independent, as required by the binomial distribution—for example, when each binomial observation is for related individuals, such as members of a family.
There are several routes to models for counts based on the negative-binomial distribution (see, e.g., Long, 1997, sect. 8.3; McCullagh & Nelder, 1989, sect. 6.2.3). One approach (following McCullagh & Nelder, 1989, p. 233) is to adopt a Poisson model for the count Y (^) i but to suppose that the expected count μ∗ i is itself an unobservable random variable that is gamma-distributed with mean μi and constant scale parameter ω (implying that the the gamma shape parameter is ψ (^) i = μi /ω 18 ). Then the observed count Y (^) i follows a negative-binomial distribution , 19
p(y (^) i ) =
(yi + ω) y! (ω)
μyi iω ω (μ (^) i + ω) μi^ +ω^
with expected value E (Yi ) = μi and variance V (Y (^) i ) = μi + μ^2 i /ω. Unless the parameter ω is large, therefore, the variance of Y increases more rapidly with the mean than the variance of a Poisson variable. Making the expected value of Y (^) i a random variable incorporates additional variation among observed counts for observations that share the same values of the explanatory variables and consequently have the same linear predictor η (^) i. With the gamma scale parameter ω fixed to a known value, the negative-binomial distribution is an exponential family (in the sense of Equation 15.15 in Section 15.3.1), and a GLM based on this distribution can be fit by iterated weighted least squares (as developed in the next section). If instead—and is typically the case—the value of ω is unknown, and must therefore be estimated from the data, standard methods for GLMs based on exponential families do not apply. We can, however, obtain estimates of both the regression coefficients and ω by the method of maximum likelihood. Applied to Ornstein’s interlocking-directorate regression, and using the log link, the negative-binomial GLM produces results very similar to those of the quasi-Poisson model (as the reader may wish to verify). The estimated scale parameter for the negative-binomial model is ̂ω = 1 .312, with standard error SE(̂ω) = 0 .143; we have, therefore, strong evidence that the conditional variance of the number of interlocks increases more rapidly than its expected value. 20
A particular kind of overdispersion obtains when there are more zeroes in the data than is consistent with a Poisson (or negative-binomial) distribution, a situation that can arise when only certain members of the population are “at risk” of a nonzero count. Imagine, for example, that
(^18) See Equation 15.3 on page 383. (^19) A simpler form of the negative-binomial distribution is given in Appendix D on probability and estimation. (^20) See Exercise 15.1 for a test of overdispersion based on the negative-binomial GLM.
394 Chapter 15. Generalized Linear Models
∗Estimation of the ZIP model would be simple if we knew to which latent class each observation
belongs, but, as I have pointed out, that is not true. Instead, we must maximize the somewhat more complex combined log likelihood for the two components of the ZIP model: 24
loge L(β, γ) =
yi = 0
loge
exp
z ′ i γ
− exp( x ′ i β)
yi > 0
y (^) i x ′ i β − exp( x ′ i β)
∑^ n
i= 1
loge
1 + exp( z ′ i γ)
yi > 0
loge (y (^) i !)
where z ′ i ≡ [1, z (^) i 1 ,... , zip], x ′ i ≡ [1, x (^) i 1 ,... , xik ], γ ≡ [γ 0 , γ 1 ,... , γp]′, and β ≡ [α, β 1 ,... , βk ]′.
The basic GLM for count data is the Poisson model with log link. Frequently, however, when the response variable is a count, its conditional variance increases more rapidly than its mean, producing a condition termed overdispersion, and invalidating the use of the Poisson distribution. The quasi-Poisson GLM adds a dispersion parameter to handle overdispersed count data; this model can be estimated by the method of quasi-likelihood. A similar model is based on the negative-binomial distribution, which is not an exponential family. Negative-binomial GLMs can nevertheless be estimated by maximum likelihood. The zero-inflated Poisson regression model may be appropriate when there are more zeroes in the data than is consistent with a Poisson distribution.
The joint distribution of several categorical variables defines a contingency table. As discussed in the preceding chapter, 25 if one of the variables in a contingency table is treated as the response variable, we can fit a logit or probit model (that is, for a dichotomous response, a binomial GLM) to the table. Loglinear models , in contrast, which are models for the associations among the variables in a contingency table, treat the variables symmetrically—they do not distinguish one variable as the response. There is, however, a relationship between loglinear models and logit models that I will develop later in this section. As we will see as well, loglinear models have the formal structure of two-way and higher-way ANOVA models 26 and can be fit to data by Poisson regression. Loglinear models for contingency tables have many specialized applications in the social sciences—for example to “square” tables, such as mobility tables, where the variables in the table have the same categories. The treatment of loglinear models in this section merely scratches the surface.^27
(^24) See Exercise 15.2. (^25) See Section 14.3. (^26) See Sections 8.2 and 8.3. (^27) More extensive accounts are available in many sources, including Agresti (2002), Fienberg (1980), and Powers and Xie (2000).
15.2. Generalized Linear Models for Counts 395
Table 15.4 Voter Turnout by Intensity of Partisan Preference, for the 1956 U.S. Presidential Election
Voter Turnout
Intensity of Preference Voted Did Not Vote Total
Weak 305 126 431 Medium 405 125 530 Strong 265 49 314 Total 975 300 1275
Table 15.5 General Two-Way Frequency Table
Variable C
Variable R 1 2 · · · c Total
1 Y 11 Y 12 · · · Y 1 c Y 1+ 2 Y 21 Y 22 · · · Y 2 c Y 2+ .. .
.. .
.. .
.. .
.. . r Yr 1 Yr 2 · · · Y (^) rc Yr+ Total Y +1 Y +2 · · · Y + c n
I will examine contingency tables for two variables in some detail, for this is the simplest case, and the key results that I establish here extend straightforwardly to tables of higher dimension. Consider the illustrative two-way table shown in Table 15.4, constructed from data reported in the American Voter (Campbell, Converse, Miller, & Stokes, 1960), introduced in the previous chapter. 28 The table relates intensity of partisan preference to voting turnout in the 1956 U.S. presidential election. To anticipate my analysis, the data indicate that voting turnout is positively associated with intensity of partisan preference. More generally, two categorical variables with r and c categories, respectively, define an r × c contingency table, as shown in Table 15.5, where Y (^) ij is the observed frequency count in the i, j th cell of the table. I use a “+” to represent summation over a subscript; thus Yi+ ≡
∑c j = 1 Y^ ij^ is the marginal frequency in the ith row; Y+j ≡
∑r i= 1 Y^ ij^ is the marginal frequency in the^ j^ th column; and n = Y++ ≡
∑r i= 1
∑c j = 1 Y^ ij^ is the number of observations in the sample. I assume that the n observations in Table 15.5 are independently sampled from a population with proportion π (^) ij in cell i, j , and therefore that the probability of sampling an individual observation in this cell is π (^) ij. Marginal probability distributions π (^) i+ and π+j may be defined as above; note that π++ = 1. If the row and column variables are statistically independent in the population, then the joint probability π (^) ij is the product of the marginal probabilities for all i and j : π (^) ij = πi+π+j. Because the observed frequencies Yij result from drawing a random sample, they are random variables that generally take on different values in different samples. The expected frequency in
(^28) Table 14.9 (page 371) examined the relationship of voter turnout to intensity of partisan preference and perceived closeness of the election. The current example collapses the table for these three variables over the categories of perceived closeness to examine the marginal table for turnout and preference. I return below to the analysis of the full three-way table.
15.2. Generalized Linear Models for Counts 397
Under the model in Equation 15.11, called the saturated model for the two-way table, the number of independent parameters is equal to the number of cells in the table,
1 + (r − 1 ) + (c − 1 ) + (r − 1 )(c − 1 ) = rc
The model is therefore capable of capturing any pattern of association in a two-way table. Remarkably, maximum-likelihood estimates for the parameters of a loglinear model (that is, in the present case, either the model for independence in Equation 15.9 or the saturated model in Equation 15.11) may be obtained by treating the observed cell counts Y (^) ij as the response variable in a Poisson GLM; the log expected counts ηij are then just the linear predictor for the GLM, as the notation suggests. 29 The constraint that all γ (^) ij = 0 imposed by the model of independence can be tested by a likelihood-ratio test, contrasting the model of independence (Equation 15.9) with the more gen- eral model (Equation 15.11). Because the latter is a saturated model, its residual deviance is necessarily 0, and the likelihood-ratio statistic for the hypothesis of independence H 0 : γ (^) ij = 0 is simply the residual deviance for the independence model, which has (r − 1 )(c − 1 ) residual degrees of freedom. Applied to the illustrative two-way table for the American Voter data, we get G^20 = 19 .428 with ( 3 − 1 )( 2 − 1 ) = 2 degrees of freedom, for which p < .0001, suggesting that there is strong evidence that intensity of preference and turnout are related. 30 Maximum-likelihood estimates of the parameters of the saturated loglinear model are shown in Table 15.6. It is clear from the estimated association parameters ̂γ (^) ij that turning out to vote, j = 1, increases with partisan preference (and, of course, that not turning out to vote, j = 2, decreases with preference).
The saturated loglinear model for a three-way (a × b × c) table for variables A, B, and C is defined in analogy to the three-way ANOVA model, although, as in the case of two-way tables, the meaning of the parameters is different:
ηij k = μ + α (^) A(i) + α (^) B(j ) + α (^) C(k) + α (^) AB(ij ) + α (^) AC(ik) + α (^) BC(j k) + α (^) ABC(ij k) (15.12)
(^29) * The reason that this result is remarkable is that a direct route to a likelihood function for the loglinear model leads to the multinomial distribution (discussed in Appendix D on probability and estimation), not to the Poisson distribution. That is, selecting n independent observations from a population characterized by cell probabilities πij results in cell counts following the multinomial distribution,
p(y 11 ,... , yrc ) = (^) r n! ∏ i= 1
∏^ c j = 1
y (^) ij!
∏^ r i= 1
∏^ c j = 1
π (^) ijnij
= (^) r n! ∏ i= 1
∏^ c j = 1
y (^) ij!
∏^ r i= 1
∏^ c j = 1
( (^) μij n
)nij
Noting that the expected counts μ (^) ij are functions of the parameters of the loglinear model leads to the multinomial likelihood function for the model. It turns out that maximizing this multinomial likelihood is equivalent to maximizing the likelihood for the Poisson GLM described in the text (see, e.g., Fienberg, 1980, app. II). (^30) This test is very similar to the usual Pearson chi-square test for independence in a two-way table. See Exercise 15. for details, and for an alternative formula for calculating the likelihood-ratio test statistic G^20 directly from the observed frequencies, Yij , and estimated expected frequencies under independence, ̂μ (^) ij.
398 Chapter 15. Generalized Linear Models
Table 15.6 Estimated Parameters for the Saturated Loglinear Model Fit in Table 15.
̂ γ (^) ij
i j = 1 j = 2 ̂α i
1 −0.183 0.183 0. 2 −0.037 0.037 0. 3 0.219 −0.219 −0. ̂ β j 0.625 −0.625 ̂μ = 5.
with sigma constraints specifying that each set of parameters sums to zero over each subscript; for example α 1 (+) = α 12 (i+) = α 123 (ij +) = 0. Given these constraints, we may solve for the parameters in terms of the log expected counts, with the solution following the usual ANOVA pattern; for example,
μ =
η+++ abc α (^) A(i) =
ηi++ bc
− μ
αAB(ij ) =
ηij + c
− μ − α (^) A(i) − α (^) B(j )
α (^) ABC(ij k) = ηij k − μ − α (^) A(i) − α (^) B(j ) − α (^) C(k) − α (^) AB(ij ) − α (^) AC(ik) − α (^) BC(j k)
The presence of the three-way term αABC in the model implies that the relationship between any pair of variables (say, A and B) depends on the category of the third variable (say, C). 31 Other loglinear models are defined by suppressing certain terms in the saturated model, that is, by setting parameters to zero. In specifying a restricted loglinear model, we will be guided by the principle of marginality: 32 Whenever a high-order term is included in the model, its lower- order relatives are included as well. Loglinear models of this type are often called hierarchical. Nonhierarchical loglinear models may be suitable for special applications, but they are not sensible in general (see Fienberg, 1980). According to the principle of marginality, for example, if αAB appears in the model, so do αA and αB.
(^31) Here and below I use the shorthand notation αABC to represent the whole set of αABC(ij k), and similarly for the other terms in the model. (^32) See Section 7.3.2.