
















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Methods for building regression models, including model selection procedures such as backward elimination and forward selection, stepwise regression, and automated variable selection. It also covers diagnostics for checking model assumptions and detecting influential observations. Examples using home selling price data and introduces generalized linear models for non-normal distributions and nonlinear relationships.
Typology: Quizzes
1 / 56
This page cannot be seen from the preview
Don't miss anything!

















































630 CHAPTER 14. REGRESSION MODEL BUILDING
This chapter introduces tools for building regression models and evalu- ating the effects on their fit of unusual observations or highly correlated predictors. It also shows ways of modeling variables that badly violate the assumptions of straight-line relationships with a normal response variable. Section 14.1 discusses criteria for selecting a regression model by de- ciding which of a possibly large collection of variables to include in the model. Section 14.2 introduces methods for checking regression assump- tions and evaluating the influence of individual observations. Section 14.3 discusses effects of multicollinearity — such strong “overlap” among the explanatory variables that no one of them seems useful when the oth- ers are also in the model. Section 14.4 introduces a generalized model that can handle response variables having distributions other than the normal. Sections 14.5 and 14.6 introduce models for nonlinear relation- ships.
Social research studies usually have several explanatory variables. For example, for modeling mental impairment, potential predictors include income, educational attainment, an index of life events, social and envi- ronmental stress, marital status, age, self-assessment of health, number of jobs held in previous five years, number of relatives who live nearby, number of close friends, membership in social organizations, frequency of church attendance, and so forth. Usually, the regression model for the study includes some explanatory variables for theoretical reasons. Others may be included for exploratory purposes, to check whether they explain much variation in the response variable. The model might also include terms to allow for interactions. In such situations, it can be difficult to decide which variables to use and which to exclude.
One possible strategy may be obvious: include every potentially use- ful predictor and then delete those terms not making significant partial contributions at some preassigned α-level. Unfortunately, this usually is inadequate. Because of correlations among the explanatory variables, any one variable may have little unique predictive power, especially when the number of predictors is large. It is conceivable that few, if any, ex- planatory variables would make significant partial contributions, given that all of the other explanatory variables are in the model. Here are two general guidelines for selecting explanatory variables: First, include enough of them to make the model useful for theoretical
632 CHAPTER 14. REGRESSION MODEL BUILDING
where the remaining variables all make significant partial contributions to predicting y. For most software, the variable deleted at each stage is the one that is the least significant, having the largest P -value in the significance test for its effect. Specifically, here’s the sequence of steps for backward elimination: The initial model contains all potential explanatory variables. If all variables make significant partial contributions at some fixed α-level, according to the usual t test or F test, then that model is the final one. Otherwise, the explanatory variable having the largest P -value, controlling for the other variables in the model, is removed. Next, for the model with that variable removed, the partial contributions of the variables remaining in the model are reassessed, controlling for the other variables still in the model. If they are all significant, that model is the final model. Otherwise, the variable having the largest P -value is removed. The process continues until each remaining predictor explains a significant partial amount of the variability in y.
Example 14.1 Selecting Predictors of Home Selling Price
We refer to the 100 observations in the “house selling price” data file at the text website, which were introduced in Example 9.10 on page
ˆy = − 21 , 353 .8 + 61.7(SIZE) + 46, 373 .7(N EW ) + 37.23(T AXES).
Other things being equal, an extra thousand square feet of size increases the selling price by about 62 thousand dollars, and having a new home
14.1. MODEL SELECTION PROCEDURES 633
Table 14.1: Model Fit at Initial Stage of Backward Elimination for Predicting Home Selling Price
Variable B Std. Error t Sig (Constant) 4525.75 24474. SIZE 68.35 13.94 4.90. NEW 41711.43 16887.20 2.47. TAXES 38.13 6.81 5.60. BATHS -2114.37 11465.11 -.18. BEDS -11259.1 9115.00 -1.23.
increases it by about 46 thousand. Using standardized variables, the equation is zˆy = 0. 406 zS + 0. 144 ZN + 0. 464 zT.
Table 14.2: Model Fit at Third Stage of Backward Elimination for Predicting Home Selling Price
Variable B Std. Error Std. Coeff t Sig (Constant) -21353.8 13311. SIZE 61.70 12.50 0.406 4.94. NEW 46373.70 16459.02 0.144 2.82. TAXES 37.23 6.74 0.466 5.53.
If we had included interactions in the original model, we would have ended up with a different final model. However, the model given here has the advantage of simplicity, and it has good predictive power (R^2 = 0 .790, compared to 0.793 with all the predictors). In fact, R^2 adj = 0.
for this model is higher than R adj^2 = 0.782 for the original model.
2
Whereas backward elimination begins with all the potential explanatory variables in the model, forward selection begins with none of them. It adds one variable at a time to the model until reaching a point where no remaining variable not yet in the model makes a significant partial contribution to predicting y. At each step, the variable added is the
14.1. MODEL SELECTION PROCEDURES 635
as forward selection. At each stage, each variable in the model makes a significant contribution, so no variables are dropped. For these variables, backward elimination, forward selection, and backward elimination all agree. This need not happen.
It seems appealing to have a procedure that automatically selects vari- ables according to established criteria. But any variable selection method should be used with caution and should not substitute for careful thought. There is no guarantee that the final model chosen will be sensible. For instance, when it is not known whether explanatory variables in- teract in their effects on a response variable, one might specify all the pairwise interactions as well as the main effects as the potential explana- tory variables. In this case, it is inappropriate to remove a main effect from a model that contains an interaction composed of that variable. Yet, most software does not have this safeguard. To illustrate, we used forward selection with the home sales data, including the five predictors from above as well as their ten cross-product interaction terms. The final model had R^2 = 0.866, using four interaction terms (SIZE×TAXES, SIZE ×NEW, TAXES×NEW, BATHS×NEW) and the TAXES main effect. It is inappropriate, however, to use these interactions as predictors without the SIZE, NEW, and BATHS main effects. Also, a variable selection procedure may exclude an important pre- dictor that really should be in the model according to other criteria. For instance, using backward elimination with the five predictors of home selling price and their interactions, TAXES was removed. In other words, at a certain stage, TAXES explained an insignificant part of the varia- tion in selling price. Nevertheless, it is the best single predictor of selling price, having r^2 = 0.709 by itself. (Refer to step 1 of the forward selection process in Table 14.3.) Since TAXES is such an important determinant of selling price, it seems sensible that any final model should include it as a predictor.
Figure 14.1: Variability in y Explained by x 1 , x 2 , and x 3. Shaded portion is amount ex- plained by x 1 that is also ex- plained by x 2 and x 3
((Fig. 14.1 in 3e))
636 CHAPTER 14. REGRESSION MODEL BUILDING
Although P -values provide a guide for making decisions about adding or dropping variables in selection procedures, they are not the true P - values for the tests conducted. We add or drop a variable at each stage according to a minimum or maximum P -value, but the sampling distri- bution of the maximum or minimum of a set of t or F statistics differs from the sampling distribution for the statistic for an a priori chosen test. For instance, suppose we add variables in forward selection accord- ing to whether the P -value is less than 0.05. Even if none of the potential predictors truly affect y, the probability is considerably larger than 0. that at least one of the separate test statistics provides a P -value below 0.05 (Exercise 52). At least one variable that is not really important may look impressive merely due to chance.
Similarly, once we choose a final model with a selection procedure, any inferences conducted with that model are highly approximate. In particular, P -values are likely to appear smaller than they should be and confidence intervals are likely to be too narrow, since the model was chosen that most closely reflects the data, in some sense. The inferences are more believeable if performed for that model with a new set of data.
There is a basic difference between explanatory and exploratory modes of model selection. In explanatory research, there is a theoretical model to test using multiple regression. We might test whether a hypothe- sized spurious association disappears when other variables are controlled, for example. In such research, automated selection procedures are usu- ally not appropriate, because theory dictates which variables are in the model.
In exploratory research, by contrast, the goal is not to examine the- oretically specified relationships but merely to find a good set of predic- tors. This approach searches for predictors that give a large R^2 , without concern about theoretical explanations. Thus, educational researchers might use a variable selection procedure to search for a set of test scores and other factors that predict well how students perform in college. They should be cautious about giving causal interpretations to the effects of the different variables. For example, possibly the “best” predictor of students’ success in college is whether their parents use the Internet for voice communication (with a program such as SKYPE).
In summary, automated variable selection procedures are no substi- tute for careful thought in formulating models. For most scientific re- search, they are not appropriate.
638 CHAPTER 14. REGRESSION MODEL BUILDING
Table 14.4: Model Selection Criteria for Models for Home Selling Price Variables in Model R^2 R^2 adj PRESS Cp TAXES 0.709 0.706 3.17 36. TAXES, SIZE 0.772 0.767 2.73 9. TAXES, SIZE, NEW 0.790 0.783 2.67 3. TAXES, SIZE, NEW, BEDS 0.793 0.785 2.85 4. TAXES, SIZE, NEW, BEDS, BATHS 0.793 0.782 2.91 6.
Note: Actual PRESS equals value reported times 10^11.
equation we get, let ˆy(1) denote the predicted selling price for observation
PRESS =
∑ (yi − yˆ(i))^2.
The smaller the value of PRESS, the better the predictions tend to be, in a summary sense. According to this criterion, the best-fitting model is the one with the smallest value of PRESS. Table 14.4 shows the PRESS values for five models for the house selling price data. According to this criterion, the selected model is the one with predictors TAXES, SIZE, and NEW, which has the mini- mum PRESS = 2.67. (The y values were in dollars, so squared residuals tended to be huge numbers, and the actual PRESS values are the num- bers reported multiplied by 10^11 .) This was also the model selected by backward elimination and by forward selection. A related approach reports a statistic that describes how well each model fits compared to the full model with all the predictors. Roughly speaking, it attempts to find the simplest model that has a relatively
14.2. REGRESSION DIAGNOSTICS 639
small expected value of [ˆy −E(y)]^2 , which measures the distance between a predicted value and the true mean of y at the given values of the explanatory variables. When you have a full model that you believe has sufficient terms as to eliminate important bias, you can use this statistic to search for a simpler model that also has little bias. The statistic is denoted by Cp, where p denotes the number of parameters in the regression model (including the y-intercept). For a given number of parameters p, smaller values of Cp indicate a better fit. For the full model, necessarily Cp = p. A simpler model than the full one that has Cp close to p provides essentially as good a fit, apart from sampling error. Models having values of Cp considerably larger than p do not fit as well. In using Cp to help select a model, the goal is to have the smallest number of predictors necessary to give a value of Cp close to p. For that number of predictors, the selected model is the one with the minimum value of Cp. Consider the models in Table 14.4. The full model shown on the last line of that table has five predictors and p = 6 parameters, so it has Cp = 6.0. The model removing BATHS has p = 5 parameters and has Cp = 4.0. The model removing BEDS has p = 4 parameters and has Cp = 3.7. Since Cp is then close to p = 4, this model seems to fit essentially as well as the full model, apart from sampling error. The simpler models listed in the table have Cp considerably larger than p (Cp = 9.6 with p = 3 and Cp = 36.4 with p = 2), and provide poorer fits. Some software does not report PRESS or Cp but does present a mea- sure that has a similar purpose. The AIC, short for Akaike information criterion, also attempts to find a model for which the {ˆyi} tend to be closest to {E(yi)} in an average sense. Its formula, not considered here, penalizes a model for having more parameters than are useful for getting good predictions. The AIC is also scaled in such a way that the lower the value, the better the model. The “best” model is the one with smallest AIC. An advantage of AIC it that it is also useful for models that have nonnormal distributions for y, in which case a sum of squared errors may not be a useful summary.
Once we have selected predictors for a model, how do we know that model fits the data adequately? This section introduces diagnostics that indicate (1) when model assumptions are grossly violated and (2) when certain observations are highly influential in affecting the model fit or inference about model parameters. Recall that inference about parameters in a regression model makes these assumptions:
14.2. REGRESSION DIAGNOSTICS 641
Example 14.2 Residuals for Modeling Home Selling Price
For the data of Table 9.4 (page 387) with y = selling price, variable selection procedures in Example 14.1 (page 632) suggested the model having predictors SIZE of home, TAXES, and whether NEW. The pre- diction equation is
yˆ = − 21 , 353 .8 + 61.7(SIZE) + 46, 373 .7(N EW ) + 37.2(T AXES).
Figure 14.2 is a histogram of the studentized residuals for this fit, as plotted by SPSS. No severe nonnormality seems to be indicated, since they are roughly bell-shaped about 0. However, the plot indicates that two observations have relatively large residuals. On further inspection, we find that observation 6 had a selling price of $499,900, which was $168,747 higher than the predicted selling price for a new home of 3153 square feet with a tax bill of $2997. The residual of $168,747 has a studentized value of 3.88. Likewise observation 64 had a selling price of $225,000, which was $165,501 lower than the predicted selling price for a non-new home of 4050 square feet with a tax bill of $4350. Its residual of −$165,501 has a studentized value of − 3 .93. A severe outlier on y can substantially affect the fit, especially when the values of the explanatory variables are not near their means. So, we refitted the model without these two observations. The R^2 value changes from 0.79 to 0.83, and the prediction equation changes to
ˆy = −32226 + 68.9(SIZE) + 20436(N EW ) + 38.3(T AXES).
The parameter estimates are similar for SIZE and TAXES, but the esti- mated effect of NEW drops from 46,374 to 20,436. Moreover, the effect of NEW is no longer significant, having a P -value of 0.17. Because the estimated effect of NEW is affected substantially by these two observa- tions, we should be cautious in making conclusions about its effect. Of the 100 homes in the sample, only 11 were new. It is it difficult to make precise estimates about the NEW effect with so few new homes, and results are highly affected by a couple of unusual observations.
Figure 14.2: Histogram of Studentized Residuals for Multiple Regression Model Fitted to Housing Price Data, with Predictors Size, Taxes, and New
642 CHAPTER 14. REGRESSION MODEL BUILDING
The normality assumption is not as important as the assumption that the model provides a good approximation for the true relationship between the predictors and the mean of y. If the model assumes a linear effect but the effect is actually strongly nonlinear, the conclusions will be faulty. For bivariate models, the scatterplot provides a simple check on the form of the relationship. For multiple regression, it is also useful to construct a scatterplot of each explanatory variable against the response variable. This displays only the bivariate relationships, however, whereas the model refers to the partial effect of each predictor, with the others held constant. The partial regression plot introduced in Section 11. provides some information about this. It provides a summary picture of the partial relationship. For multiple regression models, plots of the residuals (or studentized residuals) against the predicted values ˆy or against each explanatory vari- able also help us check for potential problems. If the residuals appear to fluctuate randomly about 0 with no obvious trend or change in variation as the values of a particular xi increase, then no violation of assumptions is indicated. The pattern should be roughly like Figure 14.3a. In Fig- ure 14.3c, y tends to be below ˆy for very small and very large xi-values (giving negative residuals) and above ˆy for medium-sized xi-values (giv- ing positive residuals). Such a scattering of residuals suggests that y is actually nonlinearly related to xi. Sections 14.5 and 14.6 show how to address nonlinearity.
Figure 14.3: Possible Patterns for Residuals (e), Plotted Against an Explana- tory Variable x
((Fig. 14.3 in 3e))
In practice, most response variables can take only nonnegative values. For such responses, a fairly common occurrence is that the variability increases dramatically as the mean increases. For example, consider y = annual income (in dollars), using several predictors. For those subjects having E(Y ) = $10,000, the standard deviation of income is probably much less than for those subjects having E(Y ) = $200,000. Plausible standard deviations might be $4000 and $80,000. When this happens, the conditional standard deviation of y is not constant, whereas ordinary regression assumes that it is. An indication that this is happening is when the residuals are more spread out as the yi values increase. If we were to plot the residuals against a predictor that has a positive partial association with y, such as number of years of education, the residuals
644 CHAPTER 14. REGRESSION MODEL BUILDING
in a random pattern about 0 over time, rather than showing a trend or periodic cycle. The methods presented in this text are based on independent observations and are inappropriate when time effects occur. For example, when observations next to each other tend to be positively correlated, the standard error of the sample mean is larger than the σ/
n formula that applies for independent observations. Books specializing in time series or econometrics, such as Kennedy (2004), present methods for time series data.
Least squares estimates of parameters in regression models can be strongly influenced by an outlier, especially when the sample size is small. A va- riety of statistics summarize the influence each observation has. These statistics refer to how much the predicted values ˆy or the model param- eter estimates change when the observation is removed from the data set. An observation’s influence depends on two factors: (1) how far the response on y falls from the overall trend in the sample and (2) how far the values of the explanatory variables fall from their means. The first factor on influence (how far an observation falls from the overall trend) is measured by the residual for the observation, y − ˆy. The larger the residual, the farther the observation falls from the overall trend. We can search for observations with large studentized residuals (say, larger than about 3 in absolute value) to find observations that may be influential. The second factor on influence (how far the explanatory variables fall from their means) is summarized by the leverage of the observation. The leverage is a nonnegative statistic such that the larger its value, the greater weight that observation receives in determining the ˆy values (hence, it also is sometimes called a hat value). The formula for the leverage in multiple regression is complex. For the bivariate model, the leverage for observation i simplifies to
hi =
n
(xi − ¯x)^2 ∑ (x − x¯)^2
So, the leverage gets larger as the x-value gets farther from the mean, but it gets smaller as the sample size increases. When calculated for each observation in a sample, the average leverage equals p/n, where p is the number of parameters in the model.
Most statistical software reports other diagnostics that depend on the studentized residuals and the leverages. Two popular ones are called
14.2. REGRESSION DIAGNOSTICS 645
DFFIT and DFBETA. For a given observation, DFBETA summarizes the effect on the model parameter estimates of removing the observation from the data set. For
the effect βj of xj , DFBETA equals the change in the estimate βˆj due to deleting the observation. The larger the absolute value of DFBETA, the greater the influence of the observation on that parameter estimate. Each observation has a DFBETA value for each parameter in the model. DFFIT summarizes the effect on the fit of deleting the observation. It summarizes more broadly the influence of an observation, as each observation has a single DFFIT value whereas it has a separate DFBETA for each parameter. For observation i, DFFIT equals the change in the predicted value due to deleting that observation (i.e., ˆyi − yˆ(i)). The DFFIT value has the same sign as the residual. The larger its absolute value, the greater the influence that observation has on the fitted values. Cook’s distance is an alternative measure with the same purpose, but it is based on the effect that observation i has on all the predicted values.
Example 14.3 DFBETA and DFFIT for an Influential Obser- vation
Example 14.2 (page 641) showed that observations 6 and 64 were influential on the equation for predicting home selling price using size of home, taxes, and whether the house is new. The prediction equation is
yˆ = − 21 , 354 + 61.7(SIZE) + 46, 373 .7(N EW ) + 37.2(T AXES).
For observation 6, the DFBETA values are 12.5 for size, 16,318.5 for new, and −5.7 for taxes. This means, for example, that if this observation is deleted from the data set, the effect of NEW changes from 46,373.7 to 46 , 373. 7 − 16 , 318 .5 = 30, 055 .2. Observation 6 had a predicted selling price of ˆy = 331,152.8. Its DFFIT value is 29,417.0. This means that if observation 6 is deleted from the data set, then ˆy at the predictor values for observation 6 changes to 331, 152. 8 − 29 , 417 .0 = 301, 735 .8. This analysis also shows that this observation is quite influential.
2
Some software reports standardized versions of the DFBETA and DFFIT measures. The standardized DFBETA divides the change in the
estimate βˆj due to deleting the observation by the standard error of βˆj for the adjusted data set. For observation i, the standardized DFFIT equals the change in the predicted value due to deleting that observation, divided by the standard error of ˆy for the adjusted data set. In practice, you scan or plot these diagnostic measures to see if some observations stand out from the rest, having relatively large values. Each
Parent Percentage to Predict Murder Rate for 50 U.S. States and District of Columbia Dep Var Predict Student Leverage POVERTY SINGLE
648 CHAPTER 14. REGRESSION MODEL BUILDING
An observation with a large studentized residual does not have a ma- jor influence if its values on the explanatory variables do not fall far from their means. Recall that the leverage summarizes how far the ex- planatory variables fall from their means. For instance, New Mexico has a relatively large negative studentized residual (− 2 .25) but a relatively small leverage (0.047), so it does not have large values of DFFIT or DF- BETA. Similarly, an observation far from the mean on the explanatory variables (i.e., with a large leverage) need not have a major influence, if it falls close to the prediction equation and has a small studentized residual. For instance, West Virginia has a relatively large poverty rate and its leverage of 0.18 is triple the average. However, its studentized residual is small (0.66), so it has little influence on the fit.
In many social science studies using multiple regression, the explanatory variables “overlap” considerably. Each variable may be nearly redun- dant, in the sense that it can be predicted well using the others. If we regress an explanatory variable on the others and get a large R^2 value, this suggests that it may not be needed in the model once the others are there. This condition is called multicollinearity. This section describes the effects of multicollinearity and ways to diagnose it.
Multicollinearity causes inflated standard errors for estimates of regres- sion parameters. To show why, we first consider the regression model E(y) = α + β 1 x 1 + β 2 x 2. The estimate of β 1 has standard error
se =
√ 1 − r^2 x 1 x 2
[ s √ n − 1 sx 1
] ,
where s is the estimated conditional standard deviation of y and sx 1 de- notes the sample standard deviation of x 1 values. The effect of the cor- relation√ rx 1 x 2 between the explanatory variables enters through the term
1 − r^2 x 1 x 2 in the denominator. Other things being equal, the stronger
that squared correlation, the larger the standard error of b 1. Similarly, the standard error of the estimator of β 2 also is larger with larger values of r x^21 x 2.