Model Building in Multiple Regression - Lecture Notes | STAT 102, Study notes of Statistics

Material Type: Notes; Class: INTRO BUSINESS STAT; Subject: Statistics; University: University of Pennsylvania; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-unx
koofers-user-unx 🇺🇸

4.3

(3)

10 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 13: Model Building in multiple regression
Stat 102
We will illustrate the practice of model building and discuss the
theory through examination of a model building exercise.
Data set pollution.JMP provides information about the
relationship between pollution and mortality for 60 cities
between 1959-1961.
Goal: Build a multiple regression model that can be used to
examine the effect on mortality of several pollution related
variables.
Build the most useful model based on the available data.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Model Building in Multiple Regression - Lecture Notes | STAT 102 and more Study notes Statistics in PDF only on Docsity!

Lecture 13: Model Building in multiple regression

Stat 102

 We will illustrate the practice of model building and discuss the theory through examination of a model building exercise.

 Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

 Goal: Build a multiple regression model that can be used to examine the effect on mortality of several pollution related variables.  Build the most useful model based on the available data.

The Data  The Dependent Variable ( Y ) is MORT ality = total age adjusted mortality in deaths per 100,000 population  The x -variables that could be used are  PRECIP=mean annual precipitation (in inches);  EDUC = median number of school years completed for persons 25 and older;  NONWHITE = percentage of 1960 population that is nonwhite;  NOX = relative pollution potential of N 2 O (related to amount of tons of = “Nitrous Oxide” (N 2 0) emitted per day per square kilometer);  SO2 = relative pollution potential of SO 2.  Among these, the pollution-related descriptors are NOX, SO2 and PRECIP (indirectly)  The remaining 2 variables are included as “controls”. Controls help answer whether pollution is important after controlling for other relevant (non-pollution) factors.

The Model Building Process is Interactive

 The goal is the best end product.

 There is no uniquely correct order in which to proceed.

 Decisions as to which variables to use, and how to transform them may need to be revisited as the analysis progresses.

Scatterplot Matrix

 This is often a useful first step.  Examine the Simple Regression Plots of Y on the potential x - variables  See whether they seem mostly linear.

 And see whether the Y -variable looks as if it will be

homoscedastic in future analyses. ▪ If not, try a transformation such as Log( Y ) before going further.  Our data looks good in terms of heteroscedasticity  It also looks pretty good in terms of nonlinearity but we will still find some useful corrective actions.  To get the Scatterplot Matrix in JMP click Multivariate and then put the Y- variable and all x-variables into the “Y, Columns” box.

Transformation to Correct Non-Linearities

 If non-linearity is present in the relationship between Y and some x it is usually a good idea to try to correct it now, before going further. We’ll want to use only transformations of x – We probably don’t want to transform Y since we seem to already have desirable homoscedasticity.  It’s not very evident from the small size plot in the Scatterplot Matrix, but there is some nonlinearity in the plot of MORT on SO2. We’ll work on that first --

Plot of MORT on SO

750

800

850

900

950

1000

1050

1100

1150

MORT

0 50 100 150 200 250 300 SO ___ Linear Fit: R^2 =. ___ Transformed Fit to Log: R^2 =. ___ Transformed Fit to SqRt: R^2 =.

SqRt(SO2) is the best, if only by a modest amount. And it has another advantage we’ll look at soon. {& The residual plot also looks good.}

Scatterplot of MORT by Log(NOX)

750

800

850

900

950

1000

1050

1100

1150

MORT

-1 0 1 2 3 4 5 6 Log(NOX)

 The values of this x-variable are nicely spread out.  The Y on x relationship looks not too non-linear. ▪ It’s not perfect, but it’s not clear how we could get it to be better. ▪ Mainly, there are 4 offending points (lower right); and we’ll look more at these later.

The Bonus in Using SqRt(SO2)

 SO2 is also a little crunched.

 SqRt(SO2) “uncrunches” it. Scatterplot of MORT by SqRt(SO2)

750

800

850

900

950

1000

1050

1100

1150

MORT

0 5 10 15 SqRt(SO2)

There’s even a nicely linear pattern here!

Forward Selection for Pollution Data

  1. Multivariate Correlations –

MORT PRECIP EDUC NONWHITE Log(NOX) SqRt(SO2) MORT 1.0000 0.5095 - 0.5110 0.6437 0.2920 0.

 NONWHITE has the largest R^2 = .6437^2 = .4143.  We can find the its P-value from the Parameter Estimates table: Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 887.1 10.37 85. 53 <. NONWHITE 4.488 0.7006 6.41 <.  It has P < .0001. So we include this variable in our model.

  1. In JMP save the residuals from this model.
  2. Repeat Step #1, but using these residuals instead of the original Y -values.

Forward Selection for the Data (cont)

3(cont). Multivariate Correlations with the Residuals

Residuals MORT PRECIP EDUC NONWHITE Log(NOX) SqRt(SO2) Residuals MORT 1.0000 0.3182 - 0.4921 0.0000 0.2220 0.

 EDUC now has the largest R^2 = (-.4921)^2 = .2422.  We can find the its P-value from the Parameter Estimates table: Parameter Estimates (with Y = ResidualsMORT) Term Estimate Std Error t Ratio Prob>|t| Intercept 304.08 70.84 4.29 <. EDUC - 27.71 6.44 - 4.31 <.  It has P < .0001. So we include this variable in our model.  Additional Note: The correlation between ResidualsMORT and NONWHITE is 0.0000. It is a property of linear regression (simple or multiple) that the correlation between the residuals and any explanatory variable in the model that produced them is 0.

Automatic Model Building in JMP :

  1. Click Analyze, Fit Model and add all variables under consideration to the Construct Model Effects box. Change the personality to Stepwise and click Run Model.
  2. If there are variables which you would like to include in the model for substantive reasons, regardless of their significance, check Lock next to the variable.
  3. Enter into the model the variable with the largest F ratio if Prob>F is less than .05 for this variable (do this by clicking the Enter box).
  4. Enter into the model the variable that has not already been entered into the model with the largest F ratio if Prob>F is less than .05. The F ratio for a variable X_j that has not been included in the model is the F statistic for testing the reduced model that includes only the variables already included in the model versus the full model that includes variable X_j in addition to the variables that have already been included in the model.
  5. Repeat Step 4 until no more variables can be entered into the model.

Here are the results of the model building process for the Pollution data:

Stepwise Fit: Response: MORT Stepwise Regression Control Prob to Enter 0. Prob to Leave 0. Current Estimates SSE DFE MSE RSquare Cp AIC 72504 55 1318 0.6824 4.79 435. Lock Entered Parameter Estimate nDF SS F Ratio Prob>F X X Intercept 956.7 1 0 0.000 1. X PRECIP 1.725 1 10167 7.713 0. X EDUC - 1 4.00 1 5404 4.100 0. X NONWHITE 3.04 8 1 34372 26.074 0. Log(NOX) 0 1 1051 0 .795 0. X SqRt(SO2) 5.8 3 1 24858 18.857 0. Step History Step Parameter Action "Sig Prob" Seq SS RSquare Cp p 1 NONWHITE Entered 0.0000 94595 0.4144 45.03 2 2 EDUC Entered 0.0000 33848 0.5627 21.45 3 3 SqRt(SO2) Entered 0.0012 17157 0.6 378 10.48 4 4 PRECIP Entered 0.0075 10167 0.6824 4.79 5

Tables from the Full (5-factor) Model RSquare^ Summary of Fit 0.68 7 Observations 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 156820 31364 23. Error C. To (^) tal 5459 71452228273 1323 Prob > F<.

Term Estimate^ Parameter Estimates Std Error t Ratio Prob>|t| Intercept 950.35 93.23 10.19 <. PRECIP 2.01 0.698 2.88 0. EDUC - 14.71 6.963 - 2.11 0. NONWHITE Log(NOX) 2.8256.708 0.64857.526 4.360.89 <.00010. SqRt(SO2) 4.389 2.102 2.09 0. Effect Tests Source DF Sum of Squares F Ratio Prob > F PRECIP 1 10940 8.27 0. EDUC NONWHITE 11 590725120 4.4618.99 0.0392<. Log(NOX) 1 1051 0.7955 0. SqRt(SO2) 1 5767 4.36 0. Log(NOX) is not statistically significant, as was also claimed on p. 15 and shown in the table on p. 17.