











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: INTRO BUSINESS STAT; Subject: Statistics; University: University of Pennsylvania; Term: Unknown 1989;
Typology: Study notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!












Stat 102
We will illustrate the practice of model building and discuss the theory through examination of a model building exercise.
Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.
Goal: Build a multiple regression model that can be used to examine the effect on mortality of several pollution related variables. Build the most useful model based on the available data.
The Data The Dependent Variable ( Y ) is MORT ality = total age adjusted mortality in deaths per 100,000 population The x -variables that could be used are PRECIP=mean annual precipitation (in inches); EDUC = median number of school years completed for persons 25 and older; NONWHITE = percentage of 1960 population that is nonwhite; NOX = relative pollution potential of N 2 O (related to amount of tons of = “Nitrous Oxide” (N 2 0) emitted per day per square kilometer); SO2 = relative pollution potential of SO 2. Among these, the pollution-related descriptors are NOX, SO2 and PRECIP (indirectly) The remaining 2 variables are included as “controls”. Controls help answer whether pollution is important after controlling for other relevant (non-pollution) factors.
The Model Building Process is Interactive
The goal is the best end product.
There is no uniquely correct order in which to proceed.
Decisions as to which variables to use, and how to transform them may need to be revisited as the analysis progresses.
Scatterplot Matrix
This is often a useful first step. Examine the Simple Regression Plots of Y on the potential x - variables See whether they seem mostly linear.
And see whether the Y -variable looks as if it will be
homoscedastic in future analyses. ▪ If not, try a transformation such as Log( Y ) before going further. Our data looks good in terms of heteroscedasticity It also looks pretty good in terms of nonlinearity but we will still find some useful corrective actions. To get the Scatterplot Matrix in JMP click Multivariate and then put the Y- variable and all x-variables into the “Y, Columns” box.
If non-linearity is present in the relationship between Y and some x it is usually a good idea to try to correct it now, before going further. We’ll want to use only transformations of x – We probably don’t want to transform Y since we seem to already have desirable homoscedasticity. It’s not very evident from the small size plot in the Scatterplot Matrix, but there is some nonlinearity in the plot of MORT on SO2. We’ll work on that first --
Plot of MORT on SO
750
800
850
900
950
1000
1050
1100
1150
MORT
0 50 100 150 200 250 300 SO ___ Linear Fit: R^2 =. ___ Transformed Fit to Log: R^2 =. ___ Transformed Fit to SqRt: R^2 =.
SqRt(SO2) is the best, if only by a modest amount. And it has another advantage we’ll look at soon. {& The residual plot also looks good.}
Scatterplot of MORT by Log(NOX)
750
800
850
900
950
1000
1050
1100
1150
MORT
-1 0 1 2 3 4 5 6 Log(NOX)
The values of this x-variable are nicely spread out. The Y on x relationship looks not too non-linear. ▪ It’s not perfect, but it’s not clear how we could get it to be better. ▪ Mainly, there are 4 offending points (lower right); and we’ll look more at these later.
The Bonus in Using SqRt(SO2)
SO2 is also a little crunched.
SqRt(SO2) “uncrunches” it. Scatterplot of MORT by SqRt(SO2)
750
800
850
900
950
1000
1050
1100
1150
MORT
0 5 10 15 SqRt(SO2)
There’s even a nicely linear pattern here!
Forward Selection for Pollution Data
MORT PRECIP EDUC NONWHITE Log(NOX) SqRt(SO2) MORT 1.0000 0.5095 - 0.5110 0.6437 0.2920 0.
NONWHITE has the largest R^2 = .6437^2 = .4143. We can find the its P-value from the Parameter Estimates table: Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 887.1 10.37 85. 53 <. NONWHITE 4.488 0.7006 6.41 <. It has P < .0001. So we include this variable in our model.
Forward Selection for the Data (cont)
3(cont). Multivariate Correlations with the Residuals
Residuals MORT PRECIP EDUC NONWHITE Log(NOX) SqRt(SO2) Residuals MORT 1.0000 0.3182 - 0.4921 0.0000 0.2220 0.
EDUC now has the largest R^2 = (-.4921)^2 = .2422. We can find the its P-value from the Parameter Estimates table: Parameter Estimates (with Y = ResidualsMORT) Term Estimate Std Error t Ratio Prob>|t| Intercept 304.08 70.84 4.29 <. EDUC - 27.71 6.44 - 4.31 <. It has P < .0001. So we include this variable in our model. Additional Note: The correlation between ResidualsMORT and NONWHITE is 0.0000. It is a property of linear regression (simple or multiple) that the correlation between the residuals and any explanatory variable in the model that produced them is 0.
Automatic Model Building in JMP :
Here are the results of the model building process for the Pollution data:
Stepwise Fit: Response: MORT Stepwise Regression Control Prob to Enter 0. Prob to Leave 0. Current Estimates SSE DFE MSE RSquare Cp AIC 72504 55 1318 0.6824 4.79 435. Lock Entered Parameter Estimate nDF SS F Ratio Prob>F X X Intercept 956.7 1 0 0.000 1. X PRECIP 1.725 1 10167 7.713 0. X EDUC - 1 4.00 1 5404 4.100 0. X NONWHITE 3.04 8 1 34372 26.074 0. Log(NOX) 0 1 1051 0 .795 0. X SqRt(SO2) 5.8 3 1 24858 18.857 0. Step History Step Parameter Action "Sig Prob" Seq SS RSquare Cp p 1 NONWHITE Entered 0.0000 94595 0.4144 45.03 2 2 EDUC Entered 0.0000 33848 0.5627 21.45 3 3 SqRt(SO2) Entered 0.0012 17157 0.6 378 10.48 4 4 PRECIP Entered 0.0075 10167 0.6824 4.79 5
Tables from the Full (5-factor) Model RSquare^ Summary of Fit 0.68 7 Observations 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 156820 31364 23. Error C. To (^) tal 5459 71452228273 1323 Prob > F<.
Term Estimate^ Parameter Estimates Std Error t Ratio Prob>|t| Intercept 950.35 93.23 10.19 <. PRECIP 2.01 0.698 2.88 0. EDUC - 14.71 6.963 - 2.11 0. NONWHITE Log(NOX) 2.8256.708 0.64857.526 4.360.89 <.00010. SqRt(SO2) 4.389 2.102 2.09 0. Effect Tests Source DF Sum of Squares F Ratio Prob > F PRECIP 1 10940 8.27 0. EDUC NONWHITE 11 590725120 4.4618.99 0.0392<. Log(NOX) 1 1051 0.7955 0. SqRt(SO2) 1 5767 4.36 0. Log(NOX) is not statistically significant, as was also claimed on p. 15 and shown in the table on p. 17.