Simple Linear Regression and Multiple Linear Regression | STAT 102, Study notes of Statistics

Material Type: Notes; Class: INTRO BUSINESS STAT; Subject: Statistics; University: University of Pennsylvania; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-x2j
koofers-user-x2j šŸ‡ŗšŸ‡ø

10 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 14 STAT 102
•Outliers and influential observations in
simple linear regression (Review)
•Outliers and influential observations in
multiple linear regression
•Leverage plots
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Simple Linear Regression and Multiple Linear Regression | STAT 102 and more Study notes Statistics in PDF only on Docsity!

Lecture 14 STAT 102

  • Outliers and influential observations in

simple linear regression (Review)

  • Outliers and influential observations in

multiple linear regression

  • Leverage plots

Outliers and influential points in simple regression

  • Does the age at which a child begins

to talk predict a score on a test of

mental ability at a later age?

  • gesell.JMP contains data on the age at

first word (x) and their Gesell

Adaptive score (y), an ability test

taken at a later age.

  • Child 18 is an outlier in the x

direction, so it is a leverage point and

potentially influential.

  • Child 19 is a regression outlier.

Score

Age

Leverage and Influential Points

• An observation has high leverage if it is an

outlier in the x direction.

• An observation is influential if removing it

would markedly change the least squares

line.

• Observations that have high leverage and

are outliers tend to be influential.

Outliers and influential points in simple linear regression

  • To assess whether a point is

influential, fit the least squares line

with and without the point (excluding

the row to fit it without the point) and

see how much of a difference it

makes.

  • Child 18 is highly influential; child 19

is not highly influential.

50

60

70

80

90

100

110

120

130

Score

18

19

5 10 15 20 25 30 35 40 45

Age

Bivariate Fit of Score By Age

(w/o influential point #18)

70

80

90

100

110

120

130

Score

5 10 15 20 25

Age

Linear Fit

Score = 105.63 - 0.78Age

Summary of Fit

RSquare

Analysis of Variance

Parameter Estimates

Bivariate Fit of Score By Age (all data)

50

60

70

80

90

100

110

120

130

Score

5 10 15 20 25 30 35 40 45

Age

Linear Fit

Score = 109.87 - 1.13Age

Summary of Fit

RSquare

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio

Model 1 1604.0809 1604.08 13.

Error 19 2308.5858 121.50 Prob > F

C. Total 20 3912.

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 109.87384 5.067802 21.68 <.

Age - 1.126989 0.310172 - 3.63 0.

Source DF Sum of Squares Mean Square F Ratio

Model 1 280.5195 280.519 2.

Error 18 2220.4805 123.360 Prob > F

C. Total 19 2501.

Term Estimate Std Error t Ratio Prob>|t|

Intercept 105.62987 7.161928 14.75 <.

Age - 0.779221 0.516733 - 1.

Conclusion: It is not clear at all that scores and ages are

related for normal children

How to identify outliers and influential points

in multiple regression?

Leverage Plot!

Based on the previous study we will use PRECIP, EDUC,

NONWHITE, Log(NOX) and SO2 to predict MORT.

a) We will use stepwise selection guided by the effect tests to

add (or delete) predictors into the model.

b) Under Analyze > Fit Model >

MORT > Y

Add PRECIP, EDUC, NONWHITE, Log(NOX) and

SO2 into

Construct Model Effects

c) Choose Stepwise under Personality > Run Model.

We will check (or uncheck) each variable according to the ā€œF-

ratioā€ statistics. The final model is chosen based on R-squares

and the p-values. Usually, only variables which are significant

should stay in the final model.

Here are the steps:

Stepwise Fit

Response:

MORT

Stepwise Regression Control

Prob to Enter 0.

Prob to Leave 0.

Direction:

Current Estimates

SSE DFE MSE RSquare RSquare Adj Cp AIC

Enter ed

Step History

Step Parameter Action "Sig

Prob"

Seq SS RSquare Cp p

1 NONWHITE Entered 0.0000 94595.56 0.4144 43.366 2

2 EDUC Entered 0.0000 33848.33 0.5627 20.206 3

3 SO2 Entered 0.0030 14603.66 0.6267 11.35 4

4 PRECIP Entered 0.0138 8965.8 0.6659 6.6858 5

Parameter Estimate nDF SS "F Ratio" "Prob>F"

Intercept 999.316169 1 0 0.000 1.

PRECIP 1.61112351 1 8965.8 6.466 0.
EDUC - 15.773367 1 7056.097 5.089 0.
NONWHITE 3.06092577 1 34458.46 24.852 0.

Log(NOX). 1 3613.212 2.686 0.

SO2 0.3271823 1 21102.28 15.219 0.

Outliers in Multiple Regression

• Outliers in terms of multiple regression:

Observations with large residuals.

• If residuals come from normal distribution, then a residual

with absolute value larger than about 2.6 s

e

is expected only

1% of the time.

• Investigate observations with residuals of large magnitude.

ļ‚· Residual plot of MORT vs. PRECIP, EDUC, NONWHITE and SO

ļ‚· Four places shown on the plot show some large residuals.

ļ‚· Notice that residual plots for multiple regression are using residuals vs. predicted values.

0

50

100

MORT Residual

Lancaster, PAMiami, FL

Albany, NY

New Orleans, LA

750 850 950 1050

MORT Predicted

Leverage Plots

  • A ā€œsimple regression viewā€ of a multiple regression

coefficient. For x

j:

Residual y (w/o x

j

) vs. Residual x

j

(vs the rest of x’s)

(both axes are recentered at their means)

  • Slope is the

Coefficient for that variable in the multiple regression

  • The p-value: same as the effect test p-value
  • Distances from the points to the LS line are multiple regression

residuals.

  • Useful to identify (relative to x

j

outliers

leverage

influential points

(Use them the same way as in a simple regression.)

Pollution data: the final model is

Summary of Fit

RSquare 0.

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio

Model 4 152013.34 38003.3 27.

Error 55 76259.74 1386.5 Prob > F

C. Total 59 228273.08 <.

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 999.31617 92.07861 10.85 <.

PRECIP 1.6111235 0.633579 2.54 0.
EDUC - 15.77337 6.992113 - 2.26 0.
NONWHITE 3.0609258 0.614004 4.99 <.
SO2 0.3271823 0.083867 3.90 0.

Effect Tests

Source Nparm DF Sum of Squares F Ratio Prob > F

PRECIP 1 1 8965.800 6.4663 0.
EDUC 1 1 7056.097 5.0890 0.
NONWHITE 1 1 34458.462 24.8521 <.
SO2 1 1 21102.285 15.2194 0.

Residual by Predicted Plot

0

50

100

MORT Residual

750 800 850 900 950 1050 1150

MORT Predicted

Whole Model

Actual by Predicted Plot

Summary of Fit

RSquare 0.

RSquare Adj 0.

Root Mean Square Error 37.

Mean of Response 940.

Observations (or Sum Wgts) 60

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio

Model 4 152013.34 38003.3 27.

Error 55 76259.74 1386.5 (^) Prob > F

C. Total 59 228273.08 <.

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 999.31617 92.07861 10.85 <.

PRECIP 1.6111235 0.633579 2.54 0.

EDUC - 15.773 37 6.992113 - 2.26 0.

NONWHITE 3.0609258 0.614004 4.99 <.

SO2 0.3271823 0.083867 3.90 0.

Effect Tests

Source Nparm DF Sum of

Squares

F

Ratio

Prob >

F

PRECIP 1 1 8965.800 6.4663 0.

EDUC 1 1 7056.097 5.0890 0.

NONWHITE 1 1 34458 .462 24.8521 <.

SO2 1 1 21102.285 15.2194 0.

Leverage Plot

MORT Leverage Residuals

New Orleans, LA

NONWHITE Leverage, P<.

Bivariate Fit of Y Leverage of NONWHITE for MORT

By X Leverage of NONWH

800

850

900

950

1000

1050

1100

Y Leverage of NONWHITE for MORT

New Orleans, LA

-5 0 5 10 15 20 25 30 35

X Leverage of NONWHITE for MORT

Linear Fit

Linear Fit

Y Leverage of NONWHITE for MORT =

904.02358 + 3.0609258 X Leverage of NONWHITE for MORT

Summary of Fit

RSquare 0.

RSquare Adj 0.

Root Mean Square Error 36.

Mean of Response 940.

Observations (or Sum Wgts) 60

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio

Model 1 34458.46 34458.5 26.

Error 58 76259.74 1314.8 (^) Prob > F

C. Total 59 110718.20 <.

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 904.02358 8.502028 106.33 <.

X Leverage of NONWHITE for MORT 3.0609258 0.597914 5.12 <.

ļ‚·The output from the

whole model fit is on the

left together with the

Leverage plot for

NONWHITE

ļ‚·We can reproduce the

leverage plot by

Analyze > Fit Model >

Save Columns > Effect

Leverage Pairs.

Then fit Y leverage to X

leverage in a simple

regression , shown on the

right.

ļ‚· Notice the coefficients

for NONWHITE are the

same from both outputs.

Interpretation of Leverage Plots

• The enlarged observation New Orleans is a

moderate outlier and it is somewhat

leveraged for estimating the coefficient of

both SO

and NONWHITE and possibly of

EDUC. Since New Orleans is both

moderately highly leveraged and an outlier,

we suspect that it might be influential.