Model Adequacy Checking - Applied Regression Analysis - Lecture Notes, Study notes of Mathematical Statistics

These are the important key points of lecture notes of Applied Regression Analysis are: Model Adequacy Checking, Residual Analysis, Residual Plots, Detection and Treatment of Outliers, Press Statistic, Testing for Lack of Fit, Major Assumptions, Regression Analysis, Zero Mean, Normally Distributed

Typology: Study notes

2012/2013

Uploaded on 01/11/2013

m-alam
m-alam 🇮🇳

4.7

(12)

54 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 4: Model Adequacy Checking
In this chapter, we discuss some introductory aspect of model adequacy checking,
including:
Residual Analysis,
Residual plots,
Detection and treatment of outliers,
The PRESS statistic
Testing for lack of fit.
The major assumptions that we have made in regression analysis are:
The relationship between the response
Y
and the regressors is linear, at least
approximately.
The error term
ε
has zero mean.
The error term
ε
has constant variance
σ
2
.
The errors are uncorrelated.
The errors are normally distributed.
Assumptions 4 and 5 together imply that the errors are independent. Recall that
assumption 5 is required for hypothesis testing and interval estimation.
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Model Adequacy Checking - Applied Regression Analysis - Lecture Notes and more Study notes Mathematical Statistics in PDF only on Docsity!

Chapter 4: Model Adequacy Checking

In this chapter, we discuss some introductory aspect of model adequacy checking, including:

  • Residual Analysis,
  • Residual plots,
  • Detection and treatment of outliers,
  • The PRESS statistic
  • Testing for lack of fit.

The major assumptions that we have made in regression analysis are:

  • The relationship between the response Y and the regressors is linear, at least approximately.
  • The error term ε has zero mean.
  • The error term ε has constant variance σ 2 .
  • The errors are uncorrelated.
  • The errors are normally distributed.

Assumptions 4 and 5 together imply that the errors are independent. Recall that assumption 5 is required for hypothesis testing and interval estimation.

Residual Analysis: The residuals e 1 , e 2 ,L , en have the following important properties:

(a) The mean of ei is 0.

(b) The estimate of population variance computed from the n residuals is:

( )

MS

SS

e e e

s

s

n

i i

n

i n p n p n p

i Re

1 Re

2 1

2 2 = −

∑ −^ ∑

= = σ

(c) Since the sum of is zero, they are not independent. However, if the number of residuals ( ) is large relative to the number of parameters (

ei

n p ), the dependency effect can be ignored in an analysis of residuals.

Standardized Residual: The quantity

MS

e

d

s

i i Re

= , i = 1 , 2 ,L, n , is called

standardized residual. The standardized residuals have mean zero and approximately unit

variance. A large standardized residual ( d i > 3 ) potentially indicates an outlier.

Recall that

e =^ (^ IH ) Y^ =(^ IH )(^ X^ β^ + ε )^ =( I^ − H ) ε

Therefore,

Var ( e ) = [( (^) IH )ε ] =( IH ) (ε )( (^) IH ) =σ ( IH )

/ 2 var var.

R-student Residual: The quantity ( 1 ) 2

S ( ) h

e

r

i ii

i i

, i = 1 , 2 ,L, n , is called the R-

student residual or jackknife residuals, where the quantity is the residual variance

computed with the i th observation removed. It can be shown that

S i

2 (−)

2

Re 2 ( ) − −

n p

n p

h

e

MS

S

ii

i s i

If the usual assumptions in regression analysis are met, the jackknife residual follows exactly a t -distribution with np − 1 degrees of freedom.

Example 1: Consider the following data:

y

x 1 x 2

y ,

/

X X X

( )

1

X^ X

( )

H X X X X

H

⇒ h 11 = 0. 9252 , h 22 = 0. 3832 , h 33 = 0. 7030 , h 44 = 0. 6096 , h 55 = 0. 3790

( 5 3 ) 6. 97 0.^84

11 10. 9252

2 1 Re 2 ( 1 )

2

− −

− =^ −

n p

n p

h

e

MS

S

s

( 5 3 ) 6. 97 (^0.^45 )

22 10. 3832

2 2 Re 2 ( 2 )

2

= − −

− =^ −

n p

n p

h

e

MS

S

s

( 5 3 ) 6. 97 0.^16

33 10. 7030

2 3 Re 2 ( 3 )

2

− −

− =^ −

n p

n p

h

e

MS

S

s

( 5 3 ) 6. 97 2.^26

44 10. 6096

2 44 Re 2 ( 4 )

2

− −

− =^ −

n p

n p

h

e

MS

S

s

( 5 3 ) 6. 97 (^2.^81 )

55 10. 3790

2 55 Re 2 ( 5 )

2

= − −

− =^ −

n p

n p

h

e

MS

S

s

( )

( )

( )

( )

( )

55

2 ( 5 )

1

44

2 ( 4 )

1

33

2 ( 3 )

1

22

2 ( 2 )

1

11

2 ( 1 )

1

( 5 )

( 4 )

( 3 )

( 2 )

( 1 )

S h

e

S h

e

S h

e

S h

e

S h

e

r

r

r

r

r

SAS Output: Residuals, Studentized Residuals and R-student Residuals

Obs Residuals student Rstudent

1 0.84112 1.16423 1.

2 -0.44860 -0.21618 -0.

3 0.15888 0.11034 0.

4 2.26168 1.36988 3.

5 -2.81308 -1.35107 -3.

Scat t er pl ot of X2 ver sus X 1

x

3

4

5

6

7

x

1 2 3 4 5 6

(b) Plot of Residuals versus the Fitted values: A plot of the residuals (or the

scaled residuals

ei

d i ,^ ti or^ ) versus the corresponding fitted values^ is

useful for detecting several common types of model inadequacies.

r i y^ i

If the plot of residuals versus the fitted values can be contained in a horizontal band, then there are no obvious model defects.

The outward-opening funnel pattern implies that the variance of ε is an increasing

function of Y. An inward-opening funnel indicates that the variance of ε decrease as

increases. The double-bow often occurs when Y is a proportion between zero and one. The usual approach for dealing with inequality of variance is to apply a suitable transformation to either the regressor or the response variable.

Y

A curved plot indicates nonlinearity. This could mean that other regressor variables are needed in the model. For example a squared term may be necessary. Transformation on the regressor and/or the response variable may be helpful in these cases.

A plot of residuals versus the predicted values may also reveal one or more unusually large residuals. These points are potential residuals. Extreme predicted value with large residual could also indicate either the variance is not constant or the true relationship between Y and X is not linear. These possibilities should be investigated before the points are considered outliers.

(c) Plot of Residuals versus the Regressors: Plotting the residuals versus corresponding values of each regressor variable can also be helpful. Once again a horizontal band containing the residuals is desirable. The funnel and double-bow patterns indicate nonconstant variance. The curved band or a nonlinear pattern in general indicates that the assumed relationship between Y

and the regressor X j is not correct. Thus, either higher-order terms in X j

(such as X j ) or a transformation should be considered.

2

Note that in the simple linear regression it is not necessary to plot residuals versus both predicted values and the regressor variable since the predicted values are linear combinations of the regressor values.

(d) Plot of Residuals in Time sequence: It is a good idea to plot the residuals against time order, if the time sequence in which the data were collected is known. If a horizontal band will enclose all of the residuals and the residuals will fluctuate in a more or less random fashion within this band, then there are no autocorrelation.

(f) Partial Residual plots: Suppose that the model contains the regressor

X 1 ,^ X 2 ,L^ , Xk. The partial residuals for regressor X j are^ defined

as (^) e ( (^) x ) (^) e xij i j i j Y β

, i = 1 , 2 ,L, n where the e are the residuals from

the model with all k regressors included. The partial residuals are plotted versus and the interpretation of the partial residual plot is very similar to that of the partial regression plot.

i

x ij

Example 2 (Delivery Time Data): A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or housekeeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the deliver time ( Y ) are the number of cases of

product stocked ( X 1 ) and the distance walked by the route driver ( X 2 ). The engineer

has collected 40 observations on deliver time.

SAS Output:

y = 2. 4123 +1. 6392 x1 +0. 0136 x N 20 Rsq

  1. 9525 Adj Rsq
  2. 9469 RMSE
  3. 7303

Regr essi on Model Y on X1 and X 2

  1. 0
  2. 2
  3. 4
  4. 6
  5. 8
  6. 0

CDF of RSTUDENT

  1. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0

Q- Q- pl ot of Rst udent Resi dual s

  • 2 - 1 0 1 2
  • 2
  • 1

0

1

2

3

4

5

Nor mal Quant i l es

y = 2. 4123 +1. 6392 x1 +0. 0136 x

N 20 Rsq

  1. 9525 Adj Rsq
  2. 9469 RMSE
  3. 7303

Regr essi on Model Y on X1 and X 2

  • 2
  • 1

0

1

2

3

4

5

Pr edi ct ed Val ue

0 10 20 30 40 50 60 70 80

Par t i al Resi dual pl ot s

pr 1

0

10

20

30

40

50

60

x

0 10 20 30

Par t i al Resi dual pl ot s

pr 2

  • 10

0

10

20

30

x

0 200 400 600 800 1000 1200 1400 1600

PRESS Statistic: PRESS residuals are defined by ei yi y i

= − (−), where y i

(− )is the predicted value of the i th observed response based on a fit to the remaining sample points. Large PRESS residuals are potentially useful in identifying observations where the model does not fit the data well or observation for which the model is likely to provide poor future predictions. The PREES Statistic is defined by

n − 1

∑ −^

= =

n

i

n

i h

e

y y

ii

i PRESS (^) i i 1

2

2

PRESS is generally regarded as a measure of how well a regression model will perform in predicting new data. One very important of the PRESS statistic is in comparing regression models. Generally, a model with a small value of PRESS is desired. The

PRESS statistic can be also used to compute an R -like statistic for prediction, say

2

SS

R

T

ediction

PRESS

2 Pr

This statistic gives some indication of the predictive capability of the regression model.

Example 2 (Cont.):

2 = − = − =

SS

SS

R

T

REs

2 Pr = − = − =

SS

R

T

ediction

PRESS

Therefore, we could expect this model to “explain” about 89.03% of the variation in predicting new observations, as compared to approximately 95.25% of the variability in the original data explained by the least-squares fit.

Lack of Fit of the Regression Model: