









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These are the important key points of lecture notes of Applied Regression Analysis are: Model Adequacy Checking, Residual Analysis, Residual Plots, Detection and Treatment of Outliers, Press Statistic, Testing for Lack of Fit, Major Assumptions, Regression Analysis, Zero Mean, Normally Distributed
Typology: Study notes
1 / 17
This page cannot be seen from the preview
Don't miss anything!










In this chapter, we discuss some introductory aspect of model adequacy checking, including:
The major assumptions that we have made in regression analysis are:
Assumptions 4 and 5 together imply that the errors are independent. Recall that assumption 5 is required for hypothesis testing and interval estimation.
(b) The estimate of population variance computed from the n residuals is:
( )
s
s
n
i i
n
i n p n p n p
i Re
1 Re
2 1
2 2 = −
= = σ
(c) Since the sum of is zero, they are not independent. However, if the number of residuals ( ) is large relative to the number of parameters (
n p ), the dependency effect can be ignored in an analysis of residuals.
Standardized Residual: The quantity
s
i i Re
= , i = 1 , 2 ,L, n , is called
standardized residual. The standardized residuals have mean zero and approximately unit
Recall that
e =^ (^ I − H ) Y^ =(^ I − H )(^ X^ β^ + ε )^ =( I^ − H ) ε
Therefore,
Var ( e ) = [( (^) I − H )ε ] =( I − H ) (ε )( (^) I − H ) =σ ( I − H )
/ 2 var var.
R-student Residual: The quantity ( 1 ) 2
i ii
i i −
−
, i = 1 , 2 ,L, n , is called the R-
student residual or jackknife residuals, where the quantity is the residual variance
computed with the i th observation removed. It can be shown that
2 (−)
2
Re 2 ( ) − −
n p
n p
ii
i s i
If the usual assumptions in regression analysis are met, the jackknife residual follows exactly a t -distribution with n − p − 1 degrees of freedom.
Example 1: Consider the following data:
y
/
( )
−
1
( )
−
11 10. 9252
2 1 Re 2 ( 1 )
− −
n p
n p
s
22 10. 3832
2 2 Re 2 ( 2 )
2
= − −
n p
n p
s
33 10. 7030
2 3 Re 2 ( 3 )
− −
n p
n p
s
44 10. 6096
2 44 Re 2 ( 4 )
− −
n p
n p
s
55 10. 3790
2 55 Re 2 ( 5 )
2
= − −
n p
n p
s
( )
( )
( )
( )
( )
−
−
−
−
−
−
−
−
−
−
55
2 ( 5 )
1
44
2 ( 4 )
1
33
2 ( 3 )
1
22
2 ( 2 )
1
11
2 ( 1 )
1
( 5 )
( 4 )
( 3 )
( 2 )
( 1 )
S h
e
S h
e
S h
e
S h
e
S h
e
r
r
r
r
r
SAS Output: Residuals, Studentized Residuals and R-student Residuals
Obs Residuals student Rstudent
1 0.84112 1.16423 1.
2 -0.44860 -0.21618 -0.
3 0.15888 0.11034 0.
4 2.26168 1.36988 3.
5 -2.81308 -1.35107 -3.
Scat t er pl ot of X2 ver sus X 1
x
3
4
5
6
7
x
1 2 3 4 5 6
(b) Plot of Residuals versus the Fitted values: A plot of the residuals (or the
scaled residuals
useful for detecting several common types of model inadequacies.
If the plot of residuals versus the fitted values can be contained in a horizontal band, then there are no obvious model defects.
increases. The double-bow often occurs when Y is a proportion between zero and one. The usual approach for dealing with inequality of variance is to apply a suitable transformation to either the regressor or the response variable.
A curved plot indicates nonlinearity. This could mean that other regressor variables are needed in the model. For example a squared term may be necessary. Transformation on the regressor and/or the response variable may be helpful in these cases.
A plot of residuals versus the predicted values may also reveal one or more unusually large residuals. These points are potential residuals. Extreme predicted value with large residual could also indicate either the variance is not constant or the true relationship between Y and X is not linear. These possibilities should be investigated before the points are considered outliers.
(c) Plot of Residuals versus the Regressors: Plotting the residuals versus corresponding values of each regressor variable can also be helpful. Once again a horizontal band containing the residuals is desirable. The funnel and double-bow patterns indicate nonconstant variance. The curved band or a nonlinear pattern in general indicates that the assumed relationship between Y
2
Note that in the simple linear regression it is not necessary to plot residuals versus both predicted values and the regressor variable since the predicted values are linear combinations of the regressor values.
(d) Plot of Residuals in Time sequence: It is a good idea to plot the residuals against time order, if the time sequence in which the data were collected is known. If a horizontal band will enclose all of the residuals and the residuals will fluctuate in a more or less random fashion within this band, then there are no autocorrelation.
(f) Partial Residual plots: Suppose that the model contains the regressor
as (^) e ( (^) x ) (^) e xij i j i j Y β
the model with all k regressors included. The partial residuals are plotted versus and the interpretation of the partial residual plot is very similar to that of the partial regression plot.
i
Example 2 (Delivery Time Data): A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or housekeeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the deliver time ( Y ) are the number of cases of
has collected 40 observations on deliver time.
SAS Output:
y = 2. 4123 +1. 6392 x1 +0. 0136 x N 20 Rsq
Regr essi on Model Y on X1 and X 2
CDF of RSTUDENT
Q- Q- pl ot of Rst udent Resi dual s
0
1
2
3
4
5
Nor mal Quant i l es
y = 2. 4123 +1. 6392 x1 +0. 0136 x
N 20 Rsq
Regr essi on Model Y on X1 and X 2
0
1
2
3
4
5
Pr edi ct ed Val ue
0 10 20 30 40 50 60 70 80
Par t i al Resi dual pl ot s
pr 1
0
10
20
30
40
50
60
x
0 10 20 30
Par t i al Resi dual pl ot s
pr 2
0
10
20
30
x
0 200 400 600 800 1000 1200 1400 1600
(− )is the predicted value of the i th observed response based on a fit to the remaining sample points. Large PRESS residuals are potentially useful in identifying observations where the model does not fit the data well or observation for which the model is likely to provide poor future predictions. The PREES Statistic is defined by
n − 1
= =
n
i
n
ii
i PRESS (^) i i 1
2
2
PRESS is generally regarded as a measure of how well a regression model will perform in predicting new data. One very important of the PRESS statistic is in comparing regression models. Generally, a model with a small value of PRESS is desired. The
2
T
ediction
2 Pr
This statistic gives some indication of the predictive capability of the regression model.
Example 2 (Cont.):
2 = − = − =
T
REs
2 Pr = − = − =
T
ediction
Therefore, we could expect this model to “explain” about 89.03% of the variation in predicting new observations, as compared to approximately 95.25% of the variability in the original data explained by the least-squares fit.
Lack of Fit of the Regression Model: