Homoscedasticity and Independence of Residuals in Linear Regression, Exercises of Design

The properties of residuals in simple linear regression, focusing on their homoscedasticity and independence. The document also explores the impact of monotonic heteroscedasticity on residuals and the use of transformed residuals for testing against alternative hypotheses. The paper was originally presented at The Georgia Institute of Technology in 1967.

Typology: Exercises

2021/2022

Uploaded on 08/05/2022

aichlinn
aichlinn 🇮🇪

4.4

(46)

1.9K documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
INDEPENDENT
STEPWISE
RESIDUALS
FOR
TESTING
HOMOSCEDASTICITy.!/
A. Hedayat and D. s. RobsonS/
Cornell
University
ABSTRACT
Regression
models
which
specify
independent,
homoscedastic
and nor.mally
distributed
errors
may
be
analyzed
in
a
stepwise
manner
to
produce
calculated
residuals
having
this
same
property.
If
the
n'th
residual
is
calculated
as
the
deviation
of
the
n'th
observation
from
its
predicted
value
based
on a
least
squares
fit
to
only
the
first
n
observations
then
the
resulting
sequence
of
residuals,
appropriately
nor.malized,
are
not
only
mutually
independent
and
homoscedastic
but
also
are
independent
of
all
of
the
calculated
regression
~
functions.
If
error
variance
is
a monotonic
function
of
the
mean
then,
under
certain
regularity
condition~,
the
calculated
stepv1ise
residuals
are
likewise
monotonically
heteroscedastic.
Simple
linear
regression
with
equally
spaced
values
of
the
independent
variable
constitutes
one
such
regular
case,
and
a
Monte
Carlo
study
of
the
"peak-test"
of
homoscedasticity
in
this
instance
shm1s
that
for
small
samples
the
stepwise
residuals
are
substantially
more
sensitive
to
monotonic
heteroscedasticity
than
conventional,
untransformed
residuals.
This
pape.:~
was
ori.f,:lnally
presentct:".
under
the
titls
"Eo:1·x:eed.:otEticj:'v~!
::'
...
n
L
4near
De
"J"l'"'
.-e-.
nv
1\ ··o.;
-r·''
"'
w~
\-11
T'
~,,ally
s.,.,.,
•'e"' Xi
"'I!
0"'
0
~
"'1. 1
,,
-,
ct.:..'7
""-.
_.
.1.\
b~-'··lo):-
.....
~.:.1
r"L.:·
-~jr-·
..
..~,,;:;,
.....
v
...
,'_;.J,...J..,_ -
;;lc
...
.._
l...i.
0:
.!.J.
.t1.
_
_i:J-
__
.,..)
-"'-,)-'1
u.
...
the
onE:
h\:r~1·'l:_·
;cl :::':mrteen+.:r;
;::J.c:G~ing
of
-~!1·2
:nst-:.t1.~te
of.'
1~a:thema:~ical
Statis~~.cs
..
at
t};.e Gco:rgia
IXlctHute
of
T.::c:'l;•.clog;l:
Atla~::ca,
Georgia.
5:/
Researct
cuppor~ed
in
part
by
Grant
Number
GB-4502
f::..·cm
the
National
Science
Foundation.
-~-
/
'i"
lo
(-_
-~-)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Homoscedasticity and Independence of Residuals in Linear Regression and more Exercises Design in PDF only on Docsity!

INDEPENDENT STEPWISE RESIDUALS FOR TESTING HOMOSCEDASTICITy.!/

A. Hedayat and D. s. RobsonS/ Cornell University

ABSTRACT

Regression models which specify independent, homoscedastic and nor.mally distributed errors may be analyzed in a stepwise manner to produce calculated residuals having this same property. If the n'th residual is calculated as the deviation of the n'th observation from its predicted value based on a least squares fit to only the first n observations then the resulting sequence of residuals, appropriately nor.malized, are not only mutually independent and homoscedastic but also are independent of all of the calculated regression ~ functions. If error variance is a monotonic function of the mean then, under certain regularity condition~, the calculated stepv1ise residuals are likewise monotonically heteroscedastic. Simple linear regression with equally spaced values of the independent variable constitutes one such regular case, and a Monte Carlo study of the "peak-test" of homoscedasticity in this instance shm1s that for small samples the stepwise residuals are substantially more sensitive to monotonic heteroscedasticity than conventional, untransformed residuals.

This pape.:~ was ori.f,:lnally presentct:". under the titls "Eo:1·x:eed.:otEticj:'v~! ::'...n

L4near_. De.1.\ b~-'··lo):-"J"l'"'^ .-e-. .....^ nv~.:.1^ r"L.:· 1^ ··o.; -~jr-·^ -r·'' .. ..~,,;:;,^ "'^ w~.....^ -11v ... T' ,';.J,...J..,^ ~,,ally - s.,.,.,;;lc ...^ •'e"'.._ l...i.^ Xi '·^ "'I!0:^ 0"'.!.J.^ .t1.^0 ~__i:J-^ "'1. 1 __^ .,..) ,,^ -, -"'-,)-'1^ ct.:..'7^ ""-.u. ...

the onE: h:r~1·'l:_· ;cl :::':mrteen+.:r; ;::J.c:G~ing of -~!1·2 :nst-:.t1.~te of.' 1~a:thema:~ical Statis~~.cs .. at t};.e Gco:rgia IXlctHute of T.::c:'l;•.clog;l: Atla~::ca, Georgia. 5:/ Researct cuppor~ed in part by Grant Number GB-4502 f::..·cm the National Science Foundation.

/ 'i" lo (-_^ -~--~-)

INDEPENDENT STEPWISE RESIDUALS FOR TESTING HOMOSCEDASTICITyJ/ A. Hedayat and D~ s. RobsonSf Cornell. University

1. INTRODUCTION AND SUMMARY

Consider the fixed effects general linear model Y=Xf3+€ (1)

where Y is an N-vector of responses, X is an NXP matrix with rank r ~ p having either fixed known coefficients or coefficients that are stochastically inde- pendent of the error ter.m 1 f3 is a p-vector of unknown parameters, € is an N- vector of unknown stochastic components with mean zero and is usually called the error (residual or disturbance) vector. Linear models dealt with in practice usually include in their basic structure the assumption that the covariance matrix of € is a 2 IN where a 2 is a scalar and IN denotes the identity matrix of order N. Specifically it is often assumed that €- N(O,a 2 IN). A diagnosis of the validity of these conditions imposed on the linear model residuals is impeded by the fact that under the usual hypothesis of independent and identically distributed errors, deviations from the least squares fit are neither independent nor, in general, identically distributed. Calculated residuals e = Y - Xf3 are linear functions of the true errors € = Y - ~ at the ! points of the experimental design and are subject to linear constraints equal

This paper was originally presented under the title "Homoscedasticity in Linear Regression Analysis with Equally Spaced X' s" 1 on April 4, 1967 at The one hundred fourteenth meeting of the Institute of Mathematical Statistics, At the Georgia Institute of Technology, Atlanta, Georgia. Research supported in part by Grant Number GB-4502 from the National Science Foundation. Paper No. BU-135 in the Biometrics Unit and No. 524 in the Department of Plant Breeding and Biometry, Cornell University, Ithaca, N.Y. -1-

residuals and their expected values with respect to any specified heteroscedastic model, or the "peak-test" of heteroscedasticity as developed by Goldfeld and Quandt [1 1 2]. For further discussion of the use of transformed residuals see [5] 1 [6], [7], and [8]. As shown in Section 21 an uncorrelated set of residuals retaining essentially the same intuitive appeal as the original residuals may be obtained by a stepwise fitting of the linear model to successively more observations. Thus,if xn~(n) is the predicted value of Y n calculated by fitting only the first n observations

Y1, ••• 1 Yn to the linear model Y- X~ + € then, excluding all n for which

x (^) n~ ( (^) n ) =Yn (^) 1 the residuals in the sequence

are linearly uncorrelated if the components of € are uncorrelated and homoscedastic. The degenerate case Y n =x (^) n~( n) arises when inclusion of the n'th observation in- creases t^ he rank of the design mart^ ix^ (by^ unJ..^ ty)i.~^ Nor.ma1' J.ZJ.ng^.^ scalars en = cr~ I crf are known constants and the residuals d n = c fn n are then linearly uncorrelated with common variance cr (^2) y•x^ = cr€^2 1 and if the €-distribution is normal then so is the d-distribution.

*If the r degeneracies occur at Y 1 , ••• ,y (^) r then f (^) n could be defined as Yn - x~(n-l) for n > r in order to give more weight to €n ; in any other case, hm-1ever (^) 1 the f (^) n so defined would depend on ~ as well as € •

n

The set of numbers ( d } n obtained in this manner depends upon the ordering imposed on the set of N observations; for a given set of N observations there are n!/ d possible sets ( d } n • The choice of a particular set, again, ''Till depend upon the statistician's objective in analyzing residuals. Fortunately, this choice may be made to depend upon calculated values of the regression functions, X~(n)' for any n, without affecting the probability distribution of (d (^) n} under the homo- scedastic nonmal hypothesis. Since residuals are statistically independent of estimated regression functions then in constructing a set of residuals dr+l' ••• (^) 1 ~ to test for monotonic heteroscedasticity, for example, ~may be chosen as the A normalized residual associated with the largest of the N predicted values X~~ Similarly, YN-l may be defined as the observed Y at the design point corresponding to the second largest of X~, and so on. In the simplest and, in the present context, degenerate case where Y 1 , •• &,YN are assumed to be identically distributed, say Y. ~ =a+ E.,~ the N predicted values X~ are identically a. For any given ordering of the observations, specified by same external consideration, the sequence d 2 , ••• ,~ becomes the Helmert statistics as employed by Hogg, for example, in his heuristic method of iterated tests for equality of means [4]. We note that his iterative scheme may in general be applied to the sequence of test statistics

to test the sequence of nested hypotheses ••• = ,....2 v (^) €r+2^ , • H• (^) r^ +3.• ,....2v€1 = ••• =

(i-r-l)d~ ~

  • ••• + (^) d.^2 J.- 1

N If~ is true (and the € 1 s are normally distributed) then F 111 , ••• ,F 1 ,k-r-l are

correct ordering of the observations there are design'configurations for which monotonicity of a (^2) y•x^ is not sufficient to guarantee monotonicity of the sequence {a~ }. Counterexamples violating this property are easily constructed with n simple linear regression models; in the important special case of simple linear regression with equally spaced values of x, however, the monotonicity is preserved.

This fact is demonstrated in Section 3.

The utility of independent residuals is illustrated in Section 4 where the "peak-test" developed by Goldfeld and Quandt [1] is applied to simulated residuals from a simple linear regression with equally spaced values of x. This test was devised to detect monotonic trends in a sequence of random variables, and the distribution of the peak-test statistic was tabulated for the case of independent and identically distributed (continuous) random variables. Monte Carlo camputa- tions are given here, comparing the properties of the peak-test applied to ld 3 ,, ••• , 1~1 and applied (as in Goldfeld and Quandt) to the original residuals 1el1,•••' leNt •

  1. ZERO CORRELATION BETWEEN OLD AND NEW RESIDUALS WHEN ADDITIONAL OBSERVATIONS ARE INCORPORATED INTO A LINEAR (MULTIPLE) REGRESSION ANALYSIS. Let us rewrite the model (l) in the following form

l

f3+ l- €(1) ]

where Y(l) and Y( (^) 2) contain n and N-n observations respectively. Now suppose that we ignore the Y( (^) 2) ~bservations and fit only the n observations Y(l); viz.,

where X(l) denotes the transpose of X(l)' the least squares estimate f(l) of

E(l) from model (3) will be

  • (4) If we now fit the entire N observations to the model, and if G is a generalized inverse of X'X 1 then the least squares est:i.In.ate e( (^) 2) of €( 2) "Ylill be

Note that while f(l) is a function of Y(l) only, e( (^) 2) is a function of both

We now prove the follm-ring theorem THEOREI>1 2 .. 1. (^) f(l) and e( (^) 2) ~linearly uncorrelated (independent)~~ com,c~ents £! € ~ independent and identically (normally) distributed. To prove the theorem we need the following well-known lemma _!.,EMMA 2,.1<) Let W be ~ p x q matrix. ~ if K is ~ generalized inverse of W'W, ~ WI<l.f'W = W • Proof of Theorem. f(l) and e( (^) 2) can be expressed as folloHs: f(l) = Y(l) - x(l)HX(l)Y(l) = (In- x(l)HX(l))€(1) e(2) = Y(2) - x(2)GX(l)Y(l) - x(2)GX(2)Y(2) = (IN-n - X(2)GX(2))€(2) - X(2)GXC1)€(1) • Now if the components of € are independent and identically distributed, then the covariance between f(l) and e( (^) 2) is

THE MONOTOl'liCITY PROPERTY OF THE VARIANCE OF d (^) n IN HErEROSCEDASTIC SIMPLE BEGRESSION MODELS^1 ) •· If the errors €. 1 in the simple linear regression model Y. 1 =a + f3x. 1 + E. 1 are nonnally and independently distributed with mean 0 and variance o Y<>X. 2 =- cr~... then the transformed residuals f 3 , .... ,fN are likewise normal with J. mean 0 and variances

n (^) ( - ,, - ) (^) (xn- x{n}) cJ2 (^) =- '[ l: + xi^ -^ x{EL_t.xn^ -^ x(n)^ l2^ (j~ (^) + [1- 2 2 -,^ CJ f (^) a L 1 (^) n nl: (^) (x. - X'..~ (^) ' '2, ...1 1 n (^) .En (^) ( ,x. (^) - x(n)- )2 J n 1 1 ^ I.l.}^ I l 1

When the error variance c (^) yox^2 is an increasing function of x the condition

x 1 < ••o < x , n which implicf o~.... ~ ••• ~^ cr^ n'^2 is^ not^ sufficient^ to^ ensure^ that

the normalized residuals

will have lncreasing variances~ how8ver, if x. 1 = d + bi, b > o, t~en the follov1ing th~orero obtains: ~ {crfJ ~ !!! }!lcreas:i.ns sequence a

Pr-oof~ The variance a~ in this case becomes n n- L _(6i - 2 - 2n) 2 1 n(n^2 -^ 1)(n^ -^ 2)

0 ~^ +^ in^ -^ 2)(n^ -^ 1) (^1) n(n + 1)

and

n+l

n- I -144i 2 + i(72n + 144) - 4(n + 2)(2n + 5) 0 ~ l^ n(n^2 -^ l)(n^2 -^ 4)^1

n+l

+ (n - 1)(20 - n 2 ) a2 +

n(n + l)(n + 2) n

= L 8 ni 0 ~ (say)^ • i=l

n(n - 1) (^) an+l 2 (n +·l)(n + 2)

Note that i=l E 8 n1. = 01 because of the nor.malization 1 and that for n ~ 5

(^8) nl.. > 0 for

i :::r ·n ·+ 1

and 8ni < 0 1 otherwise (where [!] denotes the integer part of !)• This infor.mation concerning the signs of the (^8) nl.. implies that

k [~] L 8ni s I 8ni for [n;~] < k :11: n i=l i=l where

also satisfies the conditions of the lemma, but in this case

n \ L (^) (a 2 .; ... (^) - a^2 i+l^ )D (^) i""'.,.. (^) - 1 1 in contradiction to the assumption o'a 2 ~ o. Calculation of the numerical values of oni for n = 3 and n = 4 reveals that the conditions of the lemma are also satisfied in these cases. Thus, the mono- tonicity preserving property holds for all n when xi = a + bi, b > 0 1 and the

correct direction of monotonicity is preserved provided only that sgn(~)^ "' = sgn(~).

4. SIMULATION OF THE "PEAK-TEST" OF HOMOSCEDASTICITY IN SIMPLE LINEAR REGRESSION

Goldfeld and Quandt [1] discuss the problem of testing homoscedasticity against a monotone heteroscedastic alternative hypothesis, and present tabulated critical values for a so-called "peak-test" of the residuals. A "peak" residual is said to occur at "'Yj^ =a "'^ + "'~xj,^ "Y^1 < ••• < "'YN'^ if and only if lYi - "'Yi1^ < lYj - "'Yjl for all i < j, and the peak-test statistic is then defined as the number of peaks A A occurring among 1Y2 - Y2 ,, ••• ,1YN- YN1 • Critical values axe obtained from the tabulated distribution of the number of peru~s occurring in a random sample of size N from an absolutely continuous distribution. In their original application of the peak-test to simple linear regression residuals, Goldfeld and Quandt [1] failed to take into account both the dependence which exists between residuals and the fact that the distribution of Y. ~ - Y.J. is a function of x.; J. under the homoscedastic hypothesis the stochastically largest absolute residual occurs with the x. ~ nearest to i. If sample size is large then these shortcomings of their procedure are minor,as the authors later pointed out [2]; however, as sample size increases, the mechanics of performing the peak-test

became unduly time consuming and a computationally simpler procedure such as the F-test described by Goldfeld and Quandt becomes more expedient. For small samples, the peak-test applied to the untransformed residuals is clearly invalid with respect to the size of the test, and also has poor power characteristics.

Table 1 illustrates these points for sample size N ~ 10, and also indicates how they are overcome by applying the peak-test to normalized, stepwise residuals. The cumulative distribution of number of peaks for selected, monotone heteroscedastic alternative hypotheses was estimated by generating 1000 samples of size N = 10 from the standard normal distribution and transforming to heteroscedastic errors by appropriate scale changes. After scale changes the least squares residuals and stepv1ise least squares residuals were then constructed as appropriate linear functions of the errors; each sample of size N = 10 \las thus used in all eight columns of observed values in Table l. The columns labeled "H 0 Nominal c.d.f." and "H 0 EKact c.d.f." were calculated from recursion fonnulae presented by Goldfeld and Quandt for the exact probability distribution of number of peaks in random samples of size 10 and 8, respectively. Note that the homoscedastic exact and observed distribution of pealr.s in normalized step- wise residuals stand in close agreement, as expected, providing a crude guide as to the amount of precision inherent in the other columns of observed probabilities.

The "peak-test" which treats the least squares residuals as if they were independent and identically distributed errs substantially in the size of the test. Thus, taking 4 or more peaks among the 10 residuals as the critical region gives a nominal significance level a 4 = 1 - .9055 = .0945 while the actual size of the test is approximately 1 - .9710 ~ .03; and if any of the three hetero- scedastic models obtained then the probability of rejecting ho.moscedasticity would be at best approximately .05 (less than the nominal size of the test). Applied to normalized stepwise residuals the critical region of 4 or more peaks has size 1 - .9385 = .0615, and the probability of detecting the alternative cr (^) y•x^2 = 2x is approximately 1 - .709 ~ .29o

Since only the errors € were simulated in this Monte Carlo operation the estimated distributions under the heteroscedastic models in Table 1 must be re- garded as estimates of conditional probabilities, the condition being that sgn(~)^ " = sgn(~). With independent, normally distributed heteroscedastic errors

where ~(·) denotes the standard cumulative normal distribution.

ACKNOWLEDGMENT

The authors are grateful to the referees, associate editor, and editor f~r several helpful suggestions.

References

[1] Goldfeld 1 s. M. and Quandt, R. E. "Some tests for homoscedasticity," Journal of the American Statistical Association, 60(1965)539-47. [2] Goldfeld 1 s. ~ and Quandt, R. E. Corrigenda, Journal of the American Statistical Association, 62(1967)1518. [3] Hogg, R. v. "On the resolution of statistical hypotheses," Journal of the American Statistical Association, 56(1961)978-89. [4] Hogg 1 R. v. "Iterated tests of the equality of several distributions," Journal of the American Statistical Association, 57(1962)579-85. [5] Koerts, J. 11 Scme further notes on disturbance estimates in regression analysis," Journal of the American Statistical Association, 62(1967) 169-83. [6] Putter, J. "Orthonormal bases of error spaces and their use for investigat- ing the normality and variances of residuals," Journal of the .American Statistical Association, 62(1967)1022-36. [7] Theil, H. "The analysis of disturbances in regression analysis," Journal of the American Statistical Association, 60(1965)1067-79• [8] Theil, H. "A simplification of the BLUS procedure for analyzing regression disturbances," Journal of the .American Statistical Association, 63(1968) 242-51.