Linear Models Outline - Probability Theory | STAT 542, Study notes of Probability and Statistics

Material Type: Notes; Class: THRY PROBAB &STAT I; Subject: STATISTICS; University: Iowa State University; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-5uf-1
koofers-user-5uf-1 🇺🇸

10 documents

1 / 55

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat 511 Outline
(Spring 2008 Revision of 2004 Version)
Steve Vardeman
Iowa State University
February 1, 2008
Abstract
This outline summarizes the main points of lectures based on Ken
Koehler’s class notes and other sources.
1LinearModels
The basic linear model structure is
Y=Xβ+²(1)
for Yan n×1vector of observables, Xan n×kmatrix of known constants, β
ak×1vector of (unknown) constants (parameters), and ²an n×1vector of
unobservable random errors. Almost always one assumes that E²=0.Often
one also assumes that for an unknown constant (a parameter) σ2>0,Var²=σ2I
(these are the Gauss-Markov model assumptions) or somewhat more generally
assumes that Var²=η2V(these are the Aitken model assumptions). These
assumptions can be phrased as “the mean vector EYis in the column space of
the matrix X(EYC(X)) and the variance-covariance matrix VarYis known
up to a multiplicative constant.”
1.1 Ordinary Least Squares
The ordinary least squares estimate for EY=Xβis made by minimizing
³Yb
Y´0³Yb
Y´
over choices of b
YC(X).b
Yis then “the (perpendicular) projection of Y
onto C(X).” This is minimization of the squared distance between Yand b
Y
belonging to C(X). Computation of this projection can be accomplished using
a (unique) “projection matrix” PXas
b
Y=PXY
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37

Partial preview of the text

Download Linear Models Outline - Probability Theory | STAT 542 and more Study notes Probability and Statistics in PDF only on Docsity!

Stat 511 Outline

(Spring 2008 Revision of 2004 Version)

Steve Vardeman

Iowa State University

February 1, 2008

Abstract This outline summarizes the main points of lectures based on Ken Koehler’s class notes and other sources.

1 Linear Models

The basic linear model structure is

Y = Xβ + ² (1)

for Y an n × 1 vector of observables, X an n × k matrix of known constants, β a k × 1 vector of (unknown) constants (parameters), and ² an n × 1 vector of unobservable random errors. Almost always one assumes that E² = 0. Often one also assumes that for an unknown constant (a parameter) σ^2 > 0 , Var² =σ^2 I (these are the Gauss-Markov model assumptions) or somewhat more generally assumes that Var² = η^2 V (these are the Aitken model assumptions). These assumptions can be phrased as “the mean vector EY is in the column space of the matrix X (EY ∈ C (X)) and the variance-covariance matrix VarY is known up to a multiplicative constant.”

1.1 Ordinary Least Squares

The ordinary least squares estimate for EY = Xβ is made by minimizing ³ Y − Yb

Y − Yb

over choices of Yb ∈ C (X). Yb is then “the (perpendicular) projection of Y

onto C (X).” This is minimization of the squared distance between Y and Yb belonging to C (X). Computation of this projection can be accomplished using a (unique) “projection matrix” PX as

Y^ b = PXY

There are various ways of constructing PX. One is as

PX = X (X^0 X)−^ X^0

for (X^0 X)−^ any generalized inverse of X^0 X. As it turns out, PX is both symmetric and idempotent. It is sometimes called the “hat matrix” and written as H rather than PX. (It is used to compute the “y hats.”) The vector e = Y − Yb = (I − PX) Y

is the vector of residuals. As it turns out, the matrix I − PX is also a perpen- dicular projection matrix. It projects onto the subspace of Rn^ consisting of all vectors perpendicular to the elements of C (X). That is, I − PX projects onto

C (X)⊥^ ≡ {u ∈ Rn| u^0 v = 0 ∀v ∈ C (X)}

It is the case that C (X)⊥^ = C (I − PX) and

rank (X) = rank (X^0 X) = rank (PX) = dimension of C (X) = trace (PX)

and rank (I − PX) = dimension of C (X)⊥^ = trace (I − PX)

and n = rank(I) = rank (PX) + rank (I − PX)

Further, there is the Pythagorean Theorem/ANOVA identity

Y^0 Y = (PXY)^0 (PXY) + ((I − PX) Y)^0 ((I − PX) Y) = Yb^0 Yb + e^0 e

When rank(X) = k (one has a “full rank” X) every w ∈ C (X) has a unique representation as a linear combination of the columns of X. In this case there is a unique b that solves

Xb = PXY = Yb = dXβ (2)

We can call this solution of equation (2) the ordinary least squares estimate of β. Notice that we then have

XbOLS = PXY = X (X^0 X) − X^0 Y

so that X^0 XbOLS = X^0 X (X^0 X)−^ X^0 Y

These are the so called “normal equations.” X^0 X is k × k with the same rank as X (namely k) and is thus non-singular. So (X^0 X)−^ = (X^0 X)−^1 and the normal equations can be solved to give

bOLS = (X^0 X)−^1 X^0 Y

When X is not of full rank, there are multiple b’s that will solve equation (2) and multiple β’s that could be used to represent EY ∈ C (X). There is thus no sensible “least squares estimate of β.”

and talk about estimating the vector

Cβ =

c^01 β c^02 β .. . c^0 lβ

An ordinary least squares estimator of Cβ is then

C^ dβOLS = C (X^0 X)−^ X^0 Y

Related to the notion of the estimability of Cβ is the concept of “testability” of hypotheses. Roughly speaking, several hypotheses like H 0 :c^0 iβ = # are (simultaneously) testable if each c^0 iβ can be estimated and the hypotheses are not internally inconsistent. To be more precise, suppose that (as above) C is an l × k matrix of constants.

Definition 3 For a matrix C of the form (4), the hypothesis H 0 :Cβ = d is testable provided each c^0 iβ is estimable and rank(C) =l.

Cases of testing H 0 :Cβ = 0 are of particular interest. If such a hypothesis is testable, ∃ ai ∈ Rn^ such that c^0 i = a^0 iX for each i, and thus with

A =

a^01 a^02 .. . a^0 l

one can write C = AX

Then the basic linear model says that EY ∈ C (X) while the hypothesis says

that EY ∈ C (A^0 )⊥. C (X) ∩ C (A^0 )⊥^ is a subspace of C (X) of dimension

rank (X) − rank (A^0 ) = rank (X) − l

and the hypothesis can thus be thought of in terms of specifying that the mean vector is in a subspace of C (X).

1.3 Means and Variances for Ordinary Least Squares (Un-

der the Gauss-Markov Assumptions)

Elementary rules about how means and variances of linear combinations of ran- dom variables are computed can be applied to find means and variances for OLS estimators. Under the Gauss-Markov Model some of these are

E Yb = Xβ and Var Yb = σ^2 PX Ee = 0 and Vare = σ^2 (I − PX)

Further, for l estimable functions c^01 β, c^02 β,... , c^0 lβ and

C =

c^01 c^02 .. . c^0 l

the estimator CdβOLS = C (X^0 X)−^ X^0 Y has mean and covariance matrix

ECdβOLS = Cβ and VarCdβOLS = σ^2 C (X^0 X) − C^0

Notice that in the case that X is full rank and β = Iβ is estimable, the above says that EbOLS = β and VarbOLS = σ^2 (X^0 X) − 1

It is possible to use Theorem 5.2A of Rencher about the mean of a quadratic form to also argue that

Ee^0 e =E

Y − Yb

Y − Yb

= σ^2 (n − rank (X))

This fact suggests the ratio

MSE ≡

e^0 e n − rank (X)

as an obvious estimate of σ^2. In the Gauss-Markov model, ordinary least squares estimation has some optimality properties. Foremost there is the guarantee provided by the Gauss- Markov Theorem. This says that under the linear model assumptions with Var² =σ^2 I, for estimable c^0 β the ordinary least squares estimator dc^0 βOLS is the Best (in the sense of minimizing variance) Linear (in the entries of Y) Unbiased (having mean c^0 β for all β) Estimator of c^0 β.

1.4 Generalized Least Squares

For V positive definite, suppose that Var² = η^2 V. There exists a symmetric positive definite square root matrix for V−^1 , call it V−^ (^12)

. Then

U = V−^

(^12) Y

satisfies the Gauss-Markov model assumptions with model matrix

W = V−^

(^12) X

It then makes sense to do ordinary least squares estimation of EU ∈ C (W)

(with PWU = Ub = Wdβ). Note that for c ∈ C (W^0 ) the BLUE of the parametric function c^0 β is

c^0 (W^0 W) − W^0 U = c^0 (W^0 W) − W^0 V−^

(^12) Y

What then is there to choose between two fundamentally equivalent linear models? There are two issues. Computational/formula simplicity pushes one in the direction of using full rank versions. Sometimes, scientific interpretability of parameters pushes one in the opposite direction. It must be understood that the set of inferences one can make can ONLY depend on the column space of the model matrix, NOT on how that column space is represented.

1.6 Normal Distribution Theory and Inference

If one adds to the basic Gauss-Markov (or Aitken) linear model assumptions an assumption that ² (and therefore Y) is multivariate normal, inference formulas (for making confidence intervals, tests and predictions) follow. These are based primarily on two basic results.

Theorem 4 (Koehler 4.7, panel 309. See also Rencher Theorem 5.5.A.) Sup- pose that A is n × n and symmetric with rank(A) = k, Y ∼MVNn (μ, Σ) for Σ positive definite. If AΣ is idempotent, then

Y^0 AY ∼ χ^2 k (μ^0 Aμ)

(So, if in addition Aμ = 0 , then Y^0 AY ∼ χ^2 k.)

Theorem 5 (Theorem 1.3.7 of Christensen) Suppose that Y ∼MVN

μ, σ^2 I

and BA = 0. a) If A is symmetric, Y^0 AY and BY are independent, and b) if both A and B are symmetric, then Y^0 AY and Y^0 BY are independent.

(Part b) is Koehler’s 4.8. Part a) is a weaker form of Corollary 1 to Rencher’s Theorem 5.6.A and part b) is a weaker form of Corollary 1 to Rencher’s Theorem 5.6.B.) Here are some implications of these theorems.

Example 6 In the normal Gauss-Markov model

1 σ^2

Y − Yb

Y − Yb

SSE

σ^2

∼ χ^2 n−rank(X)

This leads, for example, to 1 − α level confidence limits for σ^2

à SSE upper α 2 point of χ^2 n−rank(X)

SSE

lower α 2 point of χ^2 n−rank(X)

Example 7 (Estimation and testing for an estimable function) In the normal Gauss-Markov model, if c^0 β is estimable,

dc^0 βO L S − c^0 β √ MSE

q c^0 (X^0 X)−^ c

∼ tn−rank(X)

This implies that H 0 :c^0 β = # can be tested using the statistic

T =

dc (^0) βO LS − # √ MSE

q c^0 (X^0 X)−^ c

and a tn−rank(X) reference distribution. Further, if t is the upper α 2 point of the tn−rank(X) distribution, 1 − α level two-sided confidence limits for c^0 β are

dc (^0) βO LS ± t

MSE

q c^0 (X^0 X)−^ c

Example 8 (Prediction) In the normal Gauss-Markov model, suppose that c^0 β is estimable and y∗^ ∼N

c^0 β, γσ^2

independent of Y is to be observed. (We assume that γ is known.) Then

c^ d^0 βO LS − y∗ √ MSE

q γ + c^0 (X^0 X)−^ c

∼ tn−rank(X)

This means that if t is the upper α 2 point of the tn−rank(X) distribution, 1 − α level two-sided prediction limits for y∗are

dc (^0) βO LS ± t

MSE

q γ + c^0 (X^0 X)−^ c

Example 9 (Testing) In the normal Gauss-Markov model, suppose that the hypothesis H 0 :Cβ = d is testable. Then with

SSH 0 =

dCβO LS − d

C (X^0 X)−^ C^0

dCβO LS − d

it’s easy to see that SSH (^0) σ^2

∼ χ^2 l

δ^2

for

δ^2 =

σ^2 (Cβ − d)^0

C (X^0 X)

− C^0

(Cβ − d)

This in turn implies that

F =

SSH 0 /l MSE

∼ Fl,n−rank(X)

δ^2

So with f the upper α point of the Fl,n−rank(X) distribution, an α level test of

H 0 :Cβ = d can be made by rejecting if SSH 0 /l MSE > f^.^ The power of this test is power

δ^2

= P

an Fl,n−rank(X)

δ^2

random variable > f

Or taking a significance testing point of view, a p-value for testing this hypothesis is

P

an Fl,n−rank(X) random variable exceeds the observed value of SSH 0 /l MSE

Unless n is small or one is very unlucky, in regression contexts X is of full rank (i.e. is of rank r + 1). A few specifics of what has gone before that are of particular interest in the regression context are as follows. As always Yb = PXY and in the regression context it is especially common to call PX = H the hat matrix. It is n × n and its diagonal entries hii are sometimes used as indices of “influence” or “leverage” of a particular case on the regression fit. It is the case that each hii ≥ 0 and X hii = trace (H) = rank (H) = r + 1

so that the “hats” hii average to r+1 n. In light of this, a case with hii > 2(r n+1) is sometimes flagged as an “influential” case. Further

Var Yb = σ^2 PX = σ^2 H

so that Varbyi = hiiσ^2 and an estimated standard deviation of ybi is

hii

MSE.

(This is useful, for example, in making confidence intervals for Eyi.) Also as always, e = (I − PX) Y and Vare = σ^2 (I − PX) = σ^2 (I − H). So Varei = (1 − hii) σ^2 and it is typical to compute and plot standardized versions of the residuals e∗ i =

ei √ MSE

1 − hii The general testing of hypothesis framework discussed in Section 1.6 has a particular important specialization in regression contexts. That is, it is common in regression contexts (for p < r) to test

H 0 : βp+1 = βp+2 = · · · = βr = 0 (6)

and in first methods courses this is done using the “full model/reduced model” paradigm. With Xi = ( 1 |x 1 |x 2 | · · · |xi)

this is the hypothesis H 0 : EY ∈ C (Xp)

It is also possible to write this hypothesis in the standard form H 0 :Cβ = 0 using the matrix

C

(r−p)×(r+1)

μ 0 (r−p)×(p+1)

| I

(r−p)×(r−p)

So from Section 1.6 the hypothesis can be tested using an F test with numerator sum of squares

SSH 0 = (CbOLS )^0

C (X^0 X)

− 1 C^0

(CbOLS )

What is interesting and perhaps not initially obvious is that

SSH 0 = Y^0

PX − PXp

Y (7)

and that this kind of sum of squares is the elementary SSRfull − SSRreduced. (A proof of the equivalence (7) is on a handout posted on the course web page.) Further, the sum of squares in display (7) can be made part of any number of interesting partitions of the (uncorrected) overall sum of squares Y^0 Y. For example, it is clear that

Y^0 Y = Y^0

P 1 +^

PXp − P 1

PX −^ PXp

+ (I − PX)

Y (8)

so that

Y^0 Y − Y^0 P 1 Y = Y^0

PXp − P 1

Y + Y^0

PX − PXp

Y + Y^0 (I − PX) Y

In elementary regression analysis notation

Y^0 Y − Y^0 P 1 Y = SST ot (corrected) Y^0

PXp − P 1

Y = SSRreduced Y^0

PX − PXp

Y = SSRfull − SSRreduced Y^0 (I − PX) Y = SSEfull

(and then of course Y^0 (PX − P 1 ) Y = SSRfull ). These four sums of squares are often arranged in an ANOVA table for testing the hypothesis (6). It is common in regression analysis to use “reduction in sums of squares” notation and write

R(β 0 ) = Y^0 P 1 Y R(β 1 ,... , βp|β 0 ) = Y^0

PXp − P 1

Y

R(βp+1,... , βr |β 0 , β 1 ,... , βp) = Y^0

PX − PXp

Y

so that in this notation, identity (8) becomes

Y^0 Y = R(β 0 ) + R(β 1 ,... , βp|β 0 ) + R(βp+1,... , βr|β 0 , β 1 ,... , βp) + SSE

And in fact, even more elaborate breakdowns of the overall sum of squares are possible. For example,

R(β 0 ) = Y^0 P 1 Y R(β 1 |β 0 ) = Y^0 (PX 1 − P 1 ) Y R(β 2 |β 0 , β 1 ) = Y^0 (PX 2 − PX 1 ) Y .. . R(βr|β 0 , β 1 ,... , βr− 1 ) = Y^0

PX − PXr− 1

Y

represents a “Type I” or “Sequential” sum of squares breakdown of Y^0 Y−SSE. (Note that these sums of squares are appropriate numerator sums of squares for testing significance of individual β’s in models that include terms only up to the one in question.) The enterprise of trying to assign a sum of squares to a predictor variable strikes Vardeman as of little real interest, but is nevertheless a common one. Rather than think of

R(βi|β 0 , β 1 ,... , βi− 1 ) = Y^0

PXi − PXi− 1

Y

Each of these is a linear combination of the I ×J means μij and is thus estimable. So are the linear combinations of them

αi = μi. − μ.., βj = μj. − μ.., and αβij = μij −

μ.. + αi + βj

The “factorial effects” (10) here are particular (estimable) linear combinations of the cell means. It is a consequence of how these are defined that X

i

αi = 0,

X

j

βj = 0,

X

i

αβij = 0 ∀j, and

X

j

αβij = 0 ∀i (11)

An issue of particular interest in two way factorials is whether the hypothesis

H 0 :αβij = 0 ∀i and j (12)

is tenable. (If it is, great simplification of interpretation is possible ... changing levels of one factor has the same impact on mean response regardless of which level of the second factor is considered.) This hypothesis can be equivalently written as μij = μ.. + αi + βj ∀i and j

or as (^) ¡ μij − μij 0

μi (^0) j − μi (^0) j 0

= 0 ∀i, i^0 , j and j^0

and is a statement of “parallelism” on “interaction plots” of means. To test this, one could write the hypothesis in terms of (I − 1)(J − 1) statements

μij − μi. − μ.j + μ.. = 0

about the cell means and use the machinery for testing H 0 :Cβ = d from Ex- ample 9. In this case, d = 0 and the test is about EY falling in some subspace of C (X). For thinking about the nature of this subspace and issues related to the hypothesis (12), it is probably best to back up and consider an alternative to the cell means model approach. Rather than begin with the cell means model, one might instead begin with the non-full-rank “effects model”

yijk = μ∗^ + α∗ i + β∗ j + αβ∗ ij + ijk (13)

I have put stars on the parameters to make clear that this is something different from beginning with cell means and defining effects as linear combinations of them. Here there are k = 1 + I + J + IJ parameters for the means and only IJ different means. A model including all of these parameters can not be of full rank. To get simple computations/formulas, one must impose some restrictions. There are several possibilities. In the first place, the facts (11) suggest the so called “sum restrictions” in the effects model (13) X

i

α∗ i = 0,

X

j

β∗ j = 0,

X

i

αβ∗ ij = 0 ∀j, and

X

j

αβ∗ ij = 0 ∀i

Alternative restrictions are so-called “baseline restrictions.” SAS uses the base- line restrictions

α∗ I = 0, β∗ J = 0, αβ∗ Ij = 0 ∀j, and αβ∗ iJ = 0 ∀i

while R and Splus use the baseline restrictions

α∗ 1 = 0, β∗ 1 = 0, αβ∗ 1 j = 0 ∀j, and αβ∗ i 1 = 0 ∀i

Under any of these sets of restrictions one may write a full rank model matrix as

X n×IJ

Ã

n× 1 | Xα∗ n×(I−1)

| Xβ∗ n×(J−1)

| Xαβ∗ n×(I−1)(J−1)

and the no interaction hypothesis (12) is the hypothesis H 0 :EY ∈ C (( 1 |Xα∗ |Xβ∗^ )). So using the full model/reduced model paradigm from the regression discussion, one then has an appropriate numerator sum of squares

SSH 0 = Y^0

PX − P( 1 |Xα∗ |Xβ∗ (^) )

Y

and numerator degrees of freedom (I − 1) (J − 1) (in complete factorials where every nij > 0 ). Other hypotheses sometimes of interest are

H 0 :αi = 0 ∀i or H 0 :βj = 0 ∀j (14)

These are the hypotheses that all row averages of cell means are the same and that all column averages of cell means are the same. That is, these hypotheses could be written as

H 0 :μi. − μi (^0). = 0 ∀i, i^0 or H 0 :μ.j − μ.j 0 = 0 ∀j, j^0

It is possible to write the first of these in the cell means model as H 0 :Cβ = 0 for C that is (I − 1) × k and each row of C specifying αi = 0 for one of i = 1, 2 ,... , (I − 1) (or equality of two row average means). Similarly, the second can be written in the cell means model as H 0 :Cβ = 0 for C that is (J − 1) × k and each row of C specifying βj = 0 for one of j = 1, 2 ,... , (J − 1) (or equality of two column average means). Appropriate numerator sums of squares and degrees of freedom for testing these hypotheses are then obvious using the material of Example 9. These sums of squares are often referred to as “Type III” sums of squares. How to interpret standard partitions of sums of squares and to relate them to tests of hypotheses (12) and (14) is problematic unless all “cell” sample sizes are the same (all nij = m, the data are “balanced”). That is, depending upon what kind of partition one asks for in a call of a standard two-way ANOVA routine, the program produces the following breakdowns

That is, let X be the cell means model matrix (for k “full” cells) and

X∗

n×k

Ã

n× 1 | X∗ α∗ n×(I−1)

| X∗ β∗ n×(J−1)

be an appropriate restricted version of an effects model model matrix (with no interaction terms). If the pattern of empty cells is such that X∗^ is full rank (has rank I + J − 1 ), the hypothesis (15) can be tested using

F =

Y^0 (PX − PX∗ ) Y/ (k − (I + J − 1)) Y^0 (I − PX) Y/ (n − k)

and an F(k−(I+J−1)),(n−k) reference distribution. Further, every

μ∗^ + α∗ i + β∗ j

is estimable in the no interaction effects model. Provided this model extends to all I × J combinations of levels of A and B, this provides estimates of mean responses for all cells. (Note that this is essentially the same kind of extrapo- lation one does in a regression context to sets of predictors not in the original data set. However, on an intuitive basis, the link supporting extrapolation is probably stronger with quantitative regressors than it is with the qualitative predictors of the present context.)

2 Nonlinear Models

A generalization of the linear model is the (potentially) “nonlinear” model that for β a k × 1 vector of (unknown) constants (parameters) and for some function

f (x, β)

that is smooth (differentiable) in the elements of β, says that what is observed can be represented as yi = f (xi, β) + i (16)

for each xi a known vector of constants. (The dimension of x is fixed but basically irrelevant for what follows. In particular, it need not be k.) As is typical in the linear model, one usually assumes that Ei = 0 ∀i, and it is also common to assume that for an unknown constant (a parameter) σ^2 > 0 , Var² =σ^2 I.

2.1 Ordinary Least Squares in the Nonlinear Model

In general (unlike the case when f (xi, β) = x^0 iβ and the model (16) is a linear model) there are typically no explicit formulas for least squares estimation of β. That is, minimization of

g (b) =

X^ n

i=

(yi − f (xi, b))^2 (17)

is a problem in numerical analysis. There are a variety of standard algorithms used for this purpose. They are all based on the fact that a necessary condition for bOLS to be a minimizer of g (b) is that

∂g ∂bj

b=bO L S

= 0 ∀j

so that in search for an ordinary least squares estimator, one might try to find a simultaneous solution to these k “estimating” equations. A bit of calculus and algebra shows that bOLS must then solve the matrix equation

0 = D^0 (Y − f (X, b)) (18)

where we use the notations

D

n×k

μ ∂f (xi, b) ∂bj

and f (X, b) n× 1

f (x 1 , b) f (x 2 , b) .. . f (xn, b)

In the case of the linear model

D =

μ ∂ ∂bj

x^0 ib

= (xij ) = X and f (X, b) = Xb

so that equation (18) is 0 = X^0 (Y − XB), i.e. is the set of normal equations X^0 Y = X^0 Xb. One of many iterative algorithms for searching for a solution to the equation (18) is the Gauss-Newton algorithm. It proceeds as follows. For

br^ =

br 1 br 2 .. . brk

the approximate solution produced by the rth iteration of the algorithm (b^0 is some vector of starting values that must be supplied by the user), let

Dr^ =

μ ∂f (xi, b) ∂bj

b=br

The first order Taylor (linear) approximation to f (X, β) at br^ is

f (X, β) ≈ f (X, br^ ) + Dr^ (β − br)

So the nonlinear model Y = f (X, β) + ² can be written as

Y ≈ f (X, br^ ) + Dr^ (β − br) + ²

(D^0 D)−^1 “typically gets small” with increasing sample size.

  1. MSE = SSEn−k ≈ σ^2.

  2. (D^0 D)−^1 ≈

Db^0 Db

where Db=

μ ∂f (xi,b) ∂bj

b=bO LS

  1. For a smooth (differentiable) function h that maps <k^ → <q^ , claim 1) and the “delta method” (Taylor’s Theorem of Section 7.2 of the Appendix) imply that h (bO L S ) ∼.MVNq

h (β) , σ^2 G (D^0 D)−^1 G^0

for G q×k

μ ∂hi(b) ∂bj

b=β

  1. G ≈G b =

μ ∂hi(b) ∂bj

b=bO LS

Using this set of approximations, essentially exactly as in Section 1.6, one can develop inference methods. Some of these are outlined below.

Example 11 (Inference for a single βj ) From part 1) of Claim 10 we get the approximation bO LSj − βj σ

ηj

. ∼ N (0, 1)

for ηj the jth diagonal entry of (D^0 D)−^1. But then from parts 2) and 3) of the claim, bO L Sj − βj σ

ηj

bO LSj − βj √ MSE

p bηj

for bηj the jth diagonal entry of

Db^0 Db

. In the (normal Gauss-Markov)

linear model context, this last random variable is in fact t distributed for any n. Then both so that the nonlinear model formulas reduce to the linear model formulas, and as a means of making the already very approximate inference formulas somewhat more conservative, it is standard to say

bO LSj − βj √ MSE

p bηj

. ∼ tn−k

and thus to test H 0 :βj = # using

T =

bO L Sj − # √ MSE

p bηj

and a tn−k reference distribution, and to use the values

bO L Sj ± t

MSE

q bηj

as confidence limits for βj.

Example 12 (Inference for a univariate function of β, including a single mean response) For h that maps <k^ → <^1 consider inference for h (β) (with appli- cation to f (x, β) for a given set of predictor variables x). Facts 4) and 5) of Claim 10 suggest that

h (bO LS ) − h (β) √ MSE

r Gb

Db^0 Db

Gb^0

. ∼ tn−k

This leads (as in the previous application/example) to testing H 0 :h (β) = # using

T =

h (bO L S ) − # √ MSE

r Gb

Db^0 Db

Gb^0

and a tn−k reference distribution, and to use of the values

h (bO LS ) ± t

MSE

r Gb

Db^0 Db

Gb^0

as confidence limits for h (β). For a set of predictor variables x, this can then be applied to h (β) = f (x, β) to produce inferences for the mean response at x. That is, with

G

1 ×k

Ã

∂f (x, b) ∂bj

b=β

and, as expected, Gb =

Ã

∂f (x, b) ∂bj

b=bO LS

one may test H 0 :f (x, β) = # using

T =

f (x, bO L S ) − # √ MSE

r Gb

Db^0 Db

Gb^0

and a tn−k reference distribution, and use the values

f (x, bO L S ) ± t

MSE

r Gb

Db^0 Db

Gb^0

as confidence limits for f (x, β).

Example 13 (Prediction) Suppose that in the future, y∗^ normal with mean h (β) and variance γσ^2 independent of Y will be observed. (The constant γ is assumed to be known.) Approximate prediction limits for y∗^ are then

h (bO LS ) ± t

M SE

r γ + Gb

Db^0 Db

Gb^0