Non Linear Regression - Data Analysis II - Notes | STAT 8320, Study notes of Statistics

Material Type: Notes; Class: Data Analysis II; Subject: Statistics; University: University of Missouri - Columbia; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-z7o
koofers-user-z7o 🇺🇸

10 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Non-Linear Regression
You have seen models that are linear in the param-
eters. For example,
A model which is linear in the regressors and
linear in the parameters,
Yi=β0+β1X1i+β2X2i+²i.
A model which is linear in the parameters and
nonlinear in the regressors,
Yi=β0+β1X1iX2i+β2X2
1i+²i.
What happens if the model is non-linear in the pa-
rameters? For example
Yi=αXβ1
1iXβ2
2ie²i.
We could consider re-parameterizing the model by
log(Yi) = log(α) + β1log(X1i) + β2log(X2i) + ²i.
Note that this model is equivalent to the nonlinear
model above.
We could approximate a nonlinear model by a linear
model on a small interval.
F.1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Non Linear Regression - Data Analysis II - Notes | STAT 8320 and more Study notes Statistics in PDF only on Docsity!

Non-Linear Regression

  • You have seen models that are linear in the param- eters. For example, - A model which is linear in the regressors and linear in the parameters,

Yi = β 0 + β 1 X 1 i + β 2 X 2 i + ≤i.

  • A model which is linear in the parameters and nonlinear in the regressors,

Yi = β 0 + β 1 X 1 iX 2 i + β 2 X 12 i + ≤i.

  • What happens if the model is non-linear in the pa- rameters? For example

Yi = αX 1 βi^1 X 2 βi^2 e≤i^.

  • We could consider re-parameterizing the model by

log(Yi) = log(α) + β 1 log(X 1 i) + β 2 log(X 2 i) + ≤i. Note that this model is equivalent to the nonlinear model above.

  • We could approximate a nonlinear model by a linear model on a small interval.

Nonlinear Models, Cont.

However, we will sometimes need to consider nonlinear models because

  • Many models cannot be made linear using a trans- formation:

Yi = αX 1 βi^1 X 2 βi^2 + ≤i.

  • We don’t wish to re-parameterize because there is a “natural” or “theoretical” parameterization. For example, the βs make physical sense, however, the transformed βs do not.
  • Some models are not well approximated by any simple linear model. For instance, Yi = α/{1 + exp(−β 0 − β 1 Xi)}.

0.0 0.2 0.4 0.6 0.8 1.

0

1

2

3

4

5

Logistic Curve

x

y

  • We cannot approximate this with a polynomial, as they go to ±∞.
  • No simple linear model fits “well”.

Fitting Nonlinear Regression Models

Least Squares:

  • The LS estimate of θ, ˆθ, is the set of parameters that minimizes the sum of squared residuals:

SSR(ˆθ) =

∑^ n

i=

{Yi − f (x′ i; ˆθ)}^2.

  • We could equivalently consider this from a matrix perspective by writing

SSR(ˆθ) = [Y − f (ˆθ)]′[Y − f (ˆθ)],

where f (ˆθ) = [f (x′ 1 ; ˆθ), f (x′ 2 ; ˆθ), · · · , f (x′ n; ˆθ)]′^ and Y = [Y 1 , Y 2 , · · · Yn]′.

  • To obtain the normal equations we consider the partial derivatives of SSR(ˆθ) with respect to each θˆj, and set them equal to zero. This gives us a system of p equations. Each normal equation is given by

∂SSR(ˆθ) ∂θˆj

∑^ n

i=

{Yi − f (x′ i; ˆθ)}

[

∂f (x′ i; ˆθ) ∂ˆθj

]

Unfortunately, for nonlinear models, the partial deriva- tives are functions of the parameters. Thus, the normal equations are nonlinear and an explicit so- lution for ˆθ cannot be obtained.

Simple Example

Exponential Growth Model:

Recall that this model has the form

Yi = θ 1 exp(θ 2 ti) + ≤i.

Then, the derivatives are

∂f ∂θ 1

= exp(θ 2 ti) and

∂f ∂θ 2

= θ 1 ti exp(θ 2 ti).

So, the normal equations are

∑^ n

i=

[Yi − θˆ 1 exp(ˆθ 2 ti)][exp(ˆθ 2 ti)] = 0, and

∑^ n

i=

[Yi − θˆ 1 exp(ˆθ 2 ti)][ˆθ 1 ti exp(ˆθ 2 ti)] = 0.

Even for this simple example, there is no analytic solu- tion for ˆθ. We will have to consider numerical methods.

Numerical Methods for Solving Nonlinear LS:

Grid Search:

  • Consider a “grid” of possible parameter values and find those that minimize the SSR.
  • Use progressively finer grids to achieve accuracy.
  • Inefficient and not practical when p large.

Gauss-Newton Algorithm, Cont.

  • We could then solve for δ = θ − θˆ(0)^ using ordinary least squares: ˆδ(j)^ = {F (ˆθ(j−1))′F (ˆθ(j−1))}−^1 F (ˆθ(j−1))′[Y − f (ˆθ(j−1))].
  • Finally, since δ = θ − ˆθ(0), we could solve for ˆθ(1)^ = θˆ(0)^ + ˆδ(1). We can view this as finding ˆθ(1), the new estimate of θ.
  • This new estimate, ˆθ(1), can then be used to find another new estimate, ˆθ(2), and so on.

Let’s find the values of f (ˆθ(0)) and F (ˆθ(0)) for the func-

tion f (xi; θ 1 , θ 2 ) = θ 1 exp(θ 2 xi), where there are three observations, (x 1 , y 1 ), (x 2 , y 2 ), and, (x 3 , y 3 ).

Gauss-Newton Algorithm, Cont.

The Gauss-Newton Algorithm proceeds according to the following steps:

  1. Obtain an initial estimate.
  2. Perform the Taylor series expansion, and calculate f (ˆθ(j)) and F (ˆθ(j)).
  3. Use OLS to find ˆδ(j+1), the adjustment to the cur- rent estimate.
  4. Find a new estimate of θ. Use this estimate to return to step 2.
  5. Continue until

(a) There is only a minor change in the objective function,

|SSR(ˆθ(j+1)) − SSR(ˆθ(j))| SSR(ˆθ(j))

< γ 1 , and

(b) There is only a minor change in the parameter estimates,

|θˆ(j+1)^ − ˆθ(j)| < γ 2.

Modified Gauss-Newton Algorithm

We can avoid overstepping using the Modified Gauss- Newton Algorithm. In this case, we will define a new proposal for θ as

θˆ(j+1)^ = ˆθ(j)^ + αjˆδ(j+1), 0 < αj ≤ 1.

Here, the value of αj is used to modify the step length. There are many different ways that we could choose αj. The simplest is

Step Halving: Here, we will set

ˆθ(j+1)^ = ˆθ(j)^ +^1 2 k^

ˆδ(j+1),

where k is the smallest non-negative integer such that

SSR(ˆθ(j)^ +

2 k^

ˆδ(j+1)) < SSR(ˆθ(j)).

That is, try ˆδ(j+1), then ˆδ(j+1)/2, ˆδ(j+1)/4, etc.

SAS PROC NLIN uses this procedure for k = 0, 1 , · · · , 10.

Other Algorithms:

In general, convergence algorithms have the form

θˆ(j+1)^ = ˆθ(j)^ + αjAj^ ∂Q(ˆθ

(j))

∂θ

where Aj is some positive definite matrix, ∂Q(ˆθ(j))/∂θ is the gradient, based upon some function Q, which is typically the SSR.

Other Algorithms, Cont.

The modified Gauss-Newton algorithm satisfies this form with

θˆ(j+1)^ = ˆθ(j)^ − αj[F (ˆθ(j))′F (ˆθ(j))]−^1 ∂Q(ˆθ

(j))

∂θ

where here, Q = SSR.

Steepest Descent:

The steepest descent algorithm proceeds according to

ˆθ(j+1)^ = ˆθ(j)^ + αjIp×p^ ∂Q(ˆθ

(j))

∂θ

This algorithm is slow to converge, but it moves rapidly initially. Although it is seldom used anymore, it can be useful when starting values are not well known.

Levenberg-Marquardt:

The Levenberg-Marquardt algorithm has the form

θˆ(j+1)^ = ˆθ(j)^ − αj

[

F (ˆθ(j))′F (ˆθ(j)) + τ Ip×p

]− (^1) ∂Q(ˆθ(j))

∂θ

This method is a compromise between the Gauss-Newton and the steepest descent methods. It performs best when F (ˆθ(j))′F (ˆθ(j)) is nearly singular (F (ˆθ(j)) is not of full rank). This is essentially a ridge regression.

How do we choose τ? SAS starts with τ = 10−^3. If SSR(ˆθ(j+1)) < SSR(ˆθ(j)), then τ = τ /10 for the next iteration, otherwise, τ = 10τ.

Other Algorithms, Cont.

Quasi-Newton:

For the Quasi-Newton method, we will update θ accord- ing to

ˆθ(j+1)^ = ˆθ(j)^ − αjH j−^1 ∂Q(ˆθ

(j))

∂θ

where H is a positive definite approximation of the the Hessian, which gets closer as j → ∞. Hj is computed iteratively, and this method is best among first-order methods (only first derivatives are required).

Derivative Free Methods:

Ex. Secant Method: This is like the Gauss-Newton method, but it calculates the derivatives numerically, from past iterations.

These methods generally work well.

Practical Considerations:

There are several practical considerations that you must be aware of when using these techniques.

  • Failure to converge:
    • SSR(θ) may be “flat” in a neighborhood of the minimum.

Practical Considerations, Cont.

  • Failure to Converge:
    • Using better starting values can alleviate this problem.
    • This may also suggest that the model is too complex for the data, and a simpler model should be considered.
  • Convergence to a Local Minimum:
    • Linear least squares has the property that SSR(θ) = (Y −Xβ)′(Y −Xβ), which is quadratic and there- fore has a unique minimum.
    • Nonlinear least squares need not have a unique minimum.
    • Using multiple starting values can identify this problem if it exists.
    • If the dimension of θ is low, graph SSR(θ) as a function of θi.
  • Starting Values:
    • To converge, algorithms need good initial esti- mates.
    • Prior or theoretical information can be used.

Practical Considerations, Cont.

  • Starting Values, Cont.
    • Model interpretation, cont. Furthermore, when x = 0, then Y = θ 0 /{1 + exp(−θ 1 )}. Thus, an estimate of θ 1 can be obtained by solving

θˆ 0 (0) 1 + exp(−θˆ(0) 1 )

= Y (0).

Finally, notice that Y takes on the value θ 0 / 2 when x = −θ 1 /θ 2. Thus, we can find the x value which corresponds to the “average” Y , and use it to solve

θˆ(0) 1 θ^ ˆ(0) 2

= x.

We could then use ˆθ 0 (0) , ˆθ 1 (0) , and ˆθ(0) 2 as starting values.

  • Constrained Parameters: often we have constraints on parameters like θi > a or a < θi < b. - Try fitting the model ignoring the constraints and see if the converged parameter estimates satisfy the constraints. If the results satisfy the constraints, no further action is needed.

Practical Considerations, Cont.

  • Constrained Parameters, cont.
    • Try re-parameterizing: for example, for the con- straint θi > a we could write

θi = a + θi − a = a + exp{log(θi − a)} = a + exp(αi). Here, −∞ < αi < ∞ and this parametrization is unconstrained. Or, for the constraint a < θi < b we could write

θi = a +

b − a 1 + exp(−γi)

Again, −∞ < γi < ∞ and this parametrization is unconstrained.

  • Estimation of σ^2. By analogy to the linear case, we will use

s^2 =

∑n i=1[Yi^ −^ f^ (xi;^ θ)]

2

n − p

where p is the number of parameters in θ (not the number of regressors in xi).

Linear Functions of Parameters, Cont.

Using this statistic, we can form an approximate 100(1− α)% confidence interval for a′θ as

a′θˆ ± t(1−α/ 2 ,n−p)s(a′[F (θ)′F (θ)]−^1 a)^1 /^2.

To evaluate this interval, we can substitute ˆθ into F (θ).

As a specific case, suppose that a′^ = (0, 0 , · · · , 0 , 1 , 0 , · · · , 0) where the jth element of a is 1 and all other elements are 0. Then, a confidence interval for the jth element of θ is

θˆj ± t(1−α/ 2 ,n−p)s

ˆcjj,

where ˆcjj^ is the jth diagonal element of [F (θ)′F (θ)]−^1.

Nonlinear Functions of the Parameters:

More often, we are interested in nonlinear functions of θ. Suppose that h(θ) is some nonlinear function of the parameters. Note:

h(ˆθ) ≈ h(θ) + h′(ˆθ − θ),

where h = (∂h/∂θ 1 , · · · , ∂h/∂θp)′.

Then,

E{h(ˆθ)} ≈ h(θ) and var{h(ˆθ)} ≈ σ^2 h′[F (θ)′F (θ)]−^1 h.

Thus, h(ˆθ) ∼ AN(h(θ), σ^2 h′[F (θ)′F (θ)]−^1 h) and an ap- proximate 100(1 − α)% confidence interval for h(θ) is

h(ˆθ) ± t(1−α/ 2 ,n−p)s(h′[F (θ)′F (θ)]−^1 h)^1 /^2.

Confidence Intervals for Prediction

Suppose that we are interested in a prediction interval for Y at x = x 0. Notice

y 0 = f (x 0 ; θ) + ≤ 0 , ≤ 0 ∼ N(0, σ^2 ), and ˆy 0 = f (x 0 ; ˆθ).

When n is large, ˆθ is close to θ, so we will use the Taylor series expansion

f (x 0 , θˆ) ≈ f (x 0 , θ) + f 0 (θ)′(ˆθ − θ),

where f 0 (θ) = (∂f (x 0 , θ)/∂θ 1 , · · · , ∂f (x 0 , θ)/∂θp)′. Then,

y 0 − ˆy 0 ≈ y 0 − f (x 0 , θ) − f 0 (θ)′(ˆθ − θ) = ≤ 0 − f 0 (θ)′(ˆθ − θ).

From this form, we can see that

E(y 0 − ˆy 0 ) ≈ E(≤ 0 ) − f 0 (θ)′E(ˆθ − θ) = 0,

and

var(y 0 − yˆ 0 ) ≈ var(≤ 0 ) − var{f 0 (θ)′(ˆθ − θ)} = σ^2 + σ^2 f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ) = σ^2 {1 + f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ)}

Thus, y 0 − yˆ 0 ∼ AN(0, σ^2 {1 + f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ)}).

Now, since s is asymptotically independent of θˆ (and thus of y 0 − ˆy 0 ),