
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: Data Analysis II; Subject: Statistics; University: University of Missouri - Columbia; Term: Unknown 1989;
Typology: Study notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















Yi = β 0 + β 1 X 1 i + β 2 X 2 i + ≤i.
Yi = β 0 + β 1 X 1 iX 2 i + β 2 X 12 i + ≤i.
Yi = αX 1 βi^1 X 2 βi^2 e≤i^.
log(Yi) = log(α) + β 1 log(X 1 i) + β 2 log(X 2 i) + ≤i. Note that this model is equivalent to the nonlinear model above.
However, we will sometimes need to consider nonlinear models because
Yi = αX 1 βi^1 X 2 βi^2 + ≤i.
0.0 0.2 0.4 0.6 0.8 1.
0
1
2
3
4
5
Logistic Curve
x
y
Least Squares:
SSR(ˆθ) =
∑^ n
i=
{Yi − f (x′ i; ˆθ)}^2.
SSR(ˆθ) = [Y − f (ˆθ)]′[Y − f (ˆθ)],
where f (ˆθ) = [f (x′ 1 ; ˆθ), f (x′ 2 ; ˆθ), · · · , f (x′ n; ˆθ)]′^ and Y = [Y 1 , Y 2 , · · · Yn]′.
∂SSR(ˆθ) ∂θˆj
∑^ n
i=
{Yi − f (x′ i; ˆθ)}
∂f (x′ i; ˆθ) ∂ˆθj
Unfortunately, for nonlinear models, the partial deriva- tives are functions of the parameters. Thus, the normal equations are nonlinear and an explicit so- lution for ˆθ cannot be obtained.
Exponential Growth Model:
Recall that this model has the form
Yi = θ 1 exp(θ 2 ti) + ≤i.
Then, the derivatives are
∂f ∂θ 1
= exp(θ 2 ti) and
∂f ∂θ 2
= θ 1 ti exp(θ 2 ti).
So, the normal equations are
∑^ n
i=
[Yi − θˆ 1 exp(ˆθ 2 ti)][exp(ˆθ 2 ti)] = 0, and
∑^ n
i=
[Yi − θˆ 1 exp(ˆθ 2 ti)][ˆθ 1 ti exp(ˆθ 2 ti)] = 0.
Even for this simple example, there is no analytic solu- tion for ˆθ. We will have to consider numerical methods.
Numerical Methods for Solving Nonlinear LS:
Grid Search:
Let’s find the values of f (ˆθ(0)) and F (ˆθ(0)) for the func-
tion f (xi; θ 1 , θ 2 ) = θ 1 exp(θ 2 xi), where there are three observations, (x 1 , y 1 ), (x 2 , y 2 ), and, (x 3 , y 3 ).
The Gauss-Newton Algorithm proceeds according to the following steps:
(a) There is only a minor change in the objective function,
|SSR(ˆθ(j+1)) − SSR(ˆθ(j))| SSR(ˆθ(j))
< γ 1 , and
(b) There is only a minor change in the parameter estimates,
|θˆ(j+1)^ − ˆθ(j)| < γ 2.
We can avoid overstepping using the Modified Gauss- Newton Algorithm. In this case, we will define a new proposal for θ as
θˆ(j+1)^ = ˆθ(j)^ + αjˆδ(j+1), 0 < αj ≤ 1.
Here, the value of αj is used to modify the step length. There are many different ways that we could choose αj. The simplest is
Step Halving: Here, we will set
ˆθ(j+1)^ = ˆθ(j)^ +^1 2 k^
ˆδ(j+1),
where k is the smallest non-negative integer such that
SSR(ˆθ(j)^ +
2 k^
ˆδ(j+1)) < SSR(ˆθ(j)).
That is, try ˆδ(j+1), then ˆδ(j+1)/2, ˆδ(j+1)/4, etc.
SAS PROC NLIN uses this procedure for k = 0, 1 , · · · , 10.
Other Algorithms:
In general, convergence algorithms have the form
θˆ(j+1)^ = ˆθ(j)^ + αjAj^ ∂Q(ˆθ
(j))
∂θ
where Aj is some positive definite matrix, ∂Q(ˆθ(j))/∂θ is the gradient, based upon some function Q, which is typically the SSR.
The modified Gauss-Newton algorithm satisfies this form with
θˆ(j+1)^ = ˆθ(j)^ − αj[F (ˆθ(j))′F (ˆθ(j))]−^1 ∂Q(ˆθ
(j))
∂θ
where here, Q = SSR.
Steepest Descent:
The steepest descent algorithm proceeds according to
ˆθ(j+1)^ = ˆθ(j)^ + αjIp×p^ ∂Q(ˆθ
(j))
∂θ
This algorithm is slow to converge, but it moves rapidly initially. Although it is seldom used anymore, it can be useful when starting values are not well known.
Levenberg-Marquardt:
The Levenberg-Marquardt algorithm has the form
θˆ(j+1)^ = ˆθ(j)^ − αj
F (ˆθ(j))′F (ˆθ(j)) + τ Ip×p
]− (^1) ∂Q(ˆθ(j))
∂θ
This method is a compromise between the Gauss-Newton and the steepest descent methods. It performs best when F (ˆθ(j))′F (ˆθ(j)) is nearly singular (F (ˆθ(j)) is not of full rank). This is essentially a ridge regression.
How do we choose τ? SAS starts with τ = 10−^3. If SSR(ˆθ(j+1)) < SSR(ˆθ(j)), then τ = τ /10 for the next iteration, otherwise, τ = 10τ.
Quasi-Newton:
For the Quasi-Newton method, we will update θ accord- ing to
ˆθ(j+1)^ = ˆθ(j)^ − αjH j−^1 ∂Q(ˆθ
(j))
∂θ
where H is a positive definite approximation of the the Hessian, which gets closer as j → ∞. Hj is computed iteratively, and this method is best among first-order methods (only first derivatives are required).
Derivative Free Methods:
Ex. Secant Method: This is like the Gauss-Newton method, but it calculates the derivatives numerically, from past iterations.
These methods generally work well.
Practical Considerations:
There are several practical considerations that you must be aware of when using these techniques.
θˆ 0 (0) 1 + exp(−θˆ(0) 1 )
Finally, notice that Y takes on the value θ 0 / 2 when x = −θ 1 /θ 2. Thus, we can find the x value which corresponds to the “average” Y , and use it to solve
θˆ(0) 1 θ^ ˆ(0) 2
= x.
We could then use ˆθ 0 (0) , ˆθ 1 (0) , and ˆθ(0) 2 as starting values.
θi = a + θi − a = a + exp{log(θi − a)} = a + exp(αi). Here, −∞ < αi < ∞ and this parametrization is unconstrained. Or, for the constraint a < θi < b we could write
θi = a +
b − a 1 + exp(−γi)
Again, −∞ < γi < ∞ and this parametrization is unconstrained.
s^2 =
∑n i=1[Yi^ −^ f^ (xi;^ θ)]
2
n − p
where p is the number of parameters in θ (not the number of regressors in xi).
Using this statistic, we can form an approximate 100(1− α)% confidence interval for a′θ as
a′θˆ ± t(1−α/ 2 ,n−p)s(a′[F (θ)′F (θ)]−^1 a)^1 /^2.
To evaluate this interval, we can substitute ˆθ into F (θ).
As a specific case, suppose that a′^ = (0, 0 , · · · , 0 , 1 , 0 , · · · , 0) where the jth element of a is 1 and all other elements are 0. Then, a confidence interval for the jth element of θ is
θˆj ± t(1−α/ 2 ,n−p)s
ˆcjj,
where ˆcjj^ is the jth diagonal element of [F (θ)′F (θ)]−^1.
Nonlinear Functions of the Parameters:
More often, we are interested in nonlinear functions of θ. Suppose that h(θ) is some nonlinear function of the parameters. Note:
h(ˆθ) ≈ h(θ) + h′(ˆθ − θ),
where h = (∂h/∂θ 1 , · · · , ∂h/∂θp)′.
Then,
E{h(ˆθ)} ≈ h(θ) and var{h(ˆθ)} ≈ σ^2 h′[F (θ)′F (θ)]−^1 h.
Thus, h(ˆθ) ∼ AN(h(θ), σ^2 h′[F (θ)′F (θ)]−^1 h) and an ap- proximate 100(1 − α)% confidence interval for h(θ) is
h(ˆθ) ± t(1−α/ 2 ,n−p)s(h′[F (θ)′F (θ)]−^1 h)^1 /^2.
Suppose that we are interested in a prediction interval for Y at x = x 0. Notice
y 0 = f (x 0 ; θ) + ≤ 0 , ≤ 0 ∼ N(0, σ^2 ), and ˆy 0 = f (x 0 ; ˆθ).
When n is large, ˆθ is close to θ, so we will use the Taylor series expansion
f (x 0 , θˆ) ≈ f (x 0 , θ) + f 0 (θ)′(ˆθ − θ),
where f 0 (θ) = (∂f (x 0 , θ)/∂θ 1 , · · · , ∂f (x 0 , θ)/∂θp)′. Then,
y 0 − ˆy 0 ≈ y 0 − f (x 0 , θ) − f 0 (θ)′(ˆθ − θ) = ≤ 0 − f 0 (θ)′(ˆθ − θ).
From this form, we can see that
E(y 0 − ˆy 0 ) ≈ E(≤ 0 ) − f 0 (θ)′E(ˆθ − θ) = 0,
and
var(y 0 − yˆ 0 ) ≈ var(≤ 0 ) − var{f 0 (θ)′(ˆθ − θ)} = σ^2 + σ^2 f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ) = σ^2 {1 + f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ)}
Thus, y 0 − yˆ 0 ∼ AN(0, σ^2 {1 + f 0 (θ)′[F (θ)′F (θ)]−^1 f 0 (θ)}).
Now, since s is asymptotically independent of θˆ (and thus of y 0 − ˆy 0 ),