Basics of Unconstrained Optimization: Understanding Newton's Method and Its Alternatives -, Study notes of Computer Science

An in-depth exploration of unconstrained optimization, focusing on newton's method and its alternatives. Topics include the basics of newton's method, making it safe, descent directions and line searches, and alternatives like quasi-newton methods, finite-difference newton methods, steepest descent, nonlinear conjugate gradient methods, and truncated newton methods.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-yn5
koofers-user-yn5 🇺🇸

10 documents

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
AMSC/CMSC 660 Scientific Computing I
Fall 2008
Unit 3: Optimization
Dianne P. O’Leary
c
2008
Optimization: Fundamentals
Our goal is to develop algorithms to solve the problem
Problem P: Given a function f:S R, find
min
xSf(x)
with solution xopt.
The point xopt is called the minimizer, and the value f(xopt)is the minimum.
For unconstrained optimization, the set Sis usually taken to be Rn, but
sometimes we make use of upper or lower bounds on the variables, restricting our
search to a box
{x:`xu}
for some given vectors `,u Rn.
The plan
1. Basics of unconstrained optimization
2. Alternatives to Newton’s method
3. Fundamentals of constrained optimization
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Basics of Unconstrained Optimization: Understanding Newton's Method and Its Alternatives - and more Study notes Computer Science in PDF only on Docsity!

AMSC/CMSC 660 Scientific Computing I Fall 2008 Unit 3: Optimization Dianne P. O’Leary ©^ c 2008

Optimization: Fundamentals

Our goal is to develop algorithms to solve the problem

Problem P: Given a function f : S → R, find

min x∈S f (x)

with solution xopt.

The point xopt is called the minimizer, and the value f (xopt) is the minimum.

For unconstrained optimization, the set S is usually taken to be Rn, but sometimes we make use of upper or lower bounds on the variables, restricting our search to a box {x : ` ≤ x ≤ u}

for some given vectors `, u ∈ Rn.

The plan

  1. Basics of unconstrained optimization
  2. Alternatives to Newton’s method
  3. Fundamentals of constrained optimization

Part 1: Basics of unconstrained optimization

Plan for Part 1:

The plan:

  • How do we recognize a solution?
  • Some geometry.
  • Our basic algorithm for finding a solution.
  • The model method: Newton.
  • How close to Newton do we need to be?
  • Making methods safe:
    • Descent directions and line searches.
    • Trust regions.

How do we recognize a solution?

What does it mean to be a solution?

The point xopt is a local solution to Problem P if there is a δ > 0 so that if x ∈ S and ‖x − xopt‖ < δ, then f (xopt) ≤ f (x).

In other words, xopt is at least as good as any point in its neighborhood.

The point xopt is a global solution to Problem P if for any x ∈ S, then f (xopt) ≤ f (x).

Note: It would be nice if every local solution was guaranteed to be global. This is true if f is convex. We’ll look at this case more carefully in the ”Geometry” section of these notes.

Some notation

We’ll assume throughout this unit that f is smooth enough that it has as many continuous derivatives as we need. For this section, that means 2 continuous derivatives plus one more, possibly discontinuous.

The gradient of f at x is defined to be the vector

g(x) = 5 f (x) =

∂f /∂x 1 .. . ∂f /∂xn

From linear algebra, this is equivalent to saying that the matrix H(xopt) must be positive semidefinite. In other words, all of its eigenvalues must be nonnegative.

Are these conditions sufficient?

Not quite.

Example: Let f be a function of a single variable:

f (x) = x^3.

Then f ′(x) = 3x^2 and f ′′(x) = 6x, so f ′(0) = 0 and f ′′(0) = 0, so x = 0 satisfies the first- and second-order necessary conditions for optimality, but it is not a minimizer of f. []

We are very close to sufficiency, though: Recall that a symmetric matrix is positive definite if all of its eigenvalues are positive.

If g(x) = 0 and H(x) is positive definite, then x is a local minimizer.

Some geometry

What all of this means geometrically

Imagine you are at point x on a mountain, described by the function f (x), and it is foggy. (So x ∈ R^2 .)

The direction g(x) is the direction of steepest ascent. So if you want to climb the mountain, it is the best direction to walk.

The direction −g(x) is the direction of steepest descent, the fastest way down.

Any direction p that makes a positive inner product with the gradient is an uphill direction, and any direction that makes a negative inner product is downhill.

If you are standing at a point where the gradient is zero, then there is no ascent direction and no descent direction, but a direction of positive curvature will lead you to a point where you can go uphill, and a direction of negative curvature will lead you to a point where you can descend.

If you can’t find any of these, then you are at the bottom of a valley!

The basic algorithm

The basic algorithm

Our basic strategy is inspired by the foggy mountain:

Take an initial guess at the solution x(0), our starting point on the mountain. Set k = 0.

Until x(k)^ is a good enough solution,

Find a search direction p(k). Set x(k+1)^ = x(k)^ + αkp(k), where αk is a scalar chosen to guarantee that progress is made. Set k = k + 1.

Initially, we will study algorithms for which αk = 1.

Unresolved details:

  • testing convergence.
  • finding a search direction.
  • computing the step-length αk.

The model method: Newton

Newton’s method

Newton’s method is one way to determine the search direction p(k). It is inspired by our Taylor series expansion

f (x + p) ≈ f (x) + pT^ g(x) +

pT^ H(x)p ≡ fˆ (p).

Suppose we replace f (x + p) by the quadratic model fˆ (p) and minimize that.

In general, the model won’t fit f well at all ... except in a neighborhood of the point x where it is built. But if our step p is not too big, that is ok!

So let’s try to minimize fˆ with respect to p. If we set the derivative equal to zero

g(x) + H(x)p = 0

Theorem: (Fletcher, p46) Suppose f ∈ C^2 (S) and there is a positive scalar λ such that ‖H(x) − H(y)‖ ≤ λ‖x − y‖

for all points x, y in a neighborhood of xopt. Then if x(k)^ is fficiently close to xopt and if H(xopt) is positive definite, then there exists a constant c such that

‖e(k+1)‖ ≤ c‖e(k)‖^2.

This rate of convergence is called quadratic convergence and it is remarkably fast. If we have an error of 10 −^1 at some iteration, then two iterations later the error will be about 10 −^4 (if c ≈ 1 ). After four iterations it will be about 10 −^16 , as many figures as we carry in double precision arithmetic!

Newton’s quadratic rate of convergence is nice, but Newton’s method is not an ideal method:

  • It requires the computation of H at each iteration.
  • It requires the solution of a linear system involving H.
  • It can fail if H fails to be positive definite.

So we would like to modify Newton’s method to make it cheaper and more widely applicable without sacrificing its fast convergence.

An important result

We can get superlinear convergence (convergence with rate r > 1 ) without walking exactly in the Newton direction.

Making the Newton method safe

When does Newton get into trouble?

We want to modify Newton whenever we are not sure that the direction it generates is downhill.

If the Hessian is positive definite, we know the direction will be downhill, although if H is nearly singular, we may have some computational difficulties.

If the Hessian is semidefinite or indefinite, we might or might not get a downhill direction.

Our strategy:

  • We’ll use the Hessian matrix whenever it is positive definite and not close to singular, because it leads to quadratic convergence.
  • We’ll replace H(x) by Hˆ(x) = H(x) + Eˆ whenever H is close to singularity or fails to be positive definite.

Conditions on Hˆ:

  • Hˆ is symmetric positive definite.
  • Hˆ is not too close to singular; in other words, its smallest eigenvalue is bounded below by a constant bigger than zero.

Modifying Newton’s method

Sample Strategy: Levenberg-Marquardt method. This one was actually proposed for least squares problems, but it works here, too.

Replace H by Hˆ = H + γI.

This shifts every eigenvalue up by γ.

How do we choose γ? It is usually done by trial and error: seek a γ so that Hˆ is positive definite and ‖p(k)‖ ≤ h(k)^ where {h(k)} is a given sequence of numbers.

Note: If the h’s are small enough, then we can avoid using a line search. (Line searches will be discussed later, but their disadvantage is that they require the function to be evaluated many times.)

Better Strategy: Cholesky Strategies: developed by Gill and Murray.

Background: Any symmetric positive definite matrix A can be factored as

A = LDLT

where D is a diagonal matrix and L is lower triangular with ones on its main diagonal.

Idea:

  • While factoring, if any dii ≤ 0 , modify it so that it is positive. This changes the factored matrix from H to Hˆ.

What is missing? How long a step should we take in the direction p?

Descent directions and line searches.

A backtracking line search

Reference: Dennis & Schnabel pp. 6-17. Algorithm due to Dennis, More’, Schnabel.

We take x(k+1)^ = x(k)^ + α(k)p(k)^.

How do we choose α(k)?

Let F (α) = f (x + αp).

Then F ′(α) = pT^ g(x + αp).

Backtracking line search:

Choose α = 1 (to give the full Newton step). While α is not good enough,

Choose a new αnew ∈ [0, α] by interpolation, and set α = αnew.

Note: If p is not the Newton direction, then we may need an initial braketing phase to find a good upper bound on α by testing larger values.

Reference: See Fletcher Section 2.6 for details.

How do we decide that α is good enough?

Our situation

  • Have a downhill direction p, so we know that for very small α, F (α) < F (0).
  • If p = the Newton direction, then we predict that α = 1 is the minimizer.
  • We want an upper bound on the αs to consider, since Newton’s method is based on a quadratic model and is not expected to fit the function well if we go too far.
  • We might have F ′^ available.
  • We really can’t afford an exact line search. In an exact linesearch we find the value of α that exactly minimizes f (x + αp). We can do this for quadratic functions, since in that case a formula for α can be derived, but in general exact linesearch is impossible and is only interesting because a lot of theorems demand it.

What do we do?

  • Goldstein conditions
  • Wolfe(1968)-Powell(1976) conditions

What do these conditions buy for us?

It can be shown that acceptable points exist as long as the minimizer is finite.

Typical theorem: Global convergence of descent methods. If

  • f is continuously differentiable and bounded below,
  • g is Lipschitz continuous for all x, i.e., there exists a constant L such that, for all x and y, ‖g(x) − g(y)‖ ≤ L‖x − y‖

Then either g(k)^ = 0 for some k or g(k)^ → 0.

Proof: See Fletcher.

Reference: There is a similar (but less satisfying) theorem on p. 316 of N&S.

[]

Trust regions.

Trust regions: an alternative to linesearch

Reference: Fletcher Chapter 5.

Trust region methods determine α and p simultaneously.

Idea: Use g and H to form a quadratic model

f (x + p) ≈ q(p) = f (x) + pT^ g +

pT^ Hp.

Pitfall in trust region methods: If the problem is poorly scaled, then the trust region will remain very small and we will never be able to take large steps to get us close to the solution.

Example: f (x) = f 1 (x 1 ) + f 1 (10000x 2 ) where f 1 is a well-behaved function. []

Convergence of trust region methods

Typical theorem: Global convergence of trust region methods. If

  • S = {x : f (x) ≤ f (x(0))} is bounded.
  • f ∈ C^2 (S)

Then the sequence {x(k)} has an accumulation point xopt that satisfies the first- and second-order necessary conditions for optimality.

Final words

We now know how to recognize a solution and compute a solution using Newton’s method.

We have added safeguards in case the Hessian fails to be positive definite, and we have added a linesearch to guarantee convergence.

The resulting algorithm converges rather rapidly, but each iteration is quite expensive.

Next, we want to investigate algorithms that have lower cost per iteration.

Part 2: Alternatives to Newton’s method

The plan:

Recall that Newton’s method is very fast (if it converges) but it requires the gradient and the Hessian matrix to be evaluated at each iteration.

Next we develop some alternatives to Newton’s method:

  • Some alternatives that avoid calculation of the Hessian:
    • Quasi-Newton methods
    • finite-difference Newton methods
  • Algorithms that avoid calculation of the Hessian and storage of any matrix:
    • steepest descent
    • nonlinear conjugate gradient methods
    • limited-memory Quasi-Newton methods
    • truncated Newton methods
  • Technology that helps in the calculation of the Hessian: automated differentiation.
  • Methods that require no derivatives
    • finite difference methods
    • Nelder&Meade simplex
    • pattern search

Methods that require only first derivatives

If the Hessian is unavailable...

Notation:

  • H = Hessian matrix.
  • B is its approximation.
  • C is the approximation to H−^1.

Problem: Solve min x

f (x)

when g(x) can be computed, but not H(x).

Two options:

  • Estimate H(x) using finite differences: discrete Newton. This can work well.
  • Approximate H(x) using Quasi-Newton methods. (Also called variable metric.)

No Hessian Method 1: Quasi-Newton Method

Quasi-Newton methods could be the basis for a full course; there is a textbook by Dennis and Schnabel. We’ll just hit the high points.

Reference: SIAM Rev 21 1979 443-459.

where

s(k)^ = x(k+1)^ − x(k)^ , y(k)^ = g(k+1)^ − g(k)^.

Unquiz: Verify that Broyden’s good method satisfies the secant equation and the no-change conditions. []

Notes: For Broyden’s good method,

  • B(k+1)^ is formed from B(k)^ by adding a rank-one matrix.
  • B(k+1)^ is not necessarily symmetric, even if B(k)^ is. This is undesirable since we know H is symmetric.

In order to regain symmetry, we need to sacrifice the no-change conditions. Instead, we formulate the problem in a least change sense:

min B(k+1)

‖B(k+1)^ − B(k)‖

subject to the secant condition

B(k+1)s(k)^ = y(k)^.

The solution (again) depends on the choice of norm.

A refinement to this idea: We can impose other constraints, too.

  • Perhaps we know that H is sparse, and we want B to have the same structure.
  • We might want to keep B(k+1)^ positive definite.

Some history

An alphabet soup of algorithms:

  • DFP: Davidon 1959, Fletcher-Powell 1963
  • BFGS: Broyden, Fletcher, Goldfarb, Shanno 1970
  • Broyden’s good method and Broyden’s bad

An example: DFP

Still one of the most popular because it has many desirable properties.

We accumulate an approximation C to H−^1 rather than to H.

C(k+1)^ = C(k)^ −

C(k)y(k)y(k)T^ C(k) y(k)T^ C(k)y(k)^

s(k)s(k)T y(k)T^ s(k)

An example: BFGS

The most popular method.

B(k+1)^ = B(k)^ −

B(k)s(k)s(k)T^ B(k) s(k)T^ B(k)s(k)^

y(k)y(k)T y(k)T^ s(k)

Use of Quasi-Newton methods

The algorithm looks very similar to Newton’s method:

Until x(k)^ is a good enough solution,

Compute a search direction p(k)^ from p(k)^ = −C(k)g(k)^ (or solve B(k)p(k)^ = −g(k)). Set x(k+1)^ = x(k)^ + αkp(k), where αk satisfies the Goldstein-Armijo or Wolfe linesearch conditions. Form the updated matrix C(k+1)^ (or B(k+1)). Increment k.

Initialization: Now need to initialize B(0)^ (or C(0)) as well as x(0). Take B(0)^ = I or some better guess.

Unanswered questions

  • Near a stationary point, H−^1 does not exist. How do we keep C from deteriorating?
  • What happens if H is indefinite?
  • How do we check optimality?

approximately zero and they can find no descent direction or direction of negative curvature.

Convergence rate

All of these methods have an n-, 2 n-, or (n + 2)-step quadratic convergence rate if the line search is exact. An n-step quadratic convergence rate, for example, means that

lim k→∞

‖x(k+n)^ − xopt‖ ‖x(k)^ − xopt‖^2

Weakening the line search to a Wolfe or Goldstein-Armijo search generally gives superlinear convergence. See N&S p. 356 for a typical result for the Huang family.

No Hessian Method 2: Finite-difference Newton Method

Finite-difference Newton methods (N&S 11.4.1)

One of our major time-sinks in using Newton’s method is to evaluate the Hessian matrix, with entries

hij =

∂gi ∂xj

One way to avoid these evaluations is to approximate the entries:

hij ≈

gi(x + τ ej ) − gi(x) τ

where τ is a small number and ej is the jth unit vector.

Cost: n extra gradient evaluations per iteration. Sometimes this is less than the cost of the Hessian evaluation, but sometimes it is more.

Practical matters:

  • How to choose τ.
    • If τ is large, the approximation is poor and we have large truncation error.
    • If τ is small, then there is cancellation error in forming the numerator of the approximation, so we have large round-off error.

Usually we try to balance the two errors by choosing τ to make them approximately equal.

  • If the problem is poorly scaled, we may need a different τ for each j.

Convergence rate

There are theorems that say that if we choose τ carefully enough, we can get superlinear convergence.

Methods that require only first derivatives and store no matrices

Sometimes problems are too big to allow n^2 storage space for the Hessian matrix.

Some alternatives:

  • steepest descent
  • nonlinear conjugate gradient
  • limited memory Quasi-Newton
  • truncated Newton

Low-storage Method 1: Steepest Descent N&S 11.

Back to that foggy mountain

If we walk in the direction of steepest descent until we stop going downhill, we clearly are guaranteed to get to a local minimizer.

The trouble is that the algorithm is terribly slow. See the examples in N&S.

If we apply steepest descent to a quadratic function of n variables, then after many steps, the algorithm takes alternate steps approximating two directions: those corresponding to the eigenvectors of the smallest and the largest eigenvalues of the Hessian matrix.

The convergence rate on quadratics is linear:

f (x(k+1)) − f (xopt) ≤

κ − 1 κ + 1

(f (x(k)) − f (xopt))

where κ is the ratio of the largest to the smallest eigenvalue of H.

See N&S p.342 for proof.