























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth exploration of unconstrained optimization, focusing on newton's method and its alternatives. Topics include the basics of newton's method, making it safe, descent directions and line searches, and alternatives like quasi-newton methods, finite-difference newton methods, steepest descent, nonlinear conjugate gradient methods, and truncated newton methods.
Typology: Study notes
1 / 31
This page cannot be seen from the preview
Don't miss anything!
























AMSC/CMSC 660 Scientific Computing I Fall 2008 Unit 3: Optimization Dianne P. O’Leary ©^ c 2008
Optimization: Fundamentals
Our goal is to develop algorithms to solve the problem
Problem P: Given a function f : S → R, find
min x∈S f (x)
with solution xopt.
The point xopt is called the minimizer, and the value f (xopt) is the minimum.
For unconstrained optimization, the set S is usually taken to be Rn, but sometimes we make use of upper or lower bounds on the variables, restricting our search to a box {x : ` ≤ x ≤ u}
for some given vectors `, u ∈ Rn.
The plan
Part 1: Basics of unconstrained optimization
Plan for Part 1:
The plan:
How do we recognize a solution?
What does it mean to be a solution?
The point xopt is a local solution to Problem P if there is a δ > 0 so that if x ∈ S and ‖x − xopt‖ < δ, then f (xopt) ≤ f (x).
In other words, xopt is at least as good as any point in its neighborhood.
The point xopt is a global solution to Problem P if for any x ∈ S, then f (xopt) ≤ f (x).
Note: It would be nice if every local solution was guaranteed to be global. This is true if f is convex. We’ll look at this case more carefully in the ”Geometry” section of these notes.
Some notation
We’ll assume throughout this unit that f is smooth enough that it has as many continuous derivatives as we need. For this section, that means 2 continuous derivatives plus one more, possibly discontinuous.
The gradient of f at x is defined to be the vector
g(x) = 5 f (x) =
∂f /∂x 1 .. . ∂f /∂xn
From linear algebra, this is equivalent to saying that the matrix H(xopt) must be positive semidefinite. In other words, all of its eigenvalues must be nonnegative.
Are these conditions sufficient?
Not quite.
Example: Let f be a function of a single variable:
f (x) = x^3.
Then f ′(x) = 3x^2 and f ′′(x) = 6x, so f ′(0) = 0 and f ′′(0) = 0, so x = 0 satisfies the first- and second-order necessary conditions for optimality, but it is not a minimizer of f. []
We are very close to sufficiency, though: Recall that a symmetric matrix is positive definite if all of its eigenvalues are positive.
If g(x) = 0 and H(x) is positive definite, then x is a local minimizer.
Some geometry
What all of this means geometrically
Imagine you are at point x on a mountain, described by the function f (x), and it is foggy. (So x ∈ R^2 .)
The direction g(x) is the direction of steepest ascent. So if you want to climb the mountain, it is the best direction to walk.
The direction −g(x) is the direction of steepest descent, the fastest way down.
Any direction p that makes a positive inner product with the gradient is an uphill direction, and any direction that makes a negative inner product is downhill.
If you are standing at a point where the gradient is zero, then there is no ascent direction and no descent direction, but a direction of positive curvature will lead you to a point where you can go uphill, and a direction of negative curvature will lead you to a point where you can descend.
If you can’t find any of these, then you are at the bottom of a valley!
The basic algorithm
The basic algorithm
Our basic strategy is inspired by the foggy mountain:
Take an initial guess at the solution x(0), our starting point on the mountain. Set k = 0.
Until x(k)^ is a good enough solution,
Find a search direction p(k). Set x(k+1)^ = x(k)^ + αkp(k), where αk is a scalar chosen to guarantee that progress is made. Set k = k + 1.
Initially, we will study algorithms for which αk = 1.
Unresolved details:
The model method: Newton
Newton’s method
Newton’s method is one way to determine the search direction p(k). It is inspired by our Taylor series expansion
f (x + p) ≈ f (x) + pT^ g(x) +
pT^ H(x)p ≡ fˆ (p).
Suppose we replace f (x + p) by the quadratic model fˆ (p) and minimize that.
In general, the model won’t fit f well at all ... except in a neighborhood of the point x where it is built. But if our step p is not too big, that is ok!
So let’s try to minimize fˆ with respect to p. If we set the derivative equal to zero
g(x) + H(x)p = 0
Theorem: (Fletcher, p46) Suppose f ∈ C^2 (S) and there is a positive scalar λ such that ‖H(x) − H(y)‖ ≤ λ‖x − y‖
for all points x, y in a neighborhood of xopt. Then if x(k)^ is fficiently close to xopt and if H(xopt) is positive definite, then there exists a constant c such that
‖e(k+1)‖ ≤ c‖e(k)‖^2.
This rate of convergence is called quadratic convergence and it is remarkably fast. If we have an error of 10 −^1 at some iteration, then two iterations later the error will be about 10 −^4 (if c ≈ 1 ). After four iterations it will be about 10 −^16 , as many figures as we carry in double precision arithmetic!
Newton’s quadratic rate of convergence is nice, but Newton’s method is not an ideal method:
So we would like to modify Newton’s method to make it cheaper and more widely applicable without sacrificing its fast convergence.
An important result
We can get superlinear convergence (convergence with rate r > 1 ) without walking exactly in the Newton direction.
Making the Newton method safe
When does Newton get into trouble?
We want to modify Newton whenever we are not sure that the direction it generates is downhill.
If the Hessian is positive definite, we know the direction will be downhill, although if H is nearly singular, we may have some computational difficulties.
If the Hessian is semidefinite or indefinite, we might or might not get a downhill direction.
Our strategy:
Conditions on Hˆ:
Modifying Newton’s method
Sample Strategy: Levenberg-Marquardt method. This one was actually proposed for least squares problems, but it works here, too.
Replace H by Hˆ = H + γI.
This shifts every eigenvalue up by γ.
How do we choose γ? It is usually done by trial and error: seek a γ so that Hˆ is positive definite and ‖p(k)‖ ≤ h(k)^ where {h(k)} is a given sequence of numbers.
Note: If the h’s are small enough, then we can avoid using a line search. (Line searches will be discussed later, but their disadvantage is that they require the function to be evaluated many times.)
Better Strategy: Cholesky Strategies: developed by Gill and Murray.
Background: Any symmetric positive definite matrix A can be factored as
A = LDLT
where D is a diagonal matrix and L is lower triangular with ones on its main diagonal.
Idea:
What is missing? How long a step should we take in the direction p?
Descent directions and line searches.
A backtracking line search
Reference: Dennis & Schnabel pp. 6-17. Algorithm due to Dennis, More’, Schnabel.
We take x(k+1)^ = x(k)^ + α(k)p(k)^.
How do we choose α(k)?
Let F (α) = f (x + αp).
Then F ′(α) = pT^ g(x + αp).
Backtracking line search:
Choose α = 1 (to give the full Newton step). While α is not good enough,
Choose a new αnew ∈ [0, α] by interpolation, and set α = αnew.
Note: If p is not the Newton direction, then we may need an initial braketing phase to find a good upper bound on α by testing larger values.
Reference: See Fletcher Section 2.6 for details.
How do we decide that α is good enough?
Our situation
What do we do?
What do these conditions buy for us?
It can be shown that acceptable points exist as long as the minimizer is finite.
Typical theorem: Global convergence of descent methods. If
Then either g(k)^ = 0 for some k or g(k)^ → 0.
Proof: See Fletcher.
Reference: There is a similar (but less satisfying) theorem on p. 316 of N&S.
[]
Trust regions.
Trust regions: an alternative to linesearch
Reference: Fletcher Chapter 5.
Trust region methods determine α and p simultaneously.
Idea: Use g and H to form a quadratic model
f (x + p) ≈ q(p) = f (x) + pT^ g +
pT^ Hp.
Pitfall in trust region methods: If the problem is poorly scaled, then the trust region will remain very small and we will never be able to take large steps to get us close to the solution.
Example: f (x) = f 1 (x 1 ) + f 1 (10000x 2 ) where f 1 is a well-behaved function. []
Convergence of trust region methods
Typical theorem: Global convergence of trust region methods. If
Then the sequence {x(k)} has an accumulation point xopt that satisfies the first- and second-order necessary conditions for optimality.
Final words
We now know how to recognize a solution and compute a solution using Newton’s method.
We have added safeguards in case the Hessian fails to be positive definite, and we have added a linesearch to guarantee convergence.
The resulting algorithm converges rather rapidly, but each iteration is quite expensive.
Next, we want to investigate algorithms that have lower cost per iteration.
Part 2: Alternatives to Newton’s method
The plan:
Recall that Newton’s method is very fast (if it converges) but it requires the gradient and the Hessian matrix to be evaluated at each iteration.
Next we develop some alternatives to Newton’s method:
Methods that require only first derivatives
If the Hessian is unavailable...
Notation:
Problem: Solve min x
f (x)
when g(x) can be computed, but not H(x).
Two options:
No Hessian Method 1: Quasi-Newton Method
Quasi-Newton methods could be the basis for a full course; there is a textbook by Dennis and Schnabel. We’ll just hit the high points.
Reference: SIAM Rev 21 1979 443-459.
where
s(k)^ = x(k+1)^ − x(k)^ , y(k)^ = g(k+1)^ − g(k)^.
Unquiz: Verify that Broyden’s good method satisfies the secant equation and the no-change conditions. []
Notes: For Broyden’s good method,
In order to regain symmetry, we need to sacrifice the no-change conditions. Instead, we formulate the problem in a least change sense:
min B(k+1)
‖B(k+1)^ − B(k)‖
subject to the secant condition
B(k+1)s(k)^ = y(k)^.
The solution (again) depends on the choice of norm.
A refinement to this idea: We can impose other constraints, too.
Some history
An alphabet soup of algorithms:
An example: DFP
Still one of the most popular because it has many desirable properties.
We accumulate an approximation C to H−^1 rather than to H.
C(k+1)^ = C(k)^ −
C(k)y(k)y(k)T^ C(k) y(k)T^ C(k)y(k)^
s(k)s(k)T y(k)T^ s(k)
An example: BFGS
The most popular method.
B(k+1)^ = B(k)^ −
B(k)s(k)s(k)T^ B(k) s(k)T^ B(k)s(k)^
y(k)y(k)T y(k)T^ s(k)
Use of Quasi-Newton methods
The algorithm looks very similar to Newton’s method:
Until x(k)^ is a good enough solution,
Compute a search direction p(k)^ from p(k)^ = −C(k)g(k)^ (or solve B(k)p(k)^ = −g(k)). Set x(k+1)^ = x(k)^ + αkp(k), where αk satisfies the Goldstein-Armijo or Wolfe linesearch conditions. Form the updated matrix C(k+1)^ (or B(k+1)). Increment k.
Initialization: Now need to initialize B(0)^ (or C(0)) as well as x(0). Take B(0)^ = I or some better guess.
Unanswered questions
approximately zero and they can find no descent direction or direction of negative curvature.
Convergence rate
All of these methods have an n-, 2 n-, or (n + 2)-step quadratic convergence rate if the line search is exact. An n-step quadratic convergence rate, for example, means that
lim k→∞
‖x(k+n)^ − xopt‖ ‖x(k)^ − xopt‖^2
Weakening the line search to a Wolfe or Goldstein-Armijo search generally gives superlinear convergence. See N&S p. 356 for a typical result for the Huang family.
No Hessian Method 2: Finite-difference Newton Method
Finite-difference Newton methods (N&S 11.4.1)
One of our major time-sinks in using Newton’s method is to evaluate the Hessian matrix, with entries
hij =
∂gi ∂xj
One way to avoid these evaluations is to approximate the entries:
hij ≈
gi(x + τ ej ) − gi(x) τ
where τ is a small number and ej is the jth unit vector.
Cost: n extra gradient evaluations per iteration. Sometimes this is less than the cost of the Hessian evaluation, but sometimes it is more.
Practical matters:
Usually we try to balance the two errors by choosing τ to make them approximately equal.
Convergence rate
There are theorems that say that if we choose τ carefully enough, we can get superlinear convergence.
Methods that require only first derivatives and store no matrices
Sometimes problems are too big to allow n^2 storage space for the Hessian matrix.
Some alternatives:
Low-storage Method 1: Steepest Descent N&S 11.
Back to that foggy mountain
If we walk in the direction of steepest descent until we stop going downhill, we clearly are guaranteed to get to a local minimizer.
The trouble is that the algorithm is terribly slow. See the examples in N&S.
If we apply steepest descent to a quadratic function of n variables, then after many steps, the algorithm takes alternate steps approximating two directions: those corresponding to the eigenvectors of the smallest and the largest eigenvalues of the Hessian matrix.
The convergence rate on quadratics is linear:
f (x(k+1)) − f (xopt) ≤
κ − 1 κ + 1
(f (x(k)) − f (xopt))
where κ is the ratio of the largest to the smallest eigenvalue of H.
See N&S p.342 for proof.