Quasi-Newton Method, Lecture Notes - Mathematics -, Study notes of Mathematical Methods

Quasi newton method, BFGS updates

Typology: Study notes

2010/2011

Uploaded on 09/09/2011

luber-1
luber-1 🇬🇧

4.8

(12)

293 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
C12.1B: CONTINUOUS OPTIMISATION
LECTURE 4: QUASI-NEWTON METHODS
RAPHAEL HAUSER
MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD
1. The Quasi-Newton Idea. In this lecture we will discuss unconstrained min-
imisation methods that form a tradeoff between the advantages and disadvantages of
the steepest descent and Newton-Raphson methods: we aim at methods that con-
verge faster than steepest descent but have a lower operation count per iteration than
Newton-Raphson. The family of algorithms we will consider are called quasi-Newton
methods, as justified by the following motivation.
In Problem 2 of this week’s exercises you will learn that an alternative derivation
of the Newton–Raphson method is via the replacement of the minimisation problem
minxRnf(x) by minxRnq(x), where q(x) is the second order Taylor approximation
of faround xk, that is,
q(x) = f(xk) + h∇f(xk), x xki+1
2(xxk)TD2f(xk)(xxk).
If D2f(xk) is positive definite, then the minimisation of q(x) yields
xk+1 =xkD2f(xk)1f(xk)
as the optimal solution, and this is taken to be the next iterate in the Newton-Raphson
process.
If instead of the Hessian D2f(xk) we use an approximation BkD2f(xk), we
can generalise this idea and consider the minimisation problem minxRnp(x), where
p(x) is a quadratic model
p(x) = f(xk) + h∇f(xk), x xki+1
2(xxk)TBk(xxk)
of the objective function f. The minimisation of p(x) yields the iterate
x=xkB1
kf(xk).
This update is well-defined when Bkis nonsingular, and in particular when Bkis
positive definite symmetric. Since Bkis only an approximation of D2f(xk), the
update dk=xxkis used as a search direction rather than an exact update. A
line-search then yields a new Quasi-Newton iterate
xk+1 =xk+αkdk.
Suppose we are given a rule for cheaply computing Bk+1 as a function of the
previously computed gradients f(x0),...,f(xk+1), or as a function of the previ-
ously computed quantities Bk,f(xk) and f(xk+1). Then a generic quasi Newton
algorithm proceeds as follows:
Algorithm 1.1 (Generic quasi-Newton).
S0 Choose a starting point x0Rn, a nonsingular B0Sn(often the choice is
B0=I), and a termination tolerance ǫ > 0. Set k= 0.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Quasi-Newton Method, Lecture Notes - Mathematics - and more Study notes Mathematical Methods in PDF only on Docsity!

C12.1B: CONTINUOUS OPTIMISATION

LECTURE 4: QUASI-NEWTON METHODS

RAPHAEL HAUSER MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD

  1. The Quasi-Newton Idea. In this lecture we will discuss unconstrained min- imisation methods that form a tradeoff between the advantages and disadvantages of the steepest descent and Newton-Raphson methods: we aim at methods that con- verge faster than steepest descent but have a lower operation count per iteration than Newton-Raphson. The family of algorithms we will consider are called quasi-Newton methods, as justified by the following motivation. In Problem 2 of this week’s exercises you will learn that an alternative derivation of the Newton–Raphson method is via the replacement of the minimisation problem minx∈Rn^ f (x) by minx∈Rn^ q(x), where q(x) is the second order Taylor approximation of f around xk, that is,

q(x) = f (xk) + 〈∇f (xk), x − xk〉 +

(x − xk)TD^2 f (xk)(x − xk).

If D^2 f (xk) is positive definite, then the minimisation of q(x) yields

xk+1 = xk −

D^2 f (xk)

∇f (xk)

as the optimal solution, and this is taken to be the next iterate in the Newton-Raphson process. If instead of the Hessian D^2 f (xk) we use an approximation Bk ≈ D^2 f (xk), we can generalise this idea and consider the minimisation problem minx∈Rn^ p(x), where p(x) is a quadratic model

p(x) = f (xk) + 〈∇f (xk), x − xk〉 +

(x − xk)TBk(x − xk)

of the objective function f. The minimisation of p(x) yields the iterate

x∗^ = xk − B k− 1 ∇f (xk).

This update is well-defined when Bk is nonsingular, and in particular when Bk is positive definite symmetric. Since Bk is only an approximation of D^2 f (xk), the update dk = x∗^ − xk is used as a search direction rather than an exact update. A line-search then yields a new Quasi-Newton iterate

xk+1 = xk + αkdk.

Suppose we are given a rule for cheaply computing Bk+1 as a function of the previously computed gradients ∇f (x 0 ),... , ∇f (xk+1), or as a function of the previ- ously computed quantities Bk, ∇f (xk) and ∇f (xk+1). Then a generic quasi Newton algorithm proceeds as follows:

Algorithm 1.1 (Generic quasi-Newton). S0 Choose a starting point x 0 ∈ Rn, a nonsingular B 0 ∈ Sn^ (often the choice is B 0 = I), and a termination tolerance ǫ > 0. Set k = 0.

S1 If ‖∇f (xk)‖ ≤ ǫ then stop and output xk as an approximate local minimiser of f. Else go to S2. S2 Compute the quasi-Newton search direction dk = −B− k 1 ∇f (xk). S3 Perform a practical line-search for the minimisation of φ(α) = f (xk + αdk): find a step length αk that satisfies the Wolfe conditions and compute the new iterate xk+1 = xk + αkdk. S4 Compute the new approximate Hessian Bk+1 according to the specified rule. S5 Replace k by k + 1 and go to S1.

1.1. A Wish List. To turn the generic procedure outlined in Algorithm 1.1 into a practical algorithm, we have to specify how to update the approximate Hessian Bk. Many variants of such updating rules have been proposed, and we will discuss the two most widely used choices in Sections 2 and 3. But before doing that, we draw up a “wish list” of properties we would like to find in Bk. These properties serve as a motivation for almost all quasi-Newton methods, not merely the ones discussed in Sections 2 and 3.

P1: Bk should be nonsingular, so that S2 is well-defined. P2: Bk should be such that dk is a descent direction, so that S3 is well-defined. P3: Bk should be symmetric, because Hessians are symmetric matrices.

Note that properties P1–P3 can be satisfied for example by requiring that Bk be positive definite symmetric, as P1 and P3 are then trivially true, and P2 follows from

〈∇f (xk), dk〉 = −∇f (xk)TB k− 1 ∇f (xk) < 0 ,

unless ∇f (xk) = 0, in which case Algorithm 1.1 already stopped and the line-search is unnecessary. One might argue that a positive definite matrix Bk is a poor choice as a Hes- sian approximation when D^2 f (xk) has at least one negative eigenvalue. However, such iterates xk are not near a local minimiser, and at such points one is primarily concerned with finding a descent direction, and dk is then a valid choice for any Bk positive definite. This avoids that the quasi-Newton method gets attracted to any point but a local minimiser.

P4: Bk+1 should be computable by “recycling” the quantities

∇f (xk+1), ∇f (xk),... , ∇f (x 0 ), dk, αk

which have to be computed anyway.

In order to meet this requirement while computing a Bk+1 that is a credible Hessian approximation, it is useful to observe that the gradient change

γk := ∇f (xk+1) − ∇f (xk)

yields information about the Hessian change D^2 f (xk+1) − D^2 f (xk): let us write δk := αkdk for the update chosen by Algorithm 1.1. Recall that the search direction

its columns are scalar multiples of u. Therefore, the rank of this matrix is 1, which explains the the name “rank-1 update”. Thus, if we start out with a symmetric matrix B 0 , then Bk will be symmetric for all k, as all the updates are symmetric. Moreover, the rank-1 condition (2.1) guarantees that Bk and Bk+1 are “close” to each other in the rank sense. So far we did not specify the choice of u. This choice is fixed when the require- ments of P4 are satisfied through the secant condition (1.2): multiplying (2.1) by δk and substituting the result into (1.2), we find

(uTδk)u = γk − Bkδk. (2.2)

Multiplying the transpose of this equation by δk, we obtain

(uTδk)^2 = (γk − Bkδk)Tδk. (2.3)

Equation (2.2) shows that

u = γk − Bkδk uTδk

thus, (2.1) and (2.3) imply that the updating rule should be as follows,

Bk+1 = Bk +

(γk − Bkδk)(γk − Bkδk)T (uTδk)^2

= Bk +

(γk − Bkδk)(γk − Bkδk)T (γk − Bkδk)Tδk

Note that since γk = ∇f (xk+1) − ∇f (xk) and δk = αkdk, we can compute the SR update from the “recycled” information referred to in P4. When Bk+1 is computed via the updating rule (2.4) Algorithm 1.1 is called the symmetric rank 1 method (or SR1). This method was independently suggested by Broyden, Davidson, Fiacco-McCormick, Murtagh-Sargent, and Wolfe in 1967-69. The updates of the SR1 method are very simple to compute, but they have the drawback that Bk is not always positive definite and dk might not always be defined or be a descent direction. Moreover, (γk − Bkδk)Tδ can be close to zero which leads to very large updates. It remains to address property P6. Note that once dk is known, computing αk, xk+1, ∇f (xk+1, γk and δk is very cheap. Moreover, since an outer product of two n × 1 vectors necessitates the computation of n^2 entries, and since adding two n × n matrices takes n^2 additions, the total work for computing the updated matrix Bk+ from Bk and dk is of order O(n^2 ). However, in order to compute dk we need to solve the linear system of equations

Bkdk = −∇f (xk). (2.5)

Gaussian elimination takes 2n^3 /3 operations to compute dk. There are better meth- ods to solve the system (2.5) numerically, but all practical methods take O(n^3 ) op- erations (although some theoretical methods reduce this complexity slightly). No known method can solve (2.5) in the O(n^2 ) operations apparently asked for in P6. Intuitively, it should be clear that this is the case, because even if we guessed the solution, verifying that dk satisfies the system necessitates the computation of n inner

products – one for each column of Bk multiplied with dk – each of which takes n multiplications and n − 1 additions, and hence this takes O(n^2 ) operations in total. (Our analysis is crude, as we perform only a simple operations count without taking into account facts such as that performing a multiplication on a computer takes much more time than an addition.) Thus, the SR1 method, it seems, does not satisfy the complexity requirements of P6. Luckily, a way out of the dilemma is given by the Sherman–Morrison–Woodbury formula:

Theorem 2.1 (Sherman–Morrison–Woodbury). If B ∈ Rn×n^ and U, V ∈ Rn×p are matrices then

(B + U V T)−^1 = B−^1 − B−^1 U (I + V TB−^1 U )−^1 V TB−^1.

Proof. See problem set.

The usefulness of this formula is quickly understood when we apply it in the context of SR1 updates: suppose we knew Hk = B− k 1. Then, applying the Sherman- Morrison-Woodbury formula to B+ = Bk+1, B = Bk, U = u = (γk − Bkδk) and V = U T^ (that is, p = 1 in this case), we find

Hk+1 = (B+)−^1 = B−^1 − B−^1 u

1 + uTB−^1 u

uTB−^1

= Hk +

(δk − Hkγk)(δk − Hkγk)T (δk − Hkγk)Tγk

Thus, Hk+1 is just a rank 1 update of Hk. But since we have assumed that Hk is known, computing dk takes only O(n^2 ) work because this is obtained via the matrix- vector multiplication dk = −Hk∇f (xk). Hence, computing δk and γk takes only O(n^2 ) operations. Finally, because outer products of vectors take O(n^2 ) operations, Hk+1 can be computed from Hk in O(n^2 ) time. Of course, this saving was only possible because we assumed that B k− 1 = Hk is known. But if we start Algorithm 1.1 with a matrix B 0 whose inverse is know, for example with B 0 = I, then we need never form the matrices Bk and can work directly with the updates Hk in every iteration. The SR1 method then takes only O(n^2 ) work per iteration, which is a considerable saving over the Newton-Raphson method which takes O(n^3 ) work due to the solving of a linear system in each iteration. It is possible to analyse the local convergence of the SR1 method and show that the method converges superlinearly in a neighbourhood of a local minimiser of f. Thus, if the SR1 method is properly implemented, it can combine convergence speeds similar to those of the Newton-Raphson method with a lower complexity.

  1. BFGS Updates. SR1 updates are easy to use, but they have the disadvan- tage that Bk is not guaranteed to be positive definite, and dk is not guaranteed to be a descent direction. We are now going to describe the most widely used optimisation algorithm in un- constrained optimisation: the BFGS (Broyden– Fletcher–Goldfarb–Shanno) method, which is based on an updating rule for Bk that satisfies all six properties P1–P6 on our wish list. Thus, this method overcomes the weaknesses of SR1 while retaining its strengths.

can be solved as follows: the value of x 1 can be read off the first equation, yielding x 1 = 3/2. After substituting this value into the second equation, we can simply read off the value of x 2 and find x 2 = (4 − 3 /2)/3. If we generalise this procedure, we find that solving (3.2) for gk takes

∑n l=1[2(l^ −^ 1) + 1] =^ n (^2) operations, and likewise solving

(3.3) for dk takes n^2 operations. Thus, if the Cholesky factorisation of Bk is available, the computation of dk incurs an O(n^2 ) cost and the BFGS method takes O(n^2 ) work per iteration. The main idea of the BFGS method then becomes to seek a sensible updating rule Lk ← Lk+ for the Cholesky factor and to take Bk+1 = Lk+1LT k+1, which by Proposition 3. automatically guarantees that Bk+1 is symmetric positive definite. In order to address the requirement P5, we will attempt to minimise the distance of Lk and Lk+1 in a well-defined sense. Furthermore, in order to meet the requirement P4, we will choose Lk+1 so that Bk+1 satisfies the secant condition (1.2). Let us put that programme into practice: In order to avoid cluttering our equa- tions with indices we write B, L, α d δ and γ for Bk, Lk, αk, dk, δk and γk respectively, and B+, L+ etc. instead of Bk+1, Lk+1 and so forth. At first, we will neglect the condition that L is lower triangular and replace it by an arbitrary nonsingular matrix J such that JJT^ = B and J+J+T = B+. We propose to solve the minimisation problem

min J+

‖J+ − J‖F (3.4)

s.t. J+g = γ, (3.5)

where ‖ · ‖F is the Frobenius norm and g ∈ Rn^ is a parameter vector, and then we choose g so that

J+T δ = g. (3.6)

The result will be that B+ = J+J+T satisfies the quasi-Newton equation (1.2), and that ‖J+ − J‖F , a measure of distance between B+ and B, is minimised under this condition.

The minimisation problem (3.4) is clearly the same as the smooth strictly convex optimisation problem

min J+

tr

(J+ − J)(J+ − J)T

s.t. J+g = γ,

where tr(A) =

i Aii^ denotes the trace of a square matrix^ A. This problem can be reformulated as

min J+

〈J+ − J, J+ − J〉 (3.8)

s.t. 〈J+, eigT〉 = γi (i = 1,... , n), (3.9)

where ei is the i-th coordinate vector and

〈·, ·〉 : Rn×n^ × Rn×n^ → R (A, B) 7 → 〈A, B〉 := tr(ABT) =

i,j

aij bij

is a Euclidean inner product on the vector space Rn×n^ of n × n real matrices. It is easy to check that the Euclidean norm that corresponds to this inner product is the usual Frobenius norm, defined by

‖A‖F =

i,j

A^2 ij.

Thus, (Rn×n, 〈·, ·〉) is a Euclidean space of dimension n^2. The problem (3.8) is then the task of finding the point closest to J in the affine subspace defined by the constraints (3.9). The minimum is achieved at the unique point J +∗ where (J +∗ − J) is orthogonal to the affine subspace, that is, (J +∗ − J) ∈ span(e 1 gT,... , engT). Another way to express this is to say that J∗ + is characterised by the existence of a vector λ∗^ ∈ Rn^ such that

2(J +∗ − J) − λ∗gT^ = 0, J+g = γ.

Both conditions are satisfied for

J∗ + = J +

(γ − Jg)gT gTg

and (3.10)

λ∗^ = 2

γ − Jg gTg

and hence J +∗ is the global minimiser of (3.4). Condition (3.6) now becomes

JTδ =

(γ − Jg)T gTg δ

g, (3.11)

and since JTδ 6 = 0, this can only be satisfied for g = βJTδ with β 6 = 0. Substituting into (3.11), we obtain

β(γTδ − αδTBδ) β^2 δTBδ

β,

or

β = ±

γTδ δTBδ

Note that if α satisfies the Wolfe conditions of Lecture 2, then

γTδ =

∇f (x + αd) − ∇f (x)

)T

αd ≥ (c 2 − 1)φ′(0) > 0.

Therefore, the square-root in (3.12) is real. Expressing the update of J in terms of β, we find

J+ = J +

(γ − βBδ)δTJ βδTBδ

S2 Otherwise solve the triangular system Lkgk = −∇f (xk) for gk, and then the triangular system LT k dk = gk for dk. S3 Perform a line search to find a positive step length αk > 0 such that f (xk + αkdk) < f (xk), and such that αk satisfies the Wolfe conditions. S4 Set δk := αkdk, xk+1 = xk + δk. Compute γk := ∇f (xk+1) − ∇f (xk) and βk = ±

γT k δk/δ kT Bkδk. S5 Compute

J kT+1 = LT k + LT k δk(γk − βkBkδk)T βkδT k Bδk

and then compute the QR factorisation J kT+1 = Qk+1Rk+1. Set Lk+1 := RT k+ and return to S1.

The BFGS algorithm spends only O(n^2 ) computation time per iteration, yet its convergence behaviour is not unlike that of the Newton-Raphson method: it can be shown that the BFGS method has local Q-superlinear convergence (but not Q- quadratic). It can also be shown that if the BFGS algorithm is used on a strictly convex quadratic function and in conjunction with exact line searches, then Bk be- comes the exact (constant) Hessian after n iterations.

  1. Final Comments. In the lectures on trust region algorithms we will see that quasi-Newton updates are useful not only in combination with line searches, but also in methods that rely on building local models of the objective function in the form of polynomials of degree 2. The complexity per iteration and the convergence speed of quasi-Newton meth- ods are an exact tradeoff between the advantages and disadvantages of the steepest descent and Newton-Raphson methods. We summarise this in the following table, where C(f ) denotes the cost of one function evaluation of f :

cost per iteration convergence rate

Steepest descent O

nC(f )

Q-linear

Quasi-Newton O

n^2 + nC(f )

Q-superlinear

Newton-Raphson O

n^3 + n^2 C(f )

Q-quadratic