






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Quasi newton method, BFGS updates
Typology: Study notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!







RAPHAEL HAUSER MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD
q(x) = f (xk) + 〈∇f (xk), x − xk〉 +
(x − xk)TD^2 f (xk)(x − xk).
If D^2 f (xk) is positive definite, then the minimisation of q(x) yields
xk+1 = xk −
D^2 f (xk)
∇f (xk)
as the optimal solution, and this is taken to be the next iterate in the Newton-Raphson process. If instead of the Hessian D^2 f (xk) we use an approximation Bk ≈ D^2 f (xk), we can generalise this idea and consider the minimisation problem minx∈Rn^ p(x), where p(x) is a quadratic model
p(x) = f (xk) + 〈∇f (xk), x − xk〉 +
(x − xk)TBk(x − xk)
of the objective function f. The minimisation of p(x) yields the iterate
x∗^ = xk − B k− 1 ∇f (xk).
This update is well-defined when Bk is nonsingular, and in particular when Bk is positive definite symmetric. Since Bk is only an approximation of D^2 f (xk), the update dk = x∗^ − xk is used as a search direction rather than an exact update. A line-search then yields a new Quasi-Newton iterate
xk+1 = xk + αkdk.
Suppose we are given a rule for cheaply computing Bk+1 as a function of the previously computed gradients ∇f (x 0 ),... , ∇f (xk+1), or as a function of the previ- ously computed quantities Bk, ∇f (xk) and ∇f (xk+1). Then a generic quasi Newton algorithm proceeds as follows:
Algorithm 1.1 (Generic quasi-Newton). S0 Choose a starting point x 0 ∈ Rn, a nonsingular B 0 ∈ Sn^ (often the choice is B 0 = I), and a termination tolerance ǫ > 0. Set k = 0.
S1 If ‖∇f (xk)‖ ≤ ǫ then stop and output xk as an approximate local minimiser of f. Else go to S2. S2 Compute the quasi-Newton search direction dk = −B− k 1 ∇f (xk). S3 Perform a practical line-search for the minimisation of φ(α) = f (xk + αdk): find a step length αk that satisfies the Wolfe conditions and compute the new iterate xk+1 = xk + αkdk. S4 Compute the new approximate Hessian Bk+1 according to the specified rule. S5 Replace k by k + 1 and go to S1.
1.1. A Wish List. To turn the generic procedure outlined in Algorithm 1.1 into a practical algorithm, we have to specify how to update the approximate Hessian Bk. Many variants of such updating rules have been proposed, and we will discuss the two most widely used choices in Sections 2 and 3. But before doing that, we draw up a “wish list” of properties we would like to find in Bk. These properties serve as a motivation for almost all quasi-Newton methods, not merely the ones discussed in Sections 2 and 3.
P1: Bk should be nonsingular, so that S2 is well-defined. P2: Bk should be such that dk is a descent direction, so that S3 is well-defined. P3: Bk should be symmetric, because Hessians are symmetric matrices.
Note that properties P1–P3 can be satisfied for example by requiring that Bk be positive definite symmetric, as P1 and P3 are then trivially true, and P2 follows from
〈∇f (xk), dk〉 = −∇f (xk)TB k− 1 ∇f (xk) < 0 ,
unless ∇f (xk) = 0, in which case Algorithm 1.1 already stopped and the line-search is unnecessary. One might argue that a positive definite matrix Bk is a poor choice as a Hes- sian approximation when D^2 f (xk) has at least one negative eigenvalue. However, such iterates xk are not near a local minimiser, and at such points one is primarily concerned with finding a descent direction, and dk is then a valid choice for any Bk positive definite. This avoids that the quasi-Newton method gets attracted to any point but a local minimiser.
P4: Bk+1 should be computable by “recycling” the quantities
∇f (xk+1), ∇f (xk),... , ∇f (x 0 ), dk, αk
which have to be computed anyway.
In order to meet this requirement while computing a Bk+1 that is a credible Hessian approximation, it is useful to observe that the gradient change
γk := ∇f (xk+1) − ∇f (xk)
yields information about the Hessian change D^2 f (xk+1) − D^2 f (xk): let us write δk := αkdk for the update chosen by Algorithm 1.1. Recall that the search direction
its columns are scalar multiples of u. Therefore, the rank of this matrix is 1, which explains the the name “rank-1 update”. Thus, if we start out with a symmetric matrix B 0 , then Bk will be symmetric for all k, as all the updates are symmetric. Moreover, the rank-1 condition (2.1) guarantees that Bk and Bk+1 are “close” to each other in the rank sense. So far we did not specify the choice of u. This choice is fixed when the require- ments of P4 are satisfied through the secant condition (1.2): multiplying (2.1) by δk and substituting the result into (1.2), we find
(uTδk)u = γk − Bkδk. (2.2)
Multiplying the transpose of this equation by δk, we obtain
(uTδk)^2 = (γk − Bkδk)Tδk. (2.3)
Equation (2.2) shows that
u = γk − Bkδk uTδk
thus, (2.1) and (2.3) imply that the updating rule should be as follows,
Bk+1 = Bk +
(γk − Bkδk)(γk − Bkδk)T (uTδk)^2
= Bk +
(γk − Bkδk)(γk − Bkδk)T (γk − Bkδk)Tδk
Note that since γk = ∇f (xk+1) − ∇f (xk) and δk = αkdk, we can compute the SR update from the “recycled” information referred to in P4. When Bk+1 is computed via the updating rule (2.4) Algorithm 1.1 is called the symmetric rank 1 method (or SR1). This method was independently suggested by Broyden, Davidson, Fiacco-McCormick, Murtagh-Sargent, and Wolfe in 1967-69. The updates of the SR1 method are very simple to compute, but they have the drawback that Bk is not always positive definite and dk might not always be defined or be a descent direction. Moreover, (γk − Bkδk)Tδ can be close to zero which leads to very large updates. It remains to address property P6. Note that once dk is known, computing αk, xk+1, ∇f (xk+1, γk and δk is very cheap. Moreover, since an outer product of two n × 1 vectors necessitates the computation of n^2 entries, and since adding two n × n matrices takes n^2 additions, the total work for computing the updated matrix Bk+ from Bk and dk is of order O(n^2 ). However, in order to compute dk we need to solve the linear system of equations
Bkdk = −∇f (xk). (2.5)
Gaussian elimination takes 2n^3 /3 operations to compute dk. There are better meth- ods to solve the system (2.5) numerically, but all practical methods take O(n^3 ) op- erations (although some theoretical methods reduce this complexity slightly). No known method can solve (2.5) in the O(n^2 ) operations apparently asked for in P6. Intuitively, it should be clear that this is the case, because even if we guessed the solution, verifying that dk satisfies the system necessitates the computation of n inner
products – one for each column of Bk multiplied with dk – each of which takes n multiplications and n − 1 additions, and hence this takes O(n^2 ) operations in total. (Our analysis is crude, as we perform only a simple operations count without taking into account facts such as that performing a multiplication on a computer takes much more time than an addition.) Thus, the SR1 method, it seems, does not satisfy the complexity requirements of P6. Luckily, a way out of the dilemma is given by the Sherman–Morrison–Woodbury formula:
Theorem 2.1 (Sherman–Morrison–Woodbury). If B ∈ Rn×n^ and U, V ∈ Rn×p are matrices then
(B + U V T)−^1 = B−^1 − B−^1 U (I + V TB−^1 U )−^1 V TB−^1.
Proof. See problem set.
The usefulness of this formula is quickly understood when we apply it in the context of SR1 updates: suppose we knew Hk = B− k 1. Then, applying the Sherman- Morrison-Woodbury formula to B+ = Bk+1, B = Bk, U = u = (γk − Bkδk) and V = U T^ (that is, p = 1 in this case), we find
Hk+1 = (B+)−^1 = B−^1 − B−^1 u
1 + uTB−^1 u
uTB−^1
= Hk +
(δk − Hkγk)(δk − Hkγk)T (δk − Hkγk)Tγk
Thus, Hk+1 is just a rank 1 update of Hk. But since we have assumed that Hk is known, computing dk takes only O(n^2 ) work because this is obtained via the matrix- vector multiplication dk = −Hk∇f (xk). Hence, computing δk and γk takes only O(n^2 ) operations. Finally, because outer products of vectors take O(n^2 ) operations, Hk+1 can be computed from Hk in O(n^2 ) time. Of course, this saving was only possible because we assumed that B k− 1 = Hk is known. But if we start Algorithm 1.1 with a matrix B 0 whose inverse is know, for example with B 0 = I, then we need never form the matrices Bk and can work directly with the updates Hk in every iteration. The SR1 method then takes only O(n^2 ) work per iteration, which is a considerable saving over the Newton-Raphson method which takes O(n^3 ) work due to the solving of a linear system in each iteration. It is possible to analyse the local convergence of the SR1 method and show that the method converges superlinearly in a neighbourhood of a local minimiser of f. Thus, if the SR1 method is properly implemented, it can combine convergence speeds similar to those of the Newton-Raphson method with a lower complexity.
can be solved as follows: the value of x 1 can be read off the first equation, yielding x 1 = 3/2. After substituting this value into the second equation, we can simply read off the value of x 2 and find x 2 = (4 − 3 /2)/3. If we generalise this procedure, we find that solving (3.2) for gk takes
∑n l=1[2(l^ −^ 1) + 1] =^ n (^2) operations, and likewise solving
(3.3) for dk takes n^2 operations. Thus, if the Cholesky factorisation of Bk is available, the computation of dk incurs an O(n^2 ) cost and the BFGS method takes O(n^2 ) work per iteration. The main idea of the BFGS method then becomes to seek a sensible updating rule Lk ← Lk+ for the Cholesky factor and to take Bk+1 = Lk+1LT k+1, which by Proposition 3. automatically guarantees that Bk+1 is symmetric positive definite. In order to address the requirement P5, we will attempt to minimise the distance of Lk and Lk+1 in a well-defined sense. Furthermore, in order to meet the requirement P4, we will choose Lk+1 so that Bk+1 satisfies the secant condition (1.2). Let us put that programme into practice: In order to avoid cluttering our equa- tions with indices we write B, L, α d δ and γ for Bk, Lk, αk, dk, δk and γk respectively, and B+, L+ etc. instead of Bk+1, Lk+1 and so forth. At first, we will neglect the condition that L is lower triangular and replace it by an arbitrary nonsingular matrix J such that JJT^ = B and J+J+T = B+. We propose to solve the minimisation problem
min J+
s.t. J+g = γ, (3.5)
where ‖ · ‖F is the Frobenius norm and g ∈ Rn^ is a parameter vector, and then we choose g so that
J+T δ = g. (3.6)
The result will be that B+ = J+J+T satisfies the quasi-Newton equation (1.2), and that ‖J+ − J‖F , a measure of distance between B+ and B, is minimised under this condition.
The minimisation problem (3.4) is clearly the same as the smooth strictly convex optimisation problem
min J+
tr
s.t. J+g = γ,
where tr(A) =
i Aii^ denotes the trace of a square matrix^ A. This problem can be reformulated as
min J+
s.t. 〈J+, eigT〉 = γi (i = 1,... , n), (3.9)
where ei is the i-th coordinate vector and
〈·, ·〉 : Rn×n^ × Rn×n^ → R (A, B) 7 → 〈A, B〉 := tr(ABT) =
i,j
aij bij
is a Euclidean inner product on the vector space Rn×n^ of n × n real matrices. It is easy to check that the Euclidean norm that corresponds to this inner product is the usual Frobenius norm, defined by
i,j
A^2 ij.
Thus, (Rn×n, 〈·, ·〉) is a Euclidean space of dimension n^2. The problem (3.8) is then the task of finding the point closest to J in the affine subspace defined by the constraints (3.9). The minimum is achieved at the unique point J +∗ where (J +∗ − J) is orthogonal to the affine subspace, that is, (J +∗ − J) ∈ span(e 1 gT,... , engT). Another way to express this is to say that J∗ + is characterised by the existence of a vector λ∗^ ∈ Rn^ such that
2(J +∗ − J) − λ∗gT^ = 0, J+g = γ.
Both conditions are satisfied for
(γ − Jg)gT gTg
and (3.10)
λ∗^ = 2
γ − Jg gTg
and hence J +∗ is the global minimiser of (3.4). Condition (3.6) now becomes
JTδ =
(γ − Jg)T gTg δ
g, (3.11)
and since JTδ 6 = 0, this can only be satisfied for g = βJTδ with β 6 = 0. Substituting into (3.11), we obtain
β(γTδ − αδTBδ) β^2 δTBδ
β,
or
β = ±
γTδ δTBδ
Note that if α satisfies the Wolfe conditions of Lecture 2, then
γTδ =
∇f (x + αd) − ∇f (x)
αd ≥ (c 2 − 1)φ′(0) > 0.
Therefore, the square-root in (3.12) is real. Expressing the update of J in terms of β, we find
(γ − βBδ)δTJ βδTBδ
S2 Otherwise solve the triangular system Lkgk = −∇f (xk) for gk, and then the triangular system LT k dk = gk for dk. S3 Perform a line search to find a positive step length αk > 0 such that f (xk + αkdk) < f (xk), and such that αk satisfies the Wolfe conditions. S4 Set δk := αkdk, xk+1 = xk + δk. Compute γk := ∇f (xk+1) − ∇f (xk) and βk = ±
γT k δk/δ kT Bkδk. S5 Compute
J kT+1 = LT k + LT k δk(γk − βkBkδk)T βkδT k Bδk
and then compute the QR factorisation J kT+1 = Qk+1Rk+1. Set Lk+1 := RT k+ and return to S1.
The BFGS algorithm spends only O(n^2 ) computation time per iteration, yet its convergence behaviour is not unlike that of the Newton-Raphson method: it can be shown that the BFGS method has local Q-superlinear convergence (but not Q- quadratic). It can also be shown that if the BFGS algorithm is used on a strictly convex quadratic function and in conjunction with exact line searches, then Bk be- comes the exact (constant) Hessian after n iterations.
cost per iteration convergence rate
Steepest descent O
nC(f )
Q-linear
Quasi-Newton O
n^2 + nC(f )
Q-superlinear
Newton-Raphson O
n^3 + n^2 C(f )
Q-quadratic