Decent Method, Lecture Notes - Mathematics -, Study notes of Mathematical Methods

Unconstraint Optimisation, Optimality conditions for unconstrained minimisation ,Line search, descent methods, step length selection, convergence of descent method

Typology: Study notes

2010/2011

Uploaded on 09/09/2011

luber-1
luber-1 🇬🇧

4.8

(12)

293 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
C12.1B: CONTINUOUS OPTIMISATION
LECTURE 2: THE DESCENT METHOD AND LINE-SEARCHES
RAPHAEL HAUSER
MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD
1. Unconstrained Optimisation. The subject of this chapter is the uncon-
strained minimisation problem
min
xRnf(x),(1.1)
where fis a continuous objective function. Note that no constraints imposed on
the decision variables! Furthermore, we usually assume that fis C2with Lipschitz-
continuous Hessian, that is, there exists Λ >0 such that
kD2f(x)D2f(y)k Λkxyk x, y Rn.
In all that follows kxk=pPx2
idenotes the Euclidean norm of a vector xRn
and ,·i is the corresponding Euclidean inner product. If A:RnRmis a linear
map, then kAkdenotes the operator norm defined by the Euclidean norms on Rnand
Rm, that is,
kAk= inf{λ > 0 : kAxk λkxk xRn}.
The gradient f(x) of a function f:RnRis sometimes denoted by gf(x), and
its Hessian D2f(x) by Hf(x). The Jacobian Df (x) of a function f:RnRmis
sometimes denoted by Jf(x). Note: if m= 1 then Jf(x) = gf(x)T. We will also use
the so-called “big O” notation: we say that a function g(x) is of order kxkkand write
g(x) = O(kxkk) if there exists a constant c > 0 and a δ > 0 such that |g(x)| ckxkk
whenever kxk< δ.
Example 1.1 (Risk minimisation under shortselling). Let us go back to Example
2 of Lecture 1. By eliminating xn= 1 Pn1
i=1 xiwe can get rid of the constraint
n
X
i=1
xi= 1.
Furthermore, if we allow short-sel ling of assets, the constraints
xi0 (i= 1,...,n)
are no longer imposed. Finally, let us suppose all the assets considered have the same
expected return µiµ, so that the constraint
n
X
i=1
µixib
can be omitted. The investor’s aim is to minimise the risk, which can be modelled as
min
xRn1f(x1,...,xn1) =
n1
X
i=1
n1
X
j=1
σijxixj+
n1
X
j=1
σnj1
n1
X
i=1
xixj
+
n1
X
i=1
σinxi1
n1
X
j=1
xj+σnn1
n1
X
i=1
xi1
n1
X
j=1
xj.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Decent Method, Lecture Notes - Mathematics - and more Study notes Mathematical Methods in PDF only on Docsity!

C12.1B: CONTINUOUS OPTIMISATION

LECTURE 2: THE DESCENT METHOD AND LINE-SEARCHES

RAPHAEL HAUSER MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD

  1. Unconstrained Optimisation. The subject of this chapter is the uncon- strained minimisation problem

min x∈Rn^ f (x), (1.1)

where f is a continuous objective function. Note that no constraints imposed on the decision variables! Furthermore, we usually assume that f is C^2 with Lipschitz- continuous Hessian, that is, there exists Λ > 0 such that

‖D^2 f (x) − D^2 f (y)‖ ≤ Λ‖x − y‖ ∀ x, y ∈ Rn. In all that follows ‖x‖ =

x^2 i denotes the Euclidean norm of a vector x ∈ Rn and 〈·, ·〉 is the corresponding Euclidean inner product. If A : Rn^ → Rm^ is a linear map, then ‖A‖ denotes the operator norm defined by the Euclidean norms on Rn^ and Rm, that is,

‖A‖ = inf{λ > 0 : ‖Ax‖ ≤ λ‖x‖ ∀ x ∈ Rn}.

The gradient ∇f (x) of a function f : Rn^ → R is sometimes denoted by gf (x), and its Hessian D^2 f (x) by Hf (x). The Jacobian Df (x) of a function f : Rn^ → Rm^ is sometimes denoted by Jf (x). Note: if m = 1 then Jf (x) = gf (x)T. We will also use the so-called “big O” notation: we say that a function g(x) is of order ‖x‖k^ and write g(x) = O(‖x‖k) if there exists a constant c > 0 and a δ > 0 such that |g(x)| ≤ c‖x‖k whenever ‖x‖ < δ.

Example 1.1 (Risk minimisation under shortselling). Let us go back to Example 2 of Lecture 1. By eliminating xn = 1 −

∑n− 1 i=1 xi^ we can get rid of the constraint ∑^ n

i=

xi = 1.

Furthermore, if we allow short-selling of assets, the constraints

xi ≥ 0 (i = 1,... , n)

are no longer imposed. Finally, let us suppose all the assets considered have the same expected return μi ≡ μ, so that the constraint

∑^ n

i=

μixi ≥ b

can be omitted. The investor’s aim is to minimise the risk, which can be modelled as

min x∈Rn−^1

f (x 1 ,... , xn− 1 ) =

n∑− 1

i=

n∑− 1

j=

σij xixj +

n∑− 1

j=

σnj

n∑− 1

i=

xi

xj

n∑− 1

i=

σinxi

n∑− 1

j=

xj

  • σnn

n∑− 1

i=

xi

n∑− 1

j=

xj

Since the objective function f is a quadratic (degree 2) polynomial in the decision variables x 1 ,... , xn− 1 , we have f ∈ C∞. Moreover, the Hessian D^2 f (x) is the same (n − 1) × (n − 1) matrix

  

σ 11... σ 1 n

... σn 1... σnn

for all x, and hence x 7 → D^2 f (x) is a constant function, which is of course Lipschitz- continuous: ‖D^2 f (x) − D^2 f (y)‖ = 0 ≤ 0 × ‖x − y‖ ∀ x, y ∈ Rn−^1.

Example 1.2. On a CAD system it takes n parameters x 1 ,... , xn to define the shape of a car. An engineer has a piece of software which takes the design parameters x ∈ Rn^ as input and computes the air resistance f (x) of the corresponding fuselage as output. The software contains typically millions of lines of code, but for theoret- ical reasons it is known that f ∈ C^2. Using an automatic differentiation system, the engineer can automatically produce a piece of software that computes directional derivatives

Dv f (x) =

d dt

f (x + tv), Du,v f (x) =

d^2 ds dt

f (x + su + tv).

How to choose the design parameters so as to minimise the drag on the fuselage?

Note that in this example the objective function is not available explicitly. This is typical for many applications. In fact, evaluating the objective function might even involve measurements in a physical experiment. Besides from appearing as subprob- lems in constrained optimisation procedures, unconstrained optimisation problems also appear in many applications directly.

  1. Optimality Conditions for Unconstrained Minimisation. A well de- signed optimization algorithm should be able to recognise when an approximate min- imum has been attained. We therefore need a mathematical characterisation of local minimisers. At school we all learned that in the univariate case, a necessary condition is that f ′(x) = 0, and that second derivatives help deciding whether x is a local maximiser or minimiser. The same idea works in higher dimensions:

Theorem 2.1. (i) If f : Rn^ → R is differentiable at x∗^ ∈ Rn^ and has a local minimum there, then ∇f (x∗) = 0, that is, x∗^ is a stationary point of f. This is a first order necessary optimality condition, because it involves first derivatives, or the first order Taylor approximation of f. (ii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn^ and has a local minimum there, then the Hessian D^2 f (x∗) is positive semidefinite, that is, hTD^2 f (x∗)h ≥ 0 for all h ∈ Rn. This is a second order necessary optimality condition. (iii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn, and if ∇f (x∗) = 0 and D^2 f (x∗) is positive definite, that is, if hTD^2 f (x∗)h > 0 for all h ∈ Rn^ \ { 0 }, then x∗^ is a local minimiser of f. These are sufficient optimality conditions.

Therefore,

hTD^2 f (x∗)h =

( (^) n ∑

i=

〈vi, h〉vi

)T 

∑^ n

j=

〈vj , h〉σj vj

i,j

〈vi, h〉〈vj , h〉σj 〈vi, vj 〉

∑^ n

i=

〈vi, h〉^2 σi ≥ σn

∑n

i=

〈vi, h〉^2 = σn

∑n

i=

〈vi, h〉vi, 〈vi, h〉vi

= σn

〈 (^) ∑n

i=

〈vi, h〉vi,

∑^ n

j=

〈vj , h〉vj

= σn‖h‖^2. (2.3)

Let c, δ > 0 be as in part (ii). Then (2.3) implies that for all h such that ‖h‖ < min(δ, σn/ 2 c) we have

f (x∗^ + h) = f (x∗) +

hTD^2 f (x∗)h + O(‖h‖^3 )

(2.3) ≥ f (x∗) +

‖h‖^2 σn − c‖h‖^3

≥ f (x∗) +

‖h‖^2 σn − c

σn 2 c

‖h‖^2 = f (x∗),

which shows that x∗^ is a local minimiser of f.

  1. Line-Search Descent Methods. The optimality conditions we just derived play an important role in the construction of algorithms: Solving the simultaneous system of nonlinear equations

∇f (x) = 0

by an iterative procedure generating a sequence of points (xk)N, if we can assure that f (xk) decreases in each iteration,

f (xk+1) ≤ f (xk) ∀ k,

then in practice (xk)N can only converge to a local minimiser x∗^ and

‖∇f (x∗)‖ < ǫ

can be used as a stopping criterion. Thus, solving unconstrained optimisation prob- lems is closely related to the problem of solving simultaneous equations with the added feature that progress can be controlled by monitoring a naturally defined merit function (i.e., one asks ”does f decrease?”). Most competitive algorithms for unconstrained minimisation are based on this idea. There are two main families of such methods: line-search methods and trust region methods. We start with a description of the former.

Example 3.1 (Steepest descent without line searches). A simple method is defined as follows: starting from some x 0 ∈ Rn, compute a sequence of intermediate solutions (xk)N as follows,

xk+1 = xk − ∇f (xk).

The method is motivated by the fact that −∇f (xk) is the direction in which f decreases fastest when moving away from xk. But is it a descent method? The first order Taylor approximation of f shows that f (xk −α∇f (xk)) ≤ f (xk) for small α > 0. However, it is not necessarily the case that f (xk+1) ≤ f (xk), as the step −∇f (xk) can be too far. To make this a true descent method, we have to use line-searches: in each iteration we have to find αk > 0 such that

f (xk − αk∇f (xk)) < f (xk),

and then we can set

xk+1 = xk − αk∇f (xk).

A word of warning: although this method works in principle, it is too primitive to produce any good results in practice! We will later learn why. For now we set out to generalise this example.

Algorithm 3.2 (Descent method). S0 Choose a starting point x 0 ∈ Rn^ and a tolerance parameter ǫ > 0. Set k = 0. S1 If ‖∇f (xk)‖ ≤ ǫ then stop and output xk as an approximate local minimiser. S2 Otherwise choose a search direction dk ∈ Rn^ such that 〈∇f (xk), dk〉 < 0. S3 Choose a step size αk > 0 such that f (xk + αkdk) < f (xk). S4 Set xk+1 := xk + αkdk, replace k by k + 1, and go back to S1.

Below we will see that the minimal assumption we need to make for this algorithm to work is f ∈ C^1 with Lipschitz continuous gradient. The generality of Algorithm 3.2 leaves flexibility both in the choice of the step length αk and the search direction dk. In the remainder of this lecture we discuss the step length selection and treat the choice of good search directions in the next few lectures.

3.1. Step Length Selection. The conceptually simplest method of choosing αk are exact line searches, defined by

αk := inf{α ≥ 0 : φ′(α) = 0},

where φ(α) = f (xk + αdk). That is to say, the point xk + αkdk is the first stationary point of f encountered along the half line {xk + αdk : α ≥ 0 }. Note that if {α ≥ 0 : φ′(α) = 0} = ∅, as is the case for example when φ(α) = − ln α, then {α ≥ 0 : φ′(α) = 0 } = ∅, and hence αk := inf ∅ = +∞ corresponds to an infinitely long step which is still sensible. Exact line searches are mainly a theoretical tool in the convergence analysis of algorithms. In practice, they are computationally too expensive. We will now derive step length computations that are equally good choices for the purposes of Algorithm 3.2 and much cheaper to compute.

Definition 3.3. We say that αk satisfies the Wolfe conditions if

φ(αk) ≤ φ(0) + c 1 αkφ′(0), (3.1) φ′(αk) ≥ c 2 φ′(0), (3.2)

Algorithm 3.5 (Bisection method for step size). S0 Choose α > 0 and set αlow = αhigh = 0. S1 If α satisfies (3.1) (that is, if α is long enough) then goto S3. S2 Else (if α does not satisfy (3.1)) make the replacements αhigh ← α and α ← (αlow + αhigh)/ 2 , and then goto S1. S3 If α satisfies (3.2) (that is, α now satisfies both Wolfe conditions) output αk = α and stop. S4 Otherwise (if α does not satisfy (3.2)), make the replacements αlow ← α and

α ←

2 αlow if αhigh = 0, 1 2 (αlow^ +^ αhigh)^ if^ αhigh^ >^0 , and then go back to S1.

Proposition 3.6. Under the assumptions of Proposition 3.4, Algorithm 3.5 ter- minates in finite time and outputs a choice of αk that satisfies both Wolfe conditions.

Proof. Note that the two sets

W 1 := {α ≥ 0 : (3.1) holds}, W 2 := {α ≥ 0 : (3.2) holds}

are closed subsets of R+. Moreover,

φ(α) = φ(0) +

∫ (^) α

0

φ′(τ )dτ < φ(0) +

∫ (^) α

0

c 1 φ′(0)dτ

for all α sufficiently small, because φ′^ is continuous and c 1 < 1, showing that there exists δ 1 > 0 such that [0, δ 1 ] ⊂ W 1. Let α > 0, (α[ lowi] )N ⊂ W 1 and (α[ highi] )N ⊂ W 1 c be such that

α[ lowi] < α ∀i ∈ N, α[ lowi] i −→→∞ α,

α[ highi] > α ∀i ∈ N, α[ highi] i −→→∞ α.

We claim that this implies α ∈ W 2 ◦ (the topological interior of W 2 ). In fact, suppose to the contrary that α ∈ W 2 c , and hence that φ′(α) ≤ c 2 φ′(0). Then there exists a value δ 2 > 0 such that

φ′(α + τ ) < c 1 φ′(0) ∀τ ∈ [0, δ 2 ],

because φ′^ is continuous and c 2 < c 1. Therefore,

φ(α + τ ) = φ(α) +

∫ (^) α+τ

α

φ′(θ)dθ < φ(0) + c 1 (α + τ )φ′(0)

for all τ ∈ [0, δ 2 ]. Since α[ highi] converges to α from the right there exists an index j large

enough so that α [j] high ∈^ [α, α^ +^ δ^2 ], contradicting the assumption that^ α

[j] high ∈^ W^

c

Therefore, it is indeed the case that α ∈ W 2 ◦. Let us now start analysing the algorithm. Note that we only need to prove that the algorithm terminates in finite time, because the termination criterion is set such that if the algorithm terminates, then αk satisfies both Wolfe conditions.

  • We say that the algorithm starts iteration i when it visits step S1 for the i-th time, starting with iteration i = 0. Let α[ lowi] , α[ highi] and α[i]^ denote the values of αlow, αhigh and α respectively just before the algorithm enters iteration i.
  • Note that it is impossible that α[ lowi] = 0 for all i, because in that case α[i]^ = 2 −iα[0], and ultimately α[i]^ ∈ [0, δ 1 ] ⊂ W 1 and αlow is updated to α[i]^ > 0.
  • (α[ lowi] )N is an increasing sequence in W 1 such that α[ lowi] < α[i]^ for all i. In fact, these properties hold true at i = 0, and since αlow can only be updated in step S4 it will increase to the strictly larger value α[ lowi+1] = α[i]^ and α[i+1] takes on a strictly larger value than α[i]^ in the same step.
  • Initially, α[ highi] = 0 for a few iterations, but once it takes on a value α[i^0 ]^ > 0 in some iteration i 0 , then this can only happen in step S2. From then on (αihigh){i∈N:i≥i 0 } is a decreasing sequence of values from W 1 c , because αhigh is only updated in step S2 to a value of α that is strictly smaller than αhigh and not in W 1 , and α itself is updated to a strictly smaller value.
  • Overall, there are only two possible scenarios: either α [i] high = 0 for all^ i, and then α[ lowi] = α[0] 2 i−^1 for all i, in which case the algorithm detects that f is unbounded below in the direction dk, a situation we excluded in the assumptions of Proposition 3.4. It is thus the second scenario that takes place, which is that there exists an index i 0 ∈ N such that α [i 0 ] high >^ 0, and from then on α[i]^ = (α[ highi] + α[ lowi])/2, (α[ lowi] )N is increasing, (α[ highi] )N is decreasing, and the interval [α[ lowi] , α[ highi] ] is halved in length in every iteration. This shows that α[ lowi] converges to a point α from within W 1 and α[ highi] converges to the same point from within W 1 c. By the arguments above, α ∈ W 1 ∩ W 2 ◦. Therefore, α[ lowi] ∈ W 1 ∩ W 2 for i sufficiently large, and the algorithm will detect this and terminate with this value.

3.2. Convergence of Descent Methods. It is now possible to give a fairly general convergence theorem for Algorithm 3.2 as long as the step lengths satisfy the Wolfe conditions. We prepare the proof through a lemma that gives a useful bound on the amount of decrease in the objective function that is achieved in every iteration:

Lemma 3.7. Let Algorithm 3.2 be applied to a C^1 function f with Λ-Lipschitz continuous gradient and assume that the step length αk satisfies the Wolfe conditions (3.1) and (3.2). Then

f (xk+1) ≤ f (xk) − c 1 (1 − c 2 ) (cos^2 θk)‖∇f (xk)‖^2 Λ

where θk is the angle between dk and −∇f (xk), and where c 1 , c 2 are the constants from Definition 3.3.

Proof. The second Wolfe condition implies

〈∇f (xk + αkdk), dk〉 − 〈∇f (xk), dk〉 = φ′(αk) − φ′(0) ≥ (c 2 − 1)φ′(0) = (1 − c 2 ) (−〈∇f (xk), dk〉).

Theorem 3.8 is valid under the assumption that the objective function is bounded below. It is interesting to note that when this is not the case, the algorithm fails to ter- minate in finite time but produces a sequence (xk)N such that limk→∞ f (xk) = −∞, which is a perfectly sensible and desireable behaviour under the circumstances.