Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Decent Method, Lecture Notes - Mathematics -, Study notes of Mathematical Methods

University of Oxford Mathematical Methods

Unconstraint Optimisation, Optimality conditions for unconstrained minimisation ,Line search, descent methods, step length selection, convergence of descent method

Typology: Study notes

2010/2011

Uploaded on 09/09/2011

luber-1 🇬🇧

4.8

(12)

293 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

C12.1B: CONTINUOUS OPTIMISATION

LECTURE 2: THE DESCENT METHOD AND LINE-SEARCHES

RAPHAEL HAUSER

MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD

1. Unconstrained Optimisation. The subject of this chapter is the uncon-

strained minimisation problem

min

x∈Rnf(x),(1.1)

where fis a continuous objective function. Note that no constraints imposed on

the decision variables! Furthermore, we usually assume that fis C2with Lipschitz-

continuous Hessian, that is, there exists Λ >0 such that

kD2f(x)−D2f(y)k ≤ Λkx−yk ∀ x, y ∈Rn.

In all that follows kxk=pPx2

idenotes the Euclidean norm of a vector x∈Rn

and h·,·i is the corresponding Euclidean inner product. If A:Rn→Rmis a linear

map, then kAkdenotes the operator norm defined by the Euclidean norms on Rnand

Rm, that is,

kAk= inf{λ > 0 : kAxk ≤ λkxk ∀ x∈Rn}.

The gradient ∇f(x) of a function f:Rn→Ris sometimes denoted by gf(x), and

its Hessian D2f(x) by Hf(x). The Jacobian Df (x) of a function f:Rn→Rmis

sometimes denoted by Jf(x). Note: if m= 1 then Jf(x) = gf(x)T. We will also use

the so-called “big O” notation: we say that a function g(x) is of order kxkkand write

g(x) = O(kxkk) if there exists a constant c > 0 and a δ > 0 such that |g(x)| ≤ ckxkk

whenever kxk< δ.

Example 1.1 (Risk minimisation under shortselling). Let us go back to Example

2 of Lecture 1. By eliminating xn= 1 −Pn−1

i=1 xiwe can get rid of the constraint

n

X

i=1

xi= 1.

Furthermore, if we allow short-sel ling of assets, the constraints

xi≥0 (i= 1,...,n)

are no longer imposed. Finally, let us suppose all the assets considered have the same

expected return µi≡µ, so that the constraint

n

X

i=1

µixi≥b

can be omitted. The investor’s aim is to minimise the risk, which can be modelled as

min

x∈Rn−1f(x1,...,xn−1) =

n−1

X

i=1

n−1

X

j=1

σijxixj+

n−1

X

j=1

σnj1−

n−1

X

i=1

xixj

+

n−1

X

i=1

σinxi1−

n−1

X

j=1

xj+σnn1−

n−1

X

i=1

xi1−

n−1

X

j=1

xj.

1

Discover Study notes of Mathematical Methods University of Oxford

Partial preview of the text

Download Decent Method, Lecture Notes - Mathematics - and more Study notes Mathematical Methods in PDF only on Docsity!

C12.1B: CONTINUOUS OPTIMISATION

LECTURE 2: THE DESCENT METHOD AND LINE-SEARCHES

RAPHAEL HAUSER MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD

Unconstrained Optimisation. The subject of this chapter is the uncon- strained minimisation problem

min x∈Rn^ f (x), (1.1)

where f is a continuous objective function. Note that no constraints imposed on the decision variables! Furthermore, we usually assume that f is C^2 with Lipschitz- continuous Hessian, that is, there exists Λ > 0 such that

‖D^2 f (x) − D^2 f (y)‖ ≤ Λ‖x − y‖ ∀ x, y ∈ Rn. In all that follows ‖x‖ =

x^2 i denotes the Euclidean norm of a vector x ∈ Rn and 〈·, ·〉 is the corresponding Euclidean inner product. If A : Rn^ → Rm^ is a linear map, then ‖A‖ denotes the operator norm defined by the Euclidean norms on Rn^ and Rm, that is,

‖A‖ = inf{λ > 0 : ‖Ax‖ ≤ λ‖x‖ ∀ x ∈ Rn}.

The gradient ∇f (x) of a function f : Rn^ → R is sometimes denoted by gf (x), and its Hessian D^2 f (x) by Hf (x). The Jacobian Df (x) of a function f : Rn^ → Rm^ is sometimes denoted by Jf (x). Note: if m = 1 then Jf (x) = gf (x)T. We will also use the so-called “big O” notation: we say that a function g(x) is of order ‖x‖k^ and write g(x) = O(‖x‖k) if there exists a constant c > 0 and a δ > 0 such that |g(x)| ≤ c‖x‖k whenever ‖x‖ < δ.

Example 1.1 (Risk minimisation under shortselling). Let us go back to Example 2 of Lecture 1. By eliminating xn = 1 −

∑n− 1 i=1 xi^ we can get rid of the constraint ∑^ n

i=

xi = 1.

Furthermore, if we allow short-selling of assets, the constraints

xi ≥ 0 (i = 1,... , n)

are no longer imposed. Finally, let us suppose all the assets considered have the same expected return μi ≡ μ, so that the constraint

∑^ n

i=

μixi ≥ b

can be omitted. The investor’s aim is to minimise the risk, which can be modelled as

min x∈Rn−^1

f (x 1 ,... , xn− 1 ) =

n∑− 1

i=

n∑− 1

j=

σij xixj +

n∑− 1

j=

σnj

n∑− 1

i=

xi

xj

n∑− 1

i=

σinxi

n∑− 1

j=

xj

σnn

n∑− 1

i=

xi

n∑− 1

j=

xj

Since the objective function f is a quadratic (degree 2) polynomial in the decision variables x 1 ,... , xn− 1 , we have f ∈ C∞. Moreover, the Hessian D^2 f (x) is the same (n − 1) × (n − 1) matrix

  

σ 11... σ 1 n

... σn 1... σnn

for all x, and hence x 7 → D^2 f (x) is a constant function, which is of course Lipschitz- continuous: ‖D^2 f (x) − D^2 f (y)‖ = 0 ≤ 0 × ‖x − y‖ ∀ x, y ∈ Rn−^1.

Example 1.2. On a CAD system it takes n parameters x 1 ,... , xn to define the shape of a car. An engineer has a piece of software which takes the design parameters x ∈ Rn^ as input and computes the air resistance f (x) of the corresponding fuselage as output. The software contains typically millions of lines of code, but for theoret- ical reasons it is known that f ∈ C^2. Using an automatic differentiation system, the engineer can automatically produce a piece of software that computes directional derivatives

Dv f (x) =

d dt

f (x + tv), Du,v f (x) =

d^2 ds dt

f (x + su + tv).

How to choose the design parameters so as to minimise the drag on the fuselage?

Note that in this example the objective function is not available explicitly. This is typical for many applications. In fact, evaluating the objective function might even involve measurements in a physical experiment. Besides from appearing as subprob- lems in constrained optimisation procedures, unconstrained optimisation problems also appear in many applications directly.

Optimality Conditions for Unconstrained Minimisation. A well de- signed optimization algorithm should be able to recognise when an approximate min- imum has been attained. We therefore need a mathematical characterisation of local minimisers. At school we all learned that in the univariate case, a necessary condition is that f ′(x) = 0, and that second derivatives help deciding whether x is a local maximiser or minimiser. The same idea works in higher dimensions:

Theorem 2.1. (i) If f : Rn^ → R is differentiable at x∗^ ∈ Rn^ and has a local minimum there, then ∇f (x∗) = 0, that is, x∗^ is a stationary point of f. This is a first order necessary optimality condition, because it involves first derivatives, or the first order Taylor approximation of f. (ii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn^ and has a local minimum there, then the Hessian D^2 f (x∗) is positive semidefinite, that is, hTD^2 f (x∗)h ≥ 0 for all h ∈ Rn. This is a second order necessary optimality condition. (iii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn, and if ∇f (x∗) = 0 and D^2 f (x∗) is positive definite, that is, if hTD^2 f (x∗)h > 0 for all h ∈ Rn^ \ { 0 }, then x∗^ is a local minimiser of f. These are sufficient optimality conditions.

Therefore,

hTD^2 f (x∗)h =

( (^) n ∑

i=

〈vi, h〉vi

)T 

∑^ n

j=

〈vj , h〉σj vj

i,j

〈vi, h〉〈vj , h〉σj 〈vi, vj 〉

∑^ n

i=

〈vi, h〉^2 σi ≥ σn

∑n

i=

〈vi, h〉^2 = σn

∑n

i=

〈vi, h〉vi, 〈vi, h〉vi

= σn

〈 (^) ∑n

i=

〈vi, h〉vi,

∑^ n

j=

〈vj , h〉vj

= σn‖h‖^2. (2.3)

Let c, δ > 0 be as in part (ii). Then (2.3) implies that for all h such that ‖h‖ < min(δ, σn/ 2 c) we have

f (x∗^ + h) = f (x∗) +

hTD^2 f (x∗)h + O(‖h‖^3 )

(2.3) ≥ f (x∗) +

‖h‖^2 σn − c‖h‖^3

≥ f (x∗) +

‖h‖^2 σn − c

σn 2 c

‖h‖^2 = f (x∗),

which shows that x∗^ is a local minimiser of f.

Line-Search Descent Methods. The optimality conditions we just derived play an important role in the construction of algorithms: Solving the simultaneous system of nonlinear equations

∇f (x) = 0

by an iterative procedure generating a sequence of points (xk)N, if we can assure that f (xk) decreases in each iteration,

f (xk+1) ≤ f (xk) ∀ k,

then in practice (xk)N can only converge to a local minimiser x∗^ and

‖∇f (x∗)‖ < ǫ

can be used as a stopping criterion. Thus, solving unconstrained optimisation prob- lems is closely related to the problem of solving simultaneous equations with the added feature that progress can be controlled by monitoring a naturally defined merit function (i.e., one asks ”does f decrease?”). Most competitive algorithms for unconstrained minimisation are based on this idea. There are two main families of such methods: line-search methods and trust region methods. We start with a description of the former.

Example 3.1 (Steepest descent without line searches). A simple method is defined as follows: starting from some x 0 ∈ Rn, compute a sequence of intermediate solutions (xk)N as follows,

xk+1 = xk − ∇f (xk).

The method is motivated by the fact that −∇f (xk) is the direction in which f decreases fastest when moving away from xk. But is it a descent method? The first order Taylor approximation of f shows that f (xk −α∇f (xk)) ≤ f (xk) for small α > 0. However, it is not necessarily the case that f (xk+1) ≤ f (xk), as the step −∇f (xk) can be too far. To make this a true descent method, we have to use line-searches: in each iteration we have to find αk > 0 such that

f (xk − αk∇f (xk)) < f (xk),

and then we can set

xk+1 = xk − αk∇f (xk).

A word of warning: although this method works in principle, it is too primitive to produce any good results in practice! We will later learn why. For now we set out to generalise this example.

Algorithm 3.2 (Descent method). S0 Choose a starting point x 0 ∈ Rn^ and a tolerance parameter ǫ > 0. Set k = 0. S1 If ‖∇f (xk)‖ ≤ ǫ then stop and output xk as an approximate local minimiser. S2 Otherwise choose a search direction dk ∈ Rn^ such that 〈∇f (xk), dk〉 < 0. S3 Choose a step size αk > 0 such that f (xk + αkdk) < f (xk). S4 Set xk+1 := xk + αkdk, replace k by k + 1, and go back to S1.

Below we will see that the minimal assumption we need to make for this algorithm to work is f ∈ C^1 with Lipschitz continuous gradient. The generality of Algorithm 3.2 leaves flexibility both in the choice of the step length αk and the search direction dk. In the remainder of this lecture we discuss the step length selection and treat the choice of good search directions in the next few lectures.

3.1. Step Length Selection. The conceptually simplest method of choosing αk are exact line searches, defined by

αk := inf{α ≥ 0 : φ′(α) = 0},

where φ(α) = f (xk + αdk). That is to say, the point xk + αkdk is the first stationary point of f encountered along the half line {xk + αdk : α ≥ 0 }. Note that if {α ≥ 0 : φ′(α) = 0} = ∅, as is the case for example when φ(α) = − ln α, then {α ≥ 0 : φ′(α) = 0 } = ∅, and hence αk := inf ∅ = +∞ corresponds to an infinitely long step which is still sensible. Exact line searches are mainly a theoretical tool in the convergence analysis of algorithms. In practice, they are computationally too expensive. We will now derive step length computations that are equally good choices for the purposes of Algorithm 3.2 and much cheaper to compute.

Definition 3.3. We say that αk satisfies the Wolfe conditions if

φ(αk) ≤ φ(0) + c 1 αkφ′(0), (3.1) φ′(αk) ≥ c 2 φ′(0), (3.2)

Algorithm 3.5 (Bisection method for step size). S0 Choose α > 0 and set αlow = αhigh = 0. S1 If α satisfies (3.1) (that is, if α is long enough) then goto S3. S2 Else (if α does not satisfy (3.1)) make the replacements αhigh ← α and α ← (αlow + αhigh)/ 2 , and then goto S1. S3 If α satisfies (3.2) (that is, α now satisfies both Wolfe conditions) output αk = α and stop. S4 Otherwise (if α does not satisfy (3.2)), make the replacements αlow ← α and

α ←

2 αlow if αhigh = 0, 1 2 (αlow^ +^ αhigh)^ if^ αhigh^ >^0 , and then go back to S1.

Proposition 3.6. Under the assumptions of Proposition 3.4, Algorithm 3.5 ter- minates in finite time and outputs a choice of αk that satisfies both Wolfe conditions.

Proof. Note that the two sets

W 1 := {α ≥ 0 : (3.1) holds}, W 2 := {α ≥ 0 : (3.2) holds}

are closed subsets of R+. Moreover,

φ(α) = φ(0) +

∫ (^) α

0

φ′(τ )dτ < φ(0) +

∫ (^) α

0

c 1 φ′(0)dτ

for all α sufficiently small, because φ′^ is continuous and c 1 < 1, showing that there exists δ 1 > 0 such that [0, δ 1 ] ⊂ W 1. Let α > 0, (α[ lowi] )N ⊂ W 1 and (α[ highi] )N ⊂ W 1 c be such that

α[ lowi] < α ∀i ∈ N, α[ lowi] i −→→∞ α,

α[ highi] > α ∀i ∈ N, α[ highi] i −→→∞ α.

We claim that this implies α ∈ W 2 ◦ (the topological interior of W 2 ). In fact, suppose to the contrary that α ∈ W 2 c , and hence that φ′(α) ≤ c 2 φ′(0). Then there exists a value δ 2 > 0 such that

φ′(α + τ ) < c 1 φ′(0) ∀τ ∈ [0, δ 2 ],

because φ′^ is continuous and c 2 < c 1. Therefore,

φ(α + τ ) = φ(α) +

∫ (^) α+τ

α

φ′(θ)dθ < φ(0) + c 1 (α + τ )φ′(0)

for all τ ∈ [0, δ 2 ]. Since α[ highi] converges to α from the right there exists an index j large

enough so that α [j] high ∈^ [α, α^ +^ δ^2 ], contradicting the assumption that^ α

[j] high ∈^ W^

c

Therefore, it is indeed the case that α ∈ W 2 ◦. Let us now start analysing the algorithm. Note that we only need to prove that the algorithm terminates in finite time, because the termination criterion is set such that if the algorithm terminates, then αk satisfies both Wolfe conditions.

We say that the algorithm starts iteration i when it visits step S1 for the i-th time, starting with iteration i = 0. Let α[ lowi] , α[ highi] and α[i]^ denote the values of αlow, αhigh and α respectively just before the algorithm enters iteration i.
Note that it is impossible that α[ lowi] = 0 for all i, because in that case α[i]^ = 2 −iα[0], and ultimately α[i]^ ∈ [0, δ 1 ] ⊂ W 1 and αlow is updated to α[i]^ > 0.
(α[ lowi] )N is an increasing sequence in W 1 such that α[ lowi] < α[i]^ for all i. In fact, these properties hold true at i = 0, and since αlow can only be updated in step S4 it will increase to the strictly larger value α[ lowi+1] = α[i]^ and α[i+1] takes on a strictly larger value than α[i]^ in the same step.
Initially, α[ highi] = 0 for a few iterations, but once it takes on a value α[i^0 ]^ > 0 in some iteration i 0 , then this can only happen in step S2. From then on (αihigh){i∈N:i≥i 0 } is a decreasing sequence of values from W 1 c , because αhigh is only updated in step S2 to a value of α that is strictly smaller than αhigh and not in W 1 , and α itself is updated to a strictly smaller value.
Overall, there are only two possible scenarios: either α [i] high = 0 for all^ i, and then α[ lowi] = α[0] 2 i−^1 for all i, in which case the algorithm detects that f is unbounded below in the direction dk, a situation we excluded in the assumptions of Proposition 3.4. It is thus the second scenario that takes place, which is that there exists an index i 0 ∈ N such that α [i 0 ] high >^ 0, and from then on α[i]^ = (α[ highi] + α[ lowi])/2, (α[ lowi] )N is increasing, (α[ highi] )N is decreasing, and the interval [α[ lowi] , α[ highi] ] is halved in length in every iteration. This shows that α[ lowi] converges to a point α from within W 1 and α[ highi] converges to the same point from within W 1 c. By the arguments above, α ∈ W 1 ∩ W 2 ◦. Therefore, α[ lowi] ∈ W 1 ∩ W 2 for i sufficiently large, and the algorithm will detect this and terminate with this value.

3.2. Convergence of Descent Methods. It is now possible to give a fairly general convergence theorem for Algorithm 3.2 as long as the step lengths satisfy the Wolfe conditions. We prepare the proof through a lemma that gives a useful bound on the amount of decrease in the objective function that is achieved in every iteration:

Lemma 3.7. Let Algorithm 3.2 be applied to a C^1 function f with Λ-Lipschitz continuous gradient and assume that the step length αk satisfies the Wolfe conditions (3.1) and (3.2). Then

f (xk+1) ≤ f (xk) − c 1 (1 − c 2 ) (cos^2 θk)‖∇f (xk)‖^2 Λ

where θk is the angle between dk and −∇f (xk), and where c 1 , c 2 are the constants from Definition 3.3.

Proof. The second Wolfe condition implies

〈∇f (xk + αkdk), dk〉 − 〈∇f (xk), dk〉 = φ′(αk) − φ′(0) ≥ (c 2 − 1)φ′(0) = (1 − c 2 ) (−〈∇f (xk), dk〉).

Theorem 3.8 is valid under the assumption that the objective function is bounded below. It is interesting to note that when this is not the case, the algorithm fails to ter- minate in finite time but produces a sequence (xk)N such that limk→∞ f (xk) = −∞, which is a perfectly sensible and desireable behaviour under the circumstances.

Decent Method, Lecture Notes - Mathematics -, Study notes of Mathematical Methods

Related documents

Partial preview of the text

Download Decent Method, Lecture Notes - Mathematics - and more Study notes Mathematical Methods in PDF only on Docsity!

C12.1B: CONTINUOUS OPTIMISATION

LECTURE 2: THE DESCENT METHOD AND LINE-SEARCHES

)T 