






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Unconstraint Optimisation, Optimality conditions for unconstrained minimisation ,Line search, descent methods, step length selection, convergence of descent method
Typology: Study notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!







RAPHAEL HAUSER MATHEMATICAL INSTITUTE, UNIVERSITY OF OXFORD
min x∈Rn^ f (x), (1.1)
where f is a continuous objective function. Note that no constraints imposed on the decision variables! Furthermore, we usually assume that f is C^2 with Lipschitz- continuous Hessian, that is, there exists Λ > 0 such that
‖D^2 f (x) − D^2 f (y)‖ ≤ Λ‖x − y‖ ∀ x, y ∈ Rn. In all that follows ‖x‖ =
x^2 i denotes the Euclidean norm of a vector x ∈ Rn and 〈·, ·〉 is the corresponding Euclidean inner product. If A : Rn^ → Rm^ is a linear map, then ‖A‖ denotes the operator norm defined by the Euclidean norms on Rn^ and Rm, that is,
‖A‖ = inf{λ > 0 : ‖Ax‖ ≤ λ‖x‖ ∀ x ∈ Rn}.
The gradient ∇f (x) of a function f : Rn^ → R is sometimes denoted by gf (x), and its Hessian D^2 f (x) by Hf (x). The Jacobian Df (x) of a function f : Rn^ → Rm^ is sometimes denoted by Jf (x). Note: if m = 1 then Jf (x) = gf (x)T. We will also use the so-called “big O” notation: we say that a function g(x) is of order ‖x‖k^ and write g(x) = O(‖x‖k) if there exists a constant c > 0 and a δ > 0 such that |g(x)| ≤ c‖x‖k whenever ‖x‖ < δ.
Example 1.1 (Risk minimisation under shortselling). Let us go back to Example 2 of Lecture 1. By eliminating xn = 1 −
∑n− 1 i=1 xi^ we can get rid of the constraint ∑^ n
i=
xi = 1.
Furthermore, if we allow short-selling of assets, the constraints
xi ≥ 0 (i = 1,... , n)
are no longer imposed. Finally, let us suppose all the assets considered have the same expected return μi ≡ μ, so that the constraint
∑^ n
i=
μixi ≥ b
can be omitted. The investor’s aim is to minimise the risk, which can be modelled as
min x∈Rn−^1
f (x 1 ,... , xn− 1 ) =
n∑− 1
i=
n∑− 1
j=
σij xixj +
n∑− 1
j=
σnj
n∑− 1
i=
xi
xj
n∑− 1
i=
σinxi
n∑− 1
j=
xj
n∑− 1
i=
xi
n∑− 1
j=
xj
Since the objective function f is a quadratic (degree 2) polynomial in the decision variables x 1 ,... , xn− 1 , we have f ∈ C∞. Moreover, the Hessian D^2 f (x) is the same (n − 1) × (n − 1) matrix
σ 11... σ 1 n
... σn 1... σnn
for all x, and hence x 7 → D^2 f (x) is a constant function, which is of course Lipschitz- continuous: ‖D^2 f (x) − D^2 f (y)‖ = 0 ≤ 0 × ‖x − y‖ ∀ x, y ∈ Rn−^1.
Example 1.2. On a CAD system it takes n parameters x 1 ,... , xn to define the shape of a car. An engineer has a piece of software which takes the design parameters x ∈ Rn^ as input and computes the air resistance f (x) of the corresponding fuselage as output. The software contains typically millions of lines of code, but for theoret- ical reasons it is known that f ∈ C^2. Using an automatic differentiation system, the engineer can automatically produce a piece of software that computes directional derivatives
Dv f (x) =
d dt
f (x + tv), Du,v f (x) =
d^2 ds dt
f (x + su + tv).
How to choose the design parameters so as to minimise the drag on the fuselage?
Note that in this example the objective function is not available explicitly. This is typical for many applications. In fact, evaluating the objective function might even involve measurements in a physical experiment. Besides from appearing as subprob- lems in constrained optimisation procedures, unconstrained optimisation problems also appear in many applications directly.
Theorem 2.1. (i) If f : Rn^ → R is differentiable at x∗^ ∈ Rn^ and has a local minimum there, then ∇f (x∗) = 0, that is, x∗^ is a stationary point of f. This is a first order necessary optimality condition, because it involves first derivatives, or the first order Taylor approximation of f. (ii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn^ and has a local minimum there, then the Hessian D^2 f (x∗) is positive semidefinite, that is, hTD^2 f (x∗)h ≥ 0 for all h ∈ Rn. This is a second order necessary optimality condition. (iii) If f : Rn^ → R is twice differentiable at x∗^ ∈ Rn, and if ∇f (x∗) = 0 and D^2 f (x∗) is positive definite, that is, if hTD^2 f (x∗)h > 0 for all h ∈ Rn^ \ { 0 }, then x∗^ is a local minimiser of f. These are sufficient optimality conditions.
Therefore,
hTD^2 f (x∗)h =
( (^) n ∑
i=
〈vi, h〉vi
∑^ n
j=
〈vj , h〉σj vj
i,j
〈vi, h〉〈vj , h〉σj 〈vi, vj 〉
∑^ n
i=
〈vi, h〉^2 σi ≥ σn
∑n
i=
〈vi, h〉^2 = σn
∑n
i=
〈vi, h〉vi, 〈vi, h〉vi
= σn
〈 (^) ∑n
i=
〈vi, h〉vi,
∑^ n
j=
〈vj , h〉vj
= σn‖h‖^2. (2.3)
Let c, δ > 0 be as in part (ii). Then (2.3) implies that for all h such that ‖h‖ < min(δ, σn/ 2 c) we have
f (x∗^ + h) = f (x∗) +
hTD^2 f (x∗)h + O(‖h‖^3 )
(2.3) ≥ f (x∗) +
‖h‖^2 σn − c‖h‖^3
≥ f (x∗) +
‖h‖^2 σn − c
σn 2 c
‖h‖^2 = f (x∗),
which shows that x∗^ is a local minimiser of f.
∇f (x) = 0
by an iterative procedure generating a sequence of points (xk)N, if we can assure that f (xk) decreases in each iteration,
f (xk+1) ≤ f (xk) ∀ k,
then in practice (xk)N can only converge to a local minimiser x∗^ and
‖∇f (x∗)‖ < ǫ
can be used as a stopping criterion. Thus, solving unconstrained optimisation prob- lems is closely related to the problem of solving simultaneous equations with the added feature that progress can be controlled by monitoring a naturally defined merit function (i.e., one asks ”does f decrease?”). Most competitive algorithms for unconstrained minimisation are based on this idea. There are two main families of such methods: line-search methods and trust region methods. We start with a description of the former.
Example 3.1 (Steepest descent without line searches). A simple method is defined as follows: starting from some x 0 ∈ Rn, compute a sequence of intermediate solutions (xk)N as follows,
xk+1 = xk − ∇f (xk).
The method is motivated by the fact that −∇f (xk) is the direction in which f decreases fastest when moving away from xk. But is it a descent method? The first order Taylor approximation of f shows that f (xk −α∇f (xk)) ≤ f (xk) for small α > 0. However, it is not necessarily the case that f (xk+1) ≤ f (xk), as the step −∇f (xk) can be too far. To make this a true descent method, we have to use line-searches: in each iteration we have to find αk > 0 such that
f (xk − αk∇f (xk)) < f (xk),
and then we can set
xk+1 = xk − αk∇f (xk).
A word of warning: although this method works in principle, it is too primitive to produce any good results in practice! We will later learn why. For now we set out to generalise this example.
Algorithm 3.2 (Descent method). S0 Choose a starting point x 0 ∈ Rn^ and a tolerance parameter ǫ > 0. Set k = 0. S1 If ‖∇f (xk)‖ ≤ ǫ then stop and output xk as an approximate local minimiser. S2 Otherwise choose a search direction dk ∈ Rn^ such that 〈∇f (xk), dk〉 < 0. S3 Choose a step size αk > 0 such that f (xk + αkdk) < f (xk). S4 Set xk+1 := xk + αkdk, replace k by k + 1, and go back to S1.
Below we will see that the minimal assumption we need to make for this algorithm to work is f ∈ C^1 with Lipschitz continuous gradient. The generality of Algorithm 3.2 leaves flexibility both in the choice of the step length αk and the search direction dk. In the remainder of this lecture we discuss the step length selection and treat the choice of good search directions in the next few lectures.
3.1. Step Length Selection. The conceptually simplest method of choosing αk are exact line searches, defined by
αk := inf{α ≥ 0 : φ′(α) = 0},
where φ(α) = f (xk + αdk). That is to say, the point xk + αkdk is the first stationary point of f encountered along the half line {xk + αdk : α ≥ 0 }. Note that if {α ≥ 0 : φ′(α) = 0} = ∅, as is the case for example when φ(α) = − ln α, then {α ≥ 0 : φ′(α) = 0 } = ∅, and hence αk := inf ∅ = +∞ corresponds to an infinitely long step which is still sensible. Exact line searches are mainly a theoretical tool in the convergence analysis of algorithms. In practice, they are computationally too expensive. We will now derive step length computations that are equally good choices for the purposes of Algorithm 3.2 and much cheaper to compute.
Definition 3.3. We say that αk satisfies the Wolfe conditions if
φ(αk) ≤ φ(0) + c 1 αkφ′(0), (3.1) φ′(αk) ≥ c 2 φ′(0), (3.2)
Algorithm 3.5 (Bisection method for step size). S0 Choose α > 0 and set αlow = αhigh = 0. S1 If α satisfies (3.1) (that is, if α is long enough) then goto S3. S2 Else (if α does not satisfy (3.1)) make the replacements αhigh ← α and α ← (αlow + αhigh)/ 2 , and then goto S1. S3 If α satisfies (3.2) (that is, α now satisfies both Wolfe conditions) output αk = α and stop. S4 Otherwise (if α does not satisfy (3.2)), make the replacements αlow ← α and
α ←
2 αlow if αhigh = 0, 1 2 (αlow^ +^ αhigh)^ if^ αhigh^ >^0 , and then go back to S1.
Proposition 3.6. Under the assumptions of Proposition 3.4, Algorithm 3.5 ter- minates in finite time and outputs a choice of αk that satisfies both Wolfe conditions.
Proof. Note that the two sets
W 1 := {α ≥ 0 : (3.1) holds}, W 2 := {α ≥ 0 : (3.2) holds}
are closed subsets of R+. Moreover,
φ(α) = φ(0) +
∫ (^) α
0
φ′(τ )dτ < φ(0) +
∫ (^) α
0
c 1 φ′(0)dτ
for all α sufficiently small, because φ′^ is continuous and c 1 < 1, showing that there exists δ 1 > 0 such that [0, δ 1 ] ⊂ W 1. Let α > 0, (α[ lowi] )N ⊂ W 1 and (α[ highi] )N ⊂ W 1 c be such that
α[ lowi] < α ∀i ∈ N, α[ lowi] i −→→∞ α,
α[ highi] > α ∀i ∈ N, α[ highi] i −→→∞ α.
We claim that this implies α ∈ W 2 ◦ (the topological interior of W 2 ). In fact, suppose to the contrary that α ∈ W 2 c , and hence that φ′(α) ≤ c 2 φ′(0). Then there exists a value δ 2 > 0 such that
φ′(α + τ ) < c 1 φ′(0) ∀τ ∈ [0, δ 2 ],
because φ′^ is continuous and c 2 < c 1. Therefore,
φ(α + τ ) = φ(α) +
∫ (^) α+τ
α
φ′(θ)dθ < φ(0) + c 1 (α + τ )φ′(0)
for all τ ∈ [0, δ 2 ]. Since α[ highi] converges to α from the right there exists an index j large
enough so that α [j] high ∈^ [α, α^ +^ δ^2 ], contradicting the assumption that^ α
[j] high ∈^ W^
c
Therefore, it is indeed the case that α ∈ W 2 ◦. Let us now start analysing the algorithm. Note that we only need to prove that the algorithm terminates in finite time, because the termination criterion is set such that if the algorithm terminates, then αk satisfies both Wolfe conditions.
3.2. Convergence of Descent Methods. It is now possible to give a fairly general convergence theorem for Algorithm 3.2 as long as the step lengths satisfy the Wolfe conditions. We prepare the proof through a lemma that gives a useful bound on the amount of decrease in the objective function that is achieved in every iteration:
Lemma 3.7. Let Algorithm 3.2 be applied to a C^1 function f with Λ-Lipschitz continuous gradient and assume that the step length αk satisfies the Wolfe conditions (3.1) and (3.2). Then
f (xk+1) ≤ f (xk) − c 1 (1 − c 2 ) (cos^2 θk)‖∇f (xk)‖^2 Λ
where θk is the angle between dk and −∇f (xk), and where c 1 , c 2 are the constants from Definition 3.3.
Proof. The second Wolfe condition implies
〈∇f (xk + αkdk), dk〉 − 〈∇f (xk), dk〉 = φ′(αk) − φ′(0) ≥ (c 2 − 1)φ′(0) = (1 − c 2 ) (−〈∇f (xk), dk〉).
Theorem 3.8 is valid under the assumption that the objective function is bounded below. It is interesting to note that when this is not the case, the algorithm fails to ter- minate in finite time but produces a sequence (xk)N such that limk→∞ f (xk) = −∞, which is a perfectly sensible and desireable behaviour under the circumstances.