Optimization Techniques, Lecture Notes- Physics - Prof IB Leader 3, Study notes of Physics

Multidimensional Search Methods Hooke and Jeeves gradient,, search methods, rosenbrock's function newton raphson, methods conjugate, gradient method, gauss newton method

Typology: Study notes

2010/2011

Uploaded on 09/07/2011

hawking
hawking 🇬🇧

4.4

(25)

268 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
4M13 GTP
22
6. Multidimensional Search Methods
Multidimensional searches can be divided into three types:
Random Search
: Step direction and length are chosen at random; inefficient but given time it
should eventually find the answer — like monkeys trying to type Shakespeare.
Direct Search
: No attempt is made to evaluate the local gradient.
Gradient Search
: The local gradient (and possibly the Hessian) is evaluated or estimated.
6.1 Direct Search Methods
Direct search methods can be described as
systematic trial and error
methods. They are
mostly used when
the analytical relationship between the control variables and is not known, but
can be evaluated experimentally at individual points (e.g. in industrial processes);
is known but gradient information cannot be obtained in closed form.
6.1.1 Univariant Search
The simplest (and crudest) direct search method is
univariant search
, in which line searches
are performed by varying each variable in turn.
Figure 6.1
: An Example of Univariant Search.
fx() fx()
fx()
x1
x2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Optimization Techniques, Lecture Notes- Physics - Prof IB Leader 3 and more Study notes Physics in PDF only on Docsity!

6. Multidimensional Search Methods

Multidimensional searches can be divided into three types:

  • Random Search : Step direction and length are chosen at random; inefficient but given time it should eventually find the answer — like monkeys trying to type Shakespeare.
  • Direct Search : No attempt is made to evaluate the local gradient.
  • Gradient Search : The local gradient (and possibly the Hessian) is evaluated or estimated.

6.1 Direct Search Methods

Direct search methods can be described as systematic trial and error methods. They are mostly used when

  • the analytical relationship between the control variables and is not known, but can be evaluated experimentally at individual points (e.g. in industrial processes);
  • is known but gradient information cannot be obtained in closed form.

6.1.1 Univariant Search

The simplest (and crudest) direct search method is univariant search , in which line searches are performed by varying each variable in turn.

Figure 6.1 : An Example of Univariant Search.

f ( x ) f ( x )

f ( x )

x 1

x 2

A more efficient method is that by Hooke and Jeeves.

6.1.2 Hooke and Jeeves

  1. Choose a starting point using common sense or the shotgun approach 1. Evaluate the objec- tive function. Choose a suitably-sized increment for each variable.
  2. Record the current point as the base point.
  3. For each variable in turn:
    • Increase it by its increment, i.e..
    • If this reduces the value of the objective function, retain the change, otherwise reverse it and re-evaluate. If this produces no improvement either, leave the variable at its original value.
  4. If step 3 resulted in a change of position, go to step 5. Otherwise, stop if each increment is smaller than the required tolerance. If some are not, reduce them and go back to step 2.
  5. Record the current point as the new base point, and perform a pattern move by repeating the vector from the last base point. If this reduces the value of the objective function, retain it, otherwise discard it.
  6. Return to step 3.

Note that there are several variants on this algorithm, e.g. with repeated pattern moves etc.

  1. The shotgun approach consists of randomly sampling the search space a few times and choosing the combination which gives the lowest objective.

f ( x ) x 1 , x 2 , … , xn

xi + δ i

( xi − δ i ) xi

6.2 Gradient Search Methods

These methods involve evaluating the local gradient to determine the search direction (not necessarily the gradient itself), and then doing a line search to determine the minimum value of the function along that line. At this new point, the local gradient is evaluated again, and the process is repeated.

6.2.1 Method of Steepest Descent

The Steepest Descent method was devised by Cauchy in 1847. It involves evaluating the gra- dient at each point and then doing a line search in that direction to minimize the function. Each successive search direction is orthogonal to the previous one.

Although this appears sensible, it has, in fact, a slow rate of convergence (see below).

Figure 6.3 : An Example of Steepest Descent Search.

Method:

  1. Select a starting point.
  2. Determine the search direction which is the gradient at , i.e. (negative since descent ).
  3. Determine the step size along to minimize the function.

x 1

x 2

x 0 d k x k d k = −∇ f ( x k )

α k d k

  1. Update the estimate of the minimum.
  2. Check for convergence. If not converged, go to step 2.

To determine the step size either use a line search or start with the Taylor series of the func- tion

. (6.1)

Differentiating with respect to and neglecting R gives

, (6.2)

so that the minimum (when this expression equals 0) is when

This value of thus gives the distance to the point where the gradient of the function is per- pendicular to the line of search.

It should be noted that if the Hessian H is available, then a number of search methods which are superior to the Steepest Descent method can be used.

6.2.2 Convergence of the Steepest Descent Method

The method of steepest descent gives a sequence of objective values which converges linearly to (i.e. the order of convergence p is approximately one).

Since most methods have linear convergence, we must look at the convergence ratio β to com- pare them.

Linear convergence implies

. (6.4)

In the case of the Steepest Descent method, the value of β is given by

, (6.5)

where and are the largest and smallest eigenvalues of.

Thus, the convergence of the Steepest Descent method is best when the largest and smallest eigenvalues of are both large and close in value.

(See Luenberger pp. 219-220 for proof).

x k + 1 = x k + α k d k

α k

f ( x (^) k + α k d k ) f ( x k ) α kf ( x (^) k ) T^ d k α k^2 2 d k

T (^) H x = + + ( (^) k ) d k + R

α kf ∂α k (^ x^ k^ +^ α k d k )^ ∇^ f^^ (^ x k )

T (^) d k^ α k d k

T (^) H x = + ( (^) k ) d k

α k

f ( x k ) T^ d k d kT^ H ( x k ) d k

d kT^ d k d kT^ H ( x k ) d k

α k

f ( x (^) k ) f ( x *)

x k + 1 − x * = β x kx *

β A A^ −+ aa

2 ≤

A > 0 a > 0 H ( x *)

H ( x *)

4 M l 3

6.2.4Example:Rosenbrock'sFunction,f(x) =^ tOO1x,-xl)z+ i1-x,) Figure6.4:SolutionPathof a SteepestDescentAlgorithmon Rosenbrock'sFunction.

The linearsegmentsof Figure6.4 correspondto the^ steptakenwithin a giveniteration.Note thatthealgorithmwouldhavefailedin thevicinityof thepoint(-0.3,^ 0.1)butfor thefactthat the line searchfound (by chance)the secondminimumalongthe searchdirection.Several hundrediterationswereperformedcloseto the new^ point without^ any^ perceptiblechangein theobjectivefunction.

6,2,5Newton'sMethod Directsearchmethodssimplyevaluatethefunctionandmakeno useof the^ gradientinforma- tion. The SteepestDescentmethodevaluatesthe local gradientto determinethe^ searchdirec- tion (a first-ordermethod),but takesno accountof the^ rateat which^ the^ gradientis changing. Newton'smethodmakesuseof this second-orderinformation. Supposethat we want to minimize/(x) andthat,at a point xi, it is possibleto evaluate /(xo), V/(xo) and11(xo).

WecanthenwritetheTaylorseriesof /(xon, ) as

" f ( * r , r )= f ( x * ) +v / ( x * ) r ( x o * r- x k ) + j , * o n , - x o ) r H { x o )( x r + r -x r ) + R. t 6. 6 )

andapproximate./asa quadraticfunctionby settingR = 0.

Differentiatingwith respectto xk (^) + I,

V " f ( x r + r ; =^ v / ( x o ) + I l ( x o ) ( x * * , -^ x p )

GTP

The minimum is reached at when , i.e.

or , (6.8) where

and. (6.9)

If H is positive definite and f is quadratic, only one iteration is, of course, required to reach the minimum from any starting point (note that the step length is unity). Therefore, we expect good convergence from Newton’s method when the function is closely approximated by a quadratic.

6.2.6 Convergence of Newton’s Method

For a general nonlinear function f , Newton’s method converges quadratically to (i.e. the order of convergence p is approximately two) provided is sufficiently close to , is positive definite and the step lengths { } converge to unity (which is always the case for the basic Newton’s method). That is

, (6.10)

where β is the (constant) convergence ratio.

The local properties of Newton’s method make it a very attractive algorithm for unconstrained optimization. In fact, it is often regarded as the standard against which other algorithms are measured.

However, difficulties, even failure, may occur if the quadratic model is a poor approximation to f near the current point. The search direction found at each iteration will always be a descent direction, but, because the step length is unity, if f is not well modelled by a qua- dratic, it is possible that the search will badly overshoot the minimum. At the next iteration the search may overshoot again coming back, and, thus, Newton’s Method can end up oscillating indefinitely rather than converging.

6.2.7 Newton-Raphson Method

One common modification of Newton’s method is therefore to choose

(6.11) and (6.12)

where is the step size which minimizes. can be computed using a line search

x k + 1 ∇ f ( x k + 1 ) = 0

x k + 1 = x kH ( x k ) −^1 ∇ f ( x k )

x k + 1 = x k + α k d k

α k = 1 d k =− H ( x k ) −^1 ∇ f ( x (^) k )

α k

x * x 0 x *^ H ( x *) α k

x k + 1 − x * ≤ β x kx *^2

d k α k

d k = − H ( x k ) −^1 ∇ f ( x k )

x k + 1 = x k + α k d k α k f ( x k + 1 ) α k

6.2.9 Conjugate Gradient Method

In the Steepest Descent method, consecutive steps are orthogonal. In the Conjugate Gradient method, consecutive steps follow conjugate directions, i.e. , where H is the Hessian.

If the objective function is quadratic, the solution will be found after a number of iterations equal to the number of variables in the function. For higher-order functions, more steps may be needed since the method is essentially fitting a quadratic to the function at each stage.

The first step is the same as for the Steepest Descent method, but subsequent search directions are different — they ‘remember’ a bit of the previous direction.

This means that the directions tend to cut diagonally through the orthogonal steepest descent directions. Thus, they improve considerably the rate of convergence.

The search direction for is defined as

, (6.14)

where

, (6.15)

with step size chosen to minimize in the search direction (as in the Steepest Descent method), i.e.

. (6.16)

Conjugate Gradient Method:

  1. Choose , and compute , and.
  2. Determine , and.
  3. Evaluate
  4. Go to step 2, until the algorithm has converged.

d iT H d j = 0 ∀ ij

k > 0 d k + 1 =−∇ f ( x k + 1 ) +β k d k

β k

f ( x k + 1 ) ∇ f ( x (^) k )

2

α k f ( x k + 1 )

α k

f ( x k ) T^ d k d kT^ H ( x k ) d k

x 0 d 0 = −∇ f ( x 0 ) H ( x 0 ) α 0

d 0^ T^ d 0 d 0^ T^ H ( x 0 ) d 0

x k + 1 = x k + α k d kf ( x k + 1 ) H ( x k + 1 )

β k

f ( x k + 1 ) ∇ f ( x (^) k )

2

d k + 1 = −∇ f ( x k + 1 ) +β k d k

α k + 1

d k^ T + 1 ∇ f ( x (^) k + 1 ) d k^ T + 1 H ( x (^) k + 1 ) d k + 1

Figure 6.6 : Two Variable Conjugate Gradient Search Pattern.

x 1

x 2

f ( x 0 )

−∇ f ( x 0 )

d 0

α 0 d 0

x 1

f ( x 1 ) x 0

−∇ f ( x 1 )

β 0 d 0 d 1 = −∇ f ( x 1 )+β 0 d 0

first search direction

second search direction

f contours (fragments)

It is quite common for implementations of the Conjugate Gradient method to restart periodi- cally (in particular every n iterations, where n is the number of control variables), i.e. to follow the steepest descent direction and then successively compute new conjugate directions. The rationale for this strategy is as follows:

Because the Conjugate Gradient method will converge after n iterations if the objective func- tion is quadratic (and therefore the Hessian is constant), if the method has not converged after n iterations, the objective function is not quadratic and the Hessian is not constant. If the Hes- sian is not constant, then successive search directions are only approximately conjugate (see the next section for more details). By restarting periodically, memory of former search direc- tions (which are no longer conjugate) is effectively erased, and new search directions reflect- ing more accurately the behaviour of the objective function around the current search location can be determined.

Conjugate Gradient Method with Restarts:

  1. Choose , and compute , and.
  2. Determine , and.

x 0 d 0 = −∇ f ( x 0 ) H ( x 0 ) α 0

d 0^ T^ d 0 d 0^ T^ H ( x 0 ) d 0

x k + 1 = x k + α k d kf ( x k + 1 ) H ( x k + 1 )

For a minimum at , set , so

. (6.22)

Now, using the above definitions and the fact that can be written as a series of lin- early independent vectors which are conjugate in Q , we can write

(6.23)

and thus

. (6.24)

To derive the expression for , we start from the definition of the search direction

, (6.25)

Since by definition the d ’s are conjugate in Q , then

. (6.26) Therefore

(6.27) or

. (6.28) Thus . (6.29)

Through a complicated inductive proof (see Luenberger pp. 244-246), it can be shown that can also be written as

, (6.30)

which is simpler to compute.

This derivation relies on the assumption that Q is constant, i.e. the objective function is qua- dratic. If this is not the case, then successive search directions will not be truly conjugate.

x k + 1 ∇ f ( x k + 1 ) = 0 Q ( x k + 1 − x k ) = −∇ f ( x k ) ( x k + 1 − x k )

d iT^ Q ( x k + 1 − x k ) = d iT Q ( α 0 d 0 + …+α k d k ) = d iT Q α i d i

α i

d iT^ Q ( x k + 1 − x k ) d iT Q d i

d iT^ ∇ f ( x k ) d iT Q d i

β k d k + 1 =−∇ f ( x k + 1 ) +β k d k

d k^ T +^ 1 Q d k = 0

[ −∇ f ( x k + 1 )+β k d k ] T^ Q d k = 0

f ( x k + 1 ) T^ Q d k = β k d kT Q d k

β k

f ( x k + 1 ) T^ Q d k d kT Q d k

β k

β k

f ( x k + 1 ) ∇ f ( x (^) k )

2

d k

6.2.11 Convergence of the Conjugate Gradient Method

The conjugate direction is nothing but a deflected steepest descent direction, which improves substantially the rate of convergence.

As for the Steepest Descent method, the Conjugate Gradient method converges linearly , i.e.

, (6.31)

with, in theory, convergence ratios near zero (i.e. , called superlinear convergence ).

In practice, however, numerical errors in the computation tend to make. Nevertheless, this method always yields faster convergence than the Steepest Descent method.

The major advantage of the Conjugate Gradient method over Newton methods is that the former does not require the Hessian to be inverted.

Figure 6.7 : An Example of Conjugate Gradient Search.

x k + 1 − x * = β x kx *

β ≈ 0 β > 0

x 1

x 2

Example: Conjugate Gradient (continued)

6.2.14 Gradient Search Methods Summary

Method Advantages Disadvantages Steepest Descent • Only need to evaluate or esti- mate the gradient

  • Order of convergence is only linear
  • Convergence can be very slow if convergence ratio is high Newton’s Method • Quadratic convergence • Can oscillate indefinitely if objective function is not well modelled by a quadratic
  • Need to compute and invert Hessian Newton Raphson Method
  • Quadratic convergence
  • More reliable convergence than Newton Method
  • Tailored to find minima
  • Need to compute and invert Hessian

Conjugate Gradient Method

  • Faster than Steepest Descent method
  • No need to invert Hessian
    • Need to compute Hessian
    • Order of convergence is only linear — though convergence ratio is low

6.3 Nonlinear Least Squares

This class of methods is used for fitting hypothesized models to data.

For example, let be the measured data and be the model, where the independent parameters are to be manipulated in order to adjust the model to the data.

(We are assuming that the number of data points m is much larger than the number of model parameters n .)

The residual is given by , (6.32)

so that the squared error is

(6.33)

where.

The nonlinear least-squares problem is to minimize.

Although can be minimized by a general unconstrained method, in most circumstances it is worthwhile to use methods designed specifically for least-squares problems. In particular, it is useful to make use of the special structure of the gradient and Hessian matrix of.

The gradient of can be expressed as

, (6.34)

where is the Jacobian matrix of

The Hessian matrix of is given by

, (6.36)

where is the Hessian of , i.e..

y t ( (^) i ) , i = 1 2, , … , m φ ( t , x ) x T^ = [ x 1 , … , xn ]

ri ( x ) = φ ( ti , x ) − y t ( (^) i )

f ( x ) ri^2 ( x ) i = 1

mr x (^ ) = = T^ r x ( )

r x ( ) T^ = [ r 1 ( x ) , … , rm ( x )] f ( x ) f ( x )

f ( x ) f ( x )

f ( x ) 2 ri ( x ) ∇ ri ( x ) i = 1

m ∑ 2 J^ ( x^ ) = = T^ r x ( )

J ( x ) r x ( )

J ( x )

r 1 ( x ) T

rm ( x ) T

r 1 ∂ x 1 …^

r 1 ∂ xn

rmx 1 …

rmxn

H ( x ) f ( x )

H ( x ) ∇ ( ∇ f ( x )) 2 J ( x ) T J ( x ) 2 ri ( x ) R i ( x ) i = 1

m = = + ∑

R i ( x ) ri ( x ) R i ( x ) = ∇( ∇ ri ( x ))