










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Multidimensional Search Methods Hooke and Jeeves gradient,, search methods, rosenbrock's function newton raphson, methods conjugate, gradient method, gauss newton method
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!











Multidimensional searches can be divided into three types:
Direct search methods can be described as systematic trial and error methods. They are mostly used when
6.1.1 Univariant Search
The simplest (and crudest) direct search method is univariant search , in which line searches are performed by varying each variable in turn.
Figure 6.1 : An Example of Univariant Search.
f ( x ) f ( x )
f ( x )
x 1
x 2
A more efficient method is that by Hooke and Jeeves.
6.1.2 Hooke and Jeeves
Note that there are several variants on this algorithm, e.g. with repeated pattern moves etc.
f ( x ) x 1 , x 2 , … , xn
xi + δ i
( xi − δ i ) xi
These methods involve evaluating the local gradient to determine the search direction (not necessarily the gradient itself), and then doing a line search to determine the minimum value of the function along that line. At this new point, the local gradient is evaluated again, and the process is repeated.
6.2.1 Method of Steepest Descent
The Steepest Descent method was devised by Cauchy in 1847. It involves evaluating the gra- dient at each point and then doing a line search in that direction to minimize the function. Each successive search direction is orthogonal to the previous one.
Although this appears sensible, it has, in fact, a slow rate of convergence (see below).
Figure 6.3 : An Example of Steepest Descent Search.
Method:
x 1
x 2
x 0 d k x k d k = −∇ f ( x k )
α k d k
To determine the step size either use a line search or start with the Taylor series of the func- tion
. (6.1)
Differentiating with respect to and neglecting R gives
, (6.2)
so that the minimum (when this expression equals 0) is when
This value of thus gives the distance to the point where the gradient of the function is per- pendicular to the line of search.
It should be noted that if the Hessian H is available, then a number of search methods which are superior to the Steepest Descent method can be used.
6.2.2 Convergence of the Steepest Descent Method
The method of steepest descent gives a sequence of objective values which converges linearly to (i.e. the order of convergence p is approximately one).
Since most methods have linear convergence, we must look at the convergence ratio β to com- pare them.
Linear convergence implies
. (6.4)
In the case of the Steepest Descent method, the value of β is given by
, (6.5)
where and are the largest and smallest eigenvalues of.
Thus, the convergence of the Steepest Descent method is best when the largest and smallest eigenvalues of are both large and close in value.
(See Luenberger pp. 219-220 for proof).
x k + 1 = x k + α k d k
α k
f ( x (^) k + α k d k ) f ( x k ) α k ∇ f ( x (^) k ) T^ d k α k^2 2 d k
T (^) H x = + + ( (^) k ) d k + R
α k ∂ f ∂α k (^ x^ k^ +^ α k d k )^ ∇^ f^^ (^ x k )
T (^) d k^ α k d k
T (^) H x = + ( (^) k ) d k
α k
∇ f ( x k ) T^ d k d kT^ H ( x k ) d k
d kT^ d k d kT^ H ( x k ) d k
α k
f ( x (^) k ) f ( x *)
x k + 1 − x * = β x k − x *
β A A^ −+ aa
2 ≤
A > 0 a > 0 H ( x *)
H ( x *)
4 M l 3
6.2.4Example:Rosenbrock'sFunction,f(x) =^ tOO1x,-xl)z+ i1-x,) Figure6.4:SolutionPathof a SteepestDescentAlgorithmon Rosenbrock'sFunction.
The linearsegmentsof Figure6.4 correspondto the^ steptakenwithin a giveniteration.Note thatthealgorithmwouldhavefailedin thevicinityof thepoint(-0.3,^ 0.1)butfor thefactthat the line searchfound (by chance)the secondminimumalongthe searchdirection.Several hundrediterationswereperformedcloseto the new^ point without^ any^ perceptiblechangein theobjectivefunction.
6,2,5Newton'sMethod Directsearchmethodssimplyevaluatethefunctionandmakeno useof the^ gradientinforma- tion. The SteepestDescentmethodevaluatesthe local gradientto determinethe^ searchdirec- tion (a first-ordermethod),but takesno accountof the^ rateat which^ the^ gradientis changing. Newton'smethodmakesuseof this second-orderinformation. Supposethat we want to minimize/(x) andthat,at a point xi, it is possibleto evaluate /(xo), V/(xo) and11(xo).
WecanthenwritetheTaylorseriesof /(xon, ) as
" f ( * r , r )= f ( x * ) +v / ( x * ) r ( x o * r- x k ) + j , * o n , - x o ) r H { x o )( x r + r -x r ) + R. t 6. 6 )
andapproximate./asa quadraticfunctionby settingR = 0.
Differentiatingwith respectto xk (^) + I,
V " f ( x r + r ; =^ v / ( x o ) + I l ( x o ) ( x * * , -^ x p )
The minimum is reached at when , i.e.
or , (6.8) where
and. (6.9)
If H is positive definite and f is quadratic, only one iteration is, of course, required to reach the minimum from any starting point (note that the step length is unity). Therefore, we expect good convergence from Newton’s method when the function is closely approximated by a quadratic.
6.2.6 Convergence of Newton’s Method
For a general nonlinear function f , Newton’s method converges quadratically to (i.e. the order of convergence p is approximately two) provided is sufficiently close to , is positive definite and the step lengths { } converge to unity (which is always the case for the basic Newton’s method). That is
, (6.10)
where β is the (constant) convergence ratio.
The local properties of Newton’s method make it a very attractive algorithm for unconstrained optimization. In fact, it is often regarded as the standard against which other algorithms are measured.
However, difficulties, even failure, may occur if the quadratic model is a poor approximation to f near the current point. The search direction found at each iteration will always be a descent direction, but, because the step length is unity, if f is not well modelled by a qua- dratic, it is possible that the search will badly overshoot the minimum. At the next iteration the search may overshoot again coming back, and, thus, Newton’s Method can end up oscillating indefinitely rather than converging.
6.2.7 Newton-Raphson Method
One common modification of Newton’s method is therefore to choose
(6.11) and (6.12)
where is the step size which minimizes. can be computed using a line search
x k + 1 ∇ f ( x k + 1 ) = 0
x k + 1 = x k − H ( x k ) −^1 ∇ f ( x k )
x k + 1 = x k + α k d k
α k = 1 d k =− H ( x k ) −^1 ∇ f ( x (^) k )
α k
x * x 0 x *^ H ( x *) α k
x k + 1 − x * ≤ β x k − x *^2
d k α k
d k = − H ( x k ) −^1 ∇ f ( x k )
x k + 1 = x k + α k d k α k f ( x k + 1 ) α k
6.2.9 Conjugate Gradient Method
In the Steepest Descent method, consecutive steps are orthogonal. In the Conjugate Gradient method, consecutive steps follow conjugate directions, i.e. , where H is the Hessian.
If the objective function is quadratic, the solution will be found after a number of iterations equal to the number of variables in the function. For higher-order functions, more steps may be needed since the method is essentially fitting a quadratic to the function at each stage.
The first step is the same as for the Steepest Descent method, but subsequent search directions are different — they ‘remember’ a bit of the previous direction.
This means that the directions tend to cut diagonally through the orthogonal steepest descent directions. Thus, they improve considerably the rate of convergence.
The search direction for is defined as
, (6.14)
where
, (6.15)
with step size chosen to minimize in the search direction (as in the Steepest Descent method), i.e.
. (6.16)
Conjugate Gradient Method:
d iT H d j = 0 ∀ i ≠ j
k > 0 d k + 1 =−∇ f ( x k + 1 ) +β k d k
β k
∇ f ( x k + 1 ) ∇ f ( x (^) k )
α k f ( x k + 1 )
α k
∇ f ( x k ) T^ d k d kT^ H ( x k ) d k
x 0 d 0 = −∇ f ( x 0 ) H ( x 0 ) α 0
d 0^ T^ d 0 d 0^ T^ H ( x 0 ) d 0
x k + 1 = x k + α k d k ∇ f ( x k + 1 ) H ( x k + 1 )
β k
∇ f ( x k + 1 ) ∇ f ( x (^) k )
d k + 1 = −∇ f ( x k + 1 ) +β k d k
α k + 1
d k^ T + 1 ∇ f ( x (^) k + 1 ) d k^ T + 1 H ( x (^) k + 1 ) d k + 1
Figure 6.6 : Two Variable Conjugate Gradient Search Pattern.
x 1
x 2
∇ f ( x 0 )
−∇ f ( x 0 )
d 0
α 0 d 0
x 1
∇ f ( x 1 ) x 0
−∇ f ( x 1 )
β 0 d 0 d 1 = −∇ f ( x 1 )+β 0 d 0
first search direction
second search direction
f contours (fragments)
It is quite common for implementations of the Conjugate Gradient method to restart periodi- cally (in particular every n iterations, where n is the number of control variables), i.e. to follow the steepest descent direction and then successively compute new conjugate directions. The rationale for this strategy is as follows:
Because the Conjugate Gradient method will converge after n iterations if the objective func- tion is quadratic (and therefore the Hessian is constant), if the method has not converged after n iterations, the objective function is not quadratic and the Hessian is not constant. If the Hes- sian is not constant, then successive search directions are only approximately conjugate (see the next section for more details). By restarting periodically, memory of former search direc- tions (which are no longer conjugate) is effectively erased, and new search directions reflect- ing more accurately the behaviour of the objective function around the current search location can be determined.
Conjugate Gradient Method with Restarts:
x 0 d 0 = −∇ f ( x 0 ) H ( x 0 ) α 0
d 0^ T^ d 0 d 0^ T^ H ( x 0 ) d 0
x k + 1 = x k + α k d k ∇ f ( x k + 1 ) H ( x k + 1 )
For a minimum at , set , so
. (6.22)
Now, using the above definitions and the fact that can be written as a series of lin- early independent vectors which are conjugate in Q , we can write
(6.23)
and thus
. (6.24)
To derive the expression for , we start from the definition of the search direction
, (6.25)
Since by definition the d ’s are conjugate in Q , then
. (6.26) Therefore
(6.27) or
. (6.28) Thus . (6.29)
Through a complicated inductive proof (see Luenberger pp. 244-246), it can be shown that can also be written as
, (6.30)
which is simpler to compute.
This derivation relies on the assumption that Q is constant, i.e. the objective function is qua- dratic. If this is not the case, then successive search directions will not be truly conjugate.
x k + 1 ∇ f ( x k + 1 ) = 0 Q ( x k + 1 − x k ) = −∇ f ( x k ) ( x k + 1 − x k )
d iT^ Q ( x k + 1 − x k ) = d iT Q ( α 0 d 0 + …+α k d k ) = d iT Q α i d i
α i
d iT^ Q ( x k + 1 − x k ) d iT Q d i
d iT^ ∇ f ( x k ) d iT Q d i
β k d k + 1 =−∇ f ( x k + 1 ) +β k d k
d k^ T +^ 1 Q d k = 0
[ −∇ f ( x k + 1 )+β k d k ] T^ Q d k = 0
∇ f ( x k + 1 ) T^ Q d k = β k d kT Q d k
β k
∇ f ( x k + 1 ) T^ Q d k d kT Q d k
β k
β k
∇ f ( x k + 1 ) ∇ f ( x (^) k )
d k
6.2.11 Convergence of the Conjugate Gradient Method
The conjugate direction is nothing but a deflected steepest descent direction, which improves substantially the rate of convergence.
As for the Steepest Descent method, the Conjugate Gradient method converges linearly , i.e.
, (6.31)
with, in theory, convergence ratios near zero (i.e. , called superlinear convergence ).
In practice, however, numerical errors in the computation tend to make. Nevertheless, this method always yields faster convergence than the Steepest Descent method.
The major advantage of the Conjugate Gradient method over Newton methods is that the former does not require the Hessian to be inverted.
Figure 6.7 : An Example of Conjugate Gradient Search.
x k + 1 − x * = β x k − x *
β ≈ 0 β > 0
x 1
x 2
Example: Conjugate Gradient (continued)
6.2.14 Gradient Search Methods Summary
Method Advantages Disadvantages Steepest Descent • Only need to evaluate or esti- mate the gradient
Conjugate Gradient Method
This class of methods is used for fitting hypothesized models to data.
For example, let be the measured data and be the model, where the independent parameters are to be manipulated in order to adjust the model to the data.
(We are assuming that the number of data points m is much larger than the number of model parameters n .)
The residual is given by , (6.32)
so that the squared error is
(6.33)
where.
The nonlinear least-squares problem is to minimize.
Although can be minimized by a general unconstrained method, in most circumstances it is worthwhile to use methods designed specifically for least-squares problems. In particular, it is useful to make use of the special structure of the gradient and Hessian matrix of.
The gradient of can be expressed as
, (6.34)
where is the Jacobian matrix of
The Hessian matrix of is given by
, (6.36)
where is the Hessian of , i.e..
y t ( (^) i ) , i = 1 2, , … , m φ ( t , x ) x T^ = [ x 1 , … , xn ]
ri ( x ) = φ ( ti , x ) − y t ( (^) i )
f ( x ) ri^2 ( x ) i = 1
m ∑ r x (^ ) = = T^ r x ( )
r x ( ) T^ = [ r 1 ( x ) , … , rm ( x )] f ( x ) f ( x )
f ( x ) f ( x )
∇ f ( x ) 2 ri ( x ) ∇ ri ( x ) i = 1
m ∑ 2 J^ ( x^ ) = = T^ r x ( )
J ( x ) r x ( )
J ( x )
∇ r 1 ( x ) T
∇ rm ( x ) T
∂ r 1 ∂ x 1 …^
∂ r 1 ∂ xn
∂ rm ∂ x 1 …
∂ rm ∂ xn
H ( x ) f ( x )
H ( x ) ∇ ( ∇ f ( x )) 2 J ( x ) T J ( x ) 2 ri ( x ) R i ( x ) i = 1
m = = + ∑
R i ( x ) ri ( x ) R i ( x ) = ∇( ∇ ri ( x ))