Download Nonlinear Optimization: Amoeba, Conjugate Gradient, Quasi-Newton, Newton-Raphson and more Slides Advanced Algorithms in PDF only on Docsity!
1.204 Lecture 22
Unconstrained nonlinear optimization:
AmoebaAmoeba
BFGS
Linear programming: Glpk
Multiple optimum values
From Press
Heuristics to deal with multiple optima:
- Start at many initial points. Choose best of optima found.
- Find local optimum. Take a step away from it and search again.
- Simulated annealing takes ‘random’ steps repeatedly
X 1 X 2
A
B
C
D
E
F
G
X Y
Z
Figure by MIT OpenCourseWare.
t t t t t t
Nonlinear optimization
- Unconstrained nonlinear optimization algorithms
generalllly use ththe same strategy as unconstt t t rainedi d
- Select a descent direction
- Use a one dimensional line search to set step size
- Step, and iterate until convergence
- Constrained optimization used the constraints to
limit the maximum step size
- UUnconstraiinedd optiimiizatiion must sellect maxiimum step size
- Step size is problem-specific and must be tuned
- Memory requirements are rarely a problem
- Convergence, accuracy and speed are the issues
Family of nonlinear algorithms
- Amoeba (Nelder-Mead) method
- Solves nonlinear optimization problem directlySolves nonlinear optimization problem directly
- Requires no derivatives or line search
- Adapts its step size based on change in function value
- Conjugate gradient and quasi-Newton methods
- Require function, first derivatives and line search*
- Line search step size adapts as algorithm proceeds
- NewtonNewton -Raphson method (last lecture)Raphson method (last lecture)
- Used to solve nonlinear optimization problems by solving set of first order conditions
- Uses step size dx that makes f(x+dx)= 0. Little control.
- ‘Globally convergent’ Newton variant has smaller step size
- Needs first and second derivatives (and function)
Amoeba steps
- Simplex is volume defined by n+1 points in n
dimensions
- In 3In 3 -D, it is a tetrahedron or pyramidD it is a tetrahedron or pyramid
- Select a starting guess at point P 0
- Set other points of simplex as Pi= P 0 + ∆ei
- Take a series of steps
- Most steps move point of simplex where function is highest (“highest point”): “reflection”
- Conserve volume of simplexConserve volume of simplex - > avoid degeneracy> avoid degeneracy
- Where function is flat, method expands simplex to take larger steps: “expansion and reflection”
- When it reaches a “valley floor”, simplex “contracts” itself in transverse direction and tries to ooze down the valley
- If trying to pass through “eye of needle” it “contracts itself in all directions” around its lowest (best) point From Press et al
Amoeba steps
From Press et al
Simplex at beginning of step
High (^) Low
Reflection Reflection and expansion
Contraction Multiple contraction
Possible outcomes for a step in the downhill simplex method
(a) (b)
(c) (d)
Figure by MIT OpenCourseWare.
t t
Amoeba pseudocode: minimization
- Start at initial guess
- Determine which point is highest by looping over simplex points and evaluating function at each
- If difference between highest and lowest is small, return
- Otherwise ooze (iterate):
- Reflect by factor= -1 through face of simplex from high point
- If this is better than low point, reflect/expand by factor of 2
- If this is worse than second highest, contract by 2 in this directionIf this is worse than second highest, contract by 2 in this direction
- If this is worst point, contract in all directions around lowest point
- Select new face based on new high point and reflect again
- Terminate if difference between highest and lowest points is small
- Or terminate at maximum iterations allowed
Amoeba termination criteria
- All nonlinear optimization algorithm termination
criteria are difficult
- TTermiinate when step si h ize iis small (ll (~10 10 -8^8 with dith doublbles))
- Change in function value is small (~10-14^ with doubles)
- Termination can occur at a local minimum
- Restart amoeba at a minimum it found
- Reset initial simplex using Pi= P 0 + ∆ei ,
- Don’t use simplex that existed at termination
- • Code is in download:Code is in download:
public Amoeba(double ftol) public double[] minimize(double[] point, double del, MathFunction3 mf3) // And other optional methods
- This is very easy; code ran first time on demand model
- Experiment with starting point and del, but it’s usually easy
Quasi-Newton methods
Split problem into two minimization problems, each with a “quadratic envelope” You must understand your problem well enough to do this
Newton method
Full Newton step
If the minimum region is narrow or twisty, it is hard to get down into it. Full Newton steps give little control. (Amoeba just munches along)
Quasi-Newton methods
Quasi Newton steps BFGS takes a fraction of the Newton step, so that f(x) decreases at some minimum rate proportional to the average decrease
High order derivatives
We may be tempted to use higher derivatives to fit our nonlinear functions. More terms in the Taylor series expansion mean a higher degree polynomial fit, but using higher order derivatives is difficult because they are “stiff”
This is the challenge of nonlinear methods: use only low order derivatives to explore complex surfaces. There aren’t many general ways to do this. Solutions are problem-specific but use BFGS or other general algorithm: Understand problem regions; have good starting point; understand surface
High order derivatives
We may be tempted to use higher derivatives to fit our nonlinear functions. More terms in the Taylor series expansion mean a higher degree polynomial fit, but using higher order derivatives is difficult because they are “stiff”
This is the challenge of nonlinear methods: use only low degree derivatives to explore complex surfaces. There aren’t many general ways to do this. Solutions are problem-specific but use BFGS or other general algorithm: Understand problem regions; have good starting point; understand surface
BFGS
x x
xx
f
x
x
f
ij
i i j
f(x) f(P)
TakeapointPasorigin.ApproximatefusingTaylorseries:
2
i
b f
c f P
where
c b x x A x
x xx
P
i i^ j i j
i^2 ,
[ ]
f A x b
xx
f
A
b f
P i j
ij
P
AistheHessianmatrix.Gradientoffis :
2
BFGS, p.
lim
BFGSconstructsasequenceofmatricesH such that:
1
i
i H^ i =^ A
−
∞
Tofindaminimum,findazeroofthegradientoff(x):
f(x) f(x) ( ) ( )
Atpoint x,usingaTaylorseriesagain
i
i
i i
i i i i
i i
f x f x A x x
x x f x x x A x x
−>∞
Newton'smethodsets f(x) 0 tofindthenextpoint:
1
x − xi =− A ⋅∇ f x i
−
BFGS, p.
- Newton’s method, far from the minimum, canNewton s method, far from the minimum, can
project us to points x where f(x) is greater than
the current value
- Large steps based on a quadratic approximation will not necessarily lead to improvements in ill-behaved function
- BFGS does a line search along the Newton direction
- It finds a point at which the objective has decreased
- This point is used to update the Hessian (which is a matrix of second derivatives)
- The Hessian is based on the previous Hessian and a set of correction terms based on ▼f , x and xi
- See Numerical Recipes for BFGS updating formula
BFGS lnsrch() pseudocode
- Takes point, function and gradient value at point,
and direction ((line)) to search alongg as inpput
- Finds new point with lower function value
- Finding minimum along line requires too much computation
- Finding any point whose f is less than current isn’t good enough either; we may take too many steps
- We find a point where the average decrease from the current point is a fraction of the gradient projectioncurrent point is a fraction of the gradient projection
- We require a minimum step size so we don’t stall
- We use a quadratic interpolation of f along the line to choose the first new point
- We then use a cubic interpolation, now that we have 3 points ( f(0), f(1) and f(m)) for remaining points
BFGS code
- First code set
- BFGS: constructor, dfppmin()(), lnsrch()()
- MathFunction4: interface with func(), df() [gradient]
- DemandModelBFGS: Almost same as in Newton
- DemandModelBFGSTest: Almost same as in Newton
- Second code set
- Same BFGS, MathFunction4, DemandModelBFGSTest
- DemandModelBFGS df() method computes gradient numerically, without analytic expressions for derivatives
- Both BFGS implementations converge quickly on
same logit model coefficients as Newton
- Code is subtle, especially interaction between dfpmin() and lnsrch().
BFGSTest
public class DemandModelBFGSTest { public static void main(String[] args) { double[][] x= { {1, 52.9 - 4.4}, {1,{ , 4.1 - 28.5},}, //// Etc. };}; double[] y= {0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1}; // Almost identical to DemandModel from last lecture // Implements MathFunction4, with func() and df() DemandModelBFGS d= new DemandModelBFGS(x, y); double[] beta= {0, 0}; // Initial guess double log0= d.func2(beta); doubledouble gtol=gtol 1E-14;1E 14; BFGS b= new BFGS(d, gtol); double logB= b.dfpmin(beta); double[][] jacobian= d.jacobian(beta); d.print(log0, logB, beta, jacobian); } } // Output identical to last lecture example. 10 iterations
Nonlinear method performance
Method Iterations Function evaluations
Newton 6 6n^2 = 24
BFGS 10 or 11 10n= 20
Amoeba 116 116
This example is illustrative only, based on our logit test case. n is the number of variables ((coefficients in our test case)). As n increases, amoeba performance is relatively more competitive
Amoeba and BFGS have better convergence and precision than Newton
Linear programming
From GLPK manual
© Andrew Makhorin. All rights reserved.
This content is excluded from our Creative Commons license.
For more information, see http://ocw.mit.edu/fairuse
Linear programming standard form
Formulate constraints as a set of equations and a set of bounded variables.
This is easy to translate to the simplex tableau used for the computations.
© Andrew Makhorin. All rights reserved.
This content is excluded from our Creative Commons license.
For more information, see http://ocw.mit.edu/fairuse
LP program, p.
import org.gnu.glpk.*; // Glpk starts numbering at 1, not 0 public class LP { publicpublic staticstatic voidvoid main(String[]main(String[] args)args) {{ int[] ia= new int[1+1000]; // Row index of coefficient int[] ja= new int[1+1000]; // Col index of coefficient double[] ar= new double[1+1000]; // Coefficient double z, x1, x2, x3; // Obj value, unknowns GlpkSolver solver = new GlpkSolver(); solver.setProbName("sample"); solver.setObjDir(GlpkSolver.LPX_MAX); // Maximization sollver.addRddRows(3);(3) solver.setRowName(1, "p"); solver.setRowBnds(1, GlpkSolver.LPX_UP, 0.0, 100.0); solver.setRowName(2, "q"); solver.setRowBnds(2, GlpkSolver.LPX_UP, 0.0, 600.0); solver.setRowName(3, "r"); solver.setRowBnds(3, GlpkSolver.LPX_UP, 0.0, 300.0);
LP program, p.
solver.addCols(3); solver.setColName(1, "x1"); solver.setColBnds(1, GlpkSolver.LPX_LO, 0.0, 0.0); solversolver .setObjCoef(1,setObjCoef(1 10 10.0); 0); solver.setColName(2, "x2"); solver.setColBnds(2, GlpkSolver.LPX_LO, 0.0, 0.0); solver.setObjCoef(2, 6.0); solver.setColName(3, "x3"); solver.setColBnds(3, GlpkSolver.LPX_LO, 0.0, 0.0); solver.setObjCoef(3, 4.0); ia[1] = 1; ja[1] = 1; ar[1] = 1.0; /* a[1,1] = 1 / ia[2]ia[2] = 1;1; ja[2]ja[2] = 2;2; ar[2] =ar[2] 1 1.0; / 0; /* a[1a[1,2] = 2] 11 // ia[3] = 1; ja[3] = 3; ar[3] = 1.0; /* a[1,3] = 1 / ia[4] = 2; ja[4] = 1; ar[4] = 10.0; / a[2,1] = 10 / ia[5] = 2; ja[5] = 2; ar[5] = 4.0; / a[2,2] = 4 / ia[6] = 2; ja[6] = 3; ar[6] = 5.0; / a[2,3] = 5 / ia[7] = 3; ja[7] = 1; ar[7] = 2.0; / a[3,1] = 2 / ia[8] = 3; ja[8] = 2; ar[8] = 2.0; / a[3,2] = 2 / ia[9] = 3; ja[9] = 3; ar[9] = 6.0; / a[3,3] = 6 */