Optimization Methods and Algorithms | Study notes Algorithms and Programming

I♥Grid Search

LPs

Standard Form:

min cTxs.t. Ax =b, x ≥0, b ≥0.

Getting it to standard form:

Getting rid of ≥,≤:

x1≤4→x1+x2= 4, x2≥0

Getting rid of −vars:

x∈R→x=u−v, u, v ∈R+

Bounded vars:

x∈[2,5] →2≤x, x ≤5.

Simplex algorithm:

(1) Take cost function, turn into min zs.t.

cTx=z, remainder in standard LP form.

(2) Pivoting: do Gaussian Elimination to

get rid of as many variables as possible,

without distributing the zaround.

(3) Variables that have been eliminated ex-

cept in one equation are dependent/basic;

others independent/non-basic. Can always

get a feasible point by setting non-basic

variables to zero, and reading out basic

variables.

h1 0 C

0ImAi[−z, xB, xN]T= [−z0, b]T

(4) Improve solutions: find smallest reduced

cost Cj. If CJ≥0, optimality reached,

quit. Else, Jis incoming.

(5) Find as far as we can go by picking out-

going variable:

r= argmini|Ai,j>0bi/Ai,j

(6) Perform elimination to get rid of J, us-

ing equation that makes the outgoing vari-

able a basic one. That is, take the only

equation in which the outgoing variable is

non-zero, and eliminate the incoming vari-

able with it.

(7) Repeat from 4 until optimality reached.

Convex sets,fcns:

Defns:

A set is is Xif for any weighted sum of data

points satisfying Y, the weighted sum is in

the set.

Convex: Piθi= 1, θi≥0

Affine: Piθi= 1.

Conic: θi≥0.

Examples:

Lines, line segments, hyperplanes, halfs-

paces, Lpballs for p≥1, polyhedrons,

polytopes.

Preserving operations:

Translation, scaling, intersection, Affine

functions (e.g., projection, coordinate drop-

ping), set sum {c1+c2|c1∈C1, c2∈C2},

direct sum {(c1, c2)|c1∈C1, c2∈C2}, per-

spective projection.

Conv. Fcn. Defn:

f(θx + (1 −θ)y)≤θf (x) + (1 −θ)f(y)

f(y)≥f(x) + ∇f(x)T(y−x)

Preserving operations, functions:

Non-negative weighted sum, pointwise-

max, affine map f(Ax +b), composition,

perspective map.

Strict, Strong Convexity

Defns:

Strict convexity:

f(θx + (1 −θ)y)< θf (x) + (1 −θ)f(y) (ba-

sically, not linear).

m-Strong convexity:

f(θx + (1 −θ)y)≤θf (x) + (1 −θ)f(y)

−1

2mθ(1 −θ)||x−y||2

Better strong convexity defns:

(∇f(x)− ∇f(y))T(x−y)≥m||x−y||2

f(y)≥f(x) + ∇f(x)T(y−x) + m

2||y−x||2

∇2f(x)≥mI.

Gradient Descent

Given x0, repeat xk=xk−1−tk∇f(xk−1).

Picking t:can diverge if ttoo big, too slow

if ttoo small.

Backtracing line search: start with t= 1,

while f(x−t∇f(x)) > f(x)−αt||∇f(x)||2

update t=βt with 0 < α < 1/2, 0 < β < 1.

Subgradients

Defn.:

Subgradient of convex fis gs.t.

f(y)≥f(x) + gT(y−x)

Subdifferential ∂f (X): set of all g.

SG calculus:

∂(af) = a∂ f;∂(f1+f2) = ∂ f1+∂f2;

∂f (Ax +b) = AT∂f (Ax +b).

Finite-pointwise max: ∂maxf∈Ff(x) is

the convex hull of the active (achieving

max functions at x).

Norms: if f(x) = ||x||pand 1/p + 1/q = 1,

then ||x||p= max||z||q≤1zTx; thus

∂||x||p={y:||y||q≤1, yTx=

max||z||q≤1zTx}.

Optimality: f(x∗) = min f(x)↔0∈

∂f (x∗)

Remember that sgs may not exist for non-

convex functions!

Subgradient Method

Given x0, repeat xk=xk−1−tkgk−1

SG method not descent method; keep track

of best so far.

Picking t:square summable but not

summable (e.g., 1/t). Polyak steps:

(f(xk−1)−f(x∗))/||gk−1||2

Projected sg method: Project after taking a

step.

Generalized GD

Suppose f(x) = g(x) + h(x) with gconvex,

diff, hconvex, not necessarily diff.

Define proxt(x) = argminz1

2t||x−z||2

h(z); GGD is:

xk= proxt(xk−1−tk∇g(xk−1))

Generalized gradient since if

Gt(x) = (1/t)(x−proxt(x−t∇g(x)))

then update is

xk=xk−1−tkGt(xk−1)

With backtracking: While g(x−tGt(x)) >

g(x)−t∇g(x)TGt(x) + t

2||Gt(x)||2

2(maybe

with αin last term?) update t=β t.

Example (Lasso): Prox is argminz1

2t||β−

z||2

2+λ||z||1=Sλt(β). Sλ(β) is the soft-

threshold operator,

[Sλ(β)]i=(βi−λ:βi> λ

0 : −λ≤βi≤λ

βi+λ:βi<−λ

Example (Matrix Completion): Objective:

2P(i,j) observ(Yi,j −Bi,j )2+λ||B||∗with

||B||∗=Pr

i=1 σi(B).

Prox function: argminZ1

2t||B−Z||2

F+

λ||Z|∗.

Solution: matrix soft-thresholding;

UΣλVTwhere B=UΣVTand (Σλ)ii =

max{Σii −λ, 0}.

Newton’s Method: Originally devel-

oped for finding roots; use it to find roots

of gradient. Want ∇f(x) + ∇2f(x)∆x= 0;

solution is ∆x=−[∇2f(x)]−1∇f(x).

Damped Newton method:

xk+1 =xk−hk[∇2f(x)]−1∇f(x).

Conjugate Direction methods: Want

to solve min 1

2xTQx −bTxwith Q > 0.

Define Q-orthogonality as dT

iQdj= 0.

Exp. subspace thm.:

Let {di}n−1

i=0 be Q-conjugate.

(for method) gk=Qxk−b

xk+1 =xk+αdk

αk=−gT

kdk/(dT

kQdk)

Proof sketch (gk⊥Bk) by ind.:

gk+1 =Qxk+1 −b=Q(xk+αkdk)−b

(Qxk−b) + αQdk=gk+αQdk

From here, by defn of α,dT

kgk+1 =

k(gk+αQdk) = dT

kgk−αdT

kQdk= 0

Algorithm:

Arbitrary x0, repeat d0=−g0=b−Qx0

αk=−gT

kdk/dT

kQdk;xk+1 =xk+αkdk

gk=Qxk−b;dk+1 =−gk+1 +βkdk

βk=gT

k+1Qdk/(dkQdk)

Quasi-Newton Methods:

Gist: approximate Hessian/inverse Hes-

sian.

Symmetric rank-one correction:

Update: xk+1 =xk−αHkgk

αk= argminαf(xk−αHkgk) (LS)

gk=∇fk

Hk+1 =Hk+(pk−Hkqk)(pk−Hkqk)T

k(pk−Hkqk)

pk=xk+1 −xk;qk=gk+1 −gk

Might not be PSD!

DFP (Rank 2)

Hk+1 =Hk+pkpT

kqk

−HkqkqT

kHk

kHkqk

BFGS

Update inverse of Hessian via Sherman-

Morrison).

Let qk=gk+1 −gk

Hk+1 =Hk+ (1 + qT

kHkqk

kqk

)pkpT

kqk

−pkqT

kHk+HkqkpT

qkpk

LP Duality

Let cn,Am×n,bm,Gr×n,hr.

(P) min cTxs.t.

Ax =b,Gx ≤h

(D) max −bTu−hTvs.t.

−ATu−GTv=c,v≥0.

Duality:

Consider min f(x) s.t.

hi(x)≤0, i= 1,...,m

lj(x) = 0 j= 1,...,r

Lagrangian:

L(x, u, v) = f(x) + Pm

i=1 uihi(x) +

j=1 vjlj(x) with u∈Rm,v∈Rrand

u≥0.

Note: f(x)≥L(x, u, v) at feasible x.

Dual problem:

Let g(u, v) = minxL(x, u, v). La-

grange dual function is g. Dual problem

maxu≥0,v g(u, v).

Note: dual problem always concave.

Strong duality:

Always have f∗≥g∗where f∗,g∗primal

and dual objectives. When f∗=g∗, have

strong duality. If primal is a convex prob-

lem (f, hiconvex, ljaffine) and exists a

strictly feasible x, then strong duality.

Dual example (lasso):

Have primal:

Optimization Methods and Algorithms, Study notes of Algorithms and Programming

Related documents

Partial preview of the text

Download Optimization Methods and Algorithms and more Study notes Algorithms and Programming in PDF only on Docsity!

]

Gr. SG. Prox. New. Conj. QN Bar. P/D IPM

Crit f sm any sm g + simple h 2 × sm 2 × 2 × 2 × 2 ×

Const. Proj. Proj. Const. Prox Equality None None 2 × sm. ineq. 2 × sm. ineq.

Param. fix t/LS t → 0 fix t/LS fix t = 1/LS fix/LS LS in: fixed/LS; in:LS

out.: bar. → ∞ out.: bar. → ∞

Cost/It. chp chp? prox Exp. (∇^2 ) ≈ chp ≈ chp V.Exp ≈ Exp

+Storage

Rate O(1/) O(1/^2 ) O(1/) O(log(log(1/))) super-lin. superlin. O(log(1/)) O(log(1/))

Gr. and Prox. Gr. are O(1/

) w/ accel., O(log(1/)) w/strong convexity.