

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various optimization methods and algorithms such as Conjugate Direction methods, Gradient Descent, Simplex algorithm, Subgradients, Quasi-Newton Methods, LP Duality, and more. It covers topics such as LPs, standard form, subdifferential, conic sets, and generalized gradient descent. The document also includes mathematical equations and proofs. It could be useful for students studying optimization, linear programming, and related topics.
Typology: Study notes
1 / 2
This page cannot be seen from the preview
Don't miss anything!


I
♥
Grid Search
LPs Standard Form: min cT^ x s.t. Ax = b, x ≥ 0 , b ≥ 0. Getting it to standard form: Getting rid of ≥,≤: x 1 ≤ 4 → x 1 + x 2 = 4, x 2 ≥ 0 Getting rid of − vars:
x ∈ R → x = u − v, u, v ∈ R+ Bounded vars: x ∈ [2, 5] → 2 ≤ x, x ≤ 5.
Simplex algorithm: (1) Take cost function, turn into min z s.t.
cT^ x = z, remainder in standard LP form. (2) Pivoting: do Gaussian Elimination to get rid of as many variables as possible, without distributing the z around. (3) Variables that have been eliminated ex- cept in one equation are dependent/basic; others independent/non-basic. Can always get a feasible point by setting non-basic variables to zero, and reading out basic variables. [ 1 0 C 0 Im A
[−z, xB , xN ]T^ = [−z 0 , b]T
(4) Improve solutions: find smallest reduced cost Cj. If CJ ≥ 0, optimality reached, quit. Else, J is incoming. (5) Find as far as we can go by picking out- going variable: r = argmini|Ai,j > 0 bi/Ai,j
(6) Perform elimination to get rid of J, us- ing equation that makes the outgoing vari- able a basic one. That is, take the only equation in which the outgoing variable is non-zero, and eliminate the incoming vari- able with it. (7) Repeat from 4 until optimality reached.
Convex sets,fcns: Defns: A set is is X if for any weighted sum of data points satisfying Y, the weighted sum is in the set. Convex:
i θi^ = 1,^ θi^ ≥^0 Affine:
i θi^ = 1. Conic: θi ≥ 0. Examples: Lines, line segments, hyperplanes, halfs- paces, Lp balls for p ≥ 1, polyhedrons, polytopes. Preserving operations: Translation, scaling, intersection, Affine functions (e.g., projection, coordinate drop- ping), set sum {c 1 + c 2 |c 1 ∈ C 1 , c 2 ∈ C 2 }, direct sum {(c 1 , c 2 )|c 1 ∈ C 1 , c 2 ∈ C 2 }, per- spective projection. Conv. Fcn. Defn: f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)
f (y) ≥ f (x) + ∇f (x)T^ (y − x) Preserving operations, functions: Non-negative weighted sum, pointwise- max, affine map f (Ax + b), composition, perspective map.
Strict, Strong Convexity Defns: Strict convexity: f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y) (ba- sically, not linear). m-Strong convexity: f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y)
−
mθ(1 − θ)||x − y||^22
Better strong convexity defns:
(∇f (x) − ∇f (y))T^ (x − y) ≥ m||x − y||^22
f (y) ≥ f (x) + ∇f (x)T^ (y − x) + m 2 ||y − x||^22
∇^2 f (x) ≥ mI. Gradient Descent Given x^0 , repeat xk^ = xk−^1 − tk ∇f (xk−^1 ). Picking t: can diverge if t too big, too slow if t too small. Backtracing line search: start with t = 1, while f (x − t∇f (x)) > f (x) − αt||∇f (x)||^22 , update t = βt with 0 < α < 1 /2, 0 < β < 1.
Subgradients Defn.: Subgradient of convex f is g s.t. f (y) ≥ f (x) + gT^ (y − x) Subdifferential ∂f (X): set of all g. SG calculus: ∂(af ) = a∂f ; ∂(f 1 + f 2 ) = ∂f 1 + ∂f 2 ; ∂f (Ax + b) = AT^ ∂f (Ax + b). Finite-pointwise max: ∂ maxf ∈F f (x) is the convex hull of the active (achieving max functions at x). Norms: if f (x) = ||x||p and 1/p + 1/q = 1, then ||x||p = max||z||q ≤ 1 zT^ x; thus ∂||x||p = {y : ||y||q ≤ 1 , yT^ x = max||z||q ≤ 1 zT^ x}. Optimality: f (x∗) = min f (x) ↔ 0 ∈ ∂f (x∗) Remember that sgs may not exist for non- convex functions! Subgradient Method Given x^0 , repeat xk^ = xk−^1 − tk gk−^1 SG method not descent method; keep track of best so far. Picking t: square summable but not summable (e.g., 1 /t). Polyak steps: (f (xk−^1 ) − f (x∗))/||gk−^1 ||^22. Projected sg method: Project after taking a step.
Generalized GD Suppose f (x) = g(x) + h(x) with g convex, diff, h convex, not necessarily diff. Define proxt(x) = argminz (^21) t ||x − z||^22 + h(z); GGD is: xk^ = proxt(xk−^1 − tk ∇g(xk−^1 )) Generalized gradient since if Gt(x) = (1/t)(x − proxt(x − t∇g(x))) then update is xk^ = xk−^1 − tk Gt(xk−^1 ) With backtracking: While g(x − tGt(x)) > g(x) − t∇g(x)T^ Gt(x) + 2 t ||Gt(x)||^22 (maybe with α in last term?) update t = βt.
Example (Lasso): Prox is argminz (^21) t ||β − z||^22 + λ||z|| 1 = Sλt(β). Sλ(β) is the soft- threshold operator,
[Sλ(β)]i =
βi − λ : βi > λ 0 : −λ ≤ βi ≤ λ βi + λ : βi < −λ
Example (Matrix Completion): Objective: 1 2
(i,j) observ(Yi,j^ −^ Bi,j^ )
(^2) + λ||B||∗ with
||B||∗ =
∑r i=1 σi(B). Prox function: argminZ (^21) t ||B − Z||^2 F + λ||Z|∗. Solution: matrix soft-thresholding; U ΣλV T^ where B = U ΣV T^ and (Σλ)ii = max{Σii − λ, 0 }.
Newton’s Method: Originally devel- oped for finding roots; use it to find roots of gradient. Want ∇f (x) + ∇^2 f (x)∆x = 0; solution is ∆x = −[∇^2 f (x)]−^1 ∇f (x). Damped Newton method: xk+1^ = xk^ − hk [∇^2 f (x)]−^1 ∇f (x).
Conjugate Direction methods: Want to solve min 12 xT^ Qx − bT^ x with Q > 0. Define Q-orthogonality as dTi Qdj = 0. Exp. subspace thm.: Let {di}n i=0−^1 be Q-conjugate. (for method) gk = Qxk − b xk+1 = xk + αdk αk = −gTk dk /(dTk Qdk ) Proof sketch (gk ⊥ Bk ) by ind.: gk+1 = Qxk+1 − b = Q(xk + αk dk ) − b (Qxk − b) + αQdk = gk + αQdk From here, by defn of α, dTk gk+1 = dTk (gk + αQdk ) = dTk gk − αdTk Qdk = 0 Algorithm: Arbitrary x 0 , repeat d 0 = −g 0 = b − Qx 0 αk = −gTk dk /dTk Qdk ; xk+1 = xk + αk dk gk = Qxk − b; dk+1 = −gk+1 + βk dk βk = gTk+1Qdk /(dk Qdk )
Quasi-Newton Methods: Gist: approximate Hessian/inverse Hes- sian. Symmetric rank-one correction: Update: xk+1 = xk − αHk gk αk = argminα f (xk − αHk gk ) (LS) gk = ∇fk Hk+1 = Hk + (pk^ −Hk^ qk^ )(pk^ −Hk^ qk^ )
T qTk (pk −Hk qk ) pk = xk+1 − xk ; qk = gk+1 − gk Might not be PSD! DFP (Rank 2)
Hk+1 = Hk +
pk pTk pTk qk
Hk qk qTk Hk qTk Hk qk
BFGS Update inverse of Hessian via Sherman- Morrison). Let qk = gk+1 − gk
Hk+1 =Hk + (1 +
qTk Hk qk pTk qk
pk pTk pTk qk
pk qkT Hk + Hk qk pTk qk pk
LP Duality Let cn, Am×n, bm, Gr×n, hr. (P) min cT^ x s.t. Ax = b, Gx ≤ h (D) max −bT^ u − hT^ v s.t. −AT^ u − GT^ v = c, v ≥ 0.
Duality: Consider min f (x) s.t. hi(x) ≤ 0, i = 1,... , m lj (x) = 0 j = 1,... , r Lagrangian: L(x, u, v) = f (x) +
∑m ∑r i=1^ uihi(x)^ + j=1 vj^ lj^ (x) with^ u^ ∈^ R
m, v ∈ Rr (^) and
u ≥ 0. Note: f (x) ≥ L(x, u, v) at feasible x. Dual problem: Let g(u, v) = minx L(x, u, v). La- grange dual function is g. Dual problem maxu≥ 0 ,v g(u, v). Note: dual problem always concave. Strong duality: Always have f ∗^ ≥ g∗ where f ∗, g∗ primal and dual objectives. When f ∗^ = g∗, have strong duality. If primal is a convex prob- lem (f, hi convex, lj affine) and exists a strictly feasible x, then strong duality.
Dual example (lasso): Have primal:
I
♥
Grid Search
minβ 12 ||y − Xβ||^22 + λ||β|| 1 ; Introduce dummy z and solve:
minβ,z 12 ||y − z||^22 + λ||β|| 1 s.t. z = Xβ. Dual is then: minβ,z 12 ||y − z||^22 + λ||β|| 1 + uT^ (z − Xβ) 1 2 ||y||
2 2 −^
1 2 ||y^ −^ u||
2 2 −^ Iv:||v||∞≤^1 (X
T (^) u/λ)
Or minu (^12)
||y||^22 − ||y − u||^22
s.t.
||XT^ u||∞ ≤ λ.
KKT Conditions: Stationarity: 0 ∈ ∂f (x) +
∑m i=1 ui∂hi(x) +^
∑r j=1 ∂lj^ (x) Complementary slackness: ui · hi(x) = 0 for all i P feas.: hi(x) ≤ 0, lj (x) = 0 for all i, j D feas.: ui ≥ 0 for all i Necessary: if strong duality, then if x∗, u∗, v∗^ solutions, then they satisfy KKT conditions. Sufficient: always, if x∗, u∗, v∗^ satisfy KKT, then primal dual solutions. Correspondence Under strong duality, x∗ achieves the minimum in L(x, u∗, v∗); if L(x, u∗, v∗) has a unique minimum, then the corresponding point is the primal solu- tion.
Correspondence, Conjugates: Defn. convex conjugate: Given f , f ∗(y) =
maxx yT^ x − f (x).
Implies f (x) + f ∗(y) ≥ xT^ y. If f closed and convex, ∗∗^ = f.
Example, norm: If f (x) = ||x||, f ∗(y) = Iz:||z|∗≤ 1 (y)
Ellipsoid method for LP: Solves feasi- bility problems, but any LP can be turned into a feasibility problem. Setup: Let Ω be the set satisfying the constraints. Assume Ω ⊆ R-radius ball centered at y 0 , and there is a ball with radius r centered at y∗^ inside Ω. We know R, r, y 0 , but not y∗. Iterations: Can check if center of ellipsoid k is in Ω; if so, done. Else: find a constraint that is violated, find side that is not violated, fit ellipsoid to that half. Convergence:
Vol(k ) Vol( 0 )
( (^) τ
R
)m ≤
)k/m
which implies k ≤ O(m^2 log R/τ ) where τ = 1/(m + 1).
Penalty Methods: Original constrained problem (P), minx∈S f (x), replace with unconstrained
problem min f (x) + cp(x). p satisfies: p continuous, p(x) ≥ 0, p(x) = 0 iff x ∈ S. Idea: find some solution, increasingly pe- nalize outside S by increasing c → ∞: Penalty functions: p(x) = (^12)
∑p i=1 max([0, gi(x)])
2
Barrier Methods: Replace original problem with minx f (x) + 1 c B(x) where^ B^ is continuous;^ B(x)^ ≥^0 for all x ∈ int(S); B(x) → ∞ as x → ∂S. Idea: start out in interior, don’t let the al- gorithm leave S. Increase c → ∞. Barrier functions: Suppose gi(x) ≤ 0: B(x) = −
∑m i=
1 gi(x) B(x) = −
∑m i=1 log(−gi(x)) SDP:∑ ∑ Inner product: tr(A · B) = Ai,j Bi,j
ICA: Step 1: whiten. Step 2: want to mini- mize gaussian-likeness. But non-convex and lots of local minima. Assume additive lin- ear model. Whitening: Σ = cov(X) = U DU T^ , A∗^ = D−^1 /^2 U T^ A. Coordinate descent: Do argmin on each dimension, updating one-by-one. When does∑ coordinate descent work? g(x) + i hi(xi) Non-convex problems: Specialized ap- proach for each.
Convex Conjugates:
f ∗(y) = max x
xT^ x∗^ − f (x)
− min f
(x) − xT^ x∗
f (ax) f ∗(x∗/a) f (x + b) f ∗(x∗) − bT^ x∗ af (x) af ∗(x∗/a) ex^ x∗^ log(x∗) − x∗ ||x|| I||z||∗≤ 1 (x∗) Matrix derivatives: ∂A = 0 ∂(aX) = a∂X ∂(tr(X)) = tr(∂X) ∂(XY ) = (∂X)Y + X(∂Y ) ∂xT^ a/∂x = a ∂xT^ Xb/∂X = abT Suppose s,r are functions of x and A is constant,
∂sT^ Ar ∂x
∂s ∂x
T Ar +
∂r ∂x
T A
T s
Matrix properties: SVD: A = U ΣV T^ where: U are the eigenvectors of AAT D =
diag(eig(AAT^ )) V are the eigenvectors of AT^ A. Can also write A as the weighted sum of r rank-1 matrices. The rank-1 matrices are ΣiiUiV (^) iT for 1 ≤ i ≤ r.
EVD: X = V DV −^1 with D diagonal. If X is symmetric, V V T^ = I.
Traces: Linear. tr(A) = tr(AT^ ) tr(XT^ Y ) = tr(XY T^ ) tr(XT^ Y ) = vec(X)T^ vec(Y ) tr(ABC) = tr(BCA) = tr(CAB) P −^1 exists, tr(A) = tr(P −^1 AP ). tr(A) =
i λi Sherman-Morrison Mat. Inv.: Suppose A−^1 exists, 1 + vT^ A−^1 u 6 = 0. (A + uvT^ )−^1 = A−^1 − A
− (^1) uvT (^) A− 1 1+vT^ A−^1 u Matrix norms: Trace/Nuclear norm: ||A||∗ =
∑r i=1 σi(a) Spectral/Operator norm: ||A||op = σ 1 (A) Frobenius norm: ||A||F = tr(AT^ A).
Derivatives: f (x)g(x) f ′(x)g(x) + f (x)g′(x) f (g(x)) f ′(g(x))g′(x) xn^ nxn−^1 1 /f (x) −f −^2 f ′(x) f (x)/g(x) (f ′(x)g(x) − g′(x)f (x))/(g(x)^2 ) ex^ ex ln(x) 1 /x logc(x) 1 /(x ln(c)) Miscellaneous math: Lipschitz: A function f is Lipschitz contin- uous if |f (x 1 )−f (x 2 )| ≤ L|x 1 −x 2 |; controls how quickly the function changes. Gradient Lipschtiz: A differentiable function f has Lipschitz continuous gradient ||∇f (y) − ∇f (x)|| ≤ L||y − x||; if it is twice-differentiable, LI ≥ ∇^2 f (x). Useful inequalities: Cauchy-Schwarz: |xT^ y| ≤ ||x|| · ||y||. H¨older: ||f g|| 1 ≤ ||f ||p||g||q for 1/p + 1/q =