Optimization in Machine Learning: Recent Developments and Current Challenges, Lecture notes of Machine Learning

Optimization in machine learning, specifically sparse/regularized optimization, SVM formulations and algorithms, and new optimization tools. It also provides examples of compressed sensing and TV-regularized image denoising. The document explores different algorithmic approaches and the tradeoffs between optimality and regularity. The University of Wisconsin-Madison is mentioned as the author's affiliation.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

bairloy
bairloy 🇺🇸

4.2

(6)

247 documents

1 / 49

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Optimization in Machine Learning
Recent Developments and Current Challenges
Stephen Wright
University of Wisconsin-Madison
NIPS Workshop, Whistler, 12 December 2008
Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Workshop, Whistler, 12 December 2008 1
/ 49
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31

Partial preview of the text

Download Optimization in Machine Learning: Recent Developments and Current Challenges and more Lecture notes Machine Learning in PDF only on Docsity!

Optimization in Machine Learning

Recent Developments and Current Challenges

Stephen Wright

University of Wisconsin-Madison

NIPS Workshop, Whistler, 12 December 2008

Summary

(^1) Sparse / Regularized Optimization (^2) SVM Formulations (^3) SVM Algorithms (recently proposed) (^4) New optimization tools of possible interest.

Sparse Optimization

Traditionally, research on algorithmic optimization assumes exact data available and precise solutions needed. However, in many optimization applications we prefer simple, approximate solutions to more complicated exact solutions.

simple solutions easier to actuate; uncertain data does not justify precise solutions; regularized solutions less sensitive to inaccuracies; simple solution more “generalizable.”

These new “ground rules” may change the algorithmic approach altogther. For example, an approximate first-order method applied to a nonsmooth formulation may be preferred to a second-order method applied to a smooth formulation.

Regularized Formulations

Vapnik: “...tradeoff between the quality of the approximation of the given data and the complexity of the approximating function.” Simplicity sometimes manifested as sparsity in the solution vector (or some simple transformation of it).

min F(x) + λR(x),

F is the model, data-fitting, or loss term (the function that would appear in a standard optimization formulation); R is a regularization function; λ ≥ 0 is a regularization parameter.

R can be nonsmooth, to promote sparsity in x (e.g. ‖ · ‖ 1 ). Smooth choices of R such as ‖ · ‖^22 (Tikhonov regularization, ridge regression) suppress the size of x and improve conditioning.

Example: TV-regularized image denoising

Given an image f : Ω → R over a spatial domain Ω, find a nearby u that preserves edges while removing noise. (Recovered u has large constant regions.)

min u

Ω

(u − f )^2 dx + λ

Ω

|∇u| dx.

Here ∇u : Ω → R^2 is the spatial gradient of u. λ controls fidelity to image data. Recent work shows that gradient-projection methods on dual or primal-dual are much faster at recovering approximate solutions than methods with fast asymptotic convergence.

Example: Cancer Radiotherapy

In radiation treatment planning, there are an astronomical variety of possibilies for delivering radiation from a device to a treatment area. Can vary beam shape, exposure time (weight), angle. Aim to deliver a prescribed radiation dose to the tumor while avoiding surrounding critical organs and normal tissue. Also wish to use just a few beams. This makes delivery more practical and is observed to be more robust to data uncertainty.

Solving Regularized Formulations

Different applications have very different properties and requirements, that require different algorithmic approaches. Some approaches transfer between applications and can be analyzed at a more abstract level. Duality often key to getting a practical formulation. Often want to solve for a range of λ values (i.e. different tradeoffs between optimality and regularity).

Often, there is a choice between (i) methods with fast asymptotic convergence (e.g. interior-point, SQP) with expensive steps and (ii) methods with slow asymptotic convergence and cheap steps, requiring only (approximate) gradient information. The latter are more appealling when we need only an approximate solution. The best algorithms may combine both approaches!

SVM Classification: Primal

Feature vectors xi ∈ Rn, i = 1, 2 ,... , N, binary labels yi ∈ {− 1 , 1 }. Linear classifier: Defined by w ∈ Rn, b ∈ R: f (x) = w (^) iT x + b. Perfect separation if yi f (xi ) ≥ 1 for all i. Otherwise try to find (w , b) that keeps the classification errors ξi small (usually a separable, increasing function of ξi ). Usually include in the objective a norm of w or (w , b). The particular choice ‖w ‖^22 yields a maximum-margin separating hyperplane. A popular formulation: SVC-C aka L1-SVM (hinge loss):

min w ,b,ξ

‖w ‖^22 + C

∑^ N

i=

ξi ,

subject to ξi ≥ 0 , yi (w T^ xi + b) ≥ 1 − ξi , i = 1, 2 ,... , N.

Kernel Trick, RKHS

For a more powerful classifier, can project feature vector xi into a higher-dimensional space via a function φ : Rn^ → Rt^ and classify in that space. Dual formulation is the same, except for redefined K :

Kij = (yi yj )φ(xi )T^ φ(xj ).

Leads to classifier:

f (x) =

∑^ N

i=

αi yi φ(xi )T^ φ(x) + b.

Don’t actually need to use φ at all, just inner products φ(x)T^ φ(¯x). Instead of φ, work with a kernel function k : Rn^ × Rn^ → R. If k is continuous, symmetric in arguments, and positive definite (Mercer kernel), there exists a Hilbert space and a function φ in this space such that k(x, ¯x) = φ(x)T^ φ(¯x).

Thus, a typical strategy is to choose a kernel k, form Kij = yi yj k(xi , xj ), solve the dual to obtain α and b, and use the classifier

f (x) =

∑^ N

i=

αi yi k(xi , x) + b.

Most popular kernels:

Linear: k(x, ¯x) = xT^ ¯x Gaussian: k(x, ¯x) = exp(−γ‖x − ¯x‖^2 ) Polynomial: k(x, ¯x) = (xT^ x¯ + 1)d

These (and other kernels) lead to dense K , often ill conditioned.

Solving the Dual

min α

αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1 , y T^ α = 0.

Convex QP with mostly bound constraints, but a. Dense, ill conditioned Hessian makes it tricky b. The linear constraint y T^ α = 0 is a nuisance! Many methods proposed to work with this formulation.

Dual SVM: Coordinate Descent

(Hsieh et al 2008) Deal with the constraint y T^ α = 0 by getting rid of it! Corresponds to removing the “intercept” term b from the classifier:

min w ,b

‖w ‖^22 + C

∑^ N

i=

ξi ,

subject to ξi ≥ 0 , yi w T^ xi ≥ 1 − ξi , i = 1, 2 ,... , N,

Get a convex, bound-constrained QP:

min α

αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1.

Basic step: for some i = 1, 2 ,... , N, solve this problem in closed form for αi , holding all components αj , j 6 = i fixed.

  • Can cycle through i = 1, 2 ,... , N, or pick i at random.
  • Update K α by evaluating one column of the kernel.
  • Gets near-optimal solution quickly.

Dual SVM: Decomposition

Many algorithms for dual formulation make use of decomposition: Choose a subset of components of α and (approximately) solve a subproblem in just these components, fixing the other components at one of their bounds. Usually maintain feasible α throughout. Many variants, distinguished by strategy for selecting subsets, size of subsets, inner-loop strategy for solving the reduced problem. SMO: (Platt 1998). Subproblem has two components. SMVlight: (Joachims 1998). Use chooses subproblem size (usually small); components selected with a first-order heuristic. (Could use an ` 1 penalty as surrogate for cardinality constraint?) PGPDT: (Zanni, Serafini, Zanghirati 2006) Decomposition, with gradient projection on the subproblems. Parallel implementation.

LIBSVM: (Fan, Chen, Lin, Chang 2005). SMO framework, with first- and second-order heuristics for selecting the two subproblem components. Solves a 2-D QP to get the step.

Heuristics are vital to efficiency, to save expense of calculating components of kernel K and multiplying with them:

Shrinking: exclude from consideration the components αi that clearly belong at a bound (except for a final optimality check); Caching: Save some evaluated elements Kij in available memory.

Performance of Decomposition:

Used widely and well for > 10 years. Solutions α are often not particularly sparse (many support vectors), so many outer (subset selection) iterations are required. Can be problematic for large data sets.