









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Optimization in machine learning, specifically sparse/regularized optimization, SVM formulations and algorithms, and new optimization tools. It also provides examples of compressed sensing and TV-regularized image denoising. The document explores different algorithmic approaches and the tradeoffs between optimality and regularity. The University of Wisconsin-Madison is mentioned as the author's affiliation.
Typology: Lecture notes
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































Recent Developments and Current Challenges
Stephen Wright
University of Wisconsin-Madison
NIPS Workshop, Whistler, 12 December 2008
(^1) Sparse / Regularized Optimization (^2) SVM Formulations (^3) SVM Algorithms (recently proposed) (^4) New optimization tools of possible interest.
Traditionally, research on algorithmic optimization assumes exact data available and precise solutions needed. However, in many optimization applications we prefer simple, approximate solutions to more complicated exact solutions.
simple solutions easier to actuate; uncertain data does not justify precise solutions; regularized solutions less sensitive to inaccuracies; simple solution more “generalizable.”
These new “ground rules” may change the algorithmic approach altogther. For example, an approximate first-order method applied to a nonsmooth formulation may be preferred to a second-order method applied to a smooth formulation.
Vapnik: “...tradeoff between the quality of the approximation of the given data and the complexity of the approximating function.” Simplicity sometimes manifested as sparsity in the solution vector (or some simple transformation of it).
min F(x) + λR(x),
F is the model, data-fitting, or loss term (the function that would appear in a standard optimization formulation); R is a regularization function; λ ≥ 0 is a regularization parameter.
R can be nonsmooth, to promote sparsity in x (e.g. ‖ · ‖ 1 ). Smooth choices of R such as ‖ · ‖^22 (Tikhonov regularization, ridge regression) suppress the size of x and improve conditioning.
Given an image f : Ω → R over a spatial domain Ω, find a nearby u that preserves edges while removing noise. (Recovered u has large constant regions.)
min u
Ω
(u − f )^2 dx + λ
Ω
|∇u| dx.
Here ∇u : Ω → R^2 is the spatial gradient of u. λ controls fidelity to image data. Recent work shows that gradient-projection methods on dual or primal-dual are much faster at recovering approximate solutions than methods with fast asymptotic convergence.
In radiation treatment planning, there are an astronomical variety of possibilies for delivering radiation from a device to a treatment area. Can vary beam shape, exposure time (weight), angle. Aim to deliver a prescribed radiation dose to the tumor while avoiding surrounding critical organs and normal tissue. Also wish to use just a few beams. This makes delivery more practical and is observed to be more robust to data uncertainty.
Different applications have very different properties and requirements, that require different algorithmic approaches. Some approaches transfer between applications and can be analyzed at a more abstract level. Duality often key to getting a practical formulation. Often want to solve for a range of λ values (i.e. different tradeoffs between optimality and regularity).
Often, there is a choice between (i) methods with fast asymptotic convergence (e.g. interior-point, SQP) with expensive steps and (ii) methods with slow asymptotic convergence and cheap steps, requiring only (approximate) gradient information. The latter are more appealling when we need only an approximate solution. The best algorithms may combine both approaches!
Feature vectors xi ∈ Rn, i = 1, 2 ,... , N, binary labels yi ∈ {− 1 , 1 }. Linear classifier: Defined by w ∈ Rn, b ∈ R: f (x) = w (^) iT x + b. Perfect separation if yi f (xi ) ≥ 1 for all i. Otherwise try to find (w , b) that keeps the classification errors ξi small (usually a separable, increasing function of ξi ). Usually include in the objective a norm of w or (w , b). The particular choice ‖w ‖^22 yields a maximum-margin separating hyperplane. A popular formulation: SVC-C aka L1-SVM (hinge loss):
min w ,b,ξ
‖w ‖^22 + C
i=
ξi ,
subject to ξi ≥ 0 , yi (w T^ xi + b) ≥ 1 − ξi , i = 1, 2 ,... , N.
For a more powerful classifier, can project feature vector xi into a higher-dimensional space via a function φ : Rn^ → Rt^ and classify in that space. Dual formulation is the same, except for redefined K :
Kij = (yi yj )φ(xi )T^ φ(xj ).
Leads to classifier:
f (x) =
i=
αi yi φ(xi )T^ φ(x) + b.
Don’t actually need to use φ at all, just inner products φ(x)T^ φ(¯x). Instead of φ, work with a kernel function k : Rn^ × Rn^ → R. If k is continuous, symmetric in arguments, and positive definite (Mercer kernel), there exists a Hilbert space and a function φ in this space such that k(x, ¯x) = φ(x)T^ φ(¯x).
Thus, a typical strategy is to choose a kernel k, form Kij = yi yj k(xi , xj ), solve the dual to obtain α and b, and use the classifier
f (x) =
i=
αi yi k(xi , x) + b.
Most popular kernels:
Linear: k(x, ¯x) = xT^ ¯x Gaussian: k(x, ¯x) = exp(−γ‖x − ¯x‖^2 ) Polynomial: k(x, ¯x) = (xT^ x¯ + 1)d
These (and other kernels) lead to dense K , often ill conditioned.
min α
αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1 , y T^ α = 0.
Convex QP with mostly bound constraints, but a. Dense, ill conditioned Hessian makes it tricky b. The linear constraint y T^ α = 0 is a nuisance! Many methods proposed to work with this formulation.
(Hsieh et al 2008) Deal with the constraint y T^ α = 0 by getting rid of it! Corresponds to removing the “intercept” term b from the classifier:
min w ,b
‖w ‖^22 + C
i=
ξi ,
subject to ξi ≥ 0 , yi w T^ xi ≥ 1 − ξi , i = 1, 2 ,... , N,
Get a convex, bound-constrained QP:
min α
αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1.
Basic step: for some i = 1, 2 ,... , N, solve this problem in closed form for αi , holding all components αj , j 6 = i fixed.
Many algorithms for dual formulation make use of decomposition: Choose a subset of components of α and (approximately) solve a subproblem in just these components, fixing the other components at one of their bounds. Usually maintain feasible α throughout. Many variants, distinguished by strategy for selecting subsets, size of subsets, inner-loop strategy for solving the reduced problem. SMO: (Platt 1998). Subproblem has two components. SMVlight: (Joachims 1998). Use chooses subproblem size (usually small); components selected with a first-order heuristic. (Could use an ` 1 penalty as surrogate for cardinality constraint?) PGPDT: (Zanni, Serafini, Zanghirati 2006) Decomposition, with gradient projection on the subproblems. Parallel implementation.
LIBSVM: (Fan, Chen, Lin, Chang 2005). SMO framework, with first- and second-order heuristics for selecting the two subproblem components. Solves a 2-D QP to get the step.
Heuristics are vital to efficiency, to save expense of calculating components of kernel K and multiplying with them:
Shrinking: exclude from consideration the components αi that clearly belong at a bound (except for a final optimality check); Caching: Save some evaluated elements Kij in available memory.
Performance of Decomposition:
Used widely and well for > 10 years. Solutions α are often not particularly sparse (many support vectors), so many outer (subset selection) iterations are required. Can be problematic for large data sets.