Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Optimization in Machine Learning: Recent Developments and Current Challenges, Lecture notes of Machine Learning

University of Wisconsin (UW) - Whitewater Machine Learning

Optimization in machine learning, specifically sparse/regularized optimization, SVM formulations and algorithms, and new optimization tools. It also provides examples of compressed sensing and TV-regularized image denoising. The document explores different algorithmic approaches and the tradeoffs between optimality and regularity. The University of Wisconsin-Madison is mentioned as the author's affiliation.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

bairloy 🇺🇸

4.2

(6)

247 documents

1 / 49

This page cannot be seen from the preview

Don't miss anything!

Optimization in Machine Learning

Recent Developments and Current Challenges

Stephen Wright

University of Wisconsin-Madison

NIPS Workshop, Whistler, 12 December 2008

Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Workshop, Whistler, 12 December 2008 1

/ 49

Discover Lecture notes of Machine Learning University of Wisconsin (UW) - Whitewater

Partial preview of the text

Download Optimization in Machine Learning: Recent Developments and Current Challenges and more Lecture notes Machine Learning in PDF only on Docsity!

Optimization in Machine Learning

Recent Developments and Current Challenges

Stephen Wright

University of Wisconsin-Madison

NIPS Workshop, Whistler, 12 December 2008

Summary

(^1) Sparse / Regularized Optimization (^2) SVM Formulations (^3) SVM Algorithms (recently proposed) (^4) New optimization tools of possible interest.

Sparse Optimization

Traditionally, research on algorithmic optimization assumes exact data available and precise solutions needed. However, in many optimization applications we prefer simple, approximate solutions to more complicated exact solutions.

simple solutions easier to actuate; uncertain data does not justify precise solutions; regularized solutions less sensitive to inaccuracies; simple solution more “generalizable.”

These new “ground rules” may change the algorithmic approach altogther. For example, an approximate first-order method applied to a nonsmooth formulation may be preferred to a second-order method applied to a smooth formulation.

Regularized Formulations

Vapnik: “...tradeoff between the quality of the approximation of the given data and the complexity of the approximating function.” Simplicity sometimes manifested as sparsity in the solution vector (or some simple transformation of it).

min F(x) + λR(x),

F is the model, data-fitting, or loss term (the function that would appear in a standard optimization formulation); R is a regularization function; λ ≥ 0 is a regularization parameter.

R can be nonsmooth, to promote sparsity in x (e.g. ‖ · ‖ 1 ). Smooth choices of R such as ‖ · ‖^22 (Tikhonov regularization, ridge regression) suppress the size of x and improve conditioning.

Example: TV-regularized image denoising

Given an image f : Ω → R over a spatial domain Ω, find a nearby u that preserves edges while removing noise. (Recovered u has large constant regions.)

min u

(u − f )^2 dx + λ

|∇u| dx.

Here ∇u : Ω → R^2 is the spatial gradient of u. λ controls fidelity to image data. Recent work shows that gradient-projection methods on dual or primal-dual are much faster at recovering approximate solutions than methods with fast asymptotic convergence.

Example: Cancer Radiotherapy

In radiation treatment planning, there are an astronomical variety of possibilies for delivering radiation from a device to a treatment area. Can vary beam shape, exposure time (weight), angle. Aim to deliver a prescribed radiation dose to the tumor while avoiding surrounding critical organs and normal tissue. Also wish to use just a few beams. This makes delivery more practical and is observed to be more robust to data uncertainty.

Solving Regularized Formulations

Different applications have very different properties and requirements, that require different algorithmic approaches. Some approaches transfer between applications and can be analyzed at a more abstract level. Duality often key to getting a practical formulation. Often want to solve for a range of λ values (i.e. different tradeoffs between optimality and regularity).

Often, there is a choice between (i) methods with fast asymptotic convergence (e.g. interior-point, SQP) with expensive steps and (ii) methods with slow asymptotic convergence and cheap steps, requiring only (approximate) gradient information. The latter are more appealling when we need only an approximate solution. The best algorithms may combine both approaches!

SVM Classification: Primal

Feature vectors xi ∈ Rn, i = 1, 2 ,... , N, binary labels yi ∈ {− 1 , 1 }. Linear classifier: Defined by w ∈ Rn, b ∈ R: f (x) = w (^) iT x + b. Perfect separation if yi f (xi ) ≥ 1 for all i. Otherwise try to find (w , b) that keeps the classification errors ξi small (usually a separable, increasing function of ξi ). Usually include in the objective a norm of w or (w , b). The particular choice ‖w ‖^22 yields a maximum-margin separating hyperplane. A popular formulation: SVC-C aka L1-SVM (hinge loss):

min w ,b,ξ

‖w ‖^22 + C

∑^ N

ξi ,

subject to ξi ≥ 0 , yi (w T^ xi + b) ≥ 1 − ξi , i = 1, 2 ,... , N.

Kernel Trick, RKHS

For a more powerful classifier, can project feature vector xi into a higher-dimensional space via a function φ : Rn^ → Rt^ and classify in that space. Dual formulation is the same, except for redefined K :

Kij = (yi yj )φ(xi )T^ φ(xj ).

Leads to classifier:

f (x) =

∑^ N

αi yi φ(xi )T^ φ(x) + b.

Don’t actually need to use φ at all, just inner products φ(x)T^ φ(¯x). Instead of φ, work with a kernel function k : Rn^ × Rn^ → R. If k is continuous, symmetric in arguments, and positive definite (Mercer kernel), there exists a Hilbert space and a function φ in this space such that k(x, ¯x) = φ(x)T^ φ(¯x).

Thus, a typical strategy is to choose a kernel k, form Kij = yi yj k(xi , xj ), solve the dual to obtain α and b, and use the classifier

f (x) =

∑^ N

αi yi k(xi , x) + b.

Solving the Dual

min α

αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1 , y T^ α = 0.

Convex QP with mostly bound constraints, but a. Dense, ill conditioned Hessian makes it tricky b. The linear constraint y T^ α = 0 is a nuisance! Many methods proposed to work with this formulation.

Dual SVM: Coordinate Descent

(Hsieh et al 2008) Deal with the constraint y T^ α = 0 by getting rid of it! Corresponds to removing the “intercept” term b from the classifier:

min w ,b

‖w ‖^22 + C

∑^ N

ξi ,

subject to ξi ≥ 0 , yi w T^ xi ≥ 1 − ξi , i = 1, 2 ,... , N,

Get a convex, bound-constrained QP:

min α

αT^ K α − 1 T^ α s.t. 0 ≤ α ≤ C 1.

Basic step: for some i = 1, 2 ,... , N, solve this problem in closed form for αi , holding all components αj , j 6 = i fixed.

Can cycle through i = 1, 2 ,... , N, or pick i at random.
Update K α by evaluating one column of the kernel.
Gets near-optimal solution quickly.

Dual SVM: Decomposition

Many algorithms for dual formulation make use of decomposition: Choose a subset of components of α and (approximately) solve a subproblem in just these components, fixing the other components at one of their bounds. Usually maintain feasible α throughout. Many variants, distinguished by strategy for selecting subsets, size of subsets, inner-loop strategy for solving the reduced problem. SMO: (Platt 1998). Subproblem has two components. SMVlight: (Joachims 1998). Use chooses subproblem size (usually small); components selected with a first-order heuristic. (Could use an ` 1 penalty as surrogate for cardinality constraint?) PGPDT: (Zanni, Serafini, Zanghirati 2006) Decomposition, with gradient projection on the subproblems. Parallel implementation.

LIBSVM: (Fan, Chen, Lin, Chang 2005). SMO framework, with first- and second-order heuristics for selecting the two subproblem components. Solves a 2-D QP to get the step.

Heuristics are vital to efficiency, to save expense of calculating components of kernel K and multiplying with them:

Shrinking: exclude from consideration the components αi that clearly belong at a bound (except for a final optimality check); Caching: Save some evaluated elements Kij in available memory.

Performance of Decomposition:

Used widely and well for > 10 years. Solutions α are often not particularly sparse (many support vectors), so many outer (subset selection) iterations are required. Can be problematic for large data sets.

Optimization in Machine Learning: Recent Developments and Current Challenges, Lecture notes of Machine Learning

Related documents

Partial preview of the text

Download Optimization in Machine Learning: Recent Developments and Current Challenges and more Lecture notes Machine Learning in PDF only on Docsity!

Optimization in Machine Learning

Summary

Sparse Optimization

Regularized Formulations

Example: TV-regularized image denoising

Example: Cancer Radiotherapy

Solving Regularized Formulations

SVM Classification: Primal

∑^ N

Kernel Trick, RKHS

∑^ N

∑^ N

Solving the Dual

Dual SVM: Coordinate Descent

∑^ N

Dual SVM: Decomposition