Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Support Vector Machines , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

Harvard University Artificial Intelligence

Prof. David C Parkes, Computer Science, Support Vector Machines, Properties of SVMs, Regression, Harvard, Lecture Notes

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral 🇺🇸

4.5

(30)

395 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

CS181 Lecture 8 — Support Vector Machines

Avi Pfeffer; Revised by David Parkes

Feb 15, 2011

Support vector machines (SVMs) are an approach to classification

that has received a lot of interest in the past few years and the SVM

framework is currently one the most popular approaches to super-

vised learning. This makes it worthwhile to try to understand some

of the underlying mathematics behind SVMs. Still, the purpose of

these notes is not for you to understand all the details of SVMs, but

to know enough about them that you can apply them intelligently

to a problem that you might encounter.

1SVMs are based on three big ideas. The first is maximizing the margin.

We saw this earlier in the course when we discussed the amazing generalization

performance of boosting algorithms. Intuitively, maximizing the margin means

that when we learn a linear separator, we should try to choose the decision

boundary so as to maximize the distance to the examples that are closest to the

boundary. The second big idea is duality. This is an idea that is used many times

in optimization problems. It allows one problem to be transformed into another

problem that may be easier to solve. The third big idea is kernels. Kernels

allow input vectors to be mapped into a higher-dimensional, and therefore more

expressive feature space, but without incurring the full computational cost one

might expect.

1 Maximizing the Margin

We will mainly consider binary classification problems, so that the training data

is (x1, t1),...,(xn, tn), with ti∈ {−1,+1}to denote the target class for example

xi.

SVMs seek to learn linear classifiers,

y′(x) = wTφ(x) + b(1)

where wis an (ℓ×1 dimension) weight vector, φ(x)∈Rℓis an (ℓ×1

dimension) feature vector, and bis the bias and a scalar. As usual, wTdenotes

1These notes are based, in part, on Bishop (1998).

1

Discover Study notes of Artificial Intelligence Harvard University

Partial preview of the text

Download Support Vector Machines , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity!

CS181 Lecture 8 — Support Vector Machines

Avi Pfeffer; Revised by David Parkes

Feb 15, 2011

Support vector machines (SVMs) are an approach to classification that has received a lot of interest in the past few years and the SVM framework is currently one the most popular approaches to super- vised learning. This makes it worthwhile to try to understand some of the underlying mathematics behind SVMs. Still, the purpose of these notes is not for you to understand all the details of SVMs, but to know enough about them that you can apply them intelligently to a problem that you might encounter.

(^1) SVMs are based on three big ideas. The first is maximizing the margin.

We saw this earlier in the course when we discussed the amazing generalization performance of boosting algorithms. Intuitively, maximizing the margin means that when we learn a linear separator, we should try to choose the decision boundary so as to maximize the distance to the examples that are closest to the boundary. The second big idea is duality. This is an idea that is used many times in optimization problems. It allows one problem to be transformed into another problem that may be easier to solve. The third big idea is kernels. Kernels allow input vectors to be mapped into a higher-dimensional, and therefore more expressive feature space, but without incurring the full computational cost one might expect.

1 Maximizing the Margin

We will mainly consider binary classification problems, so that the training data is (x 1 , t 1 ),... , (xn, tn), with ti ∈ {− 1 , +1} to denote the target class for example xi. SVMs seek to learn linear classifiers,

y′(x) = wT^ φ(x) + b (1)

where w is an (ℓ × 1 dimension) weight vector, φ(x) ∈ Rℓ^ is an (ℓ × 1 dimension) feature vector, and b is the bias and a scalar. As usual, wT^ denotes

(^1) These notes are based, in part, on Bishop (1998).

the transpose of w, so that wT^ φ(x) is the scalar product of the weight vector and the feature vector. The function

φ : Rm^ → Rℓ^ (2)

maps input vectors into a possibly higher-dimensional feature space, with ℓ ≥ m, and is often called a basis function. The classifier is constructed as:

h(x) =

+1 if y′(x) ≥ 0 − 1 otherwise

We will assume for the most part that the training set is linearly separable in the feature space φ(x), so that there is a (w, b) so that y′(xi) ≥ 0 for ti = + and y′(xi) < 0 for ti = −1. The “margin” of a hypothesis h measures, intuitively, the distance of the decision boundary to the closest examples. The (linear) decision boundary in the ℓ dimensional feature space is

{φ(x) ∈ Rℓ^ | x ∈ Rm, wT^ φ(x) + b = 0} (4)

Note: this decision boundary has a corresponding, perhaps non-linear deci- sion boundary in terms of the set of points x ∈ Rm^ for which wT^ φ(x) + b = 0.

Definition 1 The margin for a single example x is the orthogonal distance to the decision boundary in Rℓ^ if x is classified correctly, and the negated orthog- onal distance if x is classified incorrectly.

Definition 2 The margin on a training set (or simply the “margin”) is the minimum margin on all examples in the training set.

Figure 1(a) shows the margin on examples in R^2 for a particular hypothesis (as indicated through its linear decision boundary.) One thing to note is that there are several closest examples to the hyperplane. This often happens in practice with SVMs. SVMs find a linear separator in possibly high dimensional feature space Rℓ. The goal of SVMs is to find a linear separator that maximizes the margin given a particular input representation x ∈ X and a feature map- ping φ. Figure 1(b) shows a hypothesis with a smaller margin than Figure 1(a). Intuitively, the second hypothesis is less good, because it has less robustness in being able to classify unseen examples that are distributed around the cur- rent examples. We might therefore expect it to generalize less well to unseen examples. Given parameters (w, b), then vector w is orthogonal to every vector in Rℓ on the decision boundary. To see this, note that wT^ φ(x 1 ) = wT^ φ(x 2 ) for any two examples x 1 , x 2 on the decision boundary, from which wT^ (φ(x 1 ) − φ(x 2 )) = 0, and φ(x 1 ) − φ(x 2 ) ∈ Rℓ^ is a vector aligned with the decision boundary.

If xi is correctly classified, then either y′(xi) ≥ 0 and ti = +1 or y′(xi) < 0 and ti = −1; either way tiy′(xi) ≥ 0.
If xi is incorrectly classified, then either y′(xi) ≥ 0 and ti = −1 or y′(xi) < 0 and ti = +1; either way tiy′(xi) ≤ 0

Following from this, the margin on example xi, given feature space defined by φ and trained parameters (w, b) is exactly

tiy′(xi) ||w||

ti(wT^ φ(xi) + b) ||w||

Given this, the max-margin solution solves

arg max w,b

||w||

min i∈D

[

ti(wT^ φ(xi) + b)

]

This is the optimization problem that must be solved in training a SVM. As formulated, the problem is hard to solve because it is has a non-linear objective (1/||w|| and mini[.. .]), and a large search space given that ℓ considerably larger than m is possible.

In working to simplify this formulation, we can observe that the normal- ized, orthogonal distance from point xi to the decision boundary is invariant to multiplying w and b by any constant β > 0, since

ti(wT^ φ(x) + b) ||w||

ti(βwT^ φ(x) + βb) β||w||

ti(w′T^ φ(x) + b′) ||w′||

where w′^ = βw and b′^ = βb. Given this, and given that the data is linearly separable and so there exists a decision boundary with a non-zero (and positive) margin to every example, then we can impose on (10), without any loss in generality, the constraint

ti(wT^ φ(xi) + b) ≥ 1 , ∀i ∈ { 1 ,... , n} (12)

without losing any solutions to optimization problem (10). Consider now the following reformulation of the resulting optimization prob- lem:

arg min w,b

||w||^2 (13)

s.t. ti(wT^ φ(xi) + b) ≥ 1 , ∀i ∈ { 1 ,... , n}

How do we understand this optimization problem? First of all, notice that a solution that minimizes 12 ||w||^2 will also maximize (^) ||w^1 ||.

Moreover, some of the constraints will be tight (i.e., binding) in an optimal solution, because if there is a potential solution in which the constraints are all strictly greater than one, there will be a better solution in which one or more

constraints are tight and equal to one; i.e., a solution will be available that has a smaller squared norm in the objective. Consider a constraint that is binding in the optimal solution on example xi, so that

ti(wT^ φ(xi) + b) = 1 (14)

For this example, we have tiy′(xi) = 1, and the margin on this example is exactly 1/||w||. We conclude that formulation (13) is correct, in that a solu- tion will find (w, b) to maximize the margin (noting that the margin on other

examples is ti(w

T (^) φ(xi)+b) ||w|| >^

1 ||w|| ).

2 Duality

Already considerable progress has been made towards being able to find the weights that maximize the margin on separable training examples, in that the associated optimization problem is now a convex optimization problem subject to linear constraints. But, how can we finally solve this? The concern is that the feature space Rℓ may have a very high (even unbounded) dimension ℓ. The answer is that we can employ duality. Duality is a general principle that is often used to transform a difficult optimization problem into an equivalent problem that is simpler. The idea is to put the constraints into the objective function, with each constraint associated with a scalar multiplier that becomes a variable in the dual problem and indicates how important the constraint is in the solution. The first step is to introduce these Lagrange multipliers, α 1 ,... , αn ≥ 0, one for each inequality in (13), to obtain Lagrangian function:

L(w, α, b) =

||w||^2 −

∑^ n

i=

αi(ti(wT^ φ(xi) + b) − 1) (15)

We can observe that formulation (13) is equivalent to:

arg min w,b

max α≥ 0

L(w, α, b) (16)

since if a constraint is violated in (13) with the choice of (w, b) then ti(wT^ φ(xi)+ b) < 1, ti(wT^ φ(xi)+b)− 1 < 0, and αi can be selected arbitrarily high, providing an unbounded objective value and thus a large penalty. Moreover, in the case that (w, b) are such that ti(wT^ φ(xi) + b) ≥ 1 for all i, then maxα≥ 0 L(w, α, b) = 1 2 ||w||

(^2) since αi = 0 unless ti(wT (^) φ(xi) + b) = 1, and αi(ti(wT (^) φ(xi) + b) − 1) = 0

for all i. Moreover, we also have (weak duality)

min w,b max α≥ 0 L(w, α, b) ≥ max α≥ 0 min w,b L(w, α, b) (17)

and

K(x, x′) = φ(x)T^ φ(x′) (26)

is a kernel function. A kernel function is a scalar product on two vectors mapped by basis function φ into a (possibly higher dimensional) feature space. This we can solve! In particular, it is a quadratic program because it has a quadratic objective function (terms in the objective are zero, first or second- degree nomials) and linear inequality constraints. The number of decision vari- ables is exactly the number of examples in the training data. One typical so- lution technique is to solve via “iterative conjugate gradient methods.” The particular details of this technique are outside the scope of this course. The search space in the original formulation (10) has been transformed from feature space to example space, and now has dimension of the number of data examples. This may seem like a loss but is in fact a win because the potential high dimensionality of a feature space (as described through mapping φ) is now all handled through the kernel function. When we solve the program we obtain an optimal vector α∗^ ∈ Rn ≥ 0. With this, we can substitute into Eq. (21) to obtain

w∗^ =

∑^ n

i=

α∗ i tiφ(xi) (27)

which together with y(x) = wT^ φ(x) + b∗, where b∗^ is also optimized, yields

y(x) =

∑^ n

i=

α∗ i tiφ(xi)T^ φ(x) + b∗^ (28)

∑^ n

i=

α∗ i tiK(xi, x) + b∗. (29)

The support vectors for a trained SVM classifier are those examples that define the margin of the hypothesis on the training data, that is, the examples closest to the decision boundary in the high dimensional space. For this, let S = {i : αi > 0 , i ∈ { 1 ,... , n}} denote the index set of support vectors in the solution to formulation (24). The prediction y(x) for example x depends only on the weights associated with these examples; i.e., the examples that define the margin of the classifier. These examples are said to “support” the margin. This comes from interpreting the Karush-Kuhn-Tucker (KKT) conditions on the solution, which requires

αi(tiy′(xi) − 1) = 0, ∀i ∈ { 1 ,... , n} (30)

in an optimal solution, and so αi > 0 ⇒ tiy′(xi) = 1. We see that for the support vectors, the classifier provides tiy′(xi) = 1, and so these are the

examples for which the constraints are tight in formulation (13) (and thus define the margin.) Looking at Figure 1(a), the support vectors are the examples for which there is an arrow between them and the decision boundary. We can classify any example using only the support vectors:

h(x) = sign

i∈S

α∗ i tiK(x, xi) + b∗

where sign(z) returns +1 if z ≥ 0 and −1 otherwise. The sum is over support vectors. Note that this requires computing the kernel of each support vector and the instance to be classified. Essentially, the classification process compares a new instance x with each of the support vectors. The scalar production xT^ xi measures how similar the new instance x is to the training instance xi. Weight α∗ i measures the contribution of support vector xi, i.e. α∗ i measures how important the given support vector is. We multiply by ti to take into account the influence of the given support vector on the classification. Finally, we need to solve for b∗. For this, take any example i ∈ S and note that tiy′(xi) = 1 (since the support vectors are those for which Eq. 12 is tight). From this,

ti ·

j∈S

α∗ j tj K(xj , xi) + b∗

Multiplying both sides by ti (which is ti ∈ {− 1 , +1}, so that t^2 i = 1) then we have

b∗^ = ti −

j∈S

α∗ j tj K(xj , xi) (33)

Note: in practice, given that the optimization problem is solved to some degree of accuracy, it is standard to set b∗^ to the average such value computed over all examples in the support:

b∗^ =

|S|

i∈S

ti −

j∈S

αj tj K(xj , xi)

3 Kernels

The wonderful thing that happened in formulating optimization (24) is that the inputs x only enter into the formulation via kernel function K(x, x′). This allows for a very high dimensional feature space to be handled implicitly and without first computing a feature vector and then taking the scalar product. In working with SVMs, we are in fact seeking classifiers that provide linear separation of data in a higher dimensional space. Kernels are a clever way of

(x 1 , ..., xm) to x^21 , x 1 x 2 , x 1 x 3 , ..., x 1 xm, x 2 x 1 , ..., x 2 xm, ..., xmxm− 1 , x^2 m.

K(x, z) = (xT^ z)^2 (36)

= (

∑^ m

j=

xj zj )^2 (37)

∑^ m

j=

xj zj )(

∑^ m

j=

xj zj ) (38)

∑^ m

j=

∑^ m

k=

xj xk zj zk (39)

(m,m ∑)

(j,k)=(1,1)

[xj xk ]T^ [zj zk] (40)

= φ(x)T^ φ(z) (41)

We see here the advantage of a kernel. Instead of having to compute the entire feature map (which is quadratic in this case) and then computing the scalar product directly, you only have to compute the kernel function, which is (xT^ x′)^2 , and takes linear time.

Example 3 The previous example generalizes to polynomial kernel functions:

K(x, z) = (xT^ z + c)d^ (42)

which consist of features comprised of all nomials up to the dth degree (and are therefore exponential in d.)

Example 4 The Gaussian kernel is

K(x, z) = e(−||x−z||

(^2) / 2 σ (^2) ) (43)

While this doesn’t look much like an inner product, it is a legal kernel. It compares the distances between two examples, with the importance of z to x decaying exponentially with the distance from z to x. In fact, the corresponding feature space for this kernel has an infinite dimension! A key point is that we can describe this kernel without describing the inner product space explicitly. When we come to computing with this kernel, we will be comparing training instances and computing their distance from each other.

The bottom line is that by adopting a non-trivial kernel function, SVMs can be used to learn max-margin classifiers in some high-dimensional space. In fact, one can also just ask directly whether a K : Rm^ ×Rm^ → R≥ 0 is a valid kernel, in the sense that there exists some feature mapping φ so that K(x, z) = φ(x)T^ φ(z) for all x, z. The result (due to Mercer), and out of scope of this class, is that a Kernel matrix, constructed by making entry Kij = K(xi, xj ) for the ith and

jth point in any set of examples {x 1 ,... , xn}, must be symmetric positive semi- definite. In addition, if K, K′^ are kernels then cK is a kernel for c > 0, and aK + bK′ is a kernel for a, b > 0. So, can make complex kernels from simple ones.

4 Properties of SVMs

Accurate classification
With appropriate kernel function, can express very complex hypotheses
No problem with local minima (quadratic objective function, and thus convex minimization problem)
Hypothesis directly represented as set of support vectors: independent of all other non-support vectors. Fairly succinct representation of a hypoth- esis and thus reasonable classification speed.
Very slow training. This is particularly a problem for large training sets. The objective function to the quadratic program (24) contains the Kernel on all pairs of instances.
Exactly separating examples in a high-dimensional space can lead to strange looking decision boundaries. This can lead to over-fitting. Need to use regularization (see below).

4.1 Extensions

We briefly discuss some extensions to the basic approach introduced so far.

Non separable data, regularization and model selection. What if the training data is not linearly separable, even with a high dimensional feature space? What if the data is noisy, and so exactly fitting the training data is not the best way to generalize? The SVM method can be easily extended to act as a “soft-margin” method in which some examples are allowed to be misclassified. In particular, the new (primal) training problem is

arg min w,b

||w||^2 + C

∑^ n

i=

ξi (44)

s.t. ti(wT^ φ(xi) + b) ≥ 1 − ξi, ∀i ∈ { 1 ,... , n} ξi ≥ 0 , ∀i ∈ { 1 ,... , n}

for some C > 0. We can actually view the first term in the objective, which was derived as the component that provides for a large margin, as providing “regularization” in that it is preferring simpler weights.

Regression. SVMs extend easily to regression problems. The hypothesis be- comes

y′(x) =

∑^ n

i=

(αi − αˆi)K(xi, x) + b, (46)

where αi, αˆi are the weights on support vectors on either side of the hypothesis, outside of an “ǫ-tube” of training data that is accurately fit. A parameter ǫ > 0, which is part of the training problem, is be determined through (cross-)validation, with smaller ǫ > 0 leading to more complex hypothe- ses.

Support Vector Machines , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

Related documents

Partial preview of the text

Download Support Vector Machines , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity!

CS181 Lecture 8 — Support Vector Machines

Avi Pfeffer; Revised by David Parkes

Feb 15, 2011

1 Maximizing the Margin

[

]

2 Duality

|S|

3 Kernels

4 Properties of SVMs