







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. David C Parkes, Computer Science, Support Vector Machines, Properties of SVMs, Regression, Harvard, Lecture Notes
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Support vector machines (SVMs) are an approach to classification that has received a lot of interest in the past few years and the SVM framework is currently one the most popular approaches to super- vised learning. This makes it worthwhile to try to understand some of the underlying mathematics behind SVMs. Still, the purpose of these notes is not for you to understand all the details of SVMs, but to know enough about them that you can apply them intelligently to a problem that you might encounter.
(^1) SVMs are based on three big ideas. The first is maximizing the margin.
We saw this earlier in the course when we discussed the amazing generalization performance of boosting algorithms. Intuitively, maximizing the margin means that when we learn a linear separator, we should try to choose the decision boundary so as to maximize the distance to the examples that are closest to the boundary. The second big idea is duality. This is an idea that is used many times in optimization problems. It allows one problem to be transformed into another problem that may be easier to solve. The third big idea is kernels. Kernels allow input vectors to be mapped into a higher-dimensional, and therefore more expressive feature space, but without incurring the full computational cost one might expect.
We will mainly consider binary classification problems, so that the training data is (x 1 , t 1 ),... , (xn, tn), with ti ∈ {− 1 , +1} to denote the target class for example xi. SVMs seek to learn linear classifiers,
y′(x) = wT^ φ(x) + b (1)
where w is an (ℓ × 1 dimension) weight vector, φ(x) ∈ Rℓ^ is an (ℓ × 1 dimension) feature vector, and b is the bias and a scalar. As usual, wT^ denotes
(^1) These notes are based, in part, on Bishop (1998).
the transpose of w, so that wT^ φ(x) is the scalar product of the weight vector and the feature vector. The function
φ : Rm^ → Rℓ^ (2)
maps input vectors into a possibly higher-dimensional feature space, with ℓ ≥ m, and is often called a basis function. The classifier is constructed as:
h(x) =
+1 if y′(x) ≥ 0 − 1 otherwise
We will assume for the most part that the training set is linearly separable in the feature space φ(x), so that there is a (w, b) so that y′(xi) ≥ 0 for ti = + and y′(xi) < 0 for ti = −1. The “margin” of a hypothesis h measures, intuitively, the distance of the decision boundary to the closest examples. The (linear) decision boundary in the ℓ dimensional feature space is
{φ(x) ∈ Rℓ^ | x ∈ Rm, wT^ φ(x) + b = 0} (4)
Note: this decision boundary has a corresponding, perhaps non-linear deci- sion boundary in terms of the set of points x ∈ Rm^ for which wT^ φ(x) + b = 0.
Definition 1 The margin for a single example x is the orthogonal distance to the decision boundary in Rℓ^ if x is classified correctly, and the negated orthog- onal distance if x is classified incorrectly.
Definition 2 The margin on a training set (or simply the “margin”) is the minimum margin on all examples in the training set.
Figure 1(a) shows the margin on examples in R^2 for a particular hypothesis (as indicated through its linear decision boundary.) One thing to note is that there are several closest examples to the hyperplane. This often happens in practice with SVMs. SVMs find a linear separator in possibly high dimensional feature space Rℓ. The goal of SVMs is to find a linear separator that maximizes the margin given a particular input representation x ∈ X and a feature map- ping φ. Figure 1(b) shows a hypothesis with a smaller margin than Figure 1(a). Intuitively, the second hypothesis is less good, because it has less robustness in being able to classify unseen examples that are distributed around the cur- rent examples. We might therefore expect it to generalize less well to unseen examples. Given parameters (w, b), then vector w is orthogonal to every vector in Rℓ on the decision boundary. To see this, note that wT^ φ(x 1 ) = wT^ φ(x 2 ) for any two examples x 1 , x 2 on the decision boundary, from which wT^ (φ(x 1 ) − φ(x 2 )) = 0, and φ(x 1 ) − φ(x 2 ) ∈ Rℓ^ is a vector aligned with the decision boundary.
Following from this, the margin on example xi, given feature space defined by φ and trained parameters (w, b) is exactly
tiy′(xi) ||w||
ti(wT^ φ(xi) + b) ||w||
Given this, the max-margin solution solves
arg max w,b
||w||
min i∈D
ti(wT^ φ(xi) + b)
This is the optimization problem that must be solved in training a SVM. As formulated, the problem is hard to solve because it is has a non-linear objective (1/||w|| and mini[.. .]), and a large search space given that ℓ considerably larger than m is possible.
In working to simplify this formulation, we can observe that the normal- ized, orthogonal distance from point xi to the decision boundary is invariant to multiplying w and b by any constant β > 0, since
ti(wT^ φ(x) + b) ||w||
ti(βwT^ φ(x) + βb) β||w||
ti(w′T^ φ(x) + b′) ||w′||
where w′^ = βw and b′^ = βb. Given this, and given that the data is linearly separable and so there exists a decision boundary with a non-zero (and positive) margin to every example, then we can impose on (10), without any loss in generality, the constraint
ti(wT^ φ(xi) + b) ≥ 1 , ∀i ∈ { 1 ,... , n} (12)
without losing any solutions to optimization problem (10). Consider now the following reformulation of the resulting optimization prob- lem:
arg min w,b
||w||^2 (13)
s.t. ti(wT^ φ(xi) + b) ≥ 1 , ∀i ∈ { 1 ,... , n}
How do we understand this optimization problem? First of all, notice that a solution that minimizes 12 ||w||^2 will also maximize (^) ||w^1 ||.
Moreover, some of the constraints will be tight (i.e., binding) in an optimal solution, because if there is a potential solution in which the constraints are all strictly greater than one, there will be a better solution in which one or more
constraints are tight and equal to one; i.e., a solution will be available that has a smaller squared norm in the objective. Consider a constraint that is binding in the optimal solution on example xi, so that
ti(wT^ φ(xi) + b) = 1 (14)
For this example, we have tiy′(xi) = 1, and the margin on this example is exactly 1/||w||. We conclude that formulation (13) is correct, in that a solu- tion will find (w, b) to maximize the margin (noting that the margin on other
examples is ti(w
T (^) φ(xi)+b) ||w|| >^
1 ||w|| ).
Already considerable progress has been made towards being able to find the weights that maximize the margin on separable training examples, in that the associated optimization problem is now a convex optimization problem subject to linear constraints. But, how can we finally solve this? The concern is that the feature space Rℓ may have a very high (even unbounded) dimension ℓ. The answer is that we can employ duality. Duality is a general principle that is often used to transform a difficult optimization problem into an equivalent problem that is simpler. The idea is to put the constraints into the objective function, with each constraint associated with a scalar multiplier that becomes a variable in the dual problem and indicates how important the constraint is in the solution. The first step is to introduce these Lagrange multipliers, α 1 ,... , αn ≥ 0, one for each inequality in (13), to obtain Lagrangian function:
L(w, α, b) =
||w||^2 −
∑^ n
i=
αi(ti(wT^ φ(xi) + b) − 1) (15)
We can observe that formulation (13) is equivalent to:
arg min w,b
max α≥ 0
L(w, α, b) (16)
since if a constraint is violated in (13) with the choice of (w, b) then ti(wT^ φ(xi)+ b) < 1, ti(wT^ φ(xi)+b)− 1 < 0, and αi can be selected arbitrarily high, providing an unbounded objective value and thus a large penalty. Moreover, in the case that (w, b) are such that ti(wT^ φ(xi) + b) ≥ 1 for all i, then maxα≥ 0 L(w, α, b) = 1 2 ||w||
(^2) since αi = 0 unless ti(wT (^) φ(xi) + b) = 1, and αi(ti(wT (^) φ(xi) + b) − 1) = 0
for all i. Moreover, we also have (weak duality)
min w,b max α≥ 0 L(w, α, b) ≥ max α≥ 0 min w,b L(w, α, b) (17)
and
K(x, x′) = φ(x)T^ φ(x′) (26)
is a kernel function. A kernel function is a scalar product on two vectors mapped by basis function φ into a (possibly higher dimensional) feature space. This we can solve! In particular, it is a quadratic program because it has a quadratic objective function (terms in the objective are zero, first or second- degree nomials) and linear inequality constraints. The number of decision vari- ables is exactly the number of examples in the training data. One typical so- lution technique is to solve via “iterative conjugate gradient methods.” The particular details of this technique are outside the scope of this course. The search space in the original formulation (10) has been transformed from feature space to example space, and now has dimension of the number of data examples. This may seem like a loss but is in fact a win because the potential high dimensionality of a feature space (as described through mapping φ) is now all handled through the kernel function. When we solve the program we obtain an optimal vector α∗^ ∈ Rn ≥ 0. With this, we can substitute into Eq. (21) to obtain
w∗^ =
∑^ n
i=
α∗ i tiφ(xi) (27)
which together with y(x) = wT^ φ(x) + b∗, where b∗^ is also optimized, yields
y(x) =
∑^ n
i=
α∗ i tiφ(xi)T^ φ(x) + b∗^ (28)
∑^ n
i=
α∗ i tiK(xi, x) + b∗. (29)
The support vectors for a trained SVM classifier are those examples that define the margin of the hypothesis on the training data, that is, the examples closest to the decision boundary in the high dimensional space. For this, let S = {i : αi > 0 , i ∈ { 1 ,... , n}} denote the index set of support vectors in the solution to formulation (24). The prediction y(x) for example x depends only on the weights associated with these examples; i.e., the examples that define the margin of the classifier. These examples are said to “support” the margin. This comes from interpreting the Karush-Kuhn-Tucker (KKT) conditions on the solution, which requires
αi(tiy′(xi) − 1) = 0, ∀i ∈ { 1 ,... , n} (30)
in an optimal solution, and so αi > 0 ⇒ tiy′(xi) = 1. We see that for the support vectors, the classifier provides tiy′(xi) = 1, and so these are the
examples for which the constraints are tight in formulation (13) (and thus define the margin.) Looking at Figure 1(a), the support vectors are the examples for which there is an arrow between them and the decision boundary. We can classify any example using only the support vectors:
h(x) = sign
i∈S
α∗ i tiK(x, xi) + b∗
where sign(z) returns +1 if z ≥ 0 and −1 otherwise. The sum is over support vectors. Note that this requires computing the kernel of each support vector and the instance to be classified. Essentially, the classification process compares a new instance x with each of the support vectors. The scalar production xT^ xi measures how similar the new instance x is to the training instance xi. Weight α∗ i measures the contribution of support vector xi, i.e. α∗ i measures how important the given support vector is. We multiply by ti to take into account the influence of the given support vector on the classification. Finally, we need to solve for b∗. For this, take any example i ∈ S and note that tiy′(xi) = 1 (since the support vectors are those for which Eq. 12 is tight). From this,
ti ·
j∈S
α∗ j tj K(xj , xi) + b∗
Multiplying both sides by ti (which is ti ∈ {− 1 , +1}, so that t^2 i = 1) then we have
b∗^ = ti −
j∈S
α∗ j tj K(xj , xi) (33)
Note: in practice, given that the optimization problem is solved to some degree of accuracy, it is standard to set b∗^ to the average such value computed over all examples in the support:
b∗^ =
i∈S
ti −
j∈S
αj tj K(xj , xi)
The wonderful thing that happened in formulating optimization (24) is that the inputs x only enter into the formulation via kernel function K(x, x′). This allows for a very high dimensional feature space to be handled implicitly and without first computing a feature vector and then taking the scalar product. In working with SVMs, we are in fact seeking classifiers that provide linear separation of data in a higher dimensional space. Kernels are a clever way of
(x 1 , ..., xm) to x^21 , x 1 x 2 , x 1 x 3 , ..., x 1 xm, x 2 x 1 , ..., x 2 xm, ..., xmxm− 1 , x^2 m.
K(x, z) = (xT^ z)^2 (36)
= (
∑^ m
j=
xj zj )^2 (37)
∑^ m
j=
xj zj )(
∑^ m
j=
xj zj ) (38)
∑^ m
j=
∑^ m
k=
xj xk zj zk (39)
(m,m ∑)
(j,k)=(1,1)
[xj xk ]T^ [zj zk] (40)
= φ(x)T^ φ(z) (41)
We see here the advantage of a kernel. Instead of having to compute the entire feature map (which is quadratic in this case) and then computing the scalar product directly, you only have to compute the kernel function, which is (xT^ x′)^2 , and takes linear time.
Example 3 The previous example generalizes to polynomial kernel functions:
K(x, z) = (xT^ z + c)d^ (42)
which consist of features comprised of all nomials up to the dth degree (and are therefore exponential in d.)
Example 4 The Gaussian kernel is
K(x, z) = e(−||x−z||
(^2) / 2 σ (^2) ) (43)
While this doesn’t look much like an inner product, it is a legal kernel. It compares the distances between two examples, with the importance of z to x decaying exponentially with the distance from z to x. In fact, the corresponding feature space for this kernel has an infinite dimension! A key point is that we can describe this kernel without describing the inner product space explicitly. When we come to computing with this kernel, we will be comparing training instances and computing their distance from each other.
The bottom line is that by adopting a non-trivial kernel function, SVMs can be used to learn max-margin classifiers in some high-dimensional space. In fact, one can also just ask directly whether a K : Rm^ ×Rm^ → R≥ 0 is a valid kernel, in the sense that there exists some feature mapping φ so that K(x, z) = φ(x)T^ φ(z) for all x, z. The result (due to Mercer), and out of scope of this class, is that a Kernel matrix, constructed by making entry Kij = K(xi, xj ) for the ith and
jth point in any set of examples {x 1 ,... , xn}, must be symmetric positive semi- definite. In addition, if K, K′^ are kernels then cK is a kernel for c > 0, and aK + bK′ is a kernel for a, b > 0. So, can make complex kernels from simple ones.
4.1 Extensions
We briefly discuss some extensions to the basic approach introduced so far.
Non separable data, regularization and model selection. What if the training data is not linearly separable, even with a high dimensional feature space? What if the data is noisy, and so exactly fitting the training data is not the best way to generalize? The SVM method can be easily extended to act as a “soft-margin” method in which some examples are allowed to be misclassified. In particular, the new (primal) training problem is
arg min w,b
||w||^2 + C
∑^ n
i=
ξi (44)
s.t. ti(wT^ φ(xi) + b) ≥ 1 − ξi, ∀i ∈ { 1 ,... , n} ξi ≥ 0 , ∀i ∈ { 1 ,... , n}
for some C > 0. We can actually view the first term in the objective, which was derived as the component that provides for a large margin, as providing “regularization” in that it is preferring simpler weights.
Regression. SVMs extend easily to regression problems. The hypothesis be- comes
y′(x) =
∑^ n
i=
(αi − αˆi)K(xi, x) + b, (46)
where αi, αˆi are the weights on support vectors on either side of the hypothesis, outside of an “ǫ-tube” of training data that is accurately fit. A parameter ǫ > 0, which is part of the training problem, is be determined through (cross-)validation, with smaller ǫ > 0 leading to more complex hypothe- ses.