##### Document information

• We have been discussing SVM method for learning classifiers.

PR NPTEL course – p.1/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

PR NPTEL course – p.2/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

PR NPTEL course – p.3/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

PR NPTEL course – p.4/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

• We can use the same idea in regression problems also.

PR NPTEL course – p.5/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

PR NPTEL course – p.6/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

PR NPTEL course – p.7/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

PR NPTEL course – p.8/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

• We can elegantly construct non-linear versions of linear techniques.

PR NPTEL course – p.9/135

**Support Vector Regression
**

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

PR NPTEL course – p.10/135

**Support Vector Regression
**

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

• We search in a parameterized class of functions

g(X, W ) = w1 φ1(X) + · · · + wm′ φm′(X) + b = W TΦ(X) + b,

where φi : ℜm → ℜ are some chosen functions.

PR NPTEL course – p.11/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

PR NPTEL course – p.12/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

PR NPTEL course – p.13/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

PR NPTEL course – p.14/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

PR NPTEL course – p.15/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

• Then, by using a kernel function, we never need to compute or even precisely specify the mapping Φ.

PR NPTEL course – p.16/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function.

PR NPTEL course – p.17/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization.

PR NPTEL course – p.18/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization. • We consider a special loss function that allows us to

use the kernel trick.

PR NPTEL course – p.19/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function.

PR NPTEL course – p.20/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss.

PR NPTEL course – p.21/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness.

PR NPTEL course – p.22/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness. • Also gives us optimization problem with the right

structure.

PR NPTEL course – p.23/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b.

PR NPTEL course – p.24/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b. • Hence empirical risk minimization under the ǫ-insensitive loss function would minimize

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

PR NPTEL course – p.25/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b. • Hence empirical risk minimization under the ǫ-insensitive loss function would minimize

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• We can write this as an equivalent constrained optimization problem.

PR NPTEL course – p.26/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

∑

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

PR NPTEL course – p.27/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

∑

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

• This does not give a dual with the structure we want.

PR NPTEL course – p.28/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

∑

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

• This does not give a dual with the structure we want. • So, we reformulate the optimization problem.

PR NPTEL course – p.29/135

**The Optimization Problem
**

• Find W, b and ξi, ξ′i to

minimize 1

2 W TW + C

(

n ∑

i=1

ξi + n

∑

i=1

ξ′i

)

PR NPTEL course – p.30/135

**The Optimization Problem
**

• Find W, b and ξi, ξ′i to

minimize 1

2 W TW + C

(

n ∑

i=1

ξi + n

∑

i=1

ξ′i

)

• We have added the term W TW in the objective function. This is like model complexity in a regularization context.

PR NPTEL course – p.31/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

PR NPTEL course – p.32/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

PR NPTEL course – p.33/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

• Essentially, the lagrange multipliers corresponding to the inequality constraints on the errors would be the determining factors.

PR NPTEL course – p.34/135

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

• Essentially, the lagrange multipliers corresponding to the inequality constraints on the errors would be the determining factors.

• We can use the same technique as earlier to formulate the dual to solve for the optimal Lagrange multipliers.

PR NPTEL course – p.35/135

**The dual
**

• The dual of this problem is

max α,α

n ∑

i=1

yi(αi − α′i) − ǫ n

∑

i=1

(αi + α ′ i)

−1 2

∑

i,j

(αi − α′i)(αj − α′j)Φ(Xi)TΦ(Xj)

subject to n

∑

i=1

(αi − α′i) = 0

0 ≤ αi, α′i ≤ C, i = 1, . . . , n

PR NPTEL course – p.36/135

**The solution
**

• We can use the Kuhn-Tucker conditions to derive the final optimal values of W and b as earlier.

PR NPTEL course – p.37/135

**The solution
**

• We can use the Kuhn-Tucker conditions to derive the final optimal values of W and b as earlier.

• This gives us

W ∗ = n

∑

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

PR NPTEL course – p.38/135

• We have

W ∗ = n

∑

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

PR NPTEL course – p.39/135

• We have

W ∗ = n

∑

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

PR NPTEL course – p.40/135

• We have

W ∗ = n

∑

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

• The final W is a linear combination of some of the examples – the support vectors.

PR NPTEL course – p.41/135

• We have

W ∗ = n

∑

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

• The final W is a linear combination of some of the examples – the support vectors.

• Note that the dual and the final solution are such that we can use the kernel trick.

PR NPTEL course – p.42/135

• Let K(X,X ′) = Φ(X)TΦ(X ′).

PR NPTEL course – p.43/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

∑

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

PR NPTEL course – p.44/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

∑

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

= n

∑

i=1

(α∗i − α∗ ′

i )K(Xi, X) + b ∗

PR NPTEL course – p.45/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

∑

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

= n

∑

i=1

(α∗i − α∗ ′

i )K(Xi, X) + b ∗

• As earlier, b∗ can also be written in terms of the Kernel function.

PR NPTEL course – p.46/135

**Support vector regression
**

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

PR NPTEL course – p.47/135

**Support vector regression
**

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

PR NPTEL course – p.48/135

**Support vector regression
**

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

• The parameters: C, ǫ and parameters of kernel function.

PR NPTEL course – p.49/135

**Support vector regression
**

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

• The parameters: C, ǫ and parameters of kernel function.

• The basic idea of SVR can be used in many related problems.

PR NPTEL course – p.50/135

**SV regression
**

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

PR NPTEL course – p.51/135

**SV regression
**

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

• Gives rise to some interesting robustness of the method. It can be proved that local movements of target values of points outside the ǫ-tube do not influence the regression.

PR NPTEL course – p.52/135

**SV regression
**

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

• Gives rise to some interesting robustness of the method. It can be proved that local movements of target values of points outside the ǫ-tube do not influence the regression.

• Robustness essentially comes through the support vector representation of the regression.

PR NPTEL course – p.53/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

PR NPTEL course – p.54/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

PR NPTEL course – p.55/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• This is ‘regularized risk minimization’.

PR NPTEL course – p.56/135

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• This is ‘regularized risk minimization’.

• Then W TW is the model complexity term which is intended to favour learning of ‘smoother’ models.

PR NPTEL course – p.57/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

PR NPTEL course – p.58/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function.

PR NPTEL course – p.59/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

PR NPTEL course – p.60/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

• There are ways to characterize the ‘degree of continuity’ of a function.

PR NPTEL course – p.61/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

• There are ways to characterize the ‘degree of continuity’ of a function.

• We consider one such measure now.

PR NPTEL course – p.62/135

ǫ**-Margin of a function
**

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ}

PR NPTEL course – p.63/135

ǫ**-Margin of a function
**

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ} • The intuitive idea is:

How small can ||X −X ′|| be, still keeping |f(X) − f(X ′)| ‘large’

PR NPTEL course – p.64/135

ǫ**-Margin of a function
**

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ} • The intuitive idea is:

How small can ||X −X ′|| be, still keeping |f(X) − f(X ′)| ‘large’

• The larger mǫ(f), the smoother is the function.

PR NPTEL course – p.65/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

PR NPTEL course – p.66/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

PR NPTEL course – p.67/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

• mǫ(f) > 0 for all ǫ > 0 iff f is uniformly continuous.

PR NPTEL course – p.68/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

• mǫ(f) > 0 for all ǫ > 0 iff f is uniformly continuous.

• Higher margin would mean the function is ‘slowly varying’ and hence is a ‘smoother’ model.

PR NPTEL course – p.69/135

**SVR and margin
**

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|.

PR NPTEL course – p.70/135

**SVR and margin
**

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W .

PR NPTEL course – p.71/135

**SVR and margin
**

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

PR NPTEL course – p.72/135

**SVR and margin
**

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

• Thus, mǫ(f) = 2ǫ

||W || .

PR NPTEL course – p.73/135

**SVR and margin
**

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

• Thus, mǫ(f) = 2ǫ

||W || .

• Thus in our optimization problem in SVR, minimizing W TW promotes learning of smoother models.

PR NPTEL course – p.74/135

**Solving the SVM optimization problem
**

• So far we have not considered any algorithms for solving for the SVM.

PR NPTEL course – p.75/135

**Solving the SVM optimization problem
**

• So far we have not considered any algorithms for solving for the SVM.

• We have to solve a constrained optimization problem to obtain the Lagrange multipliers and hence the SVM.

PR NPTEL course – p.76/135

**Solving the SVM optimization problem
**

• So far we have not considered any algorithms for solving for the SVM.

• We have to solve a constrained optimization problem to obtain the Lagrange multipliers and hence the SVM.

• Many specialized algorithms have been proposed for this.

PR NPTEL course – p.77/135

• The optimization problem to be solved is

max µ

q(µ) = n

∑

i=1

µi − 1

2

n ∑

i,j=1

µiµjyiyjK(Xi, Xj)

subject to 0 ≤ µi ≤ C, i = 1, . . . , n, n

∑

i=1

yiµi = 0

• A quadratic programming (QP) problem with interesting structure.

PR NPTEL course – p.78/135

**Example
**

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

PR NPTEL course – p.79/135

**Example
**

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

PR NPTEL course – p.80/135

**Example
**

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

• As is easy to see, a linear classifier is not sufficient here.

PR NPTEL course – p.81/135

**Example
**

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

• As is easy to see, a linear classifier is not sufficient here.

• Suppose we use the Kernel function: K(X,X ′) = (1 + XTX ′)2.

PR NPTEL course – p.82/135

• This example is shown below.

PR NPTEL course – p.83/135

• This example is shown below.

PR NPTEL course – p.84/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

PR NPTEL course – p.85/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) • The objective function involves K(Xi, Xj). These are

given in a matrix below.

[

(1 + XTi Xj) 2 ]

=

4 0 1 0 4 1 1 1 1

PR NPTEL course – p.86/135

• Now the objective function to be maximized is

q(µ) = 3

∑

i=1

µi − 1

2 (4µ2

1 + 4µ2

2 + µ2

3 − 2µ1µ3 − 2µ2µ3)

PR NPTEL course – p.87/135

• Now the objective function to be maximized is

q(µ) = 3

∑

i=1

µi − 1

2 (4µ2

1 + 4µ2

2 + µ2

3 − 2µ1µ3 − 2µ2µ3)

• The constraints are

µ1 + µ2 − µ3 = 0; and µi ≥ 0, i = 1, 2, 3.

PR NPTEL course – p.88/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

∑

i=1

αiµi

PR NPTEL course – p.89/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

∑

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

PR NPTEL course – p.90/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

∑

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns.

PR NPTEL course – p.91/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

∑

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns. We use complementary slackness conditions on αi.

PR NPTEL course – p.92/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

∑

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns. We use complementary slackness conditions on αi.

• We have αiµi = 0. Essentially, we need to guess which µi > 0.

PR NPTEL course – p.93/135

• In this simple problem we know all µi > 0.

PR NPTEL course – p.94/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors.

PR NPTEL course – p.95/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0.

PR NPTEL course – p.96/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0. • We have now four unknowns: µ1, µ2, µ3, λ.

PR NPTEL course – p.97/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0. • We have now four unknowns: µ1, µ2, µ3, λ.

• Using ∂L ∂µi

= 0, i = 1, 2, 3 and feasibility, we can solve for µi.

PR NPTEL course – p.98/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

PR NPTEL course – p.99/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

• These give us λ = 1

PR NPTEL course – p.100/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

• These give us λ = 1 and µ3 = 2µ1 = 2µ2.

PR NPTEL course – p.101/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

• These give us λ = 1 and µ3 = 2µ1 = 2µ2. • Thus we get µ1 = µ2 = 1 and µ3 = 2.

PR NPTEL course – p.102/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

• These give us λ = 1 and µ3 = 2µ1 = 2µ2. • Thus we get µ1 = µ2 = 1 and µ3 = 2. • This completely determines the SVM

PR NPTEL course – p.103/135

• If we used the penalty constant with C ≥ 2 we get the same solution. (If C < 2, we can not get this solution).

PR NPTEL course – p.104/135

• If we used the penalty constant with C ≥ 2 we get the same solution. (If C < 2, we can not get this solution).

• The calssification of any X by this SVM is by the sign of f(X):

f(X) = ∑

i

µiyiK(Xi, X) + b ∗

PR NPTEL course – p.105/135

• If we used the penalty constant with C ≥ 2 we get the same solution. (If C < 2, we can not get this solution).

• The calssification of any X by this SVM is by the sign of f(X):

f(X) = ∑

i

µiyiK(Xi, X) + b ∗

= K(X1, X) + K(X2, X) − 2K(X3, X) + b∗

PR NPTEL course – p.106/135

• The calssification of any X by this SVM is by the sign of f(X):

f(X) = ∑

i

µiyiK(Xi, X) + b ∗

= K(X1, X) + K(X2, X) − 2K(X3, X) + b∗

• Let us first calculate b∗.

PR NPTEL course – p.107/135

• Recall the formula

b∗ = yj − ∑

i

µiyiK(Xi, Xj), j s.t. 0 < µj

PR NPTEL course – p.108/135

• Recall the formula

b∗ = yj − ∑

i

µiyiK(Xi, Xj), j s.t. 0 < µj

• We can take j = 1, 2 or 3.

PR NPTEL course – p.109/135

• Recall the formula

b∗ = yj − ∑

i

µiyiK(Xi, Xj), j s.t. 0 < µj

• We can take j = 1, 2 or 3.

• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.

PR NPTEL course – p.110/135

• Recall the formula

b∗ = yj − ∑

i

µiyiK(Xi, Xj), j s.t. 0 < µj

• We can take j = 1, 2 or 3.

• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1. • With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1.

PR NPTEL course – p.111/135

• Recall the formula

b∗ = yj − ∑

i

µiyiK(Xi, Xj), j s.t. 0 < µj

• We can take j = 1, 2 or 3.

• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1. • With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1. • If we solved our optimization problem correctly, we

should get same b∗!

PR NPTEL course – p.112/135

• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) and K(X,X ′) = (1 + XTX ′)2.

PR NPTEL course – p.113/135

• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) and K(X,X ′) = (1 + XTX ′)2.

• Hence, taking X = (x1, x2)T , we have

f(X) = K(X1, X) + K(X2, X) − 2K(X3, X) + b∗

PR NPTEL course – p.114/135

• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) and K(X,X ′) = (1 + XTX ′)2.

• Hence, taking X = (x1, x2)T , we have

f(X) = K(X1, X) + K(X2, X) − 2K(X3, X) + b∗ = (1 − x1)2 + (1 + x1)2 − 2(1) − 1

PR NPTEL course – p.115/135

• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) and K(X,X ′) = (1 + XTX ′)2.

• Hence, taking X = (x1, x2)T , we have

f(X) = K(X1, X) + K(X2, X) − 2K(X3, X) + b∗ = (1 − x1)2 + (1 + x1)2 − 2(1) − 1 = 2x2

1 − 1

PR NPTEL course – p.116/135

• Hence this SVM will assign class +1 to X = (x1, x2)T

if 2x2

1 ≥ 1

PR NPTEL course – p.117/135

• Hence this SVM will assign class +1 to X = (x1, x2)T

if

2x2 1 ≥ 1 or |x1| ≥

1√ 2

PR NPTEL course – p.118/135

• Hence this SVM will assign class +1 to X = (x1, x2)T

if

2x2 1 ≥ 1 or |x1| ≥

1√ 2

• Why not |x1| ≥ (1/2)?

PR NPTEL course – p.119/135

• Hence this SVM will assign class +1 to X = (x1, x2)T

if

2x2 1 ≥ 1 or |x1| ≥

1√ 2

• Why not |x1| ≥ (1/2)? • We are maximizing margin of the hyperplane in

‘x2’-space.

PR NPTEL course – p.120/135

• Hence this SVM will assign class +1 to X = (x1, x2)T

if

2x2 1 ≥ 1 or |x1| ≥

1√ 2

• Why not |x1| ≥ (1/2)? • We are maximizing margin of the hyperplane in

‘x2’-space. • The final SVM is intuitively very reasonable and we

solve essentially the same problem whether we are seeking a linear classifier or a nonlinear classifier.

PR NPTEL course – p.121/135

• Getting back to the general case, we need to solve

max µ

q(µ) = n

∑

i=1

µi − 1

2

n ∑

i,j=1

µiµjyiyjK(Xi, Xj)

subject to 0 ≤ µi ≤ C, i = 1, . . . , n, n

∑

i=1

yiµi = 0

PR NPTEL course – p.122/135

• Getting back to the general case, we need to solve

max µ

q(µ) = n

∑

i=1

µi − 1

2

n ∑

i,j=1

µiµjyiyjK(Xi, Xj)

subject to 0 ≤ µi ≤ C, i = 1, . . . , n, n

∑

i=1

yiµi = 0

• We need a numerical method.

PR NPTEL course – p.123/135

• Getting back to the general case, we need to solve

max µ

q(µ) = n

∑

i=1

µi − 1

2

n ∑

i,j=1

µiµjyiyjK(Xi, Xj)

subject to 0 ≤ µi ≤ C, i = 1, . . . , n, n

∑

i=1

yiµi = 0

• We need a numerical method. • Due to the special structure, many efficient algorithms

are proposed.

PR NPTEL course – p.124/135

• One interesting idea – Chunking

PR NPTEL course – p.125/135

• One interesting idea – Chunking • We optimize on only a few variables at a time.

PR NPTEL course – p.126/135

• One interesting idea – Chunking • We optimize on only a few variables at a time. • Dimensionality of the optimization problem is

controlled.

PR NPTEL course – p.127/135

• One interesting idea – Chunking • We optimize on only a few variables at a time. • Dimensionality of the optimization problem is

controlled. • We keep randomly choosing the subset of variables.

PR NPTEL course – p.128/135

• One interesting idea – Chunking • We optimize on only a few variables at a time. • Dimensionality of the optimization problem is

controlled. • We keep randomly choosing the subset of variables. • Gave rise to the first specialized algorithm for SVM –

SVM Light

PR NPTEL course – p.129/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

PR NPTEL course – p.130/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

• We need to consider at least two variables because there is an equality constraint.

PR NPTEL course – p.131/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

• We need to consider at least two variables because there is an equality constraint.

• Sequential Minimal Optimization (SMO) – works on optimizing two variables at a time.

PR NPTEL course – p.132/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

• We need to consider at least two variables because there is an equality constraint.

• Sequential Minimal Optimization (SMO) – works on optimizing two variables at a time.

• We can analytically find the optimum with respect to two variables.

PR NPTEL course – p.133/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

• We need to consider at least two variables because there is an equality constraint.

• Sequential Minimal Optimization (SMO) – works on optimizing two variables at a time.

• We can analytically find the optimum with respect to two variables.

• We need to decide which two we consider in each iteration.

PR NPTEL course – p.134/135

• Taking chunking to extreme level – what is the smallest set of variables we can optimize on?

• We need to consider at least two variables because there is an equality constraint.

• Sequential Minimal Optimization (SMO) – works on optimizing two variables at a time.

• We can analytically find the optimum with respect to two variables.

• We need to decide which two we consider in each iteration.

• A very efficient algorithm PR NPTEL course – p.135/135