# Kernel Functions - Pattern Recognition - Lecture Slides, Slides for Advanced Engineering Dynamics. Punjab Engineering College

PDF (394.9 KB)
135 pages
1000+Number of visits
Description
The key points are: Kernel Functions, Mapping Implicitly, Nonlinear Classifiers, Regression Problems, Kernel Trick, Map Pattern Vectors, Fisher Discriminant, Non-Linear Versions, Support Vector Regression, Loss Function,...
20points
this document
Preview3 pages / 135

• We have been discussing SVM method for learning classifiers.

PR NPTEL course – p.1/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

PR NPTEL course – p.2/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

PR NPTEL course – p.3/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

PR NPTEL course – p.4/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

• We can use the same idea in regression problems also.

PR NPTEL course – p.5/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

PR NPTEL course – p.6/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H implicitly without using (or even knowing) φ.

PR NPTEL course – p.7/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H implicitly without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

PR NPTEL course – p.8/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H implicitly without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

• We can elegantly construct non-linear versions of linear techniques.

PR NPTEL course – p.9/135

Support Vector Regression

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

PR NPTEL course – p.10/135

Support Vector Regression

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

• We search in a parameterized class of functions

g(X, W ) = w1 φ1(X) + · · · + wm′ φm′(X) + b = W TΦ(X) + b,

where φi : ℜm → ℜ are some chosen functions.

PR NPTEL course – p.11/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

PR NPTEL course – p.12/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

PR NPTEL course – p.13/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

PR NPTEL course – p.14/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

PR NPTEL course – p.15/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

• Then, by using a kernel function, we never need to compute or even precisely specify the mapping Φ.

PR NPTEL course – p.16/135

Loss function

• As in a general regression problem, we need to find W to minimize

i

L(yi, g(Xi, W ))

where L is a loss function.

PR NPTEL course – p.17/135

Loss function

• As in a general regression problem, we need to find W to minimize

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization.

PR NPTEL course – p.18/135

Loss function

• As in a general regression problem, we need to find W to minimize

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization. • We consider a special loss function that allows us to

use the kernel trick.

PR NPTEL course – p.19/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function.

PR NPTEL course – p.20/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss.

PR NPTEL course – p.21/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness.

PR NPTEL course – p.22/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness. • Also gives us optimization problem with the right

structure.

PR NPTEL course – p.23/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b.

PR NPTEL course – p.24/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b. • Hence empirical risk minimization under the ǫ-insensitive loss function would minimize

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

PR NPTEL course – p.25/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b. • Hence empirical risk minimization under the ǫ-insensitive loss function would minimize

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• We can write this as an equivalent constrained optimization problem.

PR NPTEL course – p.26/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

PR NPTEL course – p.27/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

• This does not give a dual with the structure we want.

PR NPTEL course – p.28/135

• We can pose the problem as follows.

min W,b,ξ,ξ′

n ∑

i=1

ξi + n

i=1

ξ′i

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

• This does not give a dual with the structure we want. • So, we reformulate the optimization problem.

PR NPTEL course – p.29/135

The Optimization Problem

• Find W, b and ξi, ξ′i to

minimize 1

2 W TW + C

(

n ∑

i=1

ξi + n

i=1

ξ′i

)

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

PR NPTEL course – p.30/135

The Optimization Problem

• Find W, b and ξi, ξ′i to

minimize 1

2 W TW + C

(

n ∑

i=1

ξi + n

i=1

ξ′i

)

subject to yi −W TΦ(Xi) − b ≤ ǫ + ξi, i = 1, . . . , n W TΦ(Xi) + b− yi ≤ ǫ + ξ′i, i = 1, . . . , n ξi ≥ 0, ξ′i ≥ 0 i = 1, . . . , n

• We have added the term W TW in the objective function. This is like model complexity in a regularization context.

PR NPTEL course – p.31/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

PR NPTEL course – p.32/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

PR NPTEL course – p.33/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

• Essentially, the lagrange multipliers corresponding to the inequality constraints on the errors would be the determining factors.

PR NPTEL course – p.34/135

• Like earlier, we can form the Lagrangian and then, using Kuhn-Tucker conditions, can get the optimal values of W and b.

• Given that this problem is similar to the earlier one, we would get W ∗ in terms of the optimal lagrange multipliers as earlier.

• Essentially, the lagrange multipliers corresponding to the inequality constraints on the errors would be the determining factors.

• We can use the same technique as earlier to formulate the dual to solve for the optimal Lagrange multipliers.

PR NPTEL course – p.35/135

The dual

• The dual of this problem is

max α,α

n ∑

i=1

yi(αi − α′i) − ǫ n

i=1

(αi + α ′ i)

−1 2

i,j

(αi − α′i)(αj − α′j)Φ(Xi)TΦ(Xj)

subject to n

i=1

(αi − α′i) = 0

0 ≤ αi, α′i ≤ C, i = 1, . . . , n

PR NPTEL course – p.36/135

The solution

• We can use the Kuhn-Tucker conditions to derive the final optimal values of W and b as earlier.

PR NPTEL course – p.37/135

The solution

• We can use the Kuhn-Tucker conditions to derive the final optimal values of W and b as earlier.

• This gives us

W ∗ = n

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

PR NPTEL course – p.38/135

• We have

W ∗ = n

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

PR NPTEL course – p.39/135

• We have

W ∗ = n

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

PR NPTEL course – p.40/135

• We have

W ∗ = n

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

• The final W is a linear combination of some of the examples – the support vectors.

PR NPTEL course – p.41/135

• We have

W ∗ = n

i=1

(α∗i − α∗ ′

i )Φ(Xi)

b∗ = yj − Φ(Xj)TW ∗ + ǫ, j s.t. 0 < α∗j < C/n

• Note that we have α∗iα ∗′

i = 0. Also, α ∗ i , α

∗′

i are zero for examples where error is less than ǫ.

• The final W is a linear combination of some of the examples – the support vectors.

• Note that the dual and the final solution are such that we can use the kernel trick.

PR NPTEL course – p.42/135

• Let K(X,X ′) = Φ(X)TΦ(X ′).

PR NPTEL course – p.43/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

PR NPTEL course – p.44/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

= n

i=1

(α∗i − α∗ ′

i )K(Xi, X) + b ∗

PR NPTEL course – p.45/135

• Let K(X,X ′) = Φ(X)TΦ(X ′). • The optimal model learnt is

g(X, W ∗) = n

i=1

(α∗i − α∗ ′

i )φ(Xi) Tφ(X) + b∗

= n

i=1

(α∗i − α∗ ′

i )K(Xi, X) + b ∗

• As earlier, b∗ can also be written in terms of the Kernel function.

PR NPTEL course – p.46/135

Support vector regression

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

PR NPTEL course – p.47/135

Support vector regression

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

PR NPTEL course – p.48/135

Support vector regression

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

• The parameters: C, ǫ and parameters of kernel function.

PR NPTEL course – p.49/135

Support vector regression

• Once again, the kernel trick allows us to learn non-linear models using a linear method.

• For example, if we use Gaussian kernel, we get a Gaussian RBF net as the nonlinear model. The RBF centers are easily learnt here.

• The parameters: C, ǫ and parameters of kernel function.

• The basic idea of SVR can be used in many related problems.

PR NPTEL course – p.50/135

SV regression

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

PR NPTEL course – p.51/135

SV regression

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

• Gives rise to some interesting robustness of the method. It can be proved that local movements of target values of points outside the ǫ-tube do not influence the regression.

PR NPTEL course – p.52/135

SV regression

• With the ǫ-insensitive loss function, points whose targets are within ǫ of the prediction do not contribute any ‘loss’.

• Gives rise to some interesting robustness of the method. It can be proved that local movements of target values of points outside the ǫ-tube do not influence the regression.

• Robustness essentially comes through the support vector representation of the regression.

PR NPTEL course – p.53/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

PR NPTEL course – p.54/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

PR NPTEL course – p.55/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• This is ‘regularized risk minimization’.

PR NPTEL course – p.56/135

• In our formulation of the regression problem we did not explain why we added W TW term in the objective function.

• We are essentially minimizing

1

2 W TW + C

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

• This is ‘regularized risk minimization’.

• Then W TW is the model complexity term which is intended to favour learning of ‘smoother’ models.

PR NPTEL course – p.57/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

PR NPTEL course – p.58/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function.

PR NPTEL course – p.59/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

PR NPTEL course – p.60/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

• There are ways to characterize the ‘degree of continuity’ of a function.

PR NPTEL course – p.61/135

• Next we explain why W TW is a good term to capture degree of smoothness of the model being fitted.

• Let f : ℜm → ℜ be a continuous function. • Continuity means we can make |f(X) − f(X ′)| as

small as we want by taking ||X −X ′|| sufficiently small.

• There are ways to characterize the ‘degree of continuity’ of a function.

• We consider one such measure now.

PR NPTEL course – p.62/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ}

PR NPTEL course – p.63/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ} • The intuitive idea is:

How small can ||X −X ′|| be, still keeping |f(X) − f(X ′)| ‘large’

PR NPTEL course – p.64/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ(f) = inf{||X −X ′|| : |f(X) − f(X ′)| ≥ 2ǫ} • The intuitive idea is:

How small can ||X −X ′|| be, still keeping |f(X) − f(X ′)| ‘large’

• The larger mǫ(f), the smoother is the function.

PR NPTEL course – p.65/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

PR NPTEL course – p.66/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

PR NPTEL course – p.67/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

• mǫ(f) > 0 for all ǫ > 0 iff f is uniformly continuous.

PR NPTEL course – p.68/135

• Obviously, mǫ(f) = 0 if f is discontinuous.

• mǫ(f) can be zero even for continuous functions, e.g., f(x) = 1/x.

• mǫ(f) > 0 for all ǫ > 0 iff f is uniformly continuous.

• Higher margin would mean the function is ‘slowly varying’ and hence is a ‘smoother’ model.

PR NPTEL course – p.69/135

SVR and margin

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|.

PR NPTEL course – p.70/135

SVR and margin

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W .

PR NPTEL course – p.71/135

SVR and margin

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

PR NPTEL course – p.72/135

SVR and margin

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

• Thus, mǫ(f) = 2ǫ

||W || .

PR NPTEL course – p.73/135

SVR and margin

• Consider regression with linear models. Then,

|f(X) − f(X ′)| = |W T (X −X ′)|. • For all X,X ′ with |W T (X −X ′)| ≥ 2ǫ, ||X −X ′|| would be smallest if |W T (X −X ′)| = 2ǫ and (X −X ′) is parallel to W . That is, X −X ′ = ± 2ǫW

WTW .

• Thus, mǫ(f) = 2ǫ

||W || .

• Thus in our optimization problem in SVR, minimizing W TW promotes learning of smoother models.

PR NPTEL course – p.74/135

Solving the SVM optimization problem

• So far we have not considered any algorithms for solving for the SVM.

PR NPTEL course – p.75/135

Solving the SVM optimization problem

• So far we have not considered any algorithms for solving for the SVM.

• We have to solve a constrained optimization problem to obtain the Lagrange multipliers and hence the SVM.

PR NPTEL course – p.76/135

Solving the SVM optimization problem

• So far we have not considered any algorithms for solving for the SVM.

• We have to solve a constrained optimization problem to obtain the Lagrange multipliers and hence the SVM.

• Many specialized algorithms have been proposed for this.

PR NPTEL course – p.77/135

• The optimization problem to be solved is

max µ

q(µ) = n

i=1

µi − 1

2

n ∑

i,j=1

µiµjyiyjK(Xi, Xj)

subject to 0 ≤ µi ≤ C, i = 1, . . . , n, n

i=1

yiµi = 0

• A quadratic programming (QP) problem with interesting structure.

PR NPTEL course – p.78/135

Example

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

PR NPTEL course – p.79/135

Example

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

PR NPTEL course – p.80/135

Example

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

• As is easy to see, a linear classifier is not sufficient here.

PR NPTEL course – p.81/135

Example

• We will first consider a very simple example problem in ℜ2 to get a feel for the method of obtaining SVM.

• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) with y1 = y2 = +1 and y3 = −1.

• As is easy to see, a linear classifier is not sufficient here.

• Suppose we use the Kernel function: K(X,X ′) = (1 + XTX ′)2.

PR NPTEL course – p.82/135

• This example is shown below.

PR NPTEL course – p.83/135

• This example is shown below.

PR NPTEL course – p.84/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

PR NPTEL course – p.85/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0) • The objective function involves K(Xi, Xj). These are

given in a matrix below.

[

(1 + XTi Xj) 2 ]

=

4 0 1 0 4 1 1 1 1

PR NPTEL course – p.86/135

• Now the objective function to be maximized is

q(µ) = 3

i=1

µi − 1

2 (4µ2

1 + 4µ2

2 + µ2

3 − 2µ1µ3 − 2µ2µ3)

PR NPTEL course – p.87/135

• Now the objective function to be maximized is

q(µ) = 3

i=1

µi − 1

2 (4µ2

1 + 4µ2

2 + µ2

3 − 2µ1µ3 − 2µ2µ3)

• The constraints are

µ1 + µ2 − µ3 = 0; and µi ≥ 0, i = 1, 2, 3.

PR NPTEL course – p.88/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

i=1

αiµi

PR NPTEL course – p.89/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

PR NPTEL course – p.90/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns.

PR NPTEL course – p.91/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns. We use complementary slackness conditions on αi.

PR NPTEL course – p.92/135

• The lagrangian for this problem is

L(µ, λ,α) = q(µ) + λ(µ1 + µ2 − µ3) − 3

i=1

αiµi

• Using Kuhn-Tucker conditions, we have ∂L ∂µi

= 0 and µ1 + µ2 − µ3 = 0.

• This gives us four equations; we have 7 unknowns. We use complementary slackness conditions on αi.

• We have αiµi = 0. Essentially, we need to guess which µi > 0.

PR NPTEL course – p.93/135

• In this simple problem we know all µi > 0.

PR NPTEL course – p.94/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors.

PR NPTEL course – p.95/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0.

PR NPTEL course – p.96/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0. • We have now four unknowns: µ1, µ2, µ3, λ.

PR NPTEL course – p.97/135

• In this simple problem we know all µi > 0. • This is because all Xi would be support vectors. • Hence we take all αi = 0. • We have now four unknowns: µ1, µ2, µ3, λ.

• Using ∂L ∂µi

= 0, i = 1, 2, 3 and feasibility, we can solve for µi.

PR NPTEL course – p.98/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

PR NPTEL course – p.99/135

• The equations are

1 − 4µ1 + µ3 + λ = 0 1 − 4µ2 + µ3 + λ = 0

1 − µ3 + µ1 + µ2 − λ = 0 µ1 + µ2 − µ3 = 0

• These give us λ = 1

PR NPTEL course – p.100/135