# Search in the document preview

• We have been discussing SVM method for learning classifiers.

PR NPTEL course – p.1/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

PR NPTEL course – p.2/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

PR NPTEL course – p.3/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

PR NPTEL course – p.4/135

• We have been discussing SVM method for learning classifiers.

• The basic idea is to transform the feature space and learn a linear classifier in the new space.

• Using Kernel functions we can do this mapping implicitly.

• Thus Kernels give us an elegant method to learn nonlinear classifiers.

• We can use the same idea in regression problems also.

PR NPTEL course – p.5/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

PR NPTEL course – p.6/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

PR NPTEL course – p.7/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

PR NPTEL course – p.8/135

**Kernel Trick
**

• We use φ : ℜm → H to map pattern vectors into appropriate high dimensional space.

• Kernel fn allows us to compute innerproducts in H
**implicitly **without using (or even knowing) φ.

• Through kernel functions, many algorithms that use only innerproducts can be implicitly executed in a high dimensional H. ( e.g., Fisher discriminant, regression etc).

• We can elegantly construct non-linear versions of linear techniques.

PR NPTEL course – p.9/135

**Support Vector Regression
**

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

PR NPTEL course – p.10/135

**Support Vector Regression
**

• Now we consider the regression problem. • Given training data

{(X1, y1), . . . , (Xn, yn)}, Xi ∈ ℜm, yi ∈ ℜ, want to find ‘best’ function to predict y given X .

• We search in a parameterized class of functions

g(X, W ) = w1 φ1(X) + · · · + wm′ φm′(X) + b = W TΦ(X) + b,

where φi : ℜm → ℜ are some chosen functions.

PR NPTEL course – p.11/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

PR NPTEL course – p.12/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

PR NPTEL course – p.13/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

PR NPTEL course – p.14/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

PR NPTEL course – p.15/135

• If we choose, φi(X) = xi (and hence, m = m′) then it is a linear model.

• Denoting Z = Φ(X) ∈ ℜm′, we are essentially learning a linear model in a transformed space.

• This is in accordance with the basic idea of SVM method.

• We want to formulate the problem so that we can use the Kernel idea.

• Then, by using a kernel function, we never need to compute or even precisely specify the mapping Φ.

PR NPTEL course – p.16/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function.

PR NPTEL course – p.17/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization.

PR NPTEL course – p.18/135

**Loss function
**

• As in a general regression problem, we need to find W to minimize

∑

i

L(yi, g(Xi, W ))

where L is a loss function. • This is the general strategy of empirical risk

minimization. • We consider a special loss function that allows us to

use the kernel trick.

PR NPTEL course – p.19/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function.

PR NPTEL course – p.20/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss.

PR NPTEL course – p.21/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness.

PR NPTEL course – p.22/135

ǫ**-insensitive loss
**

• We employ ǫ-insensitive loss function:

Lǫ(yi, g(Xi, W )) = 0 If |yi − g(Xi, W )| < ǫ = |yi − g(Xi, W )| − ǫ otherwise

Here, ǫ is a parameter of the loss function. • If prediction is within ǫ of true value, there is no loss. • Using absolute value of error rather than square of

error allows for better robustness. • Also gives us optimization problem with the right

structure.

PR NPTEL course – p.23/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b.

PR NPTEL course – p.24/135

• We have chosen the model as:

g(X, W ) = Φ(X)TW + b. • Hence empirical risk minimization under the ǫ-insensitive loss function would minimize

n ∑

i=1

max (

|yi − Φ(Xi)TW − b| − ǫ, 0 )

PR NPTEL course – p.25/135