Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Support Vector Machines: Hard and Soft Margins, Kernel Functions, and Ensemble Learning, Study notes of Computer Science

The concepts of support vector machines (svm) with hard and soft margins, the geometric margin, and the solutions to svm. It also introduces the idea of mapping inputs to higher-dimensional feature spaces to solve linearly inseparable cases and presents various kernel functions. The document concludes with an overview of ensemble learning and its application to svm.

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-0d2
koofers-user-0d2 🇺🇸

10 documents

1 / 10

Toggle sidebar

Related documents


Partial preview of the text

Download Support Vector Machines: Hard and Soft Margins, Kernel Functions, and Ensemble Learning and more Study notes Computer Science in PDF only on Docsity!

Lecture 11

Oct -17 - 2007

SVM w/o soft margin

  • Set functional margins

y(w·x+b)≥1, maximize

geometric margin by

minimizing ||w||

Negative Class

Positive Class

SVM with soft margin

  • Set functional margins

yi^ (w·xi+b)≥ 1 - ξi ,

minimize ||w|| 2 + c ∑ξi

Negative Class

Positive Class

Solutions to SVM

∑^ s.t.^ ∑

= =

= =

N

i

i i

N

i

i i w (^) i yx y 1 1

α , α 0

w yx y αi c

N

i

i i

N

i

i i

= ∑ i ∑ = ≤ ≤

= =

, 0 0 1 1

α s.t. α and

No soft margin

With soft margin

For classifying with a new input z

Compute

classify z as + if positive, and - otherwise

Note: w need not be formed explicitly, we can classify z by taking inner products with the support vectors

∑ ∑

= =

⋅ + = ⋅ + = ⋅ +

s

j

t t t

s

j

t t

w z b t y x z b y x z b

j j j

j j

( 1 α j ) 1 α ( )

Mapping the input to a higher dimensional space

can solve the linearly inseparable cases

0 x

0

x 2

x

x

( x, x^2 )

Non-linear SVMs: Feature Spaces

  • General idea: For any data set, the original input space can

always be mapped to some higher-dimensional feature space such that the data is linearly separable:

x → Ф ( x )

Example: Quadratic Feature Space

  • Assume m input dimensions

x = ( x 1 , x 2 , L, x m)

  • Number of quadratic terms:

1+m+m+m(m-1)/2 ≈ m^2

  • The number of dimensions increase rapidly!

You may be wondering about the ’s At least they won’t hurt anything!

You will find out why they are there soon!

Dot product in quadratic feature space

Ф(a)·Ф(b)

Now let’s just look at another interesting function of (a·b):

( ) 2 1

( ) 2 ( ) 1

( 1 )

1 1 1 1

2 2

1 1 1

1

2 1

2

2

= + + +

= + +

= + +

= ⋅ + ⋅ +

⋅ +

∑ ∑ ∑ ∑

∑∑ ∑

∑ ∑

= = =+ =

= = =

= =

m

i

ii

m

i

m

ji

i ji j

m

i

i i

m

i

ii

m

i

m

j

i ji j

m

i

ii

m

i

ii

ab aabb ab

aabb ab

ab ab

ab ab

ab

They are the same! And the later only takes O(m) to compute!

Kernel Functions

  • If every data point is mapped into high-dimensional space

via some transformation x → φ( x ), the dot product that we need to compute for classifying a point x becomes:

<φ( x i ) ⋅φ( x )> for all support vectors x i

  • A kernel function is a function that is equivalent to an

inner product in some feature space.

k (a,b) = <φ( a ) ⋅φ( b )>

  • We have seen the example:

k (a,b) = (a⋅b+1) 2

This is equivalent to mapping to the quadratic space!

More kernel functions

  • Linear kernel: k (a,b) = (a⋅b)
  • Polynomial kernel: k (a,b) = (a⋅b+1) d
  • Radial-Basis-Function kernel:

In this case, the corresponding mapping φ( x ) is infinite- dimensional! Lucky that we don’t have to compute the mapping explicitly!

∑ ∑ = =

⋅Φ + = Φ ⋅Φ + = ⋅ +

s

j

t t t

s

j

t t

w z b t y x z b y Kx z b

j j j

j j

( ) 1 α j ( ( ) ()) 1 α ( )

Note: We will not get into the details but the learning of w can be achieved by using kernel functions as well!

Nonlinear SVM summary

• Map the input space to a high dimensional

feature space and learn a linear decision

boundary in the feature space

• The decision boundary will be nonlinear in

the original input space

• Many possible choices of kernel functions

– How to choose? Most frequently used

method: cross-validation

Ensemble Learning

Ensemble Learning

  • So far we have designed learning algorithms that take a training set and output a classifier
  • What if we want more accuracy than current algorithms afford? - Develop new learning algorithm - Improve existing algorithms
  • Another approach is to leverage the algorithms we have via ensemble methods - Instead of calling an algorithm just once and using its classifier - Call algorithm multiple times and combine the multiple classifiers

What is Ensemble Learning

S

L 2 L LS

h 2 L hS

different training sets and/or learning algorithms

Ensemble method:

h*^ = F(h 1 , h 2 , L, hS ) (x, ?)

(x, y^ =h^ (x))

L 1

h 1

L 1

Traditional:

S

(x, ?) h 1

(x, y*^ =h 1 (x))

Ensemble Learning

  • INTUITION: Combining Predictions of multiple classifiers (an

ensemble) is more accurate than a single classifier.

  • Justification:
    • easy to find quite good “rules of thumb” however hard to find single highly accurate prediction rule.
    • If the training set is small and the hypothesis space is large then there may be many equally accurate classifiers.
    • Hypothesis space does not contain the true function, but it has several good approximations.
    • Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers.

How to generate ensemble?

• There are a variety of methods developed

• We will look at two of them:

– Bagging

– Boosting (Adaboost: adaptive boosting)

• Both of these methods takes a single

(base) learning algorithm and wrap around

it to generate multiple classifiers

Bagging: Bootstrap Aggregation (Breiman, 1996)

  • Generate a random sample from training set by

bootstrapping

  • Sample with replacement
  • New sample contains the same number of training points( may

contain repeats)

  • On average 66.7% of the original points will appear in a random

sample

  • Repeat this sampling procedure, getting a sequence of T

independent training sets

  • Learn a sequence of classifiers C 1 ,C 2 ,…,CT for each of these

training sets, using the same base learner

  • To classify an unknown sample X, let each classifier predict.
  • Take simple majority vote to make the final prediction.

Decision Boundary by

the CART Decision Tree Algorithm