collection of papers, Study Guides, Projects, Research of Signal Processing and Analysis

papers for signal processing and machine learning

Typology: Study Guides, Projects, Research

2017/2018

Uploaded on 09/27/2018

yousafrind
yousafrind 🇵🇰

4 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Deep Support Vector Machines
Marco A. Wiering
Institute of Artificial Intelligence and Cognitive Engineering
University of Groningen, the Netherlands
Presentation at ROKS’13, Leuven, 09 July 2013
Marco A. Wiering 1/25
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download collection of papers and more Study Guides, Projects, Research Signal Processing and Analysis in PDF only on Docsity!

Deep Support Vector Machines

Marco A. Wiering

Institute of Artificial Intelligence and Cognitive Engineering

University of Groningen, the Netherlands

Presentation at ROKS’13, Leuven, 09 July 2013

Contents

I

Support Vector Machines

I

Deep Support Vector Machines

I

Experimental Results on Regression Problems

I

Experimental Results on Classification Problems

I

Conclusion

Limitations of Support Vector Machines

I

Support Vector Machines (SVM) often outperform other

machine learning methods

I

However, the standard SVM has a single adjustable layer

of weights

I

Instead of using such “shallow models”, deep architectures

can be better alternatives

I

SVMs use a-priori chosen kernel functions to compute

similarities between input vectors

I

A problem is that the choice of kernel function is important,

but kernel functions are not very flexible

I

Therefore we propose the deep SVM (DSVM)

I

The DSVM contains multiple layers of SVMs

Support Vector Regression

I

The objective function of the SVM is based on structural

risk minimization theory developed by Vapnik in the 1960s

I

Goal: find g( x ) most suitable to the data, e.g. for

regression, -insensitive (Hinge) loss function:

I

|y

i

− g( x

i

I

But also generalize well!

I

g( x ) as flat as possible ⇒ || w || as small as possible

I

Yields a convex optimization problem

SVM Regression Objective Function

I

The resulting dual objective problem is:

max

W (α

`

i= 1

i

i

`

t= 1

i

i

)y

i

i,j= 1

i

i

j

j

)( x

i

· x

j

subject to constraints

i

i

≤ C,

`

i= 1

i

i

I

Then: o = g( x ) =

`

i= 1

i

i

)( x

i

· x ) + b

Optimization Algorithms

I

For SVMs specialized toolkits have been developed such

as SVMLight and LibSVM.

I

There are multiple optimization algorithms that aim to

maximize the dual objective:

I

Sequential Minimal Optimization (SMO) is often used

I

Quadratic Programming can be used as well

I

A simple solution is to use gradient ascent:

i

i

∂W (·)

i

i

+λ(−−y

i

`

j= 1

j

j

)K

( f ( x

i

|θ), f ( x

j

and the gradient ascent learning rule for α

i

is:

i

i

∂W (·)

i

i

+λ(−+y

i

`

j= 1

j

j

)K

( f ( x

i

|θ), f ( x

j

Main Ideas of DSVMs

I

Choosing the right (parameterized) kernel may be difficult

I

Instead, we will use a set of SVMs to map the input vector

x to a feature vector f ( x )

I

More SVMs can be used to create larger feature

representations

I

All support vector coefficients (α-values) are trained using

gradient ascent or descent on an adapted dual objective

function

I

Just like Multi-layer perceptrons consist of simple

perceptrons, the DSVM consists of SVMs

Architecture

[ x ] 1 / / /.-,()*+

I

I

I

I

I

I

I

I

I

f ( x )

[ x ] 2 / / /.-,()*+

H

H

H

H

H

H

H

H

H

H

S

1

J

J

J

J

J

J

J

J

J

J

J

S

2

M

f / /

[ x ] D− 1 / / /.-,()*+

v v v v v v v v v v

S

3

t t t t t t t t t t t

[ x ] D / / /.-,()*+

s s s s s s s s s

I

Input layer of size D

I

Total of d SVMs S

a

, each one extracting one feature

I

Central feature layer of size d

I

Main support vector machine M

Adapted Objective

I

Output function: g( x ) ⇒ g( f ( x ))

I

Objective function: W (α

) ⇒ W ( f ( x ), α

I

New optimization problem:

min

f ( x )

max

W ( f ( x ), α

I

This is a min-max optimization problem

I

Adapt f ( x ) through gradient descent

I

Adapt α

through gradient ascent

Training Procedure (1)

I

Adapt α

towards a (local) maximum of W ( f ( x ), α

I

i

i

∂W

i

I

Remember:

max

W (α

`

i= 1

i

i

`

t= 1

i

i

)y

i

i,j= 1

i

i

j

j

)(K ( f ( x

i

), f ( x

j

I

The resulting gradient ascent SVM training rule for α

i

i

i

− λ( − y

i

j

j

j

)K ( f ( x

i

), f ( x

j

Training Procedure (3)

I

For the RBF kernel of the main SVM we have:

δK ( f ( x

i

), f ( x ))

δ f ( x

i

a

f ( x

i

a

− f ( x

j

a

m

K ( f ( x

i

), f ( x

j

I

This leads to:

δW

δ f ( x

i

a

l

j= 1

i

i

j

j

f ( x

i

a

− f ( x

j

a

m

K ( f ( x

i

), f ( x

j

I

We create a new dataset for each feature extracting SVM

and then train it with the gradient ascent SVM algorithm

I

We repeat the alternating training of the main SVM and

feature layer SVMs a number of times

Related Work

The DSVM is related to the following methods:

I

Kernel learning. Often relies on a fixed set of basis kernels,

where

I

Parameters are learned for a kernel (e.g. RBF kernel), or:

I

Different kernels are linearly or non-linearly combined

I

There are recent developments in multi-layer kernel

learning, e.g. Dinuzzo (2010)

I

Suykens (1999) used logistic functions to learn features.

The learning algorithm was quite different

I

Vincent and Y. Bengio (2000) proposed a neural support

vector network, but it used a random subset of support

vectors and a heuristic to adapt the neural networks

Results on Regression Problems

Dataset #inst. #feat. N SVM results DSVM results Graczyk results

Baseball 337 6 4000 0.02413 ± 0.00011 0.02294 ± 0.00010 0.

Boston Housing 461 4 1000 0. 006838 ± 0. 000095 0.006381 ± 0.000091 0.

Concrete Strength 72 5 4000 0. 00706 ± 0. 000070 0. 00621 ± 0. 000054 0.

Diabetes 43 2 4000 0. 02719 ± 0. 000263 0. 02327 ± 0. 000219 0.

Machine-CPU 188 6 1000 0. 00805 ± 0. 000181 0. 00638 ± 0. 000123 0.

Mortgage 1049 6 1000 0.000080 ± 0.000001 0.000080 ± 0.000001 0.

Stock 950 5 1000 0. 00086 ± 0. 000006 0. 00076 ± 0. 000005 0.

Breast Cancer 152 6 4000 0. 06947 ± 0. 000297 0.06910 ± 0.000295 0.

Auto-MPG 392 7 1000 6.852 ± 0.091 6.715 ± 0.092 N/A

Housing 506 13 1000 8.71 ± 0.14 9.30 ± 0.15 N/A

Experiments on Classification Problems

I

We experimented with 8 classification datasets from the

UCI repository

I

The data is split in 90% trainingdata and 10% testingdata

I

We perform 1000 times crossvalidation

I

For optimizing the parameters we have again used Particle

Swarm Optimization (PSO)

I

We will show average accuracies and standard errors