Download collection of papers and more Study Guides, Projects, Research Signal Processing and Analysis in PDF only on Docsity!
Deep Support Vector Machines
Marco A. Wiering
Institute of Artificial Intelligence and Cognitive Engineering
University of Groningen, the Netherlands
Presentation at ROKS’13, Leuven, 09 July 2013
Contents
I
Support Vector Machines
I
Deep Support Vector Machines
I
Experimental Results on Regression Problems
I
Experimental Results on Classification Problems
I
Conclusion
Limitations of Support Vector Machines
I
Support Vector Machines (SVM) often outperform other
machine learning methods
I
However, the standard SVM has a single adjustable layer
of weights
I
Instead of using such “shallow models”, deep architectures
can be better alternatives
I
SVMs use a-priori chosen kernel functions to compute
similarities between input vectors
I
A problem is that the choice of kernel function is important,
but kernel functions are not very flexible
I
Therefore we propose the deep SVM (DSVM)
I
The DSVM contains multiple layers of SVMs
Support Vector Regression
I
The objective function of the SVM is based on structural
risk minimization theory developed by Vapnik in the 1960s
I
Goal: find g( x ) most suitable to the data, e.g. for
regression, -insensitive (Hinge) loss function:
I
|y
i
− g( x
i
I
But also generalize well!
I
g( x ) as flat as possible ⇒ || w || as small as possible
I
Yields a convex optimization problem
SVM Regression Objective Function
I
The resulting dual objective problem is:
max
W (α
`
i= 1
i
i
`
t= 1
i
i
)y
i
i,j= 1
i
i
j
j
)( x
i
· x
j
subject to constraints
i
i
≤ C,
`
i= 1
i
i
I
Then: o = g( x ) =
`
i= 1
i
i
)( x
i
· x ) + b
Optimization Algorithms
I
For SVMs specialized toolkits have been developed such
as SVMLight and LibSVM.
I
There are multiple optimization algorithms that aim to
maximize the dual objective:
I
Sequential Minimal Optimization (SMO) is often used
I
Quadratic Programming can be used as well
I
A simple solution is to use gradient ascent:
i
i
∂W (·)
i
i
+λ(−−y
i
`
j= 1
j
j
)K
( f ( x
i
|θ), f ( x
j
and the gradient ascent learning rule for α
i
is:
i
i
∂W (·)
i
i
+λ(−+y
i
`
j= 1
j
j
)K
( f ( x
i
|θ), f ( x
j
Main Ideas of DSVMs
I
Choosing the right (parameterized) kernel may be difficult
I
Instead, we will use a set of SVMs to map the input vector
x to a feature vector f ( x )
I
More SVMs can be used to create larger feature
representations
I
All support vector coefficients (α-values) are trained using
gradient ascent or descent on an adapted dual objective
function
I
Just like Multi-layer perceptrons consist of simple
perceptrons, the DSVM consists of SVMs
Architecture
[ x ] 1 / / /.-,()*+
I
I
I
I
I
I
I
I
I
f ( x )
[ x ] 2 / / /.-,()*+
H
H
H
H
H
H
H
H
H
H
S
1
J
J
J
J
J
J
J
J
J
J
J
S
2
M
f / /
[ x ] D− 1 / / /.-,()*+
v v v v v v v v v v
S
3
t t t t t t t t t t t
[ x ] D / / /.-,()*+
s s s s s s s s s
I
Input layer of size D
I
Total of d SVMs S
a
, each one extracting one feature
I
Central feature layer of size d
I
Main support vector machine M
Adapted Objective
I
Output function: g( x ) ⇒ g( f ( x ))
I
Objective function: W (α
) ⇒ W ( f ( x ), α
I
New optimization problem:
min
f ( x )
max
W ( f ( x ), α
I
This is a min-max optimization problem
I
Adapt f ( x ) through gradient descent
I
Adapt α
through gradient ascent
Training Procedure (1)
I
Adapt α
towards a (local) maximum of W ( f ( x ), α
I
i
i
∂W
i
I
Remember:
max
W (α
`
i= 1
i
i
`
t= 1
i
i
)y
i
i,j= 1
i
i
j
j
)(K ( f ( x
i
), f ( x
j
I
The resulting gradient ascent SVM training rule for α
i
i
i
− λ( − y
i
j
j
j
)K ( f ( x
i
), f ( x
j
Training Procedure (3)
I
For the RBF kernel of the main SVM we have:
δK ( f ( x
i
), f ( x ))
δ f ( x
i
a
f ( x
i
a
− f ( x
j
a
m
K ( f ( x
i
), f ( x
j
I
This leads to:
δW
δ f ( x
i
a
l
j= 1
i
i
j
j
f ( x
i
a
− f ( x
j
a
m
K ( f ( x
i
), f ( x
j
I
We create a new dataset for each feature extracting SVM
and then train it with the gradient ascent SVM algorithm
I
We repeat the alternating training of the main SVM and
feature layer SVMs a number of times
Related Work
The DSVM is related to the following methods:
I
Kernel learning. Often relies on a fixed set of basis kernels,
where
I
Parameters are learned for a kernel (e.g. RBF kernel), or:
I
Different kernels are linearly or non-linearly combined
I
There are recent developments in multi-layer kernel
learning, e.g. Dinuzzo (2010)
I
Suykens (1999) used logistic functions to learn features.
The learning algorithm was quite different
I
Vincent and Y. Bengio (2000) proposed a neural support
vector network, but it used a random subset of support
vectors and a heuristic to adapt the neural networks
Results on Regression Problems
Dataset #inst. #feat. N SVM results DSVM results Graczyk results
Baseball 337 6 4000 0.02413 ± 0.00011 0.02294 ± 0.00010 0.
Boston Housing 461 4 1000 0. 006838 ± 0. 000095 0.006381 ± 0.000091 0.
Concrete Strength 72 5 4000 0. 00706 ± 0. 000070 0. 00621 ± 0. 000054 0.
Diabetes 43 2 4000 0. 02719 ± 0. 000263 0. 02327 ± 0. 000219 0.
Machine-CPU 188 6 1000 0. 00805 ± 0. 000181 0. 00638 ± 0. 000123 0.
Mortgage 1049 6 1000 0.000080 ± 0.000001 0.000080 ± 0.000001 0.
Stock 950 5 1000 0. 00086 ± 0. 000006 0. 00076 ± 0. 000005 0.
Breast Cancer 152 6 4000 0. 06947 ± 0. 000297 0.06910 ± 0.000295 0.
Auto-MPG 392 7 1000 6.852 ± 0.091 6.715 ± 0.092 N/A
Housing 506 13 1000 8.71 ± 0.14 9.30 ± 0.15 N/A
Experiments on Classification Problems
I
We experimented with 8 classification datasets from the
UCI repository
I
The data is split in 90% trainingdata and 10% testingdata
I
We perform 1000 times crossvalidation
I
For optimizing the parameters we have again used Particle
Swarm Optimization (PSO)
I
We will show average accuracies and standard errors