Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas


Adapting Multiple Kernel Parameters for Support Vector Machines using GA, Notas de estudo de Informática

Seleção de parâmetros de SVM utilizando Algoritmos genéticos

Tipologia: Notas de estudo

Antes de 2010

Compartilhado em 21/04/2010

francisco-carlos-monteiro-souza-1
francisco-carlos-monteiro-souza-1 🇧🇷

1 documento

1 / 6

Toggle sidebar

Esta página não é visível na pré-visualização

Não perca as partes importantes!

bg1
626
Adapting
Multiple
Kernel
Parameters
for
Support
Vector
Machines
using
Genetic
Algorithms
Sergio
A.
Rojas
Division
of
Parasitology
National
Institute
for
Medical
Research
London
NW7
1AA,
UK
and
Department
of
Computer
Science
University
College
London
sroj
as(&,nimr.mrc.ac.uk
Abstract-
Kernel
parameterization
is
a
key
design
step
in
the
application
of
support
vector
machines
(SVM)
for
supervised
learning
problems.
A
grid-search
with
a
cross-validation
criteria
is
often
conducted
to
choose
the
kernel
parameters
but
it
is
computationally
unfeasible
for
a
large
number
of
them.
Here
we
describe
a
genetic
algorithm
(GA)
as
a
method
for
tuning
kernels
of
multiple
parameters
for
classification
tasks,
with
application
to
the
weighted
radial
basis
function
(RBF)
kernel.
In
this
type
of
kernels
the
number
of
parameters
equals
the
dimension
of
the
input
patterns
which
is
usually
high
for
biological
datasets.
We
show
preliminary
experimental
results
where
adapted
weighted
RBF
kernels
for
SVM
achieve
classification
performance
over
98%
in
human
serum
proteomic
profile
data.
Further
improvements
to
this
method
may
lead
to
discovery
of
relevant
biomarkers
in
biomedical
applications.
1
Introduction
The
Support
Vector
Machine
[1]
(SVM)
is
a
well-known
supervised
machine
learning
technique
that
has
been
applied
successfully
to
a
wide
variety
of
problems
ranging
from
classification
[2],
regression
[3]
and
clustering
[4]
in
diverse
domains
such
as
web
text
mining
[5],
gene
expression
[6]
and
proteome
analysis
of
infectious
diseases
[work
in
progress].
The
SVM
was
proposed
originally
as
a
learning
algorithm
to
find
an
optimal
discrimination
function
between
two
linearly-separable
classes
by
maximizing
the
margin
of
the
closest
samples
to
a
separating
hyper-plane
in
the
input-dimensional
space
[2].
Further
extensions
have been
made
to
handle
the
non-
separable
cases
with
a
soft
margin
parameter
[7]
and
non-
linear
cases
by
the
use
of
a
kernel
function
[2,
7].
The
kernel
function
computes
a
measure
of
similarity
between
input
patterns
in
a
transformed
vectorial
space.
The
function
chosen
to
carry
out
the
kernel
mapping
may
be
dependant
on
parameters
such
as
the
dimension
in
a
polynomial
kernel
or
the
width
in
a
radial
basis
function
(RBF)
kernel.
These
parameters
must
be
tuned
to
the
specific
dataset
in
order
to
get
the
best
performance
of
the
Delmiro
Fernandez-Reyes
Division
of
Parasitology
National
Institute
for
Medical
Research
London
NW7
lAA,
UK
and
Department
of
Computer
Science
University
College
London
dfernan(Aynimr.mrc.ac.uk
SVM.
Usually
a
grid-search
through
a
range
of
values
for
the
parameters
is
used,
by
varying
one
parameter
with
a
fixed
step-size
while
keeping
the
others
constant
[8].
However
for
kernels
with
a
large
amount
of
parameters
such
as
weighted
kernels
it
is
computationally
unfeasible.
A
gradient
descend
might
be
used
for
this
purpose
as
described
in
[11],
altough
this
method
may
lead
to
local
minima.
We
propose
using
a
genetic
algorithm
[9]
(GA)
to
search
the
parameterization
space
of
SVM
kernels
with
multiple
parameters
with
application
to
classification
problems.
In
the
next
section
we
shortly
describe
the
SVM
and
weighted
kernels.
Section
3
explains
the
GA
approach
for
tuning
weighted
kernels
and
experimental
results
on
artificial
and
real
datasets
are
shown
in
Section
4.
The
paper
concludes
with
some
directions
for
future
work.
2
SVM
and
Weighted
Kernels
We
consider
the
problem
of
binary
classification
in
a
dataset
of
examples.
Let
D
=
{(xj,
y1),...,(x1,
y1)}
be
the
set
of
training
examples
where
x
e
91'
is
an
n-dimensional
input
vector,
y
e
{+1,-1}
is
its
corresponding
class
label
and
1
is
the
number
of
examples.
A
kernel
function
K
:
9
x
9
->9
1
computes
the
inner
product
between
two
examples
K(x,z)
=<
4D(x),D(z)
>
where
tD
is
a
mapping
from
the
input
space
to
a
transformed
feature
space.
In
this
feature
space
an
SVM
learns
a
decision
function
or
I
hyperplane
of
the
form
f(x)
=
aiyiK(x,,
x)
+
b
where
i=l
coefficients
ai
are
found
by
solving
a
constrained
quadratic
optimization
problem
aimed
to
maximize
the
margin
or
distance
of
opposite
examples
to
the
hyperplane,
and
to
minimize
a
regularization
factor
that
allows
for
misclassifications
(for
a
comprehensive
description
of
SVM
the
reader
is
referred
to
[10]).
Support
vectors
are
those
examples
xi
with
corresponding
ai
>
0.
It
is
not
necessary
to
know
the
underlying
feature
mapping
if
the
function
K(x,z)
satisfies
the
Mercer's
0-7803-9363-5/05/$20.00
©2005
IEEE.
pf3
pf4
pf5

Pré-visualização parcial do texto

Baixe Adapting Multiple Kernel Parameters for Support Vector Machines using GA e outras Notas de estudo em PDF para Informática, somente na Docsity!

Adapting Multiple Kernel Parameters for Support Vector Machines

using Genetic Algorithms

Sergio A. Rojas

Division of Parasitology National Institute for Medical Research London NW7 1AA, UK and

Department of Computer Science

University College London

srojas(&,nimr.mrc.ac.uk

Abstract- Kernel parameterization is a key design step in the application of support vector machines (SVM) for supervised learning problems. A grid-search with a cross-validation criteria is often conducted to choose the kernel parameters but it is computationally unfeasible for a (^) large number of them. Here we

describe a genetic algorithm (GA) as a method for

tuning kernels of multiple parameters for classification tasks, with application to the weighted radial basis function (RBF) kernel. In this type of kernels the number of parameters equals the dimension of the

input patterns which is^ usually^ high^ for^ biological

datasets. We show preliminary experimental results

where adapted weighted RBF^ kernels^ for^ SVM achieve

classification performance over^ 98% in human^ serum

proteomic profile data. Further improvements to^ this method may lead to discovery of relevant biomarkers

in biomedical applications.

1 Introduction

The Support Vector Machine [1] (SVM) is a^ well-known

supervised machine learning technique that has^ been

applied successfully to a wide variety of problems ranging

from classification [2], regression [3] and clustering [4] in

diverse domains such as web text^ mining [5], gene

expression [6] and^ proteome analysis of^ infectious

diseases [work in^ progress]. The^ SVM^ was^ proposed

originally as^ a^ learning algorithm to^ find^ an^ optimal

discrimination function between two linearly-separable

classes by maximizing the margin of the closest samples to

a separating hyper-plane in the input-dimensional space

[2]. Further extensions have been made to handle the non-

separable cases with a soft margin parameter [7] and non-

linear cases by the use of a kernel function [2, 7]. The

kernel function computes a measure^ of^ similarity between

input patterns in a transformed vectorial space.

The function chosen to carry out the kernel^ mapping

may be^ dependant on^ parameters such^ as^ the^ dimension^ in

a polynomial kernel or the width in a radial basis function

(RBF) kernel.^ These^ parameters must^ be^ tuned^ to^ the

specific dataset in order to get the best performance of the

Delmiro Fernandez-Reyes

Division of Parasitology National Institute for Medical Research

London NW7 lAA, UK

and Department of Computer Science University College London

dfernan(Aynimr.mrc.ac.uk

SVM. Usually a grid-search through a range of values for

the parameters is used, by varying one parameter with a

fixed step-size while keeping the others constant [8].

However for kernels with a large amount of parameters

such as weighted kernels it is computationally unfeasible.

A gradient descend might be used for this purpose as

described in [11], altough this method may lead to local

minima.

We propose using a genetic algorithm [9] (GA) to

search the^ parameterization space of^ SVM^ kernels with

multiple parameters with^ application to^ classification

problems. In the next section we shortly describe the SVM

and weighted kernels. Section 3 explains the GA approach

for tuning weighted kernels and experimental results on

artificial and real datasets are shown in Section 4. The paper concludes with^ some^ directions for future work.

2 SVM and Weighted Kernels

We consider the problem of binary classification in a

dataset of examples. Let D =^ {(xj, y1),...,(x1, y1)} be the set

of training examples where x e^ 91' is an n-dimensional

input vector, y e^ {+1,-1} is its^ corresponding class label

and 1 is the number of examples. A kernel function

K : 9 x 9 ->9^1 computes the inner product between two

examples K(x,z) =< 4D(x),D(z) > where tD^ is a mapping

from the input space to a transformed feature space. In^ this

feature space an^ SVM learns a^ decision function^ or

I

hyperplane of^ the^ form^ f(x) =^ aiyiK(x,, x)^ +^ b^ where

i=l

coefficients ai are found by solving a constrained

quadratic optimization problem aimed to maximize the

margin or distance of opposite examples to the

hyperplane, and to minimize a regularization factor that

allows for misclassifications (for a comprehensive

description of SVM the reader is referred to [10]).

Support vectors are^ those^ examples xi with^ corresponding

ai >^ 0.^ It^ is not^ necessary^ to know^ the^ underlying^ feature

mapping if the function K(x,z) satisfies the Mercer's

0-7803-9363-5/05/$20.00 ©2005^ IEEE.

Theorem conditions [10] (guaranteed with a positive semi-

definite Gram matrix K = (^) (K(xi x)) (^) j=l ).

Valid common kernel functions are the polynomial

K(x,z) = (a <x,z > +1)d (1)

and the radial basis function kernel

K(x,z) = exp%( 20.Z^2 (2)

These two^ kernels^ have^ few^ parameters.^ In^ the

polynomial d^ determines the^ dimension^ of the^ kernel (a

linear kernel has d =1) and a is a scaling factor. In

equation (2) a. is a factor that shapes the width of the

radial basis function. By including^ an^ independent^ scaling

factor for each^ input^ variable,^ it^ is^ possible^ to^ define^ a

more general form of^ these two^ kernels^ [11], the^ weighted

polynomial kernel:

K(ad t)R ke (3)

and (^) the weighted RBF kernel:

K(x,z) = expr- (xj^ -^ zj)^ (4)

The number of parameters or scale factors for these

kernels equals the dimensionality of the input vectors.

Note that for dimension greater than 3 or 4 it becomes

intractable to adjust them by a grid-search. Hence we

propose a GA-based method to overcome this problem.

3 GA for^ Adapting Weighted Kernels

Below we describe a kernel tuning approach for SVM

using a GA. We parameterized weighted RBF kernels but

the approach can be followed to other kind of weighted

kernels. Genetic algorithms have not been applied to

choosing multiple parameters in weighted kernels

although a related approach has been reported recently to

tune generalized Gaussian kernels by means of

evolutionary strategies [12]. In^ that study the^ kernel

matrix is modified using a^ covariance matrix^ adaptation

method with^ constraints^ to^ guarantee its^ applicability to^ a

SVM (i.e the resulting matrix must be symmetric, positive

definite). The recombination of^ good individuals^ is^ made

by averaging (obtaining the^ center^ of^ mass)^ the^ population

which prevents useful cross-over like that of parents with

opposite scale magnitude.

3.1 Encoding kernel parameters

A standard GA [9] was used in^ this approach. We^ define^ a

chromosome as^ an^ n-dimensional^ vector^ of^ real^ values,

Si =^ (cf1'02.o-7n).^ Each^ gene a0^ represents the^ scale

factor for thej-th input variable. The chromosome is then used in (3) or (4) when computing the kernel matrix K 3.2 Genetic operators The initial chromosome population is randomly generated with values between 0 and 1. We used single-point crossover to recombine subsets of scale factors. The number of parent individuals is^ defined^ by a crossover rate, 0 < Pc < 1.

Variations of scale factors are introduced by a logarithmic mutation function which is applied to a Pm =I1- pc percentage of individuals each^ generation. For these chromosomes a subset of genes J is chosen randomly across the genome according to a mutation

factor 0 <^ Pim < 1^. Next a random normally distributed

number R N(0,1) is generated and the values of the genes in J are up or down scaled (depending on the sign of R) within two folds of the current value, by rule (5).

  1. (^) (t + (^) 1) = 10(2R)0j (t) J c (^) {1,2,.., n} (^) (5)

Note that because R is not necessarily an integer number, the power operation may introduce not only

changes in the scale but also in the value itself. The

mutation function was designed to resemble a random

logarithmic grid-search over the scaling factors. The

intuition behind is to allow the GA to search in different scale regions for individual genes 3.3 Fitness evaluation

The fitness of a chromosome is determined by its

generalization capability when plugged into the weighted

RBF kernel of a SVM classifier. We used the area under

the curve (AUC) of the classifier in a Receiver Operating

Characteristic (ROC) curve [13] as a^ measure^ of

generalization performance. A^ given chromosome^ si

comprises the scale values aj of equation (4), so a Gram

kernel matrix (^) Kican be computed using (^) si and all the

examples in^ a^ dataset.^ A^ SVM^ classifier is trained^ with

this matrix^ using a^ 5-fold^ cross-validation^ procedure. The

fitness value is estimated averaging the^ AUC^ over^ the^5

folds. We defined the fitmess function^ as^ (6). Since the

standard deviation is substracted from^ the AUC^ mean, the

fittest chromosomes are those with high AUC average and

a low dispersion. Thus the fitness value indicates the

generalization capability of a SVM trained with a kernel

with weights (^) si.

fi= AUC_crossval_ avg(si) -^ A^ UC^ crossval^ std(si) (6)

4 Experiments

4.1 Datasets and software

We performed experiments in^ a^ variety of datasets

involving real and artificial data.^ We^ used the^ Iris^ and

Table 2. Classification performance of^ experiments. Rightmost columns show AUC estimate values averaged with standard

deviation over a^ number^ of^ experiments. (N: number of experiments,^ G:^ number of generations, P: population size,^ Pc:

crossover rate, pm.: mutation^ rate, pl,,: logarithmic mutation factor)

Dataset N^ G P^ p, P P,m Cross-validation^ Held-out^ test

HAT(best in Figla.)^20 25 200 0.8^ 0.2^ 1.0^ 99.81±1.78^ 98.26±1.

HAT (best in^ Figlb.)^20 25 200 0.8^ 0.2^ 1.0^ 99.81±1.78^ 98.26±1. Iris 30 30 30 0.8^ 0.2^ 0.3^ 96.04±1.67^ 89.23±6. Iris-noise 30 30 30 0.8^ 0.2^ 0.3^ 91.92±2.77^ 87.25±6. Heart 20 30 30 0.8 0.2 0.3 86.14±1.75 81.32±6. Heart-noise 20 30 30 0.8^ 0.2^ 0.3^ 85.36±1.51^ 77.47±7. Random2l 10 30 30 0.8 0.2 0.3 87.80±0.99 86.68±3. Repeat2l 10 30 30 0.8^ 0.2^ 0.3^ 89.65±0.99^ 90.01±3. Redund2l 10 30 30 0.8^ 0.2^ 0.3^ 88.49±1.04^ 86.77±1.

irs

2 4 6 8 10 12 14 16 Generations

(a) repeat2l

a 5n Generations

15 20

(c )

1

086 i Ta

07 065 0-

n r,

heart I r I

I I

--Best 0 4 6 8 1 Population -1 0 2 4 6 8 1 0 Generations

(b)

0

Generations

(d)

F

Best Population

0, 0~ 0s C, (^) 0~

0, 0'

055 II

095 0,

  1. 86

08 zi 075

0.. (^0 )

(^12 14 )

hat

b~~~~~~~~^ -T^ r^ T^ S^ S:Ia-ae^ '-T^11 :^ .:^71 a1T^ T..T

..---Best ----- (^) Population _ __.

Figure 2.^ Classification^ performance^ over^ evolutionary^ time.^ Plots of^ average^ fitness for the best individual and the mean population are shown for^ some^ of^ the^ experiments^ in^ Table^ 2.^ Values^ are^ averaged over the number of repetitions, N. (a) Iris dataset, (b) Heart, (c) Repeat2l,^ (d) HAT.

-I

U.. I I^ t^ A^ I^ I

I I

ti ; u... I

In order to (^) study the role of the (^) plm parameter in the

quality of solutions found by the logarithmic mutation we carried out further experiments in the HAT dataset. We are particularly interested in studying this experiment because proteomics is^ a^ hot topic currently for

experimentation in bioinformatics. Besides, this dataset

comprises a higher (^) dimensionality than those (^) previously described. We varied (^) Plrn stepwise within a (^) range of 0. to 1.0 (Figure la). Note that best classification results

over 95% were obtained while setting p, = 0.8 with

Plrn =^ 1.0^.^ The^ effect^ of this^ value^ in^ the^ mutation^ factor

is that the complete genome, that is, (^) the whole set of

scaling factors, is translated in the same direction to a

bigger o lower scale, allowing the GA to explore different

order of magnitudes during the computation of the

weighting parameters. Useful combinations of subsets of

weights in dissimilar scales are then propelled by the

crossover rate. Hence we studied the effect of changing

the crossover rate using the best mutation factor rate of

1.0 (Figure lb). There were not major changes in the

classification performance when pc varies from 0.5 to

Lastly we assesed the practicality of the GA for tuning

the kernel parameters by tracing the SVM generalization

performance during the evolutionary process. Figure 2

shows plots of AUC vs generations for experiments with

the Iris (Fig. 2a), Heart (2b), Repeat2l (2c) and HAT (2d)

datasets, averaged over the number of experiments

reported in Table 2 (until the maximum number of

generations before the algorithm became stalled over the

set of experiments). It can be seen a tendency for the

AUC to increase as the number of generations grow in all

the cases. In^ the Heart and Repeat21 datasets the trend

has a small slope as these are noisy datasets. On the other

hand, for^ the^ Iris and^ HAT^ datasets, there is^ a^ sudden

increment of both population mean and best chromosome

fitness during the initial generations and then it keeps

growing gradually showing that the set of parameters

searched by the GA improve having a meaningful effect

over time. A similar behaviour was observed in the

remainder datasets.

5 Conclusions

We have described a GA approach for adjusting

multiple parameters in SVM kernels. Although we

considered weighted RBF kernels, the method can be

extended to other weighted kernels. The experiments

showed encouraging results in^ generalization performance

for tuning kernels including a few (4, 6) or a large (20,

  1. number of parameters. In^ the latter case

parameterization is^ prohibited in the^ standard^ grid-search

technique due to computational costs. In the particular

case of the HAT proteomic dataset, performance (^) achieved

is similar to that reported by our collaborators in a

previous study using other machine learning methods not related with weighted kernels or SVM [ 16]. This study showed the applicability of GA for adapting SVM kernels to a particular dataset. However there are interesting questions arising from this approach. For example, we attained a high variability in the results of the held-out tests. When examining the weights given by the best chromosomes evolved for a specific dataset we found that they are very heterogeneous in the scale of magnitude due to the logarithmic mutation that was used. This prompted us to design a different mutation strategy, where the weights are all maintained in a homogenous scale by controlling a single global width parameterized

beforehand using a grid-search. Preliminary results of

this combined strategy are being reported in an ongoing paper.

Other ideas emerging form this work might provide

useful insight in outlining new algorithms for tasks like

feature subset selection and feature extraction. If the

weights encoded in^ the chromosome represent scale

factors of the input variables they can indicate the degree

of the relevance of those (^) variables while learning the concept implicit in^ the dataset.^ Once the GA has

evolved those variables with highest scale factors can be

regarded as the most important for solving the given task.

We are currently working on this direction by having the

GA method described above to apply a cut off threshold

on the vector (^) si forcing the less relevant features to zero

thus giving sparse weights for the selected features

(alternatively they can be ranked by magnitude). Since the

kernel weights must be plugged into the SVM during

training, this can be considered a wrapper method for

feature selection [19] in contrast to other GA approaches

where the chromosome encodes the inclusion, or the identification of the variables to (^) be included in a filter

method [15, 20]. We intend to use this approach for

biomarker discovery (results will be published

elsewhere).

Acknowledgments

We would like to thank our team of collaborators Prof.

Sanjeev Krishna, Dr.^ Dan^ Agranoff and^ Dr.^ Marios

Papadopoulos at the Department of Cellular and

Molecular Medicine, St George's Hospital Medical

School, London, UK for allowing us to use the HAT

dataset in this (^) preliminary work. Datasets and

comprehensive analytical studies will be published

elsewhere. We also thank Dr. Mark Herbster, Prof.

Anthony Finkelstein (Dept. of Computer Science, UCL,

London, UK) and Dr. Anthony A. Holder (Division of Parasitology, National Institute for Medical^ Research,

London, UK) for valuable discussions and providing

support for this work. Finally, we are grateful to the

reviewers for their usefiul comments.