Baixe Adapting Multiple Kernel Parameters for Support Vector Machines using GA e outras Notas de estudo em PDF para Informática, somente na Docsity!
Adapting Multiple Kernel Parameters for Support Vector Machines
using Genetic Algorithms
Sergio A. Rojas
Division of Parasitology National Institute for Medical Research London NW7 1AA, UK and
Department of Computer Science
University College London
srojas(&,nimr.mrc.ac.uk
Abstract- Kernel parameterization is a key design step in the application of support vector machines (SVM) for supervised learning problems. A grid-search with a cross-validation criteria is often conducted to choose the kernel parameters but it is computationally unfeasible for a (^) large number of them. Here we
describe a genetic algorithm (GA) as a method for
tuning kernels of multiple parameters for classification tasks, with application to the weighted radial basis function (RBF) kernel. In this type of kernels the number of parameters equals the dimension of the
input patterns which is^ usually^ high^ for^ biological
datasets. We show preliminary experimental results
where adapted weighted RBF^ kernels^ for^ SVM achieve
classification performance over^ 98% in human^ serum
proteomic profile data. Further improvements to^ this method may lead to discovery of relevant biomarkers
in biomedical applications.
1 Introduction
The Support Vector Machine [1] (SVM) is a^ well-known
supervised machine learning technique that has^ been
applied successfully to a wide variety of problems ranging
from classification [2], regression [3] and clustering [4] in
diverse domains such as web text^ mining [5], gene
expression [6] and^ proteome analysis of^ infectious
diseases [work in^ progress]. The^ SVM^ was^ proposed
originally as^ a^ learning algorithm to^ find^ an^ optimal
discrimination function between two linearly-separable
classes by maximizing the margin of the closest samples to
a separating hyper-plane in the input-dimensional space
[2]. Further extensions have been made to handle the non-
separable cases with a soft margin parameter [7] and non-
linear cases by the use of a kernel function [2, 7]. The
kernel function computes a measure^ of^ similarity between
input patterns in a transformed vectorial space.
The function chosen to carry out the kernel^ mapping
may be^ dependant on^ parameters such^ as^ the^ dimension^ in
a polynomial kernel or the width in a radial basis function
(RBF) kernel.^ These^ parameters must^ be^ tuned^ to^ the
specific dataset in order to get the best performance of the
Delmiro Fernandez-Reyes
Division of Parasitology National Institute for Medical Research
London NW7 lAA, UK
and Department of Computer Science University College London
dfernan(Aynimr.mrc.ac.uk
SVM. Usually a grid-search through a range of values for
the parameters is used, by varying one parameter with a
fixed step-size while keeping the others constant [8].
However for kernels with a large amount of parameters
such as weighted kernels it is computationally unfeasible.
A gradient descend might be used for this purpose as
described in [11], altough this method may lead to local
minima.
We propose using a genetic algorithm [9] (GA) to
search the^ parameterization space of^ SVM^ kernels with
multiple parameters with^ application to^ classification
problems. In the next section we shortly describe the SVM
and weighted kernels. Section 3 explains the GA approach
for tuning weighted kernels and experimental results on
artificial and real datasets are shown in Section 4. The paper concludes with^ some^ directions for future work.
2 SVM and Weighted Kernels
We consider the problem of binary classification in a
dataset of examples. Let D =^ {(xj, y1),...,(x1, y1)} be the set
of training examples where x e^ 91' is an n-dimensional
input vector, y e^ {+1,-1} is its^ corresponding class label
and 1 is the number of examples. A kernel function
K : 9 x 9 ->9^1 computes the inner product between two
examples K(x,z) =< 4D(x),D(z) > where tD^ is a mapping
from the input space to a transformed feature space. In^ this
feature space an^ SVM learns a^ decision function^ or
I
hyperplane of^ the^ form^ f(x) =^ aiyiK(x,, x)^ +^ b^ where
i=l
coefficients ai are found by solving a constrained
quadratic optimization problem aimed to maximize the
margin or distance of opposite examples to the
hyperplane, and to minimize a regularization factor that
allows for misclassifications (for a comprehensive
description of SVM the reader is referred to [10]).
Support vectors are^ those^ examples xi with^ corresponding
ai >^ 0.^ It^ is not^ necessary^ to know^ the^ underlying^ feature
mapping if the function K(x,z) satisfies the Mercer's
0-7803-9363-5/05/$20.00 ©2005^ IEEE.
Theorem conditions [10] (guaranteed with a positive semi-
definite Gram matrix K = (^) (K(xi x)) (^) j=l ).
Valid common kernel functions are the polynomial
K(x,z) = (a <x,z > +1)d (1)
and the radial basis function kernel
K(x,z) = exp%( 20.Z^2 (2)
These two^ kernels^ have^ few^ parameters.^ In^ the
polynomial d^ determines the^ dimension^ of the^ kernel (a
linear kernel has d =1) and a is a scaling factor. In
equation (2) a. is a factor that shapes the width of the
radial basis function. By including^ an^ independent^ scaling
factor for each^ input^ variable,^ it^ is^ possible^ to^ define^ a
more general form of^ these two^ kernels^ [11], the^ weighted
polynomial kernel:
K(ad t)R ke (3)
and (^) the weighted RBF kernel:
K(x,z) = expr- (xj^ -^ zj)^ (4)
The number of parameters or scale factors for these
kernels equals the dimensionality of the input vectors.
Note that for dimension greater than 3 or 4 it becomes
intractable to adjust them by a grid-search. Hence we
propose a GA-based method to overcome this problem.
3 GA for^ Adapting Weighted Kernels
Below we describe a kernel tuning approach for SVM
using a GA. We parameterized weighted RBF kernels but
the approach can be followed to other kind of weighted
kernels. Genetic algorithms have not been applied to
choosing multiple parameters in weighted kernels
although a related approach has been reported recently to
tune generalized Gaussian kernels by means of
evolutionary strategies [12]. In^ that study the^ kernel
matrix is modified using a^ covariance matrix^ adaptation
method with^ constraints^ to^ guarantee its^ applicability to^ a
SVM (i.e the resulting matrix must be symmetric, positive
definite). The recombination of^ good individuals^ is^ made
by averaging (obtaining the^ center^ of^ mass)^ the^ population
which prevents useful cross-over like that of parents with
opposite scale magnitude.
3.1 Encoding kernel parameters
A standard GA [9] was used in^ this approach. We^ define^ a
chromosome as^ an^ n-dimensional^ vector^ of^ real^ values,
Si =^ (cf1'02.o-7n).^ Each^ gene a0^ represents the^ scale
factor for thej-th input variable. The chromosome is then used in (3) or (4) when computing the kernel matrix K 3.2 Genetic operators The initial chromosome population is randomly generated with values between 0 and 1. We used single-point crossover to recombine subsets of scale factors. The number of parent individuals is^ defined^ by a crossover rate, 0 < Pc < 1.
Variations of scale factors are introduced by a logarithmic mutation function which is applied to a Pm =I1- pc percentage of individuals each^ generation. For these chromosomes a subset of genes J is chosen randomly across the genome according to a mutation
factor 0 <^ Pim < 1^. Next a random normally distributed
number R N(0,1) is generated and the values of the genes in J are up or down scaled (depending on the sign of R) within two folds of the current value, by rule (5).
- (^) (t + (^) 1) = 10(2R)0j (t) J c (^) {1,2,.., n} (^) (5)
Note that because R is not necessarily an integer number, the power operation may introduce not only
changes in the scale but also in the value itself. The
mutation function was designed to resemble a random
logarithmic grid-search over the scaling factors. The
intuition behind is to allow the GA to search in different scale regions for individual genes 3.3 Fitness evaluation
The fitness of a chromosome is determined by its
generalization capability when plugged into the weighted
RBF kernel of a SVM classifier. We used the area under
the curve (AUC) of the classifier in a Receiver Operating
Characteristic (ROC) curve [13] as a^ measure^ of
generalization performance. A^ given chromosome^ si
comprises the scale values aj of equation (4), so a Gram
kernel matrix (^) Kican be computed using (^) si and all the
examples in^ a^ dataset.^ A^ SVM^ classifier is trained^ with
this matrix^ using a^ 5-fold^ cross-validation^ procedure. The
fitness value is estimated averaging the^ AUC^ over^ the^5
folds. We defined the fitmess function^ as^ (6). Since the
standard deviation is substracted from^ the AUC^ mean, the
fittest chromosomes are those with high AUC average and
a low dispersion. Thus the fitness value indicates the
generalization capability of a SVM trained with a kernel
with weights (^) si.
fi= AUC_crossval_ avg(si) -^ A^ UC^ crossval^ std(si) (6)
4 Experiments
4.1 Datasets and software
We performed experiments in^ a^ variety of datasets
involving real and artificial data.^ We^ used the^ Iris^ and
Table 2. Classification performance of^ experiments. Rightmost columns show AUC estimate values averaged with standard
deviation over a^ number^ of^ experiments. (N: number of experiments,^ G:^ number of generations, P: population size,^ Pc:
crossover rate, pm.: mutation^ rate, pl,,: logarithmic mutation factor)
Dataset N^ G P^ p, P P,m Cross-validation^ Held-out^ test
HAT(best in Figla.)^20 25 200 0.8^ 0.2^ 1.0^ 99.81±1.78^ 98.26±1.
HAT (best in^ Figlb.)^20 25 200 0.8^ 0.2^ 1.0^ 99.81±1.78^ 98.26±1. Iris 30 30 30 0.8^ 0.2^ 0.3^ 96.04±1.67^ 89.23±6. Iris-noise 30 30 30 0.8^ 0.2^ 0.3^ 91.92±2.77^ 87.25±6. Heart 20 30 30 0.8 0.2 0.3 86.14±1.75 81.32±6. Heart-noise 20 30 30 0.8^ 0.2^ 0.3^ 85.36±1.51^ 77.47±7. Random2l 10 30 30 0.8 0.2 0.3 87.80±0.99 86.68±3. Repeat2l 10 30 30 0.8^ 0.2^ 0.3^ 89.65±0.99^ 90.01±3. Redund2l 10 30 30 0.8^ 0.2^ 0.3^ 88.49±1.04^ 86.77±1.
irs
2 4 6 8 10 12 14 16 Generations
(a) repeat2l
a 5n Generations
15 20
(c )
1
086 i Ta
07 065 0-
n r,
heart I r I
I I
--Best 0 4 6 8 1 Population -1 0 2 4 6 8 1 0 Generations
(b)
0
Generations
(d)
F
Best Population
0, 0~ 0s C, (^) 0~
0, 0'
055 II
095 0,
- 86
08 zi 075
0.. (^0 )
(^12 14 )
hat
b~~~~~~~~^ -T^ r^ T^ S^ S:Ia-ae^ '-T^11 :^ .:^71 a1T^ T..T
..---Best ----- (^) Population _ __.
Figure 2.^ Classification^ performance^ over^ evolutionary^ time.^ Plots of^ average^ fitness for the best individual and the mean population are shown for^ some^ of^ the^ experiments^ in^ Table^ 2.^ Values^ are^ averaged over the number of repetitions, N. (a) Iris dataset, (b) Heart, (c) Repeat2l,^ (d) HAT.
-I
U.. I I^ t^ A^ I^ I
I I
ti ; u... I
In order to (^) study the role of the (^) plm parameter in the
quality of solutions found by the logarithmic mutation we carried out further experiments in the HAT dataset. We are particularly interested in studying this experiment because proteomics is^ a^ hot topic currently for
experimentation in bioinformatics. Besides, this dataset
comprises a higher (^) dimensionality than those (^) previously described. We varied (^) Plrn stepwise within a (^) range of 0. to 1.0 (Figure la). Note that best classification results
over 95% were obtained while setting p, = 0.8 with
Plrn =^ 1.0^.^ The^ effect^ of this^ value^ in^ the^ mutation^ factor
is that the complete genome, that is, (^) the whole set of
scaling factors, is translated in the same direction to a
bigger o lower scale, allowing the GA to explore different
order of magnitudes during the computation of the
weighting parameters. Useful combinations of subsets of
weights in dissimilar scales are then propelled by the
crossover rate. Hence we studied the effect of changing
the crossover rate using the best mutation factor rate of
1.0 (Figure lb). There were not major changes in the
classification performance when pc varies from 0.5 to
Lastly we assesed the practicality of the GA for tuning
the kernel parameters by tracing the SVM generalization
performance during the evolutionary process. Figure 2
shows plots of AUC vs generations for experiments with
the Iris (Fig. 2a), Heart (2b), Repeat2l (2c) and HAT (2d)
datasets, averaged over the number of experiments
reported in Table 2 (until the maximum number of
generations before the algorithm became stalled over the
set of experiments). It can be seen a tendency for the
AUC to increase as the number of generations grow in all
the cases. In^ the Heart and Repeat21 datasets the trend
has a small slope as these are noisy datasets. On the other
hand, for^ the^ Iris and^ HAT^ datasets, there is^ a^ sudden
increment of both population mean and best chromosome
fitness during the initial generations and then it keeps
growing gradually showing that the set of parameters
searched by the GA improve having a meaningful effect
over time. A similar behaviour was observed in the
remainder datasets.
5 Conclusions
We have described a GA approach for adjusting
multiple parameters in SVM kernels. Although we
considered weighted RBF kernels, the method can be
extended to other weighted kernels. The experiments
showed encouraging results in^ generalization performance
for tuning kernels including a few (4, 6) or a large (20,
- number of parameters. In^ the latter case
parameterization is^ prohibited in the^ standard^ grid-search
technique due to computational costs. In the particular
case of the HAT proteomic dataset, performance (^) achieved
is similar to that reported by our collaborators in a
previous study using other machine learning methods not related with weighted kernels or SVM [ 16]. This study showed the applicability of GA for adapting SVM kernels to a particular dataset. However there are interesting questions arising from this approach. For example, we attained a high variability in the results of the held-out tests. When examining the weights given by the best chromosomes evolved for a specific dataset we found that they are very heterogeneous in the scale of magnitude due to the logarithmic mutation that was used. This prompted us to design a different mutation strategy, where the weights are all maintained in a homogenous scale by controlling a single global width parameterized
beforehand using a grid-search. Preliminary results of
this combined strategy are being reported in an ongoing paper.
Other ideas emerging form this work might provide
useful insight in outlining new algorithms for tasks like
feature subset selection and feature extraction. If the
weights encoded in^ the chromosome represent scale
factors of the input variables they can indicate the degree
of the relevance of those (^) variables while learning the concept implicit in^ the dataset.^ Once the GA has
evolved those variables with highest scale factors can be
regarded as the most important for solving the given task.
We are currently working on this direction by having the
GA method described above to apply a cut off threshold
on the vector (^) si forcing the less relevant features to zero
thus giving sparse weights for the selected features
(alternatively they can be ranked by magnitude). Since the
kernel weights must be plugged into the SVM during
training, this can be considered a wrapper method for
feature selection [19] in contrast to other GA approaches
where the chromosome encodes the inclusion, or the identification of the variables to (^) be included in a filter
method [15, 20]. We intend to use this approach for
biomarker discovery (results will be published
elsewhere).
Acknowledgments
We would like to thank our team of collaborators Prof.
Sanjeev Krishna, Dr.^ Dan^ Agranoff and^ Dr.^ Marios
Papadopoulos at the Department of Cellular and
Molecular Medicine, St George's Hospital Medical
School, London, UK for allowing us to use the HAT
dataset in this (^) preliminary work. Datasets and
comprehensive analytical studies will be published
elsewhere. We also thank Dr. Mark Herbster, Prof.
Anthony Finkelstein (Dept. of Computer Science, UCL,
London, UK) and Dr. Anthony A. Holder (Division of Parasitology, National Institute for Medical^ Research,
London, UK) for valuable discussions and providing
support for this work. Finally, we are grateful to the
reviewers for their usefiul comments.