





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Methods for applying kernel-learning techniques to similarity matrix data, specifically focusing on matrix exponential update as a more sophisticated approach than naïve and diffusion kernel methods. The report includes an explanation of matrix exponential update, its application to kernel learning, experiment results on a drug discovery dataset, and conclusions with future work.
Typology: Study Guides, Projects, Research
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Learning Kernel Matrix by Matrix Exponential Update
In many applications, unlike traditional machine learning where an example is represented by a vector of values, it is more natural to represent the relationship between two examples by a similarity score. In these cases, the data is a matrix. The matrix may have some missing values that indicate that we don’t know the relationship between the two examples or they cannot be directly compared. The matrix may be asymmetric. The relationship between examples may be an ordered relationship. The matrix is also not necessarily square. A common case is: only a certain number of (positives) examples are of interests, the similarity scores between a large number of examples and these examples are computed. Nearest Neighbor method is the most natural and widely used method for these kinds of data.
In this project, we try to apply kernel-learning methods to similarity matrix data. For the sake of simplicity, we assume that the similarity matrix be square. The obstacle we meet is that the similarity matrix is not generally a kernel matrix. As we know, a kernel matrix need to be square, symmetric, semi positive definite and should not contain missing values. A simple way to construct a kernel matrix from a similarity matrix is: assume the similarity matrix is A. Let A=(A+A’)/2. We obtain a symmetric matrix. Note that before we do the averaging the two symmetric positions, when only one of (i,j) and (j,i) is missing, we copy its symmetric counterpart to the missing value position. After doing this, for the missing values in A, we just put zeros there. If A is not a semi positive definite matrix, we add a positive constant λ into the diagonal elements of A to make it positive definite: K=A+λI. The λ is set to be slightly greater than the absolute value of the minimum eigenvalue of A. We call this approach “naïve” approach. Kernel matrix constructed by this way is called “Diag” kernel. A second approach called “diffusion” kernel [2] makes use of the property of matrix exponential function. Matrix exponential always translates a symmetric matrix into a symmetric positive definite matrix. So the produced matrix can be used as kernel matrix. In this approach, K=exp(βA). exp is matrix exponential function. A should be a symmetric matrix. β is a constant. The third approach, matrix exponential update [1] is an on-line algorithm. It also makes use the property of matrix exponential function, so it is closely related to the diffusion kernel. However matrix exponential update is more sophisticated. It is derived by using von Neumann divergence and square loss. Relative loss bound has been established for this
update [1]. We show later in this report that matrix exponential update outperforms significantly the other approaches.
The report is organized as follows: section 2 explains matrix exponential update; section 3 discusses how it can be used for kernel learning; section 4 shows the experiment results on a drug discovery dataset; section 5 discusses the conclusion and future work.
Matrix exponential update is a natural extension of the exponentiated gradient(EG) algorithm. EG’s parameter is a vector w while matrix exponential update’s parameter is a symmetric positive definite matrix. At each trial t, the algorithm receives an instance n n X (^) t R
positive definite matrix Wt. After knowing the instance’s real label yt , it incurs a loss
( ˆ )^2 y (^) t − yt and updates Wt. The update’s optimization objection function is:
F is the Bregman divergence. Setting the derivative with respect to W to zero, we have
The problem is not solvable in closed form. A common trick is to approximate W (^) t + 1 in
the loss term by Wt. Then we have
In our case, we use von Neumann entropy as the convex function for the Bregman divergence.
We also add a constraint that tr(W)=1. The update becomes
When W 0 is symmetric positive definite and all the X (^) t are symmetric, the term
(log Wt − 2 η ( tr ( WtXt )− yt ) Xt )is always symmetric so that Wt (^) + 1 is always symmetric
positive definite after each update. This property is used for kernel learning.
2
∇ F ( Wt + 1 )−∇ F ( Wt )+ η ∇(( tr ( Wt + 1 Xt )− yt )^2 )= 0
Wt (^) + 1 = ( ∇ F )−^1 (∇ F ( Wt )− 2 η ( tr ( WtXt )− yt ) Xt )
( ) log ,( ) ( ) exp( )
( ) ( log ), FW W F^1 W W
FW trW W W ∇ = ∇ =
−
(exp(log 2 ( ( ) ) ))
exp(log 2 ( ( ) ) )
1
t t t t t t
t t t t t t
t
Z tr W trWX y X
W trWX y X Z
η
η
are so different from each other that no reasonable alignment can be made. The data is visualized in figure 1.
Figure 1. Original similarity matrix. Brighter or warmer color means higher value.
The test setup is arranged as the following: we test the three approaches introduced before: naïve approach (“Diag” kernel), diffusion kernel and matrix exponential update (“ MExp” kernel). After we learn the kernel matrices, we use SVM as the kernel-learning algorithm. We also test Nearest Neighbor method. We do 10 randomly 50%(training)- 50%(testing) split of the dataset. The results (test error rate) are averaged from the 10 runs.
a. the produced kernel matrices
We first make the original matrix symmetric as discussed in the first section. This is shown in the figure 2. (Note that in the report, the same colors in different figures don’t mean the same values. It is only relative to other colors with a figure.) All the three approaches start from this symmetric matrix.
Figure 2. The symmetric matrix produced by averaging the similarity matrix and its transpose.
The “naïve” approach adds a positive constant to the diagonal elements of the above matrix (figure 3).
Figure 3. The “Diag” matrix produced by “naïve” approach.
In the following, we also show the obtained “diffusion” kernel and “MExp” kernel.
(b) Total loss of matrix exponential update
In figure 6, we show the total loss of matrix exponential update at each iteration. We can see that after 10 iterations, the total loss becomes stable.
Figure 6 Total loss of matrix exponential update
(c ) Generalization error
In the following(figure 7), we show the generalization errors of SVM on the kernels produced by the three approaches. We can see matrix exponential update achieves the best test error rate. Interestingly, the test error stabilizes after 60 iterations while in the previous figure, the total loss becomes stable after 10 iterations.
Figure 7
The generalization error using Nearest Neighbor on the kernel matrices are shown in the figure 8. We also show the result of Nearest Neighbor on the original asymmetric similarity matrix. It achieves the best overall error rate (even considering the results using SVM). This indicates that the simple averaging method that makes the symmetric matrix may be against the nature of these data.
Figure 8