Numerical Linear Algebra for Data Exploration: Two-dimensional SVD and PCA Algorithm, Study notes of Algorithms and Programming

The numerical linear algebra for data exploration method, focusing on two-dimensional singular value decomposition (svd) and principal component analysis (pca). The instructor is jieping ye. The problem formulation, the main algorithm, and the optimization problem. It also discusses the effect of the ratio of ℓ1 to ℓ2 on reconstruction error and the sensitivity of glram to the choice of the initial l0.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-trs
koofers-user-trs 🇺🇸

9 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data
Exploration— Two dimensional SVD and PCA
Instructor: Jieping Ye
1 Introduction
Traditional methods in information retrieval and machine learning deal with data in vectorized
representation. A collection of data is then stored in a single matrix AIRN×n, where each
column of Acorresponds to a vector in the N-dimensional space. A major benefit of this
vector space model is that the algebraic structure of the vector space can be exploited.
For high-dimensional data, one would like to simplify the data, so that traditional machine
learning and statistical techniques can be applied. However, crucial information intrinsic in
the data should not be removed under this simplification. A widely used method for this
purpose is to approximate the single data matrix, A, with a matrix of lower rank.
1.1 Problem formulation
Let AiIRr×c, for i= 1,· · · , n, be the ndata points in the training set, where rand c
denote the number of rows and columns respectively for each Ai. We aim to compute two
matrices LIRr×`1and RIRc×`2with orthonormal columns, and nmatrices MiIR`1×`2,
for i= 1,· · · , n, such that LMiRTapproximates Ai, for all i. Here, `1and `2are two pre-
specified parameters that are best set to the same value, based on the experimental results.
Mathematically, we can formulate this as the following minimization problem: Computing
optimal L,Rand {Mi}n
i=1, which solve
min
LIRr×`1:LTL=I`1
RIRc×`2:RTR=I`2
MiIR`1×`2:i= 1,··· , n
n
X
i=1
||AiLMiRT||2
F.(1)
The matrices Land Rin the above approximations act as the two-sided linear transformations
on the data in matrix form.
2 The main algorithm
The following theorem shows that the Mi’s are determined by the transformation matrices L
and R, which significantly simplifies the minimization problem in Eq. (1).
Theorem 2.1. Let L,Rand {Mi}n
i=1 be the optimal solution to the minimization problem
in Eq. (1). Then Mi=LTAiR, for every i.
pf3
pf4
pf5
pf8

Partial preview of the text

Download Numerical Linear Algebra for Data Exploration: Two-dimensional SVD and PCA Algorithm and more Study notes Algorithms and Programming in PDF only on Docsity!

CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data

Exploration— Two dimensional SVD and PCA

Instructor: Jieping Ye

1 Introduction

  • Traditional methods in information retrieval and machine learning deal with data in vectorized representation. A collection of data is then stored in a single matrix A ∈ IRN^ ×n, where each column of A corresponds to a vector in the N -dimensional space. A major benefit of this vector space model is that the algebraic structure of the vector space can be exploited.
  • For high-dimensional data, one would like to simplify the data, so that traditional machine learning and statistical techniques can be applied. However, crucial information intrinsic in the data should not be removed under this simplification. A widely used method for this purpose is to approximate the single data matrix, A, with a matrix of lower rank.

1.1 Problem formulation

  • Let Ai ∈ IRr×c, for i = 1, · · · , n, be the n data points in the training set, where r and c denote the number of rows and columns respectively for each Ai. We aim to compute two matrices L ∈ IRr×^1 and R ∈ IRc×^2 with orthonormal columns, and n matrices Mi ∈ IR^1 ×^2 , for i = 1, · · · , n, such that LMiRT^ approximates Ai, for all i. Here, 1 and 2 are two pre- specified parameters that are best set to the same value, based on the experimental results. Mathematically, we can formulate this as the following minimization problem: Computing optimal L, R and {Mi}ni=1, which solve

min L ∈ IRr×^1 : LT^ L = I 1 R ∈ IRc×^2 : RT^ R = I 2 Mi ∈ IR^1 ×^2 : i = 1, · · · , n

∑^ n

i=

||Ai − LMiRT^ ||^2 F. (1)

  • The matrices L and R in the above approximations act as the two-sided linear transformations on the data in matrix form.

2 The main algorithm

  • The following theorem shows that the Mi’s are determined by the transformation matrices L and R, which significantly simplifies the minimization problem in Eq. (1).

Theorem 2.1. Let L, R and {Mi}ni=1 be the optimal solution to the minimization problem in Eq. (1). Then Mi = LT^ AiR, for every i.

Proof. By the property of the trace of matrices, ∑^ n

i=

||Ai − LMiRT^ ||^2 F =

∑^ n

i=

trace

(Ai − LMiRT^ )(Ai − LMiRT^ )T^

∑^ n

i=

trace(AiATi ) +

∑^ n

i=

trace(MiM (^) iT )

∑^ n

i=

trace(LMiRT^ ATi ), (2)

where the second term

∑n i=1 trace(MiM^ T i ) results from the fact that both^ L^ and^ R^ have orthonormal columns, and trace(AB) = trace(BA), for any two matrices. Since the first term on the right hand side of Eq. (2) is a constant, the minimization in Eq. (1) is equivalent to minimizing ∑^ n

i=

trace(MiM (^) iT ) − 2

∑^ n

i=

trace(LMiRT^ ATi ). (3)

It is easy to check that the minimum of (3) is achieved, only if Mi = LT^ AiR, for every i. This completes the proof of the theorem.

  • Theorem 2.1 implies that Mi is uniquely determined by L and R with Mi = LT^ AiR, for all i. Hence the key step for the minimization in Eq. (1) is the computation of the common transformations L and R. A key property of the optimal transformations L and R is stated in the following theorem: Theorem 2.2. Let L, R and {Mi}ni=1 be the optimal solution to the minimization problem in Eq. (1). Then L and R solve the following optimization problem:

max L ∈ IRr×^1 : LT^ L = I 1 R ∈ IRc×^2 : RT^ R = I 2

∑^ n

i=

||LT^ AiR||^2 F. (4)

Proof. From Theorem 2.1, Mi = LT^ AiR, for every i. Substituting this into

∑n i=1 ||Ai^ − LMiRT^ ||^2 F , we obtain ∑^ n

i=

||Ai − LMiRT^ ||^2 F =

∑^ n

i=

||Ai||^2 F −

∑^ n

i=

||LT^ AiR||^2 F. (5)

Hence the minimization in Eq. (1) is equivalent to the maximization of ∑^ n

i=

||LT^ AiR||^2 F ,

which completes the proof of the theorem.

  • To the best of our knowledge, there is no closed form solution for the maximization problem in Eq. (4). A key observation, which leads to an iterative algorithm for the computation of L and R, is stated in the following theorem:

is achieved, only if R ∈ IRc×^2 consists of the 2 eigenvectors of the matrix MR corresponding to the largest ` 2 eigenvalues. This completes the proof of the theorem.

  • Theorem 2.3 results in an iterative procedure for computing L and R as follows: for a given L, we can compute R by computing the eigenvectors of the matrix MR; with the computed R, we can then update L by computing the eigenvectors of the matrix ML. The procedure can be repeated until convergence. The pseudo-code of the above iterative procedure is given in Algorithm GLRAM below.

Algorithm GLRAM Input: matrices {Ai}ni=1, 1 , and 2 Output: matrices L, R, and {Mi}ni=

  1. Obtain initial L 0 for L and set i ← 1;
  2. While not convergent
  3. form the matrix MR =

∑n j=1 A T j Li−^1 L T i− 1 Aj^ ;

  1. compute the 2 eigenvectors {φRj } j^2 =1 of MR corresponding to the largest ` 2 eigenvalues;
  2. Ri ←

[

φR 1 , · · · , φR` 2

]

  1. form the matrix ML =

∑n j=1 Aj^ RiR

T i A T j ;

  1. compute the 1 eigenvectors {φLj } j^1 =1 of ML corresponding to the largest ` 1 eigenvalues;
  2. Li ←

[

φL 1 , · · · , φL` 1

]

  1. i ← i + 1;
  2. EndWhile
  3. L ← Li− 1 ;
  4. R ← Ri− 1 ;
  5. For j from 1 to n
  6. Mj ← LT^ Aj R;
  7. EndFor
  • Theorem 2.3 implies that the matrix updates in Lines 5 and 8 of GLRAM do not decrease the value of

∑n i=1 ||L T (^) AiR|| 2 F , since the computed^ R^ and^ L^ are locally optimal.^ Hence by Theorem 2.2, the value of

∑n i=1 ||Ai^ −^ LMiR

T || 2

F , or

RMSRE ≡

n

∑^ n

i=

||Ai − LMiRT^ ||^2 F (8)

does not increase. Here RMSRE stands for the Root Mean Square Reconstruction Error. The convergence of GLRAM follows, since RMSRE is bounded from below by 0, as stated in the following Theorem:

Theorem 2.4. The GLRAM Algorithm monotonically non-increases the RMSRE value as defined in Eq. (8), hence it converges in the limit.

Table 1: Statistics of our test datasets. Dataset Size (n) Dimension (r × c) Number of classes RAND 500 100 × 100 = 10000 — PIX 300 100 × 100 = 10000 30 ORL 400 92 × 112 = 10304 40 AR 1638 101 × 88 = 8888 126 PIE 6615 32 × 24 = 768 63 USPS 3000 16 × 16 = 256 10

3 Evaluation

3.1 Datasets

  • The statistics of all datasets are summarized in Table 1.
  • RAND is a synthetic dataset, consisting of 500 data points of size 100 × 100. All the entries are randomly generated between 0 and 255 (the same range as the four face image datasets).

3.2 Effect of the ratio of 1 to 2 on reconstruction error

  • In this experiment, we study the effect of the ratio of 1 to 2 on reconstruction error, where 1 and 2 are the row and column dimensions of the reduced representation Mi in GLRAM. To this end, we run GLRAM with different combinations of 1 and 2 with a constant product 1 · 2 = 400. The results on PIX, ORL, and AR are shown in Table 2. It is clear from the table that the RMSRE value is small, when 1 / 2 ≈ 1, and the minimum is achieved when 1 / 2 = 1 in all cases.
  • To examine whether this is related to the fact that for images, the number of rows (r) and the number of columns (c) are comparable, we subsample the images in PIX down to a size of 50 × 100 = 5000. The result on this dataset is included in Table 2. Interestingly, we observe the same trend in this dataset. That is, the RMSRE value is small, when 1 / 2 ≈ 1. We have conducted similar experiments on other datasets and observed the same trend. This may be related to the effect of balancing between the left and right transformations involved in GLRAM.
  • Finally, we examine the effect of the ratio using the synthetic dataset. The result on RAND is included in the last column of Table 2. We observe the same trend as other datasets. That is, the RMSRE value is small, when 1 / 2 ≈ 1.
  • The above experiment on both the synthetic and real-world datasets suggests that choosing 1 / 2 ≈ 1 may be a good strategy in practice. In all the following experiments, we set both 1 and 2 equal to a common value d.

3.3 Sensitivity of GLRAM to the choice of the initial L 0

  • In this experiment, we examine the sensitivity of GLRAM to the choice of the initial L 0 for L (see Line 1 of the GLRAM algorithm). To this end, we run GLRAM with 10 different initial L 0 ’s. The first one is L 0 = (Id, 0)T^ , while the next nine being randomly generated.
  • Next, we examine the sensitivity of GLRAM using RAND, the synthetic dataset. The result is shown in Figure 1 (right). It is clear from the figure that GLRAM converges much slower on RAND than on image datasets. We run GLRAM with the threshold η = 10−^6 , and it does not converge until 78 iterations. Furthermore, GLRAM does not converge to the same solution (measured by the angle between two subspaces). Further experiments also show that the final RMSRE value may be different for different initial L 0 ’s, even though the difference always seems small. This is likely due to the fact that there are some similarities among the images in the same image datasets, while the data in RAND is randomly generated.
  • The experiment above implies that for datasets with some hidden structures, such as faces and handwritten digits, GLRAM may converge to the global solution, regardless of the choice of the initial L 0. However, it is not true in general, as shown in the RAND dataset.

3.4 Compression effectiveness

  • In this experiment, we examine the quality of the images compressed by the proposed algo- rithm and compare it with SVD and 2DPCA. Image compression is commonly applied as a pre-processing step for storage and transmission of large image data. There exists a tradeoff between quality of compressed images and compression ratio, as a high compression ratio usually leads to poor quality of compressed images.
  • Figure 2 shows images of 10 different persons from the ORL dataset. The 10 images in the first row are the original images from the dataset. The 10 images in the second row are the ones compressed by the GLRAM algorithm with d = 10. The compression ratio is about 98.0. The images compressed by SVD and 2DPCA with approximately the same number of reduced dimensions as GLRAM are shown in the third and fourth rows of Figure 2 respectively. It is clear that the images compressed by our proposed algorithm have slightly better visual quality than those compressed by 2DPCA, while the ones compressed by SVD have the best visual quality. However, the compression ratio of SVD (3.85) is much smaller than that of GLRAM (98.0).
  • Figure 3 shows images of 10 different digits from the USPS dataset. d = 5 is used in GLRAM. The compression ratio is about 10. GLRAM and SVD perform slightly better than 2DPCA. Furthermore, the compression ratio of SVD (9.4) is close to that of GLRAM (10.2). The different behavior between ORL and USPS is related to the fact that USPS has a relatively large number of data points compared to its dimension, i.e., n ¿ rc.

Figure 2: First row: raw images from ORL dataset. Second row: images compressed by GLRAM. Third row: images compressed by SVD. Fourth row: images compressed by 2DPCA. Note that the compression ratio of SVD (3.85) is much smaller than that of GLRAM (98.0).

Figure 3: First row: raw images from USPS dataset. Second row: images compressed by GLRAM. Third row: images compressed by SVD. Fourth row: images compressed by 2DPCA.