Download Numerical Linear Algebra for Data Exploration: Two-dimensional SVD and PCA Algorithm and more Study notes Algorithms and Programming in PDF only on Docsity!
CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data
Exploration— Two dimensional SVD and PCA
Instructor: Jieping Ye
1 Introduction
- Traditional methods in information retrieval and machine learning deal with data in vectorized representation. A collection of data is then stored in a single matrix A ∈ IRN^ ×n, where each column of A corresponds to a vector in the N -dimensional space. A major benefit of this vector space model is that the algebraic structure of the vector space can be exploited.
- For high-dimensional data, one would like to simplify the data, so that traditional machine learning and statistical techniques can be applied. However, crucial information intrinsic in the data should not be removed under this simplification. A widely used method for this purpose is to approximate the single data matrix, A, with a matrix of lower rank.
1.1 Problem formulation
- Let Ai ∈ IRr×c, for i = 1, · · · , n, be the n data points in the training set, where r and c denote the number of rows and columns respectively for each Ai. We aim to compute two matrices L ∈ IRr×
^1 and R ∈ IRc×^2 with orthonormal columns, and n matrices Mi ∈ IR^1 ×^2 , for i = 1, · · · , n, such that LMiRT^ approximates Ai, for all i. Here, 1 and 2 are two pre- specified parameters that are best set to the same value, based on the experimental results. Mathematically, we can formulate this as the following minimization problem: Computing optimal L, R and {Mi}ni=1, which solve
min L ∈ IRr×^1 : LT^ L = I 1 R ∈ IRc×^2 : RT^ R = I 2 Mi ∈ IR^1 ×^2 : i = 1, · · · , n
∑^ n
i=
||Ai − LMiRT^ ||^2 F. (1)
- The matrices L and R in the above approximations act as the two-sided linear transformations on the data in matrix form.
2 The main algorithm
- The following theorem shows that the Mi’s are determined by the transformation matrices L and R, which significantly simplifies the minimization problem in Eq. (1).
Theorem 2.1. Let L, R and {Mi}ni=1 be the optimal solution to the minimization problem in Eq. (1). Then Mi = LT^ AiR, for every i.
Proof. By the property of the trace of matrices, ∑^ n
i=
||Ai − LMiRT^ ||^2 F =
∑^ n
i=
trace
(Ai − LMiRT^ )(Ai − LMiRT^ )T^
∑^ n
i=
trace(AiATi ) +
∑^ n
i=
trace(MiM (^) iT )
∑^ n
i=
trace(LMiRT^ ATi ), (2)
where the second term
∑n i=1 trace(MiM^ T i ) results from the fact that both^ L^ and^ R^ have orthonormal columns, and trace(AB) = trace(BA), for any two matrices. Since the first term on the right hand side of Eq. (2) is a constant, the minimization in Eq. (1) is equivalent to minimizing ∑^ n
i=
trace(MiM (^) iT ) − 2
∑^ n
i=
trace(LMiRT^ ATi ). (3)
It is easy to check that the minimum of (3) is achieved, only if Mi = LT^ AiR, for every i. This completes the proof of the theorem.
- Theorem 2.1 implies that Mi is uniquely determined by L and R with Mi = LT^ AiR, for all i. Hence the key step for the minimization in Eq. (1) is the computation of the common transformations L and R. A key property of the optimal transformations L and R is stated in the following theorem: Theorem 2.2. Let L, R and {Mi}ni=1 be the optimal solution to the minimization problem in Eq. (1). Then L and R solve the following optimization problem:
max L ∈ IRr×^1 : LT^ L = I 1 R ∈ IRc×^2 : RT^ R = I 2
∑^ n
i=
||LT^ AiR||^2 F. (4)
Proof. From Theorem 2.1, Mi = LT^ AiR, for every i. Substituting this into
∑n i=1 ||Ai^ − LMiRT^ ||^2 F , we obtain ∑^ n
i=
||Ai − LMiRT^ ||^2 F =
∑^ n
i=
||Ai||^2 F −
∑^ n
i=
||LT^ AiR||^2 F. (5)
Hence the minimization in Eq. (1) is equivalent to the maximization of ∑^ n
i=
||LT^ AiR||^2 F ,
which completes the proof of the theorem.
- To the best of our knowledge, there is no closed form solution for the maximization problem in Eq. (4). A key observation, which leads to an iterative algorithm for the computation of L and R, is stated in the following theorem:
is achieved, only if R ∈ IRc×^2 consists of the 2 eigenvectors of the matrix MR corresponding to the largest ` 2 eigenvalues. This completes the proof of the theorem.
- Theorem 2.3 results in an iterative procedure for computing L and R as follows: for a given L, we can compute R by computing the eigenvectors of the matrix MR; with the computed R, we can then update L by computing the eigenvectors of the matrix ML. The procedure can be repeated until convergence. The pseudo-code of the above iterative procedure is given in Algorithm GLRAM below.
Algorithm GLRAM Input: matrices {Ai}ni=1, 1 , and 2 Output: matrices L, R, and {Mi}ni=
- Obtain initial L 0 for L and set i ← 1;
- While not convergent
- form the matrix MR =
∑n j=1 A T j Li−^1 L T i− 1 Aj^ ;
- compute the
2 eigenvectors {φRj } j^2 =1 of MR corresponding to the largest ` 2 eigenvalues; - Ri ←
[
φR 1 , · · · , φR` 2
]
- form the matrix ML =
∑n j=1 Aj^ RiR
T i A T j ;
- compute the
1 eigenvectors {φLj } j^1 =1 of ML corresponding to the largest ` 1 eigenvalues; - Li ←
[
φL 1 , · · · , φL` 1
]
- i ← i + 1;
- EndWhile
- L ← Li− 1 ;
- R ← Ri− 1 ;
- For j from 1 to n
- Mj ← LT^ Aj R;
- EndFor
- Theorem 2.3 implies that the matrix updates in Lines 5 and 8 of GLRAM do not decrease the value of
∑n i=1 ||L T (^) AiR|| 2 F , since the computed^ R^ and^ L^ are locally optimal.^ Hence by Theorem 2.2, the value of
∑n i=1 ||Ai^ −^ LMiR
T || 2
F , or
RMSRE ≡
n
∑^ n
i=
||Ai − LMiRT^ ||^2 F (8)
does not increase. Here RMSRE stands for the Root Mean Square Reconstruction Error. The convergence of GLRAM follows, since RMSRE is bounded from below by 0, as stated in the following Theorem:
Theorem 2.4. The GLRAM Algorithm monotonically non-increases the RMSRE value as defined in Eq. (8), hence it converges in the limit.
Table 1: Statistics of our test datasets. Dataset Size (n) Dimension (r × c) Number of classes RAND 500 100 × 100 = 10000 — PIX 300 100 × 100 = 10000 30 ORL 400 92 × 112 = 10304 40 AR 1638 101 × 88 = 8888 126 PIE 6615 32 × 24 = 768 63 USPS 3000 16 × 16 = 256 10
3 Evaluation
3.1 Datasets
- The statistics of all datasets are summarized in Table 1.
- RAND is a synthetic dataset, consisting of 500 data points of size 100 × 100. All the entries are randomly generated between 0 and 255 (the same range as the four face image datasets).
3.2 Effect of the ratio of 1 to 2 on reconstruction error
- In this experiment, we study the effect of the ratio of
1 to 2 on reconstruction error, where 1 and 2 are the row and column dimensions of the reduced representation Mi in GLRAM. To this end, we run GLRAM with different combinations of 1 and 2 with a constant product 1 · 2 = 400. The results on PIX, ORL, and AR are shown in Table 2. It is clear from the table that the RMSRE value is small, when 1 / 2 ≈ 1, and the minimum is achieved when 1 / 2 = 1 in all cases. - To examine whether this is related to the fact that for images, the number of rows (r) and the number of columns (c) are comparable, we subsample the images in PIX down to a size of 50 × 100 = 5000. The result on this dataset is included in Table 2. Interestingly, we observe the same trend in this dataset. That is, the RMSRE value is small, when
1 / 2 ≈ 1. We have conducted similar experiments on other datasets and observed the same trend. This may be related to the effect of balancing between the left and right transformations involved in GLRAM. - Finally, we examine the effect of the ratio using the synthetic dataset. The result on RAND is included in the last column of Table 2. We observe the same trend as other datasets. That is, the RMSRE value is small, when
1 / 2 ≈ 1. - The above experiment on both the synthetic and real-world datasets suggests that choosing
1 / 2 ≈ 1 may be a good strategy in practice. In all the following experiments, we set both 1 and 2 equal to a common value d.
3.3 Sensitivity of GLRAM to the choice of the initial L 0
- In this experiment, we examine the sensitivity of GLRAM to the choice of the initial L 0 for L (see Line 1 of the GLRAM algorithm). To this end, we run GLRAM with 10 different initial L 0 ’s. The first one is L 0 = (Id, 0)T^ , while the next nine being randomly generated.
- Next, we examine the sensitivity of GLRAM using RAND, the synthetic dataset. The result is shown in Figure 1 (right). It is clear from the figure that GLRAM converges much slower on RAND than on image datasets. We run GLRAM with the threshold η = 10−^6 , and it does not converge until 78 iterations. Furthermore, GLRAM does not converge to the same solution (measured by the angle between two subspaces). Further experiments also show that the final RMSRE value may be different for different initial L 0 ’s, even though the difference always seems small. This is likely due to the fact that there are some similarities among the images in the same image datasets, while the data in RAND is randomly generated.
- The experiment above implies that for datasets with some hidden structures, such as faces and handwritten digits, GLRAM may converge to the global solution, regardless of the choice of the initial L 0. However, it is not true in general, as shown in the RAND dataset.
3.4 Compression effectiveness
- In this experiment, we examine the quality of the images compressed by the proposed algo- rithm and compare it with SVD and 2DPCA. Image compression is commonly applied as a pre-processing step for storage and transmission of large image data. There exists a tradeoff between quality of compressed images and compression ratio, as a high compression ratio usually leads to poor quality of compressed images.
- Figure 2 shows images of 10 different persons from the ORL dataset. The 10 images in the first row are the original images from the dataset. The 10 images in the second row are the ones compressed by the GLRAM algorithm with d = 10. The compression ratio is about 98.0. The images compressed by SVD and 2DPCA with approximately the same number of reduced dimensions as GLRAM are shown in the third and fourth rows of Figure 2 respectively. It is clear that the images compressed by our proposed algorithm have slightly better visual quality than those compressed by 2DPCA, while the ones compressed by SVD have the best visual quality. However, the compression ratio of SVD (3.85) is much smaller than that of GLRAM (98.0).
- Figure 3 shows images of 10 different digits from the USPS dataset. d = 5 is used in GLRAM. The compression ratio is about 10. GLRAM and SVD perform slightly better than 2DPCA. Furthermore, the compression ratio of SVD (9.4) is close to that of GLRAM (10.2). The different behavior between ORL and USPS is related to the fact that USPS has a relatively large number of data points compared to its dimension, i.e., n ¿ rc.
Figure 2: First row: raw images from ORL dataset. Second row: images compressed by GLRAM. Third row: images compressed by SVD. Fourth row: images compressed by 2DPCA. Note that the compression ratio of SVD (3.85) is much smaller than that of GLRAM (98.0).
Figure 3: First row: raw images from USPS dataset. Second row: images compressed by GLRAM. Third row: images compressed by SVD. Fourth row: images compressed by 2DPCA.