Download High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction and more Study notes Computer Science in PDF only on Docsity!
Dimensionality Reduction
Techniques
Dimitrios Gunopulos, UCR
Retrieval techniques for high-
dimensional datasets
- The retrieval problem:
- Given a set of objects SS , and a query object S,
- find the objectss that are most similar to S.
- Applications:
- financial, voice, marketing, medicine, video
3
Examples
- Find companies with similar stock prices over a time interval
- Find products with similar sell cycles
- Cluster users with similar credit card utilization
- Cluster products
Indexing when the triangle inequality
holds
- Typical distance metric: Lp norm.
- We use L 2 as an example throughout:
- D(S,T) = (Σi=1,..,n (S[i] - T[i]) 2 ) 1/
7
R-trees and variants
[Guttman, 1984], [Sellis et al, 1987], [Beckmann et al, 1990]
- k-dim extension of B-trees
- Balanced tree
- Intermediate nodes are rectangles that cover lower levels
- Rectangles may be overlapping or not depending on variant (R-trees, R+-trees, R*-trees)
- Can index rectangles as well as points
L1 (^) L
L
L4 L
8
kd-trees
- Based on binary trees
- Different attribute is used for partitioning at different levels
- Efficient for indexing points
- External memory extensions: hBΠ-tree
f
f
9
Grid Files
- Use a regular grid to partition the space
- Points in each cell go to one disk page
- Can only handle points
f
f
vp-trees and pyramid trees
[Ullmann], [Berchtold et al,1998], [Bozkaya et al1997],...
- Basic idea: partition the dataset, rather than the space
- vp-trees: At each level, partition the points based on the distance from a center
- Others: mvp-, TV-, S-, Pyramid-trees
R R
c
c
c3 (^) The root level of a vp-tree with 3 children
13
High-dimensional index structures
- All require the triangle inequality to hold
- All partition either
- the space or
- the dataset into regions
- The objective is to:
- search only those regions that could potentially contain good matches
- avoid everything else
The naïve approach: Problems
- High-dimensionality:
- decreases index structure performance (the curse of dimensionality)
- slows down the distance computation
- Inefficiency
15
Dimensionality reduction
- The main idea: reduce the dimensionality of the space.
- Project the n-dimensional tuples that represent the time series in a k-dimensional space so that: - k << n - distances are preserved as well as possible
Dimensionality Reduction
- Use an indexing technique on the new space.
- GEMINI ([Faloutsos et al]):
- Map the query S to the new space
- Find nearest neighbors to S in the new space
- Compute the actual distances and keep the closest
19
Dimensionality Reduction
- To guarantee no false dismissals we must be able to prove that: - D(F(S),F(T)) < a D(S,T) - for some constant a
- a small rate of false positives is desirable, but not essential
What we achieve
- Indexing structures work much better in lower dimensionality spaces
- The distance computations run faster
- The size of the dataset is reduced, improving performance.
21
Dimensionality Techniques
- We will review a number of dimensionality techniques that can be applied in this context - SVD decomposition, - Discrete Fourier transform, and Discrete Cosine transform - Wavelets - Partitioning in the time domain - Random Projections - Multidimensional scaling - FastMap and its variants
SVD decomposition - the Karhunen-
Loeve transform
- Intuition: find the axis that shows the greatest variation, and project all points into this axis
- [Faloutsos, 1996]
f
e2 e
f
25
SVD Cont’d
- To approximate the time series, we use only the k largest eigenvectors of C.
- A’ = U x Lk
- A’ is an M x k matrix
0 20 40 60 80 100 120 140 eigenwave 0
X X'
eigenwave 1eigenwave 2 eigenwave 3eigenwave 4 eigenwave 5eigenwave 6 eigenwave 7
SVD Cont’d
- Advantages:
- Optimal dimensionality reduction (for linear projections)
- Disadvantages:
- Computationally hard, especially if the time series are very long.
- Does not work for subsequence indexing
27
SVD Extensions
- On-line approximation algorithm
- Local diemensionality reduction:
- Cluster the time series, solve for each cluster
- [Chakrabarti and Mehrotra, 2000], [Thomasian et al]
Discrete Fourier Transform
- Analyze the frequency spectrum of an one dimensional signal
- For S = (S 0 , …,Sn-1 ), the DFT is:
- Sf = 1/√n Σi=0,..,n-1 S (^) i e -j2πfi/n f = 0,1,…n-1, j 2 =-
- An efficient O(nlogn) algorithm makes DFT a practical method
- [Agrawal et al, 1993], [Rafiei and Mendelzon, 1998]
31
Discrete Fourier Transform
- Advantages:
- Efficient, concentrates the energy
- Disadvantages:
- To project the n-dimensional time series into a k- dimensional space, the same k Fourier coefficients must be store for all series
- This is not optimal for all series
- To find the k optimal coefficients for M time series, compute the average energy for each coefficient
Wavelets
- Represent the time series as a sum of prototype functions like DFT
- Typical base used: Haar wavelets
- Difference from DFT: localization in time
- Can be extended to 2 dimensions
- [Chan and Fu, 1999]
- Has been very useful in graphics, approximation techniques
33
Wavelets
- An example (using the Haar wavelet basis)
- S ≡ (2, 2, 7, 9) : original time series
- S’ ≡ (5, 6, 0, 2) : wavelet decomp.
- S[0] = S’[0] - S’[1]/2 - S’[2]/
- S[1] = S’[0] - S’[1]/2 + S’[2]/
- S[2] = S’[0] + S’[1]/2 - S’[3]/
- S[3] = S’[0] + S’[1]/2 + S’[3]/
- Efficient O(n) algorithm to find the coefficients
Using wavelets for approximation
- Keep only k coefficients, approximate the rest with 0
- Keeping the first k coefficients:
- equivalent to low pass filtering
- Keeping the largest k coefficients:
- More accurate representation, But not useful for indexing
0 20 40 60 80 100 120 140 Haar 0Haar 1 Haar 2Haar 3 Haar 4Haar 5 Haar 6Haar 7
X X'
37
Temporal Partitioning
- Very Efficient technique (O(n) time algorithm)
- Can be extended to address the subsequence matching problem
- Equivalent to wavelets (when k= 2 i^ , and mean is used)
0 20 40 60 80 100 120 140 xx (^01) xx (^23) xx (^45) xx (^67)
X X'
Random projection
- Based on the Johnson-Lindenstrauss lemma:
- For:
- 0< e < 1/2,
- any (sufficiently large) set S of M points in R n
- k = O(e -2^ lnM)
- There exists a linear map f: S → R k , such that
- (1-e) D(S,T) < D(f(S),f(T)) < (1+e)D(S,T) for S,T in S
- Random projection is good with constant probability
- [Indyk, 2000]
39
Random Projection: Application
- Set k = O(e -2^ lnM)
- Select k random n-dimensional vectors
- Project the time series into the k vectors.
- The resulting k-dimensional space approximately preserves the distances with high probability
- Monte-Carlo algorithm: we do not know if correct
Random Projection
- A very useful technique,
- Especially when used in conjunction with another technique (for example SVD)
- Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther