High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction, Study notes of Computer Science

Various high-dimensional indexing techniques, including r-trees, kd-trees, vp-trees, and sequential scan. It also covers dimensionality reduction methods such as singular value decomposition (svd), discrete fourier transform, wavelets, and line segment approximations. The advantages and disadvantages of each method and provides examples.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-pms
koofers-user-pms 🇺🇸

8 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Dimensionality Reduction
Techniques
Dimitrios Gunopulos, UCR
2
Retrieval techniques for high-
dimensional datasets
The retrieval problem:
Given a set of objects S
S, and a query object S,
find the objectss that are most similar to S.
Applications:
financial, voice, marketing, medicine, video
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction and more Study notes Computer Science in PDF only on Docsity!

Dimensionality Reduction

Techniques

Dimitrios Gunopulos, UCR

Retrieval techniques for high-

dimensional datasets

  • The retrieval problem:
    • Given a set of objects SS , and a query object S,
    • find the objectss that are most similar to S.
  • Applications:
    • financial, voice, marketing, medicine, video

3

Examples

  • Find companies with similar stock prices over a time interval
  • Find products with similar sell cycles
  • Cluster users with similar credit card utilization
  • Cluster products

Indexing when the triangle inequality

holds

  • Typical distance metric: Lp norm.
  • We use L 2 as an example throughout:
    • D(S,T) = (Σi=1,..,n (S[i] - T[i]) 2 ) 1/

7

R-trees and variants

[Guttman, 1984], [Sellis et al, 1987], [Beckmann et al, 1990]

  • k-dim extension of B-trees
  • Balanced tree
  • Intermediate nodes are rectangles that cover lower levels
  • Rectangles may be overlapping or not depending on variant (R-trees, R+-trees, R*-trees)
  • Can index rectangles as well as points

L1 (^) L

L

L4 L

8

kd-trees

  • Based on binary trees
  • Different attribute is used for partitioning at different levels
  • Efficient for indexing points
  • External memory extensions: hBΠ-tree

f

f

9

Grid Files

  • Use a regular grid to partition the space
  • Points in each cell go to one disk page
  • Can only handle points

f

f

vp-trees and pyramid trees

[Ullmann], [Berchtold et al,1998], [Bozkaya et al1997],...

  • Basic idea: partition the dataset, rather than the space
  • vp-trees: At each level, partition the points based on the distance from a center
  • Others: mvp-, TV-, S-, Pyramid-trees

R R

c

c

c3 (^) The root level of a vp-tree with 3 children

13

High-dimensional index structures

  • All require the triangle inequality to hold
  • All partition either
    • the space or
    • the dataset into regions
  • The objective is to:
    • search only those regions that could potentially contain good matches
    • avoid everything else

The naïve approach: Problems

  • High-dimensionality:
    • decreases index structure performance (the curse of dimensionality)
    • slows down the distance computation
  • Inefficiency

15

Dimensionality reduction

  • The main idea: reduce the dimensionality of the space.
  • Project the n-dimensional tuples that represent the time series in a k-dimensional space so that: - k << n - distances are preserved as well as possible

Dimensionality Reduction

  • Use an indexing technique on the new space.
  • GEMINI ([Faloutsos et al]):
    • Map the query S to the new space
    • Find nearest neighbors to S in the new space
    • Compute the actual distances and keep the closest

19

Dimensionality Reduction

  • To guarantee no false dismissals we must be able to prove that: - D(F(S),F(T)) < a D(S,T) - for some constant a
  • a small rate of false positives is desirable, but not essential

What we achieve

  • Indexing structures work much better in lower dimensionality spaces
  • The distance computations run faster
  • The size of the dataset is reduced, improving performance.

21

Dimensionality Techniques

  • We will review a number of dimensionality techniques that can be applied in this context - SVD decomposition, - Discrete Fourier transform, and Discrete Cosine transform - Wavelets - Partitioning in the time domain - Random Projections - Multidimensional scaling - FastMap and its variants

SVD decomposition - the Karhunen-

Loeve transform

  • Intuition: find the axis that shows the greatest variation, and project all points into this axis
  • [Faloutsos, 1996]

f

e2 e

f

25

SVD Cont’d

  • To approximate the time series, we use only the k largest eigenvectors of C.
  • A’ = U x Lk
  • A’ is an M x k matrix

0 20 40 60 80 100 120 140 eigenwave 0

X X'

eigenwave 1eigenwave 2 eigenwave 3eigenwave 4 eigenwave 5eigenwave 6 eigenwave 7

SVD Cont’d

  • Advantages:
    • Optimal dimensionality reduction (for linear projections)
  • Disadvantages:
    • Computationally hard, especially if the time series are very long.
    • Does not work for subsequence indexing

27

SVD Extensions

  • On-line approximation algorithm
    • [Ravi Kanth et al, 1998]
  • Local diemensionality reduction:
    • Cluster the time series, solve for each cluster
    • [Chakrabarti and Mehrotra, 2000], [Thomasian et al]

Discrete Fourier Transform

  • Analyze the frequency spectrum of an one dimensional signal
  • For S = (S 0 , …,Sn-1 ), the DFT is:
  • Sf = 1/√n Σi=0,..,n-1 S (^) i e -j2πfi/n f = 0,1,…n-1, j 2 =-
  • An efficient O(nlogn) algorithm makes DFT a practical method
  • [Agrawal et al, 1993], [Rafiei and Mendelzon, 1998]

31

Discrete Fourier Transform

  • Advantages:
    • Efficient, concentrates the energy
  • Disadvantages:
    • To project the n-dimensional time series into a k- dimensional space, the same k Fourier coefficients must be store for all series
    • This is not optimal for all series
    • To find the k optimal coefficients for M time series, compute the average energy for each coefficient

Wavelets

  • Represent the time series as a sum of prototype functions like DFT
  • Typical base used: Haar wavelets
  • Difference from DFT: localization in time
  • Can be extended to 2 dimensions
  • [Chan and Fu, 1999]
  • Has been very useful in graphics, approximation techniques

33

Wavelets

  • An example (using the Haar wavelet basis)
    • S ≡ (2, 2, 7, 9) : original time series
    • S’ ≡ (5, 6, 0, 2) : wavelet decomp.
    • S[0] = S’[0] - S’[1]/2 - S’[2]/
    • S[1] = S’[0] - S’[1]/2 + S’[2]/
    • S[2] = S’[0] + S’[1]/2 - S’[3]/
    • S[3] = S’[0] + S’[1]/2 + S’[3]/
  • Efficient O(n) algorithm to find the coefficients

Using wavelets for approximation

  • Keep only k coefficients, approximate the rest with 0
  • Keeping the first k coefficients:
    • equivalent to low pass filtering
  • Keeping the largest k coefficients:
    • More accurate representation, But not useful for indexing

0 20 40 60 80 100 120 140 Haar 0Haar 1 Haar 2Haar 3 Haar 4Haar 5 Haar 6Haar 7

X X'

37

Temporal Partitioning

  • Very Efficient technique (O(n) time algorithm)
  • Can be extended to address the subsequence matching problem
  • Equivalent to wavelets (when k= 2 i^ , and mean is used)

0 20 40 60 80 100 120 140 xx (^01) xx (^23) xx (^45) xx (^67)

X X'

Random projection

  • Based on the Johnson-Lindenstrauss lemma:
  • For:
    • 0< e < 1/2,
    • any (sufficiently large) set S of M points in R n
    • k = O(e -2^ lnM)
  • There exists a linear map f: SR k , such that
    • (1-e) D(S,T) < D(f(S),f(T)) < (1+e)D(S,T) for S,T in S
  • Random projection is good with constant probability
  • [Indyk, 2000]

39

Random Projection: Application

  • Set k = O(e -2^ lnM)
  • Select k random n-dimensional vectors
  • Project the time series into the k vectors.
  • The resulting k-dimensional space approximately preserves the distances with high probability
  • Monte-Carlo algorithm: we do not know if correct

Random Projection

  • A very useful technique,
  • Especially when used in conjunction with another technique (for example SVD)
  • Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther