Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction, Study notes of Computer Science

University of California-Riverside Computer Science

Various high-dimensional indexing techniques, including r-trees, kd-trees, vp-trees, and sequential scan. It also covers dimensionality reduction methods such as singular value decomposition (svd), discrete fourier transform, wavelets, and line segment approximations. The advantages and disadvantages of each method and provides examples.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-pms 🇺🇸

8 documents

1 / 25

This page cannot be seen from the preview

Don't miss anything!

Dimensionality Reduction

Techniques

Dimitrios Gunopulos, UCR

Retrieval techniques for high-

dimensional datasets

•The retrieval problem:

– Given a set of objects S

S, and a query object S,

– find the objectss that are most similar to S.

• Applications:

– financial, voice, marketing, medicine, video

Discover Study notes of Computer Science University of California-Riverside

Partial preview of the text

Download High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction and more Study notes Computer Science in PDF only on Docsity!

Dimensionality Reduction

Techniques

Dimitrios Gunopulos, UCR

Retrieval techniques for high-

dimensional datasets

The retrieval problem:
- Given a set of objects SS , and a query object S,
- find the objectss that are most similar to S.
Applications:
- financial, voice, marketing, medicine, video

Examples

Find companies with similar stock prices over a time interval
Find products with similar sell cycles
Cluster users with similar credit card utilization
Cluster products

Indexing when the triangle inequality

holds

Typical distance metric: Lp norm.
We use L 2 as an example throughout:
- D(S,T) = (Σi=1,..,n (S[i] - T[i]) 2 ) 1/

R-trees and variants

[Guttman, 1984], [Sellis et al, 1987], [Beckmann et al, 1990]

k-dim extension of B-trees
Balanced tree
Intermediate nodes are rectangles that cover lower levels
Rectangles may be overlapping or not depending on variant (R-trees, R+-trees, R*-trees)
Can index rectangles as well as points

L1 (^) L

L4 L

kd-trees

Based on binary trees
Different attribute is used for partitioning at different levels
Efficient for indexing points
External memory extensions: hBΠ-tree

Grid Files

Use a regular grid to partition the space
Points in each cell go to one disk page
Can only handle points

vp-trees and pyramid trees

[Ullmann], [Berchtold et al,1998], [Bozkaya et al1997],...

Basic idea: partition the dataset, rather than the space
vp-trees: At each level, partition the points based on the distance from a center
Others: mvp-, TV-, S-, Pyramid-trees

R R

c3 (^) The root level of a vp-tree with 3 children

High-dimensional index structures

All require the triangle inequality to hold
All partition either
- the space or
- the dataset into regions
The objective is to:
- search only those regions that could potentially contain good matches
- avoid everything else

The naïve approach: Problems

High-dimensionality:
- decreases index structure performance (the curse of dimensionality)
- slows down the distance computation
Inefficiency

Dimensionality reduction

The main idea: reduce the dimensionality of the space.
Project the n-dimensional tuples that represent the time series in a k-dimensional space so that: - k << n - distances are preserved as well as possible

Dimensionality Reduction

Use an indexing technique on the new space.
GEMINI ([Faloutsos et al]):
- Map the query S to the new space
- Find nearest neighbors to S in the new space
- Compute the actual distances and keep the closest

Dimensionality Reduction

To guarantee no false dismissals we must be able to prove that: - D(F(S),F(T)) < a D(S,T) - for some constant a
a small rate of false positives is desirable, but not essential

What we achieve

Indexing structures work much better in lower dimensionality spaces
The distance computations run faster
The size of the dataset is reduced, improving performance.

Dimensionality Techniques

We will review a number of dimensionality techniques that can be applied in this context - SVD decomposition, - Discrete Fourier transform, and Discrete Cosine transform - Wavelets - Partitioning in the time domain - Random Projections - Multidimensional scaling - FastMap and its variants

SVD decomposition - the Karhunen-

Loeve transform

Intuition: find the axis that shows the greatest variation, and project all points into this axis
[Faloutsos, 1996]

e2 e

SVD Cont’d

To approximate the time series, we use only the k largest eigenvectors of C.
A’ = U x Lk
A’ is an M x k matrix

0 20 40 60 80 100 120 140 eigenwave 0

X X'

eigenwave 1eigenwave 2 eigenwave 3eigenwave 4 eigenwave 5eigenwave 6 eigenwave 7

SVD Cont’d

Advantages:
- Optimal dimensionality reduction (for linear projections)
Disadvantages:
- Computationally hard, especially if the time series are very long.
- Does not work for subsequence indexing

SVD Extensions

On-line approximation algorithm
- [Ravi Kanth et al, 1998]
Local diemensionality reduction:
- Cluster the time series, solve for each cluster
- [Chakrabarti and Mehrotra, 2000], [Thomasian et al]

Discrete Fourier Transform

Analyze the frequency spectrum of an one dimensional signal
For S = (S 0 , …,Sn-1 ), the DFT is:
Sf = 1/√n Σi=0,..,n-1 S (^) i e -j2πfi/n f = 0,1,…n-1, j 2 =-
An efficient O(nlogn) algorithm makes DFT a practical method
[Agrawal et al, 1993], [Rafiei and Mendelzon, 1998]

Discrete Fourier Transform

Advantages:
- Efficient, concentrates the energy
Disadvantages:
- To project the n-dimensional time series into a k- dimensional space, the same k Fourier coefficients must be store for all series
- This is not optimal for all series
- To find the k optimal coefficients for M time series, compute the average energy for each coefficient

Wavelets

Represent the time series as a sum of prototype functions like DFT
Typical base used: Haar wavelets
Difference from DFT: localization in time
Can be extended to 2 dimensions
[Chan and Fu, 1999]
Has been very useful in graphics, approximation techniques

Wavelets

An example (using the Haar wavelet basis)
- S ≡ (2, 2, 7, 9) : original time series
- S’ ≡ (5, 6, 0, 2) : wavelet decomp.
- S[0] = S’[0] - S’[1]/2 - S’[2]/
- S[1] = S’[0] - S’[1]/2 + S’[2]/
- S[2] = S’[0] + S’[1]/2 - S’[3]/
- S[3] = S’[0] + S’[1]/2 + S’[3]/
Efficient O(n) algorithm to find the coefficients

Using wavelets for approximation

Keep only k coefficients, approximate the rest with 0
Keeping the first k coefficients:
- equivalent to low pass filtering
Keeping the largest k coefficients:
- More accurate representation, But not useful for indexing

0 20 40 60 80 100 120 140 Haar 0Haar 1 Haar 2Haar 3 Haar 4Haar 5 Haar 6Haar 7

X X'

Temporal Partitioning

Very Efficient technique (O(n) time algorithm)
Can be extended to address the subsequence matching problem
Equivalent to wavelets (when k= 2 i^ , and mean is used)

0 20 40 60 80 100 120 140 xx (^01) xx (^23) xx (^45) xx (^67)

X X'

Random projection

Based on the Johnson-Lindenstrauss lemma:
For:
- 0< e < 1/2,
- any (sufficiently large) set S of M points in R n
- k = O(e -2^ lnM)
There exists a linear map f: S → R k , such that
- (1-e) D(S,T) < D(f(S),f(T)) < (1+e)D(S,T) for S,T in S
Random projection is good with constant probability
[Indyk, 2000]

Random Projection: Application

Set k = O(e -2^ lnM)
Select k random n-dimensional vectors
Project the time series into the k vectors.
The resulting k-dimensional space approximately preserves the distances with high probability
Monte-Carlo algorithm: we do not know if correct

Random Projection

A very useful technique,
Especially when used in conjunction with another technique (for example SVD)
Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther

High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction, Study notes of Computer Science

Related documents

Partial preview of the text

Download High-Dimensional Indexing: R-trees, kd-trees, vp-trees, and Dimension Reduction and more Study notes Computer Science in PDF only on Docsity!

Dimensionality Reduction

Techniques

Dimitrios Gunopulos, UCR

Retrieval techniques for high-

dimensional datasets

Examples

Indexing when the triangle inequality

holds

R-trees and variants

kd-trees

Grid Files

vp-trees and pyramid trees

High-dimensional index structures

The naïve approach: Problems

Dimensionality reduction

Dimensionality Reduction

Dimensionality Reduction

What we achieve

Dimensionality Techniques

SVD decomposition - the Karhunen-

Loeve transform

SVD Cont’d

SVD Cont’d

SVD Extensions

Discrete Fourier Transform

Discrete Fourier Transform

Wavelets

Wavelets

Using wavelets for approximation

Temporal Partitioning

Random projection

Random Projection: Application

Random Projection