High-Dimensional Data Indexing and Similarity Measures, Study notes of Computer Science

The challenges and methods for indexing and searching high-dimensional data, including similarity measures such as euclidean distance, dynamic time warping, and wavelets. It also covers various index structures such as r-trees, r* trees, and m-tree. The document also touches on the curse of dimensionality and its impact on high-dimensional data.

Typology: Study notes

Pre 2010

Uploaded on 08/17/2009

koofers-user-okt
koofers-user-okt 🇺🇸

9 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
High-Dimensional
Data
Topics
Motivation
Similarity Measures
Index Structures
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download High-Dimensional Data Indexing and Similarity Measures and more Study notes Computer Science in PDF only on Docsity!

High-Dimensional

Data

Topics

  MotivationSimilarity Measures

 Index Structures

c (^) e f g d^ A B

R trees, redux

 We want to minimize coverage and overlap

A B c d e f g We descend both branches tosearch for

R+ Trees

  store d in both A and Blike splitting d into two pieces

A B c d e d f g

c (^) e f g d^ A B

R* trees

 Reinsert c instead of splitting node

A B x d e f g c

c (^) e f g B^ d^ A^ x

Curse of Dimensionality

Coverage and overlap as a function of dimension?

d=

d=1 d=

Curse of Dimensionality

 Generally: exponential growth of thehypervolume as a function of dimension

 Other manifestations:  number of samples required to maintain the

 same accuracynumber of nodes in a neural network
 required to “monitor” the input spacelots more

High-dimensional data

  FinanceMultimedia

  SoundMusic (“Query by humming”)   ImagesVideo

  Document RetrievalBiology/Medicine

  DNA sequence matchingMedical imagery

  Moving ObjectsHigh-Energy Physics [(t0,x0,y0), (t1,x1,y1), …]

Similarity Measure

Define a function s : V  V  Real What properties should s have? Reflexive:s(x,x) = 0 // or infinity Symmetric:s(x,y) = s(y,x) Triangle Inequality:s(x,y) + s(y,z) >= s(x,z)

Timeseries Indexing

Q =

A =

B =

Timeseries Indexing

A B

D C

Q

Timeseries Indexing

  Euclidean distanceDynamic Time Warping

 Wavelets^ ^ Jagadish, Faloutsos 1998, Keogh 2002

 LCSS^ ^ Miller 2003

 EDR^ ^ Vlachos, Kollios, Gunopolos 2002

 Chen, Ozsu, Oria 2005

Dynamic Time Warping

Dynamic Time Warping (2)

Dynamic Time Warping (3)

Dynamic Time Warping (4)

 Drawbacks:  Sensitive to noise

 expensive tocompute

Wavelets (3)

Haar wavelet transform

s si (^) i + s- si+1i+

Hierarchical decomposition allows fine-tuning

Wavelets (4)

After one Horizontal filtering

Wavelets (5)

 After twovertical and horizontalfilterings

Wavelets (6)

 Wavelets can  Principal Component Analysis (PCA), reduce dimensionality, like

  Singular Value Decomposition (SVD),others

 Indexing in the reduced feature space  False positives ok, False negatives aren’t

 Use a more refined similarity measure toeliminate false positives

M-Tree

Telscoping Vector Tree (TV)

  node = (center, radius)dim(center) >= # of “active dimensions”