ADR and DataCutter: Parallel Processing of Multi-dimensional Data - Prof. Alan L. Sussman, Study notes of Computer Science

The active data repository (adr) and datacutter systems for processing large, irregular multi-dimensional datasets. Adr is used for building parallel databases and integrates storage, retrieval, and processing of multi-dimensional data sets. Datacutter is an indexing system that works with multi-dimensional data and uses a filter-based programming model. Topics such as application properties, processing loop, architecture, services, loading data, query planning, query execution, processing strategies, and results.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-k6g
koofers-user-k6g 🇺🇸

10 documents

1 / 39

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ADR and DataCutter
Sergey Koren
CMSC818S
Thursday March 4th, 2004
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Partial preview of the text

Download ADR and DataCutter: Parallel Processing of Multi-dimensional Data - Prof. Alan L. Sussman and more Study notes Computer Science in PDF only on Docsity!

ADR and DataCutter

Sergey Koren

CMSC818S

Thursday March 4

th

Active Data Repository

Used for building parallel databases from multi-dimensional data sets Š Integrates storage, retrieval, and processing ofmulti-dimensional data sets Š Customization for application specific processing(Initialize, Map, Aggregate, Output) Š Support common operations like memorymanagement, data retrieval, and scheduling

Application Processing

Retrieve data that matches the range query Š Mapping input to output Š Accumulator used to hold intermediate result „ Aggregate input value with the intermediate result „ Aggregate operation usually commutative andassociative so can aggregate in any order Š Output is usually much smaller so the processingsteps are called a reduction

Processing Loop

Services

Š Attribute Space Service „ Manages use of attribute spaces and user map functions Š Dataset Service „ Manages datasets stored on ADR back-end Š Indexing Service „ Manages indexes (user defined and ADR) Š Data Aggregation Service „ Manages user provided function for aggregation Š Query Planning Service „ Determines query plan to process a set of queries Š Query Execution Service „ Manages resources and carries out query plan

ADR Application Suite

Query Planning

Tiling

„ If the output is too large to fit entirely intomemory „ Each tile is a subset of output chunks „ Results in implicit tiling on input chunks

Workload Partitioning

„ Each processor gets responsibility for processingsubset of input/accumulator chunks

Query Execution

Š Processing operations on storage manager to avoid copying Š Four steps „

  1. Initialization z Accumulator elements for current tile allocated „ 2. Local Reduction z Processors retrieve chunks on local disk and aggregate intoaccumulator „ 3. Global Combine z Partial results from step 2 are combined „ 4. Output Handling z Output computed from accumulator Š Overlap disk, network, and processing operations

Processing Strategies (2)

SRA (Sparsely Replicated Accumulator) „ FRA wasteful of memory because replicates all, eventhose that have no input chunks mapping to them „ Initialization overhead in Initialization phase andcommunication and computation overhead in GlobalCombine phase „ Same as FRA but, a ghost chunk is allocated only if theprocessor has a local input chunk that projects to theaccumulator chunk

Processing Strategies (3)

DA (Distributed Accumulator)

„ Each processor is given responsibility to carryout work associated with a set of output chunks „ Accumulator chunks are not replicated to remoteprocesses „ Remote input chunks that map to local outputchunks must be forwarded to processor

Results

Communication volume and computation time perprocessor for DA decreases as processors increase Š Initialization and global combine phases for FRAand SRA remain constant Š DA has higher communication volume and moreload imbalance when processors and data sizeincrease Š DA communication volume is proportional tonumber of input chunks and fan-out Š FRA proportional to number of output chunks

Current Version

http://www.cs.umd.edu/projects/hpsl/chaos/ResearchAreas/adr/

Version 1.0 – Updated in December 2000

Indexing

Data and index files

Segments contain some data tied to themulti-dimensional space

MBR (Minimum Bounding Rectangle) thatencompasses data points in segment

Segments are units of access

Indexing (cont…)

Use R-Trees, but a lot of data makes indexlarge

Use both summary index files and detailedindex files

„ Summary associates with multiple segments anddetailed index files „ Detailed index specifies segments

Can be organized into groups