































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The active data repository (adr) and datacutter systems for processing large, irregular multi-dimensional datasets. Adr is used for building parallel databases and integrates storage, retrieval, and processing of multi-dimensional data sets. Datacutter is an indexing system that works with multi-dimensional data and uses a filter-based programming model. Topics such as application properties, processing loop, architecture, services, loading data, query planning, query execution, processing strategies, and results.
Typology: Study notes
1 / 39
This page cannot be seen from the preview
Don't miss anything!
































th
Used for building parallel databases from multi-dimensional data sets Integrates storage, retrieval, and processing ofmulti-dimensional data sets Customization for application specific processing(Initialize, Map, Aggregate, Output) Support common operations like memorymanagement, data retrieval, and scheduling
Retrieve data that matches the range query Mapping input to output Accumulator used to hold intermediate result Aggregate input value with the intermediate result Aggregate operation usually commutative andassociative so can aggregate in any order Output is usually much smaller so the processingsteps are called a reduction
Attribute Space Service Manages use of attribute spaces and user map functions Dataset Service Manages datasets stored on ADR back-end Indexing Service Manages indexes (user defined and ADR) Data Aggregation Service Manages user provided function for aggregation Query Planning Service Determines query plan to process a set of queries Query Execution Service Manages resources and carries out query plan
If the output is too large to fit entirely intomemory Each tile is a subset of output chunks Results in implicit tiling on input chunks
Each processor gets responsibility for processingsubset of input/accumulator chunks
Processing operations on storage manager to avoid copying Four steps
SRA (Sparsely Replicated Accumulator) FRA wasteful of memory because replicates all, eventhose that have no input chunks mapping to them Initialization overhead in Initialization phase andcommunication and computation overhead in GlobalCombine phase Same as FRA but, a ghost chunk is allocated only if theprocessor has a local input chunk that projects to theaccumulator chunk
Each processor is given responsibility to carryout work associated with a set of output chunks Accumulator chunks are not replicated to remoteprocesses Remote input chunks that map to local outputchunks must be forwarded to processor
Communication volume and computation time perprocessor for DA decreases as processors increase Initialization and global combine phases for FRAand SRA remain constant DA has higher communication volume and moreload imbalance when processors and data sizeincrease DA communication volume is proportional tonumber of input chunks and fan-out FRA proportional to number of output chunks
Summary associates with multiple segments anddetailed index files Detailed index specifies segments