Parallel Database Systems: Understanding Parallelism and Its Impact on DBMS, Study notes of Principles of Database Management

An overview of parallel database systems, discussing the concepts of parallelism, its benefits, and the different types of parallelism in dbms. It also covers the challenges of achieving good speed-up and scale-up, architecture issues, and the advantages of shared-nothing systems. The document also mentions the importance of data partitioning and parallel query processing.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-h2e
koofers-user-h2e 🇺🇸

10 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Parallel DBMS
CMPSCI 645
Slide content due to Ramakrishnan, Gehrke, Hellerstein, Gray.
Figures taken from: Dewitt and Gray. Parallel Database Systems: The Future of High
Performance Database Systems. CACM 1992
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Parallel Database Systems: Understanding Parallelism and Its Impact on DBMS and more Study notes Principles of Database Management in PDF only on Docsity!

Parallel DBMS

CMPSCI 645

Slide content due to Ramakrishnan, Gehrke, Hellerstein, Gray. Figures taken from: Dewitt and Gray. Parallel Database Systems: The Future of High Performance Database Systems. CACM 1992

Parallel vs. Distributed DBs

 Parallel database systems

  • Improve performance through parallelizing various operations: loading data, indexing, query evaluation. Data may be distributed, but purely for performance reasons.

 Distributed database systems

  • Data is physically stored across various sites, each of which runs DBMS and can function independently. Data distribution determined by local ownership and availability, in addition to performance. 2

Parallel DBMS: Intro

 Parallelism is natural to DBMS processing

  • Pipeline parallelism: many machines each doing one step in a multi-step process.
  • Partition parallelism: many machines doing the same thing to different pieces of data.
  • Both are natural in DBMS!

Pipeline

Partition

Any Sequential Program Any Sequential Program Sequential Sequential Sequential Sequential Any Sequential Program Any Sequential Program

DBMS: The || Success Story

 DBMSs are the most (only?) successful

application of parallelism.

  • Teradata, Tandem vs. Thinking Machines, KSR..
  • Every major DBMS vendor has some || server

 Reasons for success:

  • Bulk-processing (= partition ||-ism).
  • Natural pipelining.
  • Inexpensive hardware can do the trick!
  • Users/app-programmers don’t need to think in ||

Enemies of good speed-up / scale-up

 Start up work

  • If thousands of processes must be started, this can dominate actual computation time

 Interference

  • The slowdown each new process imposes on all others when accessing shared resources

 Skew

  • Variance in the size of jobs for each process. Service time for whole job is the service time of slowest step of job. 7

Architecture Issue: Shared What?

 Alternative architectures:

  • Shared memory: all processors shared common global memory and access to all disks.
  • Shared disk: all processors have private memory, but direct access to all disks.
  • Shared nothing: each memory/disk owned by processor which acts as server for data. 8 Shared memory Shared disk Shared nothing

Different Types of DBMS ||-ism

 Intra-operator parallelism

  • get all machines working to compute a given operation (scan, sort, join)

 Inter-operator parallelism

  • each operator may run concurrently on a different site (exploits pipelining)

 Inter-query parallelism

  • different queries run on different sites

 We’ll focus on intra-operator ||-ism

Limits of pipelined parallelism in

DBMS

 Relational pipelines usually not very long

 Some relational operators block (e.g. sorting,

aggregation)

 Execution cost of one operator may be much

higher than another (example of skew)

 As a result, partitioned parallelism is key to

achieving speed-up and scale-up

10

Parallel query processing

12 Two relational scans consuming two input relations, A and B, and feeding their outputs to a join operator that in turn produces a data stream C.

Parallel Scans

 Scan in parallel, and merge.

 Selection may not require all sites for range or

hash partitioning.

 Indexes can be built at each partition.

Dataflow Network for || Join

 Good use of split/merge makes it easier to

build parallel versions of sequential join code.

Complex Parallel Query Plans

 Complex Queries: Inter-Operator parallelism

  • Pipelining between operators:  (^) note that sort and phase 1 of hash-join block the pipeline!!
  • Bushy Trees A B R S Sites 1-4 Sites 5- Sites 1-

 Best serial plan != Best || plan!

 Trivial counter-example:

  • Table partitioned with local secondary index at two nodes
  • Range query: all of node 1 and 1% of node 2.
  • Node 1 should do a scan of its partition.
  • Node 2 should use secondary index.

 SELECT *

FROM telephone_book

WHERE name < “NoGood”;

Sequential vs. Parallel Optimization

N..Z Table Scan A..M Index Scan

Parallel DBMS Summary

 ||-ism natural to query processing:

  • Both pipeline and partition ||-ism!

 Shared-Nothing vs. Shared-Mem

  • Shared-disk too, but less standard
  • Shared-mem easy, costly. Doesn’t scaleup.
  • Shared-nothing cheap, scales well, harder to implement.

 Intra-op, Inter-op, & Inter-query ||-ism all

possible.