Parallel Database Management Systems: Leveraging Parallelism for Efficient Data Processing, Slides of Database Management Systems (DBMS)

An introduction to parallel database management systems (dbms), explaining why parallel access to data is essential, the concepts of pipeline and partition parallelism, and the success stories of parallel dbmss such as teradata and tandem. It also covers different types of dbms parallelism, automatic data partitioning, and parallel query processing techniques like parallel scans, sorting, and joins.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

dhanapati
dhanapati 🇮🇳

4.1

(24)

123 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Parallel DBMS
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Parallel Database Management Systems: Leveraging Parallelism for Efficient Data Processing and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Parallel DBMS

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s

1.2 days to scan

1 Terabyte

1,000 x parallel

1.5 minute to scan.

Parallelism: divide a big problem into many smaller ones to be solved in parallel. Docsity.com

DBMS: The || Success Story

  • DBMSs are among most (only?) successful application

of parallelism.

  • Teradata, Tandem vs. Thinking Machines,
  • Every major DBMS vendor has some || server
  • Workstation manufacturers depend on || DB server sales.
  • Reasons for success:
  • Bulk-processing (= partitioned ||-ism).
  • Natural pipelining.
  • Inexpensive hardware can do the trick!
  • Users/app-prog. don’t need to think in ||

Some || Terminology

  • Speed-Up
    • More resources means

proportionally less time

for given amount of data.

  • Scale-Up
    • If resources increased in

proportion to increase in

data size, time is constant.

degree of ||-ism

Xact/sec. (throughput)

(^) Ideal

degree of ||-ism

sec./Xact (response time)

Ideal

Different Types of DBMS ||-ism

  • Intra-operator parallelism
    • get all machines working to compute a given operation (scan, sort, join)
  • Inter-operator parallelism
    • each operator may run concurrently on a different site (exploits pipelining)
  • Inter-query parallelism
    • different queries run on different site

We’ll focus on intra-operator ||-ism

Automatic Data Partitioning

Partitioning a table:

Range Hash Round Robin

Shared-disk and -memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning!

A...E F...J^ K...N^ O...S^ T...Z^ A...E F...J^ K...N^ O...S^ T...Z^ A...E^ F...J^ K...N^ O...S T...Z

Good for group-by, range queries, and also equip-join

Good for equijoins (^) Good to spread load; Most flexible - but

Parallel Sorting

  • New records again and again:
    • 8.5 Gb/minute, shared-nothing; Datamation benchmark

in 2.41 secs (UCB students)

  • Idea:
    • Scan in parallel, and range-partition as you go.
    • As tuples come in, begin “local” sorting on each
    • Resulting data is sorted, and range-partitioned.
    • Problem: skew!
    • Solution: “sample” data at start to determine partition

points.

Parallel Joins

  • Nested loop:
    • Each outer tuple must be compared with each

inner tuple that might join.

  • Easy for range partitioning on join cols, hard

otherwise!

  • Sort-Merge (or plain Merge-Join):
    • Sorting gives range-partitioning.
    • Merging partitioned tables is local.

Dataflow Network for || Join

  • Good use of split/merge makes it easier to build

parallel versions of sequential join code.

Complex Parallel Query Plans

  • Complex Queries: Inter-Operator parallelism
    • Pipelining between operators:
      • sort and phase 1 of hash-join block the pipeline!!
    • Bushy Trees
A B R S

Sites 1-4 Sites 5-

Sites 1-

Observations

  • It is relatively easy to build a fast parallel

query executor

  • It is hard to write a robust high-quality parallel

query optimizer:

  • There are many tricks.
  • One quickly hits the complexity barrier.

Parallel Query Optimization

  • Common approach: 2 phases
    • Pick best sequential plan (System R algo)
    • Pick degree of parallelism based on current system

parameters.

  • “Bind” operators to processors
    • Take query tree, “decorate” with assigned

processor

Parallel DBMS Summary

  • ||-ism natural to query processing:
    • Both pipeline and partition ||-ism!
  • Shared-Nothing vs. Shared-Mem
    • Shared-disk too, but less standard
    • Shared-memory easy, costly.
      • But doesn’t scaleup.
    • Shared-nothing cheap, scales well.
      • But harder to implement.
  • Intra-op, Inter-op & Inter-query ||-ism possible.

|| DBMS Summary, cont.

  • Data layout choices important!
  • Most DB operations can be done partition-||
    • Sort-merge join, hash-join.
  • Complex plans.
    • Allow for pipeline-||ism, but sorts, hashes block the

pipeline.

  • Partition ||-ism achieved via bushy trees.