Advanced Database Systems-Lecture 11 Slides-Computer Science, Slides of Database Management Systems (DBMS)

Query Processing, Notation, Table Scan, Nested-loop Join, External Merge Sort, Performance of External Merge Sort, Tricks for Sorting, Double Buffering, Blocked I/O, Internal Sort Algorithm, Quicksort, Replacement Selection, Sort-merge Join, Optimization of SMJ, Performance of two-pass SMJ, Hash join, Other Sort-based Algorithms, Partitioning Phase, Probing phase, Hash join Tricks, Hybrid Hash Join, Hash Join versus SMJ, Duality of Sort and Hash, I/O Patterns

Typology: Slides

2011/2012

Uploaded on 01/28/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Query Processing
CPS 216
Advanced Database Systems
2
Announcements (February 22)
Reading assignment for this week
Variant indexes (due next Monday)
Homework #2 due in 1½ weeks (March 3)
Course project proposal due in 2 weeks
Midterm in 2½ weeks
3
Overview
Many different ways of processing the same query
Scan? Sort? Hash? Use an index?
All with different performance characteristics
Best choice depends on the situation
Implement all alternatives
Let the query optimizer choose at run-time
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Advanced Database Systems-Lecture 11 Slides-Computer Science and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Query Processing

CPS 216

Advanced Database Systems

2

Announcements (February 22)

™ Reading assignment for this week

ƒ Variant indexes (due next Monday)

™ Homework #2 due in 1½ weeks (March 3)

™ Course project proposal due in 2 weeks

™ Midterm in 2½ weeks

3

Overview

™ Many different ways of processing the same query

ƒ Scan? Sort? Hash? Use an index? ƒ All with different performance characteristics

™ Best choice depends on the situation

ƒ Implement all alternatives ƒ Let the query optimizer choose at run-time

Notation

™ Relations: R , S

™ Tuples: r , s

™ Number of tuples: | R |, | S |

™ Number of disk blocks: B ( R ), B ( S )

™ Number of memory blocks available: M

™ Cost metric

ƒ Number of I/O’s ƒ Memory requirement

5

Table scan

™ Scan table R and process the query

ƒ Selection over R ƒ Projection of R without duplicate elimination

™ I/O’s: B ( R )

ƒ Trick for selection: stop early if it is a lookup by key

™ Memory requirement: 2 (double buffering)

™ Not counting the cost of writing the result out

ƒ Same for any algorithm! ƒ Maybe not needed—results may be pipelined directly into another operator

6

Nested-loop join

™ R  p S ™ For each block of R , and for each r in the block: For each block of S , and for each s in the block: Output rs if p evaluates to true over r and s ƒ R is called the outer table; S is called the inner table ™ I/O’s: B ( R ) + | R | ⋅ B ( S ) ™ Memory requirement: 4 (double buffering)

™ Improvement: block-based nested-loop join ƒ For each block of R , and for each block of S : For each r in the R block, and for each s in the S block: … ƒ I/O’s: B ( R ) + B ( R ) ⋅ B ( S ) ƒ Memory requirement: same as before

Performance of external merge sort

™ Number of passes: d log M – 1 d B ( R ) / M e e + 1

™ I/O’s

ƒ Multiply by 2 ⋅ B ( R ): each pass reads the entire relation once and writes it once ƒ Subtract B ( R ) for the final pass ƒ Roughly, this is O ( B ( R ) ⋅ log (^) M B ( R ) )

™ Memory requirement: M (as much as possible)

11

Some tricks for sorting

™ Double buffering

ƒ Allocate an additional block for each run

™ Blocked I/O

ƒ Instead of reading/writing one disk block at time, read/write a bunch (“cluster”)

™ Dealing with input whose size

is not an exact power of fan-in

12

Internal sort algorithm

™ Quicksort

)Fast

™ Replacement selection

ƒ One block for input, one for output, rest for a heap ƒ Fill the heap with input records ƒ Find the smallest record in the heap that is no less than the largest record in the current run

  • If that exists, move it to the output buffer, and move a new record from input buffer into the heap
  • If that does not exist, flush output and start a new run )Slower than quicksort, but produces longer runs (twice the size of memory if records are in random order)

Sort-merge join

™ R  R. A = S. B S

™ Sort R and S by their join attributes, and then merge

r , s = the first tuples in sorted R and S

Repeat until one of R and S is exhausted:

If r. A > s. B then s = next tuple in S

else if r. A < s. B then r = next tuple in R

else output all matching tuples, and

r , s = next in R and S

™ I/O’s: sorting + 2 B ( R ) + 2 B ( S )

ƒ In most cases (e.g., join of key and foreign key) ƒ Worst case is B ( R ) ⋅ B ( S ): everything joins

14

Example

R : S : R  R. A = S. B S :

r 1. A = 1 s 1. B = 1

r 2. A = 3 s 2. B = 2

r 3. A = 3 s 3. B = 3

r 4. A = 5 s 4. B = 3

r 5. A = 7 s 5. B = 8

r 6. A = 7

r 7. A = 8

r 1 s 1

r 2 s 3

r 2 s 4

r 3 s 3

r 3 s 4

r 7 s 5

15

Optimization of SMJ

™ Idea: combine join with the merge phase of merge sort

™ Sort: produce sorted runs of size M for R and S

™ Merge and join: merge the runs of R , merge the runs of S , and merge-join the result streams as they are generated!

Merge

Merge Sorted runs

R

S

Disk Memory

Join

Partitioning phase

™ Partition R and S according to the same hash

function on their join attributes

M – 1 partitions of R

Memory Disk

R

Same for S

… …

20

Probing phase

™ Read in each partition of R , stream in the

corresponding partition of S , join

ƒ Typically build a hash table for the partition of R

  • Not the same hash function used for partition, of course! Disk Memory

R partitions

S partitions

load

stream For each S tuple, probe and join

21

Performance of hash join

™ I/O’s: 3 ⋅ ( B ( R ) + B ( S ))

™ Memory requirement:

ƒ In the probing phase, we should have enough memory to fit one partition of R : M – 1 ≥ B ( R ) / ( M – 1) ƒ M > sqrt( B ( R )) ƒ We can always pick R to be the smaller relation, so: M > sqrt(min( B ( R ), B ( S ))

Hash join tricks

™ What if a partition is too large for memory?

ƒ Read it back in and partition it further!

  • See the duality in multi-pass merge sort here?

23

Hybrid hash join

™ What if there is extra memory available?

ƒ Use it to avoid writing/re-reading partitions

  • Of both R and S! Memory Disk

R … …

A generalization of the idea is described in the survey paper by Graefe

24

Hash join versus SMJ

(Assuming two-pass)

™ I/O’s: same

™ Memory requirement: hash join is lower ƒ sqrt(min( B ( R ), B ( S )) < sqrt( B ( R ) + B ( S ))

™ Other factors ƒ Hash join performance depends on the quality of the hash

  • Might not get evenly sized buckets