



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of various query processing techniques used in database systems, including table scans, nested-loop joins, external merge sort, and hash join. The techniques are introduced with their notation, assumptions, and performance characteristics. Table scans involve processing the entire table and performing selection and projection operations. Nested-loop joins use iterative methods to join two tables. External merge sort is used for sorting large data that doesn't fit in memory. Hash join uses hashing to partition and join tables. The document also discusses improvements, tricks, and comparisons between these techniques.
Typology: Slides
1 / 6
This page cannot be seen from the preview
Don't miss anything!




May schedule a make-up lecture later if necessary
3
Scan? Sort? Hash? Use an index? All have different performance characteristics and/or make different assumptions about data
Implement all alternatives Let the query optimizer choose at run-time
4
Number of I/O’s Memory requirement
5
Selection over R Projection of R without duplicate elimination
Trick for selection: stop early if it is a lookup by key
Same for any algorithm! Maybe not needed—results may be pipelined into another operator
6
R p S For each block of R , and for each r in the block: For each block of S , and for each s in the block: Output rs if p evaluates to true over r and s R is called the outer table; S is called the inner table I/O’s: B ( R ) + | R | ⋅ B ( S ) Memory requirement: 3 (double buffering) Improvement: block-based nested-loop join For each block of R , and for each block of S : For each r in the R block, and for each s in the S block: … I/O’s: B ( R ) + B ( R ) ⋅ B ( S ) Memory requirement: same as before
If the key of the inner table is being matched May reduce half of the I/O’s
Stuff memory with as much of R as possible, stream S by, and join every S tuple with all R tuples in memory I/O’s: B ( R ) + d B ( R ) / ( M – 2 ) e ⋅ B ( S )
There are d B ( R ) / M e level-0 sorted runs
( M – 1) memory blocks for input, 1 to buffer output # of level- i runs = d # of level-( i –1) runs / ( M – 1) e
9
10
Multiply by 2 ⋅ B ( R ): each pass reads the entire relation once and writes it once Subtract B ( R ) for the final pass Roughly, this is O ( B ( R ) ⋅ log (^) M B ( R ) )
11
Allocate an additional block for each run Trade-off: smaller fan-in (more passes)
Instead of reading/writing one disk block at time, read/write a bunch (“cluster”) More sequential I/O’s Trade-off: larger cluster → smaller fan-in (more passes)
12
In most cases (e.g., join of key and foreign key) Worst case is B ( R ) ⋅ B ( S ): everything joins
Typically build a hash table for the partition of R
R partitions
S partitions
…
…
load …
stream For each S tuple, probe and join
In the probing phase, we should have enough memory to fit one partition of R : M – 1 ≥ B ( R ) / ( M – 1) M > sqrt( B ( R )) We can always pick R to be the smaller relation, so: M > sqrt(min( B ( R ), B ( S ))
21
Read it back in and partition it again!
22
(Assuming two-pass) I/O’s: same Memory requirement: hash join is lower sqrt(min( B ( R ), B ( S )) < sqrt( B ( R ) + B ( S )) Hash join wins when two relations have very different sizes Other factors Hash join performance depends on the quality of the hash
23
Example: non-equality joins that are not very selective
Example: … WHERE user_defined_pred ( R. A , S. B )
24
More or less like hash join
Check for duplicates within each partition/bucket
Apply the hash functions to GROUP BY attributes Tuples in the same group must end up in the same partition/bucket Keep a running aggregate value for each group
Sorting: physical division, logical combination Hashing: logical division, physical combination
Sorting: multi-level merge Hashing: recursive partitioning
Sorting: sequential write, random read (merge) Hashing: random write, sequential read (partition)
Use an ISAM, B+^ -tree, or hash index on R ( A )
Use an ordered index (e.g., ISAM or B+^ -tree) on R ( A ) Hash index is not applicable
Example: B+^ -tree index on R ( A , B ) How about B+^ -tree index on R ( B , A )?
27
Example: π A (σ A > v ( R ))
One lookup leads to all result tuples in their entirety
28
Need to follow pointers to get the actual result tuples Say that 20% of R satisfies A > v
29
Typically, the cost of an index lookup is 2-4 I/O’s Beats other join methods if | R | is not too big Better pick R to be the smaller relation
30
Idea: use the ordering provided by the indexes on R ( A ) and S ( B ) to eliminate the sorting step of sort-merge join Trick: use the larger key to probe the other index Possibly skipping many keys that don’t match
B +-tree on R ( A )
B +-tree on S ( B )
1 2 3 4 7 9 18
1 7 9 11 12 17 19