Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Database Systems: Query Processing Techniques: Scans, Joins, Sort, and Hash, Slides of Introduction to Database Management Systems

Duke University Introduction to Database Management Systems

An overview of various query processing techniques used in database systems, including table scans, nested-loop joins, external merge sort, and hash join. The techniques are introduced with their notation, assumptions, and performance characteristics. Table scans involve processing the entire table and performing selection and projection operations. Nested-loop joins use iterative methods to join two tables. External merge sort is used for sorting large data that doesn't fit in memory. Hash join uses hashing to partition and join tables. The document also discusses improvements, tricks, and comparisons between these techniques.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold 🇺🇸

4.7

(24)

372 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

1

Query Processing

CPS 116

Introduction to Database Systems

2

Announcements

Homework #3 sample solution available today

(Nov. 9)

Course project milestone #2 due this Thursday

No class or office hours next Tuesday (Nov. 16): I

am out of town

May schedule a make-up lecture later if necessary

3

Overview

Many different ways of processing the same query

Scan? Sort? Hash? Use an index?

All have different performance characteristics and/or

make different assumptions about data

Best choice depends on the situation

Implement all alternatives

Let the query optimizer choose at run-time

4

Notation

Relations: R, S

Tuples: r, s

Number of tuples: |R|, |S|

Number of disk blocks: B(R), B(S)

Number of memory blocks available: M

Cost metric

Number of I/O’s

Memory requirement

5

Table scan

Scan table Rand process the query

Selection over R

Projection of Rwithout duplicate elimination

I/O’s: B(R)

Trick for selection: stop early if it is a lookup by key

Memory requirement: 2 (double buffering)

Not counting the cost of writing the result out

Same for any algorithm!

Maybe not needed—results may be pipelined into

another operator

6

Nested-loop join

RpS

For each block of R, and for each rin the block:

For each block of S, and for each sin the block:

Output rs if pevaluates to true over rand s

R is called the outer table; S is called the inner table

I/O’s: B(R) + |R| ⋅B(S)

Memory requirement: 3 (double buffering)

Improvement: block-based nested-loop join

For each block of R, and for each block of S:

For each rin the Rblock, and for each sin the Sblock: …

I/O’s: B(R) + B(R) ⋅B(S)

Memory requirement: same as before

Discover Slides of Introduction to Database Management Systems Duke University

Partial preview of the text

Download Database Systems: Query Processing Techniques: Scans, Joins, Sort, and Hash and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Query Processing

CPS 116

Introduction to Database Systems

Announcements

Homework #3 sample solution available today

(Nov. 9)

Course project milestone #2 due this Thursday

No class or office hours next Tuesday (Nov. 16): I

am out of town

May schedule a make-up lecture later if necessary

3

Overview

Many different ways of processing the same query

Scan? Sort? Hash? Use an index? All have different performance characteristics and/or make different assumptions about data

Best choice depends on the situation

Implement all alternatives Let the query optimizer choose at run-time

4

Notation

Relations: R , S

Tuples: r , s

Number of tuples: | R |, | S |

Number of disk blocks: B ( R ), B ( S )

Number of memory blocks available: M

Cost metric

Number of I/O’s Memory requirement

5

Table scan

Scan table R and process the query

Selection over R Projection of R without duplicate elimination

I/O’s: B ( R )

Trick for selection: stop early if it is a lookup by key

Memory requirement: 2 (double buffering)

Not counting the cost of writing the result out

Same for any algorithm! Maybe not needed—results may be pipelined into another operator

6

Nested-loop join

R p S For each block of R , and for each r in the block: For each block of S , and for each s in the block: Output rs if p evaluates to true over r and s R is called the outer table; S is called the inner table I/O’s: B ( R ) + | R | ⋅ B ( S ) Memory requirement: 3 (double buffering) Improvement: block-based nested-loop join For each block of R , and for each block of S : For each r in the R block, and for each s in the S block: … I/O’s: B ( R ) + B ( R ) ⋅ B ( S ) Memory requirement: same as before

More improvements of nested-loop join

Stop early

If the key of the inner table is being matched May reduce half of the I/O’s

Make use of available memory

Stuff memory with as much of R as possible, stream S by, and join every S tuple with all R tuples in memory I/O’s: B ( R ) + d B ( R ) / ( M – 2 ) e ⋅ B ( S )

Or, roughly: B ( R ) ⋅ B ( S ) / M Memory requirement: M (as much as possible)

Which table would you pick as the outer?

External merge sort

Problem: sort R , but R does not fit in memory

Pass 0: read M blocks of R at a time, sort them, and

write out a level-0 run

There are d B ( R ) / M e level-0 sorted runs

Pass i : merge ( M – 1) level-( i -1) runs at a time, and

write out a level- i run

( M – 1) memory blocks for input, 1 to buffer output # of level- i runs = d # of level-( i –1) runs / ( M – 1) e

Final pass produces 1 sorted run

9

Example of external merge sort

Input: 1, 7, 4, 5, 2, 8, 3, 6, 9

Pass 0

Pass 1

Pass 2 (final)

10

Performance of external merge sort

Number of passes: d log M – 1 d B ( R ) / M e e + 1

I/O’s

Multiply by 2 ⋅ B ( R ): each pass reads the entire relation once and writes it once Subtract B ( R ) for the final pass Roughly, this is O ( B ( R ) ⋅ log (^) M B ( R ) )

Memory requirement: M (as much as possible)

11

Some tricks for sorting

Double buffering

Allocate an additional block for each run Trade-off: smaller fan-in (more passes)

Blocked I/O

Instead of reading/writing one disk block at time, read/write a bunch (“cluster”) More sequential I/O’s Trade-off: larger cluster → smaller fan-in (more passes)

12

Sort-merge join

R R. A = S. B S

Sort R and S by their join attributes, and then merge

r , s = the first tuples in sorted R and S

Repeat until one of R and S is exhausted:

If r. A > s. B then s = next tuple in S

else if r. A < s. B then r = next tuple in R

else output all matching tuples, and

r , s = next in R and S

I/O’s: sorting + 2 B ( R ) + 2 B ( S )

In most cases (e.g., join of key and foreign key) Worst case is B ( R ) ⋅ B ( S ): everything joins

Probing phase

Read in each partition of R , stream in the

corresponding partition of S , join

Typically build a hash table for the partition of R

Not the same hash function used for partition, of course! Disk Memory

R partitions

S partitions

…

load …

stream For each S tuple, probe and join

Performance of hash join

I/O’s: 3 ⋅ ( B ( R ) + B ( S ))

Memory requirement:

In the probing phase, we should have enough memory to fit one partition of R : M – 1 ≥ B ( R ) / ( M – 1) M > sqrt( B ( R )) We can always pick R to be the smaller relation, so: M > sqrt(min( B ( R ), B ( S ))

21

Hash join tricks

What if a partition is too large for memory?

Read it back in and partition it again!

See the duality in multi-pass merge sort here?

22

Hash join versus SMJ

(Assuming two-pass) I/O’s: same Memory requirement: hash join is lower sqrt(min( B ( R ), B ( S )) < sqrt( B ( R ) + B ( S )) Hash join wins when two relations have very different sizes Other factors Hash join performance depends on the quality of the hash

Might not get evenly sized buckets SMJ can be adapted for inequality join predicates SMJ wins if R and/or S are already sorted SMJ wins if the result needs to be in sorted order

23

What about nested-loop join?

May be best if many tuples join

Example: non-equality joins that are not very selective

Necessary for black-box predicates

Example: … WHERE user_defined_pred ( R. A , S. B )

24

Other hash-based algorithms

Union (set), difference, intersection

More or less like hash join

Duplicate elimination

Check for duplicates within each partition/bucket

GROUP BY and aggregation

Apply the hash functions to GROUP BY attributes Tuples in the same group must end up in the same partition/bucket Keep a running aggregate value for each group

Duality of sort and hash

Divide-and-conquer paradigm

Sorting: physical division, logical combination Hashing: logical division, physical combination

Handling very large inputs

Sorting: multi-level merge Hashing: recursive partitioning

I/O patterns

Sorting: sequential write, random read (merge) Hashing: random write, sequential read (partition)

Selection using index

Equality predicate: σ A = v ( R )

Use an ISAM, B+^ -tree, or hash index on R ( A )

Range predicate: σ A > v ( R )

Use an ordered index (e.g., ISAM or B+^ -tree) on R ( A ) Hash index is not applicable

Indexes other than those on R ( A ) may be useful

Example: B+^ -tree index on R ( A , B ) How about B+^ -tree index on R ( B , A )?

27

Index versus table scan

Situations where index clearly wins:

Index-only queries which do not require retrieving

actual tuples

Example: π A (σ A > v ( R ))

Primary index clustered according to search key

One lookup leads to all result tuples in their entirety

28

Index versus table scan (cont’d)

BUT(!):

Consider σ A > v ( R ) and a secondary, non-clustered

index on R ( A )

Need to follow pointers to get the actual result tuples Say that 20% of R satisfies A > v

Could happen even for equality predicates I/O’s for index-based selection: lookup + 20% | R | I/O’s for scan-based selection: B ( R ) Table scan wins if a block contains more than 5 tuples

29

Index nested-loop join

R R. A = S. B S

Idea: use the value of R. A to probe the index on S ( B )

For each block of R , and for each r in the block:

Use the index on S ( B ) to retrieve s with s. B = r. A

Output rs

I/O’s: B ( R ) + | R | · (index lookup)

Typically, the cost of an index lookup is 2-4 I/O’s Beats other join methods if | R | is not too big Better pick R to be the smaller relation

Memory requirement: 2

30

Zig-zag join using ordered indexes

R R. A = S. B S

Idea: use the ordering provided by the indexes on R ( A ) and S ( B ) to eliminate the sorting step of sort-merge join Trick: use the larger key to probe the other index Possibly skipping many keys that don’t match

B +-tree on R ( A )

B +-tree on S ( B )

1 2 3 4 7 9 18

1 7 9 11 12 17 19

Database Systems: Query Processing Techniques: Scans, Joins, Sort, and Hash, Slides of Introduction to Database Management Systems

Related documents

Partial preview of the text

Download Database Systems: Query Processing Techniques: Scans, Joins, Sort, and Hash and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Query Processing

CPS 116

Introduction to Database Systems

Announcements

 Homework #3 sample solution available today

(Nov. 9)

 Course project milestone #2 due this Thursday

 No class or office hours next Tuesday (Nov. 16): I

am out of town

Overview

 Many different ways of processing the same query

 Best choice depends on the situation

Notation

 Relations: R , S

 Tuples: r , s

 Number of tuples: | R |, | S |

 Number of disk blocks: B ( R ), B ( S )

 Number of memory blocks available: M

 Cost metric

Table scan

 Scan table R and process the query

 I/O’s: B ( R )

 Memory requirement: 2 (double buffering)

 Not counting the cost of writing the result out

Nested-loop join

More improvements of nested-loop join

 Stop early

 Make use of available memory

 Which table would you pick as the outer?

External merge sort

Problem: sort R , but R does not fit in memory

 Pass 0: read M blocks of R at a time, sort them, and

write out a level-0 run

 Pass i : merge ( M – 1) level-( i -1) runs at a time, and

write out a level- i run

 Final pass produces 1 sorted run

Example of external merge sort

 Input: 1, 7, 4, 5, 2, 8, 3, 6, 9

 Pass 0

 Pass 1

 Pass 2 (final)

Performance of external merge sort

 Number of passes: d log M – 1 d B ( R ) / M e e + 1

 I/O’s

 Memory requirement: M (as much as possible)

Some tricks for sorting

 Double buffering

 Blocked I/O

Sort-merge join

 R R. A = S. B S

 Sort R and S by their join attributes, and then merge

r , s = the first tuples in sorted R and S

Repeat until one of R and S is exhausted:

If r. A > s. B then s = next tuple in S

else if r. A < s. B then r = next tuple in R

else output all matching tuples, and

r , s = next in R and S

 I/O’s: sorting + 2 B ( R ) + 2 B ( S )

Probing phase

 Read in each partition of R , stream in the

corresponding partition of S , join

Performance of hash join

 I/O’s: 3 ⋅ ( B ( R ) + B ( S ))

 Memory requirement:

Hash join tricks

 What if a partition is too large for memory?

Hash join versus SMJ

What about nested-loop join?

 May be best if many tuples join

 Necessary for black-box predicates

Other hash-based algorithms

 Union (set), difference, intersection

 Duplicate elimination

 GROUP BY and aggregation

Duality of sort and hash

 Divide-and-conquer paradigm

Homework #3 sample solution available today

Course project milestone #2 due this Thursday

No class or office hours next Tuesday (Nov. 16): I

Many different ways of processing the same query

Best choice depends on the situation

Relations: R , S

Tuples: r , s

Number of tuples: | R |, | S |

Number of disk blocks: B ( R ), B ( S )

Number of memory blocks available: M

Cost metric

Scan table R and process the query

I/O’s: B ( R )

Memory requirement: 2 (double buffering)

Not counting the cost of writing the result out

Stop early

Make use of available memory

Which table would you pick as the outer?

Pass 0: read M blocks of R at a time, sort them, and

Pass i : merge ( M – 1) level-( i -1) runs at a time, and

Final pass produces 1 sorted run

Input: 1, 7, 4, 5, 2, 8, 3, 6, 9

Pass 0

Pass 1

Pass 2 (final)

Number of passes: d log M – 1 d B ( R ) / M e e + 1

I/O’s

Memory requirement: M (as much as possible)

Double buffering

Blocked I/O

R R. A = S. B S

Sort R and S by their join attributes, and then merge

I/O’s: sorting + 2 B ( R ) + 2 B ( S )

Read in each partition of R , stream in the

I/O’s: 3 ⋅ ( B ( R ) + B ( S ))

Memory requirement:

What if a partition is too large for memory?

May be best if many tuples join

Necessary for black-box predicates

Union (set), difference, intersection

Duplicate elimination

GROUP BY and aggregation

Divide-and-conquer paradigm

Handling very large inputs

I/O patterns

Equality predicate: σ A = v ( R )

Range predicate: σ A > v ( R )

Indexes other than those on R ( A ) may be useful

Index-only queries which do not require retrieving

Primary index clustered according to search key

Consider σ A > v ( R ) and a secondary, non-clustered

R R. A = S. B S

Idea: use the value of R. A to probe the index on S ( B )

For each block of R , and for each r in the block:

I/O’s: B ( R ) + | R | · (index lookup)

Memory requirement: 2

R R. A = S. B S