CS 245: Database Systems - Query Optimization Notes, Slides of Principles of Database Management

Notes on query optimization in database systems, covering topics such as plan generation and comparison, estimating ios, join algorithms, and index usage. It also discusses factors affecting performance and various optimization techniques.

Typology: Slides

2011/2012

Uploaded on 07/15/2012

saibal
saibal 🇮🇳

32 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS 245 Notes 7 1
CS 245: Database System
Principles
Notes 7: Query Optimization
Hector Garcia-Molina
CS 245 Notes 7 2
--> Generating and comparing plans
Query
Generate Plans
Pruning x x
Estimate Cost
Cost
Select
Query Optimization
Pick Min
CS 245 Notes 7 3
To generate plans consider:
Transforming relational algebra expression
(e.g. order of joins)
Use of existing indexes
Building indexes or sorting on the fly
CS 245 Notes 7 4
Implementation details:
e.g. - Join algorithm
- Memory management
- Parallel processing
CS 245 Notes 7 5
Estimating IOs:
Count # of disk blocks that must be
read (or written) to execute query plan
CS 245 Notes 7 6
To estimate costs, we may have
additional parameters:
B(R) = # of blocks containing R tuples
f(R) = max # of tuples of R per block
M = # memory blocks available
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download CS 245: Database Systems - Query Optimization Notes and more Slides Principles of Database Management in PDF only on Docsity!

CS 245 Notes 7 1

CS 245: Database System

Principles

Notes 7: Query Optimization

Hector Garcia-Molina

CS 245 Notes 7 2

--> Generating and comparing plans Query Generate Plans Pruning x x Estimate Cost Cost Select

Query Optimization

Pick Min

CS 245 Notes 7 3

To generate plans consider:

  • Transforming relational algebra expression (e.g. order of joins)
  • Use of existing indexes
  • Building indexes or sorting on the fly

CS 245 Notes 7 4

  • Implementation details: e.g. - Join algorithm - Memory management - Parallel processing

CS 245 Notes 7 5

Estimating IOs:

  • Count # of disk blocks that must be read (or written) to execute query plan

CS 245 Notes 7 6

To estimate costs, we may have

additional parameters:

B(R) = # of blocks containing R tuples f(R) = max # of tuples of R per block M = # memory blocks available

2

CS 245 Notes 7 7

To estimate costs, we may have

additional parameters:

B(R) = # of blocks containing R tuples f(R) = max # of tuples of R per block M = # memory blocks available

HT(i) = # levels in index i LB(i) = # of leaf blocks in index i

CS 245 Notes 7 8

Clustering index

Index that allows tuples to be read in an order that corresponds to physical order A

indexA

10 15 17 19 35 37

CS 245 Notes 7 9

Notions of clustering

  • Clustered file organization …..
  • Clustered relation …..
  • Clustering index

R1 R2 S1 S2 R3 R4 S3 S

R1 R2 R3 R4 R5 R5 R7 R

CS 245 Notes 7 10

Example R1 R2 over common attribute C

T(R1) = 10,

T(R2) = 5,

S(R1) = S(R2) = 1/10 block Memory available = 101 blocks

CS 245 Notes 7 11

Example R1 R2 over common attribute C

T(R1) = 10,

T(R2) = 5,

S(R1) = S(R2) = 1/10 block Memory available = 101 blocks

 Metric: # of IOs (ignoring writing of result) CS 245 Notes 7 12

Caution!

This may not be the best way to compare

  • ignoring CPU costs
  • ignoring timing
  • ignoring double buffering requirements

4

CS 245 Notes 7 19

  • Hash join (conceptual)
    • Hash function h, range 0  k
    • Buckets for R1: G0, G1, ... Gk
    • Buckets for R2: H0, H1, ... Hk

CS 245 Notes 7 20

  • Hash join (conceptual)
    • Hash function h, range 0  k
    • Buckets for R1: G0, G1, ... Gk
    • Buckets for R2: H0, H1, ... Hk

Algorithm (1) Hash R1 tuples into G buckets (2) Hash R2 tuples into H buckets (3) For i = 0 to k do match tuples in Gi, Hi buckets

CS 245 Notes 7 21

Simple example hash: even/odd

R1 R2 Buckets 2 5 Even 4 4 R1 R 3 12 Odd: 5 3 8 13 9 8 11 14

2 4 8 4 12 8 14

3 5 9 5 3 13 11

CS 245 Notes 7 22

Factors that affect performance

(1) Tuples of relation stored physically together?

(2) Relations sorted by join attribute?

(3) Indexes exist?

CS 245 Notes 7 23

Example 1(a) Iteration Join R1 R

  • Relations not contiguous
  • Recall T(R1) = 10,000 T(R2) = 5, S(R1) = S(R2) =1/10 block MEM=101 blocks

CS 245 Notes 7 24

Example 1(a) Iteration Join R1 R

  • Relations not contiguous
  • Recall T(R1) = 10,000 T(R2) = 5, S(R1) = S(R2) =1/10 block MEM=101 blocks Cost: for each R1 tuple: [Read tuple + Read R2] Total =10,000 [1+5000]=50,010,000 IOs

CS 245 Notes 7 25

• Can we do better?

CS 245 Notes 7 26

• Can we do better?

Use our memory (1) Read 100 blocks of R (2) Read all of R2 (using 1 block) + join (3) Repeat until done

CS 245 Notes 7 27

Cost: for each R1 chunk: Read chunk: 1000 IOs Read R2: 5000 IOs 6000

CS 245 Notes 7 28

Cost: for each R1 chunk: Read chunk: 1000 IOs Read R2: 5000 IOs 6000

Total = 10,000 x 6000 = 60,000 IOs 1,

CS 245 Notes 7 29

• Can we do better?

CS 245 Notes 7 30

• Can we do better?

 Reverse join order: R2 R

Total = 5000 x (1000 + 10,000) = 1000 5 x 11,000 = 55,000 IOs

7

CS 245 Notes 7 37

(ii) Read all chunks + merge + write out

Sorted file Memory Sorted Chunks ... ...

CS 245 Notes 7 38

Cost: Sort Each tuple is read,written, read, written so... Sort cost R1: 4 x 1,000 = 4, Sort cost R2: 4 x 500 = 2,

CS 245 Notes 7 39

Example 1(d) Merge Join (continued)

R1,R2 contiguous, but unordered

Total cost = sort cost + join cost = 6,000 + 1,500 = 7,500 IOs

CS 245 Notes 7 40

Example 1(d) Merge Join (continued)

R1,R2 contiguous, but unordered

Total cost = sort cost + join cost = 6,000 + 1,500 = 7,500 IOs

But: Iteration cost = 5, so merge joint does not pay off!

CS 245 Notes 7 41

But say R1 = 10,000 blocks contiguous R2 = 5,000 blocks not ordered

Iterate: 5000 x (100+10,000) = 50 x 10, 100 = 505,000 IOs

Merge join: 5(10,000+5,000) = 75,000 IOs

Merge Join (with sort) WINS! CS 245 Notes 7 42

How much memory do we need for

merge sort?

E.g: Say I have 10 memory blocks 10

...

100 chunks  to merge, need R1 100 blocks!

8

CS 245 Notes 7 43

In general:

Say k blocks in memory x blocks for relation sort

chunks = (x/k) size of chunk = k

CS 245 Notes 7 44

In general:

Say k blocks in memory x blocks for relation sort

chunks = (x/k) size of chunk = k

chunks < buffers available for merge

CS 245 Notes 7 45

In general:

Say k blocks in memory x blocks for relation sort

chunks = (x/k) size of chunk = k

chunks < buffers available for merge

so... (x/k)  k or k^2  x or k  x

CS 245 Notes 7 46

In our example

R1 is 1000 blocks, k  31. R2 is 500 blocks, k  22.

Need at least 32 buffers

CS 245 Notes 7 47

Can we improve on merge join?

Hint: do we really need the fully sorted files? R

R

Join?

sorted runs

CS 245 Notes 7 48

Cost of improved merge join:

C = Read R1 + write R1 into runs

  • read R2 + write R2 into runs
  • join = 2000 + 1000 + 1500 = 4500

--> Memory requirement?

10

CS 245 Notes 7 55

Total cost (including probes)

= 500+5000 [Probe + get records] = 500+5000 [0.5+2] uniform assumption = 500+12,500 = 13,000 (case b)

CS 245 Notes 7 56

Total cost (including probes)

= 500+5000 [Probe + get records] = 500+5000 [0.5+2] uniform assumption = 500+12,500 = 13,000 (case b)

For case (c): = 500+5000[0.5  1 + (1/100)  1] = 500+2500+50 = 3050 IOs

CS 245 Notes 7 57

So far

Iterate R2 R1 55,000 (best) Merge Join _______ Sort+ Merge Join _______ R1.C Index _______ R2.C Index _______ Iterate R2 R1 5500 Merge join 1500 Sort+Merge Join 7500  4500 R1.C Index 5500  3050  550 contiguousR2.C Index ________

not contiguous

CS 245 Notes 7 58

  • R1, R2 contiguous (un-ordered)  Use 100 buckets  Read R1, hash, + write buckets

R1 

Example 1(f) Hash Join

... ... 10 blocks

100

CS 245 Notes 7 59

-> Same for R -> Read one R1 bucket; build memory hash table -> Read corresponding R2 bucket + hash probe

R

R

R ... ... memory

 Then repeat for all buckets CS 245 Notes 7 60

Cost:

“Bucketize:” Read R1 + write Read R2 + write Join: Read R1, R

Total cost = 3 x [1000+500] = 4500

11

CS 245 Notes 7 61

Cost:

“Bucketize:” Read R1 + write Read R2 + write Join: Read R1, R

Total cost = 3 x [1000+500] = 4500 Note: this is an approximation since buckets will vary in size and we have to round up to blocks CS 245 Notes 7 62

Minimum memory requirements:

Size of R1 bucket = (x/k) k = number of memory buffers x = number of R1 blocks So... (x/k) < k

k > x need: k+1 total memory buffers

CS 245 Notes 7 63

Trick: keep some buckets in memory

E.g., k’=33 R1 buckets = 31 blocks keep 2 in memory memory G 0 G 1

in

...

31

33-2=

R

called hybrid hash-join CS 245 Notes 7 64

Trick: keep some buckets in memory

E.g., k’=33 R1 buckets = 31 blocks keep 2 in memory memory G 0 G 1

in

...

31

33-2=

R

Memory use: G0 31 buffers G1 31 buffers Output 33-2 buffers R1 input 1 Total 94 buffers 6 buffers to spare!!

called hybrid hash-join

CS 245 Notes 7 65

Next: Bucketize R

  • R2 buckets =500/33= 16 blocks
  • Two of the R2 buckets joined immediately with G0,G memory G 0 G 1

in

...

16

33-2=

R

...

31

33-2=

R2 buckets (^) R1 buckets

CS 245 Notes 7 66

Finally: Join remaining buckets

  • for each bucket pair:
    • read one of the buckets into memory
    • join with second bucket memory

Gi

out

...

16

33-2=

ans

...

31

33-2=

one full R2bucket R2 buckets (^) R1 buckets

one R1buffer

13

CS 245 Notes 7 73

So far:

Iterate 5500 Merge join 1500 Sort+merge joint 7500 R1.C index 5500  550 R2.C index _____ Build R.C index _____ Build S.C index _____ Hash join 4500+ with trick,R1 first 4414 with trick,R2 first _____ Hash join, pointers 1600

contiguous

CS 245 Notes 7 74

Summary

  • Iteration ok for “small” relations (relative to memory size)
  • For equi-join, where relations not sorted and no indexes exist, hash join usually best

CS 245 Notes 7 75

  • Sort + merge join good for non-equi-join (e.g., R1.C > R2.C)
  • If relations already sorted, use merge join
  • If index exists, it could be useful (depends on expected result size)

CS 245 Notes 7 76

Join strategies for parallel processors

Later on….

CS 245 Notes 7 77

Chapter 16 [16] summary

  • Relational algebra level
  • Detailed query plan level
    • Estimate costs
    • Generate plans
      • Join algorithms
    • Compare costs