Query Optimization: Cost Estimation and Histograms in Advanced Database Systems, Slides of Database Management Systems (DBMS)

A part of the advanced database systems course (cps 216) and covers query optimization techniques, specifically cost estimation and the use of histograms. Topics include the estimation of costs for different operators, the importance of statistics, and the use of histograms for range queries. The document also discusses the construction and maintenance of histograms, as well as the use of equi-width and equi-height histograms.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Query Optimization
Part II
CPS 216
Advanced Database Systems
2
Announcements (April 19)
Homework #4 (last one; short) will be assigned this
Thursday
Homework #3 graded; grades posted
Project demo period April 28 – May 1
Please email me to sign up for a 30-minute slot
Final exam on May 2 (Monday 2-5pm)
3
Review of the bigger picture
Query optimization
Consider a space of possible plans (April 7)
Rewrite logical plan to combine “blocks” as much as
possible
Each block will then be optimized separately
Fewer blocks larger plan space
Estimate costs of plans in the search space (today)
Search through the space for the “best” plan (next
lecture)
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Query Optimization: Cost Estimation and Histograms in Advanced Database Systems and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Query Optimization

Part II

CPS 216

Advanced Database Systems

2

Announcements (April 19)

™ Homework #4 (last one; short) will be assigned this

Thursday

™ Homework #3 graded; grades posted

™ Project demo period April 28 – May 1

ƒ Please email me to sign up for a 30-minute slot

™ Final exam on May 2 (Monday 2-5pm)

3

Review of the bigger picture

Query optimization

™ Consider a space of possible plans (April 7)

ƒ Rewrite logical plan to combine “blocks” as much as possible ƒ Each block will then be optimized separately ƒ Fewer blocks → larger plan space

™ Estimate costs of plans in the search space (today)

™ Search through the space for the “best” plan (next

lecture)

Cost estimation

™ We have: cost estimation for each operator

ƒ Example: SORT( CID ) takes 2 × B ( input )

  • But what is B ( input )?

™ We need: size of intermediate results

PROJECT ( title ) MERGE-JOIN ( CID ) SORT ( CID )SCAN (Course) MERGE-JOIN ( SID )

SCAN ( Enroll )

SORT ( SID ) SCAN ( Student )

FILTER ( name = “Bart”)

Physical plan example:

Input to SORT( CID ):

5

Simple statistics

™ Suppose DBMS collects the following statistics for

each table R

ƒ Size of R : | R | ƒ For each column A in R , the number of distinct A values: |π A R | ƒ Assumption: R. A values are uniformly distributed over π A R (i.e., all values have the same count in R )

) Statistics are traditionally re-computed periodically;

accurate statistics are not required for estimation

6

Selections with equality predicates

™ Q : σ A = v R

™ Additional assumption: v does appear in R

™ | Q | ≈ d | R | ⁄ |π A R | e

ƒ 1 ⁄ |π A R | is the selectivity factor of predicate ( A = v ) )This predicate reduces the size of input table by the selectivity factor

Two-way equi-join

™ Q : R ( A , B )  S ( B , C )

™ Additional assumption: containment of value sets

ƒ Every row in the “smaller” table (one with fewer distinct values for the join column) joins with some row in the other table ƒ That is, if |π B R | · |π B S | then π B R ⊆ π B S ƒ Certainly not true in general

™ | Q | ≈ d | R | · | S | ⁄ max(|π B R |, |π B S |) e

ƒ Selectivity factor of R. B = S. B is 1 ⁄ max(|π B R |, |π B S |)

11

Multi-table equi-join

™ Q : R ( A , B )  S ( B , C )  T ( C , D )

™ What is the number of distinct C values in the join

of R and S?

™ Additional assumption: preservation of value sets

ƒ A non-join attribute does not lose values from its set of possible values ƒ That is, if A is in R but not S , then π A ( R  S ) = π A R ƒ Certainly not true in general

12

Multi-table equi-join (cont’d)

™ Q : R ( A , B )  S ( B , C )  T ( C , D )

™ Start with the product of relation sizes

ƒ | R | · | S | · | T |

™ Reduce the total size by the selectivity factor of each

join predicate

ƒ R. B = S. B : 1 ⁄ max(|π B R |, |π B S |) ƒ S. C = T. C : 1 ⁄ max(|π C S |, |π C T |) ƒ | Q | ≈ d (| R | · | S | · | T |) ⁄ (max(|π B R |, |π B S |) · max(|π C S |, |π C T |)) e

Recap: cost estimation with simple stats

™ Using similar ideas, we can estimate the size of projection, duplicate elimination, union, difference, aggregation (with grouping) ™ Lots of assumptions and very rough estimation ƒ Accurate estimate is not needed ƒ Maybe okay if we overestimate or underestimate consistently ƒ May lead to very nasty optimizer “hints” SELECT * FROM Student WHERE GPA > 3.9; SELECT * FROM Student WHERE GPA > 3.9 AND GPA > 3.9; ™ Next: better estimation using more information (histograms)

14

Histograms

™ Motivation

ƒ | R |, |π A R |, high( R. A ), low( R. A )

  • Too little information ƒ Actual distribution of R. A : ( v 1 , f 1 ), ( v 2 , f 2 ), …, ( vn , f (^) n )
  • fi is frequency of vi , or the number of times vi appears as R. A
  • Too much information

™ Anything in between?

ƒ Partition the domain of R. A into buckets ƒ Store a small summary of the distribution within each bucket ƒ Number of buckets is the “knob” that controls the resolution

15

Equi-width histogram

™ Divide the domain into B buckets of equal width

™ Store the bucket boundaries and the sum of

frequencies of the values within each bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 2 0 1

(^43)

(^8 8 ) (^6 ) (^533 )

19

27

13

Construction and maintenance

™ Construction

ƒ Sort all R. A values, and then take equally spaced splits

  • Example: 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 … ƒ Sampling also works

™ Maintenance

ƒ Incremental maintenance

  • Merge adjacent buckets with small counts
  • Split any bucket with a large count
    • Select the median value to split
    • Need a sample of the values within this bucket to work well ƒ Periodic recomputation also works

20

Using an equi-height histogram

™ Q : σ A = 5 R

ƒ 5 is in bucket [1, 7] (16) ƒ Assume uniform distribution within the bucket ƒ | Q | ≈ 16 ⁄ 7 ≈ 2 (| Q | = 1, actually)

™ Q : σ A ≥ 7 and A · 16 R

ƒ [7, 16] covers [8, 9], [10, 11], [12, 16] (all with 16) ƒ [7, 16] partially covers [1, 7] (16) ƒ | Q | ≈ 16 ⁄ 7 + 16 + 16 + 16 ≈ 50 (| Q | = 52, actually)

™ Join similar to equi-width histogram

21

Histogram tricks

™ Store the number of distinct values in each bucket ƒ To remove the effects of the values with 0 frequency

  • These values tend to cause underestimation ƒ Assume uniform spread (the difference between this value and the next value with non-zero frequency) ™ Compressed histogram ƒ Store ( vi , f (^) i ) pairs explicitly if fi is high ƒ For other values, use an equi-width or equi-height histogram ™ Self-tuning ƒ Analyze feedback from query execution engine to refine histograms ƒ Aboulnaga and Chaudhuri, SIGMOD 1999

More histograms

) More in Poosala et al., SIGMOD 1996

™ V-optimal( V , F ) histogram ƒ Avoid putting very different frequencies into the same bucket ƒ Partition in a way to minimize ∑ i VAR (^) i overall, where VARi is the frequency variance within bucket i ™ MaxDiff( V , A ) histogram ƒ Define area to be the product of the frequency of a value and its spread ƒ Insert bucket boundaries where two adjacent areas differ by large amounts ƒ A bit easier to construct than V-optimal; comparable performance

23

Wavelets

™ Mathematical tool for hierarchical decomposition of functions and signals ™ Haar wavelets: recursive pair-wise averaging and differencing at different resolutions ƒ Simplest wavelet basis, easy to implement Resolution Averages Detail coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] 2 [2, 1, 4, 4] [0, –1, –1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [–1.25] Haar wavelet decomposition: [2.75, –1.25, 0.5, 0, 0, –1, –1, 0]

24

Haar wavelet coefficients

™ Hierarchical decomposition structure

–1.

0.5 0

0 –1 –1 0

+ –

+

+ + + +

**- –

  • – – –**

+

2 2 0 2 3 5 4 4 Original data