Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Query Optimization: Cost Estimation and Histograms in Advanced Database Systems, Slides of Database Management Systems (DBMS)

Duke University Database Management Systems (DBMS)

A part of the advanced database systems course (cps 216) and covers query optimization techniques, specifically cost estimation and the use of histograms. Topics include the estimation of costs for different operators, the importance of statistics, and the use of histograms for range queries. The document also discusses the construction and maintenance of histograms, as well as the use of equi-width and equi-height histograms.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold 🇺🇸

4.7

(24)

372 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

Query Optimization

Part II

CPS 216

Advanced Database Systems

Announcements (April 19)

Homework #4 (last one; short) will be assigned this

Thursday

Homework #3 graded; grades posted

Project demo period April 28 – May 1

Please email me to sign up for a 30-minute slot

Final exam on May 2 (Monday 2-5pm)

Review of the bigger picture

Query optimization

Consider a space of possible plans (April 7)

Rewrite logical plan to combine “blocks” as much as

possible

Each block will then be optimized separately

Fewer blocks →larger plan space

Estimate costs of plans in the search space (today)

Search through the space for the “best” plan (next

lecture)

Discover Slides of Database Management Systems (DBMS) Duke University

Partial preview of the text

Download Query Optimization: Cost Estimation and Histograms in Advanced Database Systems and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Query Optimization

Part II

CPS 216

Advanced Database Systems

Announcements (April 19)

Homework #4 (last one; short) will be assigned this

Thursday

Homework #3 graded; grades posted

Project demo period April 28 – May 1

Please email me to sign up for a 30-minute slot

Final exam on May 2 (Monday 2-5pm)

Review of the bigger picture

Query optimization

Consider a space of possible plans (April 7)

Rewrite logical plan to combine “blocks” as much as possible Each block will then be optimized separately Fewer blocks → larger plan space

Estimate costs of plans in the search space (today)

Search through the space for the “best” plan (next

lecture)

Cost estimation

We have: cost estimation for each operator

Example: SORT( CID ) takes 2 × B ( input )

But what is B ( input )?

We need: size of intermediate results

PROJECT ( title ) MERGE-JOIN ( CID ) SORT ( CID )SCAN (Course) MERGE-JOIN ( SID )

SCAN ( Enroll )

SORT ( SID ) SCAN ( Student )

FILTER ( name = “Bart”)

Physical plan example:

Input to SORT( CID ):

Simple statistics

Suppose DBMS collects the following statistics for

each table R

Size of R : | R | For each column A in R , the number of distinct A values: |π A R | Assumption: R. A values are uniformly distributed over π A R (i.e., all values have the same count in R )

) Statistics are traditionally re-computed periodically;

accurate statistics are not required for estimation

Selections with equality predicates

Q : σ A = v R

Additional assumption: v does appear in R

| Q | ≈ d | R | ⁄ |π A R | e

1 ⁄ |π A R | is the selectivity factor of predicate ( A = v ) )This predicate reduces the size of input table by the selectivity factor

Two-way equi-join

Q : R ( A , B ) S ( B , C )

Additional assumption: containment of value sets

Every row in the “smaller” table (one with fewer distinct values for the join column) joins with some row in the other table That is, if |π B R | · |π B S | then π B R ⊆ π B S Certainly not true in general

| Q | ≈ d | R | · | S | ⁄ max(|π B R |, |π B S |) e

Selectivity factor of R. B = S. B is 1 ⁄ max(|π B R |, |π B S |)

Multi-table equi-join

Q : R ( A , B ) S ( B , C ) T ( C , D )

What is the number of distinct C values in the join

of R and S?

Additional assumption: preservation of value sets

A non-join attribute does not lose values from its set of possible values That is, if A is in R but not S , then π A ( R S ) = π A R Certainly not true in general

Multi-table equi-join (cont’d)

Q : R ( A , B ) S ( B , C ) T ( C , D )

Start with the product of relation sizes

| R | · | S | · | T |

Reduce the total size by the selectivity factor of each

join predicate

R. B = S. B : 1 ⁄ max(|π B R |, |π B S |) S. C = T. C : 1 ⁄ max(|π C S |, |π C T |) | Q | ≈ d (| R | · | S | · | T |) ⁄ (max(|π B R |, |π B S |) · max(|π C S |, |π C T |)) e

Recap: cost estimation with simple stats

Using similar ideas, we can estimate the size of projection, duplicate elimination, union, difference, aggregation (with grouping) Lots of assumptions and very rough estimation Accurate estimate is not needed Maybe okay if we overestimate or underestimate consistently May lead to very nasty optimizer “hints” SELECT * FROM Student WHERE GPA > 3.9; SELECT * FROM Student WHERE GPA > 3.9 AND GPA > 3.9; Next: better estimation using more information (histograms)

Histograms

Motivation

| R |, |π A R |, high( R. A ), low( R. A )

Too little information Actual distribution of R. A : ( v 1 , f 1 ), ( v 2 , f 2 ), …, ( vn , f (^) n )
fi is frequency of vi , or the number of times vi appears as R. A
Too much information

Anything in between?

Partition the domain of R. A into buckets Store a small summary of the distribution within each bucket Number of buckets is the “knob” that controls the resolution

Equi-width histogram

Divide the domain into B buckets of equal width

Store the bucket boundaries and the sum of

frequencies of the values within each bucket

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 2 0 1

(^43)

(^8 8 ) (^6 ) (^533 )

Construction and maintenance

Construction

Sort all R. A values, and then take equally spaced splits

Example: 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 … Sampling also works

Maintenance

Incremental maintenance

Merge adjacent buckets with small counts
Split any bucket with a large count
- Select the median value to split
- Need a sample of the values within this bucket to work well Periodic recomputation also works

Using an equi-height histogram

Q : σ A = 5 R

5 is in bucket [1, 7] (16) Assume uniform distribution within the bucket | Q | ≈ 16 ⁄ 7 ≈ 2 (| Q | = 1, actually)

Q : σ A ≥ 7 and A · 16 R

[7, 16] covers [8, 9], [10, 11], [12, 16] (all with 16) [7, 16] partially covers [1, 7] (16) | Q | ≈ 16 ⁄ 7 + 16 + 16 + 16 ≈ 50 (| Q | = 52, actually)

Join similar to equi-width histogram

Histogram tricks

Store the number of distinct values in each bucket To remove the effects of the values with 0 frequency

These values tend to cause underestimation Assume uniform spread (the difference between this value and the next value with non-zero frequency) Compressed histogram Store ( vi , f (^) i ) pairs explicitly if fi is high For other values, use an equi-width or equi-height histogram Self-tuning Analyze feedback from query execution engine to refine histograms Aboulnaga and Chaudhuri, SIGMOD 1999

More histograms

) More in Poosala et al., SIGMOD 1996

V-optimal( V , F ) histogram Avoid putting very different frequencies into the same bucket Partition in a way to minimize ∑ i VAR (^) i overall, where VARi is the frequency variance within bucket i MaxDiff( V , A ) histogram Define area to be the product of the frequency of a value and its spread Insert bucket boundaries where two adjacent areas differ by large amounts A bit easier to construct than V-optimal; comparable performance

Wavelets

Mathematical tool for hierarchical decomposition of functions and signals Haar wavelets: recursive pair-wise averaging and differencing at different resolutions Simplest wavelet basis, easy to implement Resolution Averages Detail coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] 2 [2, 1, 4, 4] [0, –1, –1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [–1.25] Haar wavelet decomposition: [2.75, –1.25, 0.5, 0, 0, –1, –1, 0]

Haar wavelet coefficients

Hierarchical decomposition structure

–1.

0.5 0

0 –1 –1 0

+ –

+ + + +

**- –

– – –**

2 2 0 2 3 5 4 4 Original data

Query Optimization: Cost Estimation and Histograms in Advanced Database Systems, Slides of Database Management Systems (DBMS)

Related documents

Partial preview of the text

Download Query Optimization: Cost Estimation and Histograms in Advanced Database Systems and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Query Optimization

Part II

CPS 216

Advanced Database Systems

Announcements (April 19)

 Homework #4 (last one; short) will be assigned this

Thursday

 Homework #3 graded; grades posted

 Project demo period April 28 – May 1

 Final exam on May 2 (Monday 2-5pm)

Review of the bigger picture

Query optimization

 Consider a space of possible plans (April 7)

 Estimate costs of plans in the search space (today)

 Search through the space for the “best” plan (next

lecture)

Cost estimation

 We have: cost estimation for each operator

 We need: size of intermediate results

Physical plan example:

Simple statistics

 Suppose DBMS collects the following statistics for

each table R

) Statistics are traditionally re-computed periodically;

accurate statistics are not required for estimation

Selections with equality predicates

 Q : σ A = v R

 Additional assumption: v does appear in R

 | Q | ≈ d | R | ⁄ |π A R | e

Two-way equi-join

 Q : R ( A , B ) S ( B , C )

 Additional assumption: containment of value sets

 | Q | ≈ d | R | · | S | ⁄ max(|π B R |, |π B S |) e

Multi-table equi-join

 Q : R ( A , B ) S ( B , C ) T ( C , D )

 What is the number of distinct C values in the join

of R and S?

 Additional assumption: preservation of value sets

Multi-table equi-join (cont’d)

 Q : R ( A , B ) S ( B , C ) T ( C , D )

 Start with the product of relation sizes

 | R | · | S | · | T |

 Reduce the total size by the selectivity factor of each

join predicate

Recap: cost estimation with simple stats

Histograms

 Motivation

 Anything in between?

Equi-width histogram

 Divide the domain into B buckets of equal width

 Store the bucket boundaries and the sum of

frequencies of the values within each bucket

Construction and maintenance

 Construction

 Maintenance

Using an equi-height histogram

 Q : σ A = 5 R

 Q : σ A ≥ 7 and A · 16 R

 Join similar to equi-width histogram

Histogram tricks

More histograms

Wavelets

Haar wavelet coefficients

 Hierarchical decomposition structure

Homework #4 (last one; short) will be assigned this

Homework #3 graded; grades posted

Project demo period April 28 – May 1

Final exam on May 2 (Monday 2-5pm)

Consider a space of possible plans (April 7)

Estimate costs of plans in the search space (today)

Search through the space for the “best” plan (next

We have: cost estimation for each operator

We need: size of intermediate results

Suppose DBMS collects the following statistics for

Q : σ A = v R

Additional assumption: v does appear in R

| Q | ≈ d | R | ⁄ |π A R | e

Q : R ( A , B ) S ( B , C )

Additional assumption: containment of value sets

| Q | ≈ d | R | · | S | ⁄ max(|π B R |, |π B S |) e

Q : R ( A , B ) S ( B , C ) T ( C , D )

What is the number of distinct C values in the join

Additional assumption: preservation of value sets

Q : R ( A , B ) S ( B , C ) T ( C , D )

Start with the product of relation sizes

| R | · | S | · | T |

Reduce the total size by the selectivity factor of each

Motivation

Anything in between?

Divide the domain into B buckets of equal width

Store the bucket boundaries and the sum of

Construction

Maintenance

Q : σ A = 5 R

Q : σ A ≥ 7 and A · 16 R

Join similar to equi-width histogram

Hierarchical decomposition structure