Download Database Query Optimization: Techniques and Algorithms - Prof. Amol V. Deshpande and more Study notes Principles of Database Management in PDF only on Docsity!
CMSC424: Database
Design
Instructor: Amol Deshpande
Data Models
Conceptual representa1on of the data
Data Retrieval
How to ask ques1ons of the database
How to answer those ques1ons
Data Storage
How/where to store data, how to access it
Data Integrity
Manage crashes, concurrency
Manage seman1c inconsistencies
Databases
Query Optimization
Introduction
Example of a Simple Type of Query
Transformation of Relational Expressions
Statistics Estimation
Optimization Algorithms
Query Optimization
Why?
Many different ways of executing a given query
Huge differences in cost
Example:
select * from person where ssn = “123”
Size of person = 1GB
Sequential Scan:
Takes 1GB / (20MB/s) = 50s
Use an index on SSN (assuming one exists):
Approx 4 Random I/Os = 40ms
Query Optimization
Execution plans
Evaluation expressions annotated with the methods used
Query Optimization
Steps:
Generate all possible execution plans for the query
Figure out the cost for each of them
Choose the best
Not done exactly as listed above
Too many different execution plans for that
Typically interleave all of these into a single efficient search
algorithm
Query Optimization
Steps:
Generate all possible execution plans for the query
First generate all equivalent expressions
Then consider all annotations for the operations
Figure out the cost for each of them
Compute cost for each operation
Using the formulas discussed before One problem: How do we know the number of result tuples for, say,
Add them!
Choose the best
Query Optimization
Introduction
Example of a Simple Type of Query
Transformation of Relational Expressions
Statistics Estimation
Optimization Algorithms
A Simple Case
Compute cost of each possibility
Say, substr() zipcode date-of-birth
Need some more information
selectivity: fraction of tuples expected to pass the predicates
Let selectivity(substr predicate) = 3/
Let selectivity(zipcode predicate) = 1/
And, selectivity(date-of-birth predicate) = 1/
How are selectivities computed?
Must keep track of some additional information about the
relations
A Simple Case
Compute cost of each possibility
Say, substr() zipcode date-of-birth
Given that:
Cost of the above plan =
1,000,000 * 100ns + 1,000,000 * 3/26 * 1ns + 1,000,000 * 1/26 * 1/100 * 1000ns = approx 100.5 ms
Cost of the plan: zipcode substr() date-of-birth:
Approx 12.92 ms About a factor of 10 better.
A Simple Case
Compute cost of each possibility
Say, substr() zipcode date-of-birth
Cost of the plan: zipcode substr() date-of-birth:
Approx 12.92 ms About a factor of 10 better.
General algorithm:
Don’t need to check all n! Possibilities
Sort the predicates in the decreasing order by rank:
1 – selectivity(predicate)
cost of the predicate
Query Optimization
General case:
Need:
A way to enumerate all plans
A way to find the cost of each plan
Sub problem: Estimating the selectivities of various operations
A way to search through the plans efficiently
Equivalence Rules
Examples:
E 1 θ E 2 = E 2 θ E 1
7(a). If θ 0 only involves attributes from E 1
σθ 0 (E 1 θ E 2 ) = (σθ 0 (E 1 )) θ E 2
And so on…
Many rules of this type
Pictorial Depiction
Example
Find the names of all customers with an account at a Brooklyn branch
whose account balance is over $1000.
Π customer_name (σ branch_city = “ Brooklyn” ∧ balance > 1000
( branch (account depositor )))
Apply the rules one by one
Π customer_name ((σ branch_city = “ Brooklyn” ∧ balance > 1000
( branch account )) depositor )
Π customer_name (((σ branch_city = “ Brooklyn” (branch)) ( σ balance > 1000
(account) )) depositor)
Example
Evaluation Plans
We still need to choose the join methods etc..
Option 1: Choose for each operation separately
Usually okay, but sometimes the operators interact
Consider joining three relations on the same attribute:
R1 a (R2 a R3)
Best option for R2 join R3 might be hash-join
But if R1 is sorted on a, then sort-merge join is preferable Because it produces the result in sorted order by a Also, we need to decide whether to use pipelining or materialization Such issues are typically taken into account when doing the optimization
Query Optimization
Introduction
Example of a Simple Type of Query
Transformation of Relational Expressions
Optimization Algorithms
Statistics Estimation
Optimization Algorithms
Two types:
Exhaustive: That attempt to find the best plan
Heuristical: That are simpler, but are not guaranteed to find
the optimal plan
Consider a simple case
Join of the relations R1, …, Rn
No selections, no projections
Still very large plan space
Searching for the best plan
Option 1:
Enumerate all equivalent expressions for the original query
expression
Using the rules outlined earlier
Estimate cost for each and choose the lowest
Too expensive!
Consider finding the best join-order for r 1 r 2... rn.
There are (2( n – 1))!/( n – 1)! different join orders for above
expression. With n = 7, the number is 665280, with n = 10,
the number is greater than 176 billion!
R1 ⨝ R cost: 100 plan: HJ R1 ⨝ R cost: 300 plan: SMJ R1 ⨝ R …. R4 ⨝ R cost: 300 plan: HJ R1 ⨝ R2 ⨝ R cost: 400 plan: SMJ(R1R2, R3) …. …. R1 ⨝ R2 ⨝ R3 ⨝ R4 ⨝ R cost: 1200 plan: HJ(R1R2R3, R4R5) R1 ⨝ R2 ⨝ R3 ⨝ R cost: 700 plan: HJ(R1R2R3, R4) …. R1 R2 R3 R4 R ⨝ ⨝ ⨝ ⨝ R3 R4 R R1 R HJ HJ HJ SMJ
Left Deep Join Trees
In left-deep join trees, the right-hand-side input for each join
is a relation, not the result of an intermediate join
Early systems only considered these types of plans
Easier to pipeline
Heuristic Optimization
Dynamic programming is expensive
Use heuristics to reduce the number of choices
Typically rule-based:
Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations before other
similar operations.
Some systems use only heuristics, others combine heuristics
with partial cost-based optimization. Steps in Typical Heuristic Optimization
1. Deconstruct conjunctive selections into a sequence of single
selection operations (Equiv. rule 1.).
2. Move selection operations down the query tree for the
earliest possible execution (Equiv. rules 2, 7a, 7b, 11).
3. Execute first those selection and join operations that will
produce the smallest relations (Equiv. rule 6).
4. Replace Cartesian product operations that are followed by a
selection condition by join operations (Equiv. rule 4a).
5. Deconstruct and move as far down the tree as possible lists
of projection attributes, creating new projections where
needed (Equiv. rules 3, 8a, 8b, 12).
6. Identify those subtrees whose operations can be pipelined,
and execute them using pipelining).
Cost estimation
Some information is static and is maintained in the metadata
Primary key?
Sorted or not, which attribute
So we can decide whether need to sort again
How many tuples in the relation, how many blocks?
RAID ?? Which one?
Read/write costs are quite different
Typically kept in some tables in the database
“all_tab_columns” in Oracle
Most systems have commands for updating them
Cost estimation
However, others need to be estimated somehow
How many tuples match a predicate like “age > 40”? E.g. Need to know how many index pages need to be read
Intermediate result sizes
The problem variously called:
“intermediate result size estimation” “selectivity estimation”
Very important to estimate reasonably well
e.g. consider “select * from R where zipcode = 20742” We estimate that there are 10 matches, and choose to use a secondary index (remember: random I/Os) Turns out there are 10000 matches Using a secondary index very bad idea Optimizer also often choose Nested-loop joins if one relation very small… underestimation can result in very bad
Selectivity Estimation
Basic idea:
Maintain some information about the tables
More information more accurate estimation More information higher storage cost, higher update cost
Make uniformity and randomness assumptions to fill in the gaps
Example:
For a relation “people”, we keep:
Total number of tuples = 100, Distinct “zipcode” values that appear in it = 100
Given a query: “zipcode = 20742”
We estimated the number of matching tuples as: 100,000/100 = 1000
What if I wanted more accurate information?
Keep histograms…
Histograms
A condensed, approximate version of the “frequency distribution”
Divide the range of the attribute value in “buckets” For each bucket, keep the total count Assume uniformity within a bucket 20000- 20200- 20400- 20600- 20800- 20199 20399 20599 20799 20999 50, 40, 30, 20, 10,