Download Query Optimization: From Logical Plan to Physical Execution and more Slides Database Management Systems (DBMS) in PDF only on Docsity!
Query Optimization
Part I
CPS 216
Advanced Database Systems
2
A query’s trip through the DBMS
Parser
Validator
Query optimizer
Query execution engine
Result
SQL query SELECT title, SID
FROM Enroll, Course WHERE Enroll.CID = Course.CID;
Parse tree
h SFW i h select-list i h from-list i
h where-list i
h table i h table i
h Query i
AND
Enroll Course
Logical plan
π title , SID
σ Enroll. CID = Course. CID
×
Enroll Course
Physical plan
PROJECT ( title , SID )
MERGE-JOIN ( CID )
SCAN ( Enroll )
SCAN ( Course )
SORT ( CID )
3
Parsing & validation
Parser: SQL → parse tree
Good old lex & yacc
Detect and reject syntax errors
Validator: parse tree → logical plan
Detect and reject semantic errors
- Nonexistent tables/views/columns
- Type mismatches (e.g., AVG(name), name + GPA, Student
UNION Enroll)
Wildcard (SELECT *) and view expansion
Use information stored in system catalog tables (contains
all metadata/schema information)
Logical plan
A tree whose nodes are logical operators
Often a tree of relational algebra operators
DB2 uses QGM (Query Graph Model)
There are many equivalent logical plans
π title
σ Student. name =“Bart” ∧ Student. SID = Enroll. SID ∧ Enroll. CID = Course.CID
×
Enroll
× Course
Student
An equivalent plan:
π title
Enroll. CID = Course. CID
Enroll
Course
Student
Student. SID = Enroll. SID
σ name = “Bart”
SELECT Course.title FROM Student, Enroll, Course WHERE Student.name = ‘Bart’ AND Student.SID = Enroll.SID AND Enroll.CID = Course.CID;
5
Query optimization and execution
Recall that a physical plan tells the DBMS query execution
engine how to execute the query
One logical plan can have many possible physical plans (with
equivalent results, but different costs and assumptions)
Query optimizer: one logical plan → “best” physical plan
Query execution engine: physical plan → results
PROJECT ( title )
INDEX-NESTED-LOOP-JOIN ( CID )
Index on Enroll( SID )
Index on Course ( CID )
Index on Student ( name )
INDEX-SCAN ( name = “Bart”)
INDEX-NESTED-LOOP-JOIN ( SID )
PROJECT ( title )
MERGE-JOIN ( CID )
SORT ( CID ) SCAN (Course) MERGE-JOIN ( SID )
SCAN ( Enroll )
SORT ( SID )
SCAN ( Student )
FILTER ( name = “Bart”)
6
Query optimization
Conceptually
Consider a space of possible plans (next)
Estimate costs of plans in the search space (next Tuesday)
Search through the space for the “best” plan (next Thursday)
Often the goal is not picking the absolute optimum, but
instead avoiding the horrible ones
1 second 1 minute 1 hour
Any of these will do
Heuristics-based query optimization
Start with a logical plan
Push selections/projections down as much as possible
Why?
Why not?
Join smaller relations first, and avoid cross product
Why?
Why not?
Convert the transformed logical plan to a physical
plan (by choosing appropriate physical operators)
11
SQL query rewrite
More complicated—subqueries and views divide a
query into nested “blocks”
Processing each block separately forces particular join
methods and join order
Even if the plan is optimal for each block, it may not be
optimal for the entire query
Unnest query: convert subqueries/views to joins
) Then we just deal with select-project-join queries
Where the clean rules of relational algebra apply
12
DB2’s QGM
Leung et al. “Query Rewrite Optimization Rules in IBM DB2 Universal
Database.”
Query Graph Model: DB2’s logical plan language
More high-level than relational algebra
A graph of boxes
Leaf boxes are tables
The standard box is the SELECT box (actually a select-
project-join query block with optional duplicate
elimination)
Other types include GROUPBY (aggregation), UNION,
INTERSECT, EXCEPT
Can always add new types (e.g., OUTERJOIN)
More on QGM boxes
Head: declarative description of the output
Schema: list of output columns
Property: Are output tuples DISTINCT?
Body: how to compute the output
Quantifiers: tuple variables that range over other boxes
- F: regular tuple variable, e.g., FROM R AS r
- E: existential quantifier, e.g., IN ( subquery ), or = ANY( subquery )
- A: universal quantifier, e.g., > ALL( subquery )
- S: scalar subquery, e.g., = ( subquery )
Quantifiers are connected a hypergraph
- Hyperedges are predicates
Enforce DISTINCT, preserve duplicates, or permit duplicates?
- For the output of this box, and for each quantifier
14
QGM example
SELECT DISTINCT q1.partno, q1.descr, q2.suppno FROM inventory q1, quotations q WHERE q1.partno = q2.partno AND q1.descr = ‘engine’ AND q2.price <= ALL (SELECT q3.price FROM quotations q WHERE q2.partno = q3.partno);
15
Query rewrite in DB
Goal: make the logical plan as general as possible,
i.e., merge boxes
Rule-based transformations on QGM
Merge subqueries in FROM
Convert E to F (e.g., IN/ANY subqueries to joins)
Convert intersection to join
Convert S to F (i.e., scalar subqueries to joins)
Convert outerjoin to join
Magic (i.e., correlated subqueries to joins)
Another E to F trick
Sometimes an ANY subquery can be turned into an
aggregate subquery without ANY, to improve performance
further
SELECT * FROM Student s
WHERE GPA > ANY
(SELECT GPA FROM Student s
WHERE s2.name = ‘Bart’);
SELECT * FROM Student s
WHERE GPA >
(SELECT MIN(GPA) FROM Student s
WHERE s2.name = ‘Bart’);
20
Does the same trick apply to ALL?
SELECT * FROM Student s
WHERE GPA > ALL
(SELECT GPA FROM Student s
WHERE s2.name = ‘Bart’);
SELECT * FROM Student s
WHERE GPA >
(SELECT MAX(GPA) FROM Student s
WHERE s2.name = ‘Bart’);
21
Correlated subqueries
SELECT CID FROM Course
WHERE title LIKE ’CPS%’
AND min_enroll >
(SELECT COUNT(*) FROM Enroll
WHERE Enroll.CID = Course.CID);
Executing correlated subquery is expensive
The subquery is evaluated once for every CPS course
) Decorrelate!
COUNT bug
SELECT CID FROM Course
WHERE title LIKE ’CPS%’
AND min_enroll > (SELECT COUNT(*) FROM Enroll
WHERE Enroll.CID = Course.CID);
SELECT CID
FROM Course, (SELECT CID, COUNT(*) AS cnt
FROM Enroll GROUP BY CID) t
WHERE t.CID = Course.CID AND min_enroll > t.cnt
AND title LIKE ’CPS%’;
First compute the enrollment for all(?) courses
23
Magic decorrelation
Simple idea
Process the outer query using other predicates
- To collect bindings for correlated variables in the subquery
Evaluate the subquery using the bindings collected
- It is a join
- Once for the entire set of bindings
- Compared to once per binding in the naïve approach
Use the result of the subquery to refine the outer query
Name “magic” comes from a technique in recursive
processing of Datalog queries
24
Magic decorrelation example
SELECT CID FROM Course WHERE title LIKE ’CPS%’ AND min_enroll > (SELECT COUNT(*) FROM Enroll WHERE Enroll.CID = Course.CID); CREATE VIEW Supp_Course AS SELECT * FROM Course WHERE title LIKE ’CPS%’;
CREATE VIEW Magic AS SELECT DISTINCT CID FROM Supp_Course;
CREATE VIEW DS AS (SELECT Enroll.CID, COUNT(*) AS cnt FROM Magic, Enroll WHERE Magic.CID = Enroll.CID GROUP BY Enroll.CID) UNION (SELECT Magic.CID, 0 AS cnt FROM Magic WHERE Magic.CID NOT IN (SELECT CID FROM Enroll);
SELECT Supp_Course.CID FROM Supp_Course, DS WHERE Supp_Course.CID = DS.CID AND min_enroll > DS.cnt;
Process the outer query without the subquery
Collect bindings
Evaluate the subquery with bindings
Finally, refine the outer query