Query Optimization: From Logical Plan to Physical Execution, Slides of Database Management Systems (DBMS)

An in-depth exploration of query optimization in advanced database systems. It covers the process of a sql query's journey through the dbms, including parsing, validation, query optimization, and execution. The text also discusses various techniques for query optimization, such as relational algebra equivalences, query rewrite, and heuristics-based optimization.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Query Optimization
Part I
CPS 216
Advanced Database Systems
2
A query’s trip through the DBMS
Parser
Validator
Query optimizer
Query execution engine
Result
SQL query SELECT title, SID
FROM Enroll, Course
WHERE Enroll.CID =
Course.CID;
Parse tree
hSFWi
hselect-listi
hfrom-listi
hwhere-listi
htableihtablei
hQueryi
AND
Enroll Course
Logical plan
πtitle, SID
σEnroll.CID = Course.CID
×
Enroll Course
Physical plan
PROJECT (title, SID)
MERGE-JOIN (CID)
SCAN (Enroll)
SCAN (Course)
SORT (CID)
3
Parsing & validation
Parser: SQL parse tree
Good old lex & yacc
Detect and reject syntax errors
Validator: parse tree logical plan
Detect and reject semantic errors
Nonexistent tables/views/columns
Type mismatches (e.g., AVG(name), name + GPA, Student
UNION Enroll)
Wildcard (SELECT *) and view expansion
Use information stored in system catalog tables (contains
all metadata/schema information)
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Query Optimization: From Logical Plan to Physical Execution and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Query Optimization

Part I

CPS 216

Advanced Database Systems

2

A query’s trip through the DBMS

Parser

Validator

Query optimizer

Query execution engine

Result

SQL query SELECT title, SID

FROM Enroll, Course WHERE Enroll.CID = Course.CID;

Parse tree

h SFW i h select-list i h from-list i

h where-list i

h table i h table i

h Query i

AND

Enroll Course

Logical plan

π title , SID

σ Enroll. CID = Course. CID

×

Enroll Course

Physical plan

PROJECT ( title , SID )

MERGE-JOIN ( CID )

SCAN ( Enroll )

SCAN ( Course )

SORT ( CID )

3

Parsing & validation

™ Parser: SQL → parse tree

ƒ Good old lex & yacc

ƒ Detect and reject syntax errors

™ Validator: parse tree → logical plan

ƒ Detect and reject semantic errors

  • Nonexistent tables/views/columns
  • Type mismatches (e.g., AVG(name), name + GPA, Student

UNION Enroll)

ƒ Wildcard (SELECT *) and view expansion

ƒ Use information stored in system catalog tables (contains

all metadata/schema information)

Logical plan

™ A tree whose nodes are logical operators

ƒ Often a tree of relational algebra operators

ƒ DB2 uses QGM (Query Graph Model)

™ There are many equivalent logical plans

π title

σ Student. name =“Bart” ∧ Student. SID = Enroll. SID ∧ Enroll. CID = Course.CID

×

Enroll

× Course

Student

An equivalent plan:

π title

 Enroll. CID = Course. CID

Enroll

Course

Student

 Student. SID = Enroll. SID

σ name = “Bart”

SELECT Course.title FROM Student, Enroll, Course WHERE Student.name = ‘Bart’ AND Student.SID = Enroll.SID AND Enroll.CID = Course.CID;

5

Query optimization and execution

™ Recall that a physical plan tells the DBMS query execution

engine how to execute the query

ƒ One logical plan can have many possible physical plans (with

equivalent results, but different costs and assumptions)

™ Query optimizer: one logical plan → “best” physical plan

™ Query execution engine: physical plan → results

PROJECT ( title )

INDEX-NESTED-LOOP-JOIN ( CID )

Index on Enroll( SID )

Index on Course ( CID )

Index on Student ( name )

INDEX-SCAN ( name = “Bart”)

INDEX-NESTED-LOOP-JOIN ( SID )

PROJECT ( title )

MERGE-JOIN ( CID )

SORT ( CID ) SCAN (Course) MERGE-JOIN ( SID )

SCAN ( Enroll )

SORT ( SID )

SCAN ( Student )

FILTER ( name = “Bart”)

6

Query optimization

™ Conceptually

ƒ Consider a space of possible plans (next)

ƒ Estimate costs of plans in the search space (next Tuesday)

ƒ Search through the space for the “best” plan (next Thursday)

™ Often the goal is not picking the absolute optimum, but

instead avoiding the horrible ones

1 second 1 minute 1 hour

Any of these will do

Heuristics-based query optimization

™ Start with a logical plan

™ Push selections/projections down as much as possible

ƒ Why?

ƒ Why not?

™ Join smaller relations first, and avoid cross product

ƒ Why?

ƒ Why not?

™ Convert the transformed logical plan to a physical

plan (by choosing appropriate physical operators)

11

SQL query rewrite

™ More complicated—subqueries and views divide a

query into nested “blocks”

ƒ Processing each block separately forces particular join

methods and join order

ƒ Even if the plan is optimal for each block, it may not be

optimal for the entire query

™ Unnest query: convert subqueries/views to joins

) Then we just deal with select-project-join queries

ƒ Where the clean rules of relational algebra apply

12

DB2’s QGM

Leung et al. “Query Rewrite Optimization Rules in IBM DB2 Universal

Database.”

™ Query Graph Model: DB2’s logical plan language

ƒ More high-level than relational algebra

™ A graph of boxes

ƒ Leaf boxes are tables

ƒ The standard box is the SELECT box (actually a select-

project-join query block with optional duplicate

elimination)

ƒ Other types include GROUPBY (aggregation), UNION,

INTERSECT, EXCEPT

ƒ Can always add new types (e.g., OUTERJOIN)

More on QGM boxes

™ Head: declarative description of the output

ƒ Schema: list of output columns

ƒ Property: Are output tuples DISTINCT?

™ Body: how to compute the output

ƒ Quantifiers: tuple variables that range over other boxes

  • F: regular tuple variable, e.g., FROM R AS r
  • E: existential quantifier, e.g., IN ( subquery ), or = ANY( subquery )
  • A: universal quantifier, e.g., > ALL( subquery )
  • S: scalar subquery, e.g., = ( subquery )

ƒ Quantifiers are connected a hypergraph

  • Hyperedges are predicates

ƒ Enforce DISTINCT, preserve duplicates, or permit duplicates?

  • For the output of this box, and for each quantifier

14

QGM example

SELECT DISTINCT q1.partno, q1.descr, q2.suppno FROM inventory q1, quotations q WHERE q1.partno = q2.partno AND q1.descr = ‘engine’ AND q2.price <= ALL (SELECT q3.price FROM quotations q WHERE q2.partno = q3.partno);

15

Query rewrite in DB

™ Goal: make the logical plan as general as possible,

i.e., merge boxes

™ Rule-based transformations on QGM

ƒ Merge subqueries in FROM

ƒ Convert E to F (e.g., IN/ANY subqueries to joins)

ƒ Convert intersection to join

ƒ Convert S to F (i.e., scalar subqueries to joins)

ƒ Convert outerjoin to join

ƒ Magic (i.e., correlated subqueries to joins)

Another E to F trick

™ Sometimes an ANY subquery can be turned into an

aggregate subquery without ANY, to improve performance

further

™ SELECT * FROM Student s

WHERE GPA > ANY

(SELECT GPA FROM Student s

WHERE s2.name = ‘Bart’);

™ SELECT * FROM Student s

WHERE GPA >

(SELECT MIN(GPA) FROM Student s

WHERE s2.name = ‘Bart’);

20

Does the same trick apply to ALL?

™ SELECT * FROM Student s

WHERE GPA > ALL

(SELECT GPA FROM Student s

WHERE s2.name = ‘Bart’);

™ SELECT * FROM Student s

WHERE GPA >

(SELECT MAX(GPA) FROM Student s

WHERE s2.name = ‘Bart’);

21

Correlated subqueries

™ SELECT CID FROM Course

WHERE title LIKE ’CPS%’

AND min_enroll >

(SELECT COUNT(*) FROM Enroll

WHERE Enroll.CID = Course.CID);

™ Executing correlated subquery is expensive

ƒ The subquery is evaluated once for every CPS course

) Decorrelate!

COUNT bug

™ SELECT CID FROM Course

WHERE title LIKE ’CPS%’

AND min_enroll > (SELECT COUNT(*) FROM Enroll

WHERE Enroll.CID = Course.CID);

™ SELECT CID

FROM Course, (SELECT CID, COUNT(*) AS cnt

FROM Enroll GROUP BY CID) t

WHERE t.CID = Course.CID AND min_enroll > t.cnt

AND title LIKE ’CPS%’;

First compute the enrollment for all(?) courses

23

Magic decorrelation

™ Simple idea

ƒ Process the outer query using other predicates

  • To collect bindings for correlated variables in the subquery

ƒ Evaluate the subquery using the bindings collected

  • It is a join
  • Once for the entire set of bindings
    • Compared to once per binding in the naïve approach

ƒ Use the result of the subquery to refine the outer query

  • Another join

™ Name “magic” comes from a technique in recursive

processing of Datalog queries

24

Magic decorrelation example

™ SELECT CID FROM Course WHERE title LIKE ’CPS%’ AND min_enroll > (SELECT COUNT(*) FROM Enroll WHERE Enroll.CID = Course.CID); ™ CREATE VIEW Supp_Course AS SELECT * FROM Course WHERE title LIKE ’CPS%’;

CREATE VIEW Magic AS SELECT DISTINCT CID FROM Supp_Course;

CREATE VIEW DS AS (SELECT Enroll.CID, COUNT(*) AS cnt FROM Magic, Enroll WHERE Magic.CID = Enroll.CID GROUP BY Enroll.CID) UNION (SELECT Magic.CID, 0 AS cnt FROM Magic WHERE Magic.CID NOT IN (SELECT CID FROM Enroll);

SELECT Supp_Course.CID FROM Supp_Course, DS WHERE Supp_Course.CID = DS.CID AND min_enroll > DS.cnt;

Process the outer query without the subquery

Collect bindings

Evaluate the subquery with bindings

Finally, refine the outer query