Download Optimizing Database Applications: Techniques for Query Efficiency and more Summaries Algebra in PDF only on Docsity!
Optimizing Data Access in Database Applications
using Static Analysis
Thesis
Submitted in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
by
K. Venkatesh Emani
Roll No: 134058001
Advsior
Prof. S. Sudarshan
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
Mumbai
Declaration
I declare that this written submission represents my ideas in my own words and where others’ ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed.
(Signature)
(Name of the student)
(Roll No.)
Date:
Abstract
Database applications are typically written using a mixture of imperative languages and embed- ded queries (expressed using SQL/other frameworks) for data access. Traditionally, the imper- ative and declarative parts of these applications have been optimized separately. Techniques for optimization that span across the declarative and imperative parts in database applications are called holistic optimization techniques. In this thesis, we present novel holistic optimization techniques for optimizing data access in applications that access data using database abstractions. In such applications, data processing logic often gets distributed across the declarative and imperative parts of a program. Conse- quently, data access from the application is often inefficient due to iterative queries, transfer of unused data, over-specification of queries, and other reasons. We propose techniques based on static program analysis and program transformations to automatically rewrite database ap- plications for efficient data access. When more than one transformations are applicable on a particular program, our techniques can systematically generate all equivalent alternatives using the given set of program transformations, represent these alternatives efficiently, and choose the best rewrite based on a cost model. We also improve upon existing techniques for optimizing the evaluation of user defined functions in databases, which, similar to database applications, contain a mix of imperative code and declarative SQL queries. We demonstrate that the static analysis techniques we develop can be applied to other optimizations as well as test data gen- eration for queries in database applications. Our experiments on real world and benchmark applications show that our techniques have wide applicability and provide significant perfor- mance benefits.
i
List of Figures
- 1 Introduction List of Figures vii
- 1.1 Problem Overview and Motivation
- 1.2 Summary of Contributions
- 1.3 Organization of the Thesis
- 2 Literature Survey
- 2.1 Early Approaches
- 2.2 Recent Approaches
- 2.2.1 Batching of Query Results
- 2.2.2 Prefetching Query Results
- 2.2.3 Prefetching Query Results at the Earliest Program Point
- 2.2.4 Pushing Computation to the Database
- 2.2.5 Optimizing Transactions in Database Applications
- 2.2.6 Optimizing User Defined Functions in Databases
- 2.2.7 Optimizations for ORM Applications
- 2.3 Summary
- 3 Translating Imperative Code to SQL
- 3.1 Introduction
- 3.2 Overview
- 3.3 DAG Based IR
- 3.3.1 Background
- 3.3.2 D-IR
- 3.3.3 Algorithm for D-IR Construction
- 3.4 F-IR Representation
- 3.4.1 Fold
- 3.4.2 Converting Loops to Fold
- 3.5 F-IR Transformations
- 3.5.1 Transformation Rules
- 3.5.2 Generating and Using Equivalent SQL
- 3.5.3 Application of Transformation Rules
- 3.5.4 Extensions
- 3.5.5 Limitations
- 3.6 Related Work
- 3.7 Experimental Evaluation
- 3.7.1 Applicability
- 3.7.2 Performance Impact
- 3.8 Summary
- 4 Cobra: A Framework for Cost-based Rewriting of Database Applications
- 4.1 Introduction
- 4.2 Motivating Example
- 4.3 Background
- 4.3.1 Volcano/Cascades AND-OR DAG
- 4.3.2 Program regions
- 4.4 AND-OR DAG Representation of Programs
- 4.4.1 Region as a State Transition
- 4.4.2 Region AND-OR DAG
- 4.5 Transformations using IR
- 4.5.1 F-IR Recap
- 4.5.2 Integration into Region DAG
- 4.5.3 Transformations
- 4.6 Cost Model
- 4.7 Related Work
- 4.8 Experimental Evaluation
- 4.9 Summary
- 5 Froid: Optimization of Imperative Programs in a Relational Database
- 5.1 Introduction
- 5.2 Background
- 5.2.1 Scalar UDF Example
- 5.2.2 UDF Evaluation in SQL Server
- 5.2.3 Drawbacks in UDF Evaluation
- 5.2.4 Prior Approaches
- 5.3 The Froid Framework
- 5.3.1 Intuition
- 5.3.2 The APPLY operator
- 5.3.3 Overview of Approach
- 5.3.4 Supported UDFs and queries
- 5.4 UDF Algebrization
- 5.4.1 Construction of Regions
- 5.4.2 Relational Expressions for Regions
- 5.4.3 Combining expressions using APPLY
- 5.4.4 Correctness and Semantics Preservation
- 5.5 Substitution and Optimization
- 5.6 Compiler Optimizations
- 5.6.1 Dynamic Slicing
- 5.6.2 Constant Folding and Propagation
- 5.6.3 Dead Code Elimination
- 5.7 Design and Implementation
- 5.7.1 Cost-based Substitution
- 5.7.2 Imposing Constraints
- 5.7.3 Supporting additional languages
- 5.7.4 Implementation Details
- 5.8 Evaluation
- 5.8.1 Applicability of Froid
- 5.8.2 Performance improvements
- 5.8.3 Resource consumption
- 5.9 Real-World UDF Examples
- 5.10 Natively compiled UDFs
- 5.11 TPC-H Queries with UDFs
- 5.11.1 Scalar UDF Definitions
- 5.11.2 TPC-H Queries Rewritten using UDFs
- 5.12 Related Work
- 5.13 Summary
- 6 Other Applications of Static Analysis
- 6.1 Rewriting ORDER BY Queries
- 6.1.1 Introduction
- 6.1.2 Background: Data Flow Analysis
- 6.1.3 Data Flow Equations for Live Order Analysis
- 6.1.4 Algorithm for Program Rewriting
- 6.1.5 Summary
- 6.2 Test Data Generation for Database Applications
- 6.2.1 Introduction
- 6.2.2 Query Extraction
- 6.2.3 Test Data and Unit Test Generation
- 6.2.4 Related Work
- 6.2.5 Experiments
- 6.2.6 Summary
- 7 Conclusions and Future Work
- 7.1 Conclusions
- 7.2 Future Work
- 7.2.1 Program transformations for ORM Applications
- 7.2.2 Cost-based program transformations
- 7.2.3 Optimization of User Defined Functions
- 7.2.4 Compiling Over-specified Data Structures
- 7.2.5 Optimizing Interactions between Web Services and Clients
- A Translating Imperative Code to SQL
- A.1 Proof Sketch for Loop to Fold Translation
- A.2 D-IR Construction
- A.2.1 Simple Statement
- A.2.2 Basic Block
- A.2.3 Sequential Region
- A.2.4 Conditional Region
- A.2.5 Loop Region
- A.2.6 Functions
- tion (image source: [83]) 1.1 Interaction between the application and the database in a typical JDBC applica-
- 1.2 Interaction between the application and the database in a typical ORM application
- 1.3 Distribution of data processing logic in ORM applications
- 1.4 Distribution of data processing logic in programs using database abstractions
- 2.1 Code with opportunities for loop fission transformation (source: [85])
- 2.2 DBridge Architecture (image source: [26])
- from [85]. 2.3 Transformed program for Figure 2.1 after loop fission [56]. Figure reproduced
- 2.4 Imperative program implementing join (source: [33])
- 2.5 Program from Figure 2.4 rewritten using SQL query (source: [33])
- 2.6 A program accessing the database (source: [31])
- 2.7 Partition of program from Figure 2.6 (source: [31])
- 2.8 Query with a scalar UDF (source: [99])
- 2.9 Decorrelated form of query from Figure 2.8 (source: [99])
- 2.10 Summary of Prior Work
- 3.1 System Overview
- 3.2 Code for highest score calculation
- 3.3 Walk-through of equivalent SQL derivation
- 3.4 Types of regions
- 3.5 D-IR construction for a simple code fragment
- 3.6 Demonstration of preconditions for translation into F-IR
- 3.7 Cursor loop with nested scalar queries
- 3.8 Optimized query for data access in Figure 3.7
- 3.9 Selection
- 3.10 Join
- 3.11 Aggregation
- 3.12 Comparison With Existing Techniques
- 4.1 COBRA Illustration
- 4.2 Hibernate object-relation mapping specification
- 4.3 Alternative implementations of the same program
- 4.4 Representing alternative query rewrites using the AND-OR DAG
- 4.5 Program regions for program P 0 from Figure 4.3a
- 4.6 Representing alternative programs using the Region DAG
- 4.7 Program M 0 : Aggregations inside a loop
- 4.8 F-IR representation for the loop in Figure 4.7
- 4.9 Region DAG for Figure 4.7 after transforming to F-IR
- 4.10 F-IR Transformation Rules (T1 to T5 are from Chapter 3)
- 4.11 Cost parameters
- work, varying Orders 4.12 Performance of alternative implementations of Figure 4.3a – Slow remote net-
- varying Orders 4.13 Performance of alternative implementations of Figure 4.3a – Fast local network,
- work, varying Customers 4.14 Performance of alternative implementations of Figure 4.3a – Slow remote net-
- description, number of cases) 4.15 Cases for cost based based optimization in real world application (pattern id,
- 4.16 Performance benefits due to COBRA
- 4.17 Code fragments for cost based rewriting
- 5.1 Example T-SQL User defined functions
- 5.2 Query plan for the query in Section 5.2.1
- 5.3 Simple UDF that reads a variable multiple times
- 5.4 Overview of the Froid framework
- 5.5 Relational expression for UDF total price
- 5.6 Plan for inlined UDF total price of Figure 5.1
- 5.7 Compiler optimizations as relational transformations
- 5.8 Varying the number of UDF invocations
- 5.9 Elapsed time for Compilation and execution (using cold plan cache)
- 5.10 TPC-H queries using UDFs
- 5.11 Improvement for UDFs in workload W1
- 5.12 Improvement for UDFs in workload W2
- 5.13 Example for Section 5.8.2
- 5.14 CPU time comparison
- 5.15 Example for I/O measurements
- 6.1 Custom sorting of ordered query results
- 6.2 CFG for the function getCustomerPayments from Figure 6.1
- 6.3 Rewritten program after removing unused ordering from Figure 6.1
- 6.4 Motivating example for automatic test data generation
- 6.5 Test data generation architecture
- 6.6 Intermediate representation for extracting query information
- 6.7 Walk-through of IR construction
- 6.8 User interaction on test results
- 6.9 Characteristics of synthetic programs
Figure 1.1: Interaction between the application and the database in a typical JDBC application (image source: [83])
impedance mismatch. Object-relational mapping (ORM) frameworks address this impedance mismatch by allowing developers to access data stored in the database using object accesses, without writing explicit SQL queries. The interaction between applications using such abstrac- tions and the database is shown in Figure 1.2. Developers using these abstractions access data simply using object accesses; the framework takes care of generating the required SQL queries, and translating the results back into objects for use in the application. Examples of popular ORM frameworks include Hibernate [60] for Java, Entity Framework [12] for .NET, Django ORM [11] for Python, Active Record [13] for Ruby on Rails, and others. An alternative approach to addressing the impedance mismatch is to make database relations as first class citizens inside a programming language. The language provides a model of the database, and custom operations that can be used to construct queries. The language compiler takes care of translating the models and operations into their SQL counterparts. Examples of such languages/libraries include SAP ABAP [14], Microsoft LINQ [15], etc. In essence, the frameworks, languages and libraries discussed above provide an abstraction of the relational
Figure 1.2: Interaction between the application and the database in a typical ORM application
List getUnfinishedProjects() { List unfinishedP = new ArrayList(); List projects = loadAll(Project.class); //loadAll() internally uses SQL to fetch all rows of the table Project for (Project project : projects){ if (!(project.getIsFinished())) unfinishedP.add(project); } return unfinishedP; } (a) Program using Hibernate ORM
List getUnfinishedProjects() { List unfinishedP = executeQuery( "SELECT * FROM Project WHERE isFinished <> 1"); return unfinishedP; } (b) Program from Figure 1.3a rewritten to use SQL
Figure 1.3: Distribution of data processing logic in ORM applications
model of the database. The use of database abstraction frameworks in database applications has been increasing. Chen et al. [28] note that in a recent survey, 67.5% of Java developers use the Hibernate ORM framework to access the database.
In applications using ORM frameworks (or other abstractions), data processing gets dis- tributed across the database and the application program. Developers of ORM applications tend to express complex queries using simple queries (generated by ORM) coupled with imperative code. For example, consider Figures 1.3 and 1.4. Figures 1.3a and 1.4a both use constructs provided by the database abstraction framework to read rows of a table into an intermediate collection and filter using an if inside a loop. Further, Figure 1.4a performs aggregation after filtering. These frameworks enable modularity and code reuse, and help developers without
SELECT * INTO TABLE gt doc FROM ekko. LOOP AT gt doc INTO lwa vbfa IF lwa vbfa-matnr EQ 5. lwa qty = lwa qty + lwa vbfa-rfmng. ENDIF. ENDLOOP. (a) An ABAP program
SELECT SUM(rfmng) FROM gt doc WHERE matnr = 5 (b) SQL query for Figure 1.4a
Figure 1.4: Distribution of data processing logic in programs using database abstractions
- We present the COBRA framework for cost-based rewriting of database applications. Our framework addresses gaps in existing approaches, which are based on heuristics, which in turn can lead to sub-optimal rewrites. Given a program containing database accesses, and a set of transformations on the program, our framework systematically explores the space of possible rewrites and chooses the best rewrite. Our techniques use the notion of program regions [74] to represent a program alge- braically, and extend the Volcano/Cascades [52, 54] framework for cost-based transfor- mations of algebraic expressions, to rewrite programs. Our techniques bring the bene- fits of Volcano/Cascades such as handling cycles in transformations, representing a large number of alternative rewrites efficiently, etc. to program transformations. The cost based choice is driven by a proposed cost model that estimates the cost of a program containing imperative statements and queries; our cost model is able to consult the database opti- mizer for estimates related to costs and cardinalities of queries.
- We develop the Froid framework for optimization of imperative code in a relational database, by inlining user defined function (UDF) invocations in queries. Such inlining provides up to order of magnitude performance improvements by enabling set oriented execution of UDFs, which are otherwise evaluated iteratively once per each row in the calling query. To this end, we propose techniques for constructing algebraic representations of scalar T- SQL user defined functions in the Microsoft SQL Server database, building on our work from translating imperative code to SQL. We evaluate alternative strategies for UDF in- lining at different stages of query optimization and present our learnings. Our techniques are also able to provide some compiler optimizations such as dead code elimination and dynamic slicing to UDFs in relational databases.
- The static analysis techniques that we developed for optimizing database applications and UDFs can also be used for extracting useful information about queries embedded in imperative code. We have extended our techniques to perform other optimizations as well as test data generation for embedded queries in database applications, as outlined below: - Developers of database applications often use ORDER BY queries, which are ex- pensive. We propose techniques based on data flow analysis to identify whether the order of query results is not used in the program, and develop transformations to rewrite queries and collections to remove the unnecessary ordering, for improved performance. - We propose techniques to generate test data for imperative programs containing SQL queries. The XData [24] system developed at IIT Bombay generates test data for automatic grading of SQL queries. Our techniques use static analysis to extract queries and relevant conditions from database applications containing embedded SQL queries, and use XData to generate test data for queries, as well as unit tests for functions containing queries, to assess the correctness of these queries.
1.3 Organization of the Thesis
The thesis is organized as follows. In Chapter 2, we survey existing approaches for holistic optimization of database applications. Techniques for translating imperative code to SQL are detailed in Chapter 3. We discuss the COBRA framework for cost-based transformations in
5
Chapter 4. Chapter 5 elaborates on the Froid framework for optimizing UDFs. In Chapter 6, we discuss other applications for static analysis of database applications. We conclude the thesis in Chapter 7 with a few directions for future work.