Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Optimizing Database Applications: Techniques for Query Efficiency, Summaries of Algebra

Murdoch University (MU)Algebra

Various techniques for optimizing database applications, focusing on query efficiency. It covers batching query results, loop fission, query optimization, and handling output ordering. The document also proposes methods for static analysis to identify relational operations in imperative code and rewrite them as SQL queries. Additionally, it presents a cost model for database application programs and discusses the benefits of using UDFs in database applications.

Typology: Summaries

2021/2022

Uploaded on 07/05/2022

lee_95 🇦🇺

4.6

(59)

999 documents

1 / 159

This page cannot be seen from the preview

Don't miss anything!

Optimizing Data Access in Database Applications

using Static Analysis

Thesis

Submitted in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

K. Venkatesh Emani

Roll No: 134058001

Advsior

Prof. S. Sudarshan

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai

2019

Discover Summaries of Algebra Murdoch University (MU)

Partial preview of the text

Download Optimizing Database Applications: Techniques for Query Efficiency and more Summaries Algebra in PDF only on Docsity!

Optimizing Data Access in Database Applications

using Static Analysis

Thesis

Submitted in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

by

K. Venkatesh Emani

Roll No: 134058001

Advsior

Prof. S. Sudarshan

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai

Declaration

I declare that this written submission represents my ideas in my own words and where others’ ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed.

(Signature)

(Name of the student)

(Roll No.)

Date:

Abstract

Database applications are typically written using a mixture of imperative languages and embed- ded queries (expressed using SQL/other frameworks) for data access. Traditionally, the imper- ative and declarative parts of these applications have been optimized separately. Techniques for optimization that span across the declarative and imperative parts in database applications are called holistic optimization techniques. In this thesis, we present novel holistic optimization techniques for optimizing data access in applications that access data using database abstractions. In such applications, data processing logic often gets distributed across the declarative and imperative parts of a program. Conse- quently, data access from the application is often inefficient due to iterative queries, transfer of unused data, over-specification of queries, and other reasons. We propose techniques based on static program analysis and program transformations to automatically rewrite database ap- plications for efficient data access. When more than one transformations are applicable on a particular program, our techniques can systematically generate all equivalent alternatives using the given set of program transformations, represent these alternatives efficiently, and choose the best rewrite based on a cost model. We also improve upon existing techniques for optimizing the evaluation of user defined functions in databases, which, similar to database applications, contain a mix of imperative code and declarative SQL queries. We demonstrate that the static analysis techniques we develop can be applied to other optimizations as well as test data gen- eration for queries in database applications. Our experiments on real world and benchmark applications show that our techniques have wide applicability and provide significant perfor- mance benefits.

List of Figures

1 Introduction List of Figures vii
- 1.1 Problem Overview and Motivation
- 1.2 Summary of Contributions
- 1.3 Organization of the Thesis
2 Literature Survey
- 2.1 Early Approaches
- 2.2 Recent Approaches
  - 2.2.1 Batching of Query Results
  - 2.2.2 Prefetching Query Results
  - 2.2.3 Prefetching Query Results at the Earliest Program Point
  - 2.2.4 Pushing Computation to the Database
  - 2.2.5 Optimizing Transactions in Database Applications
  - 2.2.6 Optimizing User Defined Functions in Databases
  - 2.2.7 Optimizations for ORM Applications
- 2.3 Summary
3 Translating Imperative Code to SQL
- 3.1 Introduction
- 3.2 Overview
- 3.3 DAG Based IR
  - 3.3.1 Background
  - 3.3.2 D-IR
  - 3.3.3 Algorithm for D-IR Construction
- 3.4 F-IR Representation
  - 3.4.1 Fold
  - 3.4.2 Converting Loops to Fold
- 3.5 F-IR Transformations
  - 3.5.1 Transformation Rules
  - 3.5.2 Generating and Using Equivalent SQL
  - 3.5.3 Application of Transformation Rules
  - 3.5.4 Extensions
  - 3.5.5 Limitations
- 3.6 Related Work
- 3.7 Experimental Evaluation
  - 3.7.1 Applicability
  - 3.7.2 Performance Impact
- 3.8 Summary
4 Cobra: A Framework for Cost-based Rewriting of Database Applications
- 4.1 Introduction
- 4.2 Motivating Example
- 4.3 Background
  - 4.3.1 Volcano/Cascades AND-OR DAG
  - 4.3.2 Program regions
- 4.4 AND-OR DAG Representation of Programs
  - 4.4.1 Region as a State Transition
  - 4.4.2 Region AND-OR DAG
- 4.5 Transformations using IR
  - 4.5.1 F-IR Recap
  - 4.5.2 Integration into Region DAG
  - 4.5.3 Transformations
- 4.6 Cost Model
- 4.7 Related Work
- 4.8 Experimental Evaluation
- 4.9 Summary
5 Froid: Optimization of Imperative Programs in a Relational Database
- 5.1 Introduction
- 5.2 Background
  - 5.2.1 Scalar UDF Example
  - 5.2.2 UDF Evaluation in SQL Server
  - 5.2.3 Drawbacks in UDF Evaluation
  - 5.2.4 Prior Approaches
- 5.3 The Froid Framework
  - 5.3.1 Intuition
  - 5.3.2 The APPLY operator
  - 5.3.3 Overview of Approach
  - 5.3.4 Supported UDFs and queries
- 5.4 UDF Algebrization
  - 5.4.1 Construction of Regions
  - 5.4.2 Relational Expressions for Regions
  - 5.4.3 Combining expressions using APPLY
  - 5.4.4 Correctness and Semantics Preservation
- 5.5 Substitution and Optimization
- 5.6 Compiler Optimizations
  - 5.6.1 Dynamic Slicing
  - 5.6.2 Constant Folding and Propagation
  - 5.6.3 Dead Code Elimination
- 5.7 Design and Implementation
  - 5.7.1 Cost-based Substitution
  - 5.7.2 Imposing Constraints
  - 5.7.3 Supporting additional languages
  - 5.7.4 Implementation Details
- 5.8 Evaluation
  - 5.8.1 Applicability of Froid
  - 5.8.2 Performance improvements
  - 5.8.3 Resource consumption
- 5.9 Real-World UDF Examples
- 5.10 Natively compiled UDFs
- 5.11 TPC-H Queries with UDFs
  - 5.11.1 Scalar UDF Definitions
  - 5.11.2 TPC-H Queries Rewritten using UDFs
- 5.12 Related Work
- 5.13 Summary
6 Other Applications of Static Analysis
- 6.1 Rewriting ORDER BY Queries
  - 6.1.1 Introduction
  - 6.1.2 Background: Data Flow Analysis
  - 6.1.3 Data Flow Equations for Live Order Analysis
  - 6.1.4 Algorithm for Program Rewriting
  - 6.1.5 Summary
- 6.2 Test Data Generation for Database Applications
  - 6.2.1 Introduction
  - 6.2.2 Query Extraction
  - 6.2.3 Test Data and Unit Test Generation
  - 6.2.4 Related Work
  - 6.2.5 Experiments
  - 6.2.6 Summary
7 Conclusions and Future Work
- 7.1 Conclusions
- 7.2 Future Work
  - 7.2.1 Program transformations for ORM Applications
  - 7.2.2 Cost-based program transformations
  - 7.2.3 Optimization of User Defined Functions
  - 7.2.4 Compiling Over-specified Data Structures
  - 7.2.5 Optimizing Interactions between Web Services and Clients
A Translating Imperative Code to SQL
- A.1 Proof Sketch for Loop to Fold Translation
- A.2 D-IR Construction
  - A.2.1 Simple Statement
  - A.2.2 Basic Block
  - A.2.3 Sequential Region
  - A.2.4 Conditional Region
  - A.2.5 Loop Region
  - A.2.6 Functions
- tion (image source: [83]) 1.1 Interaction between the application and the database in a typical JDBC applica-
1.2 Interaction between the application and the database in a typical ORM application
1.3 Distribution of data processing logic in ORM applications
1.4 Distribution of data processing logic in programs using database abstractions
2.1 Code with opportunities for loop fission transformation (source: [85])
2.2 DBridge Architecture (image source: [26])
- from [85]. 2.3 Transformed program for Figure 2.1 after loop fission [56]. Figure reproduced
2.4 Imperative program implementing join (source: [33])
2.5 Program from Figure 2.4 rewritten using SQL query (source: [33])
2.6 A program accessing the database (source: [31])
2.7 Partition of program from Figure 2.6 (source: [31])
2.8 Query with a scalar UDF (source: [99])
2.9 Decorrelated form of query from Figure 2.8 (source: [99])
2.10 Summary of Prior Work
3.1 System Overview
3.2 Code for highest score calculation
3.3 Walk-through of equivalent SQL derivation
3.4 Types of regions
3.5 D-IR construction for a simple code fragment
3.6 Demonstration of preconditions for translation into F-IR
3.7 Cursor loop with nested scalar queries
3.8 Optimized query for data access in Figure 3.7
3.9 Selection
3.10 Join
3.11 Aggregation
3.12 Comparison With Existing Techniques
4.1 COBRA Illustration
4.2 Hibernate object-relation mapping specification
4.3 Alternative implementations of the same program
4.4 Representing alternative query rewrites using the AND-OR DAG
4.5 Program regions for program P 0 from Figure 4.3a
4.6 Representing alternative programs using the Region DAG
4.7 Program M 0 : Aggregations inside a loop
4.8 F-IR representation for the loop in Figure 4.7
4.9 Region DAG for Figure 4.7 after transforming to F-IR
4.10 F-IR Transformation Rules (T1 to T5 are from Chapter 3)
4.11 Cost parameters
- work, varying Orders 4.12 Performance of alternative implementations of Figure 4.3a – Slow remote net-
- varying Orders 4.13 Performance of alternative implementations of Figure 4.3a – Fast local network,
- work, varying Customers 4.14 Performance of alternative implementations of Figure 4.3a – Slow remote net-
- description, number of cases) 4.15 Cases for cost based based optimization in real world application (pattern id,
4.16 Performance benefits due to COBRA
4.17 Code fragments for cost based rewriting
5.1 Example T-SQL User defined functions
5.2 Query plan for the query in Section 5.2.1
5.3 Simple UDF that reads a variable multiple times
5.4 Overview of the Froid framework
5.5 Relational expression for UDF total price
5.6 Plan for inlined UDF total price of Figure 5.1
5.7 Compiler optimizations as relational transformations
5.8 Varying the number of UDF invocations
5.9 Elapsed time for Compilation and execution (using cold plan cache)
5.10 TPC-H queries using UDFs
5.11 Improvement for UDFs in workload W1
5.12 Improvement for UDFs in workload W2
5.13 Example for Section 5.8.2
5.14 CPU time comparison
5.15 Example for I/O measurements
6.1 Custom sorting of ordered query results
6.2 CFG for the function getCustomerPayments from Figure 6.1
6.3 Rewritten program after removing unused ordering from Figure 6.1
6.4 Motivating example for automatic test data generation
6.5 Test data generation architecture
6.6 Intermediate representation for extracting query information
6.7 Walk-through of IR construction
6.8 User interaction on test results
6.9 Characteristics of synthetic programs

Figure 1.1: Interaction between the application and the database in a typical JDBC application (image source: [83])

impedance mismatch. Object-relational mapping (ORM) frameworks address this impedance mismatch by allowing developers to access data stored in the database using object accesses, without writing explicit SQL queries. The interaction between applications using such abstrac- tions and the database is shown in Figure 1.2. Developers using these abstractions access data simply using object accesses; the framework takes care of generating the required SQL queries, and translating the results back into objects for use in the application. Examples of popular ORM frameworks include Hibernate [60] for Java, Entity Framework [12] for .NET, Django ORM [11] for Python, Active Record [13] for Ruby on Rails, and others. An alternative approach to addressing the impedance mismatch is to make database relations as first class citizens inside a programming language. The language provides a model of the database, and custom operations that can be used to construct queries. The language compiler takes care of translating the models and operations into their SQL counterparts. Examples of such languages/libraries include SAP ABAP [14], Microsoft LINQ [15], etc. In essence, the frameworks, languages and libraries discussed above provide an abstraction of the relational

Figure 1.2: Interaction between the application and the database in a typical ORM application

List getUnfinishedProjects() { List unfinishedP = new ArrayList(); List projects = loadAll(Project.class); //loadAll() internally uses SQL to fetch all rows of the table Project for (Project project : projects){ if (!(project.getIsFinished())) unfinishedP.add(project); } return unfinishedP; } (a) Program using Hibernate ORM

List getUnfinishedProjects() { List unfinishedP = executeQuery( "SELECT * FROM Project WHERE isFinished <> 1"); return unfinishedP; } (b) Program from Figure 1.3a rewritten to use SQL

Figure 1.3: Distribution of data processing logic in ORM applications

model of the database. The use of database abstraction frameworks in database applications has been increasing. Chen et al. [28] note that in a recent survey, 67.5% of Java developers use the Hibernate ORM framework to access the database.

In applications using ORM frameworks (or other abstractions), data processing gets dis- tributed across the database and the application program. Developers of ORM applications tend to express complex queries using simple queries (generated by ORM) coupled with imperative code. For example, consider Figures 1.3 and 1.4. Figures 1.3a and 1.4a both use constructs provided by the database abstraction framework to read rows of a table into an intermediate collection and filter using an if inside a loop. Further, Figure 1.4a performs aggregation after filtering. These frameworks enable modularity and code reuse, and help developers without

SELECT * INTO TABLE gt doc FROM ekko. LOOP AT gt doc INTO lwa vbfa IF lwa vbfa-matnr EQ 5. lwa qty = lwa qty + lwa vbfa-rfmng. ENDIF. ENDLOOP. (a) An ABAP program

SELECT SUM(rfmng) FROM gt doc WHERE matnr = 5 (b) SQL query for Figure 1.4a

Figure 1.4: Distribution of data processing logic in programs using database abstractions

We present the COBRA framework for cost-based rewriting of database applications. Our framework addresses gaps in existing approaches, which are based on heuristics, which in turn can lead to sub-optimal rewrites. Given a program containing database accesses, and a set of transformations on the program, our framework systematically explores the space of possible rewrites and chooses the best rewrite. Our techniques use the notion of program regions [74] to represent a program alge- braically, and extend the Volcano/Cascades [52, 54] framework for cost-based transfor- mations of algebraic expressions, to rewrite programs. Our techniques bring the bene- fits of Volcano/Cascades such as handling cycles in transformations, representing a large number of alternative rewrites efficiently, etc. to program transformations. The cost based choice is driven by a proposed cost model that estimates the cost of a program containing imperative statements and queries; our cost model is able to consult the database opti- mizer for estimates related to costs and cardinalities of queries.
We develop the Froid framework for optimization of imperative code in a relational database, by inlining user defined function (UDF) invocations in queries. Such inlining provides up to order of magnitude performance improvements by enabling set oriented execution of UDFs, which are otherwise evaluated iteratively once per each row in the calling query. To this end, we propose techniques for constructing algebraic representations of scalar T- SQL user defined functions in the Microsoft SQL Server database, building on our work from translating imperative code to SQL. We evaluate alternative strategies for UDF in- lining at different stages of query optimization and present our learnings. Our techniques are also able to provide some compiler optimizations such as dead code elimination and dynamic slicing to UDFs in relational databases.
The static analysis techniques that we developed for optimizing database applications and UDFs can also be used for extracting useful information about queries embedded in imperative code. We have extended our techniques to perform other optimizations as well as test data generation for embedded queries in database applications, as outlined below: - Developers of database applications often use ORDER BY queries, which are ex- pensive. We propose techniques based on data flow analysis to identify whether the order of query results is not used in the program, and develop transformations to rewrite queries and collections to remove the unnecessary ordering, for improved performance. - We propose techniques to generate test data for imperative programs containing SQL queries. The XData [24] system developed at IIT Bombay generates test data for automatic grading of SQL queries. Our techniques use static analysis to extract queries and relevant conditions from database applications containing embedded SQL queries, and use XData to generate test data for queries, as well as unit tests for functions containing queries, to assess the correctness of these queries.

1.3 Organization of the Thesis

The thesis is organized as follows. In Chapter 2, we survey existing approaches for holistic optimization of database applications. Techniques for translating imperative code to SQL are detailed in Chapter 3. We discuss the COBRA framework for cost-based transformations in

Chapter 4. Chapter 5 elaborates on the Froid framework for optimizing UDFs. In Chapter 6, we discuss other applications for static analysis of database applications. We conclude the thesis in Chapter 7 with a few directions for future work.

Optimizing Database Applications: Techniques for Query Efficiency, Summaries of Algebra

Related documents

Partial preview of the text

Download Optimizing Database Applications: Techniques for Query Efficiency and more Summaries Algebra in PDF only on Docsity!

Optimizing Data Access in Database Applications

using Static Analysis

Thesis

Submitted in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

by

K. Venkatesh Emani

Roll No: 134058001

Advsior

Prof. S. Sudarshan

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai

Declaration

(Signature)

(Name of the student)

(Roll No.)

Date:

Abstract

List of Figures