Query Optimization and Join Algorithms in Relational Databases, Assignments of Introduction to Database Management Systems

The importance of query optimization in reducing system resources and increasing user experience. It covers various query optimization strategies such as combining selections and cross-products into joins, pushing selections and projections ahead of joins, and replacing subqueries with joins. The document also explores the concept of blocking I/O in relational databases and the use of hash functions and hash join algorithms for executing joins. Additionally, it compares different join algorithms including sort-merge join, block nested loop join, and nested loop join.

Typology: Assignments

2020/2021

Uploaded on 06/01/2022

shadrack-kiprotich-1
shadrack-kiprotich-1 ๐Ÿ‡ฐ๐Ÿ‡ช

5 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Assignment 2
Part 1. Concepts and principles
Question 1
The ultimate goal of a query optimizer is to reduce as much as possible system resources that are
required in fulfilling a query while providing a user with the intended results as fast as possible.
Some of the importance of query optimization are highlighted below;
a. Results into faster results thus increasing user experience.
b. It leads to more query processing at a given time because of reduced time for processing
each query.
c. Reduces wear and tear on the hardware components e.g., CPU and Drives.
d. Increases server efficacy which results in a reduction in power consumption.
In any query to be optimized, equivalent expressions for a query are identified by a relational
query optimizer using relational algebra equivalences. Several alternative query evaluation plans
are generated by considering all the available implementation techniques for relational operators
that are involved for every equivalent version of the query. Plan having the lowest estimated
cost is chosen by the optimizer by performing an estimate on the cost of each plan.
Strategies used to generate good plans are as enlisted below;
๏‚ทCombining selections and cross-products into joins.
๏‚ทExtensively re-ordering joins.
๏‚ทโ€œpushingโ€ ahead of joins selections and projections, which results in a reduction of the
size of the input.
๏‚ทPerforming queries during off-peak times.
๏‚ทReplacing subqueries with joins.
๏‚ทUsing WILDCATS at the end of the phrases.
๏‚ทDefining SELECT instead of SELECT*
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Query Optimization and Join Algorithms in Relational Databases and more Assignments Introduction to Database Management Systems in PDF only on Docsity!

Assignment 2 Part 1. Concepts and principles Question 1 The ultimate goal of a query optimizer is to reduce as much as possible system resources that are required in fulfilling a query while providing a user with the intended results as fast as possible. Some of the importance of query optimization are highlighted below; a. Results into faster results thus increasing user experience. b. It leads to more query processing at a given time because of reduced time for processing each query. c. Reduces wear and tear on the hardware components e.g., CPU and Drives. d. Increases server efficacy which results in a reduction in power consumption. In any query to be optimized, equivalent expressions for a query are identified by a relational query optimizer using relational algebra equivalences. Several alternative query evaluation plans are generated by considering all the available implementation techniques for relational operators that are involved for every equivalent version of the query. Plan having the lowest estimated cost is chosen by the optimizer by performing an estimate on the cost of each plan. Strategies used to generate good plans are as enlisted below; ๏‚ท Combining selections and cross-products into joins. ๏‚ท Extensively re-ordering joins. ๏‚ท โ€œpushingโ€ ahead of joins selections and projections, which results in a reduction of the size of the input. ๏‚ท Performing queries during off-peak times. ๏‚ท Replacing subqueries with joins. ๏‚ท Using WILDCATS at the end of the phrases. ๏‚ท Defining SELECT instead of SELECT*

Question 2 Concept of blocked I/O. In computing, a block I/O is given unit of storage that represents a fixed number of bytes known as blocks for transmitting between two communicating devices. In finer terms, a block is the smallest unit to be processed. So, in an application and including databases, we read/write into blocks not bytes. So, if particular information (byte) you want to read is inside the block, you ask the block to read the byte if the information you want. Data that has been structured this way is called blocked. Blocking on the other hand is the process of interacting with the blocked data. Relational databases and Blocking I/O Relational databases have implemented the concept of blocking. Block I/O improves the performance and also recoverability as compared to other requests. In blocking I/O, units in reading/write are performed in units i.e., Buffer Blocks of say for b pages. Per unit run, one buffer block is being set aside. Also, one buffer block [ ( B โˆ’ b ) b ] in each pass can be run. An example is, say, for every 10 pages buffer, 9 merges are run each time with only 1-page input/output buffer versus merging 4 runs at one given time with 2 [age input/output buffer block. The reason why Blocked I/O is cheaper as compared to other methods. It is much cheaper to submit a single request to perform a read/write on several subsequent pages than performing a read/write on the same pages using independent I/O requests. Question 3. Hash functions are algorithms that take in large sets of variable-length input data, then map them into shorter, fixed-length data sets. Keys are larger fixed lengths and values are the shorter length.

Hash joins are not advisable to be used in production databases that have very larger tables. The reason is, when the hash join is invoked, RAM (memory) regions that control hash joins are triggered. For large tables, a lot of memory is required. The overall cost of performing has joined is low when the hash table fits perfectly into the memory. On the other hand, the cost would rise if the hash table must be written to the disk. Pros of hash join. Searching for data is rapid. Especially when the hit count is low for the inner table and higher for the outer table. Cons of hash join. Memory usage becomes high when the hit count for the inner table is high. The performance also drops because the hits are saved t the fie first due to memory unavailability. Sort โ€“ Merge join. Sort โ€“ Merger join joins two different and independent sources of data in a table. Joining is achieved through two phases i.e., sort operation and merge operation. It is faster if a table is already sorted. Sort โ€“ merge join is used when joined tables are sorted on the joined columns. Sort-merge join is not advised when dealing with unsorted tables. This is because sorting is expensive when it comes to memory. An external sorting algorithm is required in instances where the size of the relation is larger than the memory available. Pros of sort-merge join. Sort-merge joins are faster as compared to other joins i.e., hash join. Sort -merge joins use less memory as compared to other joins for example hash join.

Cons of sort-merge join. Is it not fast when the number of comparisons is larger. As compared to nested loop join, sort-merge join is less good especially with a smaller amount of data. Block nested loop join. Block nested loop transverses block instead of tuples. It aims at reducing the number of times tables in the inner loops must be read. it achieves this by using buffering of rows read in the outer loops to reduce the number of times that tables in the inner loops must be read. In summary, in block nested-loop; inner relation blocks are paired to the outer relations. Block nested loop joins are used when the buffer size is small enough to hold the entire relationship in the memory. This is because it saves a lot of time. This is archived by processing relation block-wise (per block) instead of processing per tuple. Pros of nested loop join. It saves time when performing block access. It also reduces access costs. Cons of nested loop join. Performance drops when the hit count of outer tables is high. The reason is index is used to search the inner table each time there is fetching of a row from the outer table. Question 5. Concept of reduction factor. The reduction factor refers to the fraction of tuples in a table that satisfy a given conjunct.

Reduction factor (RF) of each term. Given the index I on the column, assuming a uniform distribution. 1/N keys(I). Otherwise, a fixed reduction factor of 1/10. ii. Column1 = column Considering indexes i1 and i2 on column_1 and column_2, assuming each key value in i (smaller one) has a matching value in i2. Part 2: Design considerations and calculation. Question 1. It is better to retrieve records in order using indexes as compared to sorting. This is because the inbuilt column is sorted in an order and they also contain unique values. Indexes are applied only to primary keys whose values have been assigned to them. Records are contained in leaves in a B+ tree. Also, a pointer interlinks noes together. 0 (log(n) is the height f a B+ Tree. When retrieving records in order, the cost of reaching the leaf node is 0(log(n)), traversing all the leaf nodes is 0(n). The overall time required is 0(n).

On the other hand, retrieving records and sorting them later, 0(n) is the time required to retrieve records in disk block and bring them to the memory. Sorting records takes 0(nlogn) when using an external merge sort. The overall time required in this approach is 0(nlogn). Question 2 Using an external merge-sort algorithm and available space for storing records on each page is: A page stores up to 10 records of 48 bytes each.4500 records are stored in 450 pages, when one record stores one page. With 4 buffer pages available, there will be: sorted runs/sub-files with four pages each. An exception is in the last run, which goes to only 2 pages long. Question 3.

This algorithm does not use multiple perโ€“relation buffers. Thus, the minimum requirement is one input buffer page for any relation and also one output buffer pager (3 buffer pages).

  1. Block nested-loops join works by reading the outer relation in blocks, and, for each block, perform a page-by-page scan of the inner relation. Another scan is permed on the outer relation while the inner relation receives one scan for each of the outer block. When outer relation is used, the total cost is minimized. The cost will increase if: B is less than 52, then B-2 < 50. Thus, 52 buffer pages are required for this cost.

Question 5. a. Cost = 10000. This is because, due to the WHERE clause which filters on the title, thus index on the name will not be used. File scan is the best resort. b. A best option for a plan is using B+ tree to find the first title index in a B+ tree search such that: However, due to the clustered index, indexโ€™s reference can be used to scan relation pages; ๐‘๐‘œ๐‘ ๐‘ก = (10000 x 10%) +( 2500 โˆ— 10% (๐‘†๐‘๐‘Ž๐‘›๐‘›๐‘–๐‘›๐‘” ๐‘กโ„Ž๐‘’ ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ)) = 1000 + 250 + 2 = 1252 ( ๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘๐‘œ๐‘ ๐‘ก ). c. A clustered index on dname works functionally the same as above where: E.dname = โ€œ Financeโ€ cost = 1002 + 250 = 1252. Relational data pages have ename field, and it will be retrieved there. d. Cost drops because the index combines all 3 indexes. These indexes are needed for an index-only scan. Cost = 2 (๐‘™๐‘œ๐‘œ๐‘˜ ๐‘ข๐‘) + 10000 ร— 5% ร— .75 (๐‘ ๐‘š๐‘Ž๐‘™๐‘™๐‘’๐‘Ÿ ๐‘ ๐‘–๐‘ง๐‘’) = 402