Advanced Database Management Systems - Assignment Problems Solved | CS 511, Assignments of Deductive Database Systems

Material Type: Assignment; Class: Advanced Data Management; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2006;

Typology: Assignments

Pre 2010

Uploaded on 03/16/2009

koofers-user-7ks
koofers-user-7ks 🇺🇸

4.3

(3)

9 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS511: Advanced Database Management Systems Fall 2006
Assignment #3
Due: 2:00pm CST on October 6, 2006
Out: September 20, 2006
NOTE: Please submit a hard copy of your homework. Bring it to the lecture table at the
beginning of the lecture on 6th October, 2006.
The hard copy should be as clearly readable as possible. You may be subtracted points for
unreadability and ugly presentation.
I2CS Students: You should email your solutions to the TA([email protected]) in the pdf
or the word format. Please send the file as attachment with your email by 2PM UIUC time
(CST). I2CS students in other time zones should note that the deadline is according to CST.
Problem 1 (30 pts)
Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique
(no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept
entirely in memory (i.e., you do not need to read it from disk).
For relation S, you have two indexes: (i) a non-clustered non-unique B+-tree index for
attribute B, and (ii) a clustered non-unique B+-tree index for attribute C. Assume that
these two indexes are also kept in memory. Also, assume that all of the tuples of S that
agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than
one block is needed to store all of the tuples with some value of C, then these blocks will be
sequentially located on the disk).
Other relevant data:
2000 tuples of R are stored per block on disk.
N(R) = 2,000,000 (number of tuples of R)
200 tuples of S are stored per block on disk.
N(S) = 200,000 (number of tuples of S)
V(B,S)=40,000 (image size of attribute B in S)
V(C,S)=800 (image size of attribute C in S)
1
pf3
pf4
pf5

Partial preview of the text

Download Advanced Database Management Systems - Assignment Problems Solved | CS 511 and more Assignments Deductive Database Systems in PDF only on Docsity!

CS511: Advanced Database Management Systems Fall 2006

Assignment

Due: 2:00pm CST on October 6, 2006 Out: September 20, 2006

NOTE: Please submit a hard copy of your homework. Bring it to the lecture table at the beginning of the lecture on 6th October, 2006.

The hard copy should be as clearly readable as possible. You may be subtracted points for unreadability and ugly presentation.

I2CS Students: You should email your solutions to the TA([email protected]) in the pdf or the word format. Please send the file as attachment with your email by 2PM UIUC time (CST). I2CS students in other time zones should note that the deadline is according to CST.

Problem 1 (30 pts)

Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique (no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept entirely in memory (i.e., you do not need to read it from disk).

For relation S, you have two indexes: (i) a non-clustered non-unique B+-tree index for attribute B, and (ii) a clustered non-unique B+-tree index for attribute C. Assume that these two indexes are also kept in memory. Also, assume that all of the tuples of S that agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than one block is needed to store all of the tuples with some value of C, then these blocks will be sequentially located on the disk).

Other relevant data:

  • 2000 tuples of R are stored per block on disk.
  • N(R) = 2,000,000 (number of tuples of R)
  • 200 tuples of S are stored per block on disk.
  • N(S) = 200,000 (number of tuples of S)
  • V(B,S)=40,000 (image size of attribute B in S)
  • V(C,S)=800 (image size of attribute C in S)

You want to execute the following query:^2

SELECT *

FROM R, S

WHERE (R.B=S.B) AND (R.C=S.C)

We present you with two query plans:

Plan 1:

For every block BL of R, retrieved using the clustered index on A for R For every tuple r of BL Use the index on B for S to retrieve all of the tuples s of S such that s.B=r.B For each of these tuples s, if s.C=r.C, output r.A, r.B, r.C, s.B, s.C, s.D, s.E

Plan 2:

For every block BL of R, retrieved using the clustered index on A for R For every tuple r of BL Use the index on C for S to retrieve all of the tuples s of S such that s.C=r.C For each of these tuples s, if s.B=r.B, output r.A, r.B, r.C, s.B, s.C, s.D, s.E

a) Analyze the above two plans carefully in terms of their behavior regarding accesses to disk, and explain which of the plans is therefore better. Be sure to include in your analysis which accesses to disk are sequential accesses and which ones are random accesses.

b) Can you think of an alternative plan that is better than the above two in terms of disk access? You do not need to submit detail computations for this question.

Solution

a) For Plan 1, we need 1000 sequential block access for reading the tuples of relation R. For each tuple of relation R, we need to access relation S by using the B attribute. However, we only have a non-clustered non-unique B+-tree index for attribute B.

Solution^4

Yes, we can pick k = 5, thus each bucket of the smaller relation S has about 200/5 or 40 pages of tuples. We still have 52 - 40 = 12 pages left, thus we can accommodate k - 1 = 4 pages for each of the other buckets. The cost is (3 - 2 * B/N)(M + N) = 2976, where B=52, M=10000/10=1000, N=2000/10=200.

Problem 3 (20 pts)

Query optimization has been considered as a key technique for the realization of the relational model.

a) Why does the relational DBMS (in particular) need query optimization? Why this was not an issue for earlier DBMS (e.g., of the network model).

b) System R has established the Selinger-style query optimization. What are the main techniques in this framework?

c) Suppose we wish to compute the following:

τb(R(a, b) on S(b, c) on T (c, d))

That is, we join the three relations and produce the result sorted on attribute b. Assume that we do NOT join R and T first, as that is a Cartesian product. What are all the subexpressions and interesting orders that System R optimizer would consider?

Solution

a) Relational query languages provide a high-level “declarative” interface to access data stored in relational database. The query does not give the access paths; there can be many possible access paths. Thus, we need to choose the efficient access path, which is the task of the query optimizer. However, earlier DBMSs use the navigational model. The query itself explicitly states the access path it would take. Thus, the query optimization was not an issue.

b) These are some techniques in this framework:

  • Space of query plans: To reduce the space, it pushes down projections and selec- tions. Then, it only considers left-deep joins
  • Cost estimation: Assuming that attributes and predicates are independent, the^5 cost is the weighted sum of the I/O cost and CPU cost. Cost of an operator depends on input data size, data distribution, and physical layout. Thus, the optimizer uses statistics about the relations to estimate the cost. Also, it uses statistics on base relations and intermediate results.
  • Search algorithm: It uses the (bottom-up) dynamic programming.

c) First, we can only join R on S or S on T first, since R on T, has been ruled out. System R optimizer considers only “left-deep” joins. Thus, there are four possible join orders. If R and S are joined first, then we must consider both the output in which the result is unsorted, and the output sorted on b, because that order is certainly “interesting”; the entire expression winds up sorted by b. We could also consider the “interesting” order in which the result is sorted by c, since there is a join still to be taken with c as the join attribute. That might be advantageous if S were initially sorted by c (e.g., there was a B-tree index on c), or if R on S were much bigger than S, and therefore sorting S before the join would be an efficient way to perform a sort-join of R on S with T. If we compute S on T first, then we should consider only the unsorted order for the result and the interesting order in which the result is sorted by b. The latter attribute is an interesting sort, because it is the join attribute of a subsequent join.

Problem 4 (15 pts)

Consider a database DB. DB has two relations R1 and R2. The relation R1 contains tuples t1 and t2, while R2 contains tuples t3, t4, and t5. Assume that the database DB, relations, and tuples form a hierarchy of lockable database elements.

Tell the sequence of lock requests and the response of the locking scheduler, which is described in the paper “Granularity of Locks and Degrees of Consistency in a Shared Data Base”, to the following sequence of request. You may assume all requests occur just before they are needed, and all unlocks occur at the end of the transaction.

r1(t1); w2(t2); r2(t3); w1(t4)

[w2(t2) represents the creation of t2 by transaction T 2.]

Solution

At the first step, T 1 puts a IS lock on the DB and on R1, and an S lock on t1.

e) When all transactions run in the above schedule, identify the transactions with degree^7 3 consistency (using Definition 1 in the paper, “Granularity of Locks and Degrees of Consistency in a Shared Data Base”). Answer the same question when transaction T did not run at all. How is the degree of consistency relevant to serializability?

Solution

a) X X T1 --> T3 --> T ^ | | | +-------------+ Y

b) None. The graph has a cycle.

c) X T3 --> T

d) The given schedule above is equivalent to a serial order of running T3 then T2.

e) None is degree 3 consistency. T2 and T3 are not as they dirty T1’s data, while T1 is not as its data is dirtied by other transactions. However, if T1 does not run, both T2 and T3 are degree 3 consistency. Degree 3 consistency holding long write and read locks is essentially strict 2PL. Conse- quently, when all transactions are degree 3 consistency, the schedule is guaranteed to be both serializable and recoverable. However, serializability alone cannot (necessarily) guarantee degree 3 consistency.