Homework Solutions - Database Systems | CS 411, Assignments of Deductive Database Systems

Material Type: Assignment; Class: Database Systems; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2004;

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-ctn-1
koofers-user-ctn-1 🇺🇸

9 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS411 Database Systems
Fall 2004
HW#4 Solution
Problem 1 Index-Based Join and Sort-Merge Join Algorithms
(a) 4, 7, 5, 2, 2, 8
(b) 2, 2, 4, 5, 7, 8 or 8, 7, 5, 4, 2, 2
Problem 2 Hybrid Hash Join Algorithm
To simplify the problem, we make the following assumptions:
1. Two-pass hash-join and Hybrid hash-join:
1.1 Among the two requirements, ”use as few buckets as possible” and ”read and write as many
blocks as we can to consecutive positions on disk”, satisfy ”use as few buckets as possible” first.
1.2 In the first pass there is only one buffer block for each bucket (therefore the blocks are written
to disks individually), while multiple blocks can be assigned for input buffer (therefore the blocks
can be read from tables consecutively). Note that in Hybrid Hash-join, you always keep one whole
bucket in memory and one buffer block for each of the other buckets.
2. Sort-based join
2.1 Use sort-merge-join in Section 15.4.7, i.e., the second phase of the sorting is combined with the
join itself.
2.2 We only write sorted sublists consecutively. We do not read sorted sublists consecutively.
3. All
No need to consider clever scheme such as dynamic expanding/shrinking of input buffer and hash
bucket buffer.
You can make your own assumptions as long as your assumptions are consistent with the conditions
in problem (e.g., you can assume multiple blocks are used for each bucket in the first-phase, you
can use the sort-based join algorithm in Section 15.4.5, you can read sorted sublists consecutively
as well), and your answer is conistent with your assumptions.
(a) In order to use as few buckets as possible and still be able to fit one whole bucket of the smaller
relation (S) into memory, the number of buckets should be 5. Therefore each buckts contain 200
blocks (R) and 100 blocks (S).
Therefore in the first pass, we can read 96 consecutive blocks from disk into memory.
Reading S while hashing into buckets: (100 + 96/2) 5 + 100 + 20/2 = 850. (Explanation: It can
read 96 consecutive blocks from S for 5 times, then finally 20 consecutive blocks from S to finish
it.)
Reading R while hashing into buckets: (100 + 96/2) 10 + 100 + 40/2 = 1600. (Explanation: It
can reads 96 consecutive blocks from R for 10 times, then finally 40 consecutive blocks from R to
1
pf3
pf4
pf5

Partial preview of the text

Download Homework Solutions - Database Systems | CS 411 and more Assignments Deductive Database Systems in PDF only on Docsity!

CS411 Database Systems

Fall 2004

HW#4 Solution

Problem 1 Index-Based Join and Sort-Merge Join Algorithms

(a) 4, 7, 5, 2, 2, 8

(b) 2, 2, 4, 5, 7, 8 or 8, 7, 5, 4, 2, 2

Problem 2 Hybrid Hash Join Algorithm

To simplify the problem, we make the following assumptions:

  1. Two-pass hash-join and Hybrid hash-join:

1.1 Among the two requirements, ”use as few buckets as possible” and ”read and write as many blocks as we can to consecutive positions on disk”, satisfy ”use as few buckets as possible” first.

1.2 In the first pass there is only one buffer block for each bucket (therefore the blocks are written to disks individually), while multiple blocks can be assigned for input buffer (therefore the blocks can be read from tables consecutively). Note that in Hybrid Hash-join, you always keep one whole bucket in memory and one buffer block for each of the other buckets.

  1. Sort-based join

2.1 Use sort-merge-join in Section 15.4.7, i.e., the second phase of the sorting is combined with the join itself.

2.2 We only write sorted sublists consecutively. We do not read sorted sublists consecutively.

  1. All

No need to consider clever scheme such as dynamic expanding/shrinking of input buffer and hash bucket buffer.

You can make your own assumptions as long as your assumptions are consistent with the conditions in problem (e.g., you can assume multiple blocks are used for each bucket in the first-phase, you can use the sort-based join algorithm in Section 15.4.5, you can read sorted sublists consecutively as well), and your answer is conistent with your assumptions.

(a) In order to use as few buckets as possible and still be able to fit one whole bucket of the smaller relation (S) into memory, the number of buckets should be 5. Therefore each buckts contain 200 blocks (R) and 100 blocks (S).

Therefore in the first pass, we can read 96 consecutive blocks from disk into memory.

Reading S while hashing into buckets: (100 + 96/2) ∗ 5 + 100 + 20/2 = 850. (Explanation: It can read 96 consecutive blocks from S for 5 times, then finally 20 consecutive blocks from S to finish it.)

Reading R while hashing into buckets: (100 + 96/2) ∗ 10 + 100 + 40/2 = 1600. (Explanation: It can reads 96 consecutive blocks from R for 10 times, then finally 40 consecutive blocks from R to

finish it.)

Writing hash buckets of S into disk: 100. 5 ∗ 500 = 50250.

Writing hash buckets of R into disk: 100. 5 ∗ 1000 = 100500.

In the second pass, join is performed bucket by bucket. For each bucket, all blocks of S are read into buffers and in-memory search structure is built. Then corresponding blocks of R are read block by block.

Reading hash buckets of S into memory while joining corresponding buckets: (100+100/2)∗5 = 750.

Reading hash buckets of R into memory while joining corresponding buckets: 100. 5 ∗ 100 = 100500.

Therefore the total time is 254450 (milliseconds).

(b)

As analyzed in Example 15.10, k = 6. When hashing S in the first pass, we have 5 buffers for 5 buckets, and 500/6=84 (83 is also fine answer) buffers for the 1 in-memory bucket. We have other 12 buffers for holding the input blocks when reading the relations. When hashing R, the 84 buffers for the in-memory bucket of S are always kept, and 5 buffers are for other 5 buckets. Therefore each time we can also read 12 blocks consectuvely.

During the second pass, we can throw the in-memory buckets. Therefore we can read each rest bucket of S consecutively. When reading corresponding bucket from R, we can consecutively read 101-84=17 blocks each time.

Reading S while hashing into buckets: (100 ∗ 12 /2) ∗ 41 + 100 + 8/2 = 4450.

Reading R while hashing into buckets: (100 + 12/2) ∗ 83 + 100 + 4/2 = 8900.

Writing 5 of the 6 hash buckets of S into disk: 100. 5 ∗ 416 = 41808.

Writing 5 of the 6 hash buckets of R into disk: 100. 5 ∗ 833 = 83716.5.

Reading hash buckets of S into memory while joining corresponding buckets: (100+84/2)∗5 = 705.

Reading hash buckets of R into memory while joining corresponding buckets: (100 + 17/2) ∗ 49 = 5316 .5.

Therefore the total time is 144896 (milliseconds).

(c)

  1. In the first phase of two-phase multiway merge sort, we can sort 100 (or 101) blocks each time, therefore there are 10 sublists of R and 5 sublists of S.

reading R: (100 + 100/2) ∗ 10 = 1500.

reading S: (100 + 100/2) ∗ 5 = 750.

writing sorted sublists of R: (100 + 100/2) ∗ 10 = 1500.

writing sorted sublists of S: (100 + 100/2) ∗ 5 = 750.

  1. The combined step of join and the second phase of sorting.

reading R: 100. 5 ∗ 1000 = 100500.

reading S: 100. 5 ∗ 500 = 50250.

(c) {R} {S} {T} {U} size 250 250 2000 2000 cost 0 0 0 0 best plan R S T U {R,S} {R,T} {R,U} {S,T} {S,U} {T,U} size 3125 500000 4000/3 4000/3 500000 32000/ cost 0 0 0 0 0 0 best plan R ./ S or S ./ R R ./ T R ./ U S ./ T S ./ U T ./ U or U ./ T {R,S,T} {R,S,U} {R,T,U} {S,T,U} size 50000/3 50000/3 64000/9 64000/ cost 4000/3 4000/3 4000/3 4000/ best plan (S ./ T ) ./ R (R ./ U ) ./ S (R ./ U ) ./ T (S ./ T ) ./ U {R,S,T,U} size 6400/ cost 4000/3 + 64000/ best plan ((S ./ T ) ./ U ) ./ R or((R ./ U ) ./ T ) ./ S

The costs of plans for 4-relation joins are:

Grouping Cost ((S ./ T ) ./ U ) ./ R 4000/3 + 64000/ ((R ./ U ) ./ T ) ./ S 4000/3 + 64000/ ((R ./ U ) ./ S) ./ T 4000/3 + 50000/ ((S ./ T ) ./ R) ./ U 4000/3 + 50000/

Problem 5 Cost Estimation

R has 25 different b values, therefore the 35 tuples of R with unknown b-values are divided among 20 values, for an average of 1.75 tuples each. Similarly the 40 tuples of S with unknown b-values are divided among 20 values, for an average of 2 tuples each.

According to containment of value sets, R and S contain the same set of b values since V (R, b) = V (S, b) = 25.

Number of join tuples with b = 0 : 4 × 9 = 36 Number of join tuples with b = 1 : 7 × 7 = 49 Number of join tuples with b = 2 : 6 × 8 = 48 Number of join tuples with b = 3 : 5 × 5 = 25 Number of join tuples with b = 4 : 3 × 2 = 6 Number of join tuples with b = 5 : 1. 75 × 6 = 10. 5 Number of join tuples with other 19 unknown b-values: 19 × 1. 75 × 2 = 66. 5

Therefore, totally the estimated join size is 36 + 49 + 48 + 25 + 6 + 10.5 + 66.5 = 241

(2)Assuming that all 25 values are equally likely to occur, the estimated join size is T (R)T (S)/max(V (R, b), V (S, b)) = 60 × 75 /25 = 180.

Problem 6 Transactions a)

Action t1 t2 Mem A Mem B Disk A Disk B READ(A,t1) 5 5 5 10 READ(B,t2) 5 10 5 10 5 10 t1=t1+t2 15 10 5 10 5 10 WRITE(A,t1) 15 10 15 10 5 10 READ(A,t1) 15 10 15 10 5 10 READ(B,t2) 15 10 15 10 5 10 t2=t1+t2 15 25 15 10 5 10 WRITE(B,t2) 15 25 15 25 5 10 OUTPUT(B) 15 25 15 25 5 25 OUTPUT(A) 15 25 15 25 15 25

Yes, the consistency can be preserved by scheduling OUTPUT(B) before OUTPUT(A).

b) Action t1 t2 Mem A Mem B Disk A Disk B READ(A,t1) 5 5 5 10 READ(B,t2) 5 10 5 10 5 10 t2=t1+t2 5 15 5 10 5 10 WRITE(B,t2) 5 15 5 15 5 10 READ(A,t1) 5 15 5 15 5 10 READ(B,t2) 5 15 5 15 5 10 t1=t1+t2 20 15 5 15 5 10 WRITE(A,t1) 20 15 20 15 5 10 OUTPUT(B) 20 15 20 15 5 15 OUTPUT(A) 20 15 20 15 20 15

No, it is impossible to preserve the consistency since A >= B if transactions finish successfully.

c) Action t Mem A Mem B Disk A Disk B READ(B,t) 10 10 5 10 t=t+1 11 10 5 10 WRITE(A,t) 11 11 10 5 10 READ(A,t) 11 11 10 5 10 t=t+1 12 11 10 5 10 WRITE(B,t) 12 11 12 5 10 OUTPUT(B) 12 11 12 5 12 OUTPUT(A) 12 11 12 11 12

Yes, the consistency can be preserved by scheduling OUTPUT(B) before OUTPUT(A).

Problem 7 Failure Recovery

(a) A=21, B=15, C=25, D=

(b) A=22, B=16 or 17, C=25 or 26 or 27 or 28, D= 43 or 44 or 45

The only values that couldn’t appear over here are A=21, B=15, because the values A=22, B= were flushed to disk before the end checkpoint in line 12 was placed onto the log.

(c) 3, 〈T1 start〉