



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Class: Database Systems; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2004;
Typology: Assignments
1 / 6
This page cannot be seen from the preview
Don't miss anything!




(a) 4, 7, 5, 2, 2, 8
(b) 2, 2, 4, 5, 7, 8 or 8, 7, 5, 4, 2, 2
To simplify the problem, we make the following assumptions:
1.1 Among the two requirements, ”use as few buckets as possible” and ”read and write as many blocks as we can to consecutive positions on disk”, satisfy ”use as few buckets as possible” first.
1.2 In the first pass there is only one buffer block for each bucket (therefore the blocks are written to disks individually), while multiple blocks can be assigned for input buffer (therefore the blocks can be read from tables consecutively). Note that in Hybrid Hash-join, you always keep one whole bucket in memory and one buffer block for each of the other buckets.
2.1 Use sort-merge-join in Section 15.4.7, i.e., the second phase of the sorting is combined with the join itself.
2.2 We only write sorted sublists consecutively. We do not read sorted sublists consecutively.
No need to consider clever scheme such as dynamic expanding/shrinking of input buffer and hash bucket buffer.
You can make your own assumptions as long as your assumptions are consistent with the conditions in problem (e.g., you can assume multiple blocks are used for each bucket in the first-phase, you can use the sort-based join algorithm in Section 15.4.5, you can read sorted sublists consecutively as well), and your answer is conistent with your assumptions.
(a) In order to use as few buckets as possible and still be able to fit one whole bucket of the smaller relation (S) into memory, the number of buckets should be 5. Therefore each buckts contain 200 blocks (R) and 100 blocks (S).
Therefore in the first pass, we can read 96 consecutive blocks from disk into memory.
Reading S while hashing into buckets: (100 + 96/2) ∗ 5 + 100 + 20/2 = 850. (Explanation: It can read 96 consecutive blocks from S for 5 times, then finally 20 consecutive blocks from S to finish it.)
Reading R while hashing into buckets: (100 + 96/2) ∗ 10 + 100 + 40/2 = 1600. (Explanation: It can reads 96 consecutive blocks from R for 10 times, then finally 40 consecutive blocks from R to
finish it.)
Writing hash buckets of S into disk: 100. 5 ∗ 500 = 50250.
Writing hash buckets of R into disk: 100. 5 ∗ 1000 = 100500.
In the second pass, join is performed bucket by bucket. For each bucket, all blocks of S are read into buffers and in-memory search structure is built. Then corresponding blocks of R are read block by block.
Reading hash buckets of S into memory while joining corresponding buckets: (100+100/2)∗5 = 750.
Reading hash buckets of R into memory while joining corresponding buckets: 100. 5 ∗ 100 = 100500.
Therefore the total time is 254450 (milliseconds).
(b)
As analyzed in Example 15.10, k = 6. When hashing S in the first pass, we have 5 buffers for 5 buckets, and 500/6=84 (83 is also fine answer) buffers for the 1 in-memory bucket. We have other 12 buffers for holding the input blocks when reading the relations. When hashing R, the 84 buffers for the in-memory bucket of S are always kept, and 5 buffers are for other 5 buckets. Therefore each time we can also read 12 blocks consectuvely.
During the second pass, we can throw the in-memory buckets. Therefore we can read each rest bucket of S consecutively. When reading corresponding bucket from R, we can consecutively read 101-84=17 blocks each time.
Reading S while hashing into buckets: (100 ∗ 12 /2) ∗ 41 + 100 + 8/2 = 4450.
Reading R while hashing into buckets: (100 + 12/2) ∗ 83 + 100 + 4/2 = 8900.
Writing 5 of the 6 hash buckets of S into disk: 100. 5 ∗ 416 = 41808.
Writing 5 of the 6 hash buckets of R into disk: 100. 5 ∗ 833 = 83716.5.
Reading hash buckets of S into memory while joining corresponding buckets: (100+84/2)∗5 = 705.
Reading hash buckets of R into memory while joining corresponding buckets: (100 + 17/2) ∗ 49 = 5316 .5.
Therefore the total time is 144896 (milliseconds).
(c)
reading R: (100 + 100/2) ∗ 10 = 1500.
reading S: (100 + 100/2) ∗ 5 = 750.
writing sorted sublists of R: (100 + 100/2) ∗ 10 = 1500.
writing sorted sublists of S: (100 + 100/2) ∗ 5 = 750.
reading R: 100. 5 ∗ 1000 = 100500.
reading S: 100. 5 ∗ 500 = 50250.
(c) {R} {S} {T} {U} size 250 250 2000 2000 cost 0 0 0 0 best plan R S T U {R,S} {R,T} {R,U} {S,T} {S,U} {T,U} size 3125 500000 4000/3 4000/3 500000 32000/ cost 0 0 0 0 0 0 best plan R ./ S or S ./ R R ./ T R ./ U S ./ T S ./ U T ./ U or U ./ T {R,S,T} {R,S,U} {R,T,U} {S,T,U} size 50000/3 50000/3 64000/9 64000/ cost 4000/3 4000/3 4000/3 4000/ best plan (S ./ T ) ./ R (R ./ U ) ./ S (R ./ U ) ./ T (S ./ T ) ./ U {R,S,T,U} size 6400/ cost 4000/3 + 64000/ best plan ((S ./ T ) ./ U ) ./ R or((R ./ U ) ./ T ) ./ S
The costs of plans for 4-relation joins are:
Grouping Cost ((S ./ T ) ./ U ) ./ R 4000/3 + 64000/ ((R ./ U ) ./ T ) ./ S 4000/3 + 64000/ ((R ./ U ) ./ S) ./ T 4000/3 + 50000/ ((S ./ T ) ./ R) ./ U 4000/3 + 50000/
R has 25 different b values, therefore the 35 tuples of R with unknown b-values are divided among 20 values, for an average of 1.75 tuples each. Similarly the 40 tuples of S with unknown b-values are divided among 20 values, for an average of 2 tuples each.
According to containment of value sets, R and S contain the same set of b values since V (R, b) = V (S, b) = 25.
Number of join tuples with b = 0 : 4 × 9 = 36 Number of join tuples with b = 1 : 7 × 7 = 49 Number of join tuples with b = 2 : 6 × 8 = 48 Number of join tuples with b = 3 : 5 × 5 = 25 Number of join tuples with b = 4 : 3 × 2 = 6 Number of join tuples with b = 5 : 1. 75 × 6 = 10. 5 Number of join tuples with other 19 unknown b-values: 19 × 1. 75 × 2 = 66. 5
Therefore, totally the estimated join size is 36 + 49 + 48 + 25 + 6 + 10.5 + 66.5 = 241
(2)Assuming that all 25 values are equally likely to occur, the estimated join size is T (R)T (S)/max(V (R, b), V (S, b)) = 60 × 75 /25 = 180.
Action t1 t2 Mem A Mem B Disk A Disk B READ(A,t1) 5 5 5 10 READ(B,t2) 5 10 5 10 5 10 t1=t1+t2 15 10 5 10 5 10 WRITE(A,t1) 15 10 15 10 5 10 READ(A,t1) 15 10 15 10 5 10 READ(B,t2) 15 10 15 10 5 10 t2=t1+t2 15 25 15 10 5 10 WRITE(B,t2) 15 25 15 25 5 10 OUTPUT(B) 15 25 15 25 5 25 OUTPUT(A) 15 25 15 25 15 25
Yes, the consistency can be preserved by scheduling OUTPUT(B) before OUTPUT(A).
b) Action t1 t2 Mem A Mem B Disk A Disk B READ(A,t1) 5 5 5 10 READ(B,t2) 5 10 5 10 5 10 t2=t1+t2 5 15 5 10 5 10 WRITE(B,t2) 5 15 5 15 5 10 READ(A,t1) 5 15 5 15 5 10 READ(B,t2) 5 15 5 15 5 10 t1=t1+t2 20 15 5 15 5 10 WRITE(A,t1) 20 15 20 15 5 10 OUTPUT(B) 20 15 20 15 5 15 OUTPUT(A) 20 15 20 15 20 15
No, it is impossible to preserve the consistency since A >= B if transactions finish successfully.
c) Action t Mem A Mem B Disk A Disk B READ(B,t) 10 10 5 10 t=t+1 11 10 5 10 WRITE(A,t) 11 11 10 5 10 READ(A,t) 11 11 10 5 10 t=t+1 12 11 10 5 10 WRITE(B,t) 12 11 12 5 10 OUTPUT(B) 12 11 12 5 12 OUTPUT(A) 12 11 12 11 12
Yes, the consistency can be preserved by scheduling OUTPUT(B) before OUTPUT(A).
(a) A=21, B=15, C=25, D=
(b) A=22, B=16 or 17, C=25 or 26 or 27 or 28, D= 43 or 44 or 45
The only values that couldn’t appear over here are A=21, B=15, because the values A=22, B= were flushed to disk before the end checkpoint in line 12 was placed onto the log.
(c) 3, 〈T1 start〉