

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Professor: Gehringer; Class: Architecture Of Parallel Computers; Subject: Electrical and Computer Engineering; University: North Carolina State University; Term: Unknown 1989;
Typology: Assignments
1 / 3
This page cannot be seen from the preview
Don't miss anything!


[May 11 for off-campus students]
Problems 1, 4, and 5 will be graded. There are 65 points on these problems.Note: You must doall the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on the graded problems.
Problem 1. (25 points) [CS&G 6.1] Consider two machinesM1 andM 2 .M 1 is a foour-prcoessor shared-L1-cache machine, whereasM 2 is a four-porcessor bus-based snooping-cache machine. M 1 has a single shared 1-BM two-way set-associative cache with 64-byte blocks, whereas each processor inM 2 has a 256-KB direct-mapped cache with 64-byte blocks. M 2 uses the Illinois MESI coherence protocol. Consider the following piece of code:
double A[1024, 1024]; /* row-major; 8-byte elements */ double C[4096]; double B[1024, 1024];
for (i=0; i<1024; i+=1) /* loop-1 / for (j=myPID; j<1024; j+=numPEs) { B[i, j] = (A[i+1, j] + A[i-1, j] + A[i, j+1] + A[i, j-1]) / 4.0; } for (i=0; i<1024; i+=numPEs) / loop-2 */ for (j=0; j<1024; j+=1) { A[i, j] = (B[i+1, j] + B[i-1, j] + B[i, j+1] + B[i, j-1]) / 4.0; }
(a) Assume that the arrayA starts at hexadecimal address 0x0 (i.e., hexadecimal address 0), array C at 0x800000 and arrayB at 0x808000. All caches are initially empty. Each processor executes the preceding code, and myPID varies from 0 to 3 for the four processors. Compute misses forM 1 separately for the two lop nests. Do the same forM 2 , stating any assumptions that you make.
On either machine, the arrays are too large to fit in cache, so both loops behave as if the cache is empty. The cache is large enough to hold the working set of three rows of the source matrix and one row of the target matrix. Given the presence of C, row A[i] conflicts with B[i-4,*] on M2. (They map to the same set on M1.) Thus, cold misses on A will know out old rows of B. A cache block holds 8 elements, however, work is assigned in cyclic fashion is each row (or distributed in cyclic column). Each block holds 8 elements, so a given processor computes two elements per block.
(b) Briefly comment on how your answer to part (a) would change if arrayC were not present. State any other assumptions that you make.
(c) What can be learned about advantages and disadvantages of shared-cache architecture from these exercises?
Problem 2. (25 points) [CS&G 7.4, 7.7] In order to solve the following problems, you need to carefully study Example 7.1 (p. 460 of Culler, Singh, & Gupta) first.
3 n?
(b) Reconsider Example 7.1 where the network is a simple ring. The average distance between
two nodes on a ring ofn nodes is
n
Problem 3. (10 points) [CS&G 8.10] A pattern of sharing that might be detected dynamically is a producer-consumer pattern, in which one processor repeatedly writes (produces) a variable and another processor repeatedly reads (consumes) it. Is the standard MESI invalidation-based protocol well suited to this? Why or why not? What enhancements or protocol might be better, and what are the savings in latency or traffic? How would you dynamically detect and employ the changes?
Problem 4. (15 points) (a) Like the perfect shuffle, the butterfly permutation by itself does not give a completely connected network. Show how a butterfly interconnection partitions 4- and 8- node networks.
(b) Since the perfect shuffle did not provide a complete connection, the exchange was added to the network, as discussed on page 10 of Lecture 23. By looking at the bit-representations of the source and destination nodes, what “property” does the perfect shuffle permutation have which prevents complete connection (as an obvious observation)? Does this property hold for the butterfly permutation as well, and if so, how? Will the introduction of an exchange permutation into the butterfly interconnection network provide a complete connection, as it did with the perfect shuffle? If so, prove it. If not, would it work for any special cases (e.g., networks of a particular sizeN)?
(c) Would adding only the super-butterfly permutations be a way to provide a complete connection with a maximum path length ofO(logN)?
f
d
e
a b c
a b c
d
e
f
Problem 5. (25 points) This problem deals with a sample interconnection network, similar to the mesh network on pp. 5–6 of Lecture 23. Consider a unidirectional mesh interconnection network of size n × n. The routing functions are:
R+1(i) ≡ (i + 1) modN, whereN =n 2 R+n (i) ≡ (i +n) modN.
For example, ifn = 3, the network would be as shown at the right.
(a) What is the greatest distance between processors ifn = 3?
(b) What is the average distance between any two processors?
(c) Repeat parts (a) and (b) withn = 4.
(d) For networks of different sizes, give a general formula for the maximum and average distances between two processors in terms ofn.
(e) List some advantage(s) and disadvantage(s) of this interconnection network vs. a standard mesh interconnection network.