Problem Set 4 - Architecture Of Parallel Computers | ECE 506, Assignments of Electrical and Electronics Engineering

Material Type: Assignment; Professor: Gehringer; Class: Architecture Of Parallel Computers; Subject: Electrical and Computer Engineering; University: North Carolina State University; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-6y4-2
koofers-user-6y4-2 🇺🇸

5

(1)

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
–1–
CSC/ECE 506: Architecture of Parallel Computers
Problem Set 4
Due May 4, 2001
[May 11 for off-campus students]
Problems 1, 4, and 5 will be graded. There are 65 points on these problems.
Note:
You must do
all
the problems, even the non-graded ones. If you do not do some of them, half as many points as
they are worth will be subtracted from your score on the graded problems.
Problem 1.
(25 points) [CS&G 6.1]
Consider two machines
M
1 and
M
2.
M
1 is a foour-prcoessor
shared-
L
1-cache machine, whereas
M
2 is a four-porcessor bus-based snooping-cache machine.
M
1 has a single shared 1-BM two-way set-associative cache with 64-byte blocks, whereas each
processor in
M
2 has a 256-KB direct-mapped cache with 64-byte blocks.
M
2 uses the Illinois MESI
coherence protocol. Consider the following piece of code:
double A[1024, 1024]; /* row-major; 8-byte elements */
double C[4096];
double B[1024, 1024];
for (i=0; i<1024; i+=1) /* loop-1 */
for (j=myPID; j<1024; j+=numPEs)
{
B[i, j] = (A[i+1, j] + A[i-1, j] + A[i, j+1] + A[i, j-1]) / 4.0;
}
for (i=0; i<1024; i+=numPEs) /* loop-2 */
for (j=0; j<1024; j+=1)
{
A[i, j] = (B[i+1, j] + B[i-1, j] + B[i, j+1] + B[i, j-1]) / 4.0;
}
(a) Assume that the array
A
starts at hexadecimal address 0x0 (i.e., hexadecimal address 0), array
C
at 0x800000 and array
B
at 0x808000. All caches are initially empty. Each processor executes
the preceding code, and myPID varies from 0 to 3 for the four processors. Compute misses for
M
1
separately for the two lop nests. Do the same for
M
2, stating any assumptions that you make.
On either machine, the arrays are too large to fit in cache, so both loops behave as if the cache is
empty. The cache is large enough to hold the working set of three rows of the source matrix and
one row of the target matrix. Given the presence of C, row A[i] conflicts with B[i-4,*] on M2.
(They map to the same set on M1.) Thus, cold misses on A will know out old rows of B. A cache
block holds 8 elements, however, work is assigned in cyclic fashion is each row (or distributed in
cyclic column). Each block holds 8 elements, so a given processor computes two elements per
block.
(b) Briefly comment on how your answer to part (a) would change if array
C
were not present.
State any other assumptions that you make.
(c) What can be learned about advantages and disadvantages of shared-cache architecture from
these exercises?
Problem 2.
(25 points) [CS&G 7.4, 7.7]
In order to solve the following problems, you need to
carefully study Example 7.1 (p. 460 of Culler, Singh, & Gupta) first.
(a) Reconsider Example 7.1 where the number of hops for an
n
-node configuration is
n
.. How
does the average transfer time increase with the number of nodes? What about 
3
n
?
pf3

Partial preview of the text

Download Problem Set 4 - Architecture Of Parallel Computers | ECE 506 and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

CSC/ECE 506: Architecture of Parallel Computers

Problem Set 4

Due May 4, 2001

[May 11 for off-campus students]

Problems 1, 4, and 5 will be graded. There are 65 points on these problems.Note: You must doall the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on the graded problems.

Problem 1. (25 points) [CS&G 6.1] Consider two machinesM1 andM 2 .M 1 is a foour-prcoessor shared-L1-cache machine, whereasM 2 is a four-porcessor bus-based snooping-cache machine. M 1 has a single shared 1-BM two-way set-associative cache with 64-byte blocks, whereas each processor inM 2 has a 256-KB direct-mapped cache with 64-byte blocks. M 2 uses the Illinois MESI coherence protocol. Consider the following piece of code:

double A[1024, 1024]; /* row-major; 8-byte elements */ double C[4096]; double B[1024, 1024];

for (i=0; i<1024; i+=1) /* loop-1 / for (j=myPID; j<1024; j+=numPEs) { B[i, j] = (A[i+1, j] + A[i-1, j] + A[i, j+1] + A[i, j-1]) / 4.0; } for (i=0; i<1024; i+=numPEs) / loop-2 */ for (j=0; j<1024; j+=1) { A[i, j] = (B[i+1, j] + B[i-1, j] + B[i, j+1] + B[i, j-1]) / 4.0; }

(a) Assume that the arrayA starts at hexadecimal address 0x0 (i.e., hexadecimal address 0), array C at 0x800000 and arrayB at 0x808000. All caches are initially empty. Each processor executes the preceding code, and myPID varies from 0 to 3 for the four processors. Compute misses forM 1 separately for the two lop nests. Do the same forM 2 , stating any assumptions that you make.

On either machine, the arrays are too large to fit in cache, so both loops behave as if the cache is empty. The cache is large enough to hold the working set of three rows of the source matrix and one row of the target matrix. Given the presence of C, row A[i] conflicts with B[i-4,*] on M2. (They map to the same set on M1.) Thus, cold misses on A will know out old rows of B. A cache block holds 8 elements, however, work is assigned in cyclic fashion is each row (or distributed in cyclic column). Each block holds 8 elements, so a given processor computes two elements per block.

(b) Briefly comment on how your answer to part (a) would change if arrayC were not present. State any other assumptions that you make.

(c) What can be learned about advantages and disadvantages of shared-cache architecture from these exercises?

Problem 2. (25 points) [CS&G 7.4, 7.7] In order to solve the following problems, you need to carefully study Example 7.1 (p. 460 of Culler, Singh, & Gupta) first.

(a) Reconsider Example 7.1 where the number of hops for ann-node configuration is √ n.. How

does the average transfer time increase with the number of nodes? What about √

3 n?

(b) Reconsider Example 7.1 where the network is a simple ring. The average distance between

two nodes on a ring ofn nodes is

n

  1. How does the average transfer time increase with the number of nodes? Assuming each link can be occupied by at most one transfer at a time, how many such transfers can take place simultaneously?

Problem 3. (10 points) [CS&G 8.10] A pattern of sharing that might be detected dynamically is a producer-consumer pattern, in which one processor repeatedly writes (produces) a variable and another processor repeatedly reads (consumes) it. Is the standard MESI invalidation-based protocol well suited to this? Why or why not? What enhancements or protocol might be better, and what are the savings in latency or traffic? How would you dynamically detect and employ the changes?

Problem 4. (15 points) (a) Like the perfect shuffle, the butterfly permutation by itself does not give a completely connected network. Show how a butterfly interconnection partitions 4- and 8- node networks.

(b) Since the perfect shuffle did not provide a complete connection, the exchange was added to the network, as discussed on page 10 of Lecture 23. By looking at the bit-representations of the source and destination nodes, what “property” does the perfect shuffle permutation have which prevents complete connection (as an obvious observation)? Does this property hold for the butterfly permutation as well, and if so, how? Will the introduction of an exchange permutation into the butterfly interconnection network provide a complete connection, as it did with the perfect shuffle? If so, prove it. If not, would it work for any special cases (e.g., networks of a particular sizeN)?

(c) Would adding only the super-butterfly permutations be a way to provide a complete connection with a maximum path length ofO(logN)?

f

d

e

a b c

a b c

d

e

f

Problem 5. (25 points) This problem deals with a sample interconnection network, similar to the mesh network on pp. 5–6 of Lecture 23. Consider a unidirectional mesh interconnection network of size n × n. The routing functions are:

R+1(i) ≡ (i + 1) modN, whereN =n 2 R+n (i) ≡ (i +n) modN.

For example, ifn = 3, the network would be as shown at the right.

(a) What is the greatest distance between processors ifn = 3?

(b) What is the average distance between any two processors?

(c) Repeat parts (a) and (b) withn = 4.

(d) For networks of different sizes, give a general formula for the maximum and average distances between two processors in terms ofn.

(e) List some advantage(s) and disadvantage(s) of this interconnection network vs. a standard mesh interconnection network.