Problem Set 4

–1–

CSC/ECE 506: Architecture of Parallel Computers

Problem Set 4

Due May 4, 2001

[May 11 for off-campus students]

Problems 1, 4, and 5 will be graded. There are 65 points on these problems.

Note:

You must do

all

the problems, even the non-graded ones. If you do not do some of them, half as many points as

they are worth will be subtracted from your score on the graded problems.

Problem 1.

(25 points) [CS&G 6.1]

Consider two machines

1 and

1 is a foour-prcoessor

shared-

1-cache machine, whereas

2 is a four-porcessor bus-based snooping-cache machine.

1 has a single shared 1-BM two-way set-associative cache with 64-byte blocks, whereas each

processor in

2 has a 256-KB direct-mapped cache with 64-byte blocks.

2 uses the Illinois MESI

coherence protocol. Consider the following piece of code:

double A[1024, 1024]; /* row-major; 8-byte elements */

double C[4096];

double B[1024, 1024];

for (i=0; i<1024; i+=1) /* loop-1 */

for (j=myPID; j<1024; j+=numPEs)

{

B[i, j] = (A[i+1, j] + A[i-1, j] + A[i, j+1] + A[i, j-1]) / 4.0;

}

for (i=0; i<1024; i+=numPEs) /* loop-2 */

for (j=0; j<1024; j+=1)

{

A[i, j] = (B[i+1, j] + B[i-1, j] + B[i, j+1] + B[i, j-1]) / 4.0;

}

(a) Assume that the array

starts at hexadecimal address 0x0 (i.e., hexadecimal address 0), array

at 0x800000 and array

at 0x808000. All caches are initially empty. Each processor executes

the preceding code, and myPID varies from 0 to 3 for the four processors. Compute misses for

separately for the two lop nests. Do the same for

2, stating any assumptions that you make.

On either machine, the arrays are too large to fit in cache, so both loops behave as if the cache is

empty. The cache is large enough to hold the working set of three rows of the source matrix and

one row of the target matrix. Given the presence of C, row A[i] conflicts with B[i-4,*] on M2.

(They map to the same set on M1.) Thus, cold misses on A will know out old rows of B. A cache

block holds 8 elements, however, work is assigned in cyclic fashion is each row (or distributed in

cyclic column). Each block holds 8 elements, so a given processor computes two elements per

block.

(b) Briefly comment on how your answer to part (a) would change if array

were not present.

State any other assumptions that you make.

these exercises?

Problem 2.

(25 points) [CS&G 7.4, 7.7]

In order to solve the following problems, you need to

carefully study Example 7.1 (p. 460 of Culler, Singh, & Gupta) first.

(a) Reconsider Example 7.1 where the number of hops for an

-node configuration is √

.. How

does the average transfer time increase with the number of nodes? What about √

Problem Set 4 - Architecture Of Parallel Computers | ECE 506, Assignments of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download Problem Set 4 - Architecture Of Parallel Computers | ECE 506 and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

CSC/ECE 506: Architecture of Parallel Computers

Due May 4, 2001

(a) Reconsider Example 7.1 where the number of hops for ann-node configuration is √ n.. How

does the average transfer time increase with the number of nodes? What about √