Assignment 3 for Architecture of Parallel Computers | ECE 506, Assignments of Electrical and Electronics Engineering

2001 Summer Material Type: Assignment; Professor: Gehringer; Class: Architecture Of Parallel Computers; Subject: Electrical and Computer Engineering; University: North Carolina State University; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-mtd
koofers-user-mtd 🇺🇸

3

(1)

7 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1–
CSC/ECE 506: Architecture of Parallel Computers
Problem Set 3
Due July 20, 2001
Problems 1 and 4 will be graded. There are 45 points on these problems.
Note:
You must do
all
the problems, even the non-graded ones. If you do not do some of them, half as many points as
they are worth will be subtracted from your score on the graded problems.
Problem 1.
(25 points)
Consider a multiprocessor with a sequentially consistent memory
system. Each processor has a cache implementing a basic 3-state write-invalidate protocol (similar
to the one in Figure 5.13, p.295 of Culler, Singh & Gupta), and a one-entry write buffer between
the CPU and the cache, so that stores need not block the processor.
Upon a store, if the referenced cache block isn’t in the exclusive state, then the register value is
transferred into the write buffer, and the necessary protocol action is launched to obtain
ownership. Once the processor obtains ownership for the block, the value is transferred from the
write buffer to the cache, and the entry is removed from the write buffer. Load instructions stall
the processor (upon a cache miss); store instuctions don’t stall the processor so long as the SC
memory model can be obeyed.
Assume the following conditions:
cache block size is 4 words;
a read/write instuction that hits in the cache takes 1 cycle;
a write instruction that incurs a privilege miss (e.g., the block is in the cache in Shared
state) takes 3 cycles;
a read/write instruction that incurs a miss (the block has to be fetched from memory) takes 5
cycles;
all arithmetic instructions take 1 cycle; and
there is no resource or network contention.
The trace at the rightgives the interleaved order in which the
instructions from the two processors (P1 and P2) were executed.
Assume that I0 starts at time 0 and successive instructions start
execution on consecutive cycles as long as there is no memory
model-induced stalling. U and V are in different cache blocks.
Initially both processors have U and V in the Shared state in their
respective caches.
You should give a timing diagram (
x
-axis is time,
y
-axis is instructions
I0--I9), showing when an instruction starts execution and when it
finishes execution. Also give a table that has the following fields:
I0: P1: LOAD R5, U
I1: P2: STORE U, R1
I2: P2: ADD R1, R2
I3: P2: STORE V, R1
I4: P1: LOAD R4, U
I5: P1: STORE V, R1
I6: P1: ADD R5, R4
I7: P1: MULT R5, R4
I8: P2: LOAD R4, U
I9: P1: STORE V, R5
instruction consistency actions miss type/hit reason for stall (if any)
Problem 2.
(20 points)
This problem should be solved using the MESI protocol for a bus-
based shared-memory multiprocessor. Assume the following:
Direct-mapped cache organization
P
1 and
P
2 each have exactly 2 cache lines
Cache-block size: 4 words
Cache-to-cache block transfer takes 4 cycles
Read/write hit (when no bus action is needed) takes 1 cycle
Invalidation takes 2 cycles
Memory-to-cache block transfer takes 8 cycles
B
1
and
B
2 are two memory blocks that map to the same cache line. They contain the data items
W, X, Y, Z, and P, Q, R, S, respectively as shown below. Each data item is one word.
pf2

Partial preview of the text

Download Assignment 3 for Architecture of Parallel Computers | ECE 506 and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

CSC/ECE 506: Architecture of Parallel Computers

Problem Set 3

Due July 20, 2001

Problems 1 and 4 will be graded. There are 45 points on these problems.Note: You must doall the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on the graded problems.

Problem 1 .(25 points) Consider a multiprocessor with a sequentially consistent memory system. Each processor has a cache implementing a basic 3-state write-invalidate protocol (similar to the one in Figure 5.13, p.295 of Culler, Singh & Gupta), and a one-entry write buffer between the CPU and the cache, so that stores need not block the processor.

Upon a store, if the referenced cache block isn’t in the exclusive state, then the register value is transferred into the write buffer, and the necessary protocol action is launched to obtain ownership. Once the processor obtains ownership for the block, the value is transferred from the write buffer to the cache, and the entry is removed from the write buffer. Load instructions stall the processor (upon a cache miss); store instuctions don’t stall the processor so long as the SC memory model can be obeyed.

Assume the following conditions:

  • cache block size is 4 words;
  • a read/write instuction that hits in the cache takes 1 cycle;
  • a write instruction that incurs a privilege miss (e.g., the block is in the cache in Shared state) takes 3 cycles;
  • a read/write instruction that incurs a miss (the block has to be fetched from memory) takes 5 cycles;
  • all arithmetic instructions take 1 cycle; and
  • there is no resource or network contention.

The trace at the rightgives the interleaved order in which the instructions from the two processors (P1 and P2) were executed. Assume that I0 starts at time 0 and successive instructions start execution on consecutive cycles as long as there is no memory model-induced stalling. U and V are in different cache blocks. Initially both processors have U and V in the Shared state in their respective caches.

You should give a timing diagram (x-axis is time,y-axis is instructions I0--I9), showing when an instruction starts execution and when it finishes execution. Also give a table that has the following fields:

I0: P1: LOAD R5, U

I1: P2: STORE U, R

I2: P2: ADD R1, R

I3: P2: STORE V, R

I4: P1: LOAD R4, U

I5: P1: STORE V, R

I6: P1: ADD R5, R

I7: P1: MULT R5, R

I8: P2: LOAD R4, U

I9: P1: STORE V, R

instruction consistency actions miss type/hit reason for stall (if any)

Problem 2. (20 points) This problem should be solved using the MESI protocol for a bus- based shared-memory multiprocessor. Assume the following:

Direct-mapped cache organization P1 andP2 each have exactly 2 cache lines Cache-block size: 4 words Cache-to-cache block transfer takes 4 cycles Read/write hit (when no bus action is needed) takes 1 cycle Invalidation takes 2 cycles Memory-to-cache block transfer takes 8 cycles

B1 andB2 are two memory blocks that map to the same cache line. They contain the data items W, X, Y, Z, and P, Q, R, S, respectively as shown below. Each data item is one word.

W

B 1

X Y Z P

B 2

Q R S

You are given the following trace of memory accesses from two processorsP andP2. Assume that the accesses occur strictly sequentially in the textual order shown at the right. Also assume that whenever a bus action is required then the time for the bus action is in addition to the time needed to satisfy the processor request (read/write).

Determine the total time needed to execute the given memory access sequence. Assume that initially the caches are empty. Clearly show the state transition for the affected cache blocks in the caches ofP1 andP2 after each access. Whenever there is a cache miss, indicate it as one of cold-miss, coherence-miss, capacity-miss, or conflict miss. For coherence misses, indicate the ones that are due to false sharing and those due to true sharing. Assume that Q and Z are private variables forP1.

P1: READ Z

P2: READ W

P1: READ W

P1: WRITE Z

P2: READ W

P1: READ P

P2: WRITE P

P1: READ P

P1: READ Q

P2: READ W

P1: WRITE Q

P2: READ X

P1: READ R

Problem 3. (15 points) [CS&G 5.9] Consider the following conditions proposed as sufficient conditions for SC

  • Every process issues memory requests in the order specified by the program.
  • After a read or write operation is issued, the issuing process waits for the operation to complete before issuing its next operation.
  • Before a processorPj can return a value written by another processorPI, all operations that were performed with respect toPI before it issued the store must also be performed with respect toPj.

Are these conditions indeed sufficient to guarantee SC executions? If so, say why. If not, construct a counterexample, and say why the conditions that were listed in the chapter are indeed sufficient in that case. Hint: Think about how these conditions are different from the ones in the chapter.

Problem 4. (20 points) In a 4-node CC-NUMA DSM machine using a memory-based directory protocol with average network transaction time between nodes of 20 μsec.—

(a) Compute the remote memory access time and draw a network transaction diagram for a write miss on node 1 to a remote remote memory block on node 2 that is dirty on node 3 for the following directory protocol optimizations:

  • Strict request-response
  • Intervention forwarding
  • Reply forwarding

(b) Compare the performance of the above three protocol optimizations.

(c) Discuss methods for achieving coherence by serialization to a memory location in this system.

Problem 5. (20 points) When blocks in the cache are tagged by virtual addresses, an inverse TLB may be used to determine whether a given physical address is found in the cache.

Draw a diagram including the TLB, cache, inverse TLB, and main memory. Annotate that diagram as follows: Assume that processP 1 has been referencing page frame 6 using (virtual) page number 27. Then a process switch occurs to processP2, which uses the samepage (not just the same page frame!) but calls it page 15. Call your first diagram “before,” and then draw an “after” diagram, showing any changes in the cache, TLB, inverse TLB, and physical memory. Describe the order in which they changed: which changed first, which changed second, etc., and why.