Computer science HW2 Solution, Assignments of Computer Science

Advanced Multi-core system Super scalar techniques and cache coherence

Typology: Assignments

2020/2021

Uploaded on 04/18/2021

yuri-lawson
yuri-lawson 🇺🇸

5

(1)

1 document

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS316 Autumn 2014 2015
C. Kozyrakis
1
CS316 Advanced Multi-Core Systems
HW2 Superscalar Techniques and Cache Coherence
Due Monday 11/10/14 at 5PM
(Through online submission portal as a single PDF)
Notes
Collaboration on this assignment is encouraged (groups of 3
students)
This HW set provides practice for the exam. Make sure you work on
or review ALL problems in this assignment.
Group: Member1 - _______________ Member2 - ________________
Member3 - ________________
Problem 1: Branch Prediction [9 points]
The figure below shows the control flow of a simple program. The CFG is annotated
with three different execution trace paths. For each execution trace, circle which
branch predictor (bimodal, local, or gshare) will
best
predict the branching behavior
of the given trace. More than one predictor may perform equally well on a particular
trace. However, you are to use each of the three predictors
exactly once
in choosing
the best predictors for the three traces. Assume each trace is executed many times
and every node in the CFG is a conditional branch. The branch history register for the
local and gshare predictors is limited to 4 bits. Bimodal is a common name for a
simple branch history table (BHT). Provide a brief explanation for your answer.
Circle one:
Bimodal
Local
gshare
Identical global history at
b13 and b15, so the PC
is needed to differentiate
them.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download Computer science HW2 Solution and more Assignments Computer Science in PDF only on Docsity!

CS316 Advanced Multi-Core Systems

HW2 – Superscalar Techniques and Cache Coherence

Due Monday 11 / 10 / 14 at 5PM

(Through online submission portal as a single PDF)

Notes

  • Collaboration on this assignment is encouraged (groups of 3 students)
  • This HW set provides practice for the exam. Make sure you work on or review ALL problems in this assignment.

Group: Member1 - _______________ Member2 - ________________ Member3 - ________________

Problem 1: Branch Prediction [9 points]

The figure below shows the control flow of a simple program. The CFG is annotated with three different execution trace paths. For each execution trace, circle which branch predictor (bimodal, local, or gshare) will best predict the branching behavior of the given trace. More than one predictor may perform equally well on a particular trace. However, you are to use each of the three predictors exactly once in choosing the best predictors for the three traces. Assume each trace is executed many times and every node in the CFG is a conditional branch. The branch history register for the local and gshare predictors is limited to 4 bits. Bimodal is a common name for a simple branch history table (BHT). Provide a brief explanation for your answer. Circle one:

Bimodal

Local

gshare

Identical global history at b13 and b15, so the PC is needed to differentiate them.

Circle one:

Bimodal

Local

gshare

Identical global history at b1, so global history doesn’t work. The local history of b1 shows it alternates between taken and not taken.

Circle one:

Bimodal

Local

gshare

All the branches in this trace have a constant behavior, so bimodal predicts well.

Part B: [3 points] Part A explored simple register renaming: when the hardware register renamer sees a source register, it substitutes the destination T register of the last instruction to have targeted that source register. When the rename table sees a destination register, it substitutes the next available T for it. But superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine, including the register renaming. A simple scalar processor would therefore look up both src register mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar processors must be able to do that as well, but they must also ensure that any dest-to-src relationships between the two concurrent instructions are handled correctly. Consider the following sample code sequence:

I0: MULTD F5, F2, F

I1: MULTD F6, F3, F

I2: ADDD F2, F5, F

I3: DIVD F5, F2, F

Assume that we would like to simultaneously rename the first two instructions (2- way superscalar). Further assume that the next two available T registers to be used are known at the beginning of the clock cycle in which these two instructions are being renamed. Conceptually, what we want is for the first instruction to do its rename table lookups, and then update the table per its destination’s T register. Then the second instruction would do exactly the same thing, and any inter-instruction dependency would thereby be handled correctly. But there’s not enough time to write the T register designation into the renaming table and then look it up again for the second instruction, all in the same clock cycle. That register substitution must instead be done live (in parallel with the register rename table update). Figure 2. shows a circuit diagram, using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your task is to show the cycle-by-cycle state of the rename table & destination / sources register mappings for every instruction of the code. Assume the table starts out with every entry equal to its index (T0 = 0; T1 = 1….). Moreover, assume that 9 is the next available T register.

Figure 2-1. Rename table and on-the-fly register substitution logic for superscalar machines. (Note: “src” is source, “dst” is destination.)

You only need to fill in mappings for registers that have been renamed from their starting values (e.g. no need to write in F60=T60, but if F60=T3 that needs to be filled in). Not all fields may be used.

Cycle 0:

Architectural Machine Instruction I

dst = T

src1 = T 2

src2 = T

F 5 T 9

F 6 T 10

F T

F T Instruction I

dst = T

src1 = T 3

src2 = T 3

F T

F T

F T

Cycle 1 :

Architectural Machine Instruction I

dst = T

src1 = T

src2 = T 10

F 2 T 11

F 5 T 12

F 6 T 10

F T Instruction I

dst = T

src1 = T1 1

F T

F T

Problem 3: Coherence [30 points] Part A: Single processor coherence [5 points] In lecture we discussed extending the standard MSI protocol to MESI by adding an Exclusive (E) state. Explain how this extra state can improve performance and provide an example program snippet (in MIPS-style assembly) to illustrate the benefit.

The exclusive (E) state benefits the common code sequences that read a value from memory, update the value, and write the value back to memory because it can reduce the number of coherence protocol messages that have to be sent. Consider the following code snippet that increments a memory value:

LD F1, 0(RX) ADDD F3, F2, F SD F3, 0(RX)

Consider the case where we start in state I and there are no sharers…

With MSI 1.) LD: I - > S, bus traffic to get value 2.) ADDD: no change 3.) SD: S - > M, bus upgrade sent to shoot down remote copies in S state even if there are none

With MESI 1.) LD: I - > E, bus traffic to get value 2.) ADDD: no change 3.) SD: E-> M, no bus upgrade needed because of exclusive copy

Part B: MOESI cache coherence protocol [10 points] Many modern systems use cache-to-cache transfer as a way to avoid penalties of going off-chip for a memory access. The MOESI cache coherency protocol, used in many AMD processors, extends from the MESI protocol by adding an additional O state. The O state indicates that the line is shared-dirty: i.e., multiple copies may exist, but the other copies are in the S state, and the cache that has the line in the O state is responsible for writing the line back if it is evicted. Fill in the table below for actions (both state transitions and bus traffic) on every event trigger. If nothing needs to be done, write in “Do nothing.” If an event is invalid for a given state, write in “Error.”

Current State s

Event and Local Coherence Controller Responses and Actions (s’ refers to next state) Local Read Local Write

Local Eviction

Bus Read Bus Write Bus Upgrade Invalid (I) Issue bus read. If no sharers, then s’=E Else s’=S

Issue bus write. s’=M

Error Do nothing

Do nothing

Do nothing

Shared (S)

Do nothing Issue bus upgrade. s'=M

s'=I Respond shared.

s'=I s'=I

Exclusive (E)

Do nothing s'=M s'=I Respond shared. Supply data to requestor s’=S

s'=I Error

Owned (O)

Do nothing Issue bus upgrade. s'=M

Write data back s'=I

Respond dirty/shar ed Supply data to requestor

Respond dirty Supply data to requestor s'=I

s'=I

Modified (M)

Do nothing Do nothing

Write data back s'=I

Respond dirty/shar ed Supply data to requestor s'=O

Respond dirty Supply data to requestor s'=I

Error

Part D: Memory Consistency [10 points] Consider a simple multicore processor using a snoopy MSI cache coherence protocol. Each processor has a single, private cache that is direct-mapped with four blocks each holding two words. The initial cache state of the system is shown in the figure below. To simplify the illustration, the cache-address tag contains the full address.

P Coherenc y State

Addres s tag B 0

S 100

B

S 108

B

M 110

B

I 118

P

Coherenc y State

Addres s tag B 0

I 100

B

M 128

B

I 110

B

S 118

P

Coherenc y State

Addres s tag B 0

S 100

B

S 108

B

I 110

B

I 118

Reads and writes will experience stall cycles depending on the state of the cache line:

  • CPU read and write hits generate no stall cycles
  • CPU read and write misses generate Nmemory stall cycles to retrieve the data from memory
  • CPU write hits that generate an invalidate incur Ninvalidate stall cycles
  • A write-back of a block, due to either a conflict or another processor’s request to an exclusive block, incurs an additional Nwriteback stall cycles The exact cycle count for each event is given in the table below: Parameter Cycles Nmemory 100 Ninvalidate 15 Nwriteback 10

Sequential consistency (SC) requires that all reads and writes appear to have executed in some total order. This may require the processor to stall in certain cases before committing a read or write instruction. Consider the following code sequence: write A read B

where the write A results in a cache miss and the read B results in a cache hit. Under SC, the processor must stall read B until after it can order (and thus perform) write A. Simple implementations of SC will stall the processor until the cache receives the data and can perform the write. Weaker consistency models relax the ordering constraints on reads and writes, reducing the cases that the processor must stall. The Total Store Order (TSO, or Processor Order) consistency model requires that all writes appear to occur in a total order but allows a processor’s reads to pass its own writes. This allows processor to implement write buffers that hold committed writes that have not yet been ordered with respect to other processors’ writes. Reads are allowed to pass (and potentially bypass) the write buffer in TSO (which they could

not do under SC). Assume that one memory operation can be performed per cycle and that operations that hit in the cache or that can be satisfied by the write buffer introduce no stall cycles. Operations that miss incur the latencies stated above. How many stall cycles occur prior to each operation for both the SC and TSO consistency models for the cases listed below? Show your work; a correct answer without any work shown will receive no credit.

Problem 4: Instruction Flow and Branch Prediction [30 points] This problem investigates the effects of branches and control flow changes on program performance for a scalar pipeline (to keep the focus on branch prediction). Branch penalties increase as the number of pipeline stages increases between instruction fetch and branch resolution (or condition and target resolution). This effect of pipelined execution drives the need for branch prediction. This problem explores both static branch prediction in Part C and dynamic branch prediction in Part D. For this problem the base machine is a 5-Stage pipeline.

The 5-Stage Pipeline without Dynamic Branch Prediction

Part A: Branch Penalties. [2 points]

What are the branch penalties for unconditional and conditional branches?

Unconditional ______ 1 ________ Conditional _______ 2 _______

Execution Assumptions:

  • unconditional branches execute in the decode stage
  • conditional branches execute in the execute stage
  • Effective address calculation is performed in the execute stage
  • All memory access is performed in the memory access stage
  • All necessary forwarding paths exist
  • The register file is read after write

The fetch address is a choice between the sequential address generation logic and the branch correction logic. If a mispredicted branch is being corrected the correction address is chosen over the sequential address for the next fetch address.

Instruction Fetch

Instruction Decode

Write Back

Execute

Memory Access

Instruction Cache

Instructions

Fetch Sequential^ Addr. Addr. Calc.

Br. Corr. Address

Part B: No Branch Prediction. [4 points]

This problem will use the insertion sort program. An execution trace, or a sequence of executed basic blocks, is provided for this problem. A basic block is a group of consecutive instructions that are always executed together in sequence.

Example Code: Insertion Sort

BB Line# Label Assembly_Instruction Comment

1 1 main: addi r2, r0, ListArray r2 <- ListArray 2 addi r3, r0, ListLength r3 <- ListLength 3 add r4, r0, r0 i = 0;

2 4 loop1: bge r4, r3, end while (i < Length) { 3 5 add r5, r4, r0 j = i ; 4 6 loop2: ble r5, r0, cont while (j > 0 ) { 5 7 addi r6, r5, - 1 k=j-1; 8 lw r7, r5(r2) temp1 = ListArray[j]; 9 lw r8, r6(r2) temp2 = ListArray[k]; 10 bge r7, r8, cont if (temp 1 >= temp2) break;

6 11 sw r8, r5(r2) ListArray[j] <- temp 2 ; 12 sw r7, r6(r2) ListArray[k] <- temp1; 13 addi r5, r5, - 1 j--; 14 ba loop2 }

7 15 cont: addi r4, r4, 1 i++; 16 ba loop1 }

8 17 end: lw r1, (sp) r1 <- Return Pointer 18 ba r 1

Execution Trace: Sequence of Basic Blocks Executed: 1 2 3 4 5 7 2 3 4 5 6 4 5 6 4 7 2 3 4 5 6 4 5 7 2 3 4 5 6 4 5 6 4 5 7 2 8

[Hint: An alternate way to represent the same execution trace above is to use the sequence of branch instructions, both conditional and unconditional (i.e. ba), executed.]

Part C: Static Branch Prediction. [8 points]

Static branch prediction is a compile-time technique of influencing branch execution in order to reduce control dependency stalls. Branch opcodes are supplemented with a static prediction bit that indicates a likely direction during execution of the branch. This is done based on profiling information, ala that in Part B. For this part of Problem 4, new branch opcodes are introduced:

bget - branch greater than or equal with static predict taken bgen - branch greater than or equal with static predict not-taken blet - branch less than or equal with static predict taken blen - branch less than or equal with static predict not-taken

Static branch prediction information is processed in the decode stage of the 5-stage pipeline. When a branch instruction with static predict taken (i.e. bget) is decoded the machine predicts taken. Conversely, when a branch instruction with static predict not-taken (i.e. bgen) is decoded the machine predicts not-taken.

  1. [6 points] Pretend you are the compiler, rewrite each conditional branch instruction in the original code sequence using the new conditional branch instructions with static branch prediction encoded.

Line# Instruction 4 bgen 6 blen 10 bgen

  1. [2 points] Assuming the same execution trace, what is the new total cycle count of the modified code sequence incorporating static branch prediction instructions. Indicate the resultant IPC.

The static branch prediction will help us predict 16 of the dynamic conditional branches correctly. However, even with the static prediction, we still lose 1 cycle in decoding the branch instruction. Therefore, for conditional branches with the correct prediction, we only lose 1 cycle. Thus, with the new static branch prediction, we save 16 cycles in conditional branch penalty.

Total cycle = old cycle count – 16 = 147 – 16 = 131 cycles IPC = 83 instructions / 131 cycles = 0.63 inst/cycle

  1. [6 points] Assume each entry of the BTB employs a 2-bit saturating up/down counter (initialized to the state 00) that maintains branch history. The BTB uses the following prediction algorithm: 00 - Not taken, 01 - Not taken, 10 - Taken, 11 - Taken. Fill in the table below with the state of the BTB after each dynamic branch instruction is executed. Use I[2:1] as the address of an instruction to determine which BTB entry is referenced, where I is the instruction number in its binary form. (2nd^ and 3rd^ bits resp.)

Dynamic Branches Executed

BTB Entry #0 BTB Entry #1 BTB Entry #2 BTB Entry #

Stat. Br #

Tar. Instr#

Hist. bits

Stat. Br #

Tar. Instr#

Hist. bits

Stat. Br #

Tar. Instr#

Hist. bits

Stat. Br #

Tar. Instr#

Hist. bits 1 4 -- 00 2 6 -- 00 3 10 16 01 4 16 4 01 5 4 -- 00 6 6 -- 00 7 10 16 00 8 14 6 01 9 6 -- 00 10 10 16 00 11 14 6 01 12 6 16 01 13 16 4 10 14 4 -- 00 15 6 16 00 16 10 16 00 17 14 6 01 18 6 -- 00 19 10 16 01 20 16 4 11 21 4 -- 00 22 6 -- 00 23 10 16 00 24 14 6 01 25 6 -- 00 26 10 16 00 27 14 6 01

28 6 -- 00

29 10 16 01

30 16 4 11

31 4 18 01

32 18 RET 01