Branch Prediction and Speculation: Designs, Performance Analysis, and Buffer Usage, Assignments of Computer Architecture and Organization

Two designs for alleviating the effect of branches in computer architecture: one using compile-time scheduling with delay slots and no branch prediction, and the other employing branch prediction without delay slots. Equations to calculate the increase in cpi for each design based on the probability of misprediction. Additionally, it discusses the use of a branch prediction buffer and the impact of global branch prediction on performance.

Typology: Assignments

Pre 2010

Uploaded on 08/05/2009

koofers-user-hv0
koofers-user-hv0 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Branch Prediction and Speculation
Sample Problems:
Branch Prediction and Speculation
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Branch Prediction and Speculation: Designs, Performance Analysis, and Buffer Usage and more Assignments Computer Architecture and Organization in PDF only on Docsity!

Sample Problems:

Branch Prediction and Speculation

  1. Consider the following two designs for a alleviating the effect of branches. a. The first design defines a branch with two delay slots and does not use branch prediction. Rather the solution is to use compile-time scheduling to fill the delay slots with useful instructions where possible. Suppose that for 30% of the branch instructions the compiler can fill both branch delay and for 60% of the instructions the compiler can fill only one delay slot. b. The second design employs branch prediction and does not use delay slots. The mis-prediction penalty is 3 cycles. The branch always costs one cycle, and if mis-predicted, it will cost an additional three cycles.

What prediction accuracy is required in the second design to achieve the same performance as the first design?

From a. we know that 10% of the branch instructions result in two pipeline bubbles while 60% result in a one cycle bubble. We can compute the increase in CPI for each case (we can ignore the probability an instruction is a branch instruction since it is the same in both cases) we have,

The increase in CPI due to a. is prob_instr_is_a_branch*(0.6 * 1 + 0.1 * 2) =

The increase in CPI due to b. is prob_instr_is_a_branch* 1 * (p * 3) = 3p Where p is the probability of misprediction.

Equating both provides the critical value of p. The prediction accuracy = (1-p)

  1. Consider the use of a branch prediction buffer using n -bit saturating counters for the code sequence shown below. The memory addresses for the instructions are shown in hexadecimal notation. Assume the following loop code has been executed 12 times. The branch at location 0x0044 has been taken 50% of the time and the branch at location 0x0050 has been taken 50% of the time. Consider the point in time of the start of the execution of the 13th^ iteration.

Address 0x0038 L3: .. 0x003C .. 0x0040 DSUBUI R3 R1 # 0x0044 BNEZ R3 L 0x0048 DADD R1 R0 R 0x004C L1: DSUBUI R3 R2 # 0x0050 BNEZ R3 L 0x0054 DADD R2 R0 R 0x0058 L2: DSUBUI R3 R1 R 0x005C BEQZ R3 L

a. Considering only the preceding code, how many entries should the branch prediction buffer have to avoid the possibility of aliasing of branch addresses?

b. If all prediction buffer entries were initialized to 0, what can be the value of the counters in the prediction buffer corresponding to these two branch instructions?

0x0044 BNEX R3 L1 0 ≤ value ≤ 6

0x0050 BNEZ R3 L2 0 ≤ value ≤ 6

The minimum number of least significant bits to ensure no aliasing for these addresses is 4, hence the branch prediction buffer would need 2^4 = 16 entries.

c. Now consider the case where we use a global branch predictor with 3 bit global history. Execution of the 13th^ iteration is about to start. Provide an example of i) the value of feasible 3-bit global branch history, and ii) the value of an infeasible global branch history. Ensure you clearly identify the entries in the branch history with the branch instructions in the code sequence.

The first two branches test for equality between two numbers, N1 & N2, with the number

  1. The last branch tests if N1 = N2. If the first branch is taken (N1 not equal to 2) and the second branch is taken (N2 = 2) then the last branch cannot be taken. Hence a feasible global history is 111 (the first branch on the program corresponds to the most significant bit). An infeasible history is 000.
  1. Consider the 5 stage integer pipeline with forwarding. Assume the branch penalty is 1 cycles (branch condition computed in ID). Now assume we have pipelined the memory system to three stages (rather than 1 stage) for both instruction fetch and data fetch. Branches are resolved at the end of the EX stage. We use a static branch-not-taken prediction strategy, i.e., if branches are taken we incur the branch penalty. Assume conditional branches occur with a frequency of 14%. a. If branches are taken 62% of the time, what is the increase in CPI due to this prediction strategy? The branch penalty is 3 cycles and incurred only when the branch is taken. Increase in CPI = 0.62 * 0.14 * 3 = 0. b. Alternatively, if we modify the pipeline and implement a delayed branch with a single delay slot, and we are able to successfully fill 65% of the slots, what is the increase in CPI? 35% of the time we are unable to fill these slots with a penalty of 1 cycle Hence increase in CPI I s= 0.35 * 14

c. Now consider the occurrence of load delay slots, where loads occur with a probability of 24%, and 40% of these fetch data used by the immediately following instruction. If we perform no instruction scheduling to fill delay slots, what is the increase in CPI compared to the original pipeline (i.e., without pipelining the memory system).

The load stalls are now three cycles rather than 1.

With no instruction scheduling we have 0.24 * 4 * 3 = 0.

  1. Consider the dynamically scheduled execution of the following code sequence where a ROB buffer is used. Assume register F10 is initialized with the value 1. and memory locations 0(R1) and 0(R2) are initialized with 6 and 7 respectively. All other registers are initialized to 0. Consider the first iteration through the loop.

1. LOOP: L.D F2, 0(R1)

2. L.D F4, 0(R2)

3. MUL.D F6, F2, F

4. SUB.D F4, F6, F

5. DIV.D F6, F12, F

6. S.D F6, 0(R1)

7. ADDD F8, F8, F

8. DADDIU R1, R1, #-

9. DADDIU R2, R2, #-

10. BNE R1, R4 LOOP

a. Show a valid state of a 4 entry ROB when instruction 7 issued. Identify the head and tail of the ROB.

Destination Value Status F6 NO VALUE PENDING 0(R1) NO VALUE PENDING F8 NO VALUE PENDING F4 41 COMPLETED

b. Register re-mapping is employed where architecture registers are remapped to physical registers (PR). F6 in instruction 3 is remapped on issue to PR 9. When the DIV instruction reaches the head of the ROB can PR 9 be freed? Justify your answer.

Yes. This means that all instructions prior to the DIV.D have committed and all instructions that used the mapped register for F6 have completed. Therefore, it can be freed.

HEAD

TAIL

  1. Consider the dynamically scheduled execution of the following code sequence. The first time through the loop an exception occurs on the DIV instruction. Distinguish between how a precise exception will be handled using a ROB and a history buffer. How does register renaming affect or not affect the handling of exceptions. 1. LOOP: L.D F2, 0(R1) 2. L.D F4, 0(R2) 3. MUL.D F6, F2, F 4. SUB.D F4, F6, F 5. DIV.D F6, F12, F 6. S.D F6, 0(R1) 7. ADDD F8, F8, F 8. DADDIU R1, R1, #- 9. DADDIU R2, R2, #- 10. BNE R1, R4 LOOP

With an ROB: Exceptions for an executing instruction are flagged in its ROB entry, but not raised. The processor raises an exception associated with an instruction when that instruction reaches the head of the ROB. Since instructions in the ROB are allotted entries in program order, they are committed in program order. Instructions fetched speculatively on a mispredicted branch are never committed. Therefore, all exceptions are precise.

With an ROB, register renaming does not affect handling of precise exceptions. This is because register renaming does not affect how the instructions commit (always in program order) and exceptions can be raised only at commit time.

In the case of a history buffer, instructions are allocated history buffer entries that contain the old value (history) of the register being written. If an exception occurs the corresponding history buffer entry is labeled. When exception instruction reaches the head of the history buffer, the history buffer is scanned from head to tail and all old values replaced using those in the history buffer. This is needed because instructions write directly to the register-file.

  1. We have a machine capable of retiring up to 4 instructions per cycle from the ROB. Explain the conditions under which more than one instruction can indeed be retired in a single cycle.

Multiple consecutive instructions starting at the current head of the ROB must have completed execution & writeback of results (to their respective entries in the ROB). All these instructions can be committed in a single cycle. However, structural limitations (like, number of write-ports to the register file and bandwidth to memory for committing stores) would put hard limits on how many of the instructions at the head-of-queue in the ROB may commit simultaneously.

Note that if multiple consecutive instructions are writing to the same destination registers or memory location, the commit-hardware can still commit them in the same cycle by ensuring that the value written to the destination-register/memory-location comes from the result of the last committed instruction that wrote that register/location. Finally, all instructions are logically committed in program order.