CS433 Midterm Exam for CS Class with Instructions on Pipelining and Tomasulo's Algorithm -, Exams of Computer Architecture and Organization

A midterm exam for a computer science class focusing on cs433. The exam covers topics such as pipelining, control hazards, exceptions, tomasulo's algorithm, and speculative execution. Students are required to answer questions related to instruction execution, functional unit usage, and cycle occupancy in the is, ex, wr, and cmt stages.

Typology: Exams

Pre 2010

Uploaded on 03/16/2009

koofers-user-xdk-1
koofers-user-xdk-1 🇺🇸

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS433 Midterm
Prof Josep Torrellas
October 17, 2006
Time: 1 hour + 15 minutes
Name:
Alias:
Instructions:
1. This is a closed-book, closed-notes examination.
2. The Exam has 3Questions. Please budget your time.
3. Calculators are allowed.
4. Please write your answers neatly. Good luck!
Problem No. Maxm Points Points Scored
1 50
2 60
3 40
Total 150
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download CS433 Midterm Exam for CS Class with Instructions on Pipelining and Tomasulo's Algorithm - and more Exams Computer Architecture and Organization in PDF only on Docsity!

CS433 Midterm

Prof Josep Torrellas

October 17, 2006

Time: 1 hour + 15 minutes

Name:

Alias:

Instructions:

  1. This is a closed-book, closed-notes examination.
  2. The Exam has 3 Questions. Please budget your time.
  3. Calculators are allowed.
  4. Please write your answers neatly. Good luck!

Problem No. Maxm Points Points Scored 1 50 2 60 3 40 Total 150

I Pipelining [50 points]

A. Control Hazards [25 points] Suppose we have a MIPS processor with a 1-delay slot for branches. Consider codes (a) through (c):

ADD R1,R2,R3 ADD R1,R2,R3 ADD R1,R2,R NOP NOP NOP BEQZ R4 label BEQZ R1 label BEQZ R1 label [ [ [ ADD R10,R10,R10 ADD R7,R9,R10 ADD R13,R9,R JMP end JMP end JMP end NOP NOP NOP label: ADD R6,R6,R6 label: ADD R7,R11,R12 label: ADD R14,R9,R end: end: end:

(a) (b) (c)

a. [2 points] What is the best instruction to put in the delay slot in code (a)? Explain why. ADD R1,R2,R b. [2 points] What is the best instruction to put in the delay slot in code (b)? Explain why. NOP. Cannot find any c. [7 points] What is the best instruction to put in the delay slot in code (c) if R2+R3=0 60% of the time? Show the re- sulting code. In this case, what are the instructions executed when R2+R3=0, and what are the instructions executed when R2+R3!=0? ADD R1,R2,R NOP BEQ R1 end ADD R14,R9,R ADD R13,R9,R JMP end NOP label: ADD R14,R9,R end:

case 1) inst # 1, 2, 3, 4 case 2) inst # 1, 2, 3, 4, 5, 6, 7

d. [7 points] Repeat the whole question c if R2+R3=0 40% of the time.

B. Exceptions [25 points]

a. [5 points] What does it mean that a pipeline supports precise exceptions?

b. [4 points] How does a pipeline support precise exceptions?

c. [4 points] List one good thing and one bad thing of precise exceptions?

d. [4 points] What is a statically and a dynamically scheduled machine?

e. [4 points] What is a superscalar and a VLIW? Are they static or dynamic machines?

f. [4 points] How is the reorder buffer related to exception han- dling?

II Tomasulo’s Algorithm and Speculative Execution [60 points]

Consider the following code fragment DADDI R1, R0, # LOOP: L.D F0, 0(R2) MUL.D F2, 0(R1) DADDI F4, F2, F S.D F6, 0(R2) ADD.D F2, F2, F DSUBI F6, F4, F BNEZ R1, LOOP running on a system with the following specifications, noting that the assumptions are the same as in the homework except for those in bold typeface.

  • Assume a single-issue machine with unlimited reservation stations and the pipeline functional units described by Table 1.

Functional Unit Cycles in EX # Functional Units Integer 1 1 FP add 3 1 FP multiply 8 1

Table 1: Functional Unit Specification

  • Functional units are not pipelined.
  • All stages except EX take one cycle to complete.
  • There is no forwarding between functional units. Both integer and floating point results are communicated through the CDB.
  • Memory accesses and branches use the integer functional unit to perform address calculations. All loads and stores access memory during the EX stage and take one cycle to execute.
  • There are unlimited load/store buffers and an infinite instruction queue.
  • If an instruction is in the WR stage in cycle x, then an instruction that is waiting on the same functional unit (due to a structural hazard) can begin execution in cycle x.
  • Only one instruction can write to the CDB in a clock cycle.
  • Branches and stores do not need the CDB. When executing with specu- lative hardware, branches and stores can commit in the cycle that follows EX if there is no other constraint that prevents it.
  • Whenever there is a conflict for a functional unit or the CDB, assume program order.

B. Tomasulo’s Algorithm with Speculative Execution [30 points]

Now, assume the architecture above, except with hardware specu- lation. Assume that the reorder buffer has four entries, named 0, 1, 2, and 3. Only one ROB entry can commit per cycle. Complete table 3, including:

  • The ROB entry used by each instruction.
  • The cycles that each instruction occupies in the IS, EX, WR, and CMT stages.
  • Comments to justify your answer such as type of hazards/stalls and the registers/ROB entries involved.

Instruction ROB IS EX WR CMT Comments (if appropriate) DADDI R1, R0, #2 0 1 2 3 4 L.D F0, 0(R2) 1 2 3 4 5 MUL.D F2, F0, F2 2 3 5-12 13 14 RAW (L.D F0) DADDI R2, R2, #32 3 4 5 6 15 in-order CMT S.D F2, 0(R2) 0 5 14 - 16 RAW (MUL.D F2), in-order CMT ADD.D F4, F2, F4 1 6 14-16 17 18 RAW (MUL.D F2) DSUBI R1, R1, #1 2 15 16-17 18 19 No ROB until 15, CDB conflict BNEZ R1, LOOP 3 16 19 - 20 RAW (DSUBI R1) L.D F0, 0(R2) 0 17 18 19 21 in-order CMT MUL.D F2, F0, F2 1 19 20-27 28 29 No ROB until 19 DADDI R2, R2, #32 2 20 21 22 30 in-order CMT

Table 3: Tomasulo’s Algorithm with Speculative Excecution

III [40 points] Software ILP Consider the following machine.

  • There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, logic operations, and branch operations
  • There is 1 FP/integer multiplier, taking 7 cycles to perform any multiply. It is pipelined
  • There is 1 FP adder, taking 4 cycles to perform FP additions and subtractions. It is pipelined
  • Branches are resolved in the ID stage.
  • There is one branch delay slot
  • There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores
  • Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address calculation
  • There are as many registers, both FP and integer, as you need
  • While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the operands of each instruction are avail- able when needed by each instruction
  • If unspecified, its properties are like those in the MIPS pipeline we studied in class

Now consider this code fragment:

loop L.D F2, 0(R1) L.D F4, 0(R2) DADDUI R1, R1, # DADDUI R2, R2, # ADD.D F6, F2, F MUL.D F8, F4, F ADD.D F10, F6, F S.D F10, 0(R3) DADDUI R3, R3, # DSUBUI R4, R4, # BNEZ R4, loop

B. Software Pipelining [12 points] Software pipeline the loop and reorder the instructions to reduce stalls. Don’t write the startup or cleanup code.

loop S.D F10, -24(R3) //x- ADD.D F10, F6, F8 //x- ADD.D F6, F2, F2 //x- MUL.D F8, F4, F4 //x- L.D F2, 0(R1) //x L.D F4, 0(R2) //x DADDUI R1, R1, # DSUBUI R3, R3, # DADDUI R2, R2, # BNEZ R4, loop DADDUI R4, R4, #

C. Short Answer [8 points]

a. [4 points] Name one advantage and disadvantage of Loop Un- rolling.

Advantages: more ILP, fewer overheard instructions Disadvantages: code size increases, register pressure increases, problem becomes worse in multiple issue processors b. [4 points] Name one advantage of using VLIW and one prob- lem with the original VLIW model.

Advantages: keep more FU’s busy by issuing multiple instruc- tions, simpler hardware Problems: code size increase, limitations of lockstep operation, binary compatibility, finding parallelism