CS433ug Midterm Exam Questions and Solutions for MIPS Pipeline and Tomasulo's Algorithm - , Exams of Computer Architecture and Organization

The cs433ug midterm exam for computer organization, with questions related to mips pipeline and tomasulo's algorithm. Solutions for questions regarding modified and original mips pipelines, tomasulo's algorithm, and software ilp. Students can use this document as a study resource for understanding mips pipeline structures, tomasulo's algorithm, and software ilp concepts.

Typology: Exams

Pre 2010

Uploaded on 03/16/2009

koofers-user-d8g
koofers-user-d8g 🇺🇸

9 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS433ug Midterm
Prof Josep Torrellas
March 8, 2007
Time: 1 hour + 15 minutes
Name:
Alias:
Instructions:
1. This is a closed-book, closed-notes examination.
2. The Exam has 3Questions. Please budget your time.
3. Calculators are allowed.
4. Please write your answers neatly. Good luck!
Problem No. Maxm Points Points Scored
1 40
2 40
3 40
Total 120
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download CS433ug Midterm Exam Questions and Solutions for MIPS Pipeline and Tomasulo's Algorithm - and more Exams Computer Architecture and Organization in PDF only on Docsity!

CS433ug Midterm

Prof Josep Torrellas

March 8, 2007

Time: 1 hour + 15 minutes

Name:

Alias:

Instructions:

  1. This is a closed-book, closed-notes examination.
  2. The Exam has 3 Questions. Please budget your time.
  3. Calculators are allowed.
  4. Please write your answers neatly. Good luck!

Problem No. Maxm Points Points Scored 1 40 2 40 3 40 Total 120

I MIPS Pipeline [40 points]

A. Modified MIPS Pipeline [20 points] We change the MIPS pipeline to have the following structure.

IF instruction fetch ID register read and instruction decode ALU1 first stage of execution. Branch condition is completed in this cycle, as is the address of the branch target; ALU operands are needed at the beginning of this cycle, as is the base register for loads and stores. ALU2 second stage of execution; ALU results available in this cycle; effective address for loads and stores is available MEM1 first cycle of data memory access; only address is needed. MEM2 second cycle of data memory access; store data needed at the beginning of the cycle, load data available in this cycle. WB write back of results

Assume the register file operates on split cycles as discussed in Ap- pendix A, so as to minimize bypassing requirements. a) How many branch delay slots are there? Why? Solution: 2 delay slots. The branch target is available until at the end of the ALU1 stage (which is 2 cycles after the fetch). b) Assuming all possible forwarding is supported, how many stall cycles we have in the following case? (draw a pipeline picture)

ADD R1 R2 R ADD R7 R1 R

Solution: 1 stall. IF ID ALU1 ALU2 MEM1 MEM2 WB IF ID Stall ALU1 ALU2 MEM1 MEM2 WB c) Repeat b) for:

LOAD R1, 10(R5) ADD R7 R1 R

Solution: 3 stalls. IF ID ALU1 ALU2 MEM1 MEM2 WB IF ID Stall Stall Stall ALU1 ALU2 MEM1 MEM2 WB

II Tomasulo’s Algorithm and Speculative Execution [40 points]

Consider the following code fragment LOOP: L.D F0, 0(R1) L.D F2, 8(R1) MUL.D F4, F2, F ADD.D F4, F4, F S.D F4, 0(R2) DADDI R2, R2, # DSUBI R2, R1, # BNEZ R1, LOOP running on a system with the following specifications, noting that the assumptions are the same as in the homework except for those in bold typeface.

  • Assume a single-issue machine with unlimited reservation stations and the pipeline functional units described by Table ??.

Functional Unit Cycles in EX # Functional Units Integer 1 1 FP Add 3 1 FP Multiply 8 1

Table 1: Functional Unit Specification

  • Functional units are not pipelined.
  • All stages except EX take one cycle to complete.
  • All forwarding occurs through the CDB in the WR stage
  • Loads and stores take one cycle to execute. During that cycle, they use the integer functional unit to perform effective address calculation and, in addition, they access memory.
  • There are unlimited load/store buffers and an infinite instruction queue.
  • Branches are resolved in the EX stage.
  • If an instruction is in the WR stage in cycle x, then an instruction that is waiting on the same functional unit (due to a structural hazard) can begin execution in cycle x + 1.
  • Only one instruction can write to the CDB in a clock cycle.
  • Branches and stores do not need the CDB.
  • Whenever there is a conflict for a functional unit or the CDB, assume program order.
  • When an instruction is done executing in its functional unit and is waiting for the CDB, it is still occupying the functional unit and its reservation station (meaning no other instruction may enter).
  • Treat the BNEZ instruction as an Integer instruction. Assume L.D instruction after the BNEZ can be issued the cycle after the BNEZ instruction is issued due to branch prediction.

A. Tomasulo’s Algorithm [20 points]

Complete table ?? using Tomasulo’s algorithm for the given code fragment with no hardware speculation for branches.Include:

  • The functional unit used by each instruction. For structural hazards, assume program order.
  • The cycles that each instruction occupies in the IS, EX, and WR stages.
  • Comments to justify your answer such as type of hazards and the registers involved in the hazard.

Instruction Funct. Unit IS EX WR Comments (if appropriate) L.D F0, 0(R1) Integer 1 2 3 L.D F2, 8(R1) Integer 2 3 4 MUL.D F4, F2, F0 FP Mul 3 5-12 13 RAW F ADD.D F4, F4, F0 FP Add 4 14-16 17 RAW F S.D F4, 0(R2) Integer 5 18 19 (-) RAW F DADDI R2, R2, #8 Integer 6 7 8 DSUBI R2, R1, #16 Integer 7 8 9 BNEZ R1, LOOP Integer 8 9 10 (-) L.D F0, 0(R1) Integer 9 10 11

Table 2: Execution profile using Tomasulo’s Algorithm

III Software ILP [40 points] Consider the following machine.

  • There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, logic operations, and branch operations
  • There is 1 FP/integer multiplier, taking 8 cycles to perform any multiply. It is pipelined
  • There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is pipelined
  • Branches are resolved in the ID stage.
  • There is one branch delay slot
  • There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores
  • Loads and stores spend 1 cycle in the MEM stage after the effective address calculation.
  • There are as many registers, both FP and Integer, as you need
  • While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the operands of each instruction are avail- able when needed by each instruction
  • If unspecified, its properties are like those in the MIPS pipeline we studied in class

Now consider this code fragment:

loop L.D F0, 0(R1) L.D F2, 8(R1) ADD.D F4, F2, F MUL.D F6, F0, F ADD.D F6, F6, F S.D F6, 0(R2) DADDUI R1, R1, # DADDUI R2, R2, # DSUBUI R3, R3, # BNEZ R3, loop

A. Loop Unrolling [30 points]

a. [15 points] Reschedule the code to minimize stalls. How many stalls are there? Please show the resulting code. Solution:

loop L.D F0, 0(R1) L.D F2, 8(R1) MUL.D F6, F0, F ADD.D F4, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DSUBUI R3, R3, # 3 STALLS ADD.D F6, F6, F BNEZ R3, loop S.D F6, -8(R2) Answer: 3 stalls

B. Short Answer [10 points]

a. [5 points] How does loop unrolling improve performance? What are 2 disadvantages of loop unrolling? Solution: Increases ILP in each iteration, fewer overheard instructions. Disadvantages: code size increases, register pressure increases b. [5 points] What are 2 differences between dynamically sched- uled superscalar and VLIW processors? Solution: Superscalar - Issues multiple arbitrary instructions, instruc- tions dynamically schedule, if instruction cannot be issued, dont issue VLIW - Issues a fixed number of different types of instructions, instructions packaged together at compile time, if parallel in- structions cannot be found, put NOP in its slot.