CS433ug Comp Sys Org Spring 2007 HW3: Loop Unrolling & VLIW Scheduling, Assignments of Computer Architecture and Organization

Instructions for homework 3 of the cs433ug computer systems organization course in spring 2007. The homework involves loop unrolling and vliw scheduling for a given pipeline architecture. Students are required to write instructions for each cycle, handle stalls, and minimize cycles before the second iteration begins.

Typology: Assignments

Pre 2010

Uploaded on 03/09/2009

koofers-user-bpr
koofers-user-bpr 🇺🇸

9 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS433ug: Computer Systems Organization Spring 2007
Homework 3
Assigned: 2/20
Due in class 3/6
Instructions:
Please write your name, NetID and an alias on your homework submissions for posting grades (If you don’t
want your grades posted, then don’t write an alias). We will use this alias throughout the semester.
Homeworks are due in class on the date posted.
1. Loop Unrolling [20 points]
In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics
are:
If unspecified, its properties are like those in the MIPS pipeline we studied in class
There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation
for loads/stores), subtraction, logic operations, and branch operations
There is 1 FP/integer multiplier, taking 7 cycles to perform any multiply. It is pip elined
There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is pipelined
There is 1 FP/integer divider, taking 20 cycles. It is not pip elined
There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores
Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address
calculation
There are as many registers, both FP and integer, as you need
There is one branch delay slot
While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the
operands of each instruction are available when needed by each instruction
Consider the following loop:
loop L.D F2, 0(R1)
ADD.D F2, F2, F2
L.D F4, 0(R2)
MUL.D F4, F4, F2
ADD.D F6, F2, F4
S.D F6, 0(R3)
DADDUI R1, R1, #8
DADDUI R2, R2, #16
DADDUI R3, R3, #16
DSUBUI R5, R5, #2
BNEZ R5, loop
(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row
take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a
dependency that will not be resolved in time), write STALL instead, and move on to the next
cycle to see if it can issue then. Explain all stalls, but dont reorder instructions. How many cycles
elapse before the second iteration begins? Show your work (i.e. write instructions in order as they
would be executed and write STALL in between instructions for each cycle where there would be
a stall). Remember there is 1 branch delay slot. [5 points]
Solution:
loop L.D F2, 0(R1)
STALL RAW F2
ADD.D F2, F2, F2
L.D F4, 0(R2)
STALL RAW F4
MUL.D F4, F4, F2
1
pf3
pf4

Partial preview of the text

Download CS433ug Comp Sys Org Spring 2007 HW3: Loop Unrolling & VLIW Scheduling and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CS433ug: Computer Systems Organization Spring 2007 Homework 3 Assigned: 2/ Due in class 3/

Instructions: Please write your name, NetID and an alias on your homework submissions for posting grades (If you don’t want your grades posted, then don’t write an alias). We will use this alias throughout the semester. Homeworks are due in class on the date posted.

  1. Loop Unrolling [20 points] In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics are: - If unspecified, its properties are like those in the MIPS pipeline we studied in class - There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, logic operations, and branch operations - There is 1 FP/integer multiplier, taking 7 cycles to perform any multiply. It is pipelined - There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is pipelined - There is 1 FP/integer divider, taking 20 cycles. It is not pipelined - There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores - Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address calculation - There are as many registers, both FP and integer, as you need - There is one branch delay slot - While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the operands of each instruction are available when needed by each instruction

Consider the following loop:

loop L.D F2, 0(R1) ADD.D F2, F2, F L.D F4, 0(R2) MUL.D F4, F4, F ADD.D F6, F2, F S.D F6, 0(R3) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # BNEZ R5, loop

(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle to see if it can issue then. Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Remember there is 1 branch delay slot. [5 points] Solution: loop L.D F2, 0(R1) STALL RAW F ADD.D F2, F2, F L.D F4, 0(R2) STALL RAW F MUL.D F4, F4, F

STALL RAW F STALL RAW F STALL RAW F STALL RAW F STALL RAW F STALL RAW F ADD.D F6, F2, F STALL RAW F S.D F6, 0(R3) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # STALL BR Resolved in ID BNEZ R5, loop STALL NOP/BR Delay Answer: 22 Cycles

(b) Now reschedule the loop. You can change immediate values and memory offsets and reorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. [5 points] Solution: There are several solutions, this is just one. Notes: no stall was accpeted between the DSUBUI and the BNEZ here because you could have swapped the DSUBUI with another stall or a DADDUI. I posted on the newsgroup to put NOP in the branch delay slot without thinking for b and c so if you did that then you didn’t lose points. If you did use the NOP, there would be a stall after the DSUBUI and the SD and BNEZ instructions would be swaped. loop L.D F2, 0(R1) L.D F4, 0(R2) ADD.D F2, F2, F DADDUI R1, R1, # DADDUI R2, R2, # MUL.D F4, F4, F DADDUI R3, R3, # DSUBUI R5, R5, # STALL RAW F STALL RAW F STALL RAW F STALL RAW F ADD.D F6, F2, F BNEZ R5, loop S.D F6, -16(R3)

15 Cycles (c) Now unroll and reschedule the loop the minimum number of times needed to eliminate all stalls. You can, and should, remove redundant instructions. How many times did you unroll the loop and how many cycles elapse before the next iteration of the loop begins? Dont worry about clean-up code. Show your work. [10 points] Solution: This is one solution. Registers used could be different.

Notes: Theres alot of flexibility in the order of the LD, MUL and the first ADD instruction in the loop.

loop L.D F2, 0(R1) L.D F8, 8(R1) L.D F14, 16(R1) ADD.D F2, F2, F ADD.D F8, F8, F ADD.D F14, F14, F L.D F4, 0(R2) L.D F10, 16(R2) L.D F16, 32(R2) MUL.D F4, F4, F

ROB entry Instruction Funct Unit IS EX WR CMT Stall Reason 0 L.D F2, 0(R1) Integer 1 2 3 4 1 ADD.D F2, F2, F2 FP Add/Sub 1 4-6 7 8 RAW on F 2 L.D F4, 0(R2) Integer 2 3 4 8 Wait for inorder co 3 MUL.D F4, F4, F2 FP Mul 2 7-13 14 15 RAW on F 4 ADD.D F6, F2, F4 FP Add/Sub 3 14-16 17 18 RAW on F 5 S.D F6, 0(R3) Integer 3 17 18 19 RAW on F 6 DADDUI R1, R1, #8 Integer 4 5 6 19 Wait for inorder co 7 DADDUI R2, R2, #16 Integer 4 6 7 20 Wait for FU, inorder c 0 DADDUI R3, R3, #16 Integer 5 7 8 20 Wait for FU, inorder c 1 DSUBUI R5, R5, #2 Integer 9 10 11 21 Wait for ROB, inorder 2 BNEZ R5, loop Integer 9 11 12 21 Wait for ROB, inorder

(b) Based on your answer for part a, what is the bottleneck in issuing instructions? What is a simple way to remove this bottleneck? [2 points] Solution: Our reorder buffer size is a bottleneck when issuing instructions. For example in the 1st iteration, the last 2 instructions have to wait for buffer entries to be freed. By simply increasing the reorder buffer size we can issue more instructions (c) Based on your answer for part a, does this piece of code benefit significantly from being executed on a multiple issue/commit processor? Why or why not? [2 points] Solution: Answers may vary. But the main point is it does not benefit too much from multiple issue. Sure, some instructions can execute a bit sooner than they would be able to on a multiple issue processor but there is not much ILP to exploit in this code without using techniques like loop unrolling or software pipelining except perhaps in the integer operations. But for those operations, they are all competing for a single Integer function unit which we only have 1 of and execute in a single cycle anyways.

(d) What can we do to improve efficiency without modifying our processor? [2 points]

Solution: Loop unrolling to increase ILP. We can then pipeline more instructions for the functional units to keep them working.