CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To, Assignments of Computer Architecture and Organization

Information on loop unrolling and vliw scheduling in computer systems organization. It includes examples of loop unrolling and vliw scheduling, as well as solutions and justifications. The document also discusses the benefits of loop unrolling and interleaving instructions from different iterations of a loop.

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-f2b
koofers-user-f2b 🇺🇸

9 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS433: Computer Systems Organization Fall 2006
Homework 3
Assigned: 9/28
Due in class 10/12
1. Loop Unrolling [15 Points]
In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics
are:
If unspecified, its properties are like those in the MIPS pipeline we studied in class
There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective
address calculation for loads/stores), subtraction, logic operations, and branch operations
There is 1 FP/integer multiplier, taking 8 cycles to perform any multiply. It is pipelined
There is 1 FP adder, taking 7 cycles to perform FP additions and subtractions. It is pipelined
There is 1 FP/integer divider, taking 24 cycles. It is not pipelined
There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM
stage for stores
Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after
the effective address calculation
There are as many registers, both FP and integer, as you need
There is one branch delay slot
While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to
schedule such that the operands of each instruction are available when needed by each instruction
Consider the following loop:
loop L.D F2, 0(R1)
MUL.D F2, F2, F2
L.D F4, 0(R2)
ADD.D F4, F4, F4
ADD.D F6, F2, F4
DADDUI R1, R1, #16
DADDUI R2, R2, #16
DADDUI R3, R3, #16
S.D F6, 0(R3)
DSUBUI R5, R5, #8
BNEZ R5, loop
(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row
take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a
dependency that will not be resolved in time), write STALL instead, and move on to the next
cycle to see if it can issue then. Assume that a nop is scheduled in the branch delay slot (effectively
stalling 1 cycle after the branch). Explain all stalls, but dont reorder instructions. How many
cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order
as they would be executed and write STALL in between instructions for each cycle where there
would be a stall).
Solution: In retrospect, some of the assumptions weren’t clear so multiple answers were accepted.
loop L.D F2, 0(R1)
stall RAW F2
MUL.D F2, F2, F2
L.D F4, 0(R2)
1
pf3
pf4
pf5

Partial preview of the text

Download CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CS433: Computer Systems Organization Fall 2006 Homework 3 Assigned: 9/ Due in class 10/

  1. Loop Unrolling [15 Points]

In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics are:

  • If unspecified, its properties are like those in the MIPS pipeline we studied in class
  • There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, logic operations, and branch operations
  • There is 1 FP/integer multiplier, taking 8 cycles to perform any multiply. It is pipelined
  • There is 1 FP adder, taking 7 cycles to perform FP additions and subtractions. It is pipelined
  • There is 1 FP/integer divider, taking 24 cycles. It is not pipelined
  • There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores
  • Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address calculation
  • There are as many registers, both FP and integer, as you need
  • There is one branch delay slot
  • While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the operands of each instruction are available when needed by each instruction

Consider the following loop:

loop L.D F2, 0(R1) MUL.D F2, F2, F L.D F4, 0(R2) ADD.D F4, F4, F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop

(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle to see if it can issue then. Assume that a nop is scheduled in the branch delay slot (effectively stalling 1 cycle after the branch). Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Solution: In retrospect, some of the assumptions weren’t clear so multiple answers were accepted.

loop L.D F2, 0(R1) stall RAW F MUL.D F2, F2, F L.D F4, 0(R2)

stall RAW F4 F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # stall RAW F stall RAW F S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop NOP // Its ok if you put stall here

Answer: 22 Cycles (23 if you had a stall after the DSUBUI which is ok if you assumed branches are resolved in ID) 3 points total. 1 points for correct answer number of cycles. Take 0.5 points off for every 4 mistakes in the work.

(b) Now reschedule the loop. You can change immediate values and memory offsets and reorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. Solution: There are several solutions, this is just one. Notes: no stall was accpeted between the DSUBUI and the BNEZ here because you could have swapped the DSUBUI with another stall or a DADDUI. I posted on the newsgroup to put NOP in the branch delay slot without thinking for b and c so if you did that then you didn’t lose points. If you did use the NOP, there would be a stall after the DSUBUI and the SD and BNEZ instructions would be swaped.

loop L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F2, F2, F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, #16 // Can swap with any stall slot DADDUI R2, R2, #16 // Can swap with any stall slot DADDUI R3, R3, #16 // Can swap with any stall slot DSUBUI R5, R5, #8 // Can swap with any stall slot BNEZ R5, loop S.D F6, 0(R3)

17 Cycles (or 19 cyles if you scheduled a nop in the branch delay slot) 6 points total. 1 point for the correct number of cycles. Subtract 0.5 points for every error in the work up to 5 points.

(a) Unroll the loop from Question 1 four times, and schedule it for this VLIW to take as few stall cycles as possible. Consult Figure 4.5 on page 318 of your book for an example. How many cycles do the four iterations take to complete? Show your work. Solution: MEM1 MEM2 Integer FP Add/Sub FP Mul/Div LD F2, 0(R1) LD F4, 0(R2) LD F8, 16(R1) LD F10, 16(R2) LD F14, 32(R1) LD F16, 32(R2) ADD.D F2, F2, F2 MUL.D F4, F4, F LD F20, 48(R1) LD F22, 48(R2) ADD.D F8, F8, F8 MUL.D F10, F10, F ADD.D F14, F14, F14 MUL.D F16, F16, F ADD.D F20, F20, F20 MUL.D F22, F22, F DADDUI R3, R3, # DADDUI R2, R2, # DADDUI R1, R1, # DSUBUI R5, R5, # ADD.D F6, F2, F ADD.D F12, F8, F ADD.D F18, F14, F ADD.D F24, F20, F

SD F6, 0(R3) SD F12, 16(R3) SD F18, 32(R3) SD F24, 48(R3) BNEZ R5, loop 20 cycles for 4 iterations. 10 points total. 3 points for the correct number of cycles. 0.5 points for every correct line. (b) Suppose we want the number of cycles per original iteration to drop at or below 3. At least how many times do we need to unroll the loop? You dont need to show the scheduled instructions, but do justify your answer. Solution: For each iteration that we unroll after the 4th, we add a few more cycles to the execution time. From the 5th to 8th iterations, we would only be adding one cycle onto the total run time. So notice if we unroll 4 more iteration it becomes 23 cycles in 8 iterations which is ≤ 3 cycles per iteration. 4 points total for correct answer.

  1. Software Pipelining **** Graduate Student Problem **** [10 Points]

Consider once more the loop from question 1.

(a) Provide the steady-state code for a software pipelined version of the loop. You can assume the loop will have at least four iterations. Show your work. Solution: loop S.D F6, -48(R3) // x - 3 ADD.D F6, F8, F10 // x - 2 MUL.D F8, F2, F2 // x - 1 ADD.D F10, F4, F4 // x - 1 L.D F2, 0(R1) // x L.D F4, 0(R2) // x DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # BNEZ R5, loop

6 points total. 1 point for the store, 0.5 points for every other line. Registers may not be the same. (b) Now provide the start-up and finish-up code for the loop you provided in part a. Dont attempt to schedule it optimally. Show your work. Solution: starupcode: L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 16(R1) L.D F4, 16(R2) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 32(R1) L.D F4, 32(R2) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, #

cleanup: loop S.D F6, -48(R3) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F S.D F6, -32(R3) ADD.D F6, F8, F S.D F6, -16(R3) 4 points total. Subtract 0.25 points for every error up to 4 points. Note instructions can be in any order except the ones dependent on each other.

  1. Predication [10 Points]

Consider the following format for predicated MIPS instructions: (pA) DADD R1, R2, R where the DADD instruction is predicated on the predicate register pA. Assume a set of 4 1-bit predicate registers (pA, pB, pC, pD) that are set by a compare instruction of the form: CMP.ne pA, pB, R8, R The above used a not equal (.ne) comparison relation to match the code fragment above. The above compare sets the 1-bit predicate registers as follows: pA = (R8 != R0) pB = !(R8 != R0) Assume a CMP.gt instruction also exists, using the greater than (¿) comparison relation. Otherwise, CMP.gt works just like CMP.ne. Consider the following C code.

if (a>=b) { if (c>d) { x = 2; y = x + 1; } else { x = 3;

  1. GCD Test [5 Points]

Consider the following code fragment:

for (i=0; i <= 50; i += 2) { A[2*i + 4] = A[100 * i + 203] }

We want to apply the GCD test to see if three is a dependency.

(a) For the GCD test, we need to ”normalize” the loop by making index to begin at 1 and the loop variable increment by 1 on every iteration. Rewrite the code to achieve this. Solution: for (i=1; i <= 26; i += 1) { A[4*i] = A[200 * i + 3] } 2 points total. Subtract 0.5 for each mistake up to 2. (b) Apply the GCD test. Is there a loop dependency? Solution: The GCD of 4 and 200 is 4 and 3 − 0 = 3. Since 4 does not divide into 3, there is no dependency. 3 points for the correct answer.