



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Information on loop unrolling and vliw scheduling in computer systems organization. It includes examples of loop unrolling and vliw scheduling, as well as solutions and justifications. The document also discusses the benefits of loop unrolling and interleaving instructions from different iterations of a loop.
Typology: Assignments
1 / 7
This page cannot be seen from the preview
Don't miss anything!




CS433: Computer Systems Organization Fall 2006 Homework 3 Assigned: 9/ Due in class 10/
In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics are:
Consider the following loop:
loop L.D F2, 0(R1) MUL.D F2, F2, F L.D F4, 0(R2) ADD.D F4, F4, F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop
(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle to see if it can issue then. Assume that a nop is scheduled in the branch delay slot (effectively stalling 1 cycle after the branch). Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Solution: In retrospect, some of the assumptions weren’t clear so multiple answers were accepted.
loop L.D F2, 0(R1) stall RAW F MUL.D F2, F2, F L.D F4, 0(R2)
stall RAW F4 F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # stall RAW F stall RAW F S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop NOP // Its ok if you put stall here
Answer: 22 Cycles (23 if you had a stall after the DSUBUI which is ok if you assumed branches are resolved in ID) 3 points total. 1 points for correct answer number of cycles. Take 0.5 points off for every 4 mistakes in the work.
(b) Now reschedule the loop. You can change immediate values and memory offsets and reorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. Solution: There are several solutions, this is just one. Notes: no stall was accpeted between the DSUBUI and the BNEZ here because you could have swapped the DSUBUI with another stall or a DADDUI. I posted on the newsgroup to put NOP in the branch delay slot without thinking for b and c so if you did that then you didn’t lose points. If you did use the NOP, there would be a stall after the DSUBUI and the SD and BNEZ instructions would be swaped.
loop L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F2, F2, F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, #16 // Can swap with any stall slot DADDUI R2, R2, #16 // Can swap with any stall slot DADDUI R3, R3, #16 // Can swap with any stall slot DSUBUI R5, R5, #8 // Can swap with any stall slot BNEZ R5, loop S.D F6, 0(R3)
17 Cycles (or 19 cyles if you scheduled a nop in the branch delay slot) 6 points total. 1 point for the correct number of cycles. Subtract 0.5 points for every error in the work up to 5 points.
(a) Unroll the loop from Question 1 four times, and schedule it for this VLIW to take as few stall cycles as possible. Consult Figure 4.5 on page 318 of your book for an example. How many cycles do the four iterations take to complete? Show your work. Solution: MEM1 MEM2 Integer FP Add/Sub FP Mul/Div LD F2, 0(R1) LD F4, 0(R2) LD F8, 16(R1) LD F10, 16(R2) LD F14, 32(R1) LD F16, 32(R2) ADD.D F2, F2, F2 MUL.D F4, F4, F LD F20, 48(R1) LD F22, 48(R2) ADD.D F8, F8, F8 MUL.D F10, F10, F ADD.D F14, F14, F14 MUL.D F16, F16, F ADD.D F20, F20, F20 MUL.D F22, F22, F DADDUI R3, R3, # DADDUI R2, R2, # DADDUI R1, R1, # DSUBUI R5, R5, # ADD.D F6, F2, F ADD.D F12, F8, F ADD.D F18, F14, F ADD.D F24, F20, F
SD F6, 0(R3) SD F12, 16(R3) SD F18, 32(R3) SD F24, 48(R3) BNEZ R5, loop 20 cycles for 4 iterations. 10 points total. 3 points for the correct number of cycles. 0.5 points for every correct line. (b) Suppose we want the number of cycles per original iteration to drop at or below 3. At least how many times do we need to unroll the loop? You dont need to show the scheduled instructions, but do justify your answer. Solution: For each iteration that we unroll after the 4th, we add a few more cycles to the execution time. From the 5th to 8th iterations, we would only be adding one cycle onto the total run time. So notice if we unroll 4 more iteration it becomes 23 cycles in 8 iterations which is ≤ 3 cycles per iteration. 4 points total for correct answer.
Consider once more the loop from question 1.
(a) Provide the steady-state code for a software pipelined version of the loop. You can assume the loop will have at least four iterations. Show your work. Solution: loop S.D F6, -48(R3) // x - 3 ADD.D F6, F8, F10 // x - 2 MUL.D F8, F2, F2 // x - 1 ADD.D F10, F4, F4 // x - 1 L.D F2, 0(R1) // x L.D F4, 0(R2) // x DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # BNEZ R5, loop
6 points total. 1 point for the store, 0.5 points for every other line. Registers may not be the same. (b) Now provide the start-up and finish-up code for the loop you provided in part a. Dont attempt to schedule it optimally. Show your work. Solution: starupcode: L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 16(R1) L.D F4, 16(R2) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 32(R1) L.D F4, 32(R2) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, #
cleanup: loop S.D F6, -48(R3) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F S.D F6, -32(R3) ADD.D F6, F8, F S.D F6, -16(R3) 4 points total. Subtract 0.25 points for every error up to 4 points. Note instructions can be in any order except the ones dependent on each other.
Consider the following format for predicated MIPS instructions: (pA) DADD R1, R2, R where the DADD instruction is predicated on the predicate register pA. Assume a set of 4 1-bit predicate registers (pA, pB, pC, pD) that are set by a compare instruction of the form: CMP.ne pA, pB, R8, R The above used a not equal (.ne) comparison relation to match the code fragment above. The above compare sets the 1-bit predicate registers as follows: pA = (R8 != R0) pB = !(R8 != R0) Assume a CMP.gt instruction also exists, using the greater than (¿) comparison relation. Otherwise, CMP.gt works just like CMP.ne. Consider the following C code.
if (a>=b) { if (c>d) { x = 2; y = x + 1; } else { x = 3;
Consider the following code fragment:
for (i=0; i <= 50; i += 2) { A[2*i + 4] = A[100 * i + 203] }
We want to apply the GCD test to see if three is a dependency.
(a) For the GCD test, we need to ”normalize” the loop by making index to begin at 1 and the loop variable increment by 1 on every iteration. Rewrite the code to achieve this. Solution: for (i=1; i <= 26; i += 1) { A[4*i] = A[200 * i + 3] } 2 points total. Subtract 0.5 for each mistake up to 2. (b) Apply the GCD test. Is there a loop dependency? Solution: The GCD of 4 and 200 is 4 and 3 − 0 = 3. Since 4 does not divide into 3, there is no dependency. 3 points for the correct answer.