


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Instructions for homework 3 of the cs433ug computer systems organization course in spring 2007. The homework involves loop unrolling and vliw scheduling for a given pipeline architecture. Students are required to write instructions for each cycle, handle stalls, and minimize cycles before the second iteration begins.
Typology: Assignments
1 / 4
This page cannot be seen from the preview
Don't miss anything!



CS433ug: Computer Systems Organization Spring 2007 Homework 3 Assigned: 2/ Due in class 3/
Instructions: Please write your name, NetID and an alias on your homework submissions for posting grades (If you don’t want your grades posted, then don’t write an alias). We will use this alias throughout the semester. Homeworks are due in class on the date posted.
Consider the following loop:
loop L.D F2, 0(R1) ADD.D F2, F2, F L.D F4, 0(R2) MUL.D F4, F4, F ADD.D F6, F2, F S.D F6, 0(R3) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # BNEZ R5, loop
(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle to see if it can issue then. Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Remember there is 1 branch delay slot. [5 points] Solution: loop L.D F2, 0(R1) STALL RAW F ADD.D F2, F2, F L.D F4, 0(R2) STALL RAW F MUL.D F4, F4, F
STALL RAW F STALL RAW F STALL RAW F STALL RAW F STALL RAW F STALL RAW F ADD.D F6, F2, F STALL RAW F S.D F6, 0(R3) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # STALL BR Resolved in ID BNEZ R5, loop STALL NOP/BR Delay Answer: 22 Cycles
(b) Now reschedule the loop. You can change immediate values and memory offsets and reorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. [5 points] Solution: There are several solutions, this is just one. Notes: no stall was accpeted between the DSUBUI and the BNEZ here because you could have swapped the DSUBUI with another stall or a DADDUI. I posted on the newsgroup to put NOP in the branch delay slot without thinking for b and c so if you did that then you didn’t lose points. If you did use the NOP, there would be a stall after the DSUBUI and the SD and BNEZ instructions would be swaped. loop L.D F2, 0(R1) L.D F4, 0(R2) ADD.D F2, F2, F DADDUI R1, R1, # DADDUI R2, R2, # MUL.D F4, F4, F DADDUI R3, R3, # DSUBUI R5, R5, # STALL RAW F STALL RAW F STALL RAW F STALL RAW F ADD.D F6, F2, F BNEZ R5, loop S.D F6, -16(R3)
15 Cycles (c) Now unroll and reschedule the loop the minimum number of times needed to eliminate all stalls. You can, and should, remove redundant instructions. How many times did you unroll the loop and how many cycles elapse before the next iteration of the loop begins? Dont worry about clean-up code. Show your work. [10 points] Solution: This is one solution. Registers used could be different.
Notes: Theres alot of flexibility in the order of the LD, MUL and the first ADD instruction in the loop.
loop L.D F2, 0(R1) L.D F8, 8(R1) L.D F14, 16(R1) ADD.D F2, F2, F ADD.D F8, F8, F ADD.D F14, F14, F L.D F4, 0(R2) L.D F10, 16(R2) L.D F16, 32(R2) MUL.D F4, F4, F
ROB entry Instruction Funct Unit IS EX WR CMT Stall Reason 0 L.D F2, 0(R1) Integer 1 2 3 4 1 ADD.D F2, F2, F2 FP Add/Sub 1 4-6 7 8 RAW on F 2 L.D F4, 0(R2) Integer 2 3 4 8 Wait for inorder co 3 MUL.D F4, F4, F2 FP Mul 2 7-13 14 15 RAW on F 4 ADD.D F6, F2, F4 FP Add/Sub 3 14-16 17 18 RAW on F 5 S.D F6, 0(R3) Integer 3 17 18 19 RAW on F 6 DADDUI R1, R1, #8 Integer 4 5 6 19 Wait for inorder co 7 DADDUI R2, R2, #16 Integer 4 6 7 20 Wait for FU, inorder c 0 DADDUI R3, R3, #16 Integer 5 7 8 20 Wait for FU, inorder c 1 DSUBUI R5, R5, #2 Integer 9 10 11 21 Wait for ROB, inorder 2 BNEZ R5, loop Integer 9 11 12 21 Wait for ROB, inorder
(b) Based on your answer for part a, what is the bottleneck in issuing instructions? What is a simple way to remove this bottleneck? [2 points] Solution: Our reorder buffer size is a bottleneck when issuing instructions. For example in the 1st iteration, the last 2 instructions have to wait for buffer entries to be freed. By simply increasing the reorder buffer size we can issue more instructions (c) Based on your answer for part a, does this piece of code benefit significantly from being executed on a multiple issue/commit processor? Why or why not? [2 points] Solution: Answers may vary. But the main point is it does not benefit too much from multiple issue. Sure, some instructions can execute a bit sooner than they would be able to on a multiple issue processor but there is not much ILP to exploit in this code without using techniques like loop unrolling or software pipelining except perhaps in the integer operations. But for those operations, they are all competing for a single Integer function unit which we only have 1 of and execute in a single cycle anyways.
(d) What can we do to improve efficiency without modifying our processor? [2 points]
Solution: Loop unrolling to increase ILP. We can then pipeline more instructions for the functional units to keep them working.