Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To, Assignments of Computer Architecture and Organization

University of Illinois - Urbana-Champaign Computer Architecture and Organization

Prof. Josep Torrellas

Information on loop unrolling and vliw scheduling in computer systems organization. It includes examples of loop unrolling and vliw scheduling, as well as solutions and justifications. The document also discusses the benefits of loop unrolling and interleaving instructions from different iterations of a loop.

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-f2b 🇺🇸

9 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

CS433: Computer Systems Organization Fall 2006

Homework 3

Assigned: 9/28

Due in class 10/12

1. Loop Unrolling [15 Points]

In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics

are:

•If unspecified, its properties are like those in the MIPS pipeline we studied in class

•There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective

address calculation for loads/stores), subtraction, logic operations, and branch operations

•There is 1 FP/integer multiplier, taking 8 cycles to perform any multiply. It is pipelined

•There is 1 FP adder, taking 7 cycles to perform FP additions and subtractions. It is pipelined

•There is 1 FP/integer divider, taking 24 cycles. It is not pipelined

•There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM

stage for stores

•Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after

the effective address calculation

•There are as many registers, both FP and integer, as you need

•There is one branch delay slot

•While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to

schedule such that the operands of each instruction are available when needed by each instruction

Consider the following loop:

loop L.D F2, 0(R1)

MUL.D F2, F2, F2

L.D F4, 0(R2)

ADD.D F4, F4, F4

ADD.D F6, F2, F4

DADDUI R1, R1, #16

DADDUI R2, R2, #16

DADDUI R3, R3, #16

S.D F6, 0(R3)

DSUBUI R5, R5, #8

BNEZ R5, loop

(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row

take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a

dependency that will not be resolved in time), write STALL instead, and move on to the next

cycle to see if it can issue then. Assume that a nop is scheduled in the branch delay slot (effectively

stalling 1 cycle after the branch). Explain all stalls, but dont reorder instructions. How many

cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order

as they would be executed and write STALL in between instructions for each cycle where there

would be a stall).

Solution: In retrospect, some of the assumptions weren’t clear so multiple answers were accepted.

loop L.D F2, 0(R1)

stall RAW F2

MUL.D F2, F2, F2

L.D F4, 0(R2)

1

Discover Assignments of Computer Architecture and Organization University of Illinois - Urbana-Champaign

Partial preview of the text

Download CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CS433: Computer Systems Organization Fall 2006 Homework 3 Assigned: 9/ Due in class 10/

Loop Unrolling [15 Points]

In this problem, well use the pipeline shown in figure A.31 on page A.50 of your book. Its characteristics are:

If unspecified, its properties are like those in the MIPS pipeline we studied in class
There is 1 integer functional unit, taking 1 cycle to perform integer addition (including effective address calculation for loads/stores), subtraction, logic operations, and branch operations
There is 1 FP/integer multiplier, taking 8 cycles to perform any multiply. It is pipelined
There is 1 FP adder, taking 7 cycles to perform FP additions and subtractions. It is pipelined
There is 1 FP/integer divider, taking 24 cycles. It is not pipelined
There is full forwarding and bypassing, including forwarding from the end of an FU to the MEM stage for stores
Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the effective address calculation
There are as many registers, both FP and integer, as you need
There is one branch delay slot
While the hardware has full forwarding/bypassing, it is the responsibility of the compiler to schedule such that the operands of each instruction are available when needed by each instruction

Consider the following loop:

loop L.D F2, 0(R1) MUL.D F2, F2, F L.D F4, 0(R2) ADD.D F4, F4, F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop

(a) Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row take a cycle. If an instruction cant issue on a given cycle (because the current instruction has a dependency that will not be resolved in time), write STALL instead, and move on to the next cycle to see if it can issue then. Assume that a nop is scheduled in the branch delay slot (effectively stalling 1 cycle after the branch). Explain all stalls, but dont reorder instructions. How many cycles elapse before the second iteration begins? Show your work (i.e. write instructions in order as they would be executed and write STALL in between instructions for each cycle where there would be a stall). Solution: In retrospect, some of the assumptions weren’t clear so multiple answers were accepted.

loop L.D F2, 0(R1) stall RAW F MUL.D F2, F2, F L.D F4, 0(R2)

stall RAW F4 F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # stall RAW F stall RAW F S.D F6, 0(R3) DSUBUI R5, R5, # BNEZ R5, loop NOP // Its ok if you put stall here

Answer: 22 Cycles (23 if you had a stall after the DSUBUI which is ok if you assumed branches are resolved in ID) 3 points total. 1 points for correct answer number of cycles. Take 0.5 points off for every 4 mistakes in the work.

(b) Now reschedule the loop. You can change immediate values and memory offsets and reorder instructions, but dont change anything else. Show any stalls that remain. How many cycles elapse before the second iteration begins? Show your work. Solution: There are several solutions, this is just one. Notes: no stall was accpeted between the DSUBUI and the BNEZ here because you could have swapped the DSUBUI with another stall or a DADDUI. I posted on the newsgroup to put NOP in the branch delay slot without thinking for b and c so if you did that then you didn’t lose points. If you did use the NOP, there would be a stall after the DSUBUI and the SD and BNEZ instructions would be swaped.

loop L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F2, F2, F ADD.D F4, F4, F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F4 F stall RAW F ADD.D F6, F2, F DADDUI R1, R1, #16 // Can swap with any stall slot DADDUI R2, R2, #16 // Can swap with any stall slot DADDUI R3, R3, #16 // Can swap with any stall slot DSUBUI R5, R5, #8 // Can swap with any stall slot BNEZ R5, loop S.D F6, 0(R3)

17 Cycles (or 19 cyles if you scheduled a nop in the branch delay slot) 6 points total. 1 point for the correct number of cycles. Subtract 0.5 points for every error in the work up to 5 points.

(a) Unroll the loop from Question 1 four times, and schedule it for this VLIW to take as few stall cycles as possible. Consult Figure 4.5 on page 318 of your book for an example. How many cycles do the four iterations take to complete? Show your work. Solution: MEM1 MEM2 Integer FP Add/Sub FP Mul/Div LD F2, 0(R1) LD F4, 0(R2) LD F8, 16(R1) LD F10, 16(R2) LD F14, 32(R1) LD F16, 32(R2) ADD.D F2, F2, F2 MUL.D F4, F4, F LD F20, 48(R1) LD F22, 48(R2) ADD.D F8, F8, F8 MUL.D F10, F10, F ADD.D F14, F14, F14 MUL.D F16, F16, F ADD.D F20, F20, F20 MUL.D F22, F22, F DADDUI R3, R3, # DADDUI R2, R2, # DADDUI R1, R1, # DSUBUI R5, R5, # ADD.D F6, F2, F ADD.D F12, F8, F ADD.D F18, F14, F ADD.D F24, F20, F

SD F6, 0(R3) SD F12, 16(R3) SD F18, 32(R3) SD F24, 48(R3) BNEZ R5, loop 20 cycles for 4 iterations. 10 points total. 3 points for the correct number of cycles. 0.5 points for every correct line. (b) Suppose we want the number of cycles per original iteration to drop at or below 3. At least how many times do we need to unroll the loop? You dont need to show the scheduled instructions, but do justify your answer. Solution: For each iteration that we unroll after the 4th, we add a few more cycles to the execution time. From the 5th to 8th iterations, we would only be adding one cycle onto the total run time. So notice if we unroll 4 more iteration it becomes 23 cycles in 8 iterations which is ≤ 3 cycles per iteration. 4 points total for correct answer.

Software Pipelining **** Graduate Student Problem **** [10 Points]

Consider once more the loop from question 1.

(a) Provide the steady-state code for a software pipelined version of the loop. You can assume the loop will have at least four iterations. Show your work. Solution: loop S.D F6, -48(R3) // x - 3 ADD.D F6, F8, F10 // x - 2 MUL.D F8, F2, F2 // x - 1 ADD.D F10, F4, F4 // x - 1 L.D F2, 0(R1) // x L.D F4, 0(R2) // x DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, # BNEZ R5, loop

6 points total. 1 point for the store, 0.5 points for every other line. Registers may not be the same. (b) Now provide the start-up and finish-up code for the loop you provided in part a. Dont attempt to schedule it optimally. Show your work. Solution: starupcode: L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 16(R1) L.D F4, 16(R2) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F L.D F2, 32(R1) L.D F4, 32(R2) DADDUI R1, R1, # DADDUI R2, R2, # DADDUI R3, R3, # DSUBUI R5, R5, #

cleanup: loop S.D F6, -48(R3) ADD.D F6, F8, F MUL.D F8, F2, F ADD.D F10, F4, F S.D F6, -32(R3) ADD.D F6, F8, F S.D F6, -16(R3) 4 points total. Subtract 0.25 points for every error up to 4 points. Note instructions can be in any order except the ones dependent on each other.

Predication [10 Points]

Consider the following format for predicated MIPS instructions: (pA) DADD R1, R2, R where the DADD instruction is predicated on the predicate register pA. Assume a set of 4 1-bit predicate registers (pA, pB, pC, pD) that are set by a compare instruction of the form: CMP.ne pA, pB, R8, R The above used a not equal (.ne) comparison relation to match the code fragment above. The above compare sets the 1-bit predicate registers as follows: pA = (R8 != R0) pB = !(R8 != R0) Assume a CMP.gt instruction also exists, using the greater than (¿) comparison relation. Otherwise, CMP.gt works just like CMP.ne. Consider the following C code.

if (a>=b) { if (c>d) { x = 2; y = x + 1; } else { x = 3;

GCD Test [5 Points]

Consider the following code fragment:

for (i=0; i <= 50; i += 2) { A[2*i + 4] = A[100 * i + 203] }

We want to apply the GCD test to see if three is a dependency.

(a) For the GCD test, we need to ”normalize” the loop by making index to begin at 1 and the loop variable increment by 1 on every iteration. Rewrite the code to achieve this. Solution: for (i=1; i <= 26; i += 1) { A[4*i] = A[200 * i + 3] } 2 points total. Subtract 0.5 for each mistake up to 2. (b) Apply the GCD test. Is there a loop dependency? Solution: The GCD of 4 and 200 is 4 and 3 − 0 = 3. Since 4 does not divide into 3, there is no dependency. 3 points for the correct answer.

CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To, Assignments of Computer Architecture and Organization

Related documents

Partial preview of the text

Download CS433: Computer Systems Organization - Loop Unrolling and VLIW Scheduling - Prof. Josep To and more Assignments Computer Architecture and Organization in PDF only on Docsity!