Question on Tomasulo Algorithms - Assignments | CDA 5155, Assignments of Electrical and Electronics Engineering

Material Type: Assignment; Class: COMPU ARCHITECT PRIN; Subject: COMPUTER DESIGN/ARCHITECTURE; University: University of Florida; Term: Summer 2003;

Typology: Assignments

Pre 2010

Uploaded on 03/18/2009

koofers-user-6vc-1
koofers-user-6vc-1 🇺🇸

5

(1)

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Tomasulo’s Algorithm
question 3.6 in the book
In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running a
common vector loop. The loop is the so-called DAXPY loop (double-precision aX plus Y) and is the
central operation in Gaussian elimination. The following code implements the operation Y = aX + Y for
a vector of length 100. Initially, R1 = 0 and F0 contains a.
foo: L.D F2,0(R1) ;load X(i)
MUL.D F4,F2,F0 ;multiply a*X(i)
L.D F6,0(R2) ;load Y(i)
ADD.D F6,F4,F6 ;add a*X(i) + Y(i)
S.D F6,0(R2) ;store Y(i)
DADDUI R1,R1,#8 ;increment X index
DADDUI R2,R2,#8 ;increment Y index
DSGTUI R3,R1,#800 ;test if done
BEQZ R3,foo ;loop if not done
The pipeline functions units are as described.
FU type Cycles in EX Number of FUs Number of reservation stations
Integer 1 1 5
FP adder 4 1 3
FP multiplier 15 1 2
Assume the following:
Function units are not pipelined.
There is no forwarding between function units; results are communicated by the CDB.
The execution stage (EX) does both the effective address calculation and the memory access for loads
and stores. Thus the pipeline is IF/ID/IS/EX/WB, so LD/ST can execute in the same cycle as the
address calculation.
Loads take 1 cycle (always a cache hit).
The issue (IS) and write result (WB) stages each take 1 clock cycle.
There are 5 load buffer slots and 5 store buffer slots.
Assume that the BEQZ instruction takes 0 clock cycles, this means that it means that there BEQZ must
wait until all data dependences are resolved, after which there is no latency in the EX, also there is no
latency between EX and the issue cycle of the next instruction
When doing LD/ST address calculation LD/ST done in same cycle (MEM/EX cone at the same time
so no need for MEM column on page 222)
Assume FU is free starting at WB
Assume the reservation station becomes free at the WB stage
Assume BEQZ does not take up a slot in the reservation station
pf3
pf4
pf5

Partial preview of the text

Download Question on Tomasulo Algorithms - Assignments | CDA 5155 and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

Tomasulo’s Algorithm question 3.6 in the book In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running a common vector loop. The loop is the so-called DAXPY loop ( d ouble-precision aX p lus Y ) and is the central operation in Gaussian elimination. The following code implements the operation Y = aX + Y for a vector of length 100. Initially, R1 = 0 and F0 contains a. foo: L.D F2,0(R1) ;load X(i) MUL.D F4,F2,F0 ;multiply aX(i) L.D F6,0(R2) ;load Y(i) ADD.D F6,F4,F6 ;add aX(i) + Y(i) S.D F6,0(R2) ;store Y(i) DADDUI R1,R1,#8 ;increment X index DADDUI R2,R2,#8 ;increment Y index DSGTUI R3,R1,#800 ;test if done BEQZ R3,foo ;loop if not done The pipeline functions units are as described. FU type Cycles in EX Number of FUs Number of reservation stations Integer 1 1 5 FP adder 4 1 3 FP multiplier 15 1 2 Assume the following:  Function units are not pipelined.  There is no forwarding between function units; results are communicated by the CDB.  The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus the pipeline is IF/ID/IS/EX/WB, so LD/ST can execute in the same cycle as the address calculation.  Loads take 1 cycle (always a cache hit).  The issue (IS) and write result (WB) stages each take 1 clock cycle.  There are 5 load buffer slots and 5 store buffer slots.  Assume that the BEQZ instruction takes 0 clock cycles, this means that it means that there BEQZ must wait until all data dependences are resolved, after which there is no latency in the EX, also there is no latency between EX and the issue cycle of the next instruction  When doing LD/ST address calculation LD/ST done in same cycle (MEM/EX cone at the same time so no need for MEM column on page 222)  Assume FU is free starting at WB  Assume the reservation station becomes free at the WB stage  Assume BEQZ does not take up a slot in the reservation station

a. For this problem use the single-issue Tomasulo MIPS pipeline of Figure 3.2 with the pipeline latencies from table above. Show the number of stall cycles for each instruction and what clock cycles each instruction begins execution (i.e., enters its first EX cycle) for three iterations of the loop. How many clock cycles does each loop iteration take? Report your answer in the form of a table like that in Figure 3.25.  Assume 1 CDB, only 1 WB per cycle at a time Solution Here is the first iteration, and beginning of second iteration (fill out the third iteration in the same manner) (This is with a pipelined FP FU)

|------------------- cycle 1 --------------------|------------------- cycle 2 --------------------|------------------- cycle 3 --------------------|

I EX WB

(CDB)

I EX WB

(CDB)

I EX WB

(CDB)

foo L.D F2, 0(R1) (^1 2 3 10 11 12 19 20 ) MUL.D F4, F2, F0 (^2) 4-18 19 11 13-27 28 L.D F6, 0(R2) (^3 4 ) ADD F6, F4, F6 (^4) 20-23 24 S.D F6, 0(R2) (^5 25) - DADDUI R1, R1, #8 (^6 7 ) DADDUI R2, R2, #8 (^7 8 ) DSGTUI R3, R1, #800 (^8 9 ) BEQZ R3,foo (^9 10) -

Make sure to consider how many functional units are occupied. For instance, here is a table showing the instructions occupying the integer reservation

stations for the first few cycles

1 2 3 4 5 6 7 8 9 10 11 …. N

1 st^ LD X X 2 nd^ LD S.D 1 st^ DADDUI 2 nd^ DADDUI DSGTUI

c. Using the MIPS code for DAXPY above, assume Tomasulo’s algorithm with speculation as shown in Figure 3.29. Assume the latencies shown in Figure 3.63. Assume that there are separate integer function units for effective address calculation, for ALU operations, and for branch condition evaluation. Create a table as in Figure 3.34 for the first three iterations of this loop. a. Assume dual issue. b. Assume you have 2 CDB’s; thus you can commit at most 2 instructions per cycle. c. You have as many ROB slots as you need. Solution (This is for pipelined FP FU) |-------------------cycle 1--------------------|--------------------cycle 2--------------------|-------------------cycle 3--------------------| I EX WB (CDB)

C I EX WB

(CDB)

C I EX WB

(CDB)

C

foo L.D F2, 0(R1) (^1 2 3 4 6 7 8 ) MUL.D F4, F2, F0 (^1) 4-18 19 20 6 9-23 24 28 L.D F6, 0(R2) (^2 3 4 ) ADD F6, F4, F6 (^2) 20-23 24 25 S.D F6, 0(R2) (^3 25) - 25 DADDUI R1, R1, #8 (^3 4 5 ) DADDUI R2, R2, #8 (^4 5 6 ) DSGTUI R3, R1, #800 (^4 6 7 ) BEQZ R3,foo (^5 7) - 27 Occupied slots in the reservation stations (bolded numbers are cycle numbers) 1 2 3 4 5 6 7 8 9 10 11 19 20 21 22 23 24 1 st^ LD X X 2 nd^ LD X SD 1 st^ DADDUI 2 nd^ DADDUI DSGTUI