



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Class: COMPU ARCHITECT PRIN; Subject: COMPUTER DESIGN/ARCHITECTURE; University: University of Florida; Term: Summer 2003;
Typology: Assignments
1 / 5
This page cannot be seen from the preview
Don't miss anything!




Tomasulo’s Algorithm question 3.6 in the book In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running a common vector loop. The loop is the so-called DAXPY loop ( d ouble-precision aX p lus Y ) and is the central operation in Gaussian elimination. The following code implements the operation Y = aX + Y for a vector of length 100. Initially, R1 = 0 and F0 contains a. foo: L.D F2,0(R1) ;load X(i) MUL.D F4,F2,F0 ;multiply aX(i) L.D F6,0(R2) ;load Y(i) ADD.D F6,F4,F6 ;add aX(i) + Y(i) S.D F6,0(R2) ;store Y(i) DADDUI R1,R1,#8 ;increment X index DADDUI R2,R2,#8 ;increment Y index DSGTUI R3,R1,#800 ;test if done BEQZ R3,foo ;loop if not done The pipeline functions units are as described. FU type Cycles in EX Number of FUs Number of reservation stations Integer 1 1 5 FP adder 4 1 3 FP multiplier 15 1 2 Assume the following: Function units are not pipelined. There is no forwarding between function units; results are communicated by the CDB. The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus the pipeline is IF/ID/IS/EX/WB, so LD/ST can execute in the same cycle as the address calculation. Loads take 1 cycle (always a cache hit). The issue (IS) and write result (WB) stages each take 1 clock cycle. There are 5 load buffer slots and 5 store buffer slots. Assume that the BEQZ instruction takes 0 clock cycles, this means that it means that there BEQZ must wait until all data dependences are resolved, after which there is no latency in the EX, also there is no latency between EX and the issue cycle of the next instruction When doing LD/ST address calculation LD/ST done in same cycle (MEM/EX cone at the same time so no need for MEM column on page 222) Assume FU is free starting at WB Assume the reservation station becomes free at the WB stage Assume BEQZ does not take up a slot in the reservation station
a. For this problem use the single-issue Tomasulo MIPS pipeline of Figure 3.2 with the pipeline latencies from table above. Show the number of stall cycles for each instruction and what clock cycles each instruction begins execution (i.e., enters its first EX cycle) for three iterations of the loop. How many clock cycles does each loop iteration take? Report your answer in the form of a table like that in Figure 3.25. Assume 1 CDB, only 1 WB per cycle at a time Solution Here is the first iteration, and beginning of second iteration (fill out the third iteration in the same manner) (This is with a pipelined FP FU)
foo L.D F2, 0(R1) (^1 2 3 10 11 12 19 20 ) MUL.D F4, F2, F0 (^2) 4-18 19 11 13-27 28 L.D F6, 0(R2) (^3 4 ) ADD F6, F4, F6 (^4) 20-23 24 S.D F6, 0(R2) (^5 25) - DADDUI R1, R1, #8 (^6 7 ) DADDUI R2, R2, #8 (^7 8 ) DSGTUI R3, R1, #800 (^8 9 ) BEQZ R3,foo (^9 10) -
1 st^ LD X X 2 nd^ LD S.D 1 st^ DADDUI 2 nd^ DADDUI DSGTUI
c. Using the MIPS code for DAXPY above, assume Tomasulo’s algorithm with speculation as shown in Figure 3.29. Assume the latencies shown in Figure 3.63. Assume that there are separate integer function units for effective address calculation, for ALU operations, and for branch condition evaluation. Create a table as in Figure 3.34 for the first three iterations of this loop. a. Assume dual issue. b. Assume you have 2 CDB’s; thus you can commit at most 2 instructions per cycle. c. You have as many ROB slots as you need. Solution (This is for pipelined FP FU) |-------------------cycle 1--------------------|--------------------cycle 2--------------------|-------------------cycle 3--------------------| I EX WB (CDB)
foo L.D F2, 0(R1) (^1 2 3 4 6 7 8 ) MUL.D F4, F2, F0 (^1) 4-18 19 20 6 9-23 24 28 L.D F6, 0(R2) (^2 3 4 ) ADD F6, F4, F6 (^2) 20-23 24 25 S.D F6, 0(R2) (^3 25) - 25 DADDUI R1, R1, #8 (^3 4 5 ) DADDUI R2, R2, #8 (^4 5 6 ) DSGTUI R3, R1, #800 (^4 6 7 ) BEQZ R3,foo (^5 7) - 27 Occupied slots in the reservation stations (bolded numbers are cycle numbers) 1 2 3 4 5 6 7 8 9 10 11 19 20 21 22 23 24 1 st^ LD X X 2 nd^ LD X SD 1 st^ DADDUI 2 nd^ DADDUI DSGTUI