









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This course focuses on quantitative principle of computer design, instruction set architectures, datapath and control, memory hierarchy design, main memory, cache, hard drives, multiprocessor architectures, storage and I/O systems, computer clusters. This lecture includes: Functional, Execution, Units, Pipelined, Staandard, Loop, Latencies, Branch, Clock, Cycle, Iterations
Typology: Slides
1 / 16
This page cannot be seen from the preview
Don't miss anything!










Let us assume that the loop is implemented using standard five stage pipeline with branch delay of one clock cycle Functional units are fully pipelined The functional units have latencies as shown in the table
Stalls of FP ALU and Load Instruction Instruction producing result Instruction using result Latency in clock cycle FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Here, the First column shows originating instruction type Second column is the type of consuming instruction Last column is the number of intervening clock cycles needed to avoid a stall
Loop clock cycles L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 Stall 4 BNE R1, R2, LOOP 5 (delayed branch) S.D F4, 8(R1) 6 (altered & interchanged with DADDUI) )
Explanation To schedule the delay branch, complier had to determine that it could swap the DADDUI and S.D by changing the destination address of S.D instruction You can see that the address 0(R1) and is replaced by 8(R1); as R1 has been decremented by DADDUI
In this example, one loop iteration and store back is completed in one array element every 6 clock cycles but the actual work of operating on the array element takes 3 clock cycles ( load, add, and store) The remaining 3 clock cycles per iteration are the loop-overhead (to evaluate the condition, stall and branch); i.e., the loop over-head is 100% in this example
To eliminate or reduce the impact the loop- overhead, here 3 clock cycles per loop, we have to get more operations within the loop, relative to the number of overhead instructions A simple way to increase the number of instructions per loop can be to replicate the loop body for number of iterations and adjusting the loop termination code This approach is known as loop unrolling
LOOP L.D F0, 0(R1) ADD.D F4, F0, F S.D F4, 0(R1) ; drop ADDUI &BNE L.D F6,-8(R1) ADD.D F8, F6, F2 L S.D F8, -8(R1) ; drop ADDUI &BNE L.D F10,-16(R1) ADD.D F12, F10, F S.D F12, -16(R1) ; drop ADDUI &BNE L.D F14,-24(R1) ADD.D F16, F14, S.D F16 , -24(R1) DADDUI R1, R1, #- BNE R1, R2, LOOP Example: Loop Unrolling without scheduling
Note that simply replicating the instructions, when the loop is unrolled, results in the use of the same register that could prevent us from effectively scheduling the loop
Note that, here without scheduling, every operation in the unrolled loop is followed by dependent operations e.g., L.D followed by ADD.D has data dependence; therefore L.D has 1 stall similarly, 2 stalls are for ADD.D, 1 stall for DADDUI and 1 stall for branch