Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides, Slides of Advanced Computer Architecture

This course focuses on quantitative principle of computer design, instruction set architectures, datapath and control, memory hierarchy design, main memory, cache, hard drives, multiprocessor architectures, storage and I/O systems, computer clusters. This lecture includes: Functional, Execution, Units, Pipelined, Staandard, Loop, Latencies, Branch, Clock, Cycle, Iterations

Typology: Slides

2011/2012

Uploaded on 08/06/2012

amrusha
amrusha 🇮🇳

4.5

(33)

147 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Loop execution without Basic
Scheduling
Let us assume that the loop is implemented
using standard five stage pipeline with
branch delay of one clock cycle
Functional units are fully pipelined
The functional units have latencies as
shown in the table
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides and more Slides Advanced Computer Architecture in PDF only on Docsity!

Loop execution without Basic

Scheduling

Let us assume that the loop is implemented using standard five stage pipeline with branch delay of one clock cycle Functional units are fully pipelined The functional units have latencies as shown in the table

Stalls of FP ALU and Load Instruction Instruction producing result Instruction using result Latency in clock cycle FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Here, the First column shows originating instruction type Second column is the type of consuming instruction Last column is the number of intervening clock cycles needed to avoid a stall

Single loop execution With Compiler

scheduling

Loop clock cycles L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 Stall 4 BNE R1, R2, LOOP 5 (delayed branch) S.D F4, 8(R1) 6 (altered & interchanged with DADDUI) )

Explanation To schedule the delay branch, complier had to determine that it could swap the DADDUI and S.D by changing the destination address of S.D instruction You can see that the address 0(R1) and is replaced by 8(R1); as R1 has been decremented by DADDUI

Explanation .. Cont’d

In this example, one loop iteration and store back is completed in one array element every 6 clock cycles but the actual work of operating on the array element takes 3 clock cycles ( load, add, and store) The remaining 3 clock cycles per iteration are the loop-overhead (to evaluate the condition, stall and branch); i.e., the loop over-head is 100% in this example

Loop Unrolling

To eliminate or reduce the impact the loop- overhead, here 3 clock cycles per loop, we have to get more operations within the loop, relative to the number of overhead instructions A simple way to increase the number of instructions per loop can be to replicate the loop body for number of iterations and adjusting the loop termination code This approach is known as loop unrolling

LOOP L.D F0, 0(R1) ADD.D F4, F0, F S.D F4, 0(R1) ; drop ADDUI &BNE L.D F6,-8(R1) ADD.D F8, F6, F2 L S.D F8, -8(R1) ; drop ADDUI &BNE L.D F10,-16(R1) ADD.D F12, F10, F S.D F12, -16(R1) ; drop ADDUI &BNE L.D F14,-24(R1) ADD.D F16, F14, S.D F16 , -24(R1) DADDUI R1, R1, #- BNE R1, R2, LOOP Example: Loop Unrolling without scheduling

Loop Unrolling and scheduling

Note that simply replicating the instructions, when the loop is unrolled, results in the use of the same register that could prevent us from effectively scheduling the loop

Loop clock cycles

  • L.D F0, 0(R1)
    • stall
  • ADD.D F4, F0, F2
  • Stall
  • stall
  • S.D F4, 0(R1) ; drop ADDUI &BNE
  • L.D F6,-8(R1)
  • Stall
  • ADD.D F8, F6, F2
  • Stall
  • stall
  • S.D F8, -8(R1) ; drop ADDUI &BNE
  • L.D F10,-16(R1)
  • Stall
  • ADD.D F12, F10, F2
  • Stall
  • stall
  • S.D F12, -16(R1) ; drop ADDUI &BNE
  • L.D F14,-24(R1)
  • Stall
  • ADD.D F16, F14,
  • Stall
  • stall
  • S.D F16 -24(R1)
  • DADDUI R1, R1, #-32
  • Stall
  • BNE R1, R2, LOOP
  • Stall

Note that, here without scheduling, every operation in the unrolled loop is followed by dependent operations e.g., L.D followed by ADD.D has data dependence; therefore L.D has 1 stall similarly, 2 stalls are for ADD.D, 1 stall for DADDUI and 1 stall for branch

Explanation: Example loop unrolling

Now let us see the performance of

unrolled loop with scheduling

As the instructions within a loop

from different iterations, can be

re-ordered, therefore, Loop

Unrolling can also be used to

improve scheduling

Unrolling with scheduling