Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides, Slides of Advanced Computer Architecture

Gujarat University Advanced Computer Architecture

This course focuses on quantitative principle of computer design, instruction set architectures, datapath and control, memory hierarchy design, main memory, cache, hard drives, multiprocessor architectures, storage and I/O systems, computer clusters. This lecture includes: Functional, Execution, Units, Pipelined, Staandard, Loop, Latencies, Branch, Clock, Cycle, Iterations

Typology: Slides

2011/2012

Uploaded on 08/06/2012

amrusha 🇮🇳

4.5

(33)

147 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Loop execution without Basic

Scheduling

Let us assume that the loop is implemented

using standard five stage pipeline with

branch delay of one clock cycle

Functional units are fully pipelined

The functional units have latencies as

shown in the table

docsity.com

Discover Slides of Advanced Computer Architecture Gujarat University

Partial preview of the text

Download Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides and more Slides Advanced Computer Architecture in PDF only on Docsity!

Loop execution without Basic

Scheduling

Let us assume that the loop is implemented using standard five stage pipeline with branch delay of one clock cycle Functional units are fully pipelined The functional units have latencies as shown in the table

Stalls of FP ALU and Load Instruction Instruction producing result Instruction using result Latency in clock cycle FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Here, the First column shows originating instruction type Second column is the type of consuming instruction Last column is the number of intervening clock cycles needed to avoid a stall

Single loop execution With Compiler

scheduling

Loop clock cycles L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 Stall 4 BNE R1, R2, LOOP 5 (delayed branch) S.D F4, 8(R1) 6 (altered & interchanged with DADDUI) )

Explanation To schedule the delay branch, complier had to determine that it could swap the DADDUI and S.D by changing the destination address of S.D instruction You can see that the address 0(R1) and is replaced by 8(R1); as R1 has been decremented by DADDUI

Explanation .. Cont’d

In this example, one loop iteration and store back is completed in one array element every 6 clock cycles but the actual work of operating on the array element takes 3 clock cycles ( load, add, and store) The remaining 3 clock cycles per iteration are the loop-overhead (to evaluate the condition, stall and branch); i.e., the loop over-head is 100% in this example

Loop Unrolling

To eliminate or reduce the impact the loop- overhead, here 3 clock cycles per loop, we have to get more operations within the loop, relative to the number of overhead instructions A simple way to increase the number of instructions per loop can be to replicate the loop body for number of iterations and adjusting the loop termination code This approach is known as loop unrolling

LOOP L.D F0, 0(R1) ADD.D F4, F0, F S.D F4, 0(R1) ; drop ADDUI &BNE L.D F6,-8(R1) ADD.D F8, F6, F2 L S.D F8, -8(R1) ; drop ADDUI &BNE L.D F10,-16(R1) ADD.D F12, F10, F S.D F12, -16(R1) ; drop ADDUI &BNE L.D F14,-24(R1) ADD.D F16, F14, S.D F16 , -24(R1) DADDUI R1, R1, #- BNE R1, R2, LOOP Example: Loop Unrolling without scheduling

Loop Unrolling and scheduling

Note that simply replicating the instructions, when the loop is unrolled, results in the use of the same register that could prevent us from effectively scheduling the loop

Loop clock cycles

L.D F0, 0(R1)
- stall
ADD.D F4, F0, F2
Stall
stall
S.D F4, 0(R1) ; drop ADDUI &BNE
L.D F6,-8(R1)
Stall
ADD.D F8, F6, F2
Stall
stall
S.D F8, -8(R1) ; drop ADDUI &BNE
L.D F10,-16(R1)
Stall
ADD.D F12, F10, F2
Stall
stall
S.D F12, -16(R1) ; drop ADDUI &BNE
L.D F14,-24(R1)
Stall
ADD.D F16, F14,
Stall
stall
S.D F16 -24(R1)
DADDUI R1, R1, #-32
Stall
BNE R1, R2, LOOP
Stall

Note that, here without scheduling, every operation in the unrolled loop is followed by dependent operations e.g., L.D followed by ADD.D has data dependence; therefore L.D has 1 stall similarly, 2 stalls are for ADD.D, 1 stall for DADDUI and 1 stall for branch

Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides, Slides of Advanced Computer Architecture

Related documents

Partial preview of the text

Download Loop execution without Basic Scheduling-Advance Computer Architecture-Lecture Slides and more Slides Advanced Computer Architecture in PDF only on Docsity!

Loop execution without Basic

Scheduling

Single loop execution With Compiler

scheduling

Explanation .. Cont’d

Loop Unrolling

Loop Unrolling and scheduling

Loop clock cycles

Explanation: Example loop unrolling

Now let us see the performance of

unrolled loop with scheduling

As the instructions within a loop

from different iterations, can be

re-ordered, therefore, Loop

Unrolling can also be used to

improve scheduling

Unrolling with scheduling