Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Serial Execution - Intro to Computer Architecture - Lecture Notes, Study notes of Computer Architecture and Organization

In the lecture notes of the intro to computer architecture the main points are listed below:Serial Execution, Pipelined Execution, Instruction Per Clock Cycle, Assembly Language, Instruction Pipeline, Program Execution, Instruction Execution, Read Register Operands, Decode Stage, Advanced Architectures

Typology: Study notes

2012/2013

Uploaded on 05/06/2013

anurati 🇮🇳

4.2

(24)

121 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Serial Execution

Instruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 1

time

Pipelined Execution - Original RISC goal is to complete one instruction per clock cycle

Instruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 1

time

The main RISC philosophy (mid-80’s and after) is to design the assembly language (AL) to

optimize the instruction pipeline to speed program execution. One possible break down of an

instruction execution into stages would be:

ALU or load instruction: write result into register fileWrite-back

load: read memory from effective address into a temporary pipeline

register

store: write register value from Decode stage to memory at effective

address

Memory

access

Calculate using register operands read in the Decode stage. The ALU

calculation depends on the type instruction being performed:

memory reference (load/store): calculate the effective memory address of

the operand

arithmetic operation (add, sub, etc.) with two register operands

arithmetic operation with a register and an immediate constant

Execution

Determine opcode, and read register operands from the register fileDecode

Read next instruction into CPU and increment PC to next instructionFetch

ActionsStage

Lecture 7 - 1

Docsity.com

Partial preview of the text

Download Serial Execution - Intro to Computer Architecture - Lecture Notes and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Serial Execution

Instruction 2 Instruction 3

Instruction 4 Instruction 5

Instruction 1

time

Pipelined Execution - Original RISC goal is to complete one instruction per clock cycle

Instruction 2 Instruction 3

Instruction 4 Instruction 5

Instruction 1

time

The main RISC philosophy (mid-80’s and after) is to design the assembly language (AL) to optimize the instruction pipeline to speed program execution. One possible break down of an instruction execution into stages would be:

Write-back ALU or load instruction: write result into register file

load: read memory from effective address into a temporary pipeline register store: write register value from Decode stage to memory at effective address

Memory access

Calculate using register operands read in the Decode stage. The ALU calculation depends on the type instruction being performed: memory reference (load/store): calculate the effective memory address of the operand arithmetic operation (add, sub, etc.) with two register operands arithmetic operation with a register and an immediate constant

Execution

Decode Determine opcode, and read register operands from the register file

Fetch Read next instruction into CPU and increment PC to next instruction

Stage Actions

Lecture 7 - 1

Docsity.com

Advanced Architectures - multiple instructions completed per clock cycle

superpipelined (e.g., MIPS R4000)- split each stage into substages to create finer-grain stages

Instruction 2 Instruction 3

Instruction 4 Instruction 5

Instruction 1

time

superscalar (e.g., Intel Pentium, AMD Athlon)- multiple instructions in the same stage of execution in duplicate pipeline hardware

Instruction 2 Instruction 3

Instruction 4 Instruction 5 Instruction 6

Instruction 1

time

Lecture 7 - 2

Docsity.com

very-long-instruction-word, VLIW (e.g., Intel Itanium) - compiler encodes multiple operations into a long instruction word so hardware can schedule these operations at run-time on multiple functional units without analysis

Intel Itanium Interesting Features: Uses explicit parallel instruction computing (EPIC) from very-long-instruction-word (VLIW) architecture. In EPIC the compiler encodes multiple operations into a long instruction word so hardware can schedule these operations at run-time on multiple functional units without analysis, called static multiple-issue. On the Itanium, a three instruction bundle is read.

template field maps instruction slots to execution types (integer ALU, non-ALU integer, memory, floating-point, branch, and extended)

6 integer 11 1.5 GHz 221 810 1427 4 memory 3 branch 2 FP

Itanium 2 6

4 integer 9 0.8 GHz 25 379 701 2 memory 3 branch 2 FP

Itanium 6

SPEC

fp

SPEC

int

Transistors (millions)

Max. clock rate

Max. ops. per clock

Functional units

Max. instr. issue per clock

Processor

Provides hardware support for efficient procedure calls and returns -- large number of registers (128 general-purpose and 128 fl. pt. registers) with overlapping register windows

Itanium: first 32 registers for global variables and remaining 96 registers for local variables and parameters.

Lecture 7 - 4

Docsity.com

machine parallelism - the ability of the processor to take advantage of instruction-level parallelism. This is limited by: number of instructions that can be fetched and executed at the same time (# of parallel pipelines) ability of the processor to find independent instructions (the processor needs to look ahead of the current point of execution to locate independent instructions that can be brought into the pipeline and executed without hazards)

Limitations of superscalar - how much “instruction-level parallelism” (ILP) exists in the program. Independent instructions in the program can be executed in parallel, but not all can be.

true data dependency: SUB R1, R2, R3 ; R1 bR2 - R ADD R4, R1, R1 ; R4 bR1 + R Cannot be avoided by rearranging code
procedural dependency - cannot execute instructions after a branch until the branch executes
resource conflict / structural hazard - several instructions need same piece of hardware at the same time (e.g., memory, caches, buses, register file, functional units)

Three types of orderings:

order in which instructions are fetched
order in which instructions are executed (called instruction issuing )
order in which instructions update registers and memory

The more sophisticated the processor, the less it is bound by the strict relationship between these orderings. The only real constraint is that the results match that of sequential execution.

Some Categories: a) In-order issue with In-order completion.

b) In-order issue with out-of-order completion

Problem: Output dependency / WAW dependency (Write-After-Write)

I1: R3 bR3 op R I2: R4 bR3 + 1 I3: R3 bR5 + 1 I4: R7 bR3 op R4 ; R3 value generated from I3 must be used

Lecture 7 - 5

Docsity.com

Example using Tomasulo’s Algorithm

6 5 4 3 2 1

Load Buffer

From Memory From Instruction Unit

Busy Tag^ Data F F F F F F

FP op.s queue

FP Registers

FP Adders

8

9

10

11

12

13

FP Multiplers

Store Buffer

To All Tags

Tag Data Tag Data Tag Data Tag Data

Common Data Bus

To Memory

Tag Data

LD F6, 34 (R2)

LD F2, 45 (R3)

MULTD F0,F2,F

SUBD F8, F6, F

DIVD F10,F0,F

ADDD F6, F8, F

(front)

(rear)

Reservation Stations

As instructions are issued, register specifiers for pending operands are renamed to names of reservation stations. When both operands are available and a functional unit is available, the instruction in the reservation station can be executed. When the result is available, it is put on the CDB with the reservation that produced it. All reservation stations waiting to use that result will update their operands simultaneously.

Operation:

Busy - indicates if current value in reg. 0 - available in reg. 1 - not avail. Tag - reservation that will supply register value.

7

Lecture 7 - 7

Docsity.com

Download and Read Appendix I of the Textbook from Stallings web-site:

Tomasulo's Algorithm is an example of dynamic scheduling. In dynamic scheduling the stages of the pipeline are split into three stages to allow for out-of-order execution:

Issue - decodes instructions and checks for structural hazards. Instructions are issued in-order through a FIFO queue to maintain correct data flow. If there is not a free reservation station of the appropriate type, the instruction queue stalls.
Read operands - waits until no data hazards, then read operands
Write result - send the result to the CDB to be grabbed by any waiting register or reservation stations All instructions pass through the issue stage in order, but instructions stalling on operands can be bypassed by later instructions whose operands are available.

RAW hazards are handled by delaying instructions in reservation stations until all their operands are available.

WAR and WAW hazards are handled by renaming registers in instructions by reservation station numbers.

Load and Store instructions to different memory addresses can be done in any order, but the relative order of a Store and accesses to the same memory location must be maintained. One way to perform dynamic disambiguation of memory references, is to perform effective address calculations of Loads and Stores in program order in the issue stage. Before issuing a Load from the instruction queue, make sure that its effective address does not match the address of any Store instruction in the Store buffers. If there is a match, stall the instruction queue until, the corresponding Store completes. (Alternatively, the Store could forward the value to the corresponding Load ) Before issuing a Store from the instruction queue, make sure that its effective address does not match the address of any Store or Load instructions in the Store or Load buffers.

Lecture 7 - 8

Docsity.com

Branch prediction - usually used instead of delayed branching since multiple instructions need to execute in the delay slot causing problems related to instruction dependencies

Committing / Retiring Step - needed since instructions may complete out-of-order

Using branch prediction and speculative execution means some instructions’ results need to be thrown out

Results held is some temporary storage and stores performed in order of sequential execution.

Lecture 7 - 10

Docsity.com

Pentium 4 Processor

^

80486 - CISC

^

Pentium

some superscalar components- two separate integer execution units

^

Pentium Pro – Full blown superscalar ^

Subsequent models refine & enhance superscalar design

Lecture 7 - 11

Docsity.com

b) Trace cache (L1 cache) stores recently executed mico-op’sBTB uses dynamic branch prediction (a BHT) (4-bits used viaYeh’s algorithm).

Static prediction used if not in BTB.

a) Fetch 64 bytes of Pentium 4 (CISC) instruction(s) from L2cache and decode instruction boundaries and translates Pentium4 (CISC) intructions into micro-op’s (RISC)

Lecture 7 - 13

Docsity.com

d) Drive delivers decoded instructions from the trace cache tothe rename/allocate module.

c) Pulls micro-ops from cache (or ROM microprogrammedcontrol unit for very complex instructions) in program sequenceorder

Lecture 7 - 14

Docsity.com

Up to 6 micro-ops can be dispatched per cycle.

Scheduler retrieves micro-ops from queues fordispatching/issuing for execution if all operands and executionunit are available.

Lecture 7 - 16

Docsity.com

Compute flags - N, Z, C, V to use an input to the branches

Execution units retrieve necessary integer and floating pointregisters

Lecture 7 - 17

Serial Execution - Intro to Computer Architecture - Lecture Notes, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Serial Execution - Intro to Computer Architecture - Lecture Notes and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Docsity.com

Advanced Architectures - multiple instructions completed per clock cycle

Docsity.com

SPEC

SPEC

Docsity.com

Docsity.com

Docsity.com

Docsity.com

Docsity.com

^

80486 - CISC

^

^

Docsity.com

Docsity.com

Docsity.com

Docsity.com

Docsity.com