ECE 4100/6100: IBM 360/91 Microarchitecture and Tomasulo's Algorithm - Prof. Sudhakar Yala, Assignments of Computer Architecture and Organization

An in-depth analysis of the ibm 360/91 microarchitecture, focusing on tomasulo's algorithm. Topics include functional unit latencies, register renaming, reservation stations, and anti-dependence examples. The document also covers the advantages and disadvantages of the system.

Typology: Assignments

Pre 2010

Uploaded on 08/05/2009

koofers-user-ali
koofers-user-ali 🇺🇸

9 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
© Sudhakar Yalamanchili, K.V.Palem, and W. F> Wong, Georgia Institute of Technology (except as indicated)
Module:
Module:
Instruction Level Parallelism and
Instruction Level Parallelism and
Hardware Scheduling
Hardware Scheduling
ECE 4100/6100 (2)
Reading
Reading
Appendix A.7, A.8
Section 2.4, 2.5
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download ECE 4100/6100: IBM 360/91 Microarchitecture and Tomasulo's Algorithm - Prof. Sudhakar Yala and more Assignments Computer Architecture and Organization in PDF only on Docsity!

© Sudhakar Yalamanchili, K.V.Palem, and W. F> Wong, Georgia Institute of Technology (except as indicated)

Module: Module:

Instruction Level Parallelism andInstruction Level Parallelism and

Hardware SchedulingHardware Scheduling

ReadingReading

• Appendix A.7, A.

• Section 2.4, 2.

ECE 4100/6100 (3)

Compiler Hardware Interface Compiler Hardware Interface

IF ID MEM WB

Lexical Analyzer (^) Parser SemanticAnalyzer Code GeneratorIntermediate

Optimizer Code Generator (^) OptimizerPostpass

tokens parse tree parse tree

IR

low level IR machine code

L.D F2, 0(R1)

ADD.D F4, F0, F

MUL.D F6, F4, F

S.D F6, 4(R2)

IR

Hardware/Software Interface: The ISA

This module addresses scheduling below the HW/SW IF

Algorithms for Out- Algorithms for Out-ofof--order Issueorder Issue

• Scoreboarding

• Tomasulo’s Algorithm

• Others

ECE 4100/6100 (7)

Key ideas Key ideas

• Decompose the decode stage into issue and

read operand (RO) steps

™ Stall on WAW or structural hazards in issue stage

• Allow bypassing in RO of independent (in terms

of dataflow) instructions

™ Localize stalls Æ stall data dependent instructions

• Enforce WAR during write back

™ Detect and enforce hazards as late as possible

ADD.D F0, F2, F

SUB.D F4, F6, F

The Scoreboard The Scoreboard

• Consists of:

™ instruction status table

™ functional unit status table

™ result table

ECE 4100/6100 (9)

Instruction Status Table Instruction Status Table

• Keeps the information about which activities of

the execution process an instruction is

currently in.

™ issue? - is the instructions issued?

™ rdopd? - has it completed reading its operands?

™ exec? - has it completed its execution?

™ wrback? - has it completed its writeback?

IFU IDU MEMWB

issue read op

Functional Unit Status Table Functional Unit Status Table

• Has an entry for each functional unit and there are 9

fields for each entry:

™ busy? - indicates if the functional unit is busy; ™ op - the kind of operation being performed;

™ dest - the destination register; ™ src1, src2 - the two source registers;

™ Func1 (Qi ), func2(Q (^) j ) - the functional units producing

the results in the two source registers; ™ ready1?, ready2? - indicates if src1 and src2 is ready;

Add Yes Sub F8 F6 F2 Integer Yes No

Integer Yes Load F2 R3 No

Name Busy Op Fi Fj F (^) k Qj Qk R (^) j R (^) k

dest reg src1 src

Function unit producing value Source Registers have value?

ECE 4100/6100 (13)

Scoreboarding - Scoreboarding - 1 1

1. When an instruction is fetched, an entry is

made in the instruction status table.

2. After the instruction is decoded, the

corresponding issue? entry in the instruction

status table is marked.

Scoreboarding - Scoreboarding - 2 2

3. Select a functional unit. This is obtained by checking the

busy? flag of all the functional units which can execute

the current instruction.

4. Enter the relevant information in the corresponding

functional unit status table. func1 and func2 are

obtained from the corresponding entries in the result

table. For example, if one of the source register is R1 ,

then under the entry of R1 in the result table, locate the

functional unit responsible for writing to this register. If

there is no entry, then mark ready1? as ‘ready’. The

same goes for src2. This takes care of flow dependency.

Add Yes Sub F8 F6 F2 Integer Yes No

Integer Yes Load F2 R3 No

Name Busy Op Fi Fj F (^) k Qj Qk R (^) j R (^) k

dest reg src1 src

Function unit producing value Source Registers have value?

RAW

ECE 4100/6100 (15)

Scoreboarding - Scoreboarding - 3 3

5. An instruction is ready for issue if both ready1? and

ready2? entries are marked ‘ready’, and the

corresponding entry in the result table for the

destination register is empty.

  • If the latter condition is not fulfilled, instruction issue is stalled. This avoid output dependency. This entry is overwritten by the number of the functional unit that will produce the result.

6. The instruction then proceeds to read its operands and

executes with the corresponding entry in the instruction

status table updated accordingly.

FU Mult1 Integer Add Divide

Unit F0 F2 F4 F6 F8 F10 F12 etc

WAW

Scoreboarding - Scoreboarding - 4 4

  1. At the completion of the execution stage, the busy? corresponding entry in the functional unit status table is turned to ‘no’.
  2. Before write-back, it is necessary to check for anti-dependency. If there exists anti-dependency, the current write-back must be stalled until the hazard is cleared.
  3. During write-back, the result is written back to the register, the entry in the result table is turned to ‘empty’ and the instruction's entry in the instruction status table is deleted. At the same time, the functional unit status table is scanned such that the ready? entries can be updated so as to reflect the fact that the result in this register is now ready.

Add Yes Sub F8 F6 F2 Integer Yes No

Integer Yes Load F2 R3 No

Name Busy Op F (^) i F (^) j F (^) k Qj Qk R (^) j R (^) k

RAW

WAR

ECE 4100/6100 (19)

Summary Summary

• Out-of-order issue and out-of-order completion

• Performance limited by

™ Amount of ILP in the code segments

™ Number of entries in the scoreboard, i.e., amount of

look-ahead

™ Number of functional units

• Complexity of scoreboard is on the order of a

functional unit

Algorithms for Out- Algorithms for Out-ofof--order Issueorder Issue

• Scoreboarding

• Tomasulo’s Algorithm

• Others

ECE 4100/6100 (21)

In- In-Order Issue, OutOrder Issue, Out--ofof--order Execution,order Execution,

OutOut--ofof--order Completionorder Completion

I-Fetch Execution Core

Retire

“OOO” Core

Dynamic Scheduling Dynamic Scheduling

• Hardware will detect and preserve

dependencies (within a limited window of the

instruction stream)

• Hardware will check for resource availability

• Independent instructions will be issued to the

correct functional units

ECE 4100/6100 (25)

IBM 360 Instruction Format IBM 360 Instruction Format

• Known as the RX format

• All instructions (except load and stores)are of

the format

where SOURCE may be a memory operand or a

register while the SINK must be a register

SOURCE op SINK → SINK

TomasuloTomasulo’’s Algorithms Algorithm

• Credited to R.M. Tomasulo

• Implemented for the floating point unit of the

IBM 360/

ECE 4100/6100 (27)

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

FP operations are

sent by the

instruction unit to

the FPU into a

“stack” (IBM

terminology -

actually a queue!)

ECE 4100/6100 (31)

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

Buffers for

load. Each

load request

that goes out

to memory

gets a buffer

allocated.

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

The two floating

point functional

units.

ECE 4100/6100 (33)

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

Supplies operands to reservation stations. Each operand has a tag.

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

Each reservation station holds the two

operands of a operation together with

their tags as well as the busy bit (which

indicates if the operand is available.)

ECE 4100/6100 (37)

Data Structures Data Structures

• LD/SD buffers act as reservations stations for memory

units

• Instruction execution cannot start until all branches

resolved

Op Qj Qk Vj Vk A Busy

Register

value Q i

Reservation stations Values

IBM 360/91 FPU IBM 360/91 FPU

8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1

3 2 1

2 1

FP Add

(2 stage)

FP Mul/Div

(6 stage)

Decoder

FP

Registers

Load Buffer

FP Ops “Stack”

Operand Busses

Store Buffer

Reservation Stations

Common Data Bus

Operation Bus

To Memory

From Memory From Instruction Unit

All operand transport occurs

on the common data bus -

only one operand may

occupy the bus.

ECE 4100/6100 (39)

TomasuloTomasulo’’s Algorithms Algorithm

1. Decode an operation at the head of the floating

point operation stack

2. Look for an empty reservation station in the

functional unit corresponding to the operation.

If none exist, instruction issue stalls until one

does exit

3. Read the source operands from the register

file, bringing forward the tags

Tomasulo’ Tomasulo’s Algorithms Algorithm -- contcont’’dd

4. Mark the busy bit of the SINK in the register

file. Also, the tag will be set to point to the

selected reservation station

5. When the functional unit completes its

execution, it will write its result and the

corresponding reservation station number back

to the register file via the common data bus