Download ECE 4100/6100: IBM 360/91 Microarchitecture and Tomasulo's Algorithm - Prof. Sudhakar Yala and more Assignments Computer Architecture and Organization in PDF only on Docsity!
© Sudhakar Yalamanchili, K.V.Palem, and W. F> Wong, Georgia Institute of Technology (except as indicated)
Module: Module:
Instruction Level Parallelism andInstruction Level Parallelism and
Hardware SchedulingHardware Scheduling
ReadingReading
• Appendix A.7, A.
• Section 2.4, 2.
ECE 4100/6100 (3)
Compiler Hardware Interface Compiler Hardware Interface
IF ID MEM WB
Lexical Analyzer (^) Parser SemanticAnalyzer Code GeneratorIntermediate
Optimizer Code Generator (^) OptimizerPostpass
tokens parse tree parse tree
IR
low level IR machine code
L.D F2, 0(R1)
ADD.D F4, F0, F
MUL.D F6, F4, F
S.D F6, 4(R2)
IR
Hardware/Software Interface: The ISA
This module addresses scheduling below the HW/SW IF
Algorithms for Out- Algorithms for Out-ofof--order Issueorder Issue
• Scoreboarding
• Tomasulo’s Algorithm
• Others
ECE 4100/6100 (7)
Key ideas Key ideas
• Decompose the decode stage into issue and
read operand (RO) steps
Stall on WAW or structural hazards in issue stage
• Allow bypassing in RO of independent (in terms
of dataflow) instructions
Localize stalls Æ stall data dependent instructions
• Enforce WAR during write back
Detect and enforce hazards as late as possible
ADD.D F0, F2, F
SUB.D F4, F6, F
The Scoreboard The Scoreboard
• Consists of:
instruction status table
functional unit status table
result table
ECE 4100/6100 (9)
Instruction Status Table Instruction Status Table
• Keeps the information about which activities of
the execution process an instruction is
currently in.
issue? - is the instructions issued?
rdopd? - has it completed reading its operands?
exec? - has it completed its execution?
wrback? - has it completed its writeback?
IFU IDU MEMWB
issue read op
Functional Unit Status Table Functional Unit Status Table
• Has an entry for each functional unit and there are 9
fields for each entry:
busy? - indicates if the functional unit is busy; op - the kind of operation being performed;
dest - the destination register; src1, src2 - the two source registers;
Func1 (Qi ), func2(Q (^) j ) - the functional units producing
the results in the two source registers; ready1?, ready2? - indicates if src1 and src2 is ready;
Add Yes Sub F8 F6 F2 Integer Yes No
Integer Yes Load F2 R3 No
Name Busy Op Fi Fj F (^) k Qj Qk R (^) j R (^) k
dest reg src1 src
Function unit producing value Source Registers have value?
ECE 4100/6100 (13)
Scoreboarding - Scoreboarding - 1 1
1. When an instruction is fetched, an entry is
made in the instruction status table.
2. After the instruction is decoded, the
corresponding issue? entry in the instruction
status table is marked.
Scoreboarding - Scoreboarding - 2 2
3. Select a functional unit. This is obtained by checking the
busy? flag of all the functional units which can execute
the current instruction.
4. Enter the relevant information in the corresponding
functional unit status table. func1 and func2 are
obtained from the corresponding entries in the result
table. For example, if one of the source register is R1 ,
then under the entry of R1 in the result table, locate the
functional unit responsible for writing to this register. If
there is no entry, then mark ready1? as ‘ready’. The
same goes for src2. This takes care of flow dependency.
Add Yes Sub F8 F6 F2 Integer Yes No
Integer Yes Load F2 R3 No
Name Busy Op Fi Fj F (^) k Qj Qk R (^) j R (^) k
dest reg src1 src
Function unit producing value Source Registers have value?
RAW
ECE 4100/6100 (15)
Scoreboarding - Scoreboarding - 3 3
5. An instruction is ready for issue if both ready1? and
ready2? entries are marked ‘ready’, and the
corresponding entry in the result table for the
destination register is empty.
- If the latter condition is not fulfilled, instruction issue is stalled. This avoid output dependency. This entry is overwritten by the number of the functional unit that will produce the result.
6. The instruction then proceeds to read its operands and
executes with the corresponding entry in the instruction
status table updated accordingly.
FU Mult1 Integer Add Divide
Unit F0 F2 F4 F6 F8 F10 F12 etc
WAW
Scoreboarding - Scoreboarding - 4 4
- At the completion of the execution stage, the busy? corresponding entry in the functional unit status table is turned to ‘no’.
- Before write-back, it is necessary to check for anti-dependency. If there exists anti-dependency, the current write-back must be stalled until the hazard is cleared.
- During write-back, the result is written back to the register, the entry in the result table is turned to ‘empty’ and the instruction's entry in the instruction status table is deleted. At the same time, the functional unit status table is scanned such that the ready? entries can be updated so as to reflect the fact that the result in this register is now ready.
Add Yes Sub F8 F6 F2 Integer Yes No
Integer Yes Load F2 R3 No
Name Busy Op F (^) i F (^) j F (^) k Qj Qk R (^) j R (^) k
RAW
WAR
ECE 4100/6100 (19)
Summary Summary
• Out-of-order issue and out-of-order completion
• Performance limited by
Amount of ILP in the code segments
Number of entries in the scoreboard, i.e., amount of
look-ahead
Number of functional units
• Complexity of scoreboard is on the order of a
functional unit
Algorithms for Out- Algorithms for Out-ofof--order Issueorder Issue
• Scoreboarding
• Tomasulo’s Algorithm
• Others
ECE 4100/6100 (21)
In- In-Order Issue, OutOrder Issue, Out--ofof--order Execution,order Execution,
OutOut--ofof--order Completionorder Completion
I-Fetch Execution Core
Retire
“OOO” Core
Dynamic Scheduling Dynamic Scheduling
• Hardware will detect and preserve
dependencies (within a limited window of the
instruction stream)
• Hardware will check for resource availability
• Independent instructions will be issued to the
correct functional units
ECE 4100/6100 (25)
IBM 360 Instruction Format IBM 360 Instruction Format
• Known as the RX format
• All instructions (except load and stores)are of
the format
where SOURCE may be a memory operand or a
register while the SINK must be a register
SOURCE op SINK → SINK
TomasuloTomasulo’’s Algorithms Algorithm
• Credited to R.M. Tomasulo
• Implemented for the floating point unit of the
IBM 360/
ECE 4100/6100 (27)
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
FP operations are
sent by the
instruction unit to
the FPU into a
“stack” (IBM
terminology -
actually a queue!)
ECE 4100/6100 (31)
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
Buffers for
load. Each
load request
that goes out
to memory
gets a buffer
allocated.
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
The two floating
point functional
units.
ECE 4100/6100 (33)
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
Supplies operands to reservation stations. Each operand has a tag.
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
Each reservation station holds the two
operands of a operation together with
their tags as well as the busy bit (which
indicates if the operand is available.)
ECE 4100/6100 (37)
Data Structures Data Structures
• LD/SD buffers act as reservations stations for memory
units
• Instruction execution cannot start until all branches
resolved
Op Qj Qk Vj Vk A Busy
Register
value Q i
Reservation stations Values
IBM 360/91 FPU IBM 360/91 FPU
8 7 6 5 4 3 2 1 6 5 4 3 2 1 3 2 1
3 2 1
2 1
FP Add
(2 stage)
FP Mul/Div
(6 stage)
Decoder
FP
Registers
Load Buffer
FP Ops “Stack”
Operand Busses
Store Buffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory From Instruction Unit
All operand transport occurs
on the common data bus -
only one operand may
occupy the bus.
ECE 4100/6100 (39)
TomasuloTomasulo’’s Algorithms Algorithm
1. Decode an operation at the head of the floating
point operation stack
2. Look for an empty reservation station in the
functional unit corresponding to the operation.
If none exist, instruction issue stalls until one
does exit
3. Read the source operands from the register
file, bringing forward the tags
Tomasulo’ Tomasulo’s Algorithms Algorithm -- contcont’’dd
4. Mark the busy bit of the SINK in the register
file. Also, the tag will be set to point to the
selected reservation station
5. When the functional unit completes its
execution, it will write its result and the
corresponding reservation station number back
to the register file via the common data bus