In-Order & Out-of-Order Execution: Pentium II, Pentium 4, UltraSPARC III, Slides of Information and Computer Technology

An overview of in-order and out-of-order execution in various microarchitectures, including pentium ii, pentium 4, ultrasparc iii, and 8051 cpus. It discusses the concepts of in-order execution, out-of-order execution, and speculative execution, as well as the problems and techniques related to each. The document also covers the architecture of each cpu, such as the fetch/decode unit, dispatch/execute unit, retire unit, and memory subsystem.

Typology: Slides

2012/2013

Uploaded on 04/29/2013

architay
architay 🇮🇳

4.4

(17)

112 documents

1 / 50

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
In-Order Execution
In-order execution does not always give the
best performance on superscalar machines.
The following example uses in-order execution
and in-order completion.
Multiplication takes one more cycle to complete
than addition/subtraction.
A scoreboard keeps track of register usage.
User-visible registers are RO to R8.
Multiple instructions can read a register, but only one
can write a register.
Docsity.com Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32

Partial preview of the text

Download In-Order & Out-of-Order Execution: Pentium II, Pentium 4, UltraSPARC III and more Slides Information and Computer Technology in PDF only on Docsity!

In-Order Execution

  • In-order execution does not always give the

best performance on superscalar machines.

  • The following example uses in-order execution and in-order completion.
  • Multiplication takes one more cycle to complete than addition/subtraction.
  • A scoreboard keeps track of register usage.
    • User-visible registers are RO to R8.
    • Multiple instructions can read a register, but only one can write a register.

In-Order Execution

In-Order Execution

  • We can notice three kinds of dependencies

which can cause problems (instruction stalls):

  • RAW (Read After Write) dependence
  • WAR (Write After Read) dependence
  • WAW (Write After Write) dependence
  • In a WAR dependence, one instruction is trying to overwrite a register that a previous instruction may not yet have finished reading. A WAW dependence is similar.

In-Order Execution

  • In-order completion is important as well in

order to have the property of precise

interrupts.

  • Out-of-order completion leads to imprecise interrupts (we don’t know what has completed at the time of an interrupt - this is not good).
  • In order to avoid stalls, let us now permit out-

of-order execution and out-of-order

retirement.

Out-of-Order Execution

  • The previous example also introduces a new

technique called register renaming.

  • The decode unit has changed the use of R1 in I and I7 to a secret register, S1, not visible to the programmer.
  • Now I6 can be issued concurrently with I5.
  • Modern CPUs often have dozens of secret registers for use with register renaming.
  • This can often eliminate WAR and WAW dependencies.

Speculative Execution

  • Computer programs can be broken up into basic blocks , with each basic block consisting of a linear sequence of code with one entry point and one exit.
  • A basic block does not contain any control structures. - Therefore its machine language translation does not contain any branches.
  • Basic blocks are connected by control statements. Programs in this form can be represented by directed graphs.

Speculative Execution

  • Within each basic block, the reordering techniques seen work well.
  • Unfortunately, most basic blocks are short and there is insufficient parallelism to exploit.
  • The next step is to allow reordering to cross block boundaries.
  • The biggest gains come when a potentially slow operation can be moved upward in the graph to get it going earlier. Moving code upward over a branch is called hoisting.

Speculative Execution

  • Imagine that all of the variables of the previous example except evensum and oddsum are kept in registers.
  • It might make sense to move their LOAD instructions to the top of the loop, before computing k , to get them started early on, so the values will be available when they are needed.
  • Of course only one of them will be needed on each iteration, so the other LOAD will be wasted.

Speculative Execution

  • Another problem arises if a speculatively executed

instruction causes an exception.

  • A LOAD instruction may cause a cache miss on a

machine with a large cache line and a memory far slower than the CPU and cache.

  • One solution is to have a special SPECULATIVE-

LOAD instruction that tries to fetch the word from the cache, but if it is not there, just gives up.

Speculative Execution

  • A worse situation happens with the following statement: if (x > 0) z = y/x;
  • Suppose that the variables are all fetched into registers in advance and that the (slow) floating- point division is hoisted above the if test. - If x is 0, the resulting divide-by-zero trap terminates the program even though the programmer has put in explicit code to prevent this situation. - One solution is to have special versions of instructions that might cause exceptions.

Pentium II Microarchitecure

  • There are three primary components of the CPU:
    • Fetch/Decode unit
    • Dispatch/Execute unit
    • Retire unit
  • Together they act as a high-level pipeline.
  • The units communicate through an instruction pool. - The ROB ( ReOrder Buffer ) is a table which stores information about partially completed instructions.

Pentium II Microarchitecure

The Fetch/Decode Unit

  • The Fetch/Decode unit is highly pipelined, with seven stages. - Instructions enter the pipeline in stage IFU0, where entire 32-byte lines are loaded from the I-cache. - Since the IA-32 instruction set has variable-length instructions with many formats, IFU1 analyzes the byte stream to locate the start of each instruction. - IFU2 aligns the instructions so the next stage can decode them easily. - Decoding starts in ID0. Each IA-32 instruction is broken up into one or more micro-operations. Simple instructions may require just 1 micro-op.

The Fetch/Decode Unit

  • The micro-operations are queued in stage ID1. This stage also does branch prediction.
  • The static predictor predicts backward branches to be taken and forward ones not to be. After that, the dynamic branch predictor uses a 4-bit history-based algorithm. If the branch is not in the history table, the static prediction is used.
  • To avoid WAR and WAW dependencies, the Pentium II supports register renaming using one of 40 internal scratch registers. This is done in the RAT stage.
  • Finally, the micro-operations are deposited in the ROB three per clock-cycle. The micro-op will be issued when all required resources are ready.