Instruction Pipelining and CPU Organization, Study notes of Computer Architecture and Organization

The concept of instruction pipelining in cpu organization, including the pentium and powerpc case studies. It covers the functions performed by the cpu, organizational requirements, register organization, user-visible registers, control and status registers, and the instruction cycle. The document also explains how instruction pipelining can improve efficiency by performing tasks concurrently on different sequential instructions.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-6c4
koofers-user-6c4 🇺🇸

10 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
EE 4504 Section 8 1
EE 4504
Computer Organization
Section 8
The CPU Structure
EE 4504 Section 8 2
Overview
This section investigates how a typical
CPU is organized
Major components (revisited)
Register organization
The instruction cycle (revisited)
Instruction pipelining
Pentium and PowerPC case studies
Reading: Text, Chapter 11 (Sections 1 --
4), Chapter 13 (Sections 1 and 2)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Instruction Pipelining and CPU Organization and more Study notes Computer Architecture and Organization in PDF only on Docsity!

EE 4504 Section 8 1

EE 4504

Computer Organization

Section 8

The CPU Structure

EE 4504 Section 8 2

Overview

This section investigates how a typical

CPU is organized

  • Major components (revisited)
  • Register organization
  • The instruction cycle (revisited)
  • Instruction pipelining
  • Pentium and PowerPC case studies

Reading: Text, Chapter 11 (Sections 1 --

4), Chapter 13 (Sections 1 and 2)

EE 4504 Section 8 3

CPU organization

Recall the functions performed by the

CPU:

  • Fetch instructions
  • Fetch data
  • Process data
  • Write data

Organizational requirements that are

derived from these functions:

– ALU

  • Control logic
  • Temporary storage
  • Means to move data and instructions in and around the CPU

EE 4504 Section 8 4

Figure 11.1 External view of the CPU

EE 4504 Section 8 7

User-visible Registers

General categories based on function

  • General purpose » Can be assigned a variety of functions » Ideally, they are defined orthogonally to the operations within the instructions
  • Data » These registers only hold data
  • Address » These registers only hold address information » Examples: general purpose address registers, segment pointers, stack pointers, index registers
  • Condition codes » Visible to the user but values set by the CPU as the result of performing operations » Example code bits: zero, positive, overflow » Bit values are used as the basis for conditional jump instructions

EE 4504 Section 8 8

Design trade off between general purpose

and specialized registers

  • General purpose registers maximize flexibility in instruction design
  • Special purpose registers permit implicit register specification in instructions -- reduces register field size in an instruction
  • No clear “best” design approach

How many registers are enough

  • More registers permit more operands to be held within the CPU -- reducing memory bandwidth requirements to some extent
  • More registers cause an increase in the field sizes needed to specify registers in an instruction word
  • Locality of reference may not support too many registers
  • Most machines use 8-32 registers (does not include RISC machines with register windowing -- will get to that later!)

EE 4504 Section 8 9

How big (wide)

  • Address registers should be wide enough to hold the longest address address!
  • Data registers should be wide enough to hold most data types » Would not want to use 64-bit registers if the vast majority of data operations used 16 and 32-bit operands » Related to width of memory data bus » Concatenate registers together to store longer formats B-C registers in the 8085 AccA-AccB registers in the 68HC

EE 4504 Section 8 10

Control and status registers

These registers are used during the

fetching, decoding and execution of

instructions

  • Many are not visible to the user/programmer
  • Some are visible but can not be (easily) modified

Typical registers

  • Program counter » Points to the next instruction to be executed
  • Instruction register » Contains the instruction being executed
  • Memory address register
  • Memory data/buffer register
  • Program status word(s) » Superset of condition code register » Interrupt masks, supervisory modes, etc. » Status information

EE 4504 Section 8 13

Instruction Cycle

Recall the instruction cycle from Chapter

  • Fetch the instruction
  • Decode it
  • Fetch operands
  • Perform the operation
  • Store results
  • Recognize pending interrupts

Based on the addressing techniques from

Chapter 9, we can modify the state

diagram for the cycle to explicitly show

indirection in addressing

Flow of data and information between

registers during the instruction cycle varies

from processor to processor

EE 4504 Section 8 14

Figure 11.7 More complete instruction cycle state diagram

EE 4504 Section 8 15

Instruction pipelining

The instruction cycle state diagram clearly

shows the sequence of operations that take

place in order to execute a single

instruction

A “good” design goal of any system is to

have all of its components performing

useful work all of the time -- high

efficiency

Following the instruction cycle in a

sequential fashion does not permit this

level of efficiency

Compare the instruction cycle to an

automobile assembly line

  • Perform all tasks concurrently, but on different (sequential) instructions
  • The result is temporal parallelism
  • Result is the instruction pipeline

EE 4504 Section 8 16

An ideal pipeline divides a task into k

independent sequential subtasks

  • Each subtask requires 1 time unit to complete
  • The task itself then requires k time units to complete

For n iterations of the task, the execution

times will be:

  • With no pipelining: nk time units
  • With pipelining: k + (n-1) time units

Speedup of a k-stage pipeline is thus

S = nk / [k+(n-1)] ==> k (for large n)

EE 4504 Section 8 19

Figure 11.12 Pipelined execution of 9 instructions in 14 time units vs. 54

EE 4504 Section 8 20

Figure 11.13 Impact of a branch after instruction 3 (to instruction 15)

EE 4504 Section 8 21

Pipeline Limitations

Pipeline depth

  • If the speedup is based on the number of stages, why not build lots of stages?
  • Each stage uses latches at its input (output) to buffer the next set of inputs » If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead » Also suffer a time overhead in the propagation time through the latches Limits the rate at which data can be clocked through the pipeline
  • Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth
  • Data dependencies also factor into the effective length of pipelines

EE 4504 Section 8 22

Data dependencies

  • Pipelining, as a form of parallelism, must insure that computed results are the same as if computation was performed in strict sequential order
  • With multiple stages, two instructions “in execution” in the pipeline may have data dependencies -- must design the pipeline to prevent this » Data dependencies limit when an instruction can be input to the pipeline
  • Data dependency examples

A = B + C D = E + A C = G x H A = D / H

EE 4504 Section 8 25

  • Multiple streams » Replicate the initial portions of the pipeline and fetch both possible next instructions » Increases chance of memory contention » Must support multiple streams for each instruction in the pipeline
  • Prefetch branch target » When the branch instruction is decoded, begin to fetch the branch target instruction and place in a second prefetch buffer » If the branch is not taken, the sequential instructions are already in the pipe -- no loss of performance » If the branch is taken, the next instruction has been prefetched and results in minimal branch penalty (don’t have to incur a memory read operation at the end of the branch to fetch the instruction)

EE 4504 Section 8 26

  • Look ahead, look behind buffer (loop buffer) » Many conditional branches operations are used for loop control » Expand prefetch buffer so as to buffer the last few instructions executed in addition to the ones that are waiting to be executed » If buffer is big enough, entire loop can be held in it -- reducing branch penalty

PC

Pending Instructions

Previous Instructions

EE 4504 Section 8 27

  • Branch prediction » Make a good guess as to which instruction will be executed next and start that one down the pipeline » If the guess turns out to be right, no loss of performance in the pipeline » If the guess was wrong, empty the pipeline and restart with the correct instruction -- suffering the full branch penalty » Static guesses: make the guess without considering the runtime history of the program Branch never taken Branch always taken Predict based on the opcode » Dynamic guesses: track the history of conditional branches in the program Taken / not taken switch History table

EE 4504 Section 8 28

Figure 11.16 Branch prediction using 2 history bits

EE 4504 Section 8 31

Superscalar

  • Implement the CPU such that more than one instruction can be performed (completed) at a time
  • Involves replication of some or all parts of the CPU/ALU
  • Examples: » Fetch multiple instructions at the same time » Decode multiple instructions at the same time » Perform add and multiply at the same time » Perform load/stores while performing ALU operation
  • Degree of parallelism and hence the speedup of the machine goes up as more instructions are executed in parallel

EE 4504 Section 8 32

Figure 13.1 Comparison of superscalar and superpipeline operation to “regular” pipelines

EE 4504 Section 8 33

Superscalar design limitations

Data dependencies: must insure computed

results are the same as would be computed

on a strictly sequential machine

  • Two instructions can not be executed in parallel if the (data) output of one is the input of the other or if they both write to the same output location
  • Consider:

S1: A = B + C S2: D = A + 1 S3: B = E + F S4: A = E + 3

Resource dependencies

  • In the above sequence of instructions, the adder unit gets a real workout!
  • Parallelism is limited by the number of adders in the ALU

EE 4504 Section 8 34

Instruction issue policy: in what order are

instructions issued to the execution unit

and in what order do they finish?

  • In-order issue, in-order completion » Simplest method, but severely limits performance » Strict ordering of instructions: data and procedural dependencies or resource conflicts delay all subsequent instructions » “Slow” execution of some instructions delay all subsequent instructions
  • In-order issue, out-of-order completion » Any number of instructions can be executed at a time » Instruction issue is still limited by resource conflicts or data and procedural dependencies » Output dependencies resulting from out-of- order completion must be resolved » “Instruction” interrupts can be tricky

EE 4504 Section 8 37

Impact on machine parallelism

  • Adding (ALU) functional units without register renaming support may not be cost-effective » Performance is limited by data dependencies
  • Out-of-order issue benefits from large instruction buffer windows » Easier for a functional unit to find a pending instruction

EE 4504 Section 8 38

Summary

In this section, we have focused on the

operation of the CPU

  • Registers and their use
  • Instruction execution

Investigated the implementation of

“modern” CPUs

  • Pipelining » Basic concepts » Limitations to performance
  • Superpipelining
  • Superscalar