Download Instruction Pipelining and CPU Organization and more Study notes Computer Architecture and Organization in PDF only on Docsity!
EE 4504 Section 8 1
EE 4504
Computer Organization
Section 8
The CPU Structure
EE 4504 Section 8 2
Overview
This section investigates how a typical
CPU is organized
- Major components (revisited)
- Register organization
- The instruction cycle (revisited)
- Instruction pipelining
- Pentium and PowerPC case studies
Reading: Text, Chapter 11 (Sections 1 --
4), Chapter 13 (Sections 1 and 2)
EE 4504 Section 8 3
CPU organization
Recall the functions performed by the
CPU:
- Fetch instructions
- Fetch data
- Process data
- Write data
Organizational requirements that are
derived from these functions:
– ALU
- Control logic
- Temporary storage
- Means to move data and instructions in and around the CPU
EE 4504 Section 8 4
Figure 11.1 External view of the CPU
EE 4504 Section 8 7
User-visible Registers
General categories based on function
- General purpose » Can be assigned a variety of functions » Ideally, they are defined orthogonally to the operations within the instructions
- Data » These registers only hold data
- Address » These registers only hold address information » Examples: general purpose address registers, segment pointers, stack pointers, index registers
- Condition codes » Visible to the user but values set by the CPU as the result of performing operations » Example code bits: zero, positive, overflow » Bit values are used as the basis for conditional jump instructions
EE 4504 Section 8 8
Design trade off between general purpose
and specialized registers
- General purpose registers maximize flexibility in instruction design
- Special purpose registers permit implicit register specification in instructions -- reduces register field size in an instruction
- No clear “best” design approach
How many registers are enough
- More registers permit more operands to be held within the CPU -- reducing memory bandwidth requirements to some extent
- More registers cause an increase in the field sizes needed to specify registers in an instruction word
- Locality of reference may not support too many registers
- Most machines use 8-32 registers (does not include RISC machines with register windowing -- will get to that later!)
EE 4504 Section 8 9
How big (wide)
- Address registers should be wide enough to hold the longest address address!
- Data registers should be wide enough to hold most data types » Would not want to use 64-bit registers if the vast majority of data operations used 16 and 32-bit operands » Related to width of memory data bus » Concatenate registers together to store longer formats B-C registers in the 8085 AccA-AccB registers in the 68HC
EE 4504 Section 8 10
Control and status registers
These registers are used during the
fetching, decoding and execution of
instructions
- Many are not visible to the user/programmer
- Some are visible but can not be (easily) modified
Typical registers
- Program counter » Points to the next instruction to be executed
- Instruction register » Contains the instruction being executed
- Memory address register
- Memory data/buffer register
- Program status word(s) » Superset of condition code register » Interrupt masks, supervisory modes, etc. » Status information
EE 4504 Section 8 13
Instruction Cycle
Recall the instruction cycle from Chapter
- Fetch the instruction
- Decode it
- Fetch operands
- Perform the operation
- Store results
- Recognize pending interrupts
Based on the addressing techniques from
Chapter 9, we can modify the state
diagram for the cycle to explicitly show
indirection in addressing
Flow of data and information between
registers during the instruction cycle varies
from processor to processor
EE 4504 Section 8 14
Figure 11.7 More complete instruction cycle state diagram
EE 4504 Section 8 15
Instruction pipelining
The instruction cycle state diagram clearly
shows the sequence of operations that take
place in order to execute a single
instruction
A “good” design goal of any system is to
have all of its components performing
useful work all of the time -- high
efficiency
Following the instruction cycle in a
sequential fashion does not permit this
level of efficiency
Compare the instruction cycle to an
automobile assembly line
- Perform all tasks concurrently, but on different (sequential) instructions
- The result is temporal parallelism
- Result is the instruction pipeline
EE 4504 Section 8 16
An ideal pipeline divides a task into k
independent sequential subtasks
- Each subtask requires 1 time unit to complete
- The task itself then requires k time units to complete
For n iterations of the task, the execution
times will be:
- With no pipelining: nk time units
- With pipelining: k + (n-1) time units
Speedup of a k-stage pipeline is thus
S = nk / [k+(n-1)] ==> k (for large n)
EE 4504 Section 8 19
Figure 11.12 Pipelined execution of 9 instructions in 14 time units vs. 54
EE 4504 Section 8 20
Figure 11.13 Impact of a branch after instruction 3 (to instruction 15)
EE 4504 Section 8 21
Pipeline Limitations
Pipeline depth
- If the speedup is based on the number of stages, why not build lots of stages?
- Each stage uses latches at its input (output) to buffer the next set of inputs » If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead » Also suffer a time overhead in the propagation time through the latches Limits the rate at which data can be clocked through the pipeline
- Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth
- Data dependencies also factor into the effective length of pipelines
EE 4504 Section 8 22
Data dependencies
- Pipelining, as a form of parallelism, must insure that computed results are the same as if computation was performed in strict sequential order
- With multiple stages, two instructions “in execution” in the pipeline may have data dependencies -- must design the pipeline to prevent this » Data dependencies limit when an instruction can be input to the pipeline
- Data dependency examples
A = B + C D = E + A C = G x H A = D / H
EE 4504 Section 8 25
- Multiple streams » Replicate the initial portions of the pipeline and fetch both possible next instructions » Increases chance of memory contention » Must support multiple streams for each instruction in the pipeline
- Prefetch branch target » When the branch instruction is decoded, begin to fetch the branch target instruction and place in a second prefetch buffer » If the branch is not taken, the sequential instructions are already in the pipe -- no loss of performance » If the branch is taken, the next instruction has been prefetched and results in minimal branch penalty (don’t have to incur a memory read operation at the end of the branch to fetch the instruction)
EE 4504 Section 8 26
- Look ahead, look behind buffer (loop buffer) » Many conditional branches operations are used for loop control » Expand prefetch buffer so as to buffer the last few instructions executed in addition to the ones that are waiting to be executed » If buffer is big enough, entire loop can be held in it -- reducing branch penalty
PC
Pending Instructions
Previous Instructions
EE 4504 Section 8 27
- Branch prediction » Make a good guess as to which instruction will be executed next and start that one down the pipeline » If the guess turns out to be right, no loss of performance in the pipeline » If the guess was wrong, empty the pipeline and restart with the correct instruction -- suffering the full branch penalty » Static guesses: make the guess without considering the runtime history of the program Branch never taken Branch always taken Predict based on the opcode » Dynamic guesses: track the history of conditional branches in the program Taken / not taken switch History table
EE 4504 Section 8 28
Figure 11.16 Branch prediction using 2 history bits
EE 4504 Section 8 31
Superscalar
- Implement the CPU such that more than one instruction can be performed (completed) at a time
- Involves replication of some or all parts of the CPU/ALU
- Examples: » Fetch multiple instructions at the same time » Decode multiple instructions at the same time » Perform add and multiply at the same time » Perform load/stores while performing ALU operation
- Degree of parallelism and hence the speedup of the machine goes up as more instructions are executed in parallel
EE 4504 Section 8 32
Figure 13.1 Comparison of superscalar and superpipeline operation to “regular” pipelines
EE 4504 Section 8 33
Superscalar design limitations
Data dependencies: must insure computed
results are the same as would be computed
on a strictly sequential machine
- Two instructions can not be executed in parallel if the (data) output of one is the input of the other or if they both write to the same output location
- Consider:
S1: A = B + C S2: D = A + 1 S3: B = E + F S4: A = E + 3
Resource dependencies
- In the above sequence of instructions, the adder unit gets a real workout!
- Parallelism is limited by the number of adders in the ALU
EE 4504 Section 8 34
Instruction issue policy: in what order are
instructions issued to the execution unit
and in what order do they finish?
- In-order issue, in-order completion » Simplest method, but severely limits performance » Strict ordering of instructions: data and procedural dependencies or resource conflicts delay all subsequent instructions » “Slow” execution of some instructions delay all subsequent instructions
- In-order issue, out-of-order completion » Any number of instructions can be executed at a time » Instruction issue is still limited by resource conflicts or data and procedural dependencies » Output dependencies resulting from out-of- order completion must be resolved » “Instruction” interrupts can be tricky
EE 4504 Section 8 37
Impact on machine parallelism
- Adding (ALU) functional units without register renaming support may not be cost-effective » Performance is limited by data dependencies
- Out-of-order issue benefits from large instruction buffer windows » Easier for a functional unit to find a pending instruction
EE 4504 Section 8 38
Summary
In this section, we have focused on the
operation of the CPU
- Registers and their use
- Instruction execution
Investigated the implementation of
“modern” CPUs
- Pipelining » Basic concepts » Limitations to performance
- Superpipelining
- Superscalar