Modern Processor Architecture: Pipelining, Superscalar, & Out-of-Order Execution | Study notes Electrical and Electronics Engineering

©MICRODESIGN RESOURCES JULY 12, 1999 MICROPROCESSOR REPORT

by Keith Diefendorff

Having commandeered nearly all the performance-

enhancing techniques used by their mainframe and super-

computer predecessors, the microprocessors in today’s PCs

employ a dizzying assemblage of microarchitectural features

to achieve extraordinary levels of parallelism and speed.

Enabled by astronomical transistor budgets, modern PC

processors are superscalar,deeply pipelined, out of order, and

they even execute instructions speculatively. In this article, we

review the basic techniques used in these processors as well as

the tricks they employ to circumvent the two most challeng-

ing performance obstacles: memory latency and branches.

Two Paths to Performance

The task normally assigned to chip architects is to design the

highest-performance processor possible within a set of cost,

power,and size constraints established by market require-

ments. Within these constraints, application performance is

usually the best measure of success, although, sadly, the mar-

ket often mistakes clock frequency for performance.

Two main avenues are open to designers trying to

improve performance: making operations faster or executing

more of them in parallel. Operations can be made faster in

several ways. More advanced semiconductor processes make

transistors switch faster and signals propagate faster.Using

more transistors can reduce execution-unit latency (e.g., full

vs. partial multiplier arrays). Aggressive design methods can

minimize the levels of logic needed to implement a given

function (e.g., custom vs. standard-cell design) or to increase

circuit speed (e.g., dynamic vs. static circuits).

For parallelism, today’s PC processors rely on pipelining

and superscalar techniques to exploit instruction-level par-

allelism (ILP). Pipelined processors overlap instructions in

time on common execution resources.Superscalar processors

overlap instructions in space on separate resources.Both tech-

niques are used in combination.

Unfortunately,performance gains from parallelism

often fail to meet expectations. Although a four-stage pipe-

line, for example, overlaps the execution of four instructions,

as Figure 1 shows, it falls far short of a 4×performance boost.

The problem is pipeline stalls. Stalls arise from data hazards

(data dependencies), control hazards (changes in program

flow), and structural hazards (hardware resource conflicts),

all of which sap pipeline efficiency.

Lengthening the pipeline, or superpipelining, divides

instruction execution into more stages, each with a shorter

cycle time; it does not, in general, shorten the execution time

of instructions. In fact, it may increase execution time because

stages rarely divide evenly and the frequency is set by the

longest stage. In addition, longer pipelines experience a

higher percentage of stall cycles from hazards, thereby in-

creasing the average cycles per instruction (CPI). Super-

scalar techniques suffer from similar inefficiencies.

The throughput gains from a longer pipeline, however,

usually outweigh the CPI loss, so performance improves.But

lengthening the pipeline has limits. As stages shrink, clock

skew and latch overheads (setup and hold times) consume a

larger fraction of the cycle, leaving less usable time for logic.

The challenge is to make the pipeline short enough for

good efficiency but not so short that ILP and frequency are left

lying on the table, i.e., an underpipelined condition. Today’s

PC processors use pipelines of 5 to 12 stages. When making

this decision, designers must keep in mind that frequency is

often more important in the market than performance.

Prophetic Hardware for Long Pipelines

Branch prediction and speculative execution are tech-

niques used to reduce pipeline stalls on control hazards. In a

pipelined processor,conditional branches are often encoun-

tered before the data that will determine branch direction is

ready.Because instructions are fetched ahead of execution,

correctly predicting unresolved branches allows the instruc-

tion fetcher to keep the instruction queue filled with instruc-

tions that have a high probability of being used.

Some processors take the next step,actually executing

instructions speculatively past unresolved conditional

branches. This technique avoids the control-hazard stall

altogether when the branch goes in the predicted direction.

On mispredictions, however, the pipeline must be flushed,

PC Processor Microarchitecture

A Concise Review of the Techniques Used in Modern PC Processors

Fetch Issue Execute WriteInstr1

Instr2

Instr3

Instr4

Instr1

Instr2

Instr3

Instr4

Instr5

Instr6

Instr7

Instr8

Instr9

Figure 1. Pipelines overlap the execution of instructions in time.

Lengthening the pipeline increases the number of instructions exe-

cuted in a given time period. Longer pipelines, however, suffer

from a higher percentage of stalls (not shown).

Modern Processor Architecture: Pipelining, Superscalar, & Out-of-Order Execution, Study notes of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download Modern Processor Architecture: Pipelining, Superscalar, & Out-of-Order Execution and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Two Paths to Performance

Prophetic Hardware for Long Pipelines

PC Processor Microarchitecture

A Concise Review of the Techniques Used in Modern PC Processors

The Past Predicts the Future

Rearranging Instructions Boosts Throughput

If You Want It Fast, Keep It Close

It Was Here Just a Moment Ago

Multilevel Caches Move on Chip

Associativity Avoids Conflicts

Who Has What?

Memory Instructions Are Different

Process Technology: the Silver Bullet

Disagreement Over Microarchitecture Abounds