Modern Processor Architecture: Pipelining, Superscalar, & Out-of-Order Execution, Study notes of Electrical and Electronics Engineering

The advanced features of modern pc processors, including pipelining, superscalar execution, out-of-order processing, and speculative execution. It discusses how these techniques increase instruction throughput and reduce average cycles per instruction (cpi), as well as the challenges and complexities they introduce. The document also touches upon instruction cache and data cache, memory access, and the impact of the instruction set architecture on processor performance.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-4ai-2
koofers-user-4ai-2 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
©MICRODESIGN RESOURCES JULY 12, 1999 MICROPROCESSOR REPORT
by Keith Diefendorff
Having commandeered nearly all the performance-
enhancing techniques used by their mainframe and super-
computer predecessors, the microprocessors in today’s PCs
employ a dizzying assemblage of microarchitectural features
to achieve extraordinary levels of parallelism and speed.
Enabled by astronomical transistor budgets, modern PC
processors are superscalar,deeply pipelined, out of order, and
they even execute instructions speculatively. In this article, we
review the basic techniques used in these processors as well as
the tricks they employ to circumvent the two most challeng-
ing performance obstacles: memory latency and branches.
Two Paths to Performance
The task normally assigned to chip architects is to design the
highest-performance processor possible within a set of cost,
power,and size constraints established by market require-
ments. Within these constraints, application performance is
usually the best measure of success, although, sadly, the mar-
ket often mistakes clock frequency for performance.
Two main avenues are open to designers trying to
improve performance: making operations faster or executing
more of them in parallel. Operations can be made faster in
several ways. More advanced semiconductor processes make
transistors switch faster and signals propagate faster.Using
more transistors can reduce execution-unit latency (e.g., full
vs. partial multiplier arrays). Aggressive design methods can
minimize the levels of logic needed to implement a given
function (e.g., custom vs. standard-cell design) or to increase
circuit speed (e.g., dynamic vs. static circuits).
For parallelism, today’s PC processors rely on pipelining
and superscalar techniques to exploit instruction-level par-
allelism (ILP). Pipelined processors overlap instructions in
time on common execution resources.Superscalar processors
overlap instructions in space on separate resources.Both tech-
niques are used in combination.
Unfortunately,performance gains from parallelism
often fail to meet expectations. Although a four-stage pipe-
line, for example, overlaps the execution of four instructions,
as Figure 1 shows, it falls far short of a 4×performance boost.
The problem is pipeline stalls. Stalls arise from data hazards
(data dependencies), control hazards (changes in program
flow), and structural hazards (hardware resource conflicts),
all of which sap pipeline efficiency.
Lengthening the pipeline, or superpipelining, divides
instruction execution into more stages, each with a shorter
cycle time; it does not, in general, shorten the execution time
of instructions. In fact, it may increase execution time because
stages rarely divide evenly and the frequency is set by the
longest stage. In addition, longer pipelines experience a
higher percentage of stall cycles from hazards, thereby in-
creasing the average cycles per instruction (CPI). Super-
scalar techniques suffer from similar inefficiencies.
The throughput gains from a longer pipeline, however,
usually outweigh the CPI loss, so performance improves.But
lengthening the pipeline has limits. As stages shrink, clock
skew and latch overheads (setup and hold times) consume a
larger fraction of the cycle, leaving less usable time for logic.
The challenge is to make the pipeline short enough for
good efficiency but not so short that ILP and frequency are left
lying on the table, i.e., an underpipelined condition. Today’s
PC processors use pipelines of 5 to 12 stages. When making
this decision, designers must keep in mind that frequency is
often more important in the market than performance.
Prophetic Hardware for Long Pipelines
Branch prediction and speculative execution are tech-
niques used to reduce pipeline stalls on control hazards. In a
pipelined processor,conditional branches are often encoun-
tered before the data that will determine branch direction is
ready.Because instructions are fetched ahead of execution,
correctly predicting unresolved branches allows the instruc-
tion fetcher to keep the instruction queue filled with instruc-
tions that have a high probability of being used.
Some processors take the next step,actually executing
instructions speculatively past unresolved conditional
branches. This technique avoids the control-hazard stall
altogether when the branch goes in the predicted direction.
On mispredictions, however, the pipeline must be flushed,
PC Processor Microarchitecture
A Concise Review of the Techniques Used in Modern PC Processors
Fetch Issue Execute WriteInstr1
Instr2
Instr3
Instr4
Instr1
Instr2
Instr3
Instr4
Instr5
Instr6
Instr7
Instr8
Instr9
Figure 1. Pipelines overlap the execution of instructions in time.
Lengthening the pipeline increases the number of instructions exe-
cuted in a given time period. Longer pipelines, however, suffer
from a higher percentage of stalls (not shown).
pf3
pf4
pf5

Partial preview of the text

Download Modern Processor Architecture: Pipelining, Superscalar, & Out-of-Order Execution and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

by Keith Diefendorff

Having commandeered nearly all the performance- enhancing techniques used by their mainframe and super- computer predecessors, the microprocessors in today’s PCs employ a dizzying assemblage of microarchitectural features to achieve extraordinary levels of parallelism and speed. Enabled by astronomical transistor budgets, modern PC processors are superscalar, deeply pipelined, out of order, and they even execute instructions speculatively. In this article, we review the basic techniques used in these processors as well as the tricks they employ to circumvent the two most challeng- ing performance obstacles: memory latency and branches.

Two Paths to Performance

The task normally assigned to chip architects is to design the highest-performance processor possible within a set of cost, power, and size constraints established by market require- ments. Within these constraints, application performance is usually the best measure of success, although, sadly, the mar- ket often mistakes clock frequency for performance. Two main avenues are open to designers trying to improve performance: making operations faster or executing more of them in parallel. Operations can be made faster in several ways. More advanced semiconductor processes make transistors switch faster and signals propagate faster. Using more transistors can reduce execution-unit latency (e.g., full vs. partial multiplier arrays). Aggressive design methods can minimize the levels of logic needed to implement a given function (e.g., custom vs. standard-cell design) or to increase circuit speed (e.g., dynamic vs. static circuits).

For parallelism, today’s PC processors rely on pipelining and superscalar techniques to exploit instruction-level par- allelism (ILP). Pipelined processors overlap instructions in time on common execution resources. Superscalar processors overlap instructions in space on separate resources. Both tech- niques are used in combination. Unfortunately, performance gains from parallelism often fail to meet expectations. Although a four-stage pipe- line, for example, overlaps the execution of four instructions, as Figure 1 shows, it falls far short of a 4× performance boost. The problem is pipeline stalls. Stalls arise from data hazards (data dependencies), control hazards (changes in program flow), and structural hazards (hardware resource conflicts), all of which sap pipeline efficiency. Lengthening the pipeline, or superpipelining , divides instruction execution into more stages, each with a shorter cycle time ; it does not, in general, shorten the execution time of instructions. In fact, it may increase execution time because stages rarely divide evenly and the frequency is set by the longest stage. In addition, longer pipelines experience a higher percentage of stall cycles from hazards, thereby in- creasing the average cycles per instruction (CPI). Super- scalar techniques suffer from similar inefficiencies. The throughput gains from a longer pipeline, however, usually outweigh the CPI loss, so performance improves. But lengthening the pipeline has limits. As stages shrink, clock skew and latch overheads (setup and hold times) consume a larger fraction of the cycle, leaving less usable time for logic. The challenge is to make the pipeline short enough for good efficiency but not so short that ILP and frequency are left lying on the table, i.e., an underpipelined condition. Today’s PC processors use pipelines of 5 to 12 stages. When making this decision, designers must keep in mind that frequency is often more important in the market than performance.

Prophetic Hardware for Long Pipelines

Branch prediction and speculative execution are tech- niques used to reduce pipeline stalls on control hazards. In a pipelined processor, conditional branches are often encoun- tered before the data that will determine branch direction is ready. Because instructions are fetched ahead of execution, correctly predicting unresolved branches allows the instruc- tion fetcher to keep the instruction queue filled with instruc- tions that have a high probability of being used. Some processors take the next step, actually executing instructions speculatively past unresolved conditional branches. This technique avoids the control-hazard stall altogether when the branch goes in the predicted direction. On mispredictions , however, the pipeline must be flushed,

PC Processor Microarchitecture

A Concise Review of the Techniques Used in Modern PC Processors

Instr 1 Fetch Issue Execute Write Instr Instr Instr

Instr Instr Instr Instr Instr Instr (^6) Instr (^7) Instr (^8) Instr

Figure 1. Pipelines overlap the execution of instructions in time. Lengthening the pipeline increases the number of instructions exe- cuted in a given time period. Longer pipelines, however, suffer from a higher percentage of stalls (not shown).

instruction fetch redirected, and the pipeline refilled. Statis- tically, prediction and speculation dramatically reduce stalls. How dramatically depends on prediction accuracy. Branch predictors range in sophistication from simple static predictors (compiler or heuristic driven), which achieve 65–85% accuracy, to complex dynamic predictors that can achieve 98% accuracy or more. Since one in five instructions is typically a conditional branch, high accuracy is essential, especially for machines with long pipelines and, therefore, with large mispredict penalties. As a result, most modern processors employ dynamic predictors.

The Past Predicts the Future

The simplest dynamic predictor is the branch history table (BHT), a small cache indexed by the address of the branch being predicted. Simple BHTs record one-bit histories of the direction each branch took the last time it executed. More sophisticated BHTs use two-bit histories , which add hystere- sis to improve prediction accuracy on loop branches. Even more sophisticated schemes use two-level predictors with longer per-branch histories that index into pattern tables containing two-bit predictors (see MPR 3/27/95, p. 17). A simplified version of the two-level predictor uses a single global-history register of recent branch directions to index into the BHT. The GShare enhancement (see MPR 11/17/97, p. 22) adds per-branch sensitivity by hashing a few bits of the branch address with the global-history register, as Figure 2 shows. The agrees-mode enhancement encodes the prediction as agreement or disagreement with a static pre- diction, thereby avoiding excessive mispredictions when multiple active branches map to the same BHT entry. In architectures with no static-prediction opcode bits, such as the x86, the static prediction must be based on branch heuristics (e.g., backward: predict taken). Some processors predict the target instruction stream as well as the direction. Target predictions are made with a branch target address cache (BTAC), which caches the address to which control was transferred the last time the branch was taken. BTACs are sometimes combined with the BHT into a branch target buffer (BTB). Instead of a BTAC, some processors use a branch target instruction cache (BTIC), which caches the first few instructions down the tar- get path so the pipeline can be primed without an inline fetch cycle. Many processors also include a special-purpose return- address stack to predict the return addresses of subroutines.

Rearranging Instructions Boosts Throughput

Pipeline stalls arising from data and structural hazards can sometimes be avoided by judiciously rearranging instruction execution. Stalls on data hazards, for example, can be avoided by arranging instructions such that they do not depend on the results of preceding instructions that may still be in exe- cution. The extent to which this is possible, without violating the program’s data-flow graph , establishes an upper limit on the ILP the processor can exploit.

Although compilers can statically reschedule instruc- tions, they are hampered by incomplete knowledge of run- time information. Load-use penalties , for example, are resistant to static rescheduling because their length is gener- ally unpredictable at compile time. It is simply impossible to find enough independent instructions to cover the worst- case number of load-delay slots in every load. Static rescheduling is also constrained by register name- space and by ambiguous dependencies between memory instructions. A large register namespace is required for good register allocation , for freedom in rearranging instructions, and for loop unrolling. Register limitations are especially severe in x86 processors, which have only eight general- purpose registers. In-order processors —which issue, execute, complete, and retire instructions in strict program order — must rely entirely on static rescheduling and can suffer a large number of pipeline stalls. Therefore, most current PC processors implement dynamic instruction rescheduling to some degree. The sim- plest out-of-order processors issue instructions in order but allow them to execute and complete out of order. Processors of this type use register scoreboarding to interlock the pipe- line, stalling instruction issue when an instruction’s operands aren’t ready. Such processors can achieve somewhat more parallelism than in-order processors by permitting instruc- tions to execute in parallel through execution units with dif- ferent or variable latencies. Even simple out-of-order processors require complex hardware to reorder results before the corresponding in- structions are retired (removed from the machine). Although strict result ordering is not needed from a data-flow perspec- tive, it is required to maintain precise exceptions (the appear- ance of in-order execution following an interrupt) and to recover from mispredicted speculative execution. The most common reordering method is the reorder buffer (ROB), which buffers results until they can be written to the register file in program order. Accessing operands from the reorder buffer, which is needed for reasonable per- formance, requires an associative lookup to locate the most recent version of the operand.

111010010

010110001

01 11 11 10 00 01 10 11 00 00 11

01 10

11

10

Global Branch History Register

Bits From Address of Branch Being Predicted

Branch Results

Branch History Table

Index

Hash

Static Prediction

Agree/ Disagree

Predicted Direction

Figure 2. The GShare algorithm with agrees-mode encoding is used by several PC processors to dynamically predict branches.

Most PC processors utilize separate instruction caches (I-caches) and data caches (D-caches). These caches are usu- ally accessed and tagged with physical memory addresses. Processors translate logical addresses (program addresses) to physical addresses via lookup in a translation-lookaside buffer (TLB). A TLB caches recent translations of virtual- page numbers to physical-page numbers, as well as memory- protection attributes such as write protect. Usually there is an instruction TLB (I-TLB) and a data TLB (D-TLB); these are sometimes backed by a larger unified TLB. TLB misses typically invoke a microcode routine that searches for a translation in the OS-maintained virtual-memory mapping tables. If none is found, a page-fault interrupt transfers con- trol to the OS to resolve the problem.

If You Want It Fast, Keep It Close

Caches suffer from three types of misses: compulsory misses , which occur on the initial access to a memory location; capacity misses , which occur when the cache can’t hold all the data being accessed; and conflict misses , which occur when multiple memory blocks map to the same cache line. Ignoring the program’s memory-access patterns —which can have a large effect on cache behavior—the physical char- acteristics of the cache determine its miss ratio. Important characteristics include size, associativity, line length, replace- ment policy, write policy, and allocation policy. Compulsory misses can be reduced by increasing the line size to take more advantage of spatial locality. Increasing the line size, however, reduces the number of blocks in the cache, thereby increasing capacity misses and conflict misses. Line lengths of 32 and 64 bytes provide a good balance for small caches and are the most commonly used sizes. Ideally, stalls from compulsory misses could be avoided if the compiler could boost loads so memory access could get started ahead of the time the program will actually need the data. Unfortunately, a load cannot generally be boosted out of its basic block , because the block may not be executed and the load could fault. Furthermore, it is frequently impossible for a compiler to disambiguate memory address, forcing it to be overly conservative. As a result, loads cannot usually be boosted far enough to cover much memory latency. An increasingly popular solution to the problem of compulsory misses is nonbinding prefetch instructions. These instructions are simply hints to the hardware, suggest- ing it should try to fetch a memory block into the cache. Because prefetch instructions don’t modify machine state (registers) and are nonfaulting, they can be placed arbitrarily far ahead of a load, allowing time for a compulsory miss to be serviced, so the load sees a cache hit.

It Was Here Just a Moment Ago

Capacity misses are mainly a function of cache size. Large caches have lower miss ratios than small caches; as a general rule of thumb, miss ratio improves proportionally to the square root of a cache-size increase: e.g., a 4× larger cache has

roughly half the miss ratio. This rule suggests rapidly dimin- ishing returns on cache size. Moreover, access time increases with cache size, thanks to physics. This fact sets up a tradeoff between a small, fast cache and a larger, slower cache. To reduce thrashing , caches should be larger than the working set of the program. Because working sets are sometimes too large for caches small enough to have one- or two-cycle access times, many proces- sors use a cache hierarchy. Two-level caches, for example, comprise a small, fast level-one cache (L1) backed by a larger, slower level-two cache (L2), as Figure 4 shows. Early two-level caches consisted of on-chip L1s, with the external L2 connected to the system or frontside bus (FSB). FSBs are not ideal cache interfaces, however. Designed as shared multidrop buses for DRAM, I/O, and multiproces- sor (MP) traffic, FSBs are usually slow. A 500-MHz processor requires an average 2.5 CPU cycles to synchronize each memory request to a slow 100-MHz bus, adding to L2 access time. This slow speed also throttles the burst transfers used to fill cache lines. To minimize these effects, processors burst data critical word first , and forward it immediately to the processor so the pipeline can be restarted posthaste.

Multilevel Caches Move on Chip

To speed L2 accesses, many processors have adopted dual- bus architectures , placing the L2 on a dedicated backside bus (BSB). Because a BSB connects exclusively to the cache, it can be optimized for SRAM transfers and can operate at the full CPU clock rate. Since SRAMs capable of operating at full CPU speeds are expensive, however, most PC processors operate the BSB at half the CPU clock rate. Still, the BSB makes a much faster L2 interface than an FSB. Some proces- sors have taken the additional step of moving the L2-cache tags onto the processor die to speed hit/miss detection and to allow higher set-associativity. With the advent of 0.25-micron processes, PC processor vendors began bringing the BSB and the L2 on chip. The alter- native of increasing the size of the L1s is still favored by some designers, but the two-level approach will become more pop- ular as on-chip cache size grows. The trend toward on-chip L2s will accelerate with 0.18-micron processes, and external L2s may disappear completely by the 0.13-micron generation.

DRAM Memory (T (^) accDRAM )

L Cache (TaccL2 ) L1 Cache

(T

accL

) Processor Core (T (^) avg)

T (^) avg = TaccL1 + (MissL1 × TaccL2 ) + (MissL1 × MissL2 × TaccDRAM )

Miss (^) L1 MissL

Figure 4. Two-level cache hierarchies are designed to reduce the average memory access time (Tavg) seen by the processor.

Although on-chip L2s are typically smaller than exter- nal L2s, they can also be faster. On chip, the BSB can be very wide and operate at the full CPU clock rate. In addition, the L2 can have higher set-associativity, multiple banks, multiple ports, and other features that are impractical to build off chip with commodity SRAMs. These attributes can increase speed and hit ratios dramatically, offsetting the smaller size. On most PC applications, a full-speed 256K on-chip L2 out- performs a half-speed external 512K L2.

Associativity Avoids Conflicts

Conflict misses can be reduced by associativity. In a nonasso- ciative or direct-mapped cache , each memory block maps to one, and only one, cache line. But because multiple blocks map to each cache line, accesses to different memory ad- dresses can conflict. In a fully associative cache , on the other hand, any memory block can be stored in any cache line, eliminating conflicts. Fully associative caches, however, are expensive and slow, so they are usually approximated by n -way set-associative caches. As a rule of thumb, a two-way set-associative cache has a miss rate similar to a direct-mapped cache twice the size. Miss-rate improvement, however, diminishes rapidly with increasing associativity. For all practical purposes, an eight-way set-associative cache is just as effective as a fully associative cache. A least-recently used (LRU) replacement algorithm is the one most often used to decide into which way a new line should be allocated (even though LRU is known to be suboptimal in many cases). Cache performance is also affected by a cache’s write policy. The simplest policy is write through , wherein every store writes data to main memory, updating the cache only if it hits. This policy, however, leads to slow writes and to exces- sive write traffic to memory. As a result, most PC processors use write-back caches (sometimes called copy-back or store-in caches). Write-back caches write to the cache, not to memory, on a hit. Thus, a cache line can collect multiple store hits, writing to memory only once when the line is replaced. Most write-back caches use a write-allocate policy, which allocates a new line in the cache on a write miss.

Who Has What?

In systems with caches, there is the nasty problem of cache coherency. If, for example, a processor has a memory block its cache, and an I/O device writes to an address in that block, then the data in the processor’s cache becomes stale. If the processor has modified the cache line, the data written by the I/O device will be overwritten and lost permanently when the cache line is eventually written back to memory. Avoiding these situations with software is difficult and error prone. With multiple processors, the problem becomes even more complicated, and software solutions become intractable. Although not a problem for PCs today, multi- processors will one day become attractive. Thus, to simplify I/O software and enable multiprocessing in the future, PC processors all enforce cache coherence via hardware. The most popular scheme is the four-state coherence protocol called MESI (modified, exclusive, shared, invalid). In this scheme, MESI status bits are maintained with each cache line. The first time a processor writes to a shared line in its cache, it broadcasts a write-invalidate coherence transac- tion to other devices. Any device with a copy of the line in its cache invalidates its copy, which, if modified, requires writing the line to memory before allowing the processor to take ownership of it. As an optimization, some processors allow lines to be allocated into a fifth, owned state ( MOESI ) to improve the efficiency of accessing shared data in symmetric multiprocessor (SMP) systems. In this coherence scheme, every processor or device with a cache snoops (watches) every memory transaction issued by every device in the coherence domain. If a snooped transaction hits on a line that is held exclusively, its status is changed to shared. If a snooped transaction hits a modified line, the offending transaction is held off until the dirty line can be written to memory. Alternatively, the processor with the hit can intervene to produce the modified data, allowing other processors and memory to snarf it off the bus.

Memory Instructions Are Different

Loads and stores have two distinct phases: address genera- tion and memory access. Through the address-generation phase, processors treat loads and stores just like other in- structions. To deal with the unique characteristics of mem- ory, however, processors generally decouple the memory- access phase by issuing memory requests along with their resolved addresses to a queue or a buffer in the memory unit , as Figure 5 shows. One important function performed by the memory unit is load/store reordering. Store instructions are often issued before their store data is ready. If all memory transac- tions were forced to access memory in program order, subse- quent loads would be blocked, unnecessarily stalling the pipeline. To alleviate the blockage, some memory units pro- vide store reservation stations, where stores can wait on data while subsequent loads access memory. To ensure cor- rect program operation, the memory unit must perform

Address Generation (Load/Store Unit)

Cache Tags Nonblocking Data Cache

Write Port Read Port

Result Buses

Source Operand Buses

Load/Store Request Queue (in Memory Unit)

To Memory

Load Data

Load or Store Requests Store Reservation Stations

DTLB

Store Data

Store Data Forwarding

Figure 5. Some processors allow loads to bypass stores that are waiting on data. With a nonblocking cache, load hits can access the cache while a previous miss is waiting on data from memory.

floating-point multiply-add. This FMADD feature chains a multiply and an add together with the latency of a multiply, improving the performance of inner products and other important numerical algorithms. Fusing the multiply-add with a single rounding enables even greater speed. A relatively new feature in ISAs is single-instruction, multiple-data processing (SIMD). Its popularity is being driven by the increasing demand for digital-signal processing (DSP) and multimedia processing in PC applications. DSP and multimedia algorithms are rich in data-level parallel- ism. To take advantage of this parallelism, SIMD instruc- tions operate on packed fixed-length vector operands , as Figure 6 shows, rather than on the single-element operands of conventional scalar instructions. Although instruction- level parallel techniques can also exploit data-level parallel- ism, SIMD units exploit it to a higher degree and with less complexity. Most SIMD ISA extensions support special features for DSP algorithms that are not found in traditional general- purpose processor ISAs. Saturation arithmetic , for example, clamps overflowed or underflowed elements at their maxi- mum or minimum values—an important feature for proces- sing packed short integers with limited dynamic range.

Process Technology: the Silver Bullet

Unfortunately, all of the techniques reviewed so far come at a cost. Mostly they increase complexity, which has negative effects on frequency, die size, and power. Complexity in- creases nonlinearly with pipeline length and with issue width, and it is exacerbated by the cascade of tricks required to minimize stalls. Complexity adds logic gates, slowing cycle time or adding pipeline stages. It adds transistors, lengthen- ing signal paths and increasing die size and cost. It also increases design time, which, since processor performance progresses at about 60% per year, costs 4% in relative perfor- mance every month—a nontrivial amount. IC process is different. Everything improves: perfor- mance, frequency, die size, and power. No known microarchi- tectural technique comes close to the massive improvements made by a single process generation. One process generation —defined as a 30% linear shrink of feature sizes—halves die size (or doubles the tran- sistor budget for the same die size). A smaller die size dra- matically lowers manufacturing cost because it both in- creases the gross die per wafer and improves yield (see MPR 8/2/93, p. 12). Yield improves exponentially with decreasing die size because statistically fewer die are lost to defects.

Defect density is usually somewhat higher at the intro- duction of a new process, but it improves quickly with expe- rience and volume. One process generation also improves the intrinsic speed (CV/I) of transistors by 30–50%. As if these gains weren’t enough, each generation is typically accompanied by about a 25% reduction in voltage, which, since power is a quadratic function of voltage (P = CV 2 f), cuts power con- sumption in half. The only critical parameter that doesn’t naturally improve with process shrinks is interconnect delay (RC delay). Manufacturers are combating this recalcitrant term by lowering capacitance (C) with low dielectric- constant (low- k ) insulators, and by lowering resistance (R) with thicker metal layers or by moving from aluminum to copper metallization (see MPR 8/4/97, p. 14). Currently, most PC microprocessors are built on a 0.22- to 0.25-micron process with ≈0.18-micron gate lengths (Lgate), five layers of aluminum interconnect, and operating voltages from 1.8 to 2.5 V. Logic densities are roughly 60, transistors/mm 2 , and SRAM cells are about 10 μm 2. Next- generation 0.18-micron processes—which will begin volume production during 2H99 at most companies—will have an L (^) gate of ≈0.14 microns, six layers of aluminum or copper interconnect, operating voltages of 1.5 V or less, logic densi- ties of 120,000 transistors/mm 2 , and SRAM cells smaller than 5 μm^2 (see MPR 9/14/98, p. 1; MPR 1/25/99, p. 22). The package is also a factor in processor performance, cost, and size (see MPR 9/13/93, p. 12). Chips are either wire bonded , which is the cheaper method, or flip-chip mounted with solder bumps, which is the electrically superior method, onto a package substrate. Plastic substrates are the least ex- pensive, but cannot handle as much power as ceramic sub- strates. Pin-grid-array (PGA) packages are used where socketability is required, but surface-mount ball-grid-arrays (BGAs) are smaller (important for notebooks), cheaper, and electrically superior. For all package types, package costs and test costs are a function of the number of pins (or balls). Organic BGA packages appear to be the way of the future. They offer a low- k dielectric substrate and a copper lead frame for superior electrical characteristics. They also have low cost and low thermal resistance, since a heat sink can be directly attached to the silicon die.

Disagreement Over Microarchitecture Abounds

Aside from a state-of-the-art semiconductor process, which is the minimum ante to play in the PC processor business, there is considerable disagreement over which is the best col- lection of microarchitectural features for a PC processor. Some designers, for example, prefer complex wide-issue out- of-order microarchitectures; others believe that simple, fast, in-order pipelines with large caches are better. This and other differences of opinion are evident in current and upcoming PC processors.— In the next installment of this article, we look at how spe- cific PC processors use the techniques reviewed in this article.

M

ƒ ƒ ƒ ƒ

R (^) S RS

RD

X 3 X 2 X 1 X (^0) Y 3 Y 2 Y 1 Y 0

ƒ(X 3 ,Y 3 ) ƒ(X 2 ,Y 2 ) ƒ(X 1 ,Y 1 ) ƒ(X 0 ,Y 0 )

Figure 6. SIMD instructions perform the same operation on all the elements of short vectors stored in registers.