



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The advanced features of modern pc processors, including pipelining, superscalar execution, out-of-order processing, and speculative execution. It discusses how these techniques increase instruction throughput and reduce average cycles per instruction (cpi), as well as the challenges and complexities they introduce. The document also touches upon instruction cache and data cache, memory access, and the impact of the instruction set architecture on processor performance.
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




by Keith Diefendorff
Having commandeered nearly all the performance- enhancing techniques used by their mainframe and super- computer predecessors, the microprocessors in today’s PCs employ a dizzying assemblage of microarchitectural features to achieve extraordinary levels of parallelism and speed. Enabled by astronomical transistor budgets, modern PC processors are superscalar, deeply pipelined, out of order, and they even execute instructions speculatively. In this article, we review the basic techniques used in these processors as well as the tricks they employ to circumvent the two most challeng- ing performance obstacles: memory latency and branches.
The task normally assigned to chip architects is to design the highest-performance processor possible within a set of cost, power, and size constraints established by market require- ments. Within these constraints, application performance is usually the best measure of success, although, sadly, the mar- ket often mistakes clock frequency for performance. Two main avenues are open to designers trying to improve performance: making operations faster or executing more of them in parallel. Operations can be made faster in several ways. More advanced semiconductor processes make transistors switch faster and signals propagate faster. Using more transistors can reduce execution-unit latency (e.g., full vs. partial multiplier arrays). Aggressive design methods can minimize the levels of logic needed to implement a given function (e.g., custom vs. standard-cell design) or to increase circuit speed (e.g., dynamic vs. static circuits).
For parallelism, today’s PC processors rely on pipelining and superscalar techniques to exploit instruction-level par- allelism (ILP). Pipelined processors overlap instructions in time on common execution resources. Superscalar processors overlap instructions in space on separate resources. Both tech- niques are used in combination. Unfortunately, performance gains from parallelism often fail to meet expectations. Although a four-stage pipe- line, for example, overlaps the execution of four instructions, as Figure 1 shows, it falls far short of a 4× performance boost. The problem is pipeline stalls. Stalls arise from data hazards (data dependencies), control hazards (changes in program flow), and structural hazards (hardware resource conflicts), all of which sap pipeline efficiency. Lengthening the pipeline, or superpipelining , divides instruction execution into more stages, each with a shorter cycle time ; it does not, in general, shorten the execution time of instructions. In fact, it may increase execution time because stages rarely divide evenly and the frequency is set by the longest stage. In addition, longer pipelines experience a higher percentage of stall cycles from hazards, thereby in- creasing the average cycles per instruction (CPI). Super- scalar techniques suffer from similar inefficiencies. The throughput gains from a longer pipeline, however, usually outweigh the CPI loss, so performance improves. But lengthening the pipeline has limits. As stages shrink, clock skew and latch overheads (setup and hold times) consume a larger fraction of the cycle, leaving less usable time for logic. The challenge is to make the pipeline short enough for good efficiency but not so short that ILP and frequency are left lying on the table, i.e., an underpipelined condition. Today’s PC processors use pipelines of 5 to 12 stages. When making this decision, designers must keep in mind that frequency is often more important in the market than performance.
Branch prediction and speculative execution are tech- niques used to reduce pipeline stalls on control hazards. In a pipelined processor, conditional branches are often encoun- tered before the data that will determine branch direction is ready. Because instructions are fetched ahead of execution, correctly predicting unresolved branches allows the instruc- tion fetcher to keep the instruction queue filled with instruc- tions that have a high probability of being used. Some processors take the next step, actually executing instructions speculatively past unresolved conditional branches. This technique avoids the control-hazard stall altogether when the branch goes in the predicted direction. On mispredictions , however, the pipeline must be flushed,
Instr 1 Fetch Issue Execute Write Instr Instr Instr
Instr Instr Instr Instr Instr Instr (^6) Instr (^7) Instr (^8) Instr
Figure 1. Pipelines overlap the execution of instructions in time. Lengthening the pipeline increases the number of instructions exe- cuted in a given time period. Longer pipelines, however, suffer from a higher percentage of stalls (not shown).
instruction fetch redirected, and the pipeline refilled. Statis- tically, prediction and speculation dramatically reduce stalls. How dramatically depends on prediction accuracy. Branch predictors range in sophistication from simple static predictors (compiler or heuristic driven), which achieve 65–85% accuracy, to complex dynamic predictors that can achieve 98% accuracy or more. Since one in five instructions is typically a conditional branch, high accuracy is essential, especially for machines with long pipelines and, therefore, with large mispredict penalties. As a result, most modern processors employ dynamic predictors.
The simplest dynamic predictor is the branch history table (BHT), a small cache indexed by the address of the branch being predicted. Simple BHTs record one-bit histories of the direction each branch took the last time it executed. More sophisticated BHTs use two-bit histories , which add hystere- sis to improve prediction accuracy on loop branches. Even more sophisticated schemes use two-level predictors with longer per-branch histories that index into pattern tables containing two-bit predictors (see MPR 3/27/95, p. 17). A simplified version of the two-level predictor uses a single global-history register of recent branch directions to index into the BHT. The GShare enhancement (see MPR 11/17/97, p. 22) adds per-branch sensitivity by hashing a few bits of the branch address with the global-history register, as Figure 2 shows. The agrees-mode enhancement encodes the prediction as agreement or disagreement with a static pre- diction, thereby avoiding excessive mispredictions when multiple active branches map to the same BHT entry. In architectures with no static-prediction opcode bits, such as the x86, the static prediction must be based on branch heuristics (e.g., backward: predict taken). Some processors predict the target instruction stream as well as the direction. Target predictions are made with a branch target address cache (BTAC), which caches the address to which control was transferred the last time the branch was taken. BTACs are sometimes combined with the BHT into a branch target buffer (BTB). Instead of a BTAC, some processors use a branch target instruction cache (BTIC), which caches the first few instructions down the tar- get path so the pipeline can be primed without an inline fetch cycle. Many processors also include a special-purpose return- address stack to predict the return addresses of subroutines.
Pipeline stalls arising from data and structural hazards can sometimes be avoided by judiciously rearranging instruction execution. Stalls on data hazards, for example, can be avoided by arranging instructions such that they do not depend on the results of preceding instructions that may still be in exe- cution. The extent to which this is possible, without violating the program’s data-flow graph , establishes an upper limit on the ILP the processor can exploit.
Although compilers can statically reschedule instruc- tions, they are hampered by incomplete knowledge of run- time information. Load-use penalties , for example, are resistant to static rescheduling because their length is gener- ally unpredictable at compile time. It is simply impossible to find enough independent instructions to cover the worst- case number of load-delay slots in every load. Static rescheduling is also constrained by register name- space and by ambiguous dependencies between memory instructions. A large register namespace is required for good register allocation , for freedom in rearranging instructions, and for loop unrolling. Register limitations are especially severe in x86 processors, which have only eight general- purpose registers. In-order processors —which issue, execute, complete, and retire instructions in strict program order — must rely entirely on static rescheduling and can suffer a large number of pipeline stalls. Therefore, most current PC processors implement dynamic instruction rescheduling to some degree. The sim- plest out-of-order processors issue instructions in order but allow them to execute and complete out of order. Processors of this type use register scoreboarding to interlock the pipe- line, stalling instruction issue when an instruction’s operands aren’t ready. Such processors can achieve somewhat more parallelism than in-order processors by permitting instruc- tions to execute in parallel through execution units with dif- ferent or variable latencies. Even simple out-of-order processors require complex hardware to reorder results before the corresponding in- structions are retired (removed from the machine). Although strict result ordering is not needed from a data-flow perspec- tive, it is required to maintain precise exceptions (the appear- ance of in-order execution following an interrupt) and to recover from mispredicted speculative execution. The most common reordering method is the reorder buffer (ROB), which buffers results until they can be written to the register file in program order. Accessing operands from the reorder buffer, which is needed for reasonable per- formance, requires an associative lookup to locate the most recent version of the operand.
111010010
010110001
01 11 11 10 00 01 10 11 00 00 11
01 10
11
10
Global Branch History Register
Bits From Address of Branch Being Predicted
Branch Results
Branch History Table
Index
Hash
Static Prediction
Agree/ Disagree
Predicted Direction
Figure 2. The GShare algorithm with agrees-mode encoding is used by several PC processors to dynamically predict branches.
Most PC processors utilize separate instruction caches (I-caches) and data caches (D-caches). These caches are usu- ally accessed and tagged with physical memory addresses. Processors translate logical addresses (program addresses) to physical addresses via lookup in a translation-lookaside buffer (TLB). A TLB caches recent translations of virtual- page numbers to physical-page numbers, as well as memory- protection attributes such as write protect. Usually there is an instruction TLB (I-TLB) and a data TLB (D-TLB); these are sometimes backed by a larger unified TLB. TLB misses typically invoke a microcode routine that searches for a translation in the OS-maintained virtual-memory mapping tables. If none is found, a page-fault interrupt transfers con- trol to the OS to resolve the problem.
Caches suffer from three types of misses: compulsory misses , which occur on the initial access to a memory location; capacity misses , which occur when the cache can’t hold all the data being accessed; and conflict misses , which occur when multiple memory blocks map to the same cache line. Ignoring the program’s memory-access patterns —which can have a large effect on cache behavior—the physical char- acteristics of the cache determine its miss ratio. Important characteristics include size, associativity, line length, replace- ment policy, write policy, and allocation policy. Compulsory misses can be reduced by increasing the line size to take more advantage of spatial locality. Increasing the line size, however, reduces the number of blocks in the cache, thereby increasing capacity misses and conflict misses. Line lengths of 32 and 64 bytes provide a good balance for small caches and are the most commonly used sizes. Ideally, stalls from compulsory misses could be avoided if the compiler could boost loads so memory access could get started ahead of the time the program will actually need the data. Unfortunately, a load cannot generally be boosted out of its basic block , because the block may not be executed and the load could fault. Furthermore, it is frequently impossible for a compiler to disambiguate memory address, forcing it to be overly conservative. As a result, loads cannot usually be boosted far enough to cover much memory latency. An increasingly popular solution to the problem of compulsory misses is nonbinding prefetch instructions. These instructions are simply hints to the hardware, suggest- ing it should try to fetch a memory block into the cache. Because prefetch instructions don’t modify machine state (registers) and are nonfaulting, they can be placed arbitrarily far ahead of a load, allowing time for a compulsory miss to be serviced, so the load sees a cache hit.
Capacity misses are mainly a function of cache size. Large caches have lower miss ratios than small caches; as a general rule of thumb, miss ratio improves proportionally to the square root of a cache-size increase: e.g., a 4× larger cache has
roughly half the miss ratio. This rule suggests rapidly dimin- ishing returns on cache size. Moreover, access time increases with cache size, thanks to physics. This fact sets up a tradeoff between a small, fast cache and a larger, slower cache. To reduce thrashing , caches should be larger than the working set of the program. Because working sets are sometimes too large for caches small enough to have one- or two-cycle access times, many proces- sors use a cache hierarchy. Two-level caches, for example, comprise a small, fast level-one cache (L1) backed by a larger, slower level-two cache (L2), as Figure 4 shows. Early two-level caches consisted of on-chip L1s, with the external L2 connected to the system or frontside bus (FSB). FSBs are not ideal cache interfaces, however. Designed as shared multidrop buses for DRAM, I/O, and multiproces- sor (MP) traffic, FSBs are usually slow. A 500-MHz processor requires an average 2.5 CPU cycles to synchronize each memory request to a slow 100-MHz bus, adding to L2 access time. This slow speed also throttles the burst transfers used to fill cache lines. To minimize these effects, processors burst data critical word first , and forward it immediately to the processor so the pipeline can be restarted posthaste.
To speed L2 accesses, many processors have adopted dual- bus architectures , placing the L2 on a dedicated backside bus (BSB). Because a BSB connects exclusively to the cache, it can be optimized for SRAM transfers and can operate at the full CPU clock rate. Since SRAMs capable of operating at full CPU speeds are expensive, however, most PC processors operate the BSB at half the CPU clock rate. Still, the BSB makes a much faster L2 interface than an FSB. Some proces- sors have taken the additional step of moving the L2-cache tags onto the processor die to speed hit/miss detection and to allow higher set-associativity. With the advent of 0.25-micron processes, PC processor vendors began bringing the BSB and the L2 on chip. The alter- native of increasing the size of the L1s is still favored by some designers, but the two-level approach will become more pop- ular as on-chip cache size grows. The trend toward on-chip L2s will accelerate with 0.18-micron processes, and external L2s may disappear completely by the 0.13-micron generation.
DRAM Memory (T (^) accDRAM )
L Cache (TaccL2 ) L1 Cache
(T
accL
) Processor Core (T (^) avg)
T (^) avg = TaccL1 + (MissL1 × TaccL2 ) + (MissL1 × MissL2 × TaccDRAM )
Miss (^) L1 MissL
Figure 4. Two-level cache hierarchies are designed to reduce the average memory access time (Tavg) seen by the processor.
Although on-chip L2s are typically smaller than exter- nal L2s, they can also be faster. On chip, the BSB can be very wide and operate at the full CPU clock rate. In addition, the L2 can have higher set-associativity, multiple banks, multiple ports, and other features that are impractical to build off chip with commodity SRAMs. These attributes can increase speed and hit ratios dramatically, offsetting the smaller size. On most PC applications, a full-speed 256K on-chip L2 out- performs a half-speed external 512K L2.
Conflict misses can be reduced by associativity. In a nonasso- ciative or direct-mapped cache , each memory block maps to one, and only one, cache line. But because multiple blocks map to each cache line, accesses to different memory ad- dresses can conflict. In a fully associative cache , on the other hand, any memory block can be stored in any cache line, eliminating conflicts. Fully associative caches, however, are expensive and slow, so they are usually approximated by n -way set-associative caches. As a rule of thumb, a two-way set-associative cache has a miss rate similar to a direct-mapped cache twice the size. Miss-rate improvement, however, diminishes rapidly with increasing associativity. For all practical purposes, an eight-way set-associative cache is just as effective as a fully associative cache. A least-recently used (LRU) replacement algorithm is the one most often used to decide into which way a new line should be allocated (even though LRU is known to be suboptimal in many cases). Cache performance is also affected by a cache’s write policy. The simplest policy is write through , wherein every store writes data to main memory, updating the cache only if it hits. This policy, however, leads to slow writes and to exces- sive write traffic to memory. As a result, most PC processors use write-back caches (sometimes called copy-back or store-in caches). Write-back caches write to the cache, not to memory, on a hit. Thus, a cache line can collect multiple store hits, writing to memory only once when the line is replaced. Most write-back caches use a write-allocate policy, which allocates a new line in the cache on a write miss.
In systems with caches, there is the nasty problem of cache coherency. If, for example, a processor has a memory block its cache, and an I/O device writes to an address in that block, then the data in the processor’s cache becomes stale. If the processor has modified the cache line, the data written by the I/O device will be overwritten and lost permanently when the cache line is eventually written back to memory. Avoiding these situations with software is difficult and error prone. With multiple processors, the problem becomes even more complicated, and software solutions become intractable. Although not a problem for PCs today, multi- processors will one day become attractive. Thus, to simplify I/O software and enable multiprocessing in the future, PC processors all enforce cache coherence via hardware. The most popular scheme is the four-state coherence protocol called MESI (modified, exclusive, shared, invalid). In this scheme, MESI status bits are maintained with each cache line. The first time a processor writes to a shared line in its cache, it broadcasts a write-invalidate coherence transac- tion to other devices. Any device with a copy of the line in its cache invalidates its copy, which, if modified, requires writing the line to memory before allowing the processor to take ownership of it. As an optimization, some processors allow lines to be allocated into a fifth, owned state ( MOESI ) to improve the efficiency of accessing shared data in symmetric multiprocessor (SMP) systems. In this coherence scheme, every processor or device with a cache snoops (watches) every memory transaction issued by every device in the coherence domain. If a snooped transaction hits on a line that is held exclusively, its status is changed to shared. If a snooped transaction hits a modified line, the offending transaction is held off until the dirty line can be written to memory. Alternatively, the processor with the hit can intervene to produce the modified data, allowing other processors and memory to snarf it off the bus.
Loads and stores have two distinct phases: address genera- tion and memory access. Through the address-generation phase, processors treat loads and stores just like other in- structions. To deal with the unique characteristics of mem- ory, however, processors generally decouple the memory- access phase by issuing memory requests along with their resolved addresses to a queue or a buffer in the memory unit , as Figure 5 shows. One important function performed by the memory unit is load/store reordering. Store instructions are often issued before their store data is ready. If all memory transac- tions were forced to access memory in program order, subse- quent loads would be blocked, unnecessarily stalling the pipeline. To alleviate the blockage, some memory units pro- vide store reservation stations, where stores can wait on data while subsequent loads access memory. To ensure cor- rect program operation, the memory unit must perform
Address Generation (Load/Store Unit)
Cache Tags Nonblocking Data Cache
Write Port Read Port
Result Buses
Source Operand Buses
Load/Store Request Queue (in Memory Unit)
To Memory
Load Data
Load or Store Requests Store Reservation Stations
DTLB
Store Data
Store Data Forwarding
Figure 5. Some processors allow loads to bypass stores that are waiting on data. With a nonblocking cache, load hits can access the cache while a previous miss is waiting on data from memory.
floating-point multiply-add. This FMADD feature chains a multiply and an add together with the latency of a multiply, improving the performance of inner products and other important numerical algorithms. Fusing the multiply-add with a single rounding enables even greater speed. A relatively new feature in ISAs is single-instruction, multiple-data processing (SIMD). Its popularity is being driven by the increasing demand for digital-signal processing (DSP) and multimedia processing in PC applications. DSP and multimedia algorithms are rich in data-level parallel- ism. To take advantage of this parallelism, SIMD instruc- tions operate on packed fixed-length vector operands , as Figure 6 shows, rather than on the single-element operands of conventional scalar instructions. Although instruction- level parallel techniques can also exploit data-level parallel- ism, SIMD units exploit it to a higher degree and with less complexity. Most SIMD ISA extensions support special features for DSP algorithms that are not found in traditional general- purpose processor ISAs. Saturation arithmetic , for example, clamps overflowed or underflowed elements at their maxi- mum or minimum values—an important feature for proces- sing packed short integers with limited dynamic range.
Unfortunately, all of the techniques reviewed so far come at a cost. Mostly they increase complexity, which has negative effects on frequency, die size, and power. Complexity in- creases nonlinearly with pipeline length and with issue width, and it is exacerbated by the cascade of tricks required to minimize stalls. Complexity adds logic gates, slowing cycle time or adding pipeline stages. It adds transistors, lengthen- ing signal paths and increasing die size and cost. It also increases design time, which, since processor performance progresses at about 60% per year, costs 4% in relative perfor- mance every month—a nontrivial amount. IC process is different. Everything improves: perfor- mance, frequency, die size, and power. No known microarchi- tectural technique comes close to the massive improvements made by a single process generation. One process generation —defined as a 30% linear shrink of feature sizes—halves die size (or doubles the tran- sistor budget for the same die size). A smaller die size dra- matically lowers manufacturing cost because it both in- creases the gross die per wafer and improves yield (see MPR 8/2/93, p. 12). Yield improves exponentially with decreasing die size because statistically fewer die are lost to defects.
Defect density is usually somewhat higher at the intro- duction of a new process, but it improves quickly with expe- rience and volume. One process generation also improves the intrinsic speed (CV/I) of transistors by 30–50%. As if these gains weren’t enough, each generation is typically accompanied by about a 25% reduction in voltage, which, since power is a quadratic function of voltage (P = CV 2 f), cuts power con- sumption in half. The only critical parameter that doesn’t naturally improve with process shrinks is interconnect delay (RC delay). Manufacturers are combating this recalcitrant term by lowering capacitance (C) with low dielectric- constant (low- k ) insulators, and by lowering resistance (R) with thicker metal layers or by moving from aluminum to copper metallization (see MPR 8/4/97, p. 14). Currently, most PC microprocessors are built on a 0.22- to 0.25-micron process with ≈0.18-micron gate lengths (Lgate), five layers of aluminum interconnect, and operating voltages from 1.8 to 2.5 V. Logic densities are roughly 60, transistors/mm 2 , and SRAM cells are about 10 μm 2. Next- generation 0.18-micron processes—which will begin volume production during 2H99 at most companies—will have an L (^) gate of ≈0.14 microns, six layers of aluminum or copper interconnect, operating voltages of 1.5 V or less, logic densi- ties of 120,000 transistors/mm 2 , and SRAM cells smaller than 5 μm^2 (see MPR 9/14/98, p. 1; MPR 1/25/99, p. 22). The package is also a factor in processor performance, cost, and size (see MPR 9/13/93, p. 12). Chips are either wire bonded , which is the cheaper method, or flip-chip mounted with solder bumps, which is the electrically superior method, onto a package substrate. Plastic substrates are the least ex- pensive, but cannot handle as much power as ceramic sub- strates. Pin-grid-array (PGA) packages are used where socketability is required, but surface-mount ball-grid-arrays (BGAs) are smaller (important for notebooks), cheaper, and electrically superior. For all package types, package costs and test costs are a function of the number of pins (or balls). Organic BGA packages appear to be the way of the future. They offer a low- k dielectric substrate and a copper lead frame for superior electrical characteristics. They also have low cost and low thermal resistance, since a heat sink can be directly attached to the silicon die.
Aside from a state-of-the-art semiconductor process, which is the minimum ante to play in the PC processor business, there is considerable disagreement over which is the best col- lection of microarchitectural features for a PC processor. Some designers, for example, prefer complex wide-issue out- of-order microarchitectures; others believe that simple, fast, in-order pipelines with large caches are better. This and other differences of opinion are evident in current and upcoming PC processors.— In the next installment of this article, we look at how spe- cific PC processors use the techniques reviewed in this article.
M
ƒ ƒ ƒ ƒ
R (^) S RS
RD
X 3 X 2 X 1 X (^0) Y 3 Y 2 Y 1 Y 0
ƒ(X 3 ,Y 3 ) ƒ(X 2 ,Y 2 ) ƒ(X 1 ,Y 1 ) ƒ(X 0 ,Y 0 )
Figure 6. SIMD instructions perform the same operation on all the elements of short vectors stored in registers.