Applications of thread level parallelism, Lecture notes of Computer Architecture and Organization

How thread level parallelism can be used to reduce the CPI

Typology: Lecture notes

2018/2019

Uploaded on 04/13/2019

devanshu_17
devanshu_17 🇮🇳

2 documents

1 / 50

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
192 Chapter Three Instruction-Level Parallelism and Its Exploitation
until just before the store commits and can be placed directly into the stores
ROB entry by the sourcing instruction. This is accomplished by having the hard-
ware track when the source value to be stored is available in the store’s ROB
entry and searching the ROB on every instruction completion to look for depen-
dent stores.
This addition is not complicated, but adding it has two effects: We would
need to add a field to the ROB, and Figure 3.14, which is already in a small font,
would be even longer! Although Figure 3.14 makes this simplification, in our
examples, we will allow the store to pass through the write result stage and sim-
ply wait for the value to be ready when it commits.
Like Tomasulo’s algorithm, we must avoid hazards through memory. WAW
and WAR hazards through memory are eliminated with speculation because the
actual updating of memory occurs in order, when a store is at the head of the
ROB, and, hence, no earlier loads or stores can still be pending. RAW hazards
through memory are maintained by two restrictions:
1. Not allowing a load to initiate the second step of its execution if any active
ROB entry occupied by a store has a Destination field that matches the value
of the A field of the load.
2. Maintaining the program order for the computation of an effective address of
a load with respect to all earlier stores.
Together, these two restrictions ensure that any load that accesses a memory loca-
tion written to by an earlier store cannot perform the memory access until the
store has written the data. Some speculative processors will actually bypass the
value from the store to the load directly, when such a RAW hazard occurs.
Another approach is to predict potential collisions using a form of value predic-
tion; we consider this in Section 3.9.
Although this explanation of speculative execution has focused on floating
point, the techniques easily extend to the integer registers and functional units.
Indeed, speculation may be more useful in integer programs, since such programs
tend to have code where the branch behavior is less predictable. Additionally,
these techniques can be extended to work in a multiple-issue processor by allow-
ing multiple instructions to issue and commit every clock. In fact, speculation is
probably most interesting in such processors, since less ambitious techniques can
probably exploit sufficient ILP within basic blocks when assisted by a compiler.
The techniques of the preceding sections can be used to eliminate data, control
stalls, and achieve an ideal CPI of one. To improve performance further we
would like to decrease the CPI to less than one, but the CPI cannot be reduced
below one if we issue only one instruction every clock cycle.
3.7 Exploiting ILP Using Multiple Issue and
Static Scheduling
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32

Partial preview of the text

Download Applications of thread level parallelism and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

192 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

until just before the store commits and can be placed directly into the store’s

ROB entry by the sourcing instruction. This is accomplished by having the hard-

ware track when the source value to be stored is available in the store’s ROB

entry and searching the ROB on every instruction completion to look for depen-

dent stores.

This addition is not complicated, but adding it has two effects: We would

need to add a field to the ROB, and Figure 3. 14 , which is already in a small font,

would be even longer! Although Figure 3. 14 makes this simplification, in our

examples, we will allow the store to pass through the write result stage and sim-

ply wait for the value to be ready when it commits.

Like Tomasulo’s algorithm, we must avoid hazards through memory. WAW

and WAR hazards through memory are eliminated with speculation because the

actual updating of memory occurs in order, when a store is at the head of the

ROB, and, hence, no earlier loads or stores can still be pending. RAW hazards

through memory are maintained by two restrictions:

1. Not allowing a load to initiate the second step of its execution if any active

ROB entry occupied by a store has a Destination field that matches the value

of the A field of the load.

2. Maintaining the program order for the computation of an effective address of

a load with respect to all earlier stores.

Together, these two restrictions ensure that any load that accesses a memory loca-

tion written to by an earlier store cannot perform the memory access until the

store has written the data. Some speculative processors will actually bypass the

value from the store to the load directly, when such a RAW hazard occurs.

Another approach is to predict potential collisions using a form of value predic-

tion; we consider this in Section 3. 9.

Although this explanation of speculative execution has focused on floating

point, the techniques easily extend to the integer registers and functional units.

Indeed, speculation may be more useful in integer programs, since such programs

tend to have code where the branch behavior is less predictable. Additionally,

these techniques can be extended to work in a multiple-issue processor by allow-

ing multiple instructions to issue and commit every clock. In fact, speculation is

probably most interesting in such processors, since less ambitious techniques can

probably exploit sufficient ILP within basic blocks when assisted by a compiler.

The techniques of the preceding sections can be used to eliminate data, control

stalls, and achieve an ideal CPI of one. To improve performance further we

would like to decrease the CPI to less than one, but the CPI cannot be reduced

below one if we issue only one instruction every clock cycle.

3.7 Exploiting ILP Using Multiple Issue and

Static Scheduling

3.7 Exploiting ILP Using Multiple Issue and Static Scheduling ■ 193

The goal of the multiple-issue processors , discussed in the next few sections,

is to allow multiple instructions to issue in a clock cycle. Multiple-issue proces-

sors come in three major flavors:

1. Statically scheduled superscalar processors

2. VLIW (very long instruction word) processors

3. Dynamically scheduled superscalar processors

The two types of superscalar processors issue varying numbers of instructions

per clock and use in-order execution if they are statically scheduled or out-of-

order execution if they are dynamically scheduled.

VLIW processors, in contrast, issue a fixed number of instructions formatted

either as one large instruction or as a fixed instruction packet with the parallel-

ism among instructions explicitly indicated by the instruction. VLIW processors

are inherently statically scheduled by the compiler. When Intel and HP created

the IA-64 architecture, described in Appendix H, they also introduced the name

EPIC—explicitly parallel instruction computer—for this architectural style.

Although statically scheduled superscalars issue a varying rather than a fixed

number of instructions per clock, they are actually closer in concept to VLIWs,

since both approaches rely on the compiler to schedule code for the processor.

Because of the diminishing advantages of a statically scheduled superscalar as the

issue width grows, statically scheduled superscalars are used primarily for narrow

issue widths, normally just two instructions. Beyond that width, most designers

choose to implement either a VLIW or a dynamically scheduled superscalar.

Because of the similarities in hardware and required compiler technology, we

focus on VLIWs in this section. The insights of this section are easily extrapolated

to a statically scheduled superscalar.

Figure 3. 15 summarizes the basic approaches to multiple issue and their dis-

tinguishing characteristics and shows processors that use each approach.

The Basic VLIW Approach

VLIWs use multiple, independent functional units. Rather than attempting to

issue multiple, independent instructions to the units, a VLIW packages the multi-

ple operations into one very long instruction, or requires that the instructions in

the issue packet satisfy the same constraints. Since there is no fundamental

difference in the two approaches, we will just assume that multiple operations are

placed in one instruction, as in the original VLIW approach.

Since the advantage of a VLIW increases as the maximum issue rate grows,

we focus on a wider issue processor. Indeed, for simple two-issue processors, the

overhead of a superscalar is probably minimal. Many designers would probably

argue that a four-issue processor has manageable overhead, but as we will see

later in this chapter, the growth in overhead is a major factor limiting wider issue

processors.

3.7 Exploiting ILP Using Multiple Issue and Static Scheduling ■ 195

For now, we will rely on loop unrolling to generate long, straight-line code

sequences, so that we can use local scheduling to build up VLIW instructions and

focus on how well these processors operate.

Example Suppose we have a VLIW that could issue two memory references, two FP oper-

ations, and one integer operation or branch in every clock cycle. Show an

unrolled version of the loop x[i] = x[i] + s (see page 158 for the MIPS code)

for such a processor. Unroll as many times as necessary to eliminate any stalls.

Ignore delayed branches.

Answer Figure 3.16 shows the code. The loop has been unrolled to make seven copies of

the body, which eliminates all stalls (i.e., completely empty issue cycles), and

runs in 9 cycles. This code yields a running rate of seven results in 9 cycles, or

1.29 cycles per result, nearly twice as fast as the two-issue superscalar of Section

3.2 that used unrolled and scheduled code.

For the original VLIW model, there were both technical and logistical prob-

lems that make the approach less efficient. The technical problems are the

increase in code size and the limitations of lockstep operation. Two different

elements combine to increase code size substantially for a VLIW. First, generat-

ing enough operations in a straight-line code fragment requires ambitiously

unrolling loops (as in earlier examples), thereby increasing code size. Second,

whenever instructions are not full, the unused functional units translate to wasted

bits in the instruction encoding. In Appendix H, we examine software scheduling

Memory reference 1 Memory reference 2

FP

operation 1

FP

operation 2 Integer operation/branch L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F ADD.D F20,F18,F2 ADD.D F24,F22,F S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#- S.D F20,24(R1) S.D F24,16(R1) S.D F28,8(R1) BNE R1,R2,Loop Figure 3.16 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes 9 cycles assuming no branch delay; normally the branch delay would also need to be scheduled. The issue rate is 23 oper- ations in 9 clock cycles, or 2.5 operations per cycle. The efficiency, the percentage of available slots that contained an operation, is about 60%. To achieve this issue rate requires a larger number of registers than MIPS would normally use in this loop. The VLIW code sequence above requires at least eight FP registers, while the same code sequence for the base MIPS processor can use as few as two FP registers or as many as five when unrolled and scheduled.

196 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

approaches, such as software pipelining, that can achieve the benefits of unroll-

ing without as much code expansion.

To combat this code size increase, clever encodings are sometimes used. For

example, there may be only one large immediate field for use by any functional

unit. Another technique is to compress the instructions in main memory and

expand them when they are read into the cache or are decoded. In Appendix H,

we show other techniques, as well as document the significant code expansion

seen on IA-64.

Early VLIWs operated in lockstep; there was no hazard-detection hardware at

all. This structure dictated that a stall in any functional unit pipeline must cause

the entire processor to stall, since all the functional units must be kept synchro-

nized. Although a compiler may be able to schedule the deterministic functional

units to prevent stalls, predicting which data accesses will encounter a cache stall

and scheduling them are very difficult. Hence, caches needed to be blocking and

to cause all the functional units to stall. As the issue rate and number of memory

references becomes large, this synchronization restriction becomes unacceptable.

In more recent processors, the functional units operate more independently, and

the compiler is used to avoid hazards at issue time, while hardware checks allow

for unsynchronized execution once instructions are issued.

Binary code compatibility has also been a major logistical problem for

VLIWs. In a strict VLIW approach, the code sequence makes use of both the

instruction set definition and the detailed pipeline structure, including both func-

tional units and their latencies. Thus, different numbers of functional units and

unit latencies require different versions of the code. This requirement makes

migrating between successive implementations, or between implementations

with different issue widths, more difficult than it is for a superscalar design. Of

course, obtaining improved performance from a new superscalar design may

require recompilation. Nonetheless, the ability to run old binary files is a practi-

cal advantage for the superscalar approach.

The EPIC approach, of which the IA-64 architecture is the primary example,

provides solutions to many of the problems encountered in early VLIW designs,

including extensions for more aggressive software speculation and methods to

overcome the limitation of hardware dependence while preserving binary com-

patibility.

The major challenge for all multiple-issue processors is to try to exploit large

amounts of ILP. When the parallelism comes from unrolling simple loops in FP

programs, the original loop probably could have been run efficiently on a vector

processor (described in the next chapter). It is not clear that a multiple-issue pro-

cessor is preferred over a vector processor for such applications; the costs are

similar, and the vector processor is typically the same speed or faster. The poten-

tial advantages of a multiple-issue processor versus a vector processor are their

ability to extract some parallelism from less structured code and their ability to

easily cache all forms of data. For these reasons multiple-issue approaches have

become the primary method for taking advantage of instruction-level parallelism,

and vectors have become primarily an extension to these processors.

198 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

shows the issue logic for one case: issuing a load followed by a dependent FP

operation. The logic is based on that in Figure 3.14 on page 191, but represents

only one case. In a modern superscalar, every possible combination of dependent

instructions that is allowed to issue in the same clock cycle must be considered.

Since the number of possibilities climbs as the square of the number of instruc-

tions that can be issued in a clock, the issue step is a likely bottleneck for

attempts to go beyond four instructions per clock.

We can generalize the detail of Figure 3.18 to describe the basic strategy for

updating the issue logic and the reservation tables in a dynamically scheduled

superscalar with up to n issues per clock as follows:

Figure 3.17 The basic organization of a multiple issue processor with speculation. In this case, the organization could allow a FP multiply, FP add, integer, and load/store to all issues simultaneously (assuming one issue per clock per functional unit). Note that several datapaths must be widened to support multiple issues: the CDB, the operand buses, and, critically, the instruction issue logic, which is not shown in this figure. The last is a difficult problem, as we discuss in the text. From instruction unit Integer and FP registers Reservation stations FP adders FP multipliers 3 2 1 2 1 Common data bus (CDB) Operation bus Operand Address unit^ buses Load buffers Memory unit Reg # Data Reorder buffer Store data (^) Address Load data Store address Floating-point operations Load/store operations Instruction queue Integer unit 2 1 © Hennessy, John L.; Patterson, David A., Oct 07, 2011, Computer Architecture : A Quantitative Approach Morgan Kaufmann, Burlington, ISBN: 9780123838735

3.8 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation ■ 199

1. Assign a reservation station and a reorder buffer for every instruction that

might be issued in the next issue bundle. This assignment can be done before

the instruction types are known, by simply preallocating the reorder buffer

entries sequentially to the instructions in the packet using n available reorder

buffer entries and by ensuring that enough reservation stations are available

to issue the whole bundle, independent of what it contains. By limiting the

number of instructions of a given class (say, one FP, one integer, one load,

Action or bookkeeping Comments if (RegisterStat[rs1].Busy)/in-flight instr. writes rs/ {h ← RegisterStat[rs1].Reorder; if (ROB[h].Ready)/* Instr completed already / {RS[r1].Vj ← ROB[h].Value; RS[r1].Qj ← 0;} else {RS[r1].Qj ← h;} / wait for instruction / } else {RS[r1].Vj ← Regs[rs]; RS[r1].Qj ← 0;}; RS[r1].Busy ← yes; RS[r1].Dest ← b1; ROB[b1].Instruction ← Load; ROB[b1].Dest ← rd1; ROB[b1].Ready ← no; RS[r].A ← imm1; RegisterStat[rt1].Reorder ← b1; RegisterStat[rt1].Busy ← yes; ROB[b1].Dest ← rt1; Updating the reservation tables for the load instruction, which has a single source operand. Because this is the first instruction in this issue bundle, it looks no different than what would normally happen for a load. RS[r2].Qj ← b1;} / wait for load instruction / Since we know that the first operand of the FP operation is from the load, this step simply updates the reservation station to point to the load. Notice that the dependence must be analyzed on the fly and the ROB entries must be allocated during this issue step so that the reservation tables can be correctly updated. if (RegisterStat[rt2].Busy) /in-flight instr writes rt/ {h ← RegisterStat[rt2].Reorder; if (ROB[h].Ready)/ Instr completed already / {RS[r2].Vk ← ROB[h].Value; RS[r2].Qk ← 0;} else {RS[r2].Qk ← h;} / wait for instruction */ } else {RS[r2].Vk ← Regs[rt2]; RS[r2].Qk ← 0;}; RegisterStat[rd2].Reorder ← b2; RegisterStat[rd2].Busy ← yes; ROB[b2].Dest ← rd2; Since we assumed that the second operand of the FP instruction was from a prior issue bundle, this step looks like it would in the single-issue case. Of course, if this instruction was dependent on something in the same issue bundle the tables would need to be updated using the assigned reservation buffer. RS[r2].Busy ← yes; RS[r2].Dest ← b2; ROB[b2].Instruction ← FP operation; ROB[b2].Dest ← rd2; ROB[b2].Ready ← no; This section simply updates the tables for the FP operation, and is independent of the load. Of course, if further instructions in this issue bundle depended on the FP operation (as could happen with a four-issue superscalar), the updates to the reservation tables for those instructions would be effected by this instruction. Figure 3.18 The issue steps for a pair of dependent instructions (called 1 and 2) where instruction 1 is FP load and instruction 2 is an FP operation whose first operand is the result of the load instruction; r1 and r2 are the assigned reservation stations for the instructions; and b1 and b2 are the assigned reorder buffer entries. For the issuing instructions, rd1 and rd2 are the destinations; rs1, rs2, and rt2 are the sources (the load only has one source); r1 and r2 are the reservation stations allocated; and b1 and b2 are the assigned ROB entries. RS is the res- ervation station data structure. RegisterStat is the register data structure, Regs represents the actual registers, and ROB is the reorder buffer data structure. Notice that we need to have assigned reorder buffer entries for this logic to operate properly and recall that all these updates happen in a single clock cycle in parallel, not sequentially!

3.8 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation ■ 201

the speculative processor executes in clock cycle 13, while it executes in clock

cycle 19 on the nonspeculative pipeline. Because the completion rate on the non-

speculative pipeline is falling behind the issue rate rapidly, the nonspeculative

pipeline will stall when a few more iterations are issued. The performance of the

nonspeculative processor could be improved by allowing load instructions to

complete effective address calculation before a branch is decided, but unless

speculative memory accesses are allowed, this improvement will gain only 1

clock per iteration.

This example clearly shows how speculation can be advantageous when there

are data-dependent branches, which otherwise would limit performance. This

advantage depends, however, on accurate branch prediction. Incorrect specula-

tion does not improve performance; in fact, it typically harms performance and,

as we shall see, dramatically lowers energy efficiency.

Iteration number Instructions Issues at clock cycle number Executes at clock cycle number Memory access at clock cycle number Write CDB at clock cycle number Comment 1 LD R2,0(R1) 1 2 3 4 First issue 1 DADDIU R2,R2,#1 1 5 6 Wait for LW 1 SD R2,0(R1) 2 3 7 Wait for DADDIU 1 DADDIU R1,R1,#8 2 3 4 Execute directly 1 BNE R2,R3,LOOP 3 7 Wait for DADDIU 2 LD R2,0(R1) 4 8 9 10 Wait for BNE 2 DADDIU R2,R2,#1 4 11 12 Wait for LW 2 SD R2,0(R1) 5 9 13 Wait for DADDIU 2 DADDIU R1,R1,#8 5 8 9 Wait for BNE 2 BNE R2,R3,LOOP 6 13 Wait for DADDIU 3 LD R2,0(R1) 7 14 15 16 Wait for BNE 3 DADDIU R2,R2,#1 7 17 18 Wait for LW 3 SD R2,0(R1) 8 15 19 Wait for DADDIU 3 DADDIU R1,R1,#8 8 14 15 Wait for BNE 3 BNE R2,R3,LOOP 9 19 Wait for DADDIU Figure 3.19 The time of issue, execution, and writing result for a dual-issue version of our pipeline without speculation. Note that the LD following the BNE cannot start execution earlier because it must wait until the branch outcome is determined. This type of program, with data-dependent branches that cannot be resolved earlier, shows the strength of speculation. Separate functional units for address calculation, ALU operations, and branch-condition evaluation allow multiple instructions to execute in the same cycle. Figure 3.20 shows this example with speculation.

202 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

In a high-performance pipeline, especially one with multiple issues, predicting

branches well is not enough; we actually have to be able to deliver a high-

bandwidth instruction stream. In recent multiple-issue processors, this has meant

delivering 4 to 8 instructions every clock cycle. We look at methods for increas-

ing instruction delivery bandwidth first. We then turn to a set of key issues in

implementing advanced speculation techniques, including the use of register

renaming versus reorder buffers, the aggressiveness of speculation, and a tech-

nique called value prediction , which attempts to predict the result of a computa-

tion and which could further enhance ILP.

Increasing Instruction Fetch Bandwidth

A multiple-issue processor will require that the average number of instructions

fetched every clock cycle be at least as large as the average throughput. Of

course, fetching these instructions requires wide enough paths to the instruction

cache, but the most difficult aspect is handling branches. In this section, we look

Iteration number Instructions Issues at clock number Executes at clock number Read access at clock number Write CDB at clock number Commits at clock number Comment 1 LD R2,0(R1) 1 2 3 4 5 First issue 1 DADDIU R2,R2,#1 1 5 6 7 Wait for LW 1 SD R2,0(R1) 2 3 7 Wait for DADDIU 1 DADDIU R1,R1,#8 2 3 4 8 Commit in order 1 BNE R2,R3,LOOP 3 7 8 Wait for DADDIU 2 LD R2,0(R1) 4 5 6 7 9 No execute delay 2 DADDIU R2,R2,#1 4 8 9 10 Wait for LW 2 SD R2,0(R1) 5 6 10 Wait for DADDIU 2 DADDIU R1,R1,#8 5 6 7 11 Commit in order 2 BNE R2,R3,LOOP 6 10 11 Wait for DADDIU 3 LD R2,0(R1) 7 8 9 10 12 Earliest possible 3 DADDIU R2,R2,#1 7 11 12 13 Wait for LW 3 SD R2,0(R1) 8 9 13 Wait for DADDIU 3 DADDIU R1,R1,#8 8 9 10 14 Executes earlier 3 BNE R2,R3,LOOP 9 13 14 Wait for DADDIU Figure 3.20 The time of issue, execution, and writing result for a dual-issue version of our pipeline with specula- tion. Note that the LD following the BNE can start execution early because it is speculative.

3.9 Advanced Techniques for Instruction Delivery and

Speculation

204 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

If a matching entry is found in the branch-target buffer, fetching begins

immediately at the predicted PC. Note that unlike a branch-prediction buffer, the

predictive entry must be matched to this instruction because the predicted PC

will be sent out before it is known whether this instruction is even a branch. If the

processor did not check whether the entry matched this PC, then the wrong PC

would be sent out for instructions that were not branches, resulting in worse

performance. We only need to store the predicted-taken branches in the branch-

target buffer, since an untaken branch should simply fetch the next sequential

instruction, as if it were not a branch.

Figure 3.22 shows the steps when using a branch-target buffer for a simple

five-stage pipeline. From this figure we can see that there will be no branch delay

Figure 3.22 The steps involved in handling an instruction with a branch-target buffer. IF ID EX Send PC to memory and branch-target buffer Entry found in branch-target buffer? No No Normal instruction execution Yes Send out predicted Is PC instruction a taken branch? Taken branch? Enter branch instruction address and next PC into branch- target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls Yes No Yes

3.9 Advanced Techniques for Instruction Delivery and Speculation ■ 205

if a branch-prediction entry is found in the buffer and the prediction is correct.

Otherwise, there will be a penalty of at least two clock cycles. Dealing with the

mispredictions and misses is a significant challenge, since we typically will have

to halt instruction fetch while we rewrite the buffer entry. Thus, we would like to

make this process fast to minimize the penalty.

To evaluate how well a branch-target buffer works, we first must determine

the penalties in all possible cases. Figure 3.23 contains this information for a sim-

ple five-stage pipeline.

Example Determine the total branch penalty for a branch-target buffer assuming the pen-

alty cycles for individual mispredictions from Figure 3.23. Make the following

assumptions about the prediction accuracy and hit rate:

■ Prediction accuracy is 90% (for instructions in the buffer).

■ Hit rate in the buffer is 90% (for branches predicted taken).

Answer We compute the penalty by looking at the probability of two events: the branch is

predicted taken but ends up being not taken, and the branch is taken but is not

found in the buffer. Both carry a penalty of two cycles.

This penalty compares with a branch penalty for delayed branches, which we

evaluate in Appendix C, of about 0.5 clock cycles per branch. Remember,

though, that the improvement from dynamic branch prediction will grow as the

Instruction in buffer Prediction Actual branch Penalty cycles Yes Taken Taken 0 Yes Taken Not taken 2 No Taken 2 No Not taken 0 Figure 3.23 Penalties for all possible combinations of whether the branch is in the buffer and what it actually does, assuming we store only taken branches in the buffer. There is no branch penalty if everything is correctly predicted and the branch is found in the target buffer. If the branch is not correctly predicted, the penalty is equal to one clock cycle to update the buffer with the correct information (during which an instruction cannot be fetched) and one clock cycle, if needed, to restart fetching the next correct instruction for the branch. If the branch is not found and taken, a two-cycle penalty is encountered, during which time the buffer is updated. Probability (branch in buffer, but actually not taken) = Percent buffer hit rate ×Percent incorrect predictions = 90% × 10%=0. Probability (branch not in buffer, but actually taken) = 10% Branch penalty = ( 0.09 +0.10) × 2 Branch penalty = 0.

3.9 Advanced Techniques for Instruction Delivery and Speculation ■ 207

ILP in Section 3.10. Both the Intel Core processors and the AMD Phenom proces-

sors have return address predictors.

Integrated Instruction Fetch Units

To meet the demands of multiple-issue processors, many recent designers have

chosen to implement an integrated instruction fetch unit as a separate autono-

mous unit that feeds instructions to the rest of the pipeline. Essentially, this

amounts to recognizing that characterizing instruction fetch as a simple single

pipe stage given the complexities of multiple issue is no longer valid.

Instead, recent designs have used an integrated instruction fetch unit that inte-

grates several functions:

1. Integrated branch prediction —The branch predictor becomes part of the

instruction fetch unit and is constantly predicting branches, so as to drive the

fetch pipeline.

Figure 3.24 Prediction accuracy for a return address buffer operated as a stack on a number of SPEC CPU95 benchmarks. The accuracy is the fraction of return addresses predicted correctly. A buffer of 0 entries implies that the standard branch prediction is used. Since call depths are typically not large, with some exceptions, a modest buffer works well. These data come from Skadron et al. [1999] and use a fix-up mechanism to prevent corruption of the cached return addresses. Misprediction frequency 70% 60% 50% 40% 30% 20% 0 10% 0% 1 2 4 Return address buffer entries 8 Go m88ksim cc Compress Xlisp Ijpeg Perl Vortex 16

208 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

2. Instruction prefetch —To deliver multiple instructions per clock, the instruc-

tion fetch unit will likely need to fetch ahead. The unit autonomously man-

ages the prefetching of instructions (see Chapter 2 for a discussion of

techniques for doing this), integrating it with branch prediction.

3. Instruction memory access and buffering —When fetching multiple instruc-

tions per cycle a variety of complexities are encountered, including the diffi-

culty that fetching multiple instructions may require accessing multiple cache

lines. The instruction fetch unit encapsulates this complexity, using prefetch

to try to hide the cost of crossing cache blocks. The instruction fetch unit also

provides buffering, essentially acting as an on-demand unit to provide

instructions to the issue stage as needed and in the quantity needed.

Virtually all high-end processors now use a separate instruction fetch unit con-

nected to the rest of the pipeline by a buffer containing pending instructions.

Speculation: Implementation Issues and Extensions

In this section we explore four issues that involve the design trade-offs in specu-

lation, starting with the use of register renaming, the approach that is often used

instead of a reorder buffer. We then discuss one important possible extension to

speculation on control flow: an idea called value prediction.

Speculation Support: Register Renaming versus Reorder Buffers

One alternative to the use of a reorder buffer (ROB) is the explicit use of a larger

physical set of registers combined with register renaming. This approach builds

on the concept of renaming used in Tomasulo’s algorithm and extends it. In

Tomasulo’s algorithm, the values of the architecturally visible registers (R0, …,

R31 and F0, …, F31) are contained, at any point in execution, in some combina-

tion of the register set and the reservation stations. With the addition of specula-

tion, register values may also temporarily reside in the ROB. In either case, if the

processor does not issue new instructions for a period of time, all existing

instructions will commit, and the register values will appear in the register file,

which directly corresponds to the architecturally visible registers.

In the register-renaming approach, an extended set of physical registers is

used to hold both the architecturally visible registers as well as temporary values.

Thus, the extended registers replace most of the function of the ROB and the res-

ervation stations; only a queue to ensure that instructions complete in order is

needed. During instruction issue, a renaming process maps the names of architec-

tural registers to physical register numbers in the extended register set, allocating

a new unused register for the destination. WAW and WAR hazards are avoided

by renaming of the destination register, and speculation recovery is handled

because a physical register holding an instruction destination does not become

the architectural register until the instruction commits. The renaming map is a

simple data structure that supplies the physical register number of the register

210 ■ Chapter Three Instruction-Level Parallelism and Its Exploitation

Both register renaming and reorder buffers continue to be used in high-end

processors, which now feature the ability to have as many as 40 or 50 instructions

(including loads and stores waiting on the cache) in flight. Whether renaming or

a reorder buffer is used, the key complexity bottleneck for a dynamically sched-

ule superscalar remains issuing bundles of instructions with dependences within

the bundle. In particular, dependent instructions in an issue bundle must be issued

with the assigned virtual registers of the instructions on which they depend.

A strategy for instruction issue with register renaming similar to that used for

multiple issue with reorder buffers (see page 198) can be deployed, as follows:

1. The issue logic pre-reserves enough physical registers for the entire issue

bundle (say, four registers for a four-instruction bundle with at most one reg-

ister result per instruction).

2. The issue logic determines what dependences exist within the bundle. If a

dependence does not exist within the bundle, the register renaming structure

is used to determine the physical register that holds, or will hold, the result on

which instruction depends. When no dependence exists within the bundle the

result is from an earlier issue bundle, and the register renaming table will

have the correct register number.

3. If an instruction depends on an instruction that is earlier in the bundle, then

the pre-reserved physical register in which the result will be placed is used to

update the information for the issuing instruction.

Note that just as in the reorder buffer case, the issue logic must both determine

dependences within the bundle and update the renaming tables in a single clock,

and, as before, the complexity of doing this for a larger number of instructions

per clock becomes a chief limitation in the issue width.

How Much to Speculate

One of the significant advantages of speculation is its ability to uncover events

that would otherwise stall the pipeline early, such as cache misses. This potential

advantage, however, comes with a significant potential disadvantage. Specula-

tion is not free. It takes time and energy, and the recovery of incorrect speculation

further reduces performance. In addition, to support the higher instruction execu-

tion rate needed to benefit from speculation, the processor must have additional

resources, which take silicon area and power. Finally, if speculation causes an

exceptional event to occur, such as a cache or translation lookaside buffer (TLB)

miss, the potential for significant performance loss increases, if that event would

not have occurred without speculation.

To maintain most of the advantage, while minimizing the disadvantages,

most pipelines with speculation will allow only low-cost exceptional events

(such as a first-level cache miss) to be handled in speculative mode. If an

expensive exceptional event occurs, such as a second-level cache miss or a TLB

miss, the processor will wait until the instruction causing the event is no longer

3.9 Advanced Techniques for Instruction Delivery and Speculation ■ 211

speculative before handling the event. Although this may slightly degrade the

performance of some programs, it avoids significant performance losses in

others, especially those that suffer from a high frequency of such events coupled

with less-than-excellent branch prediction.

In the 1990s, the potential downsides of speculation were less obvious. As

processors have evolved, the real costs of speculation have become more appar-

ent, and the limitations of wider issue and speculation have been obvious. We

return to this issue shortly.

Speculating through Multiple Branches

In the examples we have considered in this chapter, it has been possible to

resolve a branch before having to speculate on another. Three different situations

can benefit from speculating on multiple branches simultaneously: (1) a very

high branch frequency, (2) significant clustering of branches, and (3) long delays

in functional units. In the first two cases, achieving high performance may mean

that multiple branches are speculated, and it may even mean handling more than

one branch per clock. Database programs, and other less structured integer

computations, often exhibit these properties, making speculation on multiple

branches important. Likewise, long delays in functional units can raise the impor-

tance of speculating on multiple branches as a way to avoid stalls from the longer

pipeline delays.

Speculating on multiple branches slightly complicates the process of specula-

tion recovery but is straightforward otherwise. As of 2011, no processor has yet

combined full speculation with resolving multiple branches per cycle, and it is

unlikely that the costs of doing so would be justified in terms of performance ver-

sus complexity and power.

Speculation and the Challenge of Energy Efficiency

What is the impact of speculation on energy efficiency? At first glance, one

might argue that using speculation always decreases energy efficiency, since

whenever speculation is wrong it consumes excess energy in two ways:

1. The instructions that were speculated and whose results were not needed gen-

erated excess work for the processor, wasting energy.

2. Undoing the speculation and restoring the state of the processor to continue

execution at the appropriate address consumes additional energy that would

not be needed without speculation.

Certainly, speculation will raise the power consumption and, if we could control

speculation, it would be possible to measure the cost (or at least the dynamic

power cost). But, if speculation lowers the execution time by more than it

increases the average power consumption, then the total energy consumed may

be less.