Appendix h, Study Guides, Projects, Research of Technical Writing

Copmuter architecture 5th edition appendix h

Typology: Study Guides, Projects, Research

2015/2016

Uploaded on 11/02/2016

muna.alakhras
muna.alakhras 🇯🇴

1 document

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
H.1 Introduction: Exploiting Instruction-Level Parallelism Statically H-2
H.2 Detecting and Enhancing Loop-Level Parallelism H-2
H.3 Scheduling and Structuring Code for Parallelism H-12
H.4 Hardware Support for Exposing Parallelism: Predicated Instructions H-23
H.5 Hardware Support for Compiler Speculation H-27
H.6 The Intel IA-64 Architecture and Itanium Processor H-32
H.7 Concluding Remarks H-43
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Appendix h and more Study Guides, Projects, Research Technical Writing in PDF only on Docsity!

  • H.1 Introduction: Exploiting Instruction-Level Parallelism Statically H-
  • H.2 Detecting and Enhancing Loop-Level Parallelism H-
  • H.3 Scheduling and Structuring Code for Parallelism H-
  • H.4 Hardware Support for Exposing Parallelism: Predicated Instructions H-
  • H.5 Hardware Support for Compiler Speculation H-
  • H.6 The Intel IA-64 Architecture and Itanium Processor H-
  • H.7 Concluding Remarks H-

H

Hardware and Software for

VLIW and EPIC 1

The EPIC approach is based on the application of massive resources. These resources include more load-store, computational, and branch units, as well as larger, lower-latency caches than would be required for a superscalar processor. Thus, IA-64 gambles that, in the future, power will not be the critical limitation, and that massive resources, along with the machinery to exploit them, will not penalize performance with their adverse effect on clock speed, path length, or CPI factors. M. Hopkins in a commentary on the EPIC approach and the IA-64 architecture (2000)

H.2 Detecting and Enhancing Loop-Level Parallelism ■ H-

consider only data dependences, which arise when an operand is written at some point and read at a later point. Name dependences also exist and may be removed by renaming techniques like those we explored in Chapter 3. The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier itera- tions; such a dependence is called a loop-carried dependence. Most of the exam- ples we considered in Section 3.2 have no loop-carried dependences and, thus, are loop-level parallel. To see that a loop is parallel, let us first look at the source representation:

for (i=1000; i>0; i=i–1) x[i] = x[i] + s;

In this loop, there is a dependence between the two uses of x[i], but this depen- dence is within a single iteration and is not loop carried. There is a dependence between successive uses of i in different iterations, which is loop carried, but this dependence involves an induction variable and can be easily recognized and eliminated. We saw examples of how to eliminate dependences involving induc- tion variables during loop unrolling in Section 3.2, and we will look at additional examples later in this section. Because finding loop-level parallelism involves recognizing structures such as loops, array references, and induction variable computations, the compiler can do this analysis more easily at or near the source level, as opposed to the machine-code level. Let’s look at a more complex example.

Example Consider a loop like this one:

for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 / B[i+1] = B[i] + A[i+1]; / S2 */ }

Assume that A, B, and C are distinct, nonoverlapping arrays. (In practice, the arrays may sometimes be the same or may overlap. Because the arrays may be passed as parameters to a procedure, which includes this loop, determining whether arrays overlap or are identical often requires sophisticated, interproce- dural analysis of the program.) What are the data dependences among the state- ments S1 and S2 in the loop?

Answer There are two different dependences:

  1. S1 uses a value computed by S1 in an earlier iteration, since iteration i com- putes A[i+1], which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
  2. S2 uses the value, A[i+1], computed by S1 in the same iteration.

H-4 ■ Appendix H Hardware and Software for VLIW and EPIC

These two dependences are different and have different effects. To see how they differ, let’s assume that only one of these dependences exists at a time. Because the dependence of statement S1 is on an earlier iteration of S1, this dependence is loop carried. This dependence forces successive iterations of this loop to execute in series. The second dependence (S2 depending on S1) is within an iteration and is not loop carried. Thus, if this were the only dependence, multiple iterations of the loop could execute in parallel, as long as each pair of statements in an iteration were kept in order. We saw this type of dependence in an example in Section 3.2, where unrolling was able to expose the parallelism. It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next example shows.

Example Consider a loop like this one:

for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 / B[i+1] = C[i] + D[i]; / S2 */ }

What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel.

Answer Statement S1 uses the value assigned in the previous iteration by statement S2, so there is a loop-carried dependence between S2 and S1. Despite this loop-carried dependence, this loop can be made parallel. Unlike the earlier loop, this depen- dence is not circular: Neither statement depends on itself, and, although S depends on S2, S2 does not depend on S1. A loop is parallel if it can be written without a cycle in the dependences, since the absence of a cycle means that the dependences give a partial ordering on the statements. Although there are no circular dependences in the above loop, it must be transformed to conform to the partial ordering and expose the parallelism. Two observations are critical to this transformation:

  1. There is no dependence from S1 to S2. If there were, then there would be a cycle in the dependences and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2.
  2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop.

These two observations allow us to replace the loop above with the following code sequence:

H-6 ■ Appendix H Hardware and Software for VLIW and EPIC

On the iteration i, the loop references element i – 5. The loop is said to have a dependence distance of 5. Many loops with carried dependences have a depen- dence distance of 1. The larger the distance, the more potential parallelism can be obtained by unrolling the loop. For example, if we unroll the first loop, with a dependence distance of 1, successive statements are dependent on one another; there is still some parallelism among the individual instructions, but not much. If we unroll the loop that has a dependence distance of 5, there is a sequence of five statements that have no dependences, and thus much more ILP. Although many loops with loop-carried dependences have a dependence distance of 1, cases with larger distances do arise, and the longer distance may well provide enough paral- lelism to keep a processor busy.

Finding Dependences

Finding the dependences in a program is an important part of three tasks: (1) good scheduling of code, (2) determining which loops might contain parallelism, and (3) eliminating name dependences. The complexity of dependence analysis arises because of the presence of arrays and pointers in languages like C or C++, or pass-by-reference parameter passing in FORTRAN. Since scalar variable ref- erences explicitly refer to a name, they can usually be analyzed quite easily, with aliasing because of pointers and reference parameters causing some complica- tions and uncertainty in the analysis. How does the compiler detect dependences in general? Nearly all dependence analysis algorithms work on the assumption that array indices are affine. In sim- plest terms, a one-dimensional array index is affine if it can be written in the form a × i + b , where a and b are constants and i is the loop index variable. The index of a multidimensional array is affine if the index in each dimension is affine. Sparse array accesses, which typically have the form x[y[i]], are one of the major examples of nonaffine accesses. Determining whether there is a dependence between two references to the same array in a loop is thus equivalent to determining whether two affine func- tions can have the same value for different indices between the bounds of the loop. For example, suppose we have stored to an array element with index value a × i + b and loaded from the same array with index value c × i + d , where i is the for-loop index variable that runs from m to n. A dependence exists if two condi- tions hold:

  1. There are two iteration indices, j and k , both within the limits of the for loop. That is, mjn, mkn.
  2. The loop stores into an array element indexed by a × j + b and later fetches from that same array element when it is indexed by c × k + d. That is, a × j + b = c × k + d.

H.2 Detecting and Enhancing Loop-Level Parallelism ■ H-

In general, we cannot determine whether a dependence exists at compile time. For example, the values of a , b , c , and d may not be known (they could be values in other arrays), making it impossible to tell if a dependence exists. In other cases, the dependence testing may be very expensive but decidable at com- pile time. For example, the accesses may depend on the iteration indices of multi- ple nested loops. Many programs, however, contain primarily simple indices where a , b , c , and d are all constants. For these cases, it is possible to devise rea- sonable compile time tests for dependence. As an example, a simple and sufficient test for the absence of a dependence is the greatest common divisor (GCD) test. It is based on the observation that if a loop-carried dependence exists, then GCD ( c,a ) must divide ( d – b ). (Recall that an integer, x, divides another integer, y , if we get an integer quotient when we do the division y/x and there is no remainder.)

Example Use the GCD test to determine whether dependences exist in the following loop:

for (i=1; i<=100; i=i+1) { X[2i+3] = X[2i] * 5.0; }

Answer Given the values a = 2, b = 3, c = 2, and d = 0, then GCD( a,c ) = 2, and d – b = –3. Since 2 does not divide –3, no dependence is possible.

The GCD test is sufficient to guarantee that no dependence exists; however, there are cases where the GCD test succeeds but no dependence exists. This can arise, for example, because the GCD test does not take the loop bounds into account. In general, determining whether a dependence actually exists is NP complete. In practice, however, many common cases can be analyzed precisely at low cost. Recently, approaches using a hierarchy of exact tests increasing in generality and cost have been shown to be both accurate and efficient. (A test is exact if it precisely determines whether a dependence exists. Although the general case is NP complete, there exist exact tests for restricted situations that are much cheaper.) In addition to detecting the presence of a dependence, a compiler wants to classify the type of dependence. This classification allows a compiler to recog- nize name dependences and eliminate them at compile time by renaming and copying.

Example The following loop has multiple types of dependences. Find all the true depen- dences, output dependences, and antidependences, and eliminate the output dependences and antidependences by renaming.

H.2 Detecting and Enhancing Loop-Level Parallelism ■ H-

■ When objects are referenced via pointers rather than array indices (but see discussion below)

■ When array indexing is indirect through another array, which happens with many representations of sparse arrays

■ When a dependence may exist for some value of the inputs but does not exist in actuality when the code is run since the inputs never take on those values

■ When an optimization depends on knowing more than just the possibility of a dependence but needs to know on which write of a variable does a read of that variable depend

To deal with the issue of analyzing programs with pointers, another type of analysis, often called points-to analysis, is required (see Wilson and Lam [1995]). The key question that we want answered from dependence analysis of pointers is whether two pointers can designate the same address. In the case of complex dynamic data structures, this problem is extremely difficult. For example, we may want to know whether two pointers can reference the same node in a list at a given point in a program, which in general is undecidable and in practice is extremely difficult to answer. We may, however, be able to answer a simpler question: Can two pointers designate nodes in the same list, even if they may be separate nodes? This more restricted analysis can still be quite useful in schedul- ing memory accesses performed through pointers. The basic approach used in points-to analysis relies on information from three major sources:

  1. Type information, which restricts what a pointer can point to.
  2. Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what a pointer can point to. For example, if p always points to an object allocated in a given source line and q never points to that object, then p and q can never point to the same object.
  3. Information derived from pointer assignments. For example, if p may be assigned the value of q , then p may point to anything q points to.

There are several cases where analyzing pointers has been successfully applied and is extremely useful:

■ When pointers are used to pass the address of an object as a parameter, it is possible to use points-to analysis to determine the possible set of objects ref- erenced by a pointer. One important use is to determine if two pointer param- eters may designate the same object.

■ When a pointer can point to one of several types, it is sometimes possible to determine the type of the data object that a pointer designates at different parts of the program.

■ It is often possible to separate out pointers that may only point to a local object versus a global one.

H-10 ■ Appendix H Hardware and Software for VLIW and EPIC

There are two different types of limitations that affect our ability to do accurate dependence analysis for large programs. The first type of limitation arises from restrictions in the analysis algorithms. Often, we are limited by the lack of applica- bility of the analysis rather than a shortcoming in dependence analysis per se. For example, dependence analysis for pointers is essentially impossible for programs that use pointers in arbitrary fashion—such as by doing arithmetic on pointers. The second limitation is the need to analyze behavior across procedure boundaries to get accurate information. For example, if a procedure accepts two parameters that are pointers, determining whether the values could be the same requires analyzing across procedure boundaries. This type of analysis, called interprocedural analysis , is much more difficult and complex than analysis within a single procedure. Unlike the case of analyzing array indices within a sin- gle loop nest, points-to analysis usually requires an interprocedural analysis. The reason for this is simple. Suppose we are analyzing a program segment with two pointers; if the analysis does not know anything about the two pointers at the start of the program segment, it must be conservative and assume the worst case. The worst case is that the two pointers may designate the same object, but they are not guaranteed to designate the same object. This worst case is likely to propagate through the analysis, producing useless information. In practice, getting fully accurate interprocedural information is usually too expensive for real programs. Instead, compilers usually use approximations in interprocedural analysis. The result is that the information may be too inaccurate to be useful. Modern programming languages that use strong typing, such as Java, make the analysis of dependences easier. At the same time the extensive use of proce- dures to structure programs, as well as abstract data types, makes the analysis more difficult. Nonetheless, we expect that continued advances in analysis algo- rithms, combined with the increasing importance of pointer dependency analysis, will mean that there is continued progress on this important problem.

Eliminating Dependent Computations

Compilers can reduce the impact of dependent computations so as to achieve more instruction-level parallelism (ILP). The key technique is to eliminate or reduce a dependent computation by back substitution, which increases the amount of parallelism and sometimes increases the amount of computation required. These techniques can be applied both within a basic block and within loops, and we describe them differently. Within a basic block, algebraic simplifications of expressions and an optimi- zation called copy propagation , which eliminates operations that copy values, can be used to simplify sequences like the following:

DADDUI R1,R2,# DADDUI R1,R1,#

H-12 ■ Appendix H Hardware and Software for VLIW and EPIC

Assume we unroll a loop with this recurrence five times. If we let the value of x on these five iterations be given by x1, x2, x3, x4, and x5, then we can write the value of sum at the end of each unroll as:

sum = sum + x1 + x2 + x3 + x4 + x5;

If unoptimized, this expression requires five dependent operations, but it can be rewritten as:

sum = ((sum + x1) + (x2 + x3)) + (x4 + x5);

which can be evaluated in only three dependent operations. Recurrences also arise from implicit calculations, such as those associated with array indexing. Each array index translates to an address that is computed based on the loop index variable. Again, with unrolling and algebraic optimiza- tion, the dependent computations can be minimized.

We have already seen that one compiler technique, loop unrolling, is useful to uncover parallelism among instructions by creating longer sequences of straight- line code. There are two other important techniques that have been developed for this purpose: software pipelining and trace scheduling.

Software Pipelining: Symbolic Loop Unrolling

Software pipelining is a technique for reorganizing loops such that each itera- tion in the software-pipelined code is made from instructions chosen from dif- ferent iterations of the original loop. This approach is most easily understood by looking at the scheduled code for the unrolled loop, which appeared in the example on page 78. The scheduler in this example essentially interleaves instructions from different loop iterations, so as to separate the dependent instructions that occur within a single loop iteration. By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body, increasing the possibility that the unrolled loop can be scheduled without stalls. A software-pipelined loop interleaves instructions from different iterations without unrolling the loop, as illustrated in Figure H.1. This technique is the soft- ware counterpart to what Tomasulo’s algorithm does in hardware. The software- pipelined loop for the earlier example would contain one load, one add, and one store, each from a different iteration. There is also some start-up code that is needed before the loop begins as well as code to finish up after the loop is com- pleted. We will ignore these in this discussion, for simplicity.

H.3 Scheduling and Structuring Code for Parallelism

H.3 Scheduling and Structuring Code for Parallelism ■ H-

Example Show a software-pipelined version of this loop, which increments all the ele- ments of an array whose starting address is in R1 by the contents of F2:

Loop: L.D F0,0(R1) ADD.D F4,F0,F S.D F4,0(R1) DADDUI R1,R1,#- BNE R1,R2,Loop

You may omit the start-up and clean-up code.

Answer Software pipelining symbolically unrolls the loop and then selects instructions from each iteration. Since the unrolling is symbolic, the loop overhead instruc- tions (the DADDUI and BNE) need not be replicated. Here’s the body of the unrolled loop without overhead instructions, highlighting the instructions taken from each iteration:

Iteration i: L.D F0,0(R1) ADD.D F4,F0,F S.D F4,0(R1) Iteration i+1: L.D F0,0(R1) ADD.D F4,F0,F S.D F4,0(R1) Iteration i+2: L.D F0,0(R1) ADD.D F4,F0,F S.D F4,0(R1)

Figure H.1 A software-pipelined loop chooses instructions from different loop iter- ations, thus separating the dependent instructions within one iteration of the origi- nal loop. The start-up and finish-up code will correspond to the portions above and below the software-pipelined iteration.

Software- pipelined iteration

Iteration (^0) Iteration (^1) Iteration (^2) Iteration (^3) Iteration 4

H.3 Scheduling and Structuring Code for Parallelism ■ H-

tion using software pipelining is quite difficult for several reasons: Many loops require significant transformation before they can be software pipelined, the trade-offs in terms of overhead versus efficiency of the software-pipelined loop are complex, and the issue of register management creates additional complexi- ties. To help deal with the last two of these issues, the IA-64 added extensive hardware sport for software pipelining. Although this hardware can make it more efficient to apply software pipelining, it does not eliminate the need for complex compiler support, or the need to make difficult decisions about the best way to compile a loop.

Global Code Scheduling

In Section 3.2 we examined the use of loop unrolling and code scheduling to improve ILP. The techniques in Section 3.2 work well when the loop body is straight-line code, since the resulting unrolled loop looks like a single basic block. Similarly, software pipelining works well when the body is a single basic block, since it is easier to find the repeatable schedule. When the body of an unrolled loop contains internal control flow, however, scheduling the code is much more complex. In general, effective scheduling of a loop body with internal control flow will require moving instructions across branches, which is global code scheduling. In this section, we first examine the challenge and limitations of global code

Figure H.2 The execution pattern for (a) a software-pipelined loop and (b) an unrolled loop. The shaded areas are the times when the loop is not running with maxi- mum overlap or parallelism among instructions. This occurs once at the beginning and once at the end for the software-pipelined loop. For the unrolled loop it occurs m/n times if the loop has a total of m iterations and is unrolled n times. Each block repre- sents an unroll of n iterations. Increasing the number of unrollings will reduce the start- up and clean-up overhead. The overhead of one iteration overlaps with the overhead of the next, thereby reducing the impact. The total area under the polygonal region in each case will be the same, since the total number of operations is just the execution rate multiplied by the time.

(a) Software pipelining

Proportional to number of unrolls

Overlap between unrolled iterations

Time

Wind-down code

Start-up code

(b) Loop unrolling

Time

Number of overlapped operations

Number of overlapped operations

H-16 ■ Appendix H Hardware and Software for VLIW and EPIC

scheduling. In Section H.4 we examine hardware support for eliminating control flow within an inner loop, then we examine two compiler techniques that can be used when eliminating the control flow is not a viable approach. Global code scheduling aims to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependences. The data dependences force a partial order on operations, while the control dependences dictate instructions across which code cannot be easily moved. Data dependences are overcome by unrolling and, in the case of memory operations, using dependence analysis to determine if two references refer to the same address. Finding the shortest possible sequence for a piece of code means finding the shortest sequence for the critical path , which is the longest sequence of dependent instructions. Control dependences arising from loop branches are reduced by unrolling. Global code scheduling can reduce the effect of control dependences arising from conditional nonloop branches by moving code. Since moving code across branches will often affect the frequency of execution of such code, effectively using global code motion requires estimates of the relative frequency of different paths. Although global code motion cannot guarantee faster code, if the fre- quency information is accurate, the compiler can determine whether such code movement is likely to lead to faster code. Global code motion is important since many inner loops contain conditional statements. Figure H.3 shows a typical code fragment, which may be thought of as an iteration of an unrolled loop, and highlights the more common control flow.

Figure H.3 A code fragment and the common path shaded with gray. Moving the assignments to B or C requires a more complex analysis than for straight-line code. In this section we focus on scheduling this code segment efficiently without hardware assistance. Predication or conditional instructions, which we discuss in the next section, provide another way to schedule this code.

A(i) = A(i) + B(i)

T F

B(i) = X

A(i) = 0?

C(i) =

H-18 ■ Appendix H Hardware and Software for VLIW and EPIC

they will slow down the program if the trace selected is not optimal and the oper- ations end up requiring additional instructions to execute. Moving the assignment to C up to before the first branch requires two steps. First, the assignment is moved over the join point of the else part into the portion corresponding to the then part. This movement makes the instructions for C con- trol dependent on the branch and means that they will not execute if the else path, which is the infrequent path, is chosen. Hence, instructions that were data depen- dent on the assignment to C, and which execute after this code fragment, will be affected. To ensure the correct value is computed for such instructions, a copy is made of the instructions that compute and assign to C on the else path. Second, we can move C from the then part of the branch across the branch condition, if it does not affect any data flow into the branch condition. If C is moved to before the if test, the copy of C in the else branch can usually be eliminated, since it will be redundant. We can see from this example that global code scheduling is subject to many constraints. This observation is what led designers to provide hardware support to make such code motion easier, and Sections H.4 and H.5 explores such support in detail. Global code scheduling also requires complex trade-offs to make code motion decisions. For example, assuming that the assignment to B can be moved before the conditional branch (possibly with some compensation code on the alternative branch), will this movement make the code run faster? The answer is, possibly! Similarly, moving the copies of C into the if and else branches makes the code initially bigger! Only if the compiler can successfully move the compu- tation across the if test will there be a likely benefit. Consider the factors that the compiler would have to consider in moving the computation and assignment of B:

■ What are the relative execution frequencies of the then case and the else case in the branch? If the then case is much more frequent, the code motion may be beneficial. If not, it is less likely, although not impossible, to consider moving the code. ■ What is the cost of executing the computation and assignment to B above the branch? It may be that there are a number of empty instruction issue slots in the code above the branch and that the instructions for B can be placed into these slots that would otherwise go empty. This opportunity makes the com- putation of B “free” at least to first order. ■ How will the movement of B change the execution time for the then case? If B is at the start of the critical path for the then case, moving it may be highly beneficial. ■ Is B the best code fragment that can be moved above the branch? How does it compare with moving C or other statements within the then case? ■ What is the cost of the compensation code that may be necessary for the else case? How effectively can this code be scheduled, and what is its impact on execution time?

H.3 Scheduling and Structuring Code for Parallelism ■ H-

As we can see from this partial list, global code scheduling is an extremely complex problem. The trade-offs depend on many factors, and individual deci- sions to globally schedule instructions are highly interdependent. Even choosing which instructions to start considering as candidates for global code motion is complex! To try to simplify this process, several different methods for global code scheduling have been developed. The two methods we briefly explore here rely on a simple principle: Focus the attention of the compiler on a straight-line code segment representing what is estimated to be the most frequently executed code path. Unrolling is used to generate the straight-line code, but, of course, the com- plexity arises in how conditional branches are handled. In both cases, they are effectively straightened by choosing and scheduling the most frequent path.

Trace Scheduling: Focusing on the Critical Path

Trace scheduling is useful for processors with a large number of issues per clock, where conditional or predicated execution (see Section H.4) is inappropriate or unsupported, and where simple loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor busy. Trace scheduling is a way to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths. Because it can generate significant overheads on the designated infrequent path, it is best used where profile information indicates significant differences in frequency between different paths and where the profile information is highly indicative of program behavior independent of the input. Of course, this limits its effective applicability to certain classes of programs. There are two steps to trace scheduling. The first step, called trace selection, tries to find a likely sequence of basic blocks whose operations will be put together into a smaller number of instructions; this sequence is called a trace. Loop unrolling is used to generate long traces, since loop branches are taken with high probability. Additionally, by using static branch prediction, other conditional branches are also chosen as taken or not taken, so that the resultant trace is a straight-line sequence resulting from concatenating many basic blocks. If, for example, the program fragment shown in Figure H.3 corresponds to an inner loop with the highlighted path being much more frequent, and the loop were unwound four times, the primary trace would consist of four copies of the shaded portion of the program, as shown in Figure H.4. Once a trace is selected, the second process, called trace compaction , tries to squeeze the trace into a small number of wide instructions. Trace compaction is code scheduling; hence, it attempts to move operations as early as it can in a sequence (trace), packing the operations into as few wide instructions (or issue packets) as possible. The advantage of the trace scheduling approach is that it simplifies the deci- sions concerning global code motion. In particular, branches are viewed as jumps into or out of the selected trace, which is assumed to be the most probable path.