Optimizing Compiler Techniques: Understanding Code Transformation and Optimization, Papers of Computer Science

Various compiler optimization techniques, including optimization of intermediate code, register allocation, and assembly code optimization. It also covers topics like constant propagation, folding, and algebraic simplification. Examples of decaf code and its corresponding tac, as well as explanations of basic blocks and their connection in the flow graph.

Typology: Papers

Pre 2010

Uploaded on 08/09/2009

koofers-user-af5
koofers-user-af5 🇺🇸

9 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS143 Handout 33
Autumn 2007 November 26, 2007
Code Optimization
Handout written by Maggie Johnson.
Optimization is the process of transforming a piece of code to make more efficient
(either in terms of time or space) without changing its output or side-effects. The only
difference visible to the code’s user should be that it runs faster and/or consumes less
memory. It is really a misnomer that the name implies you are finding an "optimal"
solution— in truth, optimization aims to improve, not perfect, the result.
Optimization is the field where most compiler research is done today. The tasks of the
front-end (scanning, parsing, semantic analysis) are well understood and unoptimized
code generation is relatively straightforward. Optimization, on the other hand, still
retains a sizable measure of mysticism. High-quality optimization is more of an art
than a science. Compilers for mature languages aren’t judged by how well they parse
or analyze the code—you just expect it to do it right with a minimum of hassle—but
instead by the quality of the object code they produce.
Many optimization problems are NP-complete and thus most optimization algorithms
rely on heuristics and approximations. It may be possible to come up with a case where
a particular algorithm fails to produce better code or perhaps even makes it worse.
However, the algorithms tend to do rather well overall.
It’s worth reiterating here that efficient code starts with intelligent decisions by the
programmer. No one expects a compiler to replace BubbleSort with Quicksort. If a
programmer uses a lousy algorithm, no amount of optimization can make it snappy. In
terms of big-O, a compiler can only make improvements to constant factors. But, all
else being equal, you want an algorithm with low constant factors.
First let me note that you probably shouldn’t try to optimize the way we will discuss
today in your favorite high-level language. Consider the following two code snippets
where each walks through an array and set every element to one. Which one is faster?
int arr[10000];
void Binky() {
int i;
for (i=0; i < 10000; i++)
arr[i] = 1;
}
int arr[10000];
void Winky() {
register int *p;
for (p = arr; p < arr + 10000; p++)
*p = 1;
}
You will invariably encounter people who think the second one is faster. And they are
probably right….if using a compiler without optimization. But, many modern
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Optimizing Compiler Techniques: Understanding Code Transformation and Optimization and more Papers Computer Science in PDF only on Docsity!

CS143 Handout 33 Autumn 2007 November 26, 2007

Code Optimization

Handout written by Maggie Johnson.

Optimization is the process of transforming a piece of code to make more efficient

(either in terms of time or space) without changing its output or side-effects. The only

difference visible to the code’s user should be that it runs faster and/or consumes less

memory. It is really a misnomer that the name implies you are finding an "optimal"

solution— in truth, optimization aims to improve, not perfect, the result.

Optimization is the field where most compiler research is done today. The tasks of the

front-end (scanning, parsing, semantic analysis) are well understood and unoptimized

code generation is relatively straightforward. Optimization, on the other hand, still

retains a sizable measure of mysticism. High-quality optimization is more of an art

than a science. Compilers for mature languages aren’t judged by how well they parse

or analyze the code—you just expect it to do it right with a minimum of hassle—but

instead by the quality of the object code they produce.

Many optimization problems are NP-complete and thus most optimization algorithms

rely on heuristics and approximations. It may be possible to come up with a case where

a particular algorithm fails to produce better code or perhaps even makes it worse.

However, the algorithms tend to do rather well overall.

It’s worth reiterating here that efficient code starts with intelligent decisions by the

programmer. No one expects a compiler to replace BubbleSort with Quicksort. If a

programmer uses a lousy algorithm, no amount of optimization can make it snappy. In

terms of big-O, a compiler can only make improvements to constant factors. But, all

else being equal, you want an algorithm with low constant factors.

First let me note that you probably shouldn’t try to optimize the way we will discuss

today in your favorite high-level language. Consider the following two code snippets

where each walks through an array and set every element to one. Which one is faster?

int arr[10000]; void Binky() { int i; for (i=0; i < 10000; i++) arr[i] = 1; } int arr[10000]; void Winky() { register int *p; for (p = arr; p < arr + 10000; p++) *p = 1; }

You will invariably encounter people who think the second one is faster. And they are

probably right….if using a compiler without optimization. But, many modern

compilers emit the same object code for both, by use of clever techniques (in particular,

this one is called "loop-induction variable elimination") that work particularly well on

idiomatic usage. The moral of this story is that most often you should write code that is

easier to understand and let the compiler do the optimization.

Correctness Above All!

If may seem obvious, but it bears repeating that optimization should not change the

correctness of the generated code. Transforming the code to something that runs faster

but incorrectly is of little value. It is expected that the unoptimized and optimized

variants give the same output for all inputs. This may not hold for an incorrectly

written program (e.g., one that uses an uninitialized variable).

When and Where To Optimize

There are a variety of tactics for attacking optimization. Some techniques are applied to

the intermediate code, to streamline, rearrange, compress, etc. in an effort to reduce the

size of the abstract syntax tree or shrink the number of TAC instructions. Others are

applied as part of final code generation—choosing which instructions to emit, how to

allocate registers and when/what to spill, and the like. And still other optimizations

may occur after final code generation, attempting to re-work the assembly code itself

into something more efficient.

Optimization can be very complex and time-consuming; it often involves multiple sub-

phases, some of which are applied more than once. Most compilers allow optimization

to be turned off to speed up compilation ( gcc even has specific flags to turn on and off

individual optimizations.)

Control-Flow Analysis

Consider all that has happened up to this point in the compiling process—lexical

analysis, syntactic analysis, semantic analysis and finally intermediate-code generation.

The compiler has done an enormous amount of analysis, but it still doesn’t really know

how the program does what it does. In control-flow analysis, the compiler figures out

even more information about how the program does its work, only now it can assume

that there are no syntactic or semantic errors in the code.

  • the instruction immediately following a branch or a return

A basic block ends in any of the following ways:

  • a jump statement
  • a conditional or unconditional branch
  • a return statement

Using the rules above, let's divide the Fibonacci TAC code into basic blocks:

_fib: BeginFunc 68; _tmp0 = 1 ; _tmp1 = base < _tmp0 ; _tmp2 = base == _tmp0 ; _tmp3 = _tmp1 || _tmp2 ; IfZ _tmp3 Goto _L0 ; result = base ; Goto _L1; _L0: _tmp4 = 0 ; f0 = _tmp4 ; _tmp5 = 1 ; f1 = _tmp5 ; _tmp6 = 2 ; i = _tmp6 ; _L2: _tmp7 = i < base ; _tmp8 = i == base ; _tmp9 = _tmp7 || _tmp8 ; IfZ _tmp9 Goto _L3 ; _tmp10 = f0 + f1 ; result = _tmp1 0 ; f0 = f1 ; f1 = result ; _tmp11 = 1 ; _tmp12 = i + _tmp11 ; i = _tmp12 ; Goto _L2 ; _L3: _L1: Return result ; EndFunc

Now we can construct the control-flow graph between the blocks. Each basic block is a

node in the graph, and the possible different routes a program might take are the

connections, i.e. if a block ends with a branch, there will be a path leading from that

block to the branch target. The blocks that can follow a block are called its successors.

There may be multiple successors or just one. Similarly the block may have many, one,

or no predecessors.

Connect up the flow graph for Fibonacci basic blocks given above. What does an if-

then-else look like in a flow graph? What about a loop?

You probably have all seen the gcc warning or javac error about: "Unreachable code at

line XXX." How can the compiler tell when code is unreachable?

Local Optimizations

Optimizations performed exclusively within a basic block are called "local

optimizations". These are typically the easiest to perform since we do not consider any

control flow information, we just work with the statements within the block. Many of

the local optimizations we will discuss have corresponding global optimizations that

operate on the same principle, but require additional analysis to perform. We'll

consider some of the more common local optimizations as examples.

Constant Folding

Constant folding refers to the evaluation at compile-time of expressions whose

operands are known to be constant. In its simplest form, it involves determining that all

of the operands in an expression are constant-valued, performing the evaluation of the

expression at compile-time, and then replacing the expression by its value. If an

expression such as 10 + 2 * 3 is encountered, the compiler can compute the result at

compile-time ( 16 ) and emit code as if the input contained the result rather than the

original expression. Similarly, constant conditions, such as a conditional

branch if a < b goto L1 else goto L2 where a and b are constant can be replaced by a

Goto L1 or Goto L2 depending on the truth of the expression evaluated at compile-time.

The constant expression has to be evaluated at least once, but if the compiler does it, it

means you don’t have to do it again as needed during runtime. One thing to be careful

about is that the compiler must obey the grammar and semantic rules from the source

language that apply to expression evaluation, which may not necessarily match the

language you are writing the compiler in. (For example, if you were writing an APL

compiler, you would need to take care that you were respecting its Iversonian

precedence rules). It should also respect the expected treatment of any exceptional

conditions (divide by zero, over/underflow).

Here is an example of this in action. First, we have the unoptimized version, TAC on

left, MIPS on right:

_tmp0 = 12 ; _tmp1 = arr + _tmp0 ; _tmp2 = *(_tmp1) ; li $t0, 12 lw $t1, - 8($fp) add $t2, $t1, $t lw $t3, 0($t2)

A bit of constant propagation and a little rearrangement on the second load instruction,

cuts the number of registers needed from 4 to 2 and the number of instructions likewise

in the optimized version:

_tmp0 = *(arr + 12) ; lw $t0, - 8($fp) lw $t1, 12($t0)

Algebraic Simplification And Reassociation

Simplifications use algebraic properties or particular operator-operand combinations to

simplify expressions. Reassociation refers to using properties such as associativity,

commutativity and distributivity to rearrange an expression to enable other

optimizations such as constant-folding or loop-invariant code motion.

The most obvious of these are the optimizations that can remove useless instructions

entirely via algebraic identities. The rules of arithmetic can come in handy when

looking for redundant calculations to eliminate. Consider the examples below, which

allow you to replace an expression on the left with a simpler equivalent on the right:

x+0 = x 0+x = x x1 = x 1x = x 0/x = 0 x- 0 = x b && true = b b && false = false b || true = true b || false = b

The use of algebraic rearrangement can restructure an expression to enable constant-

folding or common sub-expression elimination and so on. Consider the Decaf code on

the far left, unoptimized TAC in middle, and rearranged and constant-folded TAC on

far right:

b = 5 + a + 10 ; _tmp0 = 5 ; _tmp1 = _tmp0 + a ; _tmp2 = _tmp1 + 10 ; b = _tmp2 ; _tmp0 = 15 ; _tmp1 = a + _tmp ; b = _tmp1 ;

Operator Strength Reduction

Operator strength reduction replaces an operator by a "less expensive" one. Given each

group of identities below, which operations are the most and least expensive, assuming

f is a float and i is an int? (Trick question: it all depends on the architectures—you

need to know your target machine to optimize well!)

i2 = 2i = i+i = i << 1 i/2 = (int)(i0.5) 0 - 1 = - i f2 = 2.0 * f = f + f f/2.0 = f*0.

Strength reduction is often performed as part of loop-induction variable elimination. An

idiomatic loop to zero all the elements of an array might look like this in Decaf and its

corresponding TAC:

while (i < 100) { arr[i] = 0; i = i + 1; } L0:_tmp2 = i < 100; IfZ _tmp2 Goto _L1 ; _tmp4 = 4 * i ; _tmp5 = arr + _tmp4 ; *(_tmp5) = 0 ; i = i + 1 ; L1:

Each time through the loop, we multiply i by 4 (the element size) and add to the array

base. Instead, we could maintain the address to the current element and instead just

add 4 each time:

_tmp4 = arr ; L0:_tmp2 = i < 100; IfZ _tmp2 Goto _L1 ; *_tmp4 = 0; _tmp4 = _tmp4 + 4; i = i + 1 ; L1:

This eliminates the multiplication entirely and reduces the need for an extra temporary.

By re-writing the loop termination test in terms of arr, we could remove the variable i

entirely and not bother tracking and incrementing it at all.

Copy Propagation

This optimization is similar to constant propagation, but generalized to non-constant

values. If we have an assignment a = b in our instruction stream, we can replace later

occurrences of a with b (assuming there are no changes to either variable in-between).

Given the way we generate TAC code, this is a particularly valuable optimization since

it is able to eliminate a large number of instructions that only serve to copy values from

one variable to another.

main() { int x, y, z; x = (1+20)* - x; y = xx+(x/y); y = z = (x/y)/(xx); }

straight translation:

tmp1 = 1 + 20 ; tmp2 = - x ; x = tmp1 * tmp2 ; tmp3 = x * x ; tmp4 = x / y ; y = tmp3 + tmp4 ; tmp5 = x / y ; tmp6 = x * x ; z = tmp5 / tmp6 ; y = z ;

What subexpressions can be eliminated? How can valid common subexpressions (live

ones) be determined? Here is an optimized version, after constant folding and

propagation and elimination of common sub-expressions:

tmp2 = - x ; x = 21 * tmp2 ; tmp3 = x * x ; tmp4 = x / y ; y = tmp3 + tmp4 ; tmp5 = x / y ; z = tmp5 / tmp3 ; y = z ;

Global Optimizations, Data-Flow Analysis

So far we were only considering making changes within one basic block. With some

additional analysis, we can apply similar optimizations across basic blocks, making

them global optimizations. It’s worth pointing out that global in this case does not mean

across the entire program. We usually only optimize one function at a time. Inter-

procedural analysis is an even larger task, one not even attempted by some compilers.

The additional analysis the optimizer must do to perform optimizations across basic

blocks is called data-flow analysis. Data-flow analysis is much more complicated than

control-flow analysis, and we can only scratch the surface here, but were you to take

CS243 (a wonderful class!) you will get to delve much deeper.

Let’s consider a global common subexpression elimination optimization as our

example. Careful analysis across blocks can determine whether an expression is alive

on entry to a block. Such an expression is said to be available at that point. Once the set

of available expressions is known, common sub-expressions can be eliminated on a

global basis.

Each block is a node in the flow graph of a program. The successor set (succ(x)) for a

node x is the set of all nodes that x directly flows into. The predecessor set (pred(x)) for

a node x is the set of all nodes that flow directly into x. An expression is defined at the

point where it is assigned a value and killed when one of its operands is subsequently

assigned a new value. An expression is available at some point p in a flow graph if every

path leading to p contains a prior definition of that expression which is not

subsequently killed.

avail[B] = set of expressions available on entry to block B

exit[B] = set of expressions available on exit from B

avail[B] = ∩ exit[x]: x ∈ pred[B] (i.e. B has available the intersection of the

exit of its predecessors)

killed[B] = set of the expressions killed in B

defined[B] = set of expressions defined in B

exit[B] = avail[B] - killed[B] + defined[B]

avail[B] = ∩ (avail[x] - killed[x] + defined[x]) : x ∈ pred[B]

Here is an algorithm for global common sub-expression elimination:

1) First, compute defined and killed sets for each basic block (this does not involve

any of its predecessors or successors).

2) Iteratively compute the avail and exit sets for each block by running the

following algorithm until you hit a stable fixed point:

a) Identify each statement s of the form a = b op c in some block B such that

b op c is available at the entry to B and neither b nor c is redefined in B

prior to s.

b) Follow flow of control backward in the graph passing back to but not

through each block that defines b op c. The last computation of b op c in

such a block reaches s.

c) After each computation d = b op c identified in step 2a, add

statement t = d to that block where t is a new temp.

d) Replace s by a = t.

Try an example to make things clearer:

main: BeginFunc 28; b = a + 2 ; c = 4 * b ; tmp1 = b < c; ifNZ tmp1 goto L1 ; b = 1 ; L1: d = a + 2 ; EndFunc ;

For loop L, moving invariant statement s in block B which defines variable v outside the

loop is a safe optimization if:

1. B dominates all exits from L

2. No other statement assigns a value to v

3. All uses of v inside L are from the definition in s.

Loop invariant code can be moved to just above the entry point to the loop.

Machine Optimizations

In final code generation, there is a lot of opportunity for cleverness in generating

efficient target code. In this pass, specific machines features (specialized instructions,

hardware pipeline abilities, register details) are taken into account to produce code

optimized for this particular architecture.

Register Allocation

One machine optimization of particular importance is register allocation, which is

perhaps the single most effective optimization for all architectures. Registers are the

fastest kind of memory available, but as a resource, they can be scarce. The problem is

how to minimize traffic between the registers and what lies beyond them in the

memory hierarchy to eliminate time wasted sending data back and forth across the bus

and the different levels of caches.

Your Decaf back-end uses a very naïve and inefficient means of assigning registers, it

just fills them before performing an operation and spills them right afterwards. A much

more effective strategy would be to consider which variables are more heavily in

demand and keep those in registers and spill those that are no longer needed or won't

be needed until much later.

One common register allocation technique is called "register coloring", after the central

idea to view register allocation as a graph coloring problem. If we have 8 registers, then

we try to color a graph with eight different colors. The graph’s nodes are made of

"webs" and the arcs are determined by calculating interference between the webs. A

web represents a variable’s definitions, places where it is assigned a value (as in x = … ),

and the possible different uses of those definitions (as in y = x + 2 ). This problem, in fact,

can be approached as another graph. The definition and uses of a variable are nodes,

and if a definition reaches a use, there is an arc between the two nodes. If two portions

of a variable’s definition-use graph are unconnected, then we have two separate webs

for a variable. In the interference graph for the routine, each node is a web. We seek to

determine which webs don't interfere with one another, so we know we can use the

same register for those two variables. For example, consider the following code:

i = 10; j = 20; x = i + j; y = j + k;

We say that i interferes with j because at least one pair of i ’s definitions and uses is

separated by a definition or use of j , thus, i and j are "alive" at the same time. A

variable is alive between the time it has been defined and that definition’s last use, after

which the variable is dead. If two variables interfere, then we cannot use the same

register for each. But two variables that don't interfere can since there is no overlap in

the liveness and can occupy the same register.

Once we have the interference graph constructed, we r-color it so that no two adjacent

nodes share the same color (r is the number of registers we have, each color represents a

different register). You may recall that graph-coloring is NP-complete, so we employ a

heuristic rather than an optimal algorithm. Here is a simplified version of something

that might be used:

1. Find the node with the least neighbors. (Break ties arbitrarily.)

2. Remove it from the interference graph and push it onto a stack

3. Repeat steps 1 and 2 until the graph is empty.

4. Now, rebuild the graph as follows:

a. Take the top node off the stack and reinsert it into the graph

b. Choose a color for it based on the color of any of its neighbors presently in

the graph, rotating colors in case there is more than one choice.

c. Repeat a and b until the graph is either completely rebuilt, or there is no

color available to color the node.

If we get stuck, then the graph may not be r-colorable, we could try again with a

different heuristic, say reusing colors as often as possible. If no other choice, we have to

spill a variable to memory.

Instruction Scheduling

Another extremely important optimization of the final code generator is instruction

scheduling. Because many machines, including most RISC architectures, have some

sort of pipelining capability, effectively harnessing that capability requires judicious

ordering of instructions.

In MIPS, each instruction is issued in one cycle, but some take multiple cycles to

complete. It takes an additional cycle before the value of a load is available and two

cycles for a branch to reach its destination, but an instruction can be placed in the "delay

slot" after a branch and executed in that slack time. On the left is one arrangement of a

set of instructions that requires 7 cycles. It assumes no hardware interlock and thus

Optimization Soup

You might wonder about the interactions between the various optimization techniques.

Some transformations may expose possibilities for others, and even the reverse is true,

one optimization may obscure or remove possibilities for others. Algebraic

rearrangement may allow for common subexpression elimination or code motion.

Constant folding usually paves the way for constant propagation and then it turns out

to be useful to run another round constant-folding and so on. How do you know you

are done? You don't!

As one compiler textbook author (Pyster) puts it:

"Adding optimizations to a compiler is a lot like eating chicken soup when you have

a cold. Having a bowl full never hurts, but who knows if it really helps. If the

optimizations are structured modularly so that the addition of one does not increase

compiler complexity, the temptation to fold in another is hard to resist. How well

the techniques work together or against each other is hard to determine."

Bibliography

A. Aho, R. Sethi, J.D. Ullman, Compilers: Principles, Techniques, and Tools, Reading, MA:

Addison-Wesley, 1986.

J.P. Bennett, Introduction to Compiling Techniques. Berkshire, England: McGraw-Hill, 1990.

R. Mak, Writing Compilers and Interpreters. New York, NY: Wiley, 1991.

S. Muchnick, Advanced Compiler Design and Implementation. San Francisco, CA: Morgan

Kaufmann, 1997.

A. Pyster, Compiler Design and Construction. Reinhold, 1988.