Code Optimization: Techniques and Strategies for Improving Performance, Slides of Introduction to Computers

This document from a cs 213 class in 1998 covers various techniques for optimizing code, including basic optimizations like reduction in strength, code motion, and common subexpression sharing, as well as advanced optimizations such as code scheduling, unrolling, and pipelining. The document also discusses optimization blockers and provides examples and advice for optimizing performance without destroying code modularity and generality.

Typology: Slides

2010/2011

Uploaded on 10/07/2011

rolla45
rolla45 🇺🇸

4

(6)

133 documents

1 / 33

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Code Optimization
November 3, 1998
Topics
Basic optimizations
Reduction in strength
Code motion
Common subexpression sharing
Optimization blockers
Advanced optimizations
Code scheduling
Unrolling & pipelining
Advice
15-213
class21.ppt
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21

Partial preview of the text

Download Code Optimization: Techniques and Strategies for Improving Performance and more Slides Introduction to Computers in PDF only on Docsity!

TopicsNovember 3, 1998Code Optimization

Basic optimizations

  • Common subexpression sharing– Code motion – Reduction in strength

Optimization blockers

Advanced optimizations

  • Unrolling & pipelining – Code scheduling

Advice

class21.ppt

CS 213 F’

class21.ppt

Great Reality

There’s more to performance than asymptotic

complexity

Constant factors matter too!

Easily see 10:1 performance range depending on how code written

procedures, and loopsMust optimize at multiple levels: algorithm, data representations,

Must understand system to optimize performance

How programs compiled and executed

How to measure program performance and identify bottlenecks

and generalityHow to improve performance without destroying code modularity

CS 213 F’

class21.ppt

Limitations of Optimizing Compilers

Work under Tight Restriction

realizable circumstanceCannot perform optimization if changes behavior under any

Even if circumstances seem quite bizarre

Have No Understanding of Application

Limited information about data ranges

Don’t always make best trade-offs

Some Don’t Try Very Hard

Increase cost of compilation

More chances for compiler errors

CS 213 F’

class21.ppt

Basic Optimizations

Reduction in Strength

Replace costly operation with simpler one

Shift, add instead of multiply or divide

  • Integer multiplication requires 8-16 cycles on the Alpha 21164

Procedure with no stack frame

Keep data in registers rather than memory

Pointer arithmetic

Code Motion

Reduce frequency with which computation performed

  • Especially moving code out of loop – If it will always produce same result

Share Common Subexpressions

Reuse portions of expressions

CS 213 F’

class21.ppt

Multiply / Divide Example

Unsigned integers, power of 2

Most possible optimizations

uweight4(unsigned void

long

(^) x,

unsigned

(^) long

(^) *dest)

dest[0]

4*x;

dest[1]

44x;

dest[2]

x;

dest[3]

x / (^) 4;

dest[4]

x % (^) 4;

s4addq

4x

stq

dest[0]

4x

sll

16x

stq

dest[1]

16x

lda

mulq

-4x

stq

dest[2]

-4x

srl

x / 4

stq

dest[3]

x (^) / (^4)

and

(^) x % 4

stq

dest[4]

x (^) % (^4)

Code Sequences

CS 213 F’

class21.ppt

Multiply / Divide Example

Signed integers, power of 2

Multiplication same as for unsigned

Correct rounding of negatives for division

Shift / And combination would produce positive remainder

weight4(long int x, void

long int

*dest)

dest[3]

x / (^) 4;

dest[4]

x % (^) 4;

addq

x

3

cmovge

if

(x >= (^) 0),

x

sra

x / 4

stq

dest[3]

x (^) / (^4)

s4addq

(x

/

subq

(^) x

( (^) * (x (^) / 4))

stq

dest[4]

x (^) % (^4)

Division Code

CS 213 F’

class21.ppt

Omitting Stack Frame

Reduces strength of general procedure call

Leaf Procedure

Does not call any other procedures

All Local Variables Can be Held in Registers

Not too many

No local structures or arrays

  • Suppose allocate array

int a[6]

as registers

» How would you generate code for

a[i]

No address operations

&x

cannot be generated if x is in register

Performance Improvements

Minor saving in stack space

Eliminates time to setup and undo stack frame

CS 213 F’

class21.ppt

Keeping Data in Registers

Computing Integer Sum

z = x + y

Integer data stored in registers

r

r

r

  • 1 clock cycleaddq $1, $2, $

Data addresses stored in registers

r

r

r

  • 4 clock cyclesstq $6, 0($3)addq $4, $5, $6ldq $5, 0($2)ldq $4, 0($1)

Computing Double Precision Sum

z = x

y

Register data: 4 clock cycles

Memory data: 7 clock cycles

CS 213 F’

class21.ppt

Procedure^ Memory Optimization Example (Cont.)

product

Compute product of array elements and store at

*dest

Accumulate in register

Each iteration takes ~6 cycles (roughly twice as fast)

(^) i, (^) $

(^) = vals,

(^) = cnt

(^) $f

(^) prod

Loop: s8addq

(^) &vals[i]

ldt

$f1,0($1)

(^) $f

vals[i]

mult

$f10,$f1,$f

(^) prod *=

(^) vals[i]

addq

(^) $2,1,$

(^) i++

cmplt

(^) if (^) (i

CS 213 F’

class21.ppt

Blocker #1: Memory Aliasing

Aliasing

Two different memory references specify single location

Example

double a[3] = {

product1(a, &a[2],

product2(a, &a[2],

Observations

Easy to have happen in C

  • Direct access to storage structures – Since allowed to do address arithmetic

Get in habit of introducing local variables

  • Your way of telling compiler not to check for aliasing – Accumulating within loops

CS 213 F’

class21.ppt

Pointer Code Example

Procedure

product

Compute product of array elements and store at

*dest

Each iteration takes ~5 cycles

  • Can’t do much better (in this case), since mult takes 4 cycles

With more functional units or lower mult latency, we could do better:

  • requires loop unrolling or software pipelining (discussed later)

= vals,

(^) val_end

(^) $f

(^) prod

Loop: ldt

$f1,0($16)

(^) $f

*vals

mult

(^) $f10,$f1,$f

(^) prod

(^) vals

addq

(^) $16,8,$

(^) vals++

subq

(^) $16,$2,$

if (^) (vals !=

val_end)

bne

$1,Loop

continue

(^) looping

void

product3(double vals[],

double

*dest,

(^) long int

(^) cnt)

double

*val_end

= vals+cnt;

double

prod

if (^) (cnt

*dest

prod;

return;

while}

(^) (vals

val_end)

prod

prod

*vals++;

*dest =

prod;

CS 213 F’

class21.ppt

Code Motion

Move Computation out of Frequently Executed Section

if guaranteed to always give same result

out of loop

for

(i

0; i < n; i++)

for (^) (j

= 0; j < n;

j++)

a[n*i

j] (^) = (^) b[j];

for

(i

0; i < n; i++)

int (^) ni =

(^) n*i;

for (^) (j

= 0; j < n;

j++)

a[ni

j]

= b[j];

CS 213 F’

class21.ppt

Code Motion Examples

Sum Integers from 1 to n!

Best Better Bad

sum

= 0;

for

(i

0; i <= fact(n); i++)

sum (^) +=

i;

sum

= 0;

fn (^) = (^) fact(n);

for

(i

0; i <= fn;

i++)

sum (^) +=

i;

sum

= 0;

for

(i

fact(n);

(^) i > 0; i--)

sum (^) +=

i;

fn (^) = (^) fact(n);

sum

= fn

(^) * (^) (fn

CS 213 F’

class21.ppt

Blocker #2: Procedure Calls

Why couldn’t the compiler move fact(n) out of the inner loop?

Procedure May Have Side Effects

i.e, alters global state each time called

Function May Not Return Same Value for Given

Arguments

Depends on other parts of global state

Why doesn’t compiler look at code for fact(n)?

Linker may overload with different version

  • Unless declared static

Interprocedural optimization is not used extensively due to cost

Warning

Compiler treats procedure call as a black box

Weak optimizations in and around them