Program Optimization, Lecture Slide - Computer Science, Slides of Computer System Design and Architecture

Generally Useful Optimizations, Code Motion ,Precomputation, Strength Reduction, Sharing of Common Subexpression ,Removing Unnecessary Procedure calls, Optimization Blockers , Exploiting Instruction-Level Parallelism, Dealing with Consitionals

Typology: Slides

2010/2011

Uploaded on 10/08/2011

rolla45
rolla45 šŸ‡ŗšŸ‡ø

4

(6)

133 documents

1 / 54

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Carnegie Mellon
1
Program'Op*miza*on'
15#213:'Introduc0on'to'Computer'Systems'
25th'Lecture,'Nov.'23,'2010'
Instructors:''
Randy'Bryant'and'Dave'O’Hallaron'
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36

Partial preview of the text

Download Program Optimization, Lecture Slide - Computer Science and more Slides Computer System Design and Architecture in PDF only on Docsity!

Program Opmizaon

15-­‐213: Introduc0on to Computer Systems

th

Lecture, Nov. 23, 2010

Instructors:

Randy Bryant and Dave O’Hallaron

Today

ļ‚¢ Overview

ļ‚¢ Generally Useful Opmizaons

ļ‚§ Code mo0on/precomputa0on

ļ‚§ Strength reduc0on

ļ‚§ Sharing of common subexpressions

ļ‚§ Removing unnecessary procedure calls

ļ‚¢ Opmizaon Blockers

ļ‚§ Procedure calls

ļ‚§ Memory aliasing

ļ‚¢ Exploing Instrucon-­‐Level Parallelism

ļ‚¢ Dealing with Condi*onals

Op*mizing Compilers

ļ‚¢ Provide efficient mapping of program to machine

ļ‚§ register alloca0on

ļ‚§ code selec0on and ordering (scheduling)

ļ‚§ dead code elimina0on

ļ‚§ elimina0ng minor inefficiencies

ļ‚¢ Don’t (usually) improve asympto*c efficiency

ļ‚§ up to programmer to select best overall algorithm

ļ‚§ big-­‐O savings are (oWen) more important than constant factors

ļ‚§ but constant factors also maQer

ļ‚¢ Have difficulty overcoming ā€œopmizaon blockersā€

ļ‚§ poten0al memory aliasing

ļ‚§ poten0al procedure side-­‐effects

Limitaons of Opmizing Compilers

ļ‚¢ Operate under fundamental constraint

ļ‚§ Must not cause any change in program behavior

ļ‚§ OWen prevents it from making op0miza0ons when would only affect behavior

under pathological condi0ons.

ļ‚¢ Behavior that may be obvious to the programmer can be obfuscated by

languages and coding styles

ļ‚§ e.g., Data ranges may be more limited than variable types suggest

ļ‚¢ Most analysis is performed only within procedures

ļ‚§ Whole-­‐program analysis is too expensive in most cases

ļ‚¢ Most analysis is based only on sta1c informa*on

ļ‚§ Compiler has difficulty an0cipa0ng run-­‐0me inputs

ļ‚¢ When in doubt, the compiler must be conserva*ve

Compiler-­‐Generated Code Mo*on

set_row: testq %rcx, %rcx # Test n jle .L4 # If 0, goto done movq %rcx, %rax # rax = n imulq %rdx, %rax # rax = i leaq (%rdi,%rax,8), %rdx # rowp = A + ni* movl $0, %r8d # j = 0 .L3: # loop: movq (%rsi,%r8,8), %rax # t = b[j] movq %rax, (%rdx) # rowp = t addq $1, %r8 # j++ addq $8, %rdx # rowp++ cmpq %r8, %rcx # Compare n:j jg .L3 # If >, goto loop .L4: # done: rep ; ret long j; long ni = ni; double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; void set_row(double *a, double b, long i, long n) { long j; for (j = 0; j < n; j++) a[ni+j] = b[j]; }

Where are the FP operations?

Reduc*on in Strength

ļ‚§ Replace costly opera0on with simpler one

ļ‚§ ShiW, add instead of mul0ply or divide

16*x --> x << 4

ļ‚§ U0lity machine dependent

ļ‚§ Depends on cost of mul0ply or divide instruc0on

– On Intel Nehalem, integer mul0ply requires 3 CPU cycles

ļ‚§ Recognize sequence of products

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }

void lower(char *s)

int i;

for (i = 0; i < strlen(s); i++)

if (s[i] >= 'A' && s[i] <= 'Z')

s[i] -= ('A' - 'a');

Opmizaon Blocker #1: Procedure Calls

ļ‚¢ Procedure to Convert String to Lower Case

ļ‚§ Extracted from 213 lab submissions, Fall, 1998

Lower Case Conversion Performance

ļ‚§ Time quadruples when double string length

ļ‚§ Quadra0c performance

0 20 40 60 80 100 120 140 160 180 200 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 CPU seconds String length lower

Calling Strlen

ļ‚¢ Strlen performance

ļ‚§ Only way to determine length of string is to scan its en0re length, looking for

null character.

ļ‚¢ Overall performance, string of length N

ļ‚§ N calls to strlen

ļ‚§ Require 0mes N, N-­‐1, N-­‐2, …, 1

ļ‚§ Overall O(N^2 ) performance

/* My version of strlen */

size_t strlen(const char *s)

size_t length = 0;

while (*s != '\0') {

s++;

length++;

return length;

Improving Performance

ļ‚§ Move call to strlen outside of loop

ļ‚§ Since result does not change from one itera0on to another

ļ‚§ Form of code mo0on

void lower(char *s)

int i;

int len = strlen(s);

for (i = 0; i < len; i++)

if (s[i] >= 'A' && s[i] <= 'Z')

s[i] -= ('A' - 'a');

Opmizaon Blocker: Procedure Calls

ļ‚¢ Why couldn’t compiler move strlen out of inner loop?

ļ‚§ Procedure may have side effects

ļ‚§ Alters global state each 0me called

ļ‚§ Func0on may not return same value for given arguments

ļ‚§ Depends on other parts of global state

ļ‚§ Procedure lower could interact with strlen

ļ‚¢ Warning:

ļ‚§ Compiler treats procedure call as a black box

ļ‚§ Weak op0miza0ons near them

ļ‚¢ Remedies:

ļ‚§ Use of inline func0ons

ļ‚§ GCC does this with –O

ļ‚§ See web aside ASM:OPT

ļ‚§ Do your own code mo0on

int lencnt = 0;

size_t strlen(const char *s)

size_t length = 0;

while (*s != '\0') {

s++; length++;

lencnt += length;

return length;

Memory MaHers

ļ‚§ Code updates b[i] on every itera0on

ļ‚§ Why couldn’t compiler op0mize this away?

sum_rows1 inner loop

.L53: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax movsd %xmm0, (%rsi,%r8,8) # FP store jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows1(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { b[i] = 0; for (j = 0; j < n; j++) b[i] += a[in + j]; } }

Removing Aliasing

ļ‚§ No need to store intermediate results

sum_rows2 inner loop

.L66: addsd (%rcx), %xmm0 # FP Add addq $8, %rcx decq %rax jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows2(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { double val = 0; for (j = 0; j < n; j++) val += a[in + j]; b[i] = val; } }

Opmizaon Blocker: Memory Aliasing

ļ‚¢ Aliasing

ļ‚§ Two different memory references specify single loca0on

ļ‚§ Easy to have happen in C

ļ‚§ Since allowed to do address arithme0c

ļ‚§ Direct access to storage structures

ļ‚§ Get in habit of introducing local variables

ļ‚§ Accumula0ng within loops

ļ‚§ Your way of telling compiler not to check for aliasing