Download Program Optimization, Lecture Slide - Computer Science and more Slides Computer System Design and Architecture in PDF only on Docsity!
Program Opmizaon
15-Āā213: Introduc0on to Computer Systems
th
Lecture, Nov. 23, 2010
Instructors:
Randy Bryant and Dave OāHallaron
Today
ļ¢ Overview
ļ¢ Generally Useful Opmizaons
ļ§ Code mo0on/precomputa0on
ļ§ Strength reduc0on
ļ§ Sharing of common subexpressions
ļ§ Removing unnecessary procedure calls
ļ¢ Opmizaon Blockers
ļ§ Procedure calls
ļ§ Memory aliasing
ļ¢ Exploing Instrucon-ĀāLevel Parallelism
ļ¢ Dealing with Condi*onals
Op*mizing Compilers
ļ¢ Provide efficient mapping of program to machine
ļ§ register alloca0on
ļ§ code selec0on and ordering (scheduling)
ļ§ dead code elimina0on
ļ§ elimina0ng minor inefficiencies
ļ¢ Donāt (usually) improve asympto*c efficiency
ļ§ up to programmer to select best overall algorithm
ļ§ big-ĀāO savings are (oWen) more important than constant factors
ļ§ but constant factors also maQer
ļ¢ Have difficulty overcoming āopmizaon blockersā
ļ§ poten0al memory aliasing
ļ§ poten0al procedure side-Āāeffects
Limitaons of Opmizing Compilers
ļ¢ Operate under fundamental constraint
ļ§ Must not cause any change in program behavior
ļ§ OWen prevents it from making op0miza0ons when would only affect behavior
under pathological condi0ons.
ļ¢ Behavior that may be obvious to the programmer can be obfuscated by
languages and coding styles
ļ§ e.g., Data ranges may be more limited than variable types suggest
ļ¢ Most analysis is performed only within procedures
ļ§ Whole-Āāprogram analysis is too expensive in most cases
ļ¢ Most analysis is based only on sta1c informa*on
ļ§ Compiler has difficulty an0cipa0ng run-Āā0me inputs
ļ¢ When in doubt, the compiler must be conserva*ve
Compiler-ĀāGenerated Code Mo*on
set_row: testq %rcx, %rcx # Test n jle .L4 # If 0, goto done movq %rcx, %rax # rax = n imulq %rdx, %rax # rax = i leaq (%rdi,%rax,8), %rdx # rowp = A + ni* movl $0, %r8d # j = 0 .L3: # loop: movq (%rsi,%r8,8), %rax # t = b[j] movq %rax, (%rdx) # rowp = t addq $1, %r8 # j++ addq $8, %rdx # rowp++ cmpq %r8, %rcx # Compare n:j jg .L3 # If >, goto loop .L4: # done: rep ; ret long j; long ni = ni; double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; void set_row(double *a, double b, long i, long n) { long j; for (j = 0; j < n; j++) a[ni+j] = b[j]; }
Where are the FP operations?
Reduc*on in Strength
ļ§ Replace costly opera0on with simpler one
ļ§ ShiW, add instead of mul0ply or divide
16*x --> x << 4
ļ§ U0lity machine dependent
ļ§ Depends on cost of mul0ply or divide instruc0on
ā On Intel Nehalem, integer mul0ply requires 3 CPU cycles
ļ§ Recognize sequence of products
for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }
void lower(char *s)
int i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
Opmizaon Blocker #1: Procedure Calls
ļ¢ Procedure to Convert String to Lower Case
ļ§ Extracted from 213 lab submissions, Fall, 1998
Lower Case Conversion Performance
ļ§ Time quadruples when double string length
ļ§ Quadra0c performance
0 20 40 60 80 100 120 140 160 180 200 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 CPU seconds String length lower
Calling Strlen
ļ¢ Strlen performance
ļ§ Only way to determine length of string is to scan its en0re length, looking for
null character.
ļ¢ Overall performance, string of length N
ļ§ N calls to strlen
ļ§ Require 0mes N, N-Āā1, N-Āā2, ā¦, 1
ļ§ Overall O(N^2 ) performance
/* My version of strlen */
size_t strlen(const char *s)
size_t length = 0;
while (*s != '\0') {
s++;
length++;
return length;
Improving Performance
ļ§ Move call to strlen outside of loop
ļ§ Since result does not change from one itera0on to another
ļ§ Form of code mo0on
void lower(char *s)
int i;
int len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
Opmizaon Blocker: Procedure Calls
ļ¢ Why couldnāt compiler move strlen out of inner loop?
ļ§ Procedure may have side effects
ļ§ Alters global state each 0me called
ļ§ Func0on may not return same value for given arguments
ļ§ Depends on other parts of global state
ļ§ Procedure lower could interact with strlen
ļ¢ Warning:
ļ§ Compiler treats procedure call as a black box
ļ§ Weak op0miza0ons near them
ļ¢ Remedies:
ļ§ Use of inline func0ons
ļ§ GCC does this with āO
ļ§ See web aside ASM:OPT
ļ§ Do your own code mo0on
int lencnt = 0;
size_t strlen(const char *s)
size_t length = 0;
while (*s != '\0') {
s++; length++;
lencnt += length;
return length;
Memory MaHers
ļ§ Code updates b[i] on every itera0on
ļ§ Why couldnāt compiler op0mize this away?
sum_rows1 inner loop
.L53: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax movsd %xmm0, (%rsi,%r8,8) # FP store jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows1(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { b[i] = 0; for (j = 0; j < n; j++) b[i] += a[in + j]; } }
Removing Aliasing
ļ§ No need to store intermediate results
sum_rows2 inner loop
.L66: addsd (%rcx), %xmm0 # FP Add addq $8, %rcx decq %rax jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows2(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { double val = 0; for (j = 0; j < n; j++) val += a[in + j]; b[i] = val; } }
Opmizaon Blocker: Memory Aliasing
ļ¢ Aliasing
ļ§ Two different memory references specify single loca0on
ļ§ Easy to have happen in C
ļ§ Since allowed to do address arithme0c
ļ§ Direct access to storage structures
ļ§ Get in habit of introducing local variables
ļ§ Accumula0ng within loops
ļ§ Your way of telling compiler not to check for aliasing