Optimizing Vector Operations: Understanding Compiler Limitations and Techniques, Lab Reports of Computer Science

The optimization of vector operations, focusing on the limitations of compilers and techniques to improve performance. Topics include code motion, vector abstract data types (adt), optimization examples, time scales, cycles per element, and procedure calls. The document also discusses the importance of understanding compiler capabilities and limitations.

Typology: Lab Reports

Pre 2010

Uploaded on 08/19/2009

koofers-user-95a
koofers-user-95a 🇺🇸

9 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
15-213
“The course that gives CMU its Zip!”
Code Optimization I:
Machine Independent Optimizations
Sept. 26, 2002
Code Optimization I:
Machine Independent Optimizations
Sept. 26, 2002
Topics
Topics
Machine-Independent Optimizations
zCode motion
zReduction in strength
zCommon subexpression sharing
Tuning
zIdentifying performance bottlenecks
class10.ppt
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download Optimizing Vector Operations: Understanding Compiler Limitations and Techniques and more Lab Reports Computer Science in PDF only on Docsity!

“The course that gives CMU its Zip!”

Code Optimization I:

Machine Independent Optimizations

Sept. 26, 2002

Code Optimization I:

Machine Independent Optimizations

Sept. 26, 2002

Topics^ Topics

Machine-Independent Optimizations z Code motion z Reduction in strength z Common subexpression sharing

Tuning z Identifying performance bottlenecks class10.ppt

Great Reality #4^ Great Reality #4 There’s more to performance than asymptotic^ There’s more to performance than asymptotic

complexity^ complexity

Constant factors matter too!^ Constant factors matter too!

Easily see 10:1 performance range depending on how codeis written

Must optimize at multiple levels: z algorithm, data representations, procedures, and loops

Must understand system to optimize performance^ Must understand system to optimize performance

How programs are compiled and executed

How to measure program performance and identifybottlenecks

How to improve performance without destroying codemodularity and generality

Limitations of Optimizing Compilers^ Limitations of Optimizing CompilersOperate Under Fundamental Constraint^ Operate Under Fundamental Constraint

Must not cause any change in program behavior under anypossible condition „ Often prevents it from making optimizations when would only affectbehavior under pathological conditions. Behavior that may be obvious to the programmer can be^ Behavior that may be obvious to the programmer can be obfuscated b y languages and coding styles obfuscated b y languages and coding styles „ e.g., data ranges may be more limited than variable types suggest Most analysis is performed only within procedures^ Most analysis is performed only within procedures

whole-program analysis is too expensive in most cases Most analysis is based only on^ Most analysis is based only on staticstatic informationinformation „ compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative^ When in doubt, the compiler must be conservative

Machine-Independent Optimizations^ Machine-Independent Optimizations

Optimizations you should do regardless of processor /compiler

Code Motion^ Code Motion

Reduce frequency with which computation performed z If it will always produce same result z Especially moving code out of loop for (i = 0; i < n; i++) { _int ni = ni;_* for (j = 0; j < n; j++) a[ni

  • j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Reduction in Strength^ Reduction in Strength^ „

Replace costly operation with simpler one

Shift, add instead of multiply or divide 16*x --> x << 4 z Utility machine dependent z Depends on cost of multiply or divide instruction z On Pentium II or III, integer multiply only requires 4 CPU cycles

Recognize sequence of products for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni

  • j] = b[j]; ni += n; }

Make Use of Registers^ Make Use of Registers

Reading and writing registers much faster thanreading/writing memory

Limitation^ Limitation

Compiler not always able to determine whether variable canbe held in register

Possibility of Aliasing

See example later

Vector ADT^ Vector ADT

length^ data

  • • • 0 1 2 length–

Procedures^ Procedures

vec_ptr new_vec(int len) z Create vector of specified length int get_vec_element(vec_ptr v, int index, int *dest) z *Retrieve vector element, store at dest z Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) z Return pointer to start of vector data

Similar to array implementations in Pascal, ML, Java z E.g., always do bounds checking

Optimization Example^ Optimization Example

void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } }

Procedure^ Procedure

Compute sum of all elements of vector

Store result at destination location

Cycles Per Element^ Cycles Per Element

Convenient way to express performance of program thatoperators on vectors or lists

Length = n

T = CPEn + Overhead* 0 900 800 700 600 500 400 300 200 100 1000 0 50 100 150 200 Elements Cycles vsum1 Slope = 4. vsum2 Slope = 3.

Optimization Example^ Optimization Example

void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } }

Procedure^ Procedure

Compute sum of all elements of integer vector

Store result at destination location

Vector data structure and operations defined via abstract datatype

Pentium II/III Performance: Clock Cycles / Element^ Pentium II/III Performance: Clock Cycles / Element

42.06 (Compiled -g) 31.25 (Compiled -O2)

Move

vec_length

Call Out of Loop

Move

vec_length

Call Out of Loop

Optimization^ Optimization

Move call to vec_length out of inner loop z Value does not change from one iteration to next z Code motion

CPE: 20.66 (Compiled -O2) z vec_length requires only constant time, but significant overhead void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } }

Code Motion Example #2^ Code Motion Example

void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); }

Procedure to Convert String to Lower Case^ Procedure to Convert String to Lower Case

Extracted from 213 lab submissions, Fall, 1998

Convert Loop To Goto Form^ Convert Loop To Goto Form

void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: } „ strlen executed every iteration „ strlen linear in length of string z Must scan string until finds '\0' „ Overall performance is quadratic

Improving Performance^ Improving Performance

void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); }

Move call to strlen outside of loop

Since result does not change from one iteration to another

Form of code motion