Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Program Optimization, Lecture Slide - Computer Science, Slides of Computer System Design and Architecture

Carnegie Mellon University (CMU)Computer System Design and Architecture

Generally Useful Optimizations, Code Motion ,Precomputation, Strength Reduction, Sharing of Common Subexpression ,Removing Unnecessary Procedure calls, Optimization Blockers , Exploiting Instruction-Level Parallelism, Dealing with Consitionals

Typology: Slides

2010/2011

Uploaded on 10/08/2011

rolla45 🇺🇸

(6)

133 documents

1 / 54

This page cannot be seen from the preview

Don't miss anything!

Carnegie Mellon

Program'Op*miza*on'

15#213:'Introduc0on'to'Computer'Systems'

25th'Lecture,'Nov.'23,'2010'

Instructors:''

Randy'Bryant'and'Dave'O’Hallaron'

Discover Slides of Computer System Design and Architecture Carnegie Mellon University (CMU)

Partial preview of the text

Download Program Optimization, Lecture Slide - Computer Science and more Slides Computer System Design and Architecture in PDF only on Docsity!

Program Opmizaon

15-‐213: Introduc0on to Computer Systems

Lecture, Nov. 23, 2010

Instructors:

Randy Bryant and Dave O’Hallaron

Today

 Overview

 Generally Useful Opmizaons

 Code mo0on/precomputa0on

 Strength reduc0on

 Sharing of common subexpressions

 Removing unnecessary procedure calls

 Opmizaon Blockers

 Procedure calls

 Memory aliasing

 Exploing Instrucon-‐Level Parallelism

 Dealing with Condi*onals

Op*mizing Compilers

 Provide efficient mapping of program to machine

 register alloca0on

 code selec0on and ordering (scheduling)

 dead code elimina0on

 elimina0ng minor inefficiencies

 Don’t (usually) improve asympto*c efficiency

 up to programmer to select best overall algorithm

 big-‐O savings are (oWen) more important than constant factors

 but constant factors also maQer

 Have difficulty overcoming “opmizaon blockers”

 poten0al memory aliasing

 poten0al procedure side-‐effects

Limitaons of Opmizing Compilers

 Operate under fundamental constraint

 Must not cause any change in program behavior

 OWen prevents it from making op0miza0ons when would only affect behavior

under pathological condi0ons.

 Behavior that may be obvious to the programmer can be obfuscated by

languages and coding styles

 e.g., Data ranges may be more limited than variable types suggest

 Most analysis is performed only within procedures

 Whole-‐program analysis is too expensive in most cases

 Most analysis is based only on sta1c informa*on

 Compiler has difficulty an0cipa0ng run-‐0me inputs

 When in doubt, the compiler must be conserva*ve

Compiler-‐Generated Code Mo*on

set_row: testq %rcx, %rcx # Test n jle .L4 # If 0, goto done movq %rcx, %rax # rax = n imulq %rdx, %rax # rax = i leaq (%rdi,%rax,8), %rdx # rowp = A + ni* movl $0, %r8d # j = 0 .L3: # loop: movq (%rsi,%r8,8), %rax # t = b[j] movq %rax, (%rdx) # rowp = t addq $1, %r8 # j++ addq $8, %rdx # rowp++ cmpq %r8, %rcx # Compare n:j jg .L3 # If >, goto loop .L4: # done: rep ; ret long j; long ni = ni; double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; void set_row(double *a, double b, long i, long n) { long j; for (j = 0; j < n; j++) a[ni+j] = b[j]; }

Where are the FP operations?

Reduc*on in Strength

 Replace costly opera0on with simpler one

 ShiW, add instead of mul0ply or divide

16*x --> x << 4

 U0lity machine dependent

 Depends on cost of mul0ply or divide instruc0on

– On Intel Nehalem, integer mul0ply requires 3 CPU cycles

 Recognize sequence of products

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }

void lower(char *s)

int i;

for (i = 0; i < strlen(s); i++)

if (s[i] >= 'A' && s[i] <= 'Z')

s[i] -= ('A' - 'a');

Opmizaon Blocker #1: Procedure Calls

 Procedure to Convert String to Lower Case

 Extracted from 213 lab submissions, Fall, 1998

Lower Case Conversion Performance

 Time quadruples when double string length

 Quadra0c performance

0 20 40 60 80 100 120 140 160 180 200 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 CPU seconds String length lower

Calling Strlen

 Strlen performance

 Only way to determine length of string is to scan its en0re length, looking for

null character.

 Overall performance, string of length N

 N calls to strlen

 Require 0mes N, N-‐1, N-‐2, …, 1

 Overall O(N^2 ) performance

/* My version of strlen */

size_t strlen(const char *s)

size_t length = 0;

while (*s != '\0') {

s++;

length++;

return length;

Improving Performance

 Move call to strlen outside of loop

 Since result does not change from one itera0on to another

 Form of code mo0on

void lower(char *s)

int i;

int len = strlen(s);

for (i = 0; i < len; i++)

if (s[i] >= 'A' && s[i] <= 'Z')

s[i] -= ('A' - 'a');

Opmizaon Blocker: Procedure Calls

 Why couldn’t compiler move strlen out of inner loop?

 Procedure may have side effects

 Alters global state each 0me called

 Func0on may not return same value for given arguments

 Depends on other parts of global state

 Procedure lower could interact with strlen

 Warning:

 Compiler treats procedure call as a black box

 Weak op0miza0ons near them

 Remedies:

 Use of inline func0ons

 GCC does this with –O

 See web aside ASM:OPT

 Do your own code mo0on

int lencnt = 0;

size_t strlen(const char *s)

size_t length = 0;

while (*s != '\0') {

s++; length++;

lencnt += length;

return length;

Memory MaHers

 Code updates b[i] on every itera0on

 Why couldn’t compiler op0mize this away?

sum_rows1 inner loop

.L53: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax movsd %xmm0, (%rsi,%r8,8) # FP store jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows1(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { b[i] = 0; for (j = 0; j < n; j++) b[i] += a[in + j]; } }

Removing Aliasing

 No need to store intermediate results

sum_rows2 inner loop

.L66: addsd (%rcx), %xmm0 # FP Add addq $8, %rcx decq %rax jne .L /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows2(double *a, double b, long n) { long i, j; for (i = 0; i < n; i++) { double val = 0; for (j = 0; j < n; j++) val += a[in + j]; b[i] = val; } }

Program Optimization, Lecture Slide - Computer Science, Slides of Computer System Design and Architecture

Related documents

Partial preview of the text

Download Program Optimization, Lecture Slide - Computer Science and more Slides Computer System Design and Architecture in PDF only on Docsity!

Program Opmizaon

15-­‐213: Introduc0on to Computer Systems

Lecture, Nov. 23, 2010

Instructors:

Randy Bryant and Dave O’Hallaron

Today

 Overview

 Generally Useful Opmizaons

 Code mo0on/precomputa0on

 Strength reduc0on

 Sharing of common subexpressions

 Removing unnecessary procedure calls

 Opmizaon Blockers

 Procedure calls

 Memory aliasing

 Exploing Instrucon-­‐Level Parallelism

 Dealing with Condi*onals

Op*mizing Compilers

 Provide efficient mapping of program to machine

 register alloca0on

 code selec0on and ordering (scheduling)

 dead code elimina0on

 elimina0ng minor inefficiencies

 Don’t (usually) improve asympto*c efficiency

 up to programmer to select best overall algorithm

 big-­‐O savings are (oWen) more important than constant factors

 but constant factors also maQer

 Have difficulty overcoming “opmizaon blockers”

 poten0al memory aliasing

 poten0al procedure side-­‐effects

Limitaons of Opmizing Compilers

 Operate under fundamental constraint

 Must not cause any change in program behavior

 OWen prevents it from making op0miza0ons when would only affect behavior

under pathological condi0ons.

 Behavior that may be obvious to the programmer can be obfuscated by

languages and coding styles

 e.g., Data ranges may be more limited than variable types suggest

 Most analysis is performed only within procedures

 Whole-­‐program analysis is too expensive in most cases

 Most analysis is based only on sta1c informa*on

 Compiler has difficulty an0cipa0ng run-­‐0me inputs

 When in doubt, the compiler must be conserva*ve

Compiler-­‐Generated Code Mo*on

Where are the FP operations?

Reduc*on in Strength

 Replace costly opera0on with simpler one

 ShiW, add instead of mul0ply or divide

16*x --> x << 4

 U0lity machine dependent

 Depends on cost of mul0ply or divide instruc0on

– On Intel Nehalem, integer mul0ply requires 3 CPU cycles

 Recognize sequence of products

void lower(char *s)

int i;

for (i = 0; i < strlen(s); i++)

if (s[i] >= 'A' && s[i] <= 'Z')

s[i] -= ('A' - 'a');

Opmizaon Blocker #1: Procedure Calls

 Procedure to Convert String to Lower Case

 Extracted from 213 lab submissions, Fall, 1998

Lower Case Conversion Performance

 Time quadruples when double string length

 Quadra0c performance

Calling Strlen

 Strlen performance

 Only way to determine length of string is to scan its en0re length, looking for

null character.

 Overall performance, string of length N

 N calls to strlen

 Require 0mes N, N-­‐1, N-­‐2, …, 1

 Overall O(N^2 ) performance

/* My version of strlen */

size_t strlen(const char *s)

size_t length = 0;

while (*s != '\0') {

15-‐213: Introduc0on to Computer Systems

 Exploing Instrucon-‐Level Parallelism

 big-‐O savings are (oWen) more important than constant factors

 poten0al procedure side-‐effects

 Whole-‐program analysis is too expensive in most cases

 Compiler has difficulty an0cipa0ng run-‐0me inputs

Compiler-‐Generated Code Mo*on

 Require 0mes N, N-‐1, N-‐2, …, 1