Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism, Lecture notes of Computer Science

University of California-Riverside Computer Science

An overview of how compilers affect performance, focusing on loop unrolling and constant propagation. It explains the classical cpu performance equation, highlighting the roles of instruction count (ic), cycles per instruction (cpi), and cycle time. The document also discusses amdahls law and its implications for multicore architectures, emphasizing the importance of optimizing the common case and the limitations of parallelism. Additional topics include gpu performance, memory access, and parallelization overhead, offering insights into achieving speedup in computer systems. The document concludes with announcements regarding assignments and resources for further learning.

Typology: Lecture notes

2023/2024

Uploaded on 05/23/2025

fancycode 🇺🇸

7 documents

1 / 56

This page cannot be seen from the preview

Don't miss anything!

Performance (3):

Do the right thing

Hung-Wei Tseng

Discover Lecture notes of Computer Science University of California-Riverside

Partial preview of the text

Download CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism and more Lecture notes Computer Science in PDF only on Docsity!

Performance (3):

Do the right thing

Hung-Wei Tseng

Recap: von Neumann architecture Processor Memory Storage f30f1efa 4883ec 488d3d 0f0000e dcffffff 31c c408c30f

Instructions 1f

08400000 00000100 02004865 6c6c6f2c 20776f 6c 00000000 00000000

Data

int main(){ printf(“Hello, world!\n”); } f30f1efa 4883ec 488d3d 0f0000e dcffffff 31c c408c30f

Instructions 1f

08400000 00000100 02004865 6c6c6f2c 20776f 6c 00000000 00000000

Data

Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers 4883ec sub $0x8,%rsp 0x8 0x 0x 0x10640x By loading different programs into memory, your computer can perform different functions

(^) If we turn on “-O3” flag when using gcc to compile both code snippets A and B , how many of the following can we expect?

က Compiler optimizations can reduce IC for both

က Compiler optimizations can make the CPI lower for both

က Compiler optimizations can make the ET lower for both

က Compiler optimizations can transform code B into code A

A. 0

B. 1

C. 2

D. 3

E. 4

How compilers affect performance Compiler can apply loop unrolling, constant propagation naively to reduce IC for(i = 0 ; i < ARRAY_SIZE; i++) { for(j = 0 ; j < ARRAY_SIZE; j++) { c[i][j] = a[i][j]+b[i][j]; } } for(j = 0 ; j < ARRAY_SIZE; j++) { for(i = 0 ; i < ARRAY_SIZE; i++) { c[i][j] = a[i][j]+b[i][j]; } } A^ B Reduced IC does not necessarily mean lower CPI — compiler may pick one longer instruction to replace a few shorter ones Compiler cannot guarantee the combined effects lead to better performance! “Most compilers” will not significantly change programmer’s code since compiler cannot guarantee if doing that would affect the correctness

(^) What does better mean?
(^) Amdahl’s Law and its implications Outline

Quantitive Analysis of “Better”

(^) The relative performance between two machines, X and Y. Y is n

times faster than X

(^) The speedup of Y over X Speedup n = Execution Time X Execution Time Y Speedup = Execution Time X Execution Time Y

(^) Consider the same program on the following two machines, X and Y. By

how much Y is faster than X?

A. 0. B. 0. C. 0. D. 1. E. No changes Speedup of Y over X ET Y = ( 5 × 10 9 ) × ( 20 % × 7 + 20 % × 2 + 60 % × 1 ) × 1 6 × 10 9 secs = 2 secs Speedup = Execution Time X Execution TimeY =

2 = 1. Clock Rate Dynamic Instruction Count Percentage of Type-A Insts. CPI of Type-A Insts. Percentage of Type-B Insts. CPI of Type-B Insts. Percentage of Type-C Insts. CPI of Type-C Insts. Machine X 4 GHz 5000000000 20% 5 20% 2 60% 1 Machine Y 6 GHz 5000000000 20% 7 20% 2 60% 1 ET X = ( 5 × 10 9 ) × ( 20 % × 5 + 20 % × 2 + 60 % × 1 ) × 1 4 × 10 9 sec = 2.5 sec

Amdahl’s Law — and It’s

Implication in the Multicore Era

Mark D. Hill, University of Wisconsin-Madison

Michael R. Marty, Google

In IEEE Computer, vol. 41, no. 7

Amdahl’s Law Speedup enhanced ( f, s) =

( 1 − f ) + f s f — The fraction of time in the original program s — The speedup we can achieve on f Speedup enhanced = Execution Time baseline Execution Time enhanced

https://www.pollev.com/hungweitseng close in

(^) Final Fantasy XV spends lots of time loading a map — within which period that 95% of the time on the accessing the H.D.D., the rest in the operating system, file system and the I/O protocol. If we replace the H.D.D. with a flash drive, which provides 100x faster access time. By how much can we speed up the map loading process? A. ~7x B. ~10x C. ~17x D. ~29x E. ~100x Practicing Amdahl’s Law Hard Disk Drive Latency (us) 0 2000 4000 6000 8000 File System Operating System HDD

(^) We can apply Amdahl’s law for multiple optimizations
(^) These optimizations must be dis-joint!
- (^) If optimization #1 and optimization #2 are dis-joint:
- (^) If optimization #1 and optimization #2 are not dis-joint: Amdahl’s Law on Multiple Optimizations Speedup enhanced ( f Opt 1 , f Opt 2 , s Opt 1 , s Opt 2 ) =

( 1 − f Opt 1 − f Opt 2

f_Opt 1 s_Opt 1

f_Opt 2 s_Opt 2 Speedup enhanced ( f OnlyOpt 1 , f OnlyOpt 2 , f BothOpt 1 Opt 2 , s OnlyOpt 1 , s OnlyOpt 2 , s BothOpt 1 Opt 2 )

fOpt1 fOpt2 1-fOpt1-fOpt

fOnlyOpt1 fOnlyOpt2 fBothOpt1Opt2 1-fOnlyOpt1-fOnlyOpt2-fBothOpt1Opt

= 1 ( 1 − fOnlyOpt 1 − fOnlyOpt 2 − fBothOpt 1 Opt 2 ) + + f_BothOpt 1 Opt 2 s_BothOpt 1 Opt 2

f_OnlyOpt 1 s_OnlyOpt 1

f_OnlyOpt 2 s_OnlyOpt 2

https://www.pollev.com/hungweitseng close in

(^) With the latest flash memory technologies, the system spends 16% of time on accessing the flash, and the software overhead is now 84%. If your company ask you and your team to invent a new memory technology that replaces flash to achieve 2x speedup on loading maps, how much faster the new technology needs to be? A. ~5x B. ~10x C. ~20x D. ~100x E. None of the above Speedup further! Flash SSD Latency (us) 0 12.5 25 37.5 50 File System Operating System Hardware

CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism, Lecture notes of Computer Science

Related documents

Partial preview of the text

Download CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism and more Lecture notes Computer Science in PDF only on Docsity!

Performance (3):

Do the right thing

Hung-Wei Tseng

Instructions 1f

Data

Instructions 1f

Data

က Compiler optimizations can reduce IC for both

က Compiler optimizations can make the CPI lower for both

က Compiler optimizations can make the ET lower for both

က Compiler optimizations can transform code B into code A

A. 0

B. 1

C. 2

D. 3

E. 4

Quantitive Analysis of “Better”

times faster than X

how much Y is faster than X?

Amdahl’s Law — and It’s

Implication in the Multicore Era

Mark D. Hill, University of Wisconsin-Madison

Michael R. Marty, Google

In IEEE Computer, vol. 41, no. 7

fOpt1 fOpt2 1-fOpt1-fOpt

fOnlyOpt1 fOnlyOpt2 fBothOpt1Opt2 1-fOnlyOpt1-fOnlyOpt2-fBothOpt1Opt