CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism, Lecture notes of Computer Science

An overview of how compilers affect performance, focusing on loop unrolling and constant propagation. It explains the classical cpu performance equation, highlighting the roles of instruction count (ic), cycles per instruction (cpi), and cycle time. The document also discusses amdahls law and its implications for multicore architectures, emphasizing the importance of optimizing the common case and the limitations of parallelism. Additional topics include gpu performance, memory access, and parallelization overhead, offering insights into achieving speedup in computer systems. The document concludes with announcements regarding assignments and resources for further learning.

Typology: Lecture notes

2023/2024

Uploaded on 05/23/2025

fancycode
fancycode 🇺🇸

7 documents

1 / 56

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Performance (3):
Do the right thing
Hung-Wei Tseng
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38

Partial preview of the text

Download CPU Performance Optimization: Compilers, Amdahl's Law, and Parallelism and more Lecture notes Computer Science in PDF only on Docsity!

Performance (3):

Do the right thing

Hung-Wei Tseng

Recap: von Neumann architecture Processor Memory Storage f30f1efa 4883ec 488d3d 0f0000e dcffffff 31c c408c30f

Instructions 1f

08400000 00000100 02004865 6c6c6f2c 20776f 6c 00000000 00000000

Data

int main(){ printf(“Hello, world!\n”); } f30f1efa 4883ec 488d3d 0f0000e dcffffff 31c c408c30f

Instructions 1f

08400000 00000100 02004865 6c6c6f2c 20776f 6c 00000000 00000000

Data

Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers 4883ec sub $0x8,%rsp 0x8 0x 0x 0x10640x By loading different programs into memory, your computer can perform different functions

  • (^) If we turn on “-O3” flag when using gcc to compile both code snippets A and B , how many of the following can we expect?

က Compiler optimizations can reduce IC for both

က Compiler optimizations can make the CPI lower for both

က Compiler optimizations can make the ET lower for both

က Compiler optimizations can transform code B into code A

A. 0

B. 1

C. 2

D. 3

E. 4

How compilers affect performance Compiler can apply loop unrolling, constant propagation naively to reduce IC for(i = 0 ; i < ARRAY_SIZE; i++) { for(j = 0 ; j < ARRAY_SIZE; j++) { c[i][j] = a[i][j]+b[i][j]; } } for(j = 0 ; j < ARRAY_SIZE; j++) { for(i = 0 ; i < ARRAY_SIZE; i++) { c[i][j] = a[i][j]+b[i][j]; } } A^ B Reduced IC does not necessarily mean lower CPI — compiler may pick one longer instruction to replace a few shorter ones Compiler cannot guarantee the combined effects lead to better performance! “Most compilers” will not significantly change programmer’s code since compiler cannot guarantee if doing that would affect the correctness

  • (^) What does better mean?
  • (^) Amdahl’s Law and its implications Outline

Quantitive Analysis of “Better”

  • (^) The relative performance between two machines, X and Y. Y is n

times faster than X

  • (^) The speedup of Y over X Speedup n = Execution Time X Execution Time Y Speedup = Execution Time X Execution Time Y
  • (^) Consider the same program on the following two machines, X and Y. By

how much Y is faster than X?

A. 0. B. 0. C. 0. D. 1. E. No changes Speedup of Y over X ET Y = ( 5 × 10 9 ) × ( 20 % × 7 + 20 % × 2 + 60 % × 1 ) × 1 6 × 10 9 secs = 2 secs Speedup = Execution Time X Execution TimeY =

2 = 1. Clock Rate Dynamic Instruction Count Percentage of Type-A Insts. CPI of Type-A Insts. Percentage of Type-B Insts. CPI of Type-B Insts. Percentage of Type-C Insts. CPI of Type-C Insts. Machine X 4 GHz 5000000000 20% 5 20% 2 60% 1 Machine Y 6 GHz 5000000000 20% 7 20% 2 60% 1 ET X = ( 5 × 10 9 ) × ( 20 % × 5 + 20 % × 2 + 60 % × 1 ) × 1 4 × 10 9 sec = 2.5 sec

Amdahl’s Law — and It’s

Implication in the Multicore Era

Mark D. Hill, University of Wisconsin-Madison

Michael R. Marty, Google

In IEEE Computer, vol. 41, no. 7

Amdahl’s Law Speedup enhanced ( f, s) =

( 1 − f ) + f s f — The fraction of time in the original program s — The speedup we can achieve on f Speedup enhanced = Execution Time baseline Execution Time enhanced

https://www.pollev.com/hungweitseng close in

  • (^) Final Fantasy XV spends lots of time loading a map — within which period that 95% of the time on the accessing the H.D.D., the rest in the operating system, file system and the I/O protocol. If we replace the H.D.D. with a flash drive, which provides 100x faster access time. By how much can we speed up the map loading process? A. ~7x B. ~10x C. ~17x D. ~29x E. ~100x Practicing Amdahl’s Law Hard Disk Drive Latency (us) 0 2000 4000 6000 8000 File System Operating System HDD
  • (^) We can apply Amdahl’s law for multiple optimizations
  • (^) These optimizations must be dis-joint!
    • (^) If optimization #1 and optimization #2 are dis-joint:
    • (^) If optimization #1 and optimization #2 are not dis-joint: Amdahl’s Law on Multiple Optimizations Speedup enhanced ( f Opt 1 , f Opt 2 , s Opt 1 , s Opt 2 ) =

( 1 − f Opt 1 − f Opt 2

f_Opt 1 s_Opt 1

f_Opt 2 s_Opt 2 Speedup enhanced ( f OnlyOpt 1 , f OnlyOpt 2 , f BothOpt 1 Opt 2 , s OnlyOpt 1 , s OnlyOpt 2 , s BothOpt 1 Opt 2 )

fOpt1 fOpt2 1-fOpt1-fOpt

fOnlyOpt1 fOnlyOpt2 fBothOpt1Opt2 1-fOnlyOpt1-fOnlyOpt2-fBothOpt1Opt

= 1 ( 1 − fOnlyOpt 1 − fOnlyOpt 2 − fBothOpt 1 Opt 2 ) + + f_BothOpt 1 Opt 2 s_BothOpt 1 Opt 2

f_OnlyOpt 1 s_OnlyOpt 1

f_OnlyOpt 2 s_OnlyOpt 2

https://www.pollev.com/hungweitseng close in

  • (^) With the latest flash memory technologies, the system spends 16% of time on accessing the flash, and the software overhead is now 84%. If your company ask you and your team to invent a new memory technology that replaces flash to achieve 2x speedup on loading maps, how much faster the new technology needs to be? A. ~5x B. ~10x C. ~20x D. ~100x E. None of the above Speedup further! Flash SSD Latency (us) 0 12.5 25 37.5 50 File System Operating System Hardware