Performance Analysis of Parallel Systems: Metrics, Overheads, and Isoefficiency, Slides of Parallel Computing and Programming

An overview of performance metrics for parallel systems, focusing on the tradeoff between granularity and performance, scalability, and the concept of asymptotic isoefficiency. It covers topics such as measuring program performance, asymptotic execution time, speedup, and parallel efficiency. The document also discusses the impact of non-cost optimality and the effect of granularity on performance.

Typology: Slides

2011/2012

Uploaded on 07/19/2012

adnaan
adnaan 🇵🇰

4

(1)

13 documents

1 / 36

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
2
Topic Overview
Performance metrics for parallel systems
Tradeoff: granularity vs. performance
Scalability of parallel systems
Introduction to asymptotic isoefficiency
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24

Partial preview of the text

Download Performance Analysis of Parallel Systems: Metrics, Overheads, and Isoefficiency and more Slides Parallel Computing and Programming in PDF only on Docsity!

2

Topic Overview

  • Performance metrics for parallel systems
  • Tradeoff: granularity vs. performance
  • Scalability of parallel systems
  • Introduction to asymptotic isoefficiency

3

Measuring Program Performance

  • Wall clock time —start time of the first process to end time of last process —how does this scale? - when the number of processors is changed - when the program is ported to another machine
  • Operation counts, e.g. FLOPs —are these useful?
  • How much faster is the parallel version? —what do we compare with?

5

Performance Metrics: Execution Time

  • Serial time: TS —time elapsed between the start and end of serial execution
  • Parallel time: Tp —time elapsed between first process start and last process end

6

Performance Metrics: Total Parallel Overhead

  • Tall = —∑^ time spent collectively by each processor - p Tp ,^ where^ p^ is the number of processors
  • Total parallel overhead:^ To —time wasted by all processors combined — To = Tall - TSTo = pTP - TS

8

Performance Metrics: Speedup

  • S = TS / TP

Example

  • Add^ n^ numbers using^ n^ processing elements
  • If^ n^ is a power of two —can perform this operation in log n steps on n processors - propagate partial sums up a logical binary tree

Performance analysis

  • TS =^ Θ^ ( n )
  • Assumptions —addition takes constant time^ tc —communication of a single word takes time^ ts +^ tw
  • TP =^ Θ^ ( log^ n )
  • Speedup^ S^ =^ Θ^ ( n^ /^ log^ n )

9

A Note About T

s
  • Might be many serial algorithms for a problem
  • Different algorithms may have different asymptotic runtimes
  • May be parallelizable to different degrees

11

Speedup of Odd-Even Parallel Sort

  • Serial time for bubblesort: 150 seconds
  • Odd-even parallel sort: 40 seconds
  • Apparent speedup = 150/40 = 3. —is this a fair assessment?

Should consider the best serial

program as the baseline for T

s

  • What if serial quicksort only took 30 seconds?
  • Speedup of odd-even sort over quicksort = 30/40 = 0. —fairer assessment

12

Performance Metrics: Speedup Bounds

  • Parallel program never terminates: speedup = 0
  • Speedup >^ p? —theoretically^ only if each processor spends less than time^ TS /^ p —but then single processor could be time sliced for < T s - contradicts assumption of minimal T s —in practice: yes!

14

Parallel Efficiency

- Fraction of time a processor performs useful work

  • E = S / p = TS / (p TP) - Bounds —theoretically: 0 E 1 —in practice: can have efficiency > 1 if superlinear speedup - Previous example: add^ n^ numbers using^ n^ PEs —speedup S = Θ ( n / log n ) —efficiency E = S / n = Θ (n / log n) / n = Θ (1 / log n)

15

Example: Edge Detection

  • Operation uses a^^3 x^^3 template to compute each pixel value
  • Serial time for an^ n^ x^ n^ image is given by^ TS =^9 tcn 2
  • Possible parallelization — partition image equally into vertical slabs, each with n^2 / p pixels — boundary of each slab is 2n pixels - number of pixel values that will have to be communicated — communication time =^^2 ( ts +^ twn )
  • Apply template to all^ n 2 / p pixels in time TS = 9 tcn 2 / p

17

Cost Optimality

- Cost of parallel system =^ pTP —sum of the work time for each processor — AKA work or processor-time product - Parallel system is^ cost-optimal^ if —O(solving a problem on a parallel computer) = O(serial) - Since^ E^ =^ TS /^ pTP , for cost optimal systems^ E^ =^ O (1)

18

Considering Cost Optimality

Problem revisited: add n numbers

• Is it cost-optimal on a parallel system using^ n^ PEs?

• As before,^ TP =^ log^ n^ for^ p^ =^ n

• Cost of this system =^ p^ TP =^ Θ( n^ log^ n )

• Serial runtime =^ Θ( n )

• Algorithm is not cost optimal

- E =^ Θ (n / (n log n)) =^ Θ (1 / log n)

20

Effect of Granularity on Performance

  • Scaling down a parallel system —using fewer processors than the maximum possible —usually improves parallel system efficiency —naïve scaling down - consider each original processor as virtual processor - map virtual processors to scaled-down number of processors
  • Impact —# PE decreases by a factor of n / p —computation for each PE increases by a factor of n / p —communication cost should decrease as well - VPs assigned to a physical processor might communicate

21

Building Granularity: Sum Example

  • Add^ n^ numbers on^ p^ processing elementsp < nn and p are powers of 2
  • Use parallel algorithm for^ n^ (virtual) processors —assign each processor n / p virtual processors