Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof, Study notes of Computer Science

An introduction to performance analysis in parallel computing, covering key concepts such as execution time, speedup, and various approaches for measuring and modeling performance. Examples using unix, mpi, and different parallel architectures.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-d1i-1
koofers-user-d1i-1 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Performance Analysis
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 Performance Analysis 1 / 36
Performance Analysis
Performance Metrics:
Execution Time
+most direct reflection of performance
not always insightful
Speedup
+shows relative performance to sequential performance
Jingke Li (Portland State University) CS 415/515 Performance Analysis 2 / 36
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof and more Study notes Computer Science in PDF only on Docsity!

Performance Analysis

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 Performance Analysis 1 / 36

Performance Analysis

Performance Metrics:

  • Execution Time
    • most direct reflection of performance − not always insightful
  • Speedup
    • shows relative performance to sequential performance

Approaches

  • (^) Extrapolation from Observations — Run the program and collect performance data. For example, “We implemented the algorithm on parallel computer X and achieved a speedup of 10.8 on 12 processors with problem size N = 100.”
  • Asymptotic Analysis — Analyze algorithm performance based on problem size and processor count. For example, “The algorithm requires O(N log N) time on O(N) processors.”

Jingke Li (Portland State University) CS 415/515 Performance Analysis 3 / 36

Approaches

  • (^) Performance Modeling — Specify a performance metric as a function of processor count P and problem size N. For example: (a) T = N + N^2 /P (b) T = (N + N^2 )/P + 100 (c) T = (N + N^2 )/P + 0. 6 P^2 These three formula describe specific performance pattern for their corresponding algorithms, even though they all achieve the same speedup of about 10.8 when P = 12 and N = 100, and have the same asymptotic complexity when P = O(N).

Measuring Wall-Clock Time in Unix

  • (^) time() — returns the wall-clock time in seconds since the Epoch (defined as 00:00 1/1/1970, GMT).
  • (^) gettimeofday() — returns the wall-clock time at microsecond scale.
  • (^) clock gettime() — returns the wall-clock time at nanosecond scale. (Works only on systems that implement POSIX.1b Realtime Extension.)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 7 / 36

Measuring Wall-Clock Time in Unix

#include time t time(time t *tloc);

If tloc is not NULL, the return value is also stored in *tloc.

#include int gettimeofday(struct timeval *tp, void *tzp);

The struct timeval structure contains two members, long tv sec and long tv usec, whose values are seconds and microseconds since the EPOCH respectively. The second parameter tzp must be NULL.

An Example

#include

int main() { struct timeval tpstart; struct timeval tpend; long elapsed;

gettimeofday(&tpstart, NULL); function_to_time(); gettimeofday(&tpend, NULL); elapsed = MILLION * (tpend.tv_sec - tpstart.tv_sec)

  • tpend.tv_usec - tpstart.tv_usec; printf("Elapsed time: %ld microseconds\n", elapsed); }

Jingke Li (Portland State University) CS 415/515 Performance Analysis 9 / 36

Measuring CPU Time in Unix

  • (^) clock() — returns the amount of CPU time (in terms of number of clock ticks) used by the calling process. (This is an ANSI C routine.)

#include clock t clock(void);

To convert the return value to seconds, divide it by CLOCKS PER SEC (typically 1,000,000).

Example

double start, finish; ... ... MPI_Barrier(comm); start = MPI_WTime(); ... ... MPI_Barrier(comm); finish = MPI_WTime(); if (my_rank == 0) printf("Elapsed time = %e seconds\n", finish - start);

Jingke Li (Portland State University) CS 415/515 Performance Analysis 13 / 36

Speedup

S =

Ts Tp

  • (^) Ts — Execution time using a single processor
  • (^) Tp — Execution time using N processors (^1)

10

100

1000

1 10 100 1000

Speedup

Processors

N= Perfect Speedup Algorithm 1 Algorithm 2 Algorithm 3

A Simple Observation: S ≤ N

Amdahl’s Law

The speedup of a program is upper bounded by the reciprocal of the sequential fraction of the program. (The bigger the fraction, the smaller the speedup.)

For a given program, let f be the sequential fraction, then

  • fTs is the time required on sequential portion
  • (^) (1 − f )Ts is the time required on parallelizable portion

S =

Ts Tp

Ts fTs + (1 − f )Ts /N

f + (1 − f )/N

f

Jingke Li (Portland State University) CS 415/515 Performance Analysis 15 / 36

Amdahl’s Law Example

E.g. If the sequential component of a program is 5%, then the maximum speedup that can be achieved is 20.

Other Speedup Models

  • Fixed-Memory Speedup (Scaled Speedup) — Each processor runs a (sub)program of equal size.

SFM (P) =

Ts (PN 0 ) Tp(PN 0 )

(where

N

P

= N 0 isaconstant)

  • Fixed-Time Speedup — Each processor runs for a fixed amount of time.

SFT (P) =

Ts (Np ) Tp(Np )

(whereTp(Np )isaconstant)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 19 / 36

Example

Given a program with following features:

Time: Ts (N) = 1, 000 , 000 + 1, 000 N + 24N^2 (μs) Tp(N) = 1, 500 , 000 + 1, 050 N/P + 24N^2 /P (μs)

Space: Ss (N) = 100, 000 + 200N (bytes)

Sp(N) = 125, 000 + 200N/P (bytes)

Consider three cases:

(1) Fixed-size speedup for N 0 = 1, 000 (2) Scaled speedup for N/P = 1, 000 (3) Fixed-time speedup for Tp = Ts (N 0 )

Results

SFS =

Ts (N 0 ) Tp (N 0 ) =^

1 .5 + 25. 05 /P →^

1. 5 = 17.^33

SFM =

Ts (PN 0 ) Tp (PN 0 ) =

1 + P + 24P^2

2 .55 + 24P (unbounded!)

SFT = Ts^ (Np^ ) Ts (N 0 )

whereNp =

− 1 , 050 /P +

(1050/P)^2 − 4 · 1 , 000 , 000 · 24 P

2(24/P)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 21 / 36

Results (cont.)

P FS Spdup FM Spdup Time(sec.) FT Spdup Storage(kb) 1 0.98 0.98 26.55 0.98 323 2 1.85 1.96 50.55 1.92 266 4 3.35 3.95 98.55 3.80 225 8 5.61 7.94 194.55 7.57 196 16 8.48 15.94 386.55 15.11 175 32 11.39 31.94 770.55 30.18 160 64 13.75 63.94 1538.55 60.33 150 128 15.33 127.94 3074.55 120.6 143 256 16.27 255.94 6146.55 241.2 138 512 16.78 511.94 12290.55 482.5 134 1024 17.06 1023.94 24578.55 964.9 131

Performance Analysis

Step 1. Developing Performance Models:

  • (^) Computation Time:
    • Analyze algorithm complexity
    • Analyze processor workload
  • Communication Time:
    • Decompose communication components in the algorithm into basic communication patterns.
    • Develop performance models for basic communication patterns.

Step 2. Fitting Data to Models:

  • Given a set of observed performance data points and a (hypothetic) performance function, Find the parameters in the function. Example: Least-Square Fit Method: Find parameters to function f , such that

i

(obs(i) − f (i))^2 is

minimized.

Jingke Li (Portland State University) CS 415/515 Performance Analysis 25 / 36

A Simple One-to-One Communication Model

tw = cost/word

ts = startup cost

T = time

(^0) L = message length

tcomm = ts + tw N

Machine ts tw IBM SP2 40 0. Intel DELTA 77 0. Intel Paragon 121 0. Meiko CS-2 87 0. nCUBE-2 154 2. TMC CM-5 82 0. PCs on Ethernet 1500 5. PCs on FDDI 1150 1.

Model Collective Communications

Implementation of collective communications is topology dependent. For example, consider one-to-all broadcast

  • On a Ring — • On a Mesh —

Jingke Li (Portland State University) CS 415/515 Performance Analysis 27 / 36

Model Collective Communications (cont.)

  • (^) One-to-All Broadcast on a Tree —
  • (^) One-to-All Broadcast on a Hypercube —

Cost Analysis of One-to-All Communication

  • Broadcast and Reduction — The cost model for one-to-all broadcast over p processors is straightforward — it simply involves log p steps of (concurrent) point-to-point simple message transfers: tbcast = (ts + tw m) log p This cost model also works for all-to-one reduction.
  • (^) Scatter and Gather — The abstract communication pattern for a scatter is the same as for a broadcast. The difference is that, in a scatter, the message size starts out at m(p − 1)/2, and gets halved in each subsequent step: tscatter = (ts + tw m(p − 1)/2) + (ts + tw m(p − 1)/4) + · · · + (ts + tw m) = ts log p + tw m(p − 1)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 31 / 36

All-to-All Communication on a Ring

Involves p − 1 steps of shift: T = (ts +tw m)(p −1)

All-to-All Communication on a Mesh

It involves √p − 1 steps of shift in rows with message size m and √p − 1 steps of shift in columns with message size m√p:

T = (ts + tw m)(√p − 1) + (ts + tw m√p)(√p − 1) = 2ts (

p − 1) + tw m(p − 1)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 33 / 36

All-to-All Communication on a Hypercube

It involves log p steps, with doubled message size in each step:

T = ts log p + tw m(p − 1)