










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to performance analysis in parallel computing, covering key concepts such as execution time, speedup, and various approaches for measuring and modeling performance. Examples using unix, mpi, and different parallel architectures.
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!











Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 Performance Analysis 1 / 36
Performance Metrics:
Jingke Li (Portland State University) CS 415/515 Performance Analysis 3 / 36
Jingke Li (Portland State University) CS 415/515 Performance Analysis 7 / 36
#include time t time(time t *tloc);
If tloc is not NULL, the return value is also stored in *tloc.
#include int gettimeofday(struct timeval *tp, void *tzp);
The struct timeval structure contains two members, long tv sec and long tv usec, whose values are seconds and microseconds since the EPOCH respectively. The second parameter tzp must be NULL.
#include
int main() { struct timeval tpstart; struct timeval tpend; long elapsed;
gettimeofday(&tpstart, NULL); function_to_time(); gettimeofday(&tpend, NULL); elapsed = MILLION * (tpend.tv_sec - tpstart.tv_sec)
Jingke Li (Portland State University) CS 415/515 Performance Analysis 9 / 36
#include clock t clock(void);
To convert the return value to seconds, divide it by CLOCKS PER SEC (typically 1,000,000).
double start, finish; ... ... MPI_Barrier(comm); start = MPI_WTime(); ... ... MPI_Barrier(comm); finish = MPI_WTime(); if (my_rank == 0) printf("Elapsed time = %e seconds\n", finish - start);
Jingke Li (Portland State University) CS 415/515 Performance Analysis 13 / 36
Ts Tp
10
100
1000
1 10 100 1000
Speedup
Processors
N= Perfect Speedup Algorithm 1 Algorithm 2 Algorithm 3
A Simple Observation: S ≤ N
The speedup of a program is upper bounded by the reciprocal of the sequential fraction of the program. (The bigger the fraction, the smaller the speedup.)
For a given program, let f be the sequential fraction, then
Ts Tp
Ts fTs + (1 − f )Ts /N
f + (1 − f )/N
f
Jingke Li (Portland State University) CS 415/515 Performance Analysis 15 / 36
E.g. If the sequential component of a program is 5%, then the maximum speedup that can be achieved is 20.
SFM (P) =
Ts (PN 0 ) Tp(PN 0 )
(where
= N 0 isaconstant)
SFT (P) =
Ts (Np ) Tp(Np )
(whereTp(Np )isaconstant)
Jingke Li (Portland State University) CS 415/515 Performance Analysis 19 / 36
Given a program with following features:
Time: Ts (N) = 1, 000 , 000 + 1, 000 N + 24N^2 (μs) Tp(N) = 1, 500 , 000 + 1, 050 N/P + 24N^2 /P (μs)
Space: Ss (N) = 100, 000 + 200N (bytes)
Sp(N) = 125, 000 + 200N/P (bytes)
Consider three cases:
(1) Fixed-size speedup for N 0 = 1, 000 (2) Scaled speedup for N/P = 1, 000 (3) Fixed-time speedup for Tp = Ts (N 0 )
Ts (N 0 ) Tp (N 0 ) =^
Ts (PN 0 ) Tp (PN 0 ) =
2 .55 + 24P (unbounded!)
SFT = Ts^ (Np^ ) Ts (N 0 )
whereNp =
Jingke Li (Portland State University) CS 415/515 Performance Analysis 21 / 36
P FS Spdup FM Spdup Time(sec.) FT Spdup Storage(kb) 1 0.98 0.98 26.55 0.98 323 2 1.85 1.96 50.55 1.92 266 4 3.35 3.95 98.55 3.80 225 8 5.61 7.94 194.55 7.57 196 16 8.48 15.94 386.55 15.11 175 32 11.39 31.94 770.55 30.18 160 64 13.75 63.94 1538.55 60.33 150 128 15.33 127.94 3074.55 120.6 143 256 16.27 255.94 6146.55 241.2 138 512 16.78 511.94 12290.55 482.5 134 1024 17.06 1023.94 24578.55 964.9 131
Step 1. Developing Performance Models:
Step 2. Fitting Data to Models:
i
(obs(i) − f (i))^2 is
minimized.
Jingke Li (Portland State University) CS 415/515 Performance Analysis 25 / 36
tw = cost/word
ts = startup cost
T = time
(^0) L = message length
tcomm = ts + tw N
Machine ts tw IBM SP2 40 0. Intel DELTA 77 0. Intel Paragon 121 0. Meiko CS-2 87 0. nCUBE-2 154 2. TMC CM-5 82 0. PCs on Ethernet 1500 5. PCs on FDDI 1150 1.
Implementation of collective communications is topology dependent. For example, consider one-to-all broadcast
Jingke Li (Portland State University) CS 415/515 Performance Analysis 27 / 36
Jingke Li (Portland State University) CS 415/515 Performance Analysis 31 / 36
Involves p − 1 steps of shift: T = (ts +tw m)(p −1)
It involves √p − 1 steps of shift in rows with message size m and √p − 1 steps of shift in columns with message size m√p:
T = (ts + tw m)(√p − 1) + (ts + tw m√p)(√p − 1) = 2ts (
p − 1) + tw m(p − 1)
Jingke Li (Portland State University) CS 415/515 Performance Analysis 33 / 36
It involves log p steps, with doubled message size in each step:
T = ts log p + tw m(p − 1)