Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof, Study notes of Computer Science

Portland State University (PSU)Computer Science

Prof. Jingke Li

An introduction to performance analysis in parallel computing, covering key concepts such as execution time, speedup, and various approaches for measuring and modeling performance. Examples using unix, mpi, and different parallel architectures.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-d1i-1 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Performance Analysis

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 Performance Analysis 1 / 36

Performance Analysis

Performance Metrics:

•Execution Time

+most direct reflection of performance

−not always insightful

•Speedup

+shows relative performance to sequential performance

Jingke Li (Portland State University) CS 415/515 Performance Analysis 2 / 36

Discover Study notes of Computer Science Portland State University (PSU)

Partial preview of the text

Download Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof and more Study notes Computer Science in PDF only on Docsity!

Performance Analysis

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 Performance Analysis 1 / 36

Performance Analysis

Performance Metrics:

Execution Time
- most direct reflection of performance − not always insightful
Speedup
- shows relative performance to sequential performance

Approaches

(^) Extrapolation from Observations — Run the program and collect performance data. For example, “We implemented the algorithm on parallel computer X and achieved a speedup of 10.8 on 12 processors with problem size N = 100.”
Asymptotic Analysis — Analyze algorithm performance based on problem size and processor count. For example, “The algorithm requires O(N log N) time on O(N) processors.”

Jingke Li (Portland State University) CS 415/515 Performance Analysis 3 / 36

Approaches

(^) Performance Modeling — Specify a performance metric as a function of processor count P and problem size N. For example: (a) T = N + N^2 /P (b) T = (N + N^2 )/P + 100 (c) T = (N + N^2 )/P + 0. 6 P^2 These three formula describe specific performance pattern for their corresponding algorithms, even though they all achieve the same speedup of about 10.8 when P = 12 and N = 100, and have the same asymptotic complexity when P = O(N).

Measuring Wall-Clock Time in Unix

(^) time() — returns the wall-clock time in seconds since the Epoch (defined as 00:00 1/1/1970, GMT).
(^) gettimeofday() — returns the wall-clock time at microsecond scale.
(^) clock gettime() — returns the wall-clock time at nanosecond scale. (Works only on systems that implement POSIX.1b Realtime Extension.)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 7 / 36

Measuring Wall-Clock Time in Unix

#include time t time(time t *tloc);

If tloc is not NULL, the return value is also stored in *tloc.

#include int gettimeofday(struct timeval *tp, void *tzp);

The struct timeval structure contains two members, long tv sec and long tv usec, whose values are seconds and microseconds since the EPOCH respectively. The second parameter tzp must be NULL.

An Example

#include

int main() { struct timeval tpstart; struct timeval tpend; long elapsed;

gettimeofday(&tpstart, NULL); function_to_time(); gettimeofday(&tpend, NULL); elapsed = MILLION * (tpend.tv_sec - tpstart.tv_sec)

tpend.tv_usec - tpstart.tv_usec; printf("Elapsed time: %ld microseconds\n", elapsed); }

Jingke Li (Portland State University) CS 415/515 Performance Analysis 9 / 36

Measuring CPU Time in Unix

(^) clock() — returns the amount of CPU time (in terms of number of clock ticks) used by the calling process. (This is an ANSI C routine.)

#include clock t clock(void);

To convert the return value to seconds, divide it by CLOCKS PER SEC (typically 1,000,000).

Example

double start, finish; ... ... MPI_Barrier(comm); start = MPI_WTime(); ... ... MPI_Barrier(comm); finish = MPI_WTime(); if (my_rank == 0) printf("Elapsed time = %e seconds\n", finish - start);

Jingke Li (Portland State University) CS 415/515 Performance Analysis 13 / 36

Speedup

S =

Ts Tp

(^) Ts — Execution time using a single processor
(^) Tp — Execution time using N processors (^1)

100

1000

1 10 100 1000

Speedup

Processors

N= Perfect Speedup Algorithm 1 Algorithm 2 Algorithm 3

A Simple Observation: S ≤ N

Amdahl’s Law

The speedup of a program is upper bounded by the reciprocal of the sequential fraction of the program. (The bigger the fraction, the smaller the speedup.)

For a given program, let f be the sequential fraction, then

fTs is the time required on sequential portion
(^) (1 − f )Ts is the time required on parallelizable portion

S =

Ts Tp

Ts fTs + (1 − f )Ts /N

f + (1 − f )/N

Jingke Li (Portland State University) CS 415/515 Performance Analysis 15 / 36

Amdahl’s Law Example

E.g. If the sequential component of a program is 5%, then the maximum speedup that can be achieved is 20.

Other Speedup Models

Fixed-Memory Speedup (Scaled Speedup) — Each processor runs a (sub)program of equal size.

SFM (P) =

Ts (PN 0 ) Tp(PN 0 )

(where

N

P

= N 0 isaconstant)

Fixed-Time Speedup — Each processor runs for a fixed amount of time.

SFT (P) =

Ts (Np ) Tp(Np )

(whereTp(Np )isaconstant)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 19 / 36

Example

Given a program with following features:

Time: Ts (N) = 1, 000 , 000 + 1, 000 N + 24N^2 (μs) Tp(N) = 1, 500 , 000 + 1, 050 N/P + 24N^2 /P (μs)

Space: Ss (N) = 100, 000 + 200N (bytes)

Sp(N) = 125, 000 + 200N/P (bytes)

Consider three cases:

(1) Fixed-size speedup for N 0 = 1, 000 (2) Scaled speedup for N/P = 1, 000 (3) Fixed-time speedup for Tp = Ts (N 0 )

Results

SFS =

Ts (N 0 ) Tp (N 0 ) =^

1 .5 + 25. 05 /P →^

1. 5 = 17.^33

SFM =

Ts (PN 0 ) Tp (PN 0 ) =

1 + P + 24P^2

2 .55 + 24P (unbounded!)

SFT = Ts^ (Np^ ) Ts (N 0 )

whereNp =

− 1 , 050 /P +

(1050/P)^2 − 4 · 1 , 000 , 000 · 24 P

2(24/P)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 21 / 36

Results (cont.)

P FS Spdup FM Spdup Time(sec.) FT Spdup Storage(kb) 1 0.98 0.98 26.55 0.98 323 2 1.85 1.96 50.55 1.92 266 4 3.35 3.95 98.55 3.80 225 8 5.61 7.94 194.55 7.57 196 16 8.48 15.94 386.55 15.11 175 32 11.39 31.94 770.55 30.18 160 64 13.75 63.94 1538.55 60.33 150 128 15.33 127.94 3074.55 120.6 143 256 16.27 255.94 6146.55 241.2 138 512 16.78 511.94 12290.55 482.5 134 1024 17.06 1023.94 24578.55 964.9 131

Performance Analysis

Step 1. Developing Performance Models:

(^) Computation Time:
- Analyze algorithm complexity
- Analyze processor workload
Communication Time:
- Decompose communication components in the algorithm into basic communication patterns.
- Develop performance models for basic communication patterns.

Step 2. Fitting Data to Models:

Given a set of observed performance data points and a (hypothetic) performance function, Find the parameters in the function. Example: Least-Square Fit Method: Find parameters to function f , such that

(obs(i) − f (i))^2 is

minimized.

Jingke Li (Portland State University) CS 415/515 Performance Analysis 25 / 36

A Simple One-to-One Communication Model

tw = cost/word

ts = startup cost

T = time

(^0) L = message length

tcomm = ts + tw N

Machine ts tw IBM SP2 40 0. Intel DELTA 77 0. Intel Paragon 121 0. Meiko CS-2 87 0. nCUBE-2 154 2. TMC CM-5 82 0. PCs on Ethernet 1500 5. PCs on FDDI 1150 1.

Model Collective Communications

Implementation of collective communications is topology dependent. For example, consider one-to-all broadcast

On a Ring — • On a Mesh —

Jingke Li (Portland State University) CS 415/515 Performance Analysis 27 / 36

Model Collective Communications (cont.)

(^) One-to-All Broadcast on a Tree —
(^) One-to-All Broadcast on a Hypercube —

Cost Analysis of One-to-All Communication

Broadcast and Reduction — The cost model for one-to-all broadcast over p processors is straightforward — it simply involves log p steps of (concurrent) point-to-point simple message transfers: tbcast = (ts + tw m) log p This cost model also works for all-to-one reduction.
(^) Scatter and Gather — The abstract communication pattern for a scatter is the same as for a broadcast. The difference is that, in a scatter, the message size starts out at m(p − 1)/2, and gets halved in each subsequent step: tscatter = (ts + tw m(p − 1)/2) + (ts + tw m(p − 1)/4) + · · · + (ts + tw m) = ts log p + tw m(p − 1)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 31 / 36

All-to-All Communication on a Ring

Involves p − 1 steps of shift: T = (ts +tw m)(p −1)

All-to-All Communication on a Mesh

It involves √p − 1 steps of shift in rows with message size m and √p − 1 steps of shift in columns with message size m√p:

T = (ts + tw m)(√p − 1) + (ts + tw m√p)(√p − 1) = 2ts (

p − 1) + tw m(p − 1)

Jingke Li (Portland State University) CS 415/515 Performance Analysis 33 / 36

All-to-All Communication on a Hypercube

It involves log p steps, with doubled message size in each step:

T = ts log p + tw m(p − 1)

Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof, Study notes of Computer Science

Related documents

Partial preview of the text

Download Performance Analysis: Execution Time, Speedup, and Approaches in Parallel Computing - Prof and more Study notes Computer Science in PDF only on Docsity!

Performance Analysis

Performance Analysis

Approaches

Approaches

Measuring Wall-Clock Time in Unix

Measuring Wall-Clock Time in Unix

An Example

Measuring CPU Time in Unix

Example

Speedup

S =

Amdahl’s Law

S =

Amdahl’s Law Example

Other Speedup Models

N

P

Example

Results

SFS =

1 .5 + 25. 05 /P →^

1. 5 = 17.^33

SFM =

1 + P + 24P^2

− 1 , 050 /P +

(1050/P)^2 − 4 · 1 , 000 , 000 · 24 P

2(24/P)

Results (cont.)

Performance Analysis

A Simple One-to-One Communication Model

Model Collective Communications

Model Collective Communications (cont.)

Cost Analysis of One-to-All Communication

All-to-All Communication on a Ring

All-to-All Communication on a Mesh

All-to-All Communication on a Hypercube