Download Performance Metrics - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!
Lecture 10: Performance Metrics
Topic Overview
- Sources of Overhead in Parallel Programs
- Performance Metrics for Parallel Systems
- Effect of Granularity on Performance
- Scalability of Parallel Systems
- Minimum Execution Time and Minimum Cost-Optimal Execution Time
- Asymptotic Analysis of Parallel Programs
- Other Scalability Metrics
Analytical Modeling - Basics
- A number of performance measures are intuitive.
- Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. But how does this scale when the number of processors is changed of the program is ported to another machine altogether?
- How much faster is the parallel version? This begs the obvious followup question - whats the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look
- Raw FLOP count - What good are FLOP counts when they dont solve a problem?
Sources of Overhead in Parallel Programs
- If I use two processors, shouldnt my program run twice as fast?
- No - a number of overheads, including wasted computation, communication, idling, and contention cause degradation in performance.
The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling.
Performance Metrics for Parallel
Systems: Execution
Time
- Serial runtime of a program is the time
elapsed between the beginning and the end
of its execution on a sequential computer.
- The parallel runtime is the time that elapses
from the moment the first processor starts to
the moment the last processor finishes
execution.
- We denote the serial runtime by and the
parallel runtime by T P.
Performance Metrics for Parallel
Systems: Total Parallel Overhead
- Let Tall be the total time collectively spent by all the processing elements.
- TS is the serial time.
- Observe that Tall - TS is then the total time spend by all processors combined in non-useful work. This is called the total overhead.
- The total time collectively spent by all the processing elements Tall = p TP ( p is the number of processors).
- The overhead function ( To ) is therefore given by
To = p TP - TS (1)
Performance Metrics: Example
- Consider the problem of adding n numbers by
using n processing elements.
- If n is a power of two, we can perform this
operation in log n steps by propagating partial
sums up a logical binary tree of processors.
Performance Metrics: Example
Computing the globalsum of 16 partial sums using
16 processing elements. Σji denotes the sum of
numbers with consecutive labels from i to j.
Performance Metrics: Speedup
- For a given problem, there might be many serial
algorithms available. These algorithms may have
different asymptotic runtimes and may be
parallelizable to different degrees
- For the purpose of computing speedup, we
always consider the best sequential program as
the baseline
- For the purpose of determining how effective our
parallelization technique is, we can determine a
pseudo-speedup w.r.t. the sequential version of
the parallel algorithm Docsity.com
Performance Metrics: Speedup
- Consider the problem of parallel bubble sort.Example
- The serial time for bubblesort is 150 seconds.
- The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40 seconds.
- The speedup would appear to be 150/40 = 3.75. This is actually a pseudo-speedup
- But is this really a fair assessment of the system?
- What if serial quicksort only took 30 seconds? In this case, the speedup is 30/40 = 0.75. This is a more realistic assessment of the system.
Performance Metrics: Superlinear Speedups
One reason for superlinearity is that the parallel
version does less work than corresponding serial algorithm.
Searching an unstructured tree for a node with a given label, `S', on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm.
Performance Metrics: Superlinear Speedups
Performance Metrics: Superlinear Speedups
Example: A processor with 64KB of cache
yields an 80% hit ratio. If two processors are
used, since the problem size/processor is
smaller, the hit ratio goes up to 90%. Of the
remaining 10% access, 8% come from local
memory and 2% from remote memory.
Performance Metrics: Efficiency
- Efficiency is a measure of the fraction of time
for which a processing element is usefully
employed
- Mathematically, it is given by
= (2)
- Following the bounds on speedup, efficiency
can be as low as 0 and as high as 1. Docsity.com