Explaining Speedup & Efficiency in Parallel Algorithms: Fox's Algorithm, Exercises of Computer Engineering and Programming

The concept of speedup and efficiency in parallel algorithms using fox's algorithm as an example. It covers the definition of speedup, efficiency, and linear speedup, as well as the importance of measuring speedup and the impact of communications overhead. The document also provides a description of fox's algorithm and its implementation.

Typology: Exercises

2011/2012

Uploaded on 07/17/2012

bandana
bandana 🇮🇳

4.4

(36)

87 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Parallel Algorithms
Outline:
A parallel algorithm is an algorithm that has been specifically written for
execution on a computer with two or more processors (ie. a parallel computer).
Speedup and Efficiency
Outline:
This course first discusses some of the most basic concepts that one must be
aware of when writing parallel algorithms.
Speedup
Outline:
When we write a parallel algorithm we are usually interested in knowing what are
the algorithm's performance gains over a similar algorithm run on a serial
computer.
One way of judging the performance of an algorithm is to measure it's speedup.
Speedup is defined as the ratio of the run-time of the fastest serial algorithm run
on a serial computer (Ts) to the run-time of a parallel version of the algorithm
(Tp) run on N processors of a parallel computer. That is, the speedup, SN, is:
Ts
SN = Tp
(Note: there is an alternative definition of speedup.)
Example:
If the best known serial algorithm takes 8 seconds (ie. Ts = 8) while a parallel
algorithm takes 2 seconds (ie. Tp = 2) using N = 5 processors, then
Ts 8
SN = Tp = 2 = 4
Efficiency
Definition:
When we write a parallel algorithm that is to be run on a parallel computer with N
processors we might expect that parallel algorithm to speedup by a factor N (this
is known as linear speedup). In practice this is not the case (see Factors That
Limit Speedup).
To judge how effective a parallel algorithm is we measure its efficiency.
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download Explaining Speedup & Efficiency in Parallel Algorithms: Fox's Algorithm and more Exercises Computer Engineering and Programming in PDF only on Docsity!

Parallel Algorithms

Outline:

A parallel algorithm is an algorithm that has been specifically written for execution on a computer with two or more processors ( ie. a parallel computer).

Speedup and Efficiency

Outline:

This course first discusses some of the most basic concepts that one must be aware of when writing parallel algorithms.

Speedup

Outline:

When we write a parallel algorithm we are usually interested in knowing what are the algorithm's performance gains over a similar algorithm run on a serial computer.

One way of judging the performance of an algorithm is to measure it's speedup.

Speedup is defined as the ratio of the run-time of the fastest serial algorithm run on a serial computer ( T (^) s ) to the run-time of a parallel version of the algorithm ( T (^) p ) run on N processors of a parallel computer. That is, the speedup, SN , is: T (^) s SN = T (^) p

( Note: there is an alternative definition of speedup.)

Example:

If the best known serial algorithm takes 8 seconds ( ie. T (^) s = 8) while a parallel algorithm takes 2 seconds ( ie. T (^) p = 2) using N = 5 processors, then T (^) s 8 SN = T (^) p

Efficiency

Definition:

When we write a parallel algorithm that is to be run on a parallel computer with N processors we might expect that parallel algorithm to speedup by a factor N (this is known as linear speedup). In practice this is not the case (see Factors That Limit Speedup).

To judge how effective a parallel algorithm is we measure its efficiency.

Efficiency is a measure of the fraction of time that a processor spends performing useful work.

That is, if SN is the speedup in using a parallel algorithm on a computer composed of N processors, then the efficiency, EN , of the parallel algorithm is given by:

SN

EN =

N

Example:

If the best known serial algorithm takes 8 seconds ( ie. T (^) s = 8) while a parallel algorithm takes 2 seconds ( ie. T (^) p = 2) using N = 5 processors, then SN = S 5 = 4, and: SN S5 4 EN = E 5 = N

N

= 0.8 ( ie. 80%)

Measuring Speedup

Discussion:

To measure the speedup of a parallel algorithm one must calculate the time of execution, T (^) s , of a serial version of the algorithm in question.

However it is difficult to know what is meant by T (^) s.

To measure T (^) s we could run the serial algorithm on the fastest serial computer available. Or we could run the serial algorithm on (^) one processor of the target parallel computer.

One could argue that the first method (comparison to the fastest serial computer) is the fairest. However, it is sometimes difficult to get access to the fastest serial computer. Also the fastest serial computer always changes as computer manufacturers build better machines.

Measuring speedup in this way has the effect of reducing its value.

The second method (measuring the time taken by one processor of the computer) is the one most algorithm writers use and leads to an alternative definition of speedup. (See the original definition of speedup for a comparison.)

Alternative Definition:

processor is waiting for another to finish before it can continue. Writing an algorithm that evenly distributes its workload across all the processors is known as Load Balancing.

A special case of load balancing is the extra workload due to the presence of an unparallelisable serial component within the parallel algorithm. Such a serial component would only allow one processor to work on it while the others processors remained idle.

Communications Overhead:

Any time that is spent communicating data between processors degrades the speedup (since a processor that is transmitting data is not calculating).

To reduce the amount of communication overhead parallel algorithm designers make sure that the grain size ( ie. the relative amount of work done between communications) is as large as possible.

Amdahl's Law

Definition:

Amdahl's Law states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially, ie. the fraction of serial operations.

Discussion:

Let s be the amount of time spent (by one processor) on serial parts of the program and p be the amount of time spent (by one processor) on parts of the program that can be done in parallel, (^) ie. T (^) s = s + p p T (^) p = s + N where T (^) s and T (^) p are the times taken to run the algorithm on a serial and parallel computer respectively. N is the number of processors.

Thus the speedup, S (^) N of a parallel algorithm is

T (^) s s + p SN = T (^) p

s + (p/N)

··· ( Amdahl's Law )

Example:

A program contains 100 operations each of which take 1 time unit to complete. If 80 operations can be done in parallel ( p = 80) then 20 operations must be done sequentially ( s = 20). For the optimal case of using 80 processors (using more than 80 processors would not improve performance since there are only 80 parallel operations) gives a speedup, S 80 , of s + p 20 + 80 S (^) N = S80 = s + (p/N)

That is, a speedup of only 5 is possible no matter how many processors are available.

Serial Fraction and Amdahl's Law

Definition:

The serial fraction , F of a parallel algorithm is defined to be: s F = T<SUBS< sub> where s is the time that must be spent performing serial operations ( ie. operations that cannot be performed be parallelised) and (^) T (^) s is the total time spent running the whole algorithm on one processor.

Note: If p is the time spent by one processor performing the operations of a parallel algorithm then T (^) s = s + p.

Amdahl's Law:

Since T (^) s = s + p then Amdahl's Law can be rewritten in terms of the serial fraction, F. That is, speedup, SN becomes

N SN = (N - 1)F + 1

··· ( Amdahl's Law )

Discussion:

If we examine this equation we see that for F = 0 ( ie. no serial part) then the speedup, SN = N , ie. linear speedup.

Whereas if F = 1 ( ie. completely serial) then the speedup, SN = 1, ie. there will be no speedup no matter how many processors are used.

Example:

Consider the effect of the serial fraction (^) F on the speedup produced for (^) N = 10:

systems with large numbers of processors because sufficient speedup will never be produced.

However most of the important applications that need to be parallelised contain very small serial fractions ( ie. F < 0.001).

Using the Serial Fraction to Measure Performance

Discussion:

By running a speedup of a parallel algorithm on a machine with N processors , we can find the actual speedup S (^) N achieved. We can use this measured value to calculate the serial fraction F for a computer by using Amdahl's Law.

That is,

1 N

F =

( N - 1) SN

  • 1 ··· ( Serial Fraction )

The serial fraction is a better measure of the performance of a parallel algorithm.

Example:

To see why the serial fraction is a better measure of performance than the speedup consider the following results: N S (^) N E (^) N F

2 1.95 97% 0.

3 2.88 96% 0.

4 3.76 94% 0.

8 6.96 87% 0.

where EN is the efficiency. Without looking at the serial fraction we cannot tell whether these results are good or not.

For example, what is causing the efficiency to decrease?

By examining F we can conclude that the efficiency is decreasing since F is almost constant. This constant, relatively high value for F is due to the limited parallelism of the program.

Fox's Algorithm (General Case)

Outline:

This course describes a memory-efficient parallel algorithm for performing the multiplication of two m × m matrices ( ie. C = A × B where A, B, and C are square m × m matrices) on a parallel computer.

Mathematically this is represented as:

m C ij =

k = 1

a ik b kj where i, j = 1, 2, ···, m

Description of Fox's Algorithm

Introduction:

Fox's Algorithm is a memory memory-efficient parallel algorithm for performing the multiplication of two m × m matrices ( ie. C = A × B where A, B, and C are square (^) m × (^) m matrices) on a parallel computer. That is, Fox's Algorithm is designed to perform the calculation shown in Figure 1 as efficiently as possible.

a 11 a 12 ··· a 1 m b11 b 12 ··· b (^1) m c 11 c 12 ··· c 1 m

a 21 a 22 ··· a 2 m b21 b 22 ··· b (^2) m c 21 c 22 ··· c 2 m

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···

a m 1 a m 2 ··· a mm

×

b m 1 b m 2 ··· b mm

c m 1 c m 2 ··· c mm

Figure 1. The product of two m × m matrices. The figure shows the elements of the matrices.

The algorithm consists of two parts:

  1. Initialisation, and
  2. Roll, Broadcast and Multiply.

The initialisation step splits the matrices A and B into smaller, more manageable sub-matrices.

The Roll, Broadcast, Multiply step is the heart of the algorithm and performs the multiplication of the matrices.

where

a 11 a 12 a 13 a 14 a 31 a 32 a 33 a 34 A 11 = a 21 a 22

A 12 =

a 23 a 24

A 21 =

a 41 a 42

A 22 =

a 43 a 44

and so on, until,

a( m -1)( m -1) a m ( m -1) A mm = a m ( m -1) a mm

and where

b11 b12 b13 b14 b31 b32 b33 b B 11 = b21 b

B 12 =

b23 b

B 21 =

b41 b

B 22 =

b43 b

and so on until,

b( m -1)( m -1) b m ( m -1) B mm = b m ( m -1) b mm Figure 1. The product of two m × m matrices. The figure shows the partitioned elements of the matrices. In this illustration we are creating lots of 2 × 2 sub- matrices. However, it should be noted that in practice the size of the sub-matrices is chosen such that the number of sub-matrices equals the number of processors.

We shall call the sub-matrices A ij and B ij where 1 i, j N (see Figure 1).

(Note: A ij is not the same as a ij. A ij is a sub-matrix of A, whereas a ij is a single element of A.)

To create the sub-matrices A ij (as illustrated in (^) Figure 1 ) we take the first (^) m / N ½ rows ( ie. rows 1 to m / N ½) and take elements from the first m / N ½^ columns ( ie. columns 1 to m / N ½) to form sub-matrix A 11.

To form sub-matrix A 12 we take elements from columns ( m / N ½^ + 1) to 2 m / N ½^ of rows 1 to m / N ½.

In general, to form A ij we take elements from columns (( i -1) m / N ½^ + 1) to im / N ½ of rows (( j -1) m / N ½^ + 1) to jm / N ½.

That is, to create A ij we use elements a kl where:

( i - 1) m im

N

  • 1 k

N

and

( j - 1) m jm

N

  • 1 l

N

We follow the same procedure to find the sub matrix B ij. That is, an element b kl of B ij is found if

( i - 1) m im

N

  • 1 k

N

and

( j - 1) m jm

N

  • 1 l

N

Array Initialisation

Initialisation:

The next step is to create a number of arrays on each processor to store the sub- matrices created during partitioning.

We shall label the N processors P 0 , P 1 , ···, P N -1. We shall use the notation that a single subscripted array ( eg. T p )

belong to the processor identified by the same subscript ( eg. T p belongs to processor P p.

For each processor we must create 4 m / N ½^ × m / N ½:

  1. R p is an m / N ½^ × m / N ½^ array. It is used to store the sub-matrices A ij found during partitioning such that:

U 1 = T 1 × S 1

U( i -1) N ½+( j -1) = T( i -1) N ½+( j -1) × S (^) ( i -1) N ½+( j -1) ··· U N -1 = T N -1 × S N -

Once all the initialisation steps have been completed the arrays U p form the answer (see Figure 1 ).

U 0 U 1 ··· U N ½-

U N ½^ U N ½+1 ··· U 2 N ½-

C =

U N - N ½^ UN- N ½+1 ··· U N -

Figure 1. The m × m matrix C is represented by the sub-matrices U. The sub-matrices U 0 , U 1 , ···, UN-1 are calculated on processors P 0 , P 1 , ···, PN- respectively.

Roll, Broadcast, Multiply

Introduction:

Once the matrices A and B have been partitioned and the arrays initialised the algorithm proceeds as follows:

For t = 1, 2, ···, N - 1:

Roll:

On each processor the arrays S p are updated: t +1 t S ( i -1) N ½^ + ( j -1)

= S

iN ½^ + ( j -1) - ¤ N

where

0 if i < N ¤ =

1 if i = N

Broadcast:

On each processor the arrays T p are updated: t +1 t T ( i -1) N ½^ + ( j -1)

= R

( i -1) N ½^ + ( i + tN ½^ -1)

where

0 if t + i - N 0 μ =

1 if t + i - N > 0

Multiply:

Finally, on each processor the arrays U p are calculated. That is, for p = 0 to N -1: t +1 t t t U p

= U

p

+ T

p

× S

p

Notes:

In the above superscripts of arrays ( eg. S t +1^ ) represent the value of an array at that step in the algorithm ( eg. S t +1^ represents the value of array S at step t +1 in the algorithm.

BEWARE! we must be careful on the Roll part of the algorithm since there is a potential for over-writing an array before it has been passed on to another processor.

Increment t :

Once the above Roll, Broadcast and Multiply steps have completed t is increased

by 1 until t N.

Final Result

Final Result:

Once all steps have been completed ( ie. partitioning, array initialisation and Roll, Broadcast, and Multiply) arrays U p will contain the answer - see Figure 1. U 0 U 1 ··· U N ½- U N ½^ U N ½+1 ··· U 2 N ½- ··· ··· ··· ···

C =

U N - N ½^ UN- N ½+1 ··· U N -

Figure 1. The m × m matrix C is represented by the sub-matrices U. The sub- matrices U 0 , U 1 , ···, UN-1 are calculated on processors P 0 , P 1 , ···, PN-1 respectively. Once the algorithm is complete all the sub-matrices U can be gathered together to form the answer.

An Example of Fox's Algorithm

a 11 a 12 a 13 a 14 a 15 a 16 b11 b12 b13 b14 b15 b a 21 a 22 a 23 a 24 a 25 a 26 b21 b22 b23 b24 b25 b a 31 a 32 a 33 a 34 a 35 a 36 b31 b32 b33 b34 b35 b a 41 a 42 a 43 a 44 a 45 a 46 b41 b42 b43 b44 b45 b a 51 a 52 a 53 a 54 a 55 a 56 b51 b52 b53 b54 b55 b

a 61 a 62 a 63 a 64 a 65 a 66

×

b61 b62 b63 b64 b65 b

Figure 1. The product of two 6 × 6 matrices. The figure shows the partitioned elements of the matrices. In this illustration we are creating 9 2 × 2 sub-matrices for each matrix.

If we call the sub-matrices A 11 , A 12 , ···, A 33 and B 11 , B 12 , ···, B 33 then we get two sets of 9 2 × 2 sub-matrices as illustrated in Figure 2.

A 11 A 12 A 13 B 11 B 12 B 13

A 21 A 22 A 23 B 21 B 22 B 23

A 31 A 32 A 33

×

B 31 B 32 B 33

where

a 11 a 12 a 13 a 14 a 15 a 16 A 11 = a 21 a 22

A 12 =

a 23 a 24

A 13 =

a 25 a 26 a 31 a 32 a 33 a 34 a 35 a 36 A 21 = a 41 a 42

A 22 =

a 43 a 44

A 23 =

a 45 a 46 a 51 a 52 a 53 a 54 a 55 a 56 A 31 = a 61 a 62

A 32 =

a 63 a 64

A 33 =

a 65 a 66

and,

b11 b12 b13 b14 b15 b B 11 = b21 b

B 12 =

b23 b

B 13 =

b25 b b31 b32 b33 b34 b35 b B 21 = b41 b

B 22 =

b43 b

B 23 =

b45 b B 31 = b51 b52 B 32 = b53 b54 B 33 = b55 b

b61 b62 b63 b64 b65 b

Figure 2. The result of partitioning the two 6 × 6 matrices - two lots of 9 2 × 2 matrices.

Array Initialisation

Initialisation:

Let us initialise the arrays R, S, T and U on all the processors. Greyed extracts from A Description of Fox's Algorithm: Initialisation.

The next step is to create a number of arrays on each processor to store the sub- matrices created during [A Description of Fox's Algorithm: Partitioning].

We shall label the N processors P 0 , P 1 , ···, P N -1. We shall use the notation that a single subscripted array ( eg. T p ) belong to the processor identified by the same subscript ( eg. T p belongs to processor P p ).

For each processor we must create 4 m / N ½^ × m / N ½:

For this example we must create 4 2 × 2 arrays on each processor:

  1. R p is an m / N ½^ × m / N ½^ array. It is used to store the sub-matrices A ij found during [A Description of Fox's Algorithm: Partitioning] such that:

We create R p as a 2 × 2 array on each processor, and store values of the sub-matrices A ij as shown below:

R 0 = A 11 R 1 = A 12 R 2 = A 13

R 3 = A 21 R 4 = A 22 R 5 = A 23

R 6 = A 31 R 7 = A 32 R 8 = A 33

This array does not change throughout the lifetime of the algorithm.

  1. S p is an m / N ½^ × m / N ½^ array. It is used to store the sub-matrices B ij found during [A Description of Fox's Algorithm: Partitioning] such that, initially:

In our case the arrays S p are 2 × 2 arrays and store values of the sub- matrices B ij as shown below:

0 0 0 S 0

= B

11

S

1

= B

12

S

2

= B

13

S 0 = B S 0 = B S 0 = B

Figure 1 shows the result once the initialisation step is complete. In this figure we have taken the above values for T p and S p and re-written them in terms of the sub- matrices A ij and B ij.

U 0 U 1 U 2

C = U 3 U 4 U 5

U 6 U 7 U 8

which equals,

T 0 × S 0 T 1 × S 1 T 2 × S 2

C = T 3 × S 3 T 4 × S 4 T 5 × S 5

T 6 × S 6 T 7 × S 7 T 8 × S 8

Or in terms of the sub-matrices A ij and B ij

C = A 11 × B 11 A 11 × B 12 A 11 × B 13

A 22 × B 21 A 22 × B 22 A 22 × B 23

A 33 × B 31 A 33 × B 32 A 33 × B 33

Figure 1. The 6 × 6 matrix C is represented by the sub-matrices U. The sub- matrices U 0 , U 1 , ···, U 8 are calculated on processors P 0 , P 1 , ···, P 8 respectively.

For N=16 we can have this graphical description,

Then, multiply sub-blocks and add into C,

Roll, Broadcast, Multiply

Introduction:

Now that all the matrices A and B have been partitioned and arrays created and initialised on all the N = 9 processors we can proceed with the next stage of the algorithm - A Description of Fox's Algorithm: Roll, Broadcast, Multiply.