Parallel Reduction Algorithms - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Parallel Reduction Algorithms, Different Interconnection Topologies, Example, Message-Passing, Parallel Program, Message-Passing Parallel Program, Reduction Computations, Their Parallelization, Reduction Computation, Recursive Reduction Approach

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank
devank 🇮🇳

4.3

(12)

152 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 2: Parallel Reduction Algorithms & Their
Analysis on Different Interconnection Topologies
Docsity.com
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Parallel Reduction Algorithms - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Lecture 2: Parallel Reduction Algorithms & Their

Analysis on Different Interconnection Topologies

An example of an SPMD message-passing parallel program

2

Reduction Computations & Their Parallelization

  • The prior max computation is a reduction computation , defined as x = f(D),

where D is a data set (e.g., a vector), and x is a scalar quantity. In the max

computation, f is the max function, and D the set/vector of numbers for which

the max is to be computed.

  • Reduction computations that are associative [defined as f(a,b,c) = f(f(a,b), c) =

f(a, f(b,c))], can be easily parallelized using a recursive reduction approach (the

final value of f(D) needs to be at some processor at the end of the parallel

computation):

  • The data set D is evenly distributed among P processors—each processor Pi has a disjoint subset Di of D/P data elements.
  • Each processor performs the computation f on its data set
  • Each processor then engages in (log P) rounds of message passing with some other processors. In the k’th round Pi communicates with a unique partner processor Pj = partner(Pi, k) in which it sends or receives (depending, say, on whether its id is is > or < than Pj, resp.) the current f computation result it or Pj contains, resp.
  • If Pi receives a computation result b from Pj in the k’th round, it computes a = f(a,b), where a is its current result, and participates in the (k+1)’th round of commun. If Pi has sent its data to Pj, then it does not participate in any further rounds of communication and computation; it is done with its task.
  • At the end of the (log P) rounds of communication, the processor with the least ID (= 0) will hold f(D).

Reduction Computations & Their Parallelization (contd.)

  • Assuming (Pi, Pj), where Pj = partner(Pi, k), is a unique send-recv pair in round k, the # of processors holding the required partial computation results halve after each round, and hence the term recursive halving for such parallel computations.
  • In general, there are other variations of parallel reduction computations (generally dictated by the interconnection topology) in which the # of processors will reduce by some factor other than 2 in each round of communication. The general term for such parallel computations is recursive reduction.
  • A topology independent recursive-halving communication pattern is shown below. Note also that as the # of processors involved halve, the # of initial data sets that each “active” partial result represents/covers double (a recursive doubling of coverage of data sets by each active partial result). Total # of msgs sent is P-1.

Time step 1 (^) Time step 1 Time step 1 Time step 1

Time step 2 Time step 2

Time step 3

Analysis of Parallel Reduction on Different Topologies

  • Recursive halving based reduction on a hypercube:
    • Initial computation time = Theta(N/P); N= # of data items, P = # processors.
    • Communication time = Theta(log P), as there are (log P) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1-hop msg., and there is no conflict among msgs
    • Computation time during commun. rounds = Theta(log P) [1 red. oper. in eachprocessor in each round).
  • Same comput. and commun. time for exchange commun. on a hypercube
  • Speedup = S(P) = Seq._time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(2*logP))] ~ Theta(P) if N >> P

1

3

2

1

1 1

2

1

3

2

1

1 1

2

2

2

3

(^3 )

(a) Hypercubes of dimensions 0 to 4

(b) Msg pattern for a reduction comput. using recursive halving; processor 000 will hold the final result

(c) Msg pattern for a reduction comput. using exchange communication; all processors will hold the final result

Time steps

Analysis of Parallel Reduction on Different Topologies (contd).

  • Recursive reduction on a direct tree:
    • Initial Computation time = Theta(N/P); N= # of data items, P = # processors.
    • Communication time = Theta((log (P/2)), as there are (log ((P+1)/2)) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1-hop msg., and there is no conflict among msgs;
    • Computation time during commun. rounds = Theta(2(log (P/2)) [2 red. opers. in the “parent” processor in each round) = Theta(2(log P))
  • Speedup = S(P) = Seq_time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(3*logP)
)]~ Theta(P) if N >> P

Recursive reduction in (a) a direct tree network; and (b) an indirect tree network.

1 1

(^2 )

1 1 1, 2^ 1, 2

2, 4

Time steps

Round #, Hops