Parallel Prefix Algorithms: Coarse-Grained vs. Fine-Grained Approaches for Prefix Sums, Study notes of Computer Science

Parallel prefix algorithms, focusing on coarse-grained and fine-grained approaches for solving the prefix sum problem. Coarse-grained algorithms are commonly used in parallel systems due to their efficiency, while fine-grained algorithms offer better efficiency when p = n. The document also covers the recursive and non-recursive algorithms for parallel prefix sums, their analysis, and advantages.

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-y31
koofers-user-y31 🇺🇸

10 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Parallel Prefix
9/10/09
Coarse vs. fine grained
Unlike the problem on the first
homework, most parallel algorithms are
coarse-grained.
i.e., n/p >> 1
This is required to achieve adequate
speedups on most parallel systems.
computation > communication
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Parallel Prefix Algorithms: Coarse-Grained vs. Fine-Grained Approaches for Prefix Sums and more Study notes Computer Science in PDF only on Docsity!

Parallel Prefix

Coarse vs. fine grained

  • Unlike the problem on the first

homework, most parallel algorithms are

coarse-grained.

  • i.e., n/p >> 1
  • This is required to achieve adequate

speedups on most parallel systems.

  • computation > communication

Algorithm design

  • Many parallel algorithms have sequential and parallel modules.
  • The best case approach is to design a fine-grained algorithm where p = n. - This would guarantee efficiency for all p < n , which is a more typical situation because efficiency always scales down (within a constant).
  • We will discuss a specfic example of this today.

Prefix sums

  • We are given n elements x 0 , x 1 , …, xn-

and a binary associative operator ⊗

  • Computing the partial sums s 0 , s 1 , …,

sn-1 , where si = x 0 ⊗ x 1 ⊗ … ⊗ xi , is

called the prefix sum problem.

  • Serial algorithm is Ω( n ).

Parallel Prefix

  • We would like to develop a parallel algorithm to compute prefix sums.
  • In addition, it turns out that many different problems can be solved efficiently using this parallel algorithm. - e.g., max or min operation
  • Often called a scan or sweep operation.

Example

  • Parallel prefix ([1,2,3,4,5,6,7,8], sum)
    • Returns [1, 3, 6, 10, 15, 21, 28, 36]
  • Although we will focus on sums today,

many other operators can be used to

solve an assortment of problems.

  • Examples will be in the next homework problem set handed out Tues.

Return to parallel sums

  • In class, we discussed an algorithm for sums where one processor received the answer.
  • For your homework, you were asked to develop an alternative were all processors got the result. - This can be achieved by exchanging values instead of unidirectional communication.

One answer

  • For i = 0 to d - 1 do
    • Send sum to the processor obtained by inverting the i th^ bit
    • Receive sum from the processor obtained by inverting the i th^ bit
    • Add received sum to local sum

Illustration

Input vector Prefix sums of 1st half Prefix sums of 2nd half (Requires log p time) Analysis from Aluru Chapter 1

Improving this further

  • Suppose that we calculate both the

prefix sums and the total sums on each

of the two partitions.

  • This will add slightly more memory but

result in a substantial improvement as

we will see.

Illustration

Input vector Prefix sums of 1st half Prefix sums of 2nd half (Requires one hypercubic permutation)

Non-recursive algorithm

  • Set prefix sum to be element on this processor.
  • Set total sum to be prefix sum.
  • For i = 0 to d - 1
    • Send total sum to processor obtained by inverting the i th^ bit of self id and receive back
    • Add received sum to total sum
    • If exchange occurs between processor with a smaller id, add received sum to prefix sum

Efficiency

  • This algorithm is as efficient as adding n numbers. - Only a constant amount of additional work is required during the parallel algorithm.
  • Further, we can compute all prefix sums in the time it takes to compute the last prefix sum Sn-1 , which is also the total sum.

Details

  • In the non-recursive algorithm, we maintain two variables: - Prefix sum - Total sum
  • The algorithm contains log p phases, each of which requiring O(1) computation and one communication. - Worse case is two adds when it communication occurs with a processor with a lower rank.

Analysis of algorithm

  • Where p = n
    • Computation time = O (log p )
    • Communication time = O((τ + μ) log p )
  • Where p > n
    • Computation time = O( n/p + log p )
    • Communication time = O((τ + μ) log p )

Example

  • Suppose we want to calculate the rank of a processor given “marked” and “unmarked” processors.
  • Rank = # of preceeding processors with mark if and only if this processor is marked.
  • How would you solve this problem?

Butterfly networks

  • We can retain the connectivity of a

hypercube of p processors using p (log

p +1).

  • These are arranged in log p + 1

columns of p processors each.

Details

  • Links connecting processors on a row

are called row links; others are

appropriately called hypercube links.

  • In practice, we dynamic interconnection

networks achieve similar results with

fewer processors.

Dynamic butterfly networks

  • Uses a switch called a crossbar.
  • Processors are connected to a

crossbar, crossbars to crossbars.

  • Log p columns, p/2 switches
  • Communication achieved in log p

stages.