Parallel Reduction - Lecture Slides | CSE 40833, Study notes of Computer Science

Material Type: Notes; Class: Introduction to Parallel Algorithms and Programming; Subject: Computer Science and Engr.; University: Notre Dame; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-pft
koofers-user-pft 🇺🇸

9 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Extreme Computing: Parallel Prefix Reductions
8/12/08
Parallel Reduction
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Parallel Reduction - Lecture Slides | CSE 40833 and more Study notes Computer Science in PDF only on Docsity!

Extreme Computing: Parallel Prefix Reductions

8/12/

Parallel Reduction

Extreme Computing: Parallel Prefix Reductions

8/12/

Common Problem: “Reduction”

a.k.a “Prefix”

  • Common code:

for (i=0, i<=n, i++) {sum=sum+x[i];}

  • Generalization:

for (i=0, i<=n, i++) {sum=foo(sum,x[i]);}

  • where

foo

has certain associative-like properties

  • And, Or, *, max, min, ….
    • Even more general problem:

for (i=1, i<=n, i++) {out[i]= out[i-1] + x[i];}

for (i=1, i<=n, i++) {out[i]=foo(out[i-1],x[i]);}

  • where out([0] is initialized to some value like “0”
    • Problem: how to convert to CUDA that run in parallel

Extreme Computing: Parallel Prefix Reductions

8/12/

Prefix Sum

  • for (i=0, i<=n, i++) {sum=sum+x[i];}• Notes: Addition is both
    • Associative: (a + (b +c)) = ((a + b) + c)– and Commutative: a + b = b + a
      • Thus we are free to add up #s in any order• Unsatisfactory CUDA Solutions

sum+=x[threadIdx.x];

doesn’t work (why?)

atomicAdd(&sum,

x[threadID]x.x);

does, but

  • exceedingly slow – no parallelism– Doesn’t work for floats

Extreme Computing: Parallel Prefix Reductions

8/12/

Log Sum Reduction = “Parallel Prefix”

Step 1: P processors consume 2P operands & create P results

Step 2: P/2 processors consume P operands & create P/2 results

Step K-1: 2 processors consume 4 operands & create 2 results

Step K: 1 processor consume 2 operands & creates 1 results

Assume P = 2

K

For N=2P operands, it takes log(N/2) = log(P) = K steps

Extreme Computing: Parallel Prefix Reductions

8/12/

Sample Code – Small N

(Untested)

global void SumKernel(float* values, unsigned N)/* assume N initially no more than twice blockDim */{

unsigned i = threadIdx.x;unsigned Stride = N>>1;while (Stride >0)

{if (i + Stride < N)

values[i] +=values[i + Stride];

__syncthreads();

Stride=Stride>>1; /* halve stride */

Note that data is kept grouped in consecutive locations

Time O(log2(N))

Extreme Computing: Parallel Prefix Reductions

8/12/

Larger N – Approach 1

(values must be in global)

  • Move 2*blockDim sized chunks of values to

temp0 (in shared)

  • Initiate a grid of calls to prior kernel to reduce to

N/(2*blockDIM) points

  • Grid of size N/(2*blockDim)– Have the sum moved to temp1[blockIdx.x]
    • Repeat above
      • With N = N/(2*blockDim)– Use temp1 as source, temp0 as destination
        • Stop when only one result• Problem: shared is limited in size!!

Extreme Computing: Parallel Prefix Reductions

8/12/

Timing It Out

  • Assume we have N operands to combine, N>=P• Assume P = 2

k

, N=

mk

, M>>

  • Last step combines P operands in k steps– 2

nd

to last has P sets of P = P

2

operands: kP steps

  • 3

rd

to last has P

3

operands: kP

2

steps

  • First pass has N=

mk

operands: kP

m-

steps

  • T(P) = k + kP + kP

2

  • … kP

m-

= k(P

m

-1)/(P-1) = log(P)(N-1)/(P-1)

log(P)N/P

  • If T(1)=N, then Speedup(P) = N/(log(P)N/P) =

P/log(P)

Extreme Computing: Parallel Prefix Reductions

8/12/

Recurrances

Extreme Computing: Parallel Prefix Reductions

8/12/

A Scan

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008, p.4)

Exactly

length

adds

Each of n threads does log2(n) adds, for total of nlog2(n) addsProblem in CUDA: for large n need to sync the updates

Standard Sequential codeSimple minded Parallel Code: n processors (1 per output)

Extreme Computing: Parallel Prefix Reductions

8/12/

Double Buffered Version

Extreme Computing: Parallel Prefix Reductions

8/12/

A More Efficient Algorithm

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 7)

  • Goal: more efficient algorithm• Use

balanced trees

  • Binary tree with n leaves,– log2(n) levels,– 2

d

nodes at level d

  • and n-1 total internal nodes
    • If one add per interior node, n-1 adds• Two phases
      • up-sweep: go up thru tree computing partial sums– down-sweep: go down from root, building scan in place

Extreme Computing: Parallel Prefix Reductions

8/12/

Up-Sweep: Just Like Reduction Sum(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 8)

d=2 d=1 d=

Note error

in text

Final

Contents of Memory

Extreme Computing: Parallel Prefix Reductions

8/12/

CUDA Code: Part 1: Up-Sweep

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 10)

Extreme Computing: Parallel Prefix Reductions

8/12/

CUDA Code: Part 2: Down-Sweep

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 10)