Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Parallel Reduction - Lecture Slides | CSE 40833, Study notes of Computer Science

University of Notre Dame Computer Science

Material Type: Notes; Class: Introduction to Parallel Algorithms and Programming; Subject: Computer Science and Engr.; University: Notre Dame; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-pft 🇺🇸

9 documents

1 / 29

This page cannot be seen from the preview

Don't miss anything!

Extreme Computing: Parallel Prefix Reductions

8/12/08

Parallel Reduction

Discover Study notes of Computer Science University of Notre Dame

Partial preview of the text

Download Parallel Reduction - Lecture Slides | CSE 40833 and more Study notes Computer Science in PDF only on Docsity!

Extreme Computing: Parallel Prefix Reductions

8/12/

Parallel Reduction

Extreme Computing: Parallel Prefix Reductions

8/12/

Common Problem: “Reduction”

a.k.a “Prefix”

Common code:

for (i=0, i<=n, i++) {sum=sum+x[i];}

Generalization:

for (i=0, i<=n, i++) {sum=foo(sum,x[i]);}

where

foo

has certain associative-like properties

And, Or, *, max, min, ….
- Even more general problem:

for (i=1, i<=n, i++) {out[i]= out[i-1] + x[i];}

for (i=1, i<=n, i++) {out[i]=foo(out[i-1],x[i]);}

where out([0] is initialized to some value like “0”
- Problem: how to convert to CUDA that run in parallel

Extreme Computing: Parallel Prefix Reductions

8/12/

Prefix Sum

for (i=0, i<=n, i++) {sum=sum+x[i];}• Notes: Addition is both
- Associative: (a + (b +c)) = ((a + b) + c)– and Commutative: a + b = b + a
  - Thus we are free to add up #s in any order• Unsatisfactory CUDA Solutions

sum+=x[threadIdx.x];

doesn’t work (why?)

atomicAdd(&sum,

x[threadID]x.x);

does, but

exceedingly slow – no parallelism– Doesn’t work for floats

Extreme Computing: Parallel Prefix Reductions

8/12/

Log Sum Reduction = “Parallel Prefix”

Step 1: P processors consume 2P operands & create P results

Step 2: P/2 processors consume P operands & create P/2 results

Step K-1: 2 processors consume 4 operands & create 2 results

Step K: 1 processor consume 2 operands & creates 1 results

Assume P = 2

For N=2P operands, it takes log(N/2) = log(P) = K steps

Extreme Computing: Parallel Prefix Reductions

8/12/

Sample Code – Small N

(Untested)

global void SumKernel(float* values, unsigned N)/* assume N initially no more than twice blockDim */{

unsigned i = threadIdx.x;unsigned Stride = N>>1;while (Stride >0)

{if (i + Stride < N)

values[i] +=values[i + Stride];

__syncthreads();

Stride=Stride>>1; /* halve stride */

Note that data is kept grouped in consecutive locations

Time O(log2(N))

Extreme Computing: Parallel Prefix Reductions

8/12/

Larger N – Approach 1

(values must be in global)

Move 2*blockDim sized chunks of values to

temp0 (in shared)

Initiate a grid of calls to prior kernel to reduce to

N/(2*blockDIM) points

Grid of size N/(2*blockDim)– Have the sum moved to temp1[blockIdx.x]
- Repeat above
  - With N = N/(2*blockDim)– Use temp1 as source, temp0 as destination
    - Stop when only one result• Problem: shared is limited in size!!

Extreme Computing: Parallel Prefix Reductions

8/12/

Timing It Out

Assume we have N operands to combine, N>=P• Assume P = 2

, N=

, M>>

Last step combines P operands in k steps– 2

to last has P sets of P = P

operands: kP steps

to last has P

operands: kP

steps

First pass has N=

operands: kP

steps

T(P) = k + kP + kP

… kP

= k(P

-1)/(P-1) = log(P)(N-1)/(P-1)

log(P)N/P

If T(1)=N, then Speedup(P) = N/(log(P)N/P) =

P/log(P)

Extreme Computing: Parallel Prefix Reductions

8/12/

Recurrances

Extreme Computing: Parallel Prefix Reductions

8/12/

A Scan

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008, p.4)

Exactly

length

adds

Each of n threads does log2(n) adds, for total of nlog2(n) addsProblem in CUDA: for large n need to sync the updates

Standard Sequential codeSimple minded Parallel Code: n processors (1 per output)

Extreme Computing: Parallel Prefix Reductions

8/12/

Double Buffered Version

Extreme Computing: Parallel Prefix Reductions

8/12/

A More Efficient Algorithm

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 7)

Goal: more efficient algorithm• Use

balanced trees

Binary tree with n leaves,– log2(n) levels,– 2

nodes at level d

and n-1 total internal nodes
- If one add per interior node, n-1 adds• Two phases
  - up-sweep: go up thru tree computing partial sums– down-sweep: go down from root, building scan in place

Extreme Computing: Parallel Prefix Reductions

8/12/

Up-Sweep: Just Like Reduction Sum(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 8)

d=2 d=1 d=

Note error

in text

Final

Contents of Memory

Extreme Computing: Parallel Prefix Reductions

8/12/

CUDA Code: Part 1: Up-Sweep

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 10)

Extreme Computing: Parallel Prefix Reductions

8/12/

CUDA Code: Part 2: Down-Sweep

(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 10)

Parallel Reduction - Lecture Slides | CSE 40833, Study notes of Computer Science

Related documents

Partial preview of the text

Download Parallel Reduction - Lecture Slides | CSE 40833 and more Study notes Computer Science in PDF only on Docsity!

Parallel Reduction

Common Problem: “Reduction”

a.k.a “Prefix”

for (i=0, i<=n, i++) {sum=sum+x[i];}

for (i=0, i<=n, i++) {sum=foo(sum,x[i]);}

foo

has certain associative-like properties

for (i=1, i<=n, i++) {out[i]= out[i-1] + x[i];}

for (i=1, i<=n, i++) {out[i]=foo(out[i-1],x[i]);}

Prefix Sum

Thus we are free to add up #s in any order• Unsatisfactory CUDA Solutions

Log Sum Reduction = “Parallel Prefix”

For N=2P operands, it takes log(N/2) = log(P) = K steps

Sample Code – Small N

(Untested)

global void SumKernel(float* values, unsigned N)/* assume N initially no more than twice blockDim */{

unsigned i = threadIdx.x;unsigned Stride = N>>1;while (Stride >0)

{if (i + Stride < N)

values[i] +=values[i + Stride];

__syncthreads();

Stride=Stride>>1; /* halve stride */

Larger N – Approach 1

(values must be in global)

to last has P sets of P = P

operands: kP steps

to last has P

operands: kP

steps

operands: kP

steps

= k(P

-1)/(P-1) = log(P)(N-1)/(P-1)

log(P)N/P

A Scan

A More Efficient Algorithm

nodes at level d

Up-Sweep: Just Like Reduction Sum(see Parallel Prefix Sum (Scan) with CUDA, Jan. 2008 p. 8)

CUDA Code: Part 1: Up-Sweep

CUDA Code: Part 2: Down-Sweep