Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj, Study notes of Computer Science

Colorado State University (CSU)Computer Science

Prof. Sanjay V. Rajopadhye

A chapter from the book 'parallel program design' by sanjay rajopadhye, focusing on the task channel model for parallel computing and efficient broadcast algorithms. The chapter covers topics such as task partitioning, communication, and analysis of broadcast algorithms in a parallel computing context. It also includes case studies on adding numbers on a grid, n-body problem, and matrix multiplication.

Typology: Study notes

Pre 2010

Uploaded on 03/18/2009

koofers-user-6yf 🇺🇸

10 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

Parallel Program Design

(Chapter 3)

Sanjay Rajopadhye

Colorado State University

Fall 2008 Week 2

Outline

Exercise (conclusion)

Parallel program design

Task channel model

Partitioning

Communication

Agglomeration

Mapping

Case Studies:

Adding numbers on a grid + Broadcast

n-body

matrix multiplication

Exercise (conclusion)

Precise model

“Octopus” accountant: simultaneously send/receive

from any neighbor (unit cost)

Each communication “event” may have an

arbitrarily large stack of cards (cost independent of

volume).

Alternate model:

Cost is directly proportional to volume

Data Distribution

Efficiently broadcast the numbers to processors

Row wise or column wise:

Each processor alo ng the edge gets a stack

Keeps whatever is needed for its row/column (40 or 25)

Forwards the rest to the p rocessor to south (or east)

Initiates an independent b roadcast along its row/column

Analysis

Total number of steps: 65

Discover Study notes of Computer Science Colorado State University (CSU)

Partial preview of the text

Download Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj and more Study notes Computer Science in PDF only on Docsity!

Parallel Program Design

(Chapter 3)

Sanjay Rajopadhye

Colorado State University

Fall 2008 Week 2

Outline

 Exercise (conclusion)

 Parallel program design

 Task channel model

 Partitioning

 Communication

 Agglomeration

 Mapping

 Case Studies:

 Adding numbers on a grid + Broadcast

 n-body

 matrix multiplication

Exercise (conclusion)

Precise model

 “Octopus” accountant: simultaneously send/receive

from any neighbor (unit cost)

 Each communication “event” may have an

arbitrarily large stack of cards (cost independent of

volume).

Alternate model:

 Cost is directly proportional to volume

Data Distribution

Efficiently broadcast the numbers to processors

 Row wise or column wise:

 Each processor along the edge gets a stack

 Keeps whatever is needed for its row/column (40 or 25)

 Forwards the rest to the processor to south (or east)

 Initiates an independent broadcast along its row/column

 Analysis

 Total number of steps: 65

Computation/Accumulation

Efficiently compute and collect answers.

 Each processor computes the sum of 2 numbers

 Communicate backward to the corner processor

 Intersperse with one addition operation

Analysis

 Number of steps: 2 + 2*  Total execution time: 197

 Can we do better?

Tradeoff

Use fewer processors (only 500 in a 25x20 grid)

 Local data = 4  Communication distance = 45  Total time = 139

 Optimal?

 10x20 grid: 95  10x10 grid: 70  5x10 grid: 65  5x5 grid: 70

New twist: what if the problem size is different?

Tradeoff (analytical) Assumptions: P processors, arranged in a square grid (to minimize perimeter) N numbers to add, no overlap of communication & computation  Total execution time = N/P + 3 √P  One term increases with P (as a square-root, polynomial of degree 0.5) and other decreases (hyperbolically).  Note we are not interested in asymptotic behavior here, but rather with a tradeoff. What value of P minimizes the time.  Set derivative to zero and solve ! P ^ = 4 N^ 2 3 3 Task-Channel Model  Task: program with local memory plus “I/O ports” d ef pa rse(bod y) if /Error/ =~ bod y then m = /(^ #^ /r/P a rserFa ilure.m a tch()[i] g enera tes a wa rning.+)/.m a tch(bod y) end^ r a ise m [0] h = /a c = /a href=" href="(..h)"/.m a tch(bod y)(.*.c)"/.m a tch(bod y) end^ r eturn h[1],c[1]

Communication

Insert channels for tasks that need to communicate

 Local communication: each task needs to communicate with a

few tasks

 Global communication: many tasks involved in a single

“communication event”

 Clutters the graph (avoid, but keep track)  Drawback of the methodology: “collective communications” are a powerful design paradigm (e.g., matrix multiplication later). Communication issues

 Communication should be balanced

 Favor local communication

 Parallelism in communication

tasks should perform communications independently

 Parallelism in the computation

tasks should be able to perform computations concurrently Agglomeration

Now group tasks into “clusters” (throttle parallelism)

 Improve locality

 Achieve load balance

 Simplify programming effort

 Achieve scalability

Agglomeration issues

 Agglomeration should increase locality(?)

 Replicated tasks

 take less time than avoided communication  allow scaling  Clusters have similar size (computation + communication)

 Number of clusters is

 increasing function of problem size  as small as possible (but no less than number of processors)

 Modifications to existing code are reasonable (ideally

parameterizable)

Mapping

Assign tasks to processors, and orchestrate a schedule. Main goal:

processor utilization/load balance

 Static  Structured communication  constant computation per task minimize communication, one task per processor  variable computation per task cyclic mapping for load balance  Unstructured static load balancing algorithm  Dynamic  Frequent communication dynamic load balancing algorithm  Many short lived tasks (no inter-task communications) run-time schedule Case studies: broadcast

 One processor out of P has a single value, and we want every

processor to have a copy of the same value? Device efficient

algorithms to achieve this

 Simplified use of Foster’s methodology: no agglomeration,

(#tasks = #processors) and no mapping.

 Special conditions arise when we take processor architecture

into account

 “Octopus accountant” grid  Ideal (fully connected) machine where any pair of processors can communicate (but a processor cannot simultaneously communicate with multiple other processors) Broadcast (grid lower bound)

 Time for a processor to receive the value must be at least equal

to its distance (number of hops in the grid graph) from the

corner (source)

 This is the perimeter of the grid, P0.

 The “data distribution” phase of the “add-numbers” exercise

can be adapted to achieve this.

Broadcast (ideal lower bound)

 At any time, let’s say that k processors have a copy.

 Then, by our rules, no more than 2k processors can have a copy

(only pairwise communications are allowed) at the next step.

 Initially, only one processor has a copy.

 This successive doubling implies that at least lg P steps are

needed.

 If the model is changed so that a processor may communicate

simultaneously with b others (for some constant, b), there is

“only a constant factor” improvement.

 Not significant is asymptotic analysis -- crucial in real life.

n- Body Problem  Partitioning: one task per body  Communication: complete graph  main “collective communication event” is everyone must send all its single, unique data to everyone else, i.e., a simultaneous broadcast  also called “all-gather”  Brute force: n steps, each processor receives data from every other  coordination is a detail that needs to be resolved  Better way: divide and conquer  If half the nodes have already achieved an all gather amongst themselves  One more step to complete all-gather for all n nodes. All-gather Analysis  Naïve strategy: n λ  Divide-and-conquer:  lg n steps  T = λ lg n  Message size grows at each step  Realistic communication model (affine cost function)  Time to transmit a message of volume v  All-gather under affine communication model (geometric series) ! t = " + (^) # v ! T gather = "lg n + (^) # n n -Body (concl.)

 Many more particles than processors

 One task per processor and agglomerate n/p particles into each cluster.  Modify the all-gather

 Alternate methodology:

 Tasks for each step  Agglomeration first projects along the time iteration

 Analysis:

T iter = "lg p + n (^ p $^ p #^1 )+ % n p Input/Output

 Task-channel model does not deal explicitly with I/O

 Solution: add auxiliary tasks: time (also affine function but

different latency and bandwidth) = λ′^ + n /β′

 Distributing the inputs (scatter)

 Adding accountants example  Divide & conquer scatter (in affine model) ! T iter = "lg p + n (^ p $^ p #^1 )

PRAM Model

 (Yet) Another model for parallel programming

 Ideal SIMD machine

 Completely ignore communication

 Assume a shared memory, globally accessible by all processors

in constant time

 What if processors access the same location

PRAM Model (variants)  Conflict resolution (for both reads and writes):  Disallow (weaker machine) but probably more realistic  Allow but provide a resolution mechanism  EREW: Exclusive Read, Exclusive Write (no conflicts allowed)  CREW  ERCW (rarely considered)  CRCW: Write conflicts resolved by  arbitrary: any of the writers succeeds (non-deterministically)  priority: the writer with the highest priority succeeds  common: the algorithm must ensure that all writers write the same value  combining (most powerful): the values are combined using an associative operator (e.g., sum, max, etc.) PRAM Matrix Multiplication

Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj, Study notes of Computer Science

Related documents

Partial preview of the text

Download Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj and more Study notes Computer Science in PDF only on Docsity!

Parallel Program Design

(Chapter 3)

Sanjay Rajopadhye

Colorado State University

Fall 2008 Week 2

Outline

 Exercise (conclusion)

 Parallel program design

 Task channel model

 Partitioning

 Communication

 Agglomeration

 Mapping

 Case Studies:

 Adding numbers on a grid + Broadcast

 n-body

 matrix multiplication

Exercise (conclusion)

Precise model

 “Octopus” accountant: simultaneously send/receive

from any neighbor (unit cost)

 Each communication “event” may have an

arbitrarily large stack of cards (cost independent of

volume).

Alternate model:

 Cost is directly proportional to volume

Data Distribution

Efficiently broadcast the numbers to processors

 Row wise or column wise:

 Each processor along the edge gets a stack

 Keeps whatever is needed for its row/column (40 or 25)

 Forwards the rest to the processor to south (or east)

 Initiates an independent broadcast along its row/column

 Analysis

 Total number of steps: 65

Efficiently compute and collect answers.

 Each processor computes the sum of 2 numbers

 Communicate backward to the corner processor

 Intersperse with one addition operation

Analysis

 Can we do better?

Use fewer processors (only 500 in a 25x20 grid)

 Optimal?

New twist: what if the problem size is different?

Insert channels for tasks that need to communicate

 Local communication: each task needs to communicate with a

few tasks

 Global communication: many tasks involved in a single

“communication event”

 Communication should be balanced

 Favor local communication

 Parallelism in communication

 Parallelism in the computation

Now group tasks into “clusters” (throttle parallelism)

 Improve locality

 Achieve load balance

 Simplify programming effort

 Achieve scalability

 Agglomeration should increase locality(?)

 Replicated tasks

 Number of clusters is

 Modifications to existing code are reasonable (ideally

parameterizable)

Assign tasks to processors, and orchestrate a schedule. Main goal:

processor utilization/load balance

 One processor out of P has a single value, and we want every

processor to have a copy of the same value? Device efficient

algorithms to achieve this

 Simplified use of Foster’s methodology: no agglomeration,

(#tasks = #processors) and no mapping.

 Special conditions arise when we take processor architecture

into account

 Time for a processor to receive the value must be at least equal

to its distance (number of hops in the grid graph) from the

corner (source)

 This is the perimeter of the grid, P0.