Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj, Study notes of Computer Science

A chapter from the book 'parallel program design' by sanjay rajopadhye, focusing on the task channel model for parallel computing and efficient broadcast algorithms. The chapter covers topics such as task partitioning, communication, and analysis of broadcast algorithms in a parallel computing context. It also includes case studies on adding numbers on a grid, n-body problem, and matrix multiplication.

Typology: Study notes

Pre 2010

Uploaded on 03/18/2009

koofers-user-6yf
koofers-user-6yf 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Parallel Program Design
(Chapter 3)
Sanjay Rajopadhye
Colorado State University
Fall 2008 Week 2
Outline
Exercise (conclusion)
Parallel program design
Task channel model
Partitioning
Communication
Agglomeration
Mapping
Case Studies:
Adding numbers on a grid + Broadcast
n-body
matrix multiplication
Exercise (conclusion)
Precise model
“Octopus” accountant: simultaneously send/receive
from any neighbor (unit cost)
Each communication “event” may have an
arbitrarily large stack of cards (cost independent of
volume).
Alternate model:
Cost is directly proportional to volume
Data Distribution
Efficiently broadcast the numbers to processors
Row wise or column wise:
Each processor alo ng the edge gets a stack
Keeps whatever is needed for its row/column (40 or 25)
Forwards the rest to the p rocessor to south (or east)
Initiates an independent b roadcast along its row/column
Analysis
Total number of steps: 65
pf3
pf4
pf5
pf8

Partial preview of the text

Download Parallel Program Design: Task Channel Model and Broadcast Algorithms - Prof. Sanjay V. Raj and more Study notes Computer Science in PDF only on Docsity!

Parallel Program Design

(Chapter 3)

Sanjay Rajopadhye

Colorado State University

Fall 2008 Week 2

Outline

 Exercise (conclusion)

 Parallel program design

 Task channel model

 Partitioning

 Communication

 Agglomeration

 Mapping

 Case Studies:

 Adding numbers on a grid + Broadcast

 n-body

 matrix multiplication

Exercise (conclusion)

Precise model

 “Octopus” accountant: simultaneously send/receive

from any neighbor (unit cost)

 Each communication “event” may have an

arbitrarily large stack of cards (cost independent of

volume).

Alternate model:

 Cost is directly proportional to volume

Data Distribution

Efficiently broadcast the numbers to processors

 Row wise or column wise:

 Each processor along the edge gets a stack

 Keeps whatever is needed for its row/column (40 or 25)

 Forwards the rest to the processor to south (or east)

 Initiates an independent broadcast along its row/column

 Analysis

 Total number of steps: 65

Computation/Accumulation

Efficiently compute and collect answers.

 Each processor computes the sum of 2 numbers

 Communicate backward to the corner processor

 Intersperse with one addition operation

Analysis

 Number of steps: 2 + 2*  Total execution time: 197

 Can we do better?

Tradeoff

Use fewer processors (only 500 in a 25x20 grid)

 Local data = 4  Communication distance = 45  Total time = 139

 Optimal?

 10x20 grid: 95  10x10 grid: 70  5x10 grid: 65  5x5 grid: 70

New twist: what if the problem size is different?

Tradeoff (analytical) Assumptions: P processors, arranged in a square grid (to minimize perimeter) N numbers to add, no overlap of communication & computation  Total execution time = N/P + 3 √P  One term increases with P (as a square-root, polynomial of degree 0.5) and other decreases (hyperbolically).  Note we are not interested in asymptotic behavior here, but rather with a tradeoff. What value of P minimizes the time.  Set derivative to zero and solve ! P ^ = 4 N^ 2 3 3 Task-Channel Model  Task: program with local memory plus “I/O ports” d ef pa rse(bod y) if /Error/ =~ bod y then m = /(^ #^ /r/P a rserFa ilure.m a tch()[i] g enera tes a wa rning.+)/.m a tch(bod y) end^ r a ise m [0] h = /a c = /a href=" href="(..h)"/.m a tch(bod y)(.*.c)"/.m a tch(bod y) end^ r eturn h[1],c[1]

Communication

Insert channels for tasks that need to communicate

 Local communication: each task needs to communicate with a

few tasks

 Global communication: many tasks involved in a single

“communication event”

 Clutters the graph (avoid, but keep track)  Drawback of the methodology: “collective communications” are a powerful design paradigm (e.g., matrix multiplication later). Communication issues

 Communication should be balanced

 Favor local communication

 Parallelism in communication

tasks should perform communications independently

 Parallelism in the computation

tasks should be able to perform computations concurrently Agglomeration

Now group tasks into “clusters” (throttle parallelism)

 Improve locality

 Achieve load balance

 Simplify programming effort

 Achieve scalability

Agglomeration issues

 Agglomeration should increase locality(?)

 Replicated tasks

 take less time than avoided communication  allow scaling  Clusters have similar size (computation + communication)

 Number of clusters is

 increasing function of problem size  as small as possible (but no less than number of processors)

 Modifications to existing code are reasonable (ideally

parameterizable)

Mapping

Assign tasks to processors, and orchestrate a schedule. Main goal:

processor utilization/load balance

 Static  Structured communication  constant computation per task minimize communication, one task per processor  variable computation per task cyclic mapping for load balance  Unstructured static load balancing algorithm  Dynamic  Frequent communication dynamic load balancing algorithm  Many short lived tasks (no inter-task communications) run-time schedule Case studies: broadcast

 One processor out of P has a single value, and we want every

processor to have a copy of the same value? Device efficient

algorithms to achieve this

 Simplified use of Foster’s methodology: no agglomeration,

(#tasks = #processors) and no mapping.

 Special conditions arise when we take processor architecture

into account

 “Octopus accountant” grid  Ideal (fully connected) machine where any pair of processors can communicate (but a processor cannot simultaneously communicate with multiple other processors) Broadcast (grid lower bound)

 Time for a processor to receive the value must be at least equal

to its distance (number of hops in the grid graph) from the

corner (source)

 This is the perimeter of the grid, P0.

 The “data distribution” phase of the “add-numbers” exercise

can be adapted to achieve this.

Broadcast (ideal lower bound)

 At any time, let’s say that k processors have a copy.

 Then, by our rules, no more than 2k processors can have a copy

(only pairwise communications are allowed) at the next step.

 Initially, only one processor has a copy.

 This successive doubling implies that at least lg P steps are

needed.

 If the model is changed so that a processor may communicate

simultaneously with b others (for some constant, b), there is

“only a constant factor” improvement.

 Not significant is asymptotic analysis -- crucial in real life.

n- Body Problem  Partitioning: one task per body  Communication: complete graph  main “collective communication event” is everyone must send all its single, unique data to everyone else, i.e., a simultaneous broadcast  also called “all-gather”  Brute force: n steps, each processor receives data from every other  coordination is a detail that needs to be resolved  Better way: divide and conquer  If half the nodes have already achieved an all gather amongst themselves  One more step to complete all-gather for all n nodes. All-gather Analysis  Naïve strategy: n λ  Divide-and-conquer:  lg n steps  T = λ lg n  Message size grows at each step  Realistic communication model (affine cost function)  Time to transmit a message of volume v  All-gather under affine communication model (geometric series) ! t = " + (^) # v ! T gather = "lg n + (^) # n n -Body (concl.)

 Many more particles than processors

 One task per processor and agglomerate n/p particles into each cluster.  Modify the all-gather

 Alternate methodology:

 Tasks for each step  Agglomeration first projects along the time iteration

 Analysis:

T iter = "lg p + n (^ p $^ p #^1 )+ % n p Input/Output

 Task-channel model does not deal explicitly with I/O

 Solution: add auxiliary tasks: time (also affine function but

different latency and bandwidth) = λ′^ + n /β′

 Distributing the inputs (scatter)

 Adding accountants example  Divide & conquer scatter (in affine model) ! T iter = "lg p + n (^ p $^ p #^1 )

PRAM Model

 (Yet) Another model for parallel programming

 Ideal SIMD machine

 Completely ignore communication

 Assume a shared memory, globally accessible by all processors

in constant time

 What if processors access the same location

PRAM Model (variants)  Conflict resolution (for both reads and writes):  Disallow (weaker machine) but probably more realistic  Allow but provide a resolution mechanism  EREW: Exclusive Read, Exclusive Write (no conflicts allowed)  CREW  ERCW (rarely considered)  CRCW: Write conflicts resolved by  arbitrary: any of the writers succeeds (non-deterministically)  priority: the writer with the highest priority succeeds  common: the algorithm must ensure that all writers write the same value  combining (most powerful): the values are combined using an associative operator (e.g., sum, max, etc.) PRAM Matrix Multiplication

 Work out on your sheets

Why PRAM

 Provides lower bounds (the best you can do in an ideal situation)

 plus indication of how much you’d lose moving to a real

machine

 Good starting point for task-channel model

 Evolution of the matrix multiplication example