Basic Communication Operation - Lecture Slides | CS 575, Study notes of Computer Science

Material Type: Notes; Professor: Bohm; Class: Parallel Processing; Subject: Computer Science; University: Colorado State University; Term: Spring 2010;

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-3vt-1
koofers-user-3vt-1 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS575 Parallel Processing
Lecture four:
Basic Communication Operations
Wim Bohm, Colorado State University
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Basic Communication Operation - Lecture Slides | CS 575 and more Study notes Computer Science in PDF only on Docsity!

CS575 Parallel Processing

Lecture four:

Basic Communication OperationsWim Bohm, Colorado State University

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license

Assumptions^ ^ Basic communication

^ Routing / Switching Techniques

^ both Store and Forward: t

+tsw .m.l^

O(m.l)

and Cut Through: t

+tsw .m + t

.l^ h

O(m+l)

ONON ^ Ring: l <= p/2 ^ Wraparound square 2 D mesh: l <= sqrt(p) ^ Hypercube: l <= log(p)

^ Bi-directional Communication links

^ Two directly connected PE-s can send messages of size mto

each other simultaneously in t

  • ts

m timew

Broadcast, Store&Forward, Ring^ ^ Naïve: p-1 messages sent sequentially

^ WASTEFUL ^ 2*(1 + 2 + .. p/2) hops: O(p

2 ) hops

Avoiding redundant transmission Avoiding redundant transmission^ ^ Nodes

store

copy of message and

forward

it

^ Initiating node (say node 0)

sends two messages

^ one in clockwise direction to

node p/

^ one in counter clock direction to node p/2+1 ^ the messages travel overlapping in time ^ #steps <= p/2+1:

(t+ts

.m).(p/2+1)w

Broadcast, Store&Forward, Mesh^ ^ Each row and column of mesh is a ring!^ ^ Two phases

^ Ring broadcast in row of initiating node^ Ring broadcast in all columns from row of initiating node^ Ring broadcast in all columns from row of initiating node ^ Time steps:

(ts^

  • tw

.m).sqrt(p)

^ Similar (n phase) procedure works on nD mesh

^ Matrix Vector product

^ nn matrix, n1 vector, n*n mesh ^ DISCUSS

Hypercube sources, destinations^ ^ Phase i, initiator = 0…0:

^ Send along dimension related to bit i (0,1,.. Left to right) ^ Sender if bits i and on are 0 ^ Receiver: flip bit i Phase i, general initiator

x^

…x

^ Phase i, general initiator

x^1

…xn

^ Need a re-labeling mechanism: My-ID

XOR

Sender

^ This maps all bit patterns onto all bit patterns

^ Why? Why is that relevant? ^ Nearest neighbors in original labeling are nearest neighbors in thenew labeling.

^ Why? Why is that relevant? ^ Phase i still sends along dimension related to bit i

^ Discuss algorithm

Recap: Broadcast Store and Forward^ ^ Ring:

(t

+ts

.m).(p/2+1)w

^ Mesh:

(t

+ ts

.m).sqrt(p)w

^ Cube:

(t

+ ts

.m).log(p)w

^ Cube:

(t

+ ts

.m).log(p)w

^ Q: Is there a faster than O(log(p)) algorithm?

^ Assumptions: Only one send per time step possible

Any topology allowed

Multi node broadcast

Also known as all to all broadcast

Mp

- 1 Mp

-^1

Mp

-^1

Mp

- 1 Mp

-^1

Mp

-^1

..^

..^

M

M

M

data

Mo M1 .. Mp-

BC

Mo

Mo ..

Mo

Pes

p-

..^

P-

MultiNodeBC, SF, Ring, Mesh^ ^ Ring: All channels can be kept busy all the time

^ Step one:node i

sends its data to node

(i+1)mod p

^ Step 2 .. (p

^ Step 2 .. (p

node i

sends the data it received in the previous step to node (i+1)mod p  T ~ (t

  • ts

.m)pw

^ Mesh: rows and cols are rings

^ Phase one: broadcast in row:

message size m

^ Phase two: broadcast in

col: message size m.sqrt(p)

^ T ~ t

.sqrt(p) + ts

.m.pw

Reduction

on hyper cube

^ Associative op (+, union, synchronize, ..) ^ All to all

^ Pre:

PE

has Xi

i

Post:

PE^

has Reduce(op,X

.. X

^ Post:

PEi

has Reduce(op,X

.. X 0

)p-

^ Why associative?

^ Use all to all broadcast scheme ^ Here message size does not increase:

Tred

= (t

+ts

).log(p)w

Parallel Prefix on the cube^ ^ Associative op (+, union, synchronize, ..)^ ^ Pre:

PE

has Xi

,^ i = 0 .. p-1i

^ Post:

PE

has Reduce(op,Xi

.. X 0

)i

^ Post:

PE

has Reduce(op,Xi

.. X 0

)i

^ Naïve: all to all broadcast, then add X

.. X 0

i

^ Better: Adapt accumulation scheme

^ Every PE has two variables ^ my-result

: add incoming value from PE

if^ i i < my-id

^ total

: add incoming value ^ exchange total

… and on and on …^ ^ Circular shift

^ (SF, CT) * (ring, mesh, hyper cube)

^ Broadcasting a message in parts

Along different paths Along different paths

^ All port communication

^ Simultaneous communication over all ports of a node

^ Special hardware

^ E.g. for reduction (switches combine messages)