

























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This lecture was delivered by Dr. Hanif Durad at Pakistan Institute of Engineering and Applied Sciences, Islamabad (PIEAS) for Parallel Computing course. it includes: Basic, Communication, Operations, Personalized, Speed, Scatter, Gather, Broadcast, Algorithm, Design
Typology: Slides
1 / 97
This page cannot be seen from the preview
Don't miss anything!


























































































On special offer
4.1 One-to-All Broadcast and All-to-One Reduction4.2 All-to-All Broadcast and Reduction4.3 All-Reduce and Prefix-Sum Operations4.4 Scatter and Gather4.5 All-to-All Personalized Communication4.6 Circular Shift4.7 Improving the Speed of Some Communication
Operation 4.8 Summary
^
Many interactions in practical parallel programs occur inwell-defined patterns involving groups of processors. ^
Efficient implementations of these operations canimprove performance, reduce development effort andcost, and improve software quality. ^
Efficient implementations must leverage underlyingarchitecture. For this reason, we refer to specificarchitectures here. ^
We select a descriptive set of architectures to illustratethe process of algorithm design.
processors.
^
use logical binary tree for broadcasting ^
make sure to minimize congestion ^
complexity:
(t
+ts^
wm
)logp
p^0
p^1
p 2
p^3
p 4
p^5
p 6
p^7
p^0
p^1
p 2
p^3
p 4
p^5
p 6
p^7
Which better?
Ref CommImpl1.ppt
One-to-all broadcast on an eight-node ring. Node 0 is thesource of the broadcast. Each message transfer step is shownby a numbered, dotted arrow from the source of the messageto its destination. The number on an arrow indicates the timestep during which the message is transferred.
Reduction on an eight-node ring with node 0 as thedestination of the reduction.
Consider the problem of multiplying a matrix with a vector. ^
The
n
x
n
matrix is assigned to an
n
x
n
(virtual) processor grid.
The vector is assumed to be on the first row of processors. ^
The first step of the product requires a one-to-all broadcast of thevector element along the corresponding column of processors.This can be done concurrently for all
n
columns.
^
The processors compute local product of the vector element andthe local matrix entry. ^
In the final step, the results of these products are accumulated tothe first row using
n
concurrent all-to-one reduction operations
along the columns (using the sum operation).
^
Ring broadcast in row of initiating node ^
Ring broadcast in all columns from row of initiatingnode
^
n*n
matrix,
n*
vector,
n*n
mesh
One-to-all broadcast on a 16-node
mesh.
One-to-all broadcast on a three-dimensional hypercube.The binary representations of node labels are shown inparentheses.
d^
One-to-all broadcast of a message X from source on ahypercube.