Group Communication - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Group Communication, Operations, One-To-All Broadcast, Broadcast and Reduction, Prefix-Sum Operations, Scatter and Gather, Personalized Communication, Circular Shift, Speed of Some Communication, Operations

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank
devank 🇮🇳

4.3

(12)

152 documents

1 / 70

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 9: Group Communication
Operations
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46

Partial preview of the text

Download Group Communication - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Lecture 9: Group Communication

Operations

Topic Overview

  • One-to-All Broadcast and All-to-One Reduction
  • All-to-All Broadcast and Reduction
  • All-Reduce and Prefix-Sum Operations
  • Scatter and Gather
  • All-to-All Personalized Communication
  • Circular Shift
  • Improving the Speed of Some Communication Operations

Basic Communication Operations:

Introduction

  • Group communication operations are built using point-to-point messaging primitives.
  • Recall from our discussion of architectures that communicating a message of size m over an uncongested network takes time t (^) s +mt (^) w.
  • We use this as the basis for our analyses. Where necessary, we take congestion into account explicitly by scaling the t (^) w term.
  • We assume that the network is bidirectional and that communication is single-ported.

One-to-All Broadcast and All-to-One

Reduction

  • One processor has a piece of data (of size m ) it needs to send to everyone.
  • The dual of one-to-all broadcast is all-to-one reduction.
  • In all-to-one reduction, each processor has m units of data. These data items must be combined piece-wise (using some associative operator, such as addition or min), and the result made available at a target processor.

One-to-All Broadcast and All-to-One

Reduction on Rings

  • Simplest way is to send p-1 messages from the source to the other p-1 processors - this is not very efficient.
  • Use recursive doubling: source sends a message to a selected processor. We now have two independent problems derined over halves of machines.
  • Reduction can be performed in an identical fashion by inverting the process.

One-to-All Broadcast

  • One-to-all broadcast on an eight-node ring. Node 0 is the source of the broadcast.Each message transfer step is shown by a numbered, dotted arrow from the source of the message to its destination. The number on an arrow indicates thetime step during which the message is transferred
  • Algorithmic Approach:having the data doubles every iteration/round and the processor getting the data Recursive doubling w/ recursive splitting : # of processors from another processor is the “mirror processor” in the other half of the currentprocessor space (thus splitting this space in two). The current processor space halves every round.
  • Time = Theta(P-1)
  • An easier algorithm: Send to neighbor, and then neighbor and source take care ofeach half of the ring (now a linear array) w/ either above algo. or just sequentially sending the data according to the linear connections. Time = Theta(1 + P/2 – 1) =Theta(P/2) Docsity.com

Broadcast and Reduction: Example

Consider the problem of multiplying a matrix with a vector.

  • The n x n matrix is assigned to an n x n (virtual) processor grid. The vector is assumed to be on the first row of processors.
  • The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors. This can be done concurrently for all n columns.
  • The processors compute local product of the vector element and the local matrix entry.
  • In the final step, the results of these products are accumulated to the first row using n concurrent all-to-one reduction operations along the rows (using the sum operation).

Broadcast and Reduction: Matrix-Vector Multiplication Example

One-to-all broadcast and all-to-one reduction in the multiplication of a 4 x 4 matrix with a 4 x 1 vector.

Broadcast and Reduction on a Mesh:

Example

One-to-all broadcast on a 16-node mesh.

Broadcast and Reduction on a

Hypercube

  • A hypercube with 2 d^ nodes can be regarded as a d -dimensional mesh with two nodes in each dimension.
  • The mesh algorithm can be generalized to a hypercube and the operation is carried out in d ( = log p ) steps.

Broadcast and Reduction on a

Balanced Binary Tree

  • Consider a binary tree in which processors are (logically) at the leaves and internal nodes are routing nodes.
  • Assume that source processor is the root of this tree. In the first step, the source sends the data to the right child (assuming the source is also the left child). The problem has now been decomposed into two problems with half the number of processors.

Broadcast and Reduction on a

Balanced Indirect Binary Tree

  • One-to-all broadcast on an eight-node indirect tree
  • Algorithm: Recursive doubling w/ recursive splitting
  • Time = Sum_{i=1 to log P} (2logP/2i-1^ )) = 2logP [1 – (1/2)log P^ ]/[1-1/2] = 4logP (P-1)/P = Theta(4 log P).

Broadcast and Reduction Algorithms

One-to-all broadcast of a message X from source on a hypercube.

/* I am or will be the source inmy current proc. space */

Broadcast and Reduction Algorithms

Single-node accumulation on a d -dimensional hypercube. Each node contributes a messagethe destination. X containing m words, and node 0 is