Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Group Communication - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Aliah University Parallel Computing and Programming

Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Group Communication, Operations, One-To-All Broadcast, Broadcast and Reduction, Prefix-Sum Operations, Scatter and Gather, Personalized Communication, Circular Shift, Speed of Some Communication, Operations

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank 🇮🇳

4.3

(12)

152 documents

1 / 70

This page cannot be seen from the preview

Don't miss anything!

Lecture 9: Group Communication

Operations

Docsity.com

Discover Slides of Parallel Computing and Programming Aliah University

Partial preview of the text

Download Group Communication - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Lecture 9: Group Communication

Operations

Topic Overview

One-to-All Broadcast and All-to-One Reduction
All-to-All Broadcast and Reduction
All-Reduce and Prefix-Sum Operations
Scatter and Gather
All-to-All Personalized Communication
Circular Shift
Improving the Speed of Some Communication Operations

Basic Communication Operations:

Introduction

Group communication operations are built using point-to-point messaging primitives.
Recall from our discussion of architectures that communicating a message of size m over an uncongested network takes time t (^) s +mt (^) w.
We use this as the basis for our analyses. Where necessary, we take congestion into account explicitly by scaling the t (^) w term.
We assume that the network is bidirectional and that communication is single-ported.

One-to-All Broadcast and All-to-One

Reduction

One processor has a piece of data (of size m ) it needs to send to everyone.
The dual of one-to-all broadcast is all-to-one reduction.
In all-to-one reduction, each processor has m units of data. These data items must be combined piece-wise (using some associative operator, such as addition or min), and the result made available at a target processor.

One-to-All Broadcast and All-to-One

Reduction on Rings

Simplest way is to send p-1 messages from the source to the other p-1 processors - this is not very efficient.
Use recursive doubling: source sends a message to a selected processor. We now have two independent problems derined over halves of machines.
Reduction can be performed in an identical fashion by inverting the process.

One-to-All Broadcast

One-to-all broadcast on an eight-node ring. Node 0 is the source of the broadcast.Each message transfer step is shown by a numbered, dotted arrow from the source of the message to its destination. The number on an arrow indicates thetime step during which the message is transferred
Algorithmic Approach:having the data doubles every iteration/round and the processor getting the data Recursive doubling w/ recursive splitting : # of processors from another processor is the “mirror processor” in the other half of the currentprocessor space (thus splitting this space in two). The current processor space halves every round.
Time = Theta(P-1)
An easier algorithm: Send to neighbor, and then neighbor and source take care ofeach half of the ring (now a linear array) w/ either above algo. or just sequentially sending the data according to the linear connections. Time = Theta(1 + P/2 – 1) =Theta(P/2) Docsity.com

Broadcast and Reduction: Example

Consider the problem of multiplying a matrix with a vector.

The n x n matrix is assigned to an n x n (virtual) processor grid. The vector is assumed to be on the first row of processors.
The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors. This can be done concurrently for all n columns.
The processors compute local product of the vector element and the local matrix entry.
In the final step, the results of these products are accumulated to the first row using n concurrent all-to-one reduction operations along the rows (using the sum operation).

Broadcast and Reduction: Matrix-Vector Multiplication Example

One-to-all broadcast and all-to-one reduction in the multiplication of a 4 x 4 matrix with a 4 x 1 vector.

Broadcast and Reduction on a Mesh:

Example

One-to-all broadcast on a 16-node mesh.

Broadcast and Reduction on a

Hypercube

A hypercube with 2 d^ nodes can be regarded as a d -dimensional mesh with two nodes in each dimension.
The mesh algorithm can be generalized to a hypercube and the operation is carried out in d ( = log p ) steps.

Broadcast and Reduction on a

Balanced Binary Tree

Consider a binary tree in which processors are (logically) at the leaves and internal nodes are routing nodes.
Assume that source processor is the root of this tree. In the first step, the source sends the data to the right child (assuming the source is also the left child). The problem has now been decomposed into two problems with half the number of processors.

Broadcast and Reduction on a

Balanced Indirect Binary Tree

One-to-all broadcast on an eight-node indirect tree
Algorithm: Recursive doubling w/ recursive splitting
Time = Sum_{i=1 to log P} (2logP/2i-1^ )) = 2logP [1 – (1/2)log P^ ]/[1-1/2] = 4logP (P-1)/P = Theta(4 log P).

Broadcast and Reduction Algorithms

One-to-all broadcast of a message X from source on a hypercube.

/* I am or will be the source inmy current proc. space */

Broadcast and Reduction Algorithms

Single-node accumulation on a d -dimensional hypercube. Each node contributes a messagethe destination. X containing m words, and node 0 is

Group Communication - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Related documents

Partial preview of the text

Download Group Communication - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Lecture 9: Group Communication

Operations

Topic Overview

Basic Communication Operations:

Introduction

One-to-All Broadcast and All-to-One

Reduction

One-to-All Broadcast and All-to-One

Reduction on Rings

One-to-All Broadcast

Broadcast and Reduction: Example

Broadcast and Reduction on a Mesh:

Example

Broadcast and Reduction on a

Hypercube

Broadcast and Reduction on a

Balanced Binary Tree

Broadcast and Reduction on a

Balanced Indirect Binary Tree

Broadcast and Reduction Algorithms

Broadcast and Reduction Algorithms