Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus, Study notes of Computer Science

University of Alabama - Birmingham Computer Science

Prof. Purushotham V. Bangalore

An overview of message passing interface (mpi) for parallel computing, including its salient features, data types, and communication functions. It covers topics such as point-to-point communication, collective communication, user-defined datatypes, and virtual topologies. The tutorial also includes examples and explanations of various mpi concepts.

Typology: Study notes

Pre 2010

Uploaded on 04/12/2010

koofers-user-ocq 🇺🇸

10 documents

1 / 55

This page cannot be seen from the preview

Don't miss anything!

MPI Tutorial

Purushotham Bangalore, Ph.D.

Anthony Skjellum, Ph.D.

Department of Computer and

Information Sciences

University of Alabama at Birmingham

MPI Tutorial 2

Overview

• High Performance Computing

– Introduction

– Hardware models

– Software models

– Parallelization strategies

• Message Passing Interface - MPI

– Point-to-point communication

– Collective communication

–Communicators

– Datatypes

– Topologies

– Inter-communicators

– Profiling

Introduction to

High Performance Computing

MPI Tutorial 4

What is High Performance

Computing?

• Computing that stretches the ability of whatever class of

machine (or conglomerate of machines) in terms of

– floating-point operations

– number of instructions per second

–memory

– input/output

– algorithms and data structures

– system software

– software development environment

• What is considered high performance changes over

time, as systems get faster, more capable, and cheaper.

Discover Study notes of Computer Science University of Alabama - Birmingham

Partial preview of the text

Download MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus and more Study notes Computer Science in PDF only on Docsity!

MPI Tutorial

Purushotham Bangalore, Ph.D.

Anthony Skjellum, Ph.D.

Department of Computer and

Information Sciences

University of Alabama at Birmingham

MPI Tutorial

Overview

•^

High Performance Computing– Introduction– Hardware models– Software models– Parallelization strategies

Message Passing Interface - MPI– Point-to-point communication– Collective communication– Communicators– Datatypes– Topologies– Inter-communicators– Profiling

Introduction to

High Performance Computing

MPI Tutorial

What is High Performance

Computing?

•^

Computing that stretches the ability of whatever class ofmachine (or conglomerate of machines) in terms of– floating-point operations– number of instructions per second– memory– input/output– algorithms and data structures– system software– software development environment

What is considered high performance changes overtime, as systems get faster, more capable, and cheaper.

Why High Performance

Computing?

•^

Need to solve a problem faster

Need to solve a previously intractable problem,

Need to solve a “larger” problem– More refined– Higher dimensionality– More complex geometry or physics– Larger problem domain

Growing complexity of industrial systems

Growing demand for high utilization of resources and lowpollution

Greater demand for high economic agility and cost-effectiveness

MPI Tutorial

Why Not High Performance

Computing?

•^

HPC software more expensive to develop, maintain anduse

HPC software for your problem is not available or notvalidated

Legacy solutions still valuable even if old, to solve, albeitslowly

A “solution” in a field, whether slow or fast or possiblyeven correct, produces the “accepted answer”

The coordination of resources is easy to manage

The demanding aspect of the application is the human-intensive part, not the computational part, once theproblem is set up

How to Get High Performance

•^

Wait for faster CPU, memory, storage, network– Limit: physics, economics

•^

Use multiple CPUs on single problem– Limit: Amdahl’s law

•^

Choose appropriate algorithms and datastructures

•^

Choose appropriate programming model

•^

Tune software to maximize efficiency– sequentially– parallel

MPI Tutorial

Amdahl’s Law

•^

A metric that states that sequential bottlenecks posefundamental limitations on speedup, stated as

What actually limits speed up below the upper limitgoverned by sequential fraction– sequential fraction may be a function of P– communication in the system has finite, often significant cost– load imbalance

If S is the fraction of an algorithm that is serial and 1-S thefraction that can be parallelized, then the speedup that canbe achieved using P processors is: 1/( S + (1-S)/P) whichhas a limiting value of 1/S for an infinite number ofprocessors.

MPI Tutorial

Terminology, IV.

•^

Sustained Performance– the actual performance in flops that an application/benchmark

achieves

•^

Turn-around time (Time-to-Solution)– a measure of the latency of an entire application once started– a summation of the CPU, System, and I/O time of an

application

•^

Capability– A mode of operation in which a single hard problem is solved

Capacity, Throughput– A mode of operation in which the number of problems solved

per time is optimized

MPI Tutorial

Terminology, V.

•^

Process– Independent image: instruction stream, stack, data– Communication between processes is explicit (e.g.,

sockets, IPC, shmem)

•^

Thread– “Lightweight process” (LWP)– Multiple instruction streams, independent stack– Shared data– Communication can be implicit (e.g., shared

variables)

Hardware Models

MPI Tutorial

Multicomputer

•^

A networked set of computers (1 or moreCPU), each with its own memory

•^

A Massively Parallel Processor (MPP)

•^

Typically does not support a single systemimage

•^

Also known as a loosely coupled system

•^

E.g., Intel Paragon, IBM SP, etc.,

Multiprocessor

•^

A machine with processors connected tomemories through a fabric or other network

•^

Typically, supports a single system image

•^

A logical component of a modern multicomputer

•^

Also known as a tightly coupled system

•^

E.g., SGI Power Challenge, Sun HPC 10000

MPI Tutorial

Cluster/Network of Workstations

•^

A collection of capable multiprocessors oruniprocessors (nodes)– homogeneous environment - collection of similar

nodes

heterogeneous environment - collection of different

nodes (differences in computer hardware, network,operating system)

•^

A capable network connecting these nodes

•^

System software infrastructure to provideemulation of a multicomputer

•^

Provides affordable high performance computingusing available resources

Hardware Taxonomy

Flynn's taxonomy of parallel hardware–

SISD - Single Instruction Single Data– SIMD - Single Instruction Multiple Data– MISD - Multiple Instruction Single Data– MIMD - Multiple Instruction Multiple Data

Shared memory (MIMD)–

e.g., SGI Power Challenge, Sun HPC 10000

Distributed memory (MIMD)–

e.g., Intel Paragon, IBM SP, Network of Workstations

Distributed shared memory (MIMD)–

e.g., HP/Convex Exemplar, SGI Origin 2000– memory is physically distributed but logically shared

Vector SMP (SIMD/MIMD)–

e.g., Cray C90, Nec SX4, NEC Earth Simulator

MPI Tutorial

Five Levels of Hardware

Parallelism

•^

Job Level

•^

Task Level

•^

Stream Level (Concurrent threads hidinglatency)

•^

Pipeline (Instruction Concurrency andReordering)

•^

Instruction level (multi-issue, VLIW,multiply/add/load/store)

Shared Memory

Also known as Uniform Memory Access machines (UMAs) ortightly coupled systems

Processor

….

Shared Bus Memory

MPI Tutorial

Shared Memory Issues

•^

Synchronization is Explicit

•^

Data Transfer is Implicit

•^

To reduce access time and memory bandwidthdemand multilevel caches are used

•^

But introduces cache-coherence problem, thatrequires special cache-coherence protocols

•^

Not scalable

•^

Bus contention and bandwidth limitation

Distributed Memory Also known as Message Passing Systems

Network

Processor Memory

MPI Tutorial

Distributed Memory Issues

•^

Synchronization is Implicit

•^

Data Transfer is Explicit

•^

Scalable

•^

Provides a cost-effective model to increasememory bandwidth and reduce memory latencysince most of the memory access is local

•^

Introduces an additional overhead because ofinter-processor communication

•^

Low latency and high bandwidth for inter-processor communication is the key to higherperformance

Distributed Shared Memory

•^

Hybrid of shared and distributed memory models

•^

Memory is physically separated but addressspace is logically shared, meaning– any processor can access any memory location– same physical address on two processors refers to

the same memory location

•^

Access time to a memory location is not uniform,hence they are also known as Non-UniformMemory Access machines (NUMAs)

•^

Hardware support is required to maintain cachecoherency (ccNUMA)

Networks

Interconnects / Fabrics

•^

Topologies– Bus– Switch– Ring– Hierarchical Star

•^

Fabric– Ethernet (10Mbit/s, 100Mbit/s, 1Gbit/s)– Fiber Distributed Data Interface (FDDI)– Asynchronous Transfer Mode (ATM)– Gigabit (Myrinet, Giganet, etc.)

MPI Tutorial

Classes of Networks, I.

•^

WAN:

Wide Area Networks

•^

MAN:

Metropolitan Area Networks

•^

LAN:

Local Area Networks

•^

SWAN: Small Wide-Area Networks (connectedSANs)

•^

SAN:

System Area Networks (SANs)

Developing HPC Applications

SequentialProgram

ParallelCompiler

Serial Compiler

Preprocessor

Add function

calls

Explicit

message-passing

MPI Tutorial

Choice of Language and Notation

•^

C/C++/Fortran with threads and runtime support– compiler switches and directives (OpenMP)– good for local SMP programming

High Performance Fortran (HPF)– good for data parallel models with regular data relationships– your tests show efficacy of HPF on algorithms– specific array syntax is closely relevant to algorithms

C/C++/Fortran plus MPI– for irregular or dynamic data relationships– each node programmed sequentially– no parallel compiler looks over whole code (separate

compilation)

Choices Facing HPC

Programmer

•^

Choose programming model– hardware– application– performance requirements– portability requirements

•^

Shared memory– thread libraries– OpenMP or other compiler directives

•^

Distributed memory– MPI libraries– MPI

MPI Tutorial

Scalability

•^

The ability of a hardware system to be increasedin concurrency and/or memory capacity, withinthe same architecture

•^

The ability of an application or algorithm to adaptto different problem sizes and/or concurrency

•^

The ability of a set of related algorithms to solvea problem over a range of concurrencies andproblem sizes

•^

Distributed memory is more scalable thanshared memory

Poly-algorithm

•^

A collection of related algorithms that solve the sameproblem

Each member of the collection is fastest for a subset of theproblem domain

The problem domain is described by– concurrency– problem size– memory requirements– emphasis on space or speed

Relative speed changes when the poly-algorithm is ported

Discovering which to use in a given situation a priori is thecurrent research challenge

MPI Tutorial

Load Balancing

•^

MIMD machines with data-dependent workloads leadin many situations to unbalanced loads, even for“regular algorithms”

Static load balancing– choice of data distributions

Dynamic load balancing– reorganization of data distributions– retasking of processing units when they finish early

Task Migration– moving both code and data when appropriate across a system– seek to use unused cycles– seek to escape from busy machines

Overlapping Communication and

Computation

•^

Communication is a dead loss associated withparallelism

•^

Hiding communication “behind” computation isimportant

•^

If sufficient processor memory bandwidth exists,this can be done, provided the communicationnetwork allows asynchronous transfers whileother computation is on-going

•^

The maximum improvement of performance is afactor of 2, when the original I/O andcomputation times were exactly equal.

MPI Tutorial

Performance Metrics, I.

•^

Speedup– A measure of the time to solution on two systems: A

sequential system time divided by a parallel systemtime.

•^

Efficiency– A measure of the fraction of speedup achieved,

compared to the ideal (Speedup / number ofprocessors)

•^

Scaled speedup– A metric in which the problem size increases as the

concurrency increases, in order to skirt Amdahl’s law,and measures constant problem size per processor,or else constant memory use

Partitioning

•^

Both computations and the data on whichcomputations are performed can be partitioned

•^

Domain decomposition– data is partitioned into smaller pieces– computations are performed on the smaller datasets

•^

Functional decomposition– computation is partitioned instead of data– for a good functional decomposition dataset should

be disjoint

MPI Tutorial

Domain decomposition

•^

Most common approach to obtain parallelism

•^

Each process works on a local subset of theglobal data

•^

Several types of decompositions are possibledepending on the application

•^

Decomposition can be done manually or usingtools depending on the complexity of the dataset

•^

Good load balancing key to high performance– idle time must be minimized in each process– each process must communicate equal amounts of

data

Decomposition for Structured

Grids

1-D

2-D

3-D

blocked

block-scattered

MPI Tutorial

Decomposition for Unstructured

Grids

•^

Lack a straightforward, algebraic way to partition

•^

Decomposition for UnstructuredResort to geometric or graph-theoreticalmethods, e.g.– Recursive coordinate bisection (RCB)– RCB works poorly for highly stretched grids, so...

Grids

PE 0

PE 1

PE 0

PE 1 PE 2

PE 3

MPI Tutorial

Decomposition for Unstructured

Grids

Recursive graph bisection (RGB)
- A mesh can be viewed as an undirected graph• Instead of Euclidean distance, metric is graph distance• Do recursive bisection on the graph, not the coordinates

Decomposition for Unstructured

Grids

•^

Recursive spectral bisection (RSB)– solve an eigenproblem for metric– very high quality partitions, but expen$ive

•^

Multilevel variants of the above– very efficient, gives excellent partitions

•^

See Metis, Chaco– http://www-users.cs.umn.edu/~karypis/metis– http://www.cs.sandia.gov/CRF/chac_p2.html

MPI Tutorial

Establishing Communication

•^

After partitioning the next step is to setupcommunication between the different partitions

•^

Based on the application communication can be– structured / unstructured– synchronous / asynchronous– static / dynamic– nearest neighbor / collective

•^

Key to high performance– avoid excessive communication– reduce communication with replicated computation– overlap computation and communication– send few large messages instead of many small

messages

MPI

MPI Tutorial

Message Passing Interface (MPI)

•^

A message-passing library specification– Message-passing model– Not a compiler specification– Not a specific product

For parallel computers, clusters, and heterogeneousnetworks

Designed to aid the development of portable parallelsoftware libraries

Designed to provide access to advanced parallelhardware for– End users– Library writers– Tool developers

MPI Tutorial

Message Passing Interface - MPI

•^

MPI-1 standard widely accepted by vendors andprogrammers– MPI implementations available on most modern

platforms

Huge number of MPI applications deployed– Several tools exist to trace and tune MPI applications

•^

MPI provides rich set of functionality to supportlibrary writers, tools developers and applicationprogrammers

MPI Tutorial

MPI Salient Features

•^

Point-to-point communication

•^

Collective communication on process groups

•^

Communicators and groups for safecommunication

•^

User defined datatypes

•^

Virtual topologies

•^

Support for profiling

A First MPI Program

#include

<stdio.h>

#include

<mpi.h>

main(

int

argc,

char

****argv**

)

{

MPI_Init ( &argc, &argv );printf ( “Hello World!\n” );MPI_Finalize ( ); } program

main

include

’mpif.h’

integer

ierr

call

MPI_INIT(

ierr

)

print

’Hello

world!’

call

MPI_FINALIZE(

ierr

)

end

MPI Tutorial

Starting the MPI Environment

MPI_INIT ( )^ Initializes MPI environment. This function must becalled and must be the first MPI function called in aprogram (exception:

MPI_INITIALIZED

Syntax

int MPI_Init (

**int *argc, char *argv )

MPI_INIT ( IERROR )INTEGER IERROR

Exiting the MPI Environment

MPI_FINALIZE (

Cleans up all MPI state. Once this routine has beencalled, no MPI routine ( even

MPI_INIT

) may be

called^ Syntax

int MPI_Finalize ( );MPI_FINALIZE

( IERROR )

INTEGER IERROR

MPI Tutorial

C and Fortran Language

Considerations, I.

•^

MPI_INIT: The C version accepts the argc and argvvariables that are provided as arguments to main ( )

Error codes: Almost all MPI Fortran subroutines have aninteger return code as their last argument. Almost all Cfunctions return an integer error code

Types: Opaque objects are given type names in C.Opaque objects are usually of type INTEGER in Fortran(exception: binary-valued variables are of typeLOGICAL)

Inter-language interoperability is not guaranteed

Communicator

•^

Communication in

MPI

takes place with respect to

communicators

MPI_COMM_WORLD

is one such predefined

communicator (something of type

“MPI_COMM”

)^

and

contains group and context information

MPI_COMM_RANK

and

MPI_COMM_SIZE

return

information based on the communicator passed in asthe first argument

Processes may belong to many differentcommunicators

Environment Setup

Using the CIS Cluster

•^

machine, otherwise you have to login to moat.cis.uab.edu first)

•^

Compile– mpicc –o program program.c

Submit– qsub myscript.sge

Monitor– qstat –u

See User Guide for more details– http://www.cis.uab.edu/cs541/homework/instructions.pdf

Point-to-Point

Communications

Sending and Receiving

Messages

•^

Basic message passing process

Questions– To whom is data sent?– Where is the data?– What type of data is sent?– How much of data is sent?– How does the receiver identify it?

Send

Receive

B: Process 1

Process 0

MPI Tutorial

Message Organization in MPI

•^

Message is divided into data and envelope

•^

data– buffer– count– datatype

•^

envelope– process identifier (source/destination rank)– message tag– communicator

Generalizing the Buffer

Description

•^

Specified in MPI by starting address, count, anddatatype, where datatype is as follows:– Elementary (all C and Fortran datatypes)– Contiguous array of datatypes– Strided blocks of datatypes– Indexed array of blocks of datatypes– General structure

Datatypes are constructed recursively

Specifying application-oriented layout of data allowsmaximal use of special hardware

Elimination of length in favor of count is clearer– Traditional: send 20 bytes– MPI: send 5 integers

MPI Tutorial

MPI C Datatypes MPI datatype

C datatype

MPI_CHAR

signed

char

MPI_SHORT

signed

short

int

MPI_INT

signed

int

MPI_LONG

signed

long

int

MPI_UNSIGNED_CHAR

unsigned

char

MPI_UNSIGNED_SHORT

unsigned

short

int

MPI_UNSIGNED_LONG

unsigned

long_int

MPI_UNSIGNED

unsigned

int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

long

double

MPI_BYTE MPI_PACKED

MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus, Study notes of Computer Science

Related documents

Partial preview of the text

Download MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus and more Study notes Computer Science in PDF only on Docsity!

MPI Tutorial

Purushotham Bangalore, Ph.D.

Anthony Skjellum, Ph.D.

Department of Computer and

Information Sciences

University of Alabama at Birmingham

Overview

•^

Introduction to

High Performance Computing

What is High Performance

Computing?

•^

Why High Performance

Computing?

•^

Why Not High Performance

Computing?

•^

How to Get High Performance

•^

Wait for faster CPU, memory, storage, network– Limit: physics, economics

•^

Use multiple CPUs on single problem– Limit: Amdahl’s law

•^

Choose appropriate algorithms and datastructures

•^

Choose appropriate programming model

•^

Tune software to maximize efficiency– sequentially– parallel

Amdahl’s Law

•^

Terminology, IV.

•^

•^

•^

Terminology, V.

•^

Process– Independent image: instruction stream, stack, data– Communication between processes is explicit (e.g.,

•^

Thread– “Lightweight process” (LWP)– Multiple instruction streams, independent stack– Shared data– Communication can be implicit (e.g., shared

Hardware Models

Multicomputer

•^

A networked set of computers (1 or moreCPU), each with its own memory

•^

A Massively Parallel Processor (MPP)

•^

Typically does not support a single systemimage

•^

Also known as a loosely coupled system

•^

E.g., Intel Paragon, IBM SP, etc.,

Multiprocessor

•^

A machine with processors connected tomemories through a fabric or other network

•^

Typically, supports a single system image

•^

A logical component of a modern multicomputer

•^

Also known as a tightly coupled system

•^

E.g., SGI Power Challenge, Sun HPC 10000

Cluster/Network of Workstations

•^

A collection of capable multiprocessors oruniprocessors (nodes)– homogeneous environment - collection of similar

•^

A capable network connecting these nodes

•^

System software infrastructure to provideemulation of a multicomputer

•^

Provides affordable high performance computingusing available resources

Hardware Taxonomy

Five Levels of Hardware

Parallelism