MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus, Study notes of Computer Science

An overview of message passing interface (mpi) for parallel computing, including its salient features, data types, and communication functions. It covers topics such as point-to-point communication, collective communication, user-defined datatypes, and virtual topologies. The tutorial also includes examples and explanations of various mpi concepts.

Typology: Study notes

Pre 2010

Uploaded on 04/12/2010

koofers-user-ocq
koofers-user-ocq 🇺🇸

10 documents

1 / 55

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MPI Tutorial
Purushotham Bangalore, Ph.D.
Anthony Skjellum, Ph.D.
Department of Computer and
Information Sciences
University of Alabama at Birmingham
MPI Tutorial 2
Overview
High Performance Computing
Introduction
Hardware models
Software models
Parallelization strategies
Message Passing Interface - MPI
Point-to-point communication
Collective communication
–Communicators
Datatypes
Topologies
Inter-communicators
Profiling
Introduction to
High Performance Computing
MPI Tutorial 4
What is High Performance
Computing?
Computing that stretches the ability of whatever class of
machine (or conglomerate of machines) in terms of
floating-point operations
number of instructions per second
–memory
input/output
algorithms and data structures
system software
software development environment
What is considered high performance changes over
time, as systems get faster, more capable, and cheaper.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37

Partial preview of the text

Download MPI Tutorial: Understanding Message Passing Interface for Parallel Computing - Prof. Purus and more Study notes Computer Science in PDF only on Docsity!

MPI Tutorial

Purushotham Bangalore, Ph.D.

Anthony Skjellum, Ph.D.

Department of Computer and

Information Sciences

University of Alabama at Birmingham

MPI Tutorial

Overview

•^

High Performance Computing– Introduction– Hardware models– Software models– Parallelization strategies

-^

Message Passing Interface - MPI– Point-to-point communication– Collective communication– Communicators– Datatypes– Topologies– Inter-communicators– Profiling

Introduction to

High Performance Computing

MPI Tutorial

What is High Performance

Computing?

•^

Computing that stretches the ability of whatever class ofmachine (or conglomerate of machines) in terms of– floating-point operations– number of instructions per second– memory– input/output– algorithms and data structures– system software– software development environment

-^

What is considered high performance changes overtime, as systems get faster, more capable, and cheaper.

5

Why High Performance

Computing?

•^

Need to solve a problem faster

-^

Need to solve a previously intractable problem,

-^

Need to solve a “larger” problem– More refined– Higher dimensionality– More complex geometry or physics– Larger problem domain

-^

Growing complexity of industrial systems

-^

Growing demand for high utilization of resources and lowpollution

-^

Greater demand for high economic agility and cost-effectiveness

MPI Tutorial

Why Not High Performance

Computing?

•^

HPC software more expensive to develop, maintain anduse

-^

HPC software for your problem is not available or notvalidated

-^

Legacy solutions still valuable even if old, to solve, albeitslowly

-^

A “solution” in a field, whether slow or fast or possiblyeven correct, produces the “accepted answer”

-^

The coordination of resources is easy to manage

-^

The demanding aspect of the application is the human-intensive part, not the computational part, once theproblem is set up

7

How to Get High Performance

•^

Wait for faster CPU, memory, storage, network– Limit: physics, economics

•^

Use multiple CPUs on single problem– Limit: Amdahl’s law

•^

Choose appropriate algorithms and datastructures

•^

Choose appropriate programming model

•^

Tune software to maximize efficiency– sequentially– parallel

MPI Tutorial

Amdahl’s Law

•^

A metric that states that sequential bottlenecks posefundamental limitations on speedup, stated as

-^

What actually limits speed up below the upper limitgoverned by sequential fraction– sequential fraction may be a function of P– communication in the system has finite, often significant cost– load imbalance

If S is the fraction of an algorithm that is serial and 1-S thefraction that can be parallelized, then the speedup that canbe achieved using P processors is: 1/( S + (1-S)/P) whichhas a limiting value of 1/S for an infinite number ofprocessors.

MPI Tutorial

13

Terminology, IV.

•^

Sustained Performance– the actual performance in flops that an application/benchmark

achieves

•^

Turn-around time (Time-to-Solution)– a measure of the latency of an entire application once started– a summation of the CPU, System, and I/O time of an

application

•^

Capability– A mode of operation in which a single hard problem is solved

-^

Capacity, Throughput– A mode of operation in which the number of problems solved

per time is optimized

MPI Tutorial

Terminology, V.

•^

Process– Independent image: instruction stream, stack, data– Communication between processes is explicit (e.g.,

sockets, IPC, shmem)

•^

Thread– “Lightweight process” (LWP)– Multiple instruction streams, independent stack– Shared data– Communication can be implicit (e.g., shared

variables)

Hardware Models

MPI Tutorial

Multicomputer

•^

A networked set of computers (1 or moreCPU), each with its own memory

•^

A Massively Parallel Processor (MPP)

•^

Typically does not support a single systemimage

•^

Also known as a loosely coupled system

•^

E.g., Intel Paragon, IBM SP, etc.,

17

Multiprocessor

•^

A machine with processors connected tomemories through a fabric or other network

•^

Typically, supports a single system image

•^

A logical component of a modern multicomputer

•^

Also known as a tightly coupled system

•^

E.g., SGI Power Challenge, Sun HPC 10000

MPI Tutorial

Cluster/Network of Workstations

•^

A collection of capable multiprocessors oruniprocessors (nodes)– homogeneous environment - collection of similar

nodes

  • heterogeneous environment - collection of different

nodes (differences in computer hardware, network,operating system)

•^

A capable network connecting these nodes

•^

System software infrastructure to provideemulation of a multicomputer

•^

Provides affordable high performance computingusing available resources

19

Hardware Taxonomy

-^

Flynn's taxonomy of parallel hardware–

SISD - Single Instruction Single Data– SIMD - Single Instruction Multiple Data– MISD - Multiple Instruction Single Data– MIMD - Multiple Instruction Multiple Data

-^

Shared memory (MIMD)–

e.g., SGI Power Challenge, Sun HPC 10000

-^

Distributed memory (MIMD)–

e.g., Intel Paragon, IBM SP, Network of Workstations

-^

Distributed shared memory (MIMD)–

e.g., HP/Convex Exemplar, SGI Origin 2000– memory is physically distributed but logically shared

-^

Vector SMP (SIMD/MIMD)–

e.g., Cray C90, Nec SX4, NEC Earth Simulator

MPI Tutorial

Five Levels of Hardware

Parallelism

•^

Job Level

•^

Task Level

•^

Stream Level (Concurrent threads hidinglatency)

•^

Pipeline (Instruction Concurrency andReordering)

•^

Instruction level (multi-issue, VLIW,multiply/add/load/store)

25

Shared Memory

Also known as Uniform Memory Access machines (UMAs) ortightly coupled systems

Processor

Processor

Processor

Processor

Processor

….

Shared Bus Memory

MPI Tutorial

Shared Memory Issues

•^

Synchronization is Explicit

•^

Data Transfer is Implicit

•^

To reduce access time and memory bandwidthdemand multilevel caches are used

•^

But introduces cache-coherence problem, thatrequires special cache-coherence protocols

•^

Not scalable

•^

Bus contention and bandwidth limitation

27

Distributed Memory Also known as Message Passing Systems

Network

Processor Memory

Processor Memory

Processor Memory

Processor Memory

MPI Tutorial

Distributed Memory Issues

•^

Synchronization is Implicit

•^

Data Transfer is Explicit

•^

Scalable

•^

Provides a cost-effective model to increasememory bandwidth and reduce memory latencysince most of the memory access is local

•^

Introduces an additional overhead because ofinter-processor communication

•^

Low latency and high bandwidth for inter-processor communication is the key to higherperformance

29

Distributed Shared Memory

•^

Hybrid of shared and distributed memory models

•^

Memory is physically separated but addressspace is logically shared, meaning– any processor can access any memory location– same physical address on two processors refers to

the same memory location

•^

Access time to a memory location is not uniform,hence they are also known as Non-UniformMemory Access machines (NUMAs)

•^

Hardware support is required to maintain cachecoherency (ccNUMA)

Networks

31

Interconnects / Fabrics

•^

Topologies– Bus– Switch– Ring– Hierarchical Star

•^

Fabric– Ethernet (10Mbit/s, 100Mbit/s, 1Gbit/s)– Fiber Distributed Data Interface (FDDI)– Asynchronous Transfer Mode (ATM)– Gigabit (Myrinet, Giganet, etc.)

MPI Tutorial

Classes of Networks, I.

•^

WAN:

Wide Area Networks

•^

MAN:

Metropolitan Area Networks

•^

LAN:

Local Area Networks

•^

SWAN: Small Wide-Area Networks (connectedSANs)

•^

SAN:

System Area Networks (SANs)

37

Developing HPC Applications

SequentialProgram

ParallelCompiler

Serial Compiler

Preprocessor

Add function

calls

Explicit

message-passing

MPI Tutorial

Choice of Language and Notation

•^

C/C++/Fortran with threads and runtime support– compiler switches and directives (OpenMP)– good for local SMP programming

-^

High Performance Fortran (HPF)– good for data parallel models with regular data relationships– your tests show efficacy of HPF on algorithms– specific array syntax is closely relevant to algorithms

-^

C/C++/Fortran plus MPI– for irregular or dynamic data relationships– each node programmed sequentially– no parallel compiler looks over whole code (separate

compilation)

39

Choices Facing HPC

Programmer

•^

Choose programming model– hardware– application– performance requirements– portability requirements

•^

Shared memory– thread libraries– OpenMP or other compiler directives

•^

Distributed memory– MPI libraries– MPI

MPI Tutorial

Scalability

•^

The ability of a hardware system to be increasedin concurrency and/or memory capacity, withinthe same architecture

•^

The ability of an application or algorithm to adaptto different problem sizes and/or concurrency

•^

The ability of a set of related algorithms to solvea problem over a range of concurrencies andproblem sizes

•^

Distributed memory is more scalable thanshared memory

41

Poly-algorithm

•^

A collection of related algorithms that solve the sameproblem

-^

Each member of the collection is fastest for a subset of theproblem domain

-^

The problem domain is described by– concurrency– problem size– memory requirements– emphasis on space or speed

-^

Relative speed changes when the poly-algorithm is ported

-^

Discovering which to use in a given situation a priori is thecurrent research challenge

MPI Tutorial

Load Balancing

•^

MIMD machines with data-dependent workloads leadin many situations to unbalanced loads, even for“regular algorithms”

-^

Static load balancing– choice of data distributions

-^

Dynamic load balancing– reorganization of data distributions– retasking of processing units when they finish early

-^

Task Migration– moving both code and data when appropriate across a system– seek to use unused cycles– seek to escape from busy machines

43

Overlapping Communication and

Computation

•^

Communication is a dead loss associated withparallelism

•^

Hiding communication “behind” computation isimportant

•^

If sufficient processor memory bandwidth exists,this can be done, provided the communicationnetwork allows asynchronous transfers whileother computation is on-going

•^

The maximum improvement of performance is afactor of 2, when the original I/O andcomputation times were exactly equal.

MPI Tutorial

Performance Metrics, I.

•^

Speedup– A measure of the time to solution on two systems: A

sequential system time divided by a parallel systemtime.

•^

Efficiency– A measure of the fraction of speedup achieved,

compared to the ideal (Speedup / number ofprocessors)

•^

Scaled speedup– A metric in which the problem size increases as the

concurrency increases, in order to skirt Amdahl’s law,and measures constant problem size per processor,or else constant memory use

49

Partitioning

•^

Both computations and the data on whichcomputations are performed can be partitioned

•^

Domain decomposition– data is partitioned into smaller pieces– computations are performed on the smaller datasets

•^

Functional decomposition– computation is partitioned instead of data– for a good functional decomposition dataset should

be disjoint

MPI Tutorial

Domain decomposition

•^

Most common approach to obtain parallelism

•^

Each process works on a local subset of theglobal data

•^

Several types of decompositions are possibledepending on the application

•^

Decomposition can be done manually or usingtools depending on the complexity of the dataset

•^

Good load balancing key to high performance– idle time must be minimized in each process– each process must communicate equal amounts of

data

51

Decomposition for Structured

Grids

1-D

2-D

3-D

blocked

block-scattered

MPI Tutorial

Decomposition for Unstructured

Grids

•^

Lack a straightforward, algebraic way to partition

53

•^

Decomposition for UnstructuredResort to geometric or graph-theoreticalmethods, e.g.– Recursive coordinate bisection (RCB)– RCB works poorly for highly stretched grids, so...

Grids

PE 0

PE 1

PE 0

PE 1 PE 2

PE 3

MPI Tutorial

Decomposition for Unstructured

Grids

  • Recursive graph bisection (RGB)
    • A mesh can be viewed as an undirected graph• Instead of Euclidean distance, metric is graph distance• Do recursive bisection on the graph, not the coordinates

55

Decomposition for Unstructured

Grids

•^

Recursive spectral bisection (RSB)– solve an eigenproblem for metric– very high quality partitions, but expen$ive

•^

Multilevel variants of the above– very efficient, gives excellent partitions

•^

See Metis, Chaco– http://www-users.cs.umn.edu/~karypis/metis– http://www.cs.sandia.gov/CRF/chac_p2.html

MPI Tutorial

Establishing Communication

•^

After partitioning the next step is to setupcommunication between the different partitions

•^

Based on the application communication can be– structured / unstructured– synchronous / asynchronous– static / dynamic– nearest neighbor / collective

•^

Key to high performance– avoid excessive communication– reduce communication with replicated computation– overlap computation and communication– send few large messages instead of many small

messages

MPI

MPI Tutorial

Message Passing Interface (MPI)

•^

A message-passing library specification– Message-passing model– Not a compiler specification– Not a specific product

-^

For parallel computers, clusters, and heterogeneousnetworks

-^

Designed to aid the development of portable parallelsoftware libraries

-^

Designed to provide access to advanced parallelhardware for– End users– Library writers– Tool developers

MPI Tutorial

63

Message Passing Interface - MPI

•^

MPI-1 standard widely accepted by vendors andprogrammers– MPI implementations available on most modern

platforms

  • Huge number of MPI applications deployed– Several tools exist to trace and tune MPI applications

•^

MPI provides rich set of functionality to supportlibrary writers, tools developers and applicationprogrammers

MPI Tutorial

MPI Salient Features

•^

Point-to-point communication

•^

Collective communication on process groups

•^

Communicators and groups for safecommunication

•^

User defined datatypes

•^

Virtual topologies

•^

Support for profiling

65

A First MPI Program

#include

<stdio.h>

#include

<mpi.h>

main(

int

argc,

char

****argv**

)

{

MPI_Init ( &argc, &argv );printf ( “Hello World!\n” );MPI_Finalize ( ); } program

main

include

’mpif.h’

integer

ierr

call

MPI_INIT(

ierr

)

print

*,

’Hello

world!’

call

MPI_FINALIZE(

ierr

)

end

MPI Tutorial

Starting the MPI Environment

MPI_INIT ( )^ Initializes MPI environment. This function must becalled and must be the first MPI function called in aprogram (exception:

MPI_INITIALIZED

Syntax

int MPI_Init (

**int *argc, char *argv )

MPI_INIT ( IERROR )INTEGER IERROR

67

Exiting the MPI Environment

MPI_FINALIZE (

Cleans up all MPI state. Once this routine has beencalled, no MPI routine ( even

MPI_INIT

) may be

called^ Syntax

int MPI_Finalize ( );MPI_FINALIZE

( IERROR )

INTEGER IERROR

MPI Tutorial

C and Fortran Language

Considerations, I.

•^

MPI_INIT: The C version accepts the argc and argvvariables that are provided as arguments to main ( )

-^

Error codes: Almost all MPI Fortran subroutines have aninteger return code as their last argument. Almost all Cfunctions return an integer error code

-^

Types: Opaque objects are given type names in C.Opaque objects are usually of type INTEGER in Fortran(exception: binary-valued variables are of typeLOGICAL)

-^

Inter-language interoperability is not guaranteed

73

Communicator

•^

Communication in

MPI

takes place with respect to

communicators

MPI_COMM_WORLD

is one such predefined

communicator (something of type

“MPI_COMM”

)^

and

contains group and context information

MPI_COMM_RANK

and

MPI_COMM_SIZE

return

information based on the communicator passed in asthe first argument

-^

Processes may belong to many differentcommunicators

Environment Setup

75

Using the CIS Cluster

•^

Login– ssh everest00.cis.uab.edu (you must be logged into a CIS

machine, otherwise you have to login to moat.cis.uab.edu first)

•^

Compile– mpicc –o program program.c

-^

Submit– qsub myscript.sge

-^

Monitor– qstat –u

-^

See User Guide for more details– http://www.cis.uab.edu/cs541/homework/instructions.pdf

Point-to-Point

Communications

77

Sending and Receiving

Messages

•^

Basic message passing process

-^

Questions– To whom is data sent?– Where is the data?– What type of data is sent?– How much of data is sent?– How does the receiver identify it?

A:

Send

Receive

B: Process 1

Process 0

MPI Tutorial

Message Organization in MPI

•^

Message is divided into data and envelope

•^

data– buffer– count– datatype

•^

envelope– process identifier (source/destination rank)– message tag– communicator

79

Generalizing the Buffer

Description

•^

Specified in MPI by starting address, count, anddatatype, where datatype is as follows:– Elementary (all C and Fortran datatypes)– Contiguous array of datatypes– Strided blocks of datatypes– Indexed array of blocks of datatypes– General structure

-^

Datatypes are constructed recursively

-^

Specifying application-oriented layout of data allowsmaximal use of special hardware

-^

Elimination of length in favor of count is clearer– Traditional: send 20 bytes– MPI: send 5 integers

MPI Tutorial

MPI C Datatypes MPI datatype

C datatype

MPI_CHAR

signed

char

MPI_SHORT

signed

short

int

MPI_INT

signed

int

MPI_LONG

signed

long

int

MPI_UNSIGNED_CHAR

unsigned

char

MPI_UNSIGNED_SHORT

unsigned

short

int

MPI_UNSIGNED_LONG

unsigned

long_int

MPI_UNSIGNED

unsigned

int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

long

double

MPI_BYTE MPI_PACKED