Download Lecture Notes on Message Passing Interface - Parallel Computing | CS 432 and more Study notes Computer Science in PDF only on Docsity!
MPI Tutorial
Purushotham Bangalore, Ph.D.
Anthony Skjellum, Ph.D.
Department of Computer and
Information Sciences
University of Alabama at Birmingham
MPI Tutorial
Overview
Message Passing Interface - MPI
– Point-to-point communication– Collective communication– Communicators– Datatypes– Topologies– Inter-communicators– Profiling
MPI Tutorial
3
Message Passing Interface (MPI)
A message-passing library specification
- Message-passing model– Not a compiler specification– Not a specific product
For parallel computers, clusters, and heterogeneousnetworks
Designed to aid the development of portable parallelsoftware libraries
Designed to provide access to advanced parallelhardware for
- End users– Library writers– Tool developers
MPI Tutorial
Message Passing Interface - MPI
MPI-1 standard widely accepted by vendors andprogrammers
– MPI implementations available on most modern
platforms
– Huge number of MPI applications deployed– Several tools exist to trace and tune MPI applications
MPI provides rich set of functionality to supportlibrary writers, tools developers and applicationprogrammers
5
MPI Salient Features
Point-to-point communication
Collective communication on process groups
Communicators and groups for safecommunication
User defined datatypes
Virtual topologies
Support for profiling
MPI Tutorial
A First MPI Program
#include
<stdio.h>
#include
<mpi.h>
main(
int
argc,
char
**argv
MPI_Init ( &argc, &argv );printf ( “Hello World!\n” );MPI_Finalize ( );
} program
main
include
’mpif.h’
integer
ierr
call
MPI_INIT(
ierr
print
’Hello
world!’
call
MPI_FINALIZE(
ierr
end
7
Starting the MPI Environment
MPI_INIT ( ) Initializes MPI environment. This function must becalled and must be the first MPI function called in aprogram (exception:
MPI_INITIALIZED
Syntax
int MPI_Init (
int *argc, char ***argv )
MPI_INIT ( IERROR )INTEGER IERROR
MPI Tutorial
Exiting the MPI Environment
MPI_FINALIZE (
Cleans up all MPI state. Once this routine has beencalled, no MPI routine ( even
MPI_INIT
) may be
called
Syntax
int MPI_Finalize ( );MPI_FINALIZE
( IERROR )
INTEGER IERROR
MPI Tutorial
13
Example 1 (C)
#include
<mpi.h>
main(
int
argc,
char
**argv
int
rank,
size;
MPI_Init
&argc,
&argv
MPI_Comm_rank
MPI_COMM_WORLD,
&rank
MPI_Comm_size
MPI_COMM_WORLD,
&size
printf
“Process
%d
of
%d
is
alive\n”,
rank,
size
MPI_Finalize
MPI Tutorial
Communicator
Communication in
MPI
takes place with respect to
communicators
MPI_COMM_WORLD
is one such predefined
communicator (something of type
“MPI_COMM”
and
contains group and context information
MPI_COMM_RANK
and
MPI_COMM_SIZE
return
information based on the communicator passed in asthe first argument
Processes may belong to many differentcommunicators
Environment Setup
MPI Tutorial
Using the CIS Cluster
Login
- ssh everest00.cis.uab.edu– Add the following lines at the end of your .bashrc file:
export PATH=/opt/mpipro/bin:${PATH}export LD_LIBRARY_PATH=/opt/mpipro/lib64:${LD_LIBRARY_PATH}
Logout and login again
Compile
- mpicc –o program program.c– mpic++ –o program program.cc
Submit
Monitor
See User Guide for more details
http://www.cis.uab.edu/ccl/resources/everest/EverestGridNodeUserGuide.php
17
Sample SGE script
#!/bin/bash# #$ -cwd#$ -j y#$ -S /bin/bash# #$ -pe mpi 4MPI_DIR=/opt/mpipro/binEXE="/home/puri/examples/psum 1000"$MPI_DIR/mpirun -np $NSLOTS -machinefile $TMPDIR/machines $EXE
Point-to-Point
Communications
19
Sending and Receiving
Messages
Basic message passing process
Questions
- To whom is data sent?– Where is the data?– What type of data is sent?– How much of data is sent?– How does the receiver identify it?
A:
Send
Receive
B:
Process 1
Process 0
MPI Tutorial
Message Organization in MPI
Message is divided into data and envelope
data
– buffer– count– datatype
envelope
– process identifier (source/destination rank)– message tag– communicator
25
Message Tag
Tags allow programmers to deal with the arrival ofmessages in an orderly manner
MPI tags are guaranteed to range from 0 to 32767
The upper bound on tag value is provided by theattribute MPI_TAG_UB
MPI_ANY_TAG
can be used as a wildcard value
MPI Tutorial
MPI Basic Send/Receive
Thus the basic (blocking) send has become: MPI_Send ( start, count, datatype, dest,tag, comm )
And the receive has become: MPI_Recv( start, count, datatype, source,tag, comm, status )
The source, tag, and the count of the messageactually received can be retrieved from status
27
Bindings for Send and Receive
int
MPI_Send(
void
*buf,
int
count,
MPI_Datatype
type,
int dest,
int
tag,
MPI_Comm
comm
MPI_SEND(
BUF,
COUNT,
DATATYPE,
DEST,
TAG,
COMM,
IERR
BUF( * )INTEGER COUNT, DATATYPE, DEST, COMM, IERR
int
MPI_Recv(
void
*buf,
int
count,
MPI_Datatype
datatype,
int source,
int tag,
MPI_Comm
comm,
MPI_Status
*status
MPI_RECV(
BUF,
COUNT,
DATATYPE,
SOURCE,
TAG,
COMM,
STATUS,
IERR
BUF ( * )INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM,STATUS( MPI_STATUS_SIZE ), IERR
MPI Tutorial
Getting Information About a
Message
The following functions can be used to get informationabout a message MPI_Status
status;
MPI_Recv(
&status
tag_of_received_message
status.MPI_TAG;
src_of_received_message
status.MPI_SOURCE;
MPI_Get_count(
&status,
datatype,
&count);
MPI_TAG and MPI_SOURCE are primarily of use whenMPI_ANY_TAG and/or MPI_ANY_SOURCE is used inthe receive
The function MPI_GET_COUNT may be used todetermine how much data of a particular type wasreceived
29
Getting Information About a
Message (Fortran)
The following functions can be used to get informationabout a message
INTEGER
status(MPI_STATUS_SIZE)
call
MPI_Recv(
status,
ierr
tag_of_received_message
status(MPI_TAG)
src_of_received_message
status(MPI_SOURCE)
call
MPI_Get_count(status,
datatype,
count,
ierr)
MPI_TAG and MPI_SOURCE are primarily of use whenMPI_ANY_TAG and/or MPI_ANY_SOURCE is used inthe receive
The function MPI_GET_COUNT may be used todetermine how much data of a particular type wasreceived
MPI Tutorial
Example-2, I.
*program maininclude 'mpif.h'integer rank, size, to, from, tag, count, i, ierr, src, destinteger integer st_source, st_tag, st_count, status(MPI_STATUS_SIZE)double precision data(100)call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)print , 'Process ', rank, ' of ', size, ' is alive'dest = size - 1src = 0
C
if (rank .eq. src) then
to
= dest
count = 100tag
do 10 i=1, 100
data(i) = i call MPI_SEND(data, count, MPI_DOUBLE_PRECISION, to,tag,
MPI_COMM_WORLD, ierr)
31
Example-2, II.
else if (rank .eq. dest) then
tag
= MPI_ANY_TAG
count = 100from = MPI_ANY_SOURCEcall MPI_RECV(data, count, MPI_DOUBLE_PRECISION, from,
+
tag, MPI_COMM_WORLD, status, ierr)
call MPI_GET_COUNT(status, MPI_DOUBLE_PRECISION,
+
st_count, ierr)
st_source = status(MPI_SOURCE)st_tag
= status(MPI_TAG)
C
*print , 'Status info: source = ', st_source,
+
' tag = ', st_tag, ' count = ', st_count
*print , rank, ' received', (data(i),i=1,10) endifcall MPI_FINALIZE(ierr)stopend
MPI Tutorial
Example-2, I.
#include <stdio.h>#include <mpi.h>int main(int argc, char **argv) {
int i, rank, size, dest;int to, src, from, count, tag;int st_count, st_source, st_tag;double data[100];MPI_Status status;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);printf("Process %d of %d is alive\n", rank, size);dest = size - 1;src = 0;
37
Non-Blocking Communication
Non-blocking operations return (immediately)‘‘request handles” that can be waited on and queried
MPI_ISEND( start, count, datatype, dest, tag,
comm, request )
MPI_IRECV( start, count, datatype, src, tag,
comm, request )
MPI_WAIT( request, status )
Non-blocking operations allow overlappingcomputation and communication
One can also test without waiting using
MPI_TEST
MPI_TEST( request, flag, status )
Anywhere you use
MPI_Send
or
MPI_Recv
, you
can use the pair of
MPI_Isend/MPI_Wait
or
MPI_Irecv/MPI_Wait
MPI Tutorial
senderreturns @ T3buffer unavailable
Non-Blocking Send-Receive
send side
receive side
T2: MPI_Isend
T
T
Once receive is called @ T0,buffer unavailable to user MPI_Wait, returns @ T8here, receive buffer filled
High Performance ImplementationsOffer Low Overhead for Non-blocking Calls
T0: MPI_Irecv T7: transfer finishes
Internal completion is soonfollowed by return of MPI_Wait
sendercompletes @ T5buffer availableafter MPI_Wait
T4: MPI_Wait called
T6: MPI_Wait
T1: Returns
T
T9: Wait returns
T
39
Multiple Completions
It is often desirable to wait on multiple requests
An example is a worker/manager program, where themanager waits for one or more workers to send it amessage
MPI_WAITALL(
count,
array_of_requests,
array_of_statuses
MPI_WAITANY(
count,
array_of_requests,
index,
status
MPI_WAITSOME(
incount,
array_of_requests,
outcount,
array_of_indices,
array_of_statuses
There are corresponding versions of
test
for each of
these viz.,
MPI_Testall, MPI_Testany,
MPI_Testsome
MPI Tutorial
Probing the Network for
Messages
MPI_PROBE
and
MPI_IPROBE
allow the user to
check for incoming messages without actuallyreceiving them
MPI_IPROBE
returns
“flag == TRUE”
if there is
a matching message available.
MPI_PROBE
will not
return until there is a matching receive available MPI_IPROBE (source, tag, communicator,flag, status)MPI_PROBE ( source, tag, communicator,status )
41
Message Completion and
Buffering
A send has completed when the user supplied buffer canbe reused
The send mode used (standard, ready, synchronous,buffered) may provide additional information
Just because the send completes does not mean thatthe receive has completed
*buf – Message may be buffered by the system– Message may still be in transit
MPI_Send
buf,
MPI_INT,
*buf
OK,
receiver
will
always
receive
*buf
MPI_Isend(buf,
MPI_INT,
*buf
Undefined
whether
the
receiver
will
get
or
MPI_Wait
MPI Tutorial
Example-3, I.
program maininclude 'mpif.h'integer ierr, rank, size, tag, num, next, frominteger stat1(MPI_STATUS_SIZE), stat2(MPI_STATUS_SIZE)integer req1, req2call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)tag = 201next = mod(rank + 1, size)from = mod(rank + size - 1, size)if (rank .EQ. 0) then
**print *, "Enter the number of times around the ring"read , numprint , "Process 0 sends", num, " to 1"call MPI_ISEND(num, 1, MPI_INTEGER, next, tag,
MPI_COMM_WORLD, req1, ierr)
call MPI_WAIT(req1, stat1, ierr) endif
43
Example-3, II.
*continue call MPI_IRECV(num, 1, MPI_INTEGER, from, tag, MPI_COMM_WORLD, req2, ierr)call MPI_WAIT(req2, stat2, ierr)print , "Process ", rank, " received ", num, " from ", fromif (rank .EQ. 0) then
*num = num - 1print , "Process 0 decremented num"
*endifprint , "Process", rank, " sending", num, " to", nextcall MPI_ISEND(num, 1, MPI_INTEGER, next, tag, MPI_COMM_WORLD, req1, ierr)call MPI_WAIT(req1, stat1, ierr)if (num .EQ. 0) then
*print , "Process", rank, " exiting"goto 20
endifgoto 10
20 if (rank .EQ. 0) then
call MPI_IRECV(num, 1, MPI_INTEGER, from, tag, MPI_COMM_WORLD, req2, ierr)call MPI_WAIT(req2, stat2, ierr) endifcall MPI_FINALIZE(ierr)end
MPI Tutorial
Example-3, I.
**#include <stdio.h>#include <mpi.h>int main(int argc, char argv){
int num, rank, size, tag, next, from;MPI_Status status1, status2;MPI_Request req1, req2;MPI_Init(&argc, &argv);MPI_Comm_rank( MPI_COMM_WORLD, &rank);MPI_Comm_size( MPI_COMM_WORLD, &size);tag = 201;next = (rank+1) % size;from = (rank + size - 1) % size;if (rank == 0) {
printf("Enter the number of times around the ring: ");scanf("%d", &num);printf("Process %d sending %d to %d\n", rank, num, next);MPI_Isend(&num, 1, MPI_INT, next, tag,
MPI_COMM_WORLD,&req1);
MPI_Wait(&req1, &status1); }
MPI Tutorial
49
Buffered Send-Receive
receive side
T4: Transfer Complete
T1: Sender
Returns
T2: MPI_RecvOnce receive is called @ T2,buffer unavailable to userReceiver returns @ T4,buffer filled
Sender returns @ T1,buffer can be reused
T3: Transfer Starts
T0: MPI_Bsend
send side
Internal completion is soonfollowed by return of MPI_Recv
Data is copied from the userbuffer to attached buffer
MPI Tutorial
Ready Send-Receive
receive side
T4: Transfer Complete
T2: Sender
Returns
T0: MPI_RecvOnce receive is called @ T0,buffer unavailable to userReceiver returns @ T4,buffer filled
Sender returns @ T2,buffer can be reused
T3: Transfer Starts
T1: MPI_Rsend
send side
Internal completion is soonfollowed by return of MPI_Recv
MPI Tutorial
51
Other Point to Point Features
Persistent communication requests
- Saves arguments of a communication call and reduces the
overhead from subsequent calls
- The INIT call takes the original argument list of a send or
receive call and creates a corresponding communicationrequest ( e.g.,
MPI_SEND_INIT
MPI_RECV_INIT
- The START call uses the communication request to start the
corresponding operation (e.g.,
MPI_START
MPI_STARTALL
- The REQUEST_FREE call frees the persistent
communication request (
MPI_REQUEST_FREE
Send-Receive operations
MPI_SENDRECV,
MPI_SENDRECV_REPLACE
Cleaning pending communication
MPI_CANCEL
MPI Tutorial
Persistent Communication
Example: Example 4
MPI_Recv_init(&num, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &req2);MPI_Send_init(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD, &req1);do {
MPI_Start(&req2);MPI_Wait(&req2, &status2);printf("Process %d received %d from process %d\n",rank,num,from);if (rank == 0) {
num--;printf("Process 0 decremented number\n"); } printf("Process %d sending %d to %d\n", rank, num, next);MPI_Start(&req1);MPI_Wait(&req1, &status1); } while (num != 0);
Example 3 using persistent communication requests
53
Lab 2
Objective: To write a function to send a message fromprocess 0 to all other processes.
You should assume that all processes in thecommunicator will call your function “at the same time.”
The function should look something like:
user_function(void
*buffer,
int
count,
MPI_Datatype
datatype,
MPI_Comm
comm)
Process 0 should use a loop
dest = 1 ... size-
MPI_Isend(buffer,
count,
datatype,
dest,
comm,
®[i]);
MPI_WAITALL
should be used to wait for the
completion of all the sends.
Processes
through
size-
should use
MPI_IRECV
and
MPI_WAIT
to receive the message.
MPI Tutorial
Lab 2 - Driver Program
program maininclude 'mpif.h'integer
ierr, rank, size, number
number=0call MPI_Init (ierr)call MPI_Comm_rank (MPI_COMM_WORLD, rank, ierr)call MPI_Comm_size (MPI_COMM_WORLD, size, ierr)if (rank.eq.0) then
print, “Enter the number to broadcast: ”read, number**
endifcall user_broadcast(number, 1, MPI_INT, MPI_COMM_WORLD);print, “In Process ”,rank,” the number is ”, numbercall MPI_Finalize (ierr);end*
55
Lab 2 - Driver Program
**#include <stdio.h>#include <mpi.h>main(int argc, char argv){
int
rank, size;
int
number=0;
MPI_Init (&argc, &argv);MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);if (rank == 0) {
printf("[%d] Enter the number to broadcast: ", rank);scanf ("%d", &number);
} user_broadcast(&number, 1, MPI_INT, MPI_COMM_WORLD);printf("In Process %d the number is %d\n", rank, number) ;MPI_Finalize ();
Collective Communications
61
Allreduce and Allgather
Allgather
Process
Ranks
Sendbuffer
Process
Ranks
ReceivebufferABCD
ABCDABCDABCD
A
B
C
D
Allreduce
A
B
C
D
X
X
X
X
X=A
op
B
op
C
op
D
Process
Ranks
Sendbuffer
Process
Ranks
Receivebuffer
MPI Tutorial
Alltoall and Scan
Scan
Process
Ranks
Sendbuffer
Receivebuffer
Process
Ranks
W
A
B
C
D
Alltoall
Process
Ranks
Sendbuffer
Process
Ranks
Receivebuffer
A
0
B
0
C
0
D
0
A
1
B
1
C
1
D
1
A
2
B
2
C
2
D
2
A
3
B
3
C
3
D
3
A
0
A
1
A
2
A
3
B
0
B
1
B
2
B
3
C
0
C
1
C
2
C
3
D
0
D
1
D
2
D
3
X Y Z
A
op
B
op
C
op
D
A
op
B
op
C
A
op
B
A
63
MPI Collective Routines
Several routines: MPI_ALLGATHER
MPI_ALLGATHERV
MPI_BCAST
MPI_ALLTOALL
MPI_ALLTOALLV
MPI_REDUCE
MPI_GATHER
MPI_GATHERV
MPI_SCATTER
MPI_REDUCE_SCATTER
MPI_SCAN
MPI_SCATTERV
MPI_ALLREDUCE
All
versions deliver results to all participating processes
“V”
versions allow the chunks to have different sizes
MPI_ALLREDUCE, MPI_REDUCE,MPI_REDUCE_SCATTER,
and
MPI_SCAN
take both built-
in and user-defined combination functions
MPI Tutorial
Built-In Collective Computation
Operations
MPI Name
Operation
MPI_MAX
Maximum
MPI_MIN
Minimum
MPI_PROD
Product
MPI_SUM
Sum
MPI_LAND
Logical and
MPI_LOR
Logical or
MPI_LXOR
Logical exclusive or ( xor )
MPI_BAND
Bitwise and
MPI_BOR
Bitwise or
MPI_BXOR
Bitwise xor
MPI_MAXLOC
Maximum value and location
MPI_MINLOC
Minimum value and location
65
User defined Collective
Computation Operations
MPI_OP_CREATE(user_function, commute_flag, user_op)MPI_OP_FREE(user_op)The user_function should look like this:
user_function (invec, inoutvec, len, datatype)
The user_function should perform the following:
for ( i = 0; i < len; i++)
inoutvec[i] = invec[i] op inoutvec[i];
do i = 1, len
inoutvec(i) = invec(i) op inoutvec(i)
end do
MPI Tutorial
Synchronization
MPI_BARRIER ( comm )
Function blocks until all processes in “comm”call it
Often not needed at all in many message-passing codes
When needed, mostly for highly asynchronousprograms or ones with speculative execution
67
Example 5, I.
program maininclude 'mpif.h'integer
iwidth, iheight, numpixels, i, val, my_count, ierr
integer
rank, comm_size, sum, my_sum
real
rms
character recvbuf(65536), pixels(65536)call MPI_INIT(ierr)call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)call MPI_COMM_SIZE(MPI_COMM_WORLD, comm_size, ierr)if (rank.eq.0) then
iheight = 256iwidth = 256numpixels = iwidth * iheight
C
Read the imagedo i = 1, numpixels
pixels(i) = char(i) enddo
C
Calculate the number of pixels in each sub imagemy_count = numpixels / comm_size endif
MPI Tutorial
Example 5, II.
C
Broadcasts my_count to all the processescall MPI_BCAST(my_count, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
C
Scatter the imagecall MPI_SCATTER(pixels, my_count, MPI_CHARACTER, recvbuf, $
my_count, MPI_CHARACTER, 0, MPI_COMM_WORLD, ierr)
C
Take the sum of the squares of the partial imagemy_sum = 0do i=1,my_count
my_sum = my_sum + ichar(recvbuf(i))ichar(recvbuf(i)) enddo*
C
Find the global sum of the squarescall MPI_REDUCE( my_sum, sum, 1, MPI_INTEGER, MPI_SUM, 0, $
MPI_COMM_WORLD, ierr)
C
rank 0 calculates the root mean squareif (rank.eq.0) then
*rms = sqrt(real(sum)/real(numpixels))print , 'RMS = ', rms endif
73
Lab 3
Modify example 5 to handle vectors of uneven sizes byusing
MPI_SCATTERV
and
MPI_GATHERV
operations
instead of
MPI_SCATTER
and
MPI_GATHER
operations
respectively.
Hints:
- Refer to the MPI function index for the description of
MPI_SCATTERV
and
MPI_GATHERV
my_count
, broadcast
numpixels
counts
array and the
displacements
array in
each process (print the arrays while debugging)
MPI_SCATTER
and
MPI_GATHER
functions with
MPI_SCATTERV
and
MPI_GATHERV
functions respectively.
Communicators
75
Communicators
All MPI communication is based on acommunicator which contains a context and agroup
Contexts define a safe communication space formessage-passing
Contexts can be viewed as system-managedtags
Contexts allow different libraries to co-exist
The group is just a set of processes
Processes are always referred to by unique rankin group
MPI Tutorial
Pre-Defined Communicators
MPI-1 supports three pre-definedcommunicators:
– MPI_COMM_WORLD– MPI_COMM_NULL– MPI_COMM_SELF
Only MPI_COMM_WORLD is used forcommunication
Predefined communicators are needed to “getthings going” in MPI
77
Uses of MPI_COMM_WORLD
Contains all processes available at the time theprogram was started
Provides initial safe communication space
Simple programs communicate withMPI_COMM_WORLD
Complex programs duplicate and subdividecopies of MPI_COMM_WORLD
MPI_COMM_WORLD provides the basic unit ofMIMD concurrency and execution lifetime forMPI-
MPI Tutorial
Uses of MPI_COMM_NULL
An invalid communicator
Cannot be used as input to any operations thatexpect a communicator
Used as an initial value of communicators to bedefined
Returned as a result in certain cases
Value that communicator handles are set towhen freed
79
Uses of MPI_COMM_SELF
Contains only the local process
Not normally used for communication (since onlyto oneself)
Holds certain information:
– hanging cached attributes appropriate to the process– providing a singleton entry for certain calls (especially
MPI-2)
MPI Tutorial
Duplicating a Communicator:
MPI_COMM_DUP
It is a collective operation. All processes in theoriginal communicator must call this function
Duplicates the communicator group, allocates anew context, and selectively duplicates cachedattributes
The resulting communicator is not an exactduplicate. It is a whole new separatecommunication universe with similar structure
int MPI_Comm_dup( MPI_Comm comm, MPI_Comm *newcomm)MPI_COMM_DUP( COMM, NEWCOMM, IERR )INTEGER COMM, NEWCOMM, IERR