www.cineca.it
The basic of parallel programming
Carlo Cavazzoni, HPC department, CINECA
Message Passing
Cooperation between two process
Processo 0 Data
Processo 1
May I Send?
Yes
Data
Data
Time
Two fundamental function: send(message) e receive(message)
Listening…
Listening…
Listening…
3
MPI
Message Passing Interface
Goals of the MPI standard
To provide source-code portability
To allow efficient implementation
Domain decomposition
Data are divided into pieces of approximately the same size and mapped to different processors. Each processors work only on its local data. The resulting code has a single flow.
Functional decomposition
The problem is decompose into a large number of smaller tasks and then the tasks are assigned to processors as they become available, Client-Server / Master-Slave paradigm.
two basic models
Basic Features of MPI Programs
An MPI program consists of multiple instances of a serial program that communicate by library call.
Calls may be roughly divided into four classes:
Calls used to initialize, manage, and terminate communications
Calls used to communicate between pairs of processors. (Pair communication)
Calls used to communicate among groups of processors. (Collective communication)
Calls to create data types.
A First Program: Hello World!
Fortran
PROGRAM hello INCLUDE ‘mpif.h‘
INTEGER err
CALL MPI_INIT(err) PRINT *, “hello world!”
CALL MPI_FINALIZE(err) END
C
#include <stdio.h>
#include <mpi.h>
void main (int argc, char * argv[]) {
int err;
err = MPI_Init(&argc, &argv);
printf(“Hello world!\n”);
err = MPI_Finalize();
}
Header files
All Subprogram that contains calls to MPI subroutine must include the MPI header file
C:
#include<mpi.h>
Fortran:
include ‘mpif.h’
The header file contains definitions of MPI constants, MPI
types and functions
MPI Communicator
In MPI it is possible divide the total number of processes into groups, called communicators.
The Communicator is a variable identifying a group of
processes that are allowed to communicate with each other.
The communicator that includes all processes is called MPI_COMM_WORLD MPI_COMM_WORLD is the default
communicator (automatically defined):
All MPI communication subroutines have a communicator argument.
The Programmer could define many communicator at the same time
1
6
4
3 2
7
0
5
MPI_COMM_WORLD
MPI function format
C:
Error = MPI_Xxxxx(parameter,...);
MPI_Xxxxx(parameter,...);
Fortran:
CALL MPI_XXXXX(parameter, IERROR)
Initialising MPI
C:
int MPI-Init(int*argc, char***argv) Fortran:
INTEGER IERROR
CALL MPI_INIT(IERROR)
Must be the first MPI call, Initializes the message passing routines
Communicator Size
How many processors are associated with a communicator?
C:
MPI_Comm_size(MPI_Comm comm, int *size) Fortran:
INTEGER COMM, SIZE, IERR
CALL MPI_COMM_SIZE(COMM, SIZE, IERR)
OUTPUT: SIZE
Process Rank
How do you identify different processes?
What is the ID of a processor in a group?
C:
MPI_Comm_rank(MPI_Comm comm, int *rank) Fortran:
INTEGER COMM, RANK, IERR
CALL MPI_COMM_RANK(COMM, RANK, IERR) OUTPUT: RANK
rank is an integer that identifies the Process inside the communicator comm
MPI_COMM_RANK is used to find the rank (the name or identifier) of the Process
Communicator Size and Process Rank / 1
P
0P
1P
2P
3P
4P
5P
6P
7RANK = 2 SIZE = 8
Size is the number of processors associated to the communicator rank is the index of the process within a group associated to a
communicator (rank = 0,1,...,N-1). The rank is used to identify the source
How many processes are contained within a
communicator?
Initializing and Exiting MPI
Initializing the MPI environment C:
int MPI_Init(int *argc, char ***argv);
Fortran:
INTEGER IERR
CALL MPI_INIT(IERR) Finalizing MPI environment C:
int MPI_Finalize() Fortran:
INTEGER IERR
CALL MPI_FINALIZE(IERR)
This two subprograms should be called by all process, and no other MPI calls
A Template for Fortran MPI Programs
PROGRAM template INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
!!! INSERT YOUR PARALLEL CODE HERE !!!
CALL MPI_FINALIZE(ierr)
A Template for C MPI programs
#include <stdio.h>
#include <mpi.h>
void main (int argc, char * argv[]) {
int err, nproc, myid;
err = MPI_Init(&argc, &argv);
err = MPI_Comm_size(MPI_COMM_WORLD, &nproc);
err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
/*** INSERT YOUR PARALLEL CODE HERE ***/
err = MPI_Finalize();
}
Example
PROGRAM hello IMPLICIT NONE INCLUDE ‘mpif.h’
INTEGER:: myPE, totPEs, i, ierr CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK( MPI_COMM_WORLD, myPE, ierr ) CALL MPI_COMM_SIZE( MPI_COMM_WORLD, totPEs, ierr )
PRINT *, “myPE is “, myPE, “of total ”, totPEs, “ PEs”
CALL MPI_FINALIZE(ierr) END PROGRAM hello
MyPE is 1 of total 4 PEs
MyPE is 0 of total 4 PEs
MyPE is 3 of total 4 PEs
MyPE is 2 of total 4 PEs
Output (4 Procs)
Point to Point Communication
Is the fundamental communication facility provided by MPI library. Communication between 2 processes
Is conceptually simple: source process A sends a message to destination process B, B receive the message from A.
Communication take places within a communicator
Source and Destination are identified by their rank in the communicator
Communicator
1
6
4
3 2
7
0
5
Source
Dest
The Message
MPI Data types
Basic types
Derived types
Derived type can be build up from basic types
User-defined data types allows MPI to automatically scatter and gather data to and from non-contiguous buffers
Messages are identified by their envelopes,
a message could be received only if the receiver specify the correct envelope
envelope body
source destination communicator tag buffer count datatype
Message Structure
Fortran - MPI Basic Datatypes
DOUBLE COMPLEX MPI_DOUBLE_COMPLEX
MPI_PACKED MPI_BYTE
CHARACTER(1) MPI_CHARACTER
LOGICAL MPI_LOGICAL
COMPLEX MPI_COMPLEX
DOUBLE PRECISION MPI_DOUBLE_PRECISION
REAL MPI_REAL
INTEGER MPI_INTEGER
Fortran Data type
MPI Data type
C - MPI Basic Datatypes
MPI_BYTE
long double MPI_LONG_DOUBLE
double MPI_DOUBLE
float MPI_FLOAT
unsigned long int MPI_UNSIGNED_LONG
unsigned int MPI_UNSIGNED
unsigned short int MPI_UNSIGNED_SHORT
unsigned char MPI_UNSIGNED_CHAR
Signed log int MPI_LONG
signed int MPI_INT
signed short int MPI_SHORT
signed char MPI_CHAR
C Data type
MPI Data type
Blocking and non-Blocking
“Completion” of the communication means that memory locations used in the message transfer can be safely accessed
Send: variable sent can be reused after completion
Receive: variable received can now be used
MPI communication modes differ in what conditions are needed for completion
Communication modes can be blocking or non-blocking
Blocking: return from routine implies completion
Non-blocking: routine returns immediately, user must
test for completion
Communication Modes
Five different communication modes are allowed:
· Synchronous Send
· Buffered Send
· Standard Send
· Ready Send
· Receive
All of them can be Blocking or Non-Blocking
– Blocking Send will not return until the message data is sent out or buffered in system buffer and the array containing the message is ready for re-use.
– Unblocking Send will return immediately after the send and
then test if the send activity has been sucessful
Communication Modes
MPI_IRSEND MPI_RSEND
Always completes, irrespective of whether the receive has completed
Ready send
MPI_IBSEND MPI_BSEND
Always completes, irrespective of receiver
Guarantees the message being buffered
Buffered send
MPI_ISSEND MPI_SSEND
Only completes after a matching recv() is posted ad the receive operation is started.
Synchronous send
MPI_IRECV MPI_RECV
Completes when a matching message has arrived
receive
MPI_ISEND MPI_SEND
Message sent (receive state unknown)
Standard send
Non-blocking subroutine Blocking
subroutine Completion Condition
Mode
Basic blocking point-to-point communication routine in MPI.
Fortran:
MPI_SEND(buf, count, type, dest, tag, comm, ierr)
MPI_RECV(buf, count, type, dest, tag, comm, status, ierr)
Buf array of type type see table.
Count (INTEGER) number of element of buf to be sent Type (INTEGER) MPI type of buf
Dest (INTEGER) rank of the destination process Tag (INTEGER) number identifying the message
Comm (INTEGER) communicator of the sender and receiver Status (INTEGER) array of size MPI_STATUS_SIZE containing
communication status information
Ierr (INTEGER) error code (if ierr=0 no error occurs)
Standard Send and Receive
Message body envelope
Standard Send and Receive
C:
int MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm);
int MPI_Recv (void *buf, int count, MPI_Datatype type, int
dest, int tag, MPI_Comm comm, MPI_Status *status);
Handles
MPI controls its own internal data structures
MPI releases ‘handles’ to allow programmers to refer to these
C handles are defined typedefs
Fortran handles are INTEGER
Wildcards
Both in Fortran and C MPI_RECV accept wildcard:
To receive from any source: MPI_ANY_SOURCE To receive with any tag: MPI_ANY_TAG
Actual source and tag are returned in the receiver’s status parameter.
For a communication to succeed
Sender must specify a valid destination rank.
· Receiver must specify a valid source rank.
· The communicator must be the same.
· Tags must match.
· Message types must match.
· Receiver’s buffer must be large enough: larger-equal than sender's one.
Communication Envelope
Information includes, source, tag and count:
C:
status.MPI_SOURCE status.MPI_TAG MPI_Get_count() Fortran:
status(MPI_SOURCE) status(MPI_TAG) MPI_GET_COUNT
C:
int MPI_Get_count(MPI_Status status, MPI_Datatype datatype, int *count) Fortran:
MPI_GET_COUNT (status, datatype, count, ierror) INTEGER status(mpi_status_size), datatype, count,, ierror
Envelope information is returned from MPI_RECV as status
This information is useful because:
• receive tag can be MPI_ANY_TAG
• source can be MPI_ANY_SOURCE
• count can be larger than the real message size
example
PROGRAM send_recv INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc
INTEGER status(MPI_STATUS_SIZE) REAL A(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN
A(1) = 3.0 A(2) = 5.0
CALL MPI_SEND(A, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr) ELSE IF( myid .EQ. 1 ) THEN
CALL MPI_RECV(A, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) WRITE(6,*) myid,’: a(1)=’,a(1),’ a(2)=’,a(2)
END IF
CALL MPI_FINALIZE(ierr)
example
#include <stdio.h>
#include <mpi.h>
void main (int argc, char * argv[]) {
int err, nproc, myid;
MPI_Status status;
float a[2];
err = MPI_Init(&argc, &argv);
err = MPI_Comm_size(MPI_COMM_WORLD, &nproc);
err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
if( myid == 0 ) {
a[0] = 3.0, a[1] = 5.0;
MPI_Send(a, 2, MPI_FLOAT, 1, 10, MPI_COMM_WORLD);
} else if( myid == 1 ) {
MPI_Recv(a, 2, MPI_FLOAT, 0, 10, MPI_COMM_WORLD, &status);
printf(”%d: a[0]=%f a[1]=%f\n”, myid, a[0], a[1]);
}
err = MPI_Finalize();
}
Again about completion
Standard MPI_RECV and MPI_SEND block the calling process until completion.
For MPI_RECV completion: the message is arrived and the process could proceed using the received data.
For MPI_SEND completion: the process could proceed and data could be
overwritten without interfering with the message. But this does not mean that the message has already been sent. In many MPI implementation, depending on the message size, sending data are copied to MPI internal buffers.
If the message is not buffered a call to MPI_SEND implies a process
synchronization, on the contrary this is not true if the message is buffered.
Don’t make any assumptions (
implementation dependent)
DEADLOCK
Deadlock occurs when 2 (or more) processes are blocked and each is waiting for the other to make progress.
0
terminate Action A
Proceed if 1 has taken
action B
1
ini t ini
t
compute compute
Action B
terminate
Proceed if 0 has taken
action A
Simple DEADLOCK
PROGRAM deadlock INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc
INTEGER status(MPI_STATUS_SIZE) REAL A(2), B(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN
a(1) = 2.0 a(2) = 4.0
CALL MPI_RECV(b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) CALL MPI_SEND(a, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr)
ELSE IF( myid .EQ. 1 ) THEN a(1) = 3.0
a(2) = 5.0
CALL MPI_RECV(b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) CALL MPI_SEND(a, 2, MPI_REAL, 0, 11, MPI_COMM_WORLD, ierr)
END IF
WRITE(6,*) myid, ’: a(1)=’, a(1), ’ a(2)=’, a(2) CALL MPI_FINALIZE(ierr)
END
ini t ini t
compute compute
Avoiding DEADLOCK
terminate terminate
0 1
Action B
Proceed if 1 has taken
action B
Proceed if 0 has taken
action A
Action A
Avoiding DEADLOCK
PROGRAM avoid_lock INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc
INTEGER status(MPI_STATUS_SIZE) REAL A(2), B(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN
a(1) = 2.0 a(2) = 4.0
CALL MPI_RECV(b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) CALL MPI_SEND(a, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr)
ELSE IF( myid .EQ. 1 ) THEN a(1) = 3.0
a(2) = 5.0
CALL MPI_SEND(a, 2, MPI_REAL, 0, 11, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF
WRITE(6,*) myid, ’: a(1)=’, a(1), ’ a(2)=’, a(2) CALL MPI_FINALIZE(ierr)
END
DEADLOCK: the most common error
PROGRAM error_lock INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc
INTEGER status(MPI_STATUS_SIZE) REAL A(2), B(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN
a(1) = 2.0 a(2) = 4.0
CALL MPI_SEND(a, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) ELSE IF( myid .EQ. 1 ) THEN
a(1) = 3.0 a(2) = 5.0
CALL MPI_SEND(a, 2, MPI_REAL, 0, 11, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF
WRITE(6,*) myid, ’: a(1)=’, a(1), ’ a(2)=’, a(2) CALL MPI_FINALIZE(ierr)
END
Non-Blocking Send and Receive
Non-Blocking communications allows the separation between the initiation of the communication and the completion.
Advantages: between the initiation and completion the program could do other useful computation (latency hiding).
Disadvantages: the programmer has to insert code to check for
completion.
Non-Blocking Send and Receive
Fortran:
MPI_ISEND(buf, count, type, dest, tag, comm, req, ierr) MPI_IRECV(buf, count, type, dest, tag, comm, req, ierr)
buf array of type type see table.
count (INTEGER) number of element of buf to be sent type (INTEGER) MPI type of buf
dest (INTEGER) rank of the destination process tag (INTEGER) number identifying the message
comm (INTEGER) communicator of the sender and receiver
req (INTEGER) output, identifier of the communications handle
ierr (INTEGER) output, error code (if ierr=0 no error occurs)
Non-Blocking Send and Receive
C:
int MPI_Isend(void *buf, int count, MPI_Datatype
type, int dest, int tag, MPI_Comm comm, MPI_Request
*req);
int MPI_Irecv (void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request
*req);
Waiting and Testing for Completion
Fortran:
MPI_WAIT(req, status, ierr)
A call to this subroutine cause the code to wait until the communication pointed by req is complete.
req (INTEGER) input/output, identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV ).
Status (INTEGER) array of size MPI_STATUS_SIZE , if req was associated to a call to MPI_IRECV , status contains informations on the received message, otherwise status could contain an error code.
ierr (INTEGER) output, error code (if ierr=0 no error occours).
C:
int MPI_Wait(MPI_Request *req, MPI_Status *status);
Waiting and Testing for Completion
Fortran:
MPI_TEST(req, flag, status, ierr)
A call to this subroutine sets flag to .true. if the communication pointed by req is complete, sets flag to .false. otherwise.
Req (INTEGER) input/output, identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV ).
Flag(LOGICAL) output, .true. if communication req has completed .false. otherwise
Status(INTEGER) array of size MPI_STATUS_SIZE , if req was associated to a call to MPI_IRECV , status contains informations on the received message, otherwise status could contain an error code.
ierr(INTEGER) output, error code (if ierr=0 no error occours).
C:
Send and Receive, the easy way.
The easiest way to send and receive data without warring about deadlocks
Fortran:
CALL MPI_SENDRECV(sndbuf, snd_size, snd_type, dest_id, tag, rcvbuf, rcv_size, rcv_type, sour_id, tag, comm, status, ierr)
Sender side
Receiver side
CALL MPI_SENDRECV
sour_id dest_id
Send and Receive, the easy way.
PROGRAM send_recv INCLUDE ‘mpif.h‘
INTEGER ierr, myid, nproc
INTEGER status(MPI_STATUS_SIZE) REAL A(2), B(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN
a(1) = 2.0 a(2) = 4.0
CALL MPI_SENDRECV(a, 2, MPI_REAL, 1, 10, b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) ELSE IF( myid .EQ. 1 ) THEN
a(1) = 3.0 a(2) = 5.0
CALL MPI_SENDRECV(a, 2, MPI_REAL, 0, 11, b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF
WRITE(6,*) myid, ’: b(1)=’, b(1), ’ b(2)=’, b(2) CALL MPI_FINALIZE(ierr)
END
Domain decomposition and MPI
MPI is particularly suited for a Domain decomposition approach, where there is a single program flow.
Parallel computation consist of a number of processes, each working on some local data.
Each process has purely local variables (no access to remote memory).
Sharing of data takes place by message passing, by explicitly sending and receiving data
between processes.
Collective Communications
Barrier Synchronization Broadcast
Gather/Scatter
Reduction (sum, max, prod, … )
-Communications involving a group of process
-Called by all processes in a communicator
Characteristics
Collective communications do not interfere with point-to-point communication and vice-versa
All processes must call the collective routine No non-blocking collective communication No tags
Receive buffers must be exactly the right size
Safest communication mode
MPI_Barrier
Stop processes until all processes within a communicator reach the barrier
Fortran:
CALL MPI_BARRIER(comm, ierr)
C:
int MPI_Barrier(MPI_Comm comm)
Barrier
P0 P1 P2 P3 P4
t
1t
2t
3P0 P1 P2 P3 P4
barrier
t
P0 P1
P2
P3 P4
barrier
Broadcast (MPI_BCAST)
One-to-all communication: same data sent from root process to all others in the communicator
Fortran:
INTEGER count, type, root, comm, ierr
CALL MPI_BCAST(buf, count, type, root, comm, ierr) Buf array of type type
C:
int MPI_Bcast(void *buf, int count, MPI_Datatype, datatypem int root, MPI_Comm comm)
All processes must specify same root, rank and comm
Broadcast
P 0 P 1 P
2 P
3
a1
P 0
a1 a1 a1
PROGRAM broad_cast INCLUDE 'mpif.h'
INTEGER ierr, myid, nproc, root INTEGER status(MPI_STATUS_SIZE) REAL A(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0
IF( myid .EQ. 0 ) THEN a(1) = 2.0
a(2) = 4.0 END IF
CALL MPI_BCAST(a, 2, MPI_REAL, 0, MPI_COMM_WORLD, ierr)
WRITE(6,*) myid, ': a(1)=', a(1), 'a(2)=', a(2) CALL MPI_FINALIZE(ierr)
END
Scatter / Gather
Scatter
P P
00rcvbuf P0 a1
rcvbuf
P1 a2 rcvbuf
P2 a3 rcvbuf
P3 a4 rcvbuf
PP11 PP22 PP33
PP00 a1 a2 a3 a4
sndbuf sndbuf sndbuf sndbuf Gather
P P
00 a1 a2 a3 a4 sndbufa4 a3 a1 a2
MPI_Scatter
One-to-all communication: different data sent from root process to all others in the communicator
Fortran:
CALL MPI_SCATTER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)
Arguments definition are like other MPI subroutine
sndcount is the number of elements sent to each process, not the size of sndbuf, that should be sndcount times the number of process in the communicator
The sender arguments are meaningful only for root
MPI_SCATTERV
Usage
int MPI_Scatterv( void* sendbuf, /* in */
int* sendcounts, /* in */
int* displs, /* in */
MPI_Datatype sendtype, /* in */
void* recvbuf, /* in */
int recvcount, /* in */
MPI_Datatype recvtype, /* in */
int root, /* in */
MPI_Comm comm); /* in */
Description
Distributes individual messages from root to each process in communicator
Messages can have different sizes and displacements
MPI_Gather
One-to-all communication: different data collected by the root process, from all others processes in the communicator. Is the opposite of Scatter
Fortran:
CALL MPI_GATHER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)
Arguments definition are like other MPI subroutine
rcvcount is the number of elements collected from each process, not the size of rcvbuf, that should be rcvcount times the number of process in the communicator
The receiver arguments are meaningful only for root
MPI_GATHERV
Usage
int MPI_Gatherv( void* sendbuf, /* in */
int sendcount, /* in */
MPI_Datatype sendtype, /* in */
void* recvbuf, /* out */
int* recvcount, /* in */
int* displs, /* in */
MPI_Datatype recvtype, /* in */
int root, /* in */
MPI_Comm comm ); /* in */
Description
Collects individual messages from each process in communicator to the root process and store them in rank order
Messages can have different sizes and displacements
Scatter example
PROGRAM scatter INCLUDE 'mpif.h'
INTEGER ierr, myid, nproc, nsnd, I, root INTEGER status(MPI_STATUS_SIZE)
REAL A(16), B(2) CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0
IF( myid .eq. root ) THEN DO i = 1, 16
a(i) = REAL(i) END DO
END IF nsnd = 2
CALL MPI_SCATTER(a, nsnd, MPI_REAL, b, nsnd,
& MPI_REAL, root, MPI_COMM_WORLD, ierr)
WRITE(6,*) myid, ': b(1)=', b(1), 'b(2)=', b(2) CALL MPI_FINALIZE(ierr)
Gather example
PROGRAM gather INCLUDE 'mpif.h'
INTEGER ierr, myid, nproc, nsnd, I, root INTEGER status(MPI_STATUS_SIZE)
REAL A(16), B(2) CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0
b(1) = REAL( myid ) b(2) = REAL( myid ) nsnd = 2
CALL MPI_GATHER(b, nsnd, MPI_REAL, a, nsnd,
& MPI_REAL, root MPI_COMM_WORLD, ierr) IF( myid .eq. root ) THEN
DO i = 1, (nsnd*nproc)
WRITE(6,*) myid, ': a(i)=', a(i) END DO
END IF
MPI_Alltoall
P
0P
0P
0P
0P
0P
0P
0P
0a
4b
4c
4d
4a
4b
4c
4d
4a
3b
3c
3d
3a
3b
3c
3d
3a
2b
2c
2d
2a
2b
2c
2d
2a
1b
1c
1d
1a
1b
1c
1d
1Fortran:
CALL MPI_ALLTOALL(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, comm, ierr)
Very useful to implement data transposition
r c v b u f
s n d b u f
Reduction
The reduction operation allow to:
•Collect data from each process
•Reduce the data to a single value
•Store the result on the root processes
•Store the result on all processes
•Overlap of communication and computing
Reduce, Parallel Sum
P0 P1
P2
P3
P0 a1
a2
a3
a4
Sa=a1+a2+a3+a4
Sa
Reduction function works with arrays
other operation: product, min, max, and, ….
b1
b2
b3
b4
Sb=b1+b2+b3+b4
Sb
P2 Sa Sb
P3 Sa Sb P1 Sa Sb
MPI_REDUCE and MPI_ALLREDUCE
Fortran:
MPI_REDUCE( snd_buf, rcv_buf, count, type, op, root, comm, ierr)
snd_buf input array of type type containing local values.
rcv_buf output array of type type containing global results
Count (INTEGER) number of element of snd_buf and rcv_buf type (INTEGER) MPI type of snd_buf and rcv_buf
op (INTEGER) parallel operation to be performed root (INTEGER) MPI id of the process storing the result
comm (INTEGER) communicator of processes involved in the operation ierr (INTEGER) output, error code (if ierr=0 no error occours)
MPI_ALLREDUCE( snd_buf, rcv_buf, count, type, op, comm, ierr)
Predefined Reduction Operations
MPI op Function
MPI_MAX Maximum
MPI_MIN Minimum
MPI_SUM Sum
MPI_PROD Product
MPI_LAND Logical AND
MPI_BAND Bitwise AND
MPI_LOR Logical OR
MPI_BOR Bitwise OR
MPI_LXOR Logical exclusive OR
MPI_BXOR Bitwise exclusive OR
MPI_MAXLOC Maximum and location
MPI_MINLOC Minimum and location
Reduce / 1
C:
int MPI_Reduce(void * snd_buf, void * rcv_buf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm)
int MPI_Allreduce(void * snd_buf, void * rcv_buf, int
count, MPI_Datatype type, MPI_Op op, MPI_Comm comm)
Reduce, example
PROGRAM reduce INCLUDE 'mpif.h'
INTEGER ierr, myid, nproc, root INTEGER status(MPI_STATUS_SIZE) REAL A(2), res(2)
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0
a(1) = 2.0*myid a(2) = 4.0+myid
CALL MPI_REDUCE(a, res, 2, MPI_REAL, MPI_SUM, root, & MPI_COMM_WORLD, ierr)
IF( myid .EQ. 0 ) THEN
WRITE(6,*) myid, ': res(1)=', res(1), 'res(2)=', res(2) END IF
CALL MPI_FINALIZE(ierr) END
Shared Memory System
• Shared memory refers to a large block of RAM that can be accessed by several different CPUs in a multiple-processor computer system.
• Usually the system is a Simmetric MultiProcessor (SMP). SMP
involves a multiprocessor computer hardware architecture where
two or more identical processors are connected to a single shared
main memory.
• The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all
architectures, including Unix platforms and Windows NT platforms.
• OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel
applications for platforms ranging from the desktop to the supercomputer.
Pros of OpenMP
• easier to program and debug;
• directives can be added incrementally;
• gradual parallelization;
• can still run the program as a serial code;
• serial code statements usually don't need modification.
Cons of OpenMP
• can only be run in shared memory computers;
• Traffic between CPU and memory increases with the number of CPUs;
• The OpenMP API uses the fork-join model of parallel execution.
• An OpenMP program begins as a single thread of execution, called the initial thread. The initial thread executes sequentially until encounters a parallel construct.
• The initial thread creates a team of threads and becomes the master of the new
team. Beyond the end of the parallel construct, only the master thread resume
execution.
OpenMP consists of a set of:
• Compiler directives
• Runtime library routines
• Environment variables
OpenMP directives for C/C++ are specified with the pragma preprocessing directive.
The syntax of an OpenMP directive is formally specified as follows:
• C/C++:
#pragma omp directive-name [clause[[,]clause]...]
• Fortran:
!$omp directive-name [clause[[,]clause]...]
Parallel construct
• Start parallel execution;
• A team of threads is created to execute the parallel region;
• The thread that encountered the parallel construct becomes the master thread of the new team with a thread number zero.
• There is an implicit barrier at the end of the construct;
• Any number of parallel constructs can be specified in a single
program;
A first program in Fortran:
PROGRAM HELLO
INTEGER VAR1, VAR2, VAR3
!Serial code
!Beginning of parallel region.
!Fork a team of threads.
!Specify variable scoping.
!$OMP PARALLEL
Print *, “Hello World!!!”
!$OMP END PARALLEL
!Resume serial code END
A first program in C:
#include <omp.h>
int main () {
int var1, var2, var3;
Serial code
! Beginning of parallel region. Fork a team of threads.
! Specify variable scoping
#pragma omp parallel {
Parallel section executed by all threads All threads join master thread and disband }
!Resume serial code }
Worksharing construct
• A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.
• A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.
• If a nowait clause is present, an implementation may omit the barrier at the end of the worksharing region.
→ Loop , Sections, Single, Workshare
Loop construct (DO/for)
The loop construct specifies that the iterations of one or more
associated loops will be executed in parallel by threads in the team
in the context of their implicit tasks. The iterations are distributed
across threads that already exist in the team executing the parallel
region to which the loop region binds.
Loop construct (DO/for)
Fortran:
integer :: i,n=200 real :: a(n),b(n),c(n)
do i=1, n
a(i) = b(i) + c(i)
enddo
Loop construct (DO/for)
Fortran:
integer :: i,n=200 real :: a(n),b(n),c(n)
!$OMP PARALLEL
!$OMP DO do i=1, n
a(i) = b(i) + c(i) enddo
!$OMP END DO
!$OMP END PARALLEL
Loop construct (DO/for)
C/C++:
#pragma omp parallel {
…
#pragma omp for for (i=0;i<n;++i) a[i] = b[i] + c[i]
…
}
Scheduling
Specifies how iterations of the associated loops are divided into contiguous non-empty subsets, called chunks, and how these chunks are distributed among threads of the team.
• Static: iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.
• Dynamic: iterations are distributed to threads in the team in chunks as the threads
request them. Each thread executes a chunk of iterations, then requests another
chunk, until no chunks remain to be distributed.
Scheduling
• Guided: the iterations are assigned to threads in the team in chunks as the
executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. The chunk decrease with time.
• Runtime: the decision regarding scheduling is deferred until run time.
• Auto: the decision regarding scheduling is delegated to the compiler and/or runtime
system.
Schedule clauses
Sections construct
The sections construct is a noniterative worksharing construct that
contains a set of structured blocks that are to be distributed among
and executed by the threads in a team.
Workshare: Sections construct
Fortran:
!$OMP PARALLEL
…
!$OMP SECTIONS
!$OMP SECTION call subr_A(c,d)
!$OMP SECTION call subr_B(e,f)
!$OMP SECTION call subr_c(g,h,i)
!$OMP END SECTIONS
…
!$OMP END PARALLEL
Workshare: Sections construct
C/C++:
#pragma omp parallel {
…
#pragma omp sections {
#pragma omp section A=subr_A(c,d)
#pragma omp section B=subr_B(e,f)
#pragma omp section C=subr_c(g,h,i)
}
…
}
Workshare: Single construct
The single construct specifies that the associated structured block is executed by only one of the threads in the team (not
necessarily the master thread).
The other threads in the team, which do not execute the block, wait at an
implicit barrier at the end of the
single construct unless a nowait
clause is specified.
Workshare: Single construct
Fortran:
!$OMP PARALLEL
…
!$OMP SINGLE read *, a
!$OMP END SINGLE
…
!$OMP END PARALLEL
Workshare: Single construct
C/C++:
#pragma omp parallel {
…
#pragma omp single
printf(“Beginning work”);
…
}
Workshare
Costrutto Parallel combinato con costrutti Workshare
!$omp parallel do ...
!$omp end parallel do
!$omp parallel workshare ...
!$omp end parallel workshare
!$omp parallel sections ...
!$omp end parallel sections
Workshare: Single construct
C/C++:
#pragma omp parallel {
…
#pragma omp single
printf(“Beginning work”);
…
}
Master & Sinchronization constructs
• Master
• Critical
• Atomic
• Barrier
• Ordered
Master construct
The master construct specifies a structured block that is executed by the master thread of the team.
There is no implied barrier either on entry to, or exit from, the master construct.
Fortran:
!$OMP PARALLEL
…
!$OMP MASTER read *, a
!$OMP END MASTER
…
!$OMP END PARALLEL
Master construct
C/C++:
#pragma omp parallel {
…
#pragma omp master printf(“Beginning work”);
…
}
Critical construct
The critical construct restricts execution of the associated structured block to a single thread at a time. An optional name may be used to identify the critical construct.
Fortran:
!$OMP PARALLEL
…
!$OMP CRITICAL [NAME]
X=FUNC_A(X)
!$OMP END CRITICAL
…
!$OMP END PARALLEL
Critical construct
C/C++:
#pragma omp parallel {
…
#pragma omp critical [name]
x=subr_A(x)
…
}
Barrier construct
The barrier construct specifies an explicit barrier at the point at which the construct appears.
Fortran:
!$OMP PARALLEL
…
X=FUNC_A(X)
!$OMP BARRIER
…
!$OMP END PARALLEL
Barrier construct
The barrier construct specifies an explicit barrier at the point at which the construct appears.
C/C++:
#pragma omp parallel {
…
x=subr_A(x)
#pragma omp barrier
…
}
Atomic construct
The atomic construct ensures that a specific storage location is updated atomically, rather than exposing it to the possibility of multiple,
simultaneous writing threads.
Fortran:
!$OMP PARALLEL
…
!$OMP ATOMIC X=X+1
…
!$OMP END PARALLEL
Atomic construct
The atomic construct ensures that a specific storage location is updated atomically, rather than exposing it to the possibility of multiple,
simultaneous writing threads.
C/C++:
#pragma omp parallel {
…
#pragma omp atomic x++;
…
}
Ordered construct
The ordered construct specifies a structured block in a loop region that will be executed in the order of the loop iterations. This sequentializes and orders the code within an ordered region while allowing code
outside the region to run in parallel.
Ordered construct
Fortran:
!$OMP PARALLEL
!$OMP DO ORDERED DO i=1,N
A(i)=...
!$OMP ORDERED PRINT *,a(i)
!$OMP END ORDERED ENDDO
!$OMP END DO ORDERED
!$OMP END PARALLEL
Ordered construct
C/C++:
#pragma omp parallel {
…
#pragma omp for ordered for (i=0;i<n;++i)
{
a[i] = b[i] + 1.0;
#pragma omp ordered {
printf(“%f\n”,a[i]);
}
}
OpenMP Memory Model
Fortran:
integer :: i=5,n=200 real :: tmp=7
!$OMP PARALLEL
!$OMP DO do i=1, n
tmp = func(b(i)) a(i) = b(i) + tmp enddo
!$OMP END DO
!$OMP END PARALLEL
OpenMP Memory Model
Fortran:
integer :: i=5,n=200 real :: tmp=7
!$OMP PARALLEL
!$OMP DO PRIVATE(tmp) do i=1, n
tmp = func(b(i)) a(i) = b(i) + tmp enddo
!$OMP END DO
!$OMP END PARALLEL
OpenMP Memory Model
OpenMP provides a consistent shared-memory model. All threads have access to the main memory to retrieve shared variables.
Each thread also has access to another type of memory that cannot be accessed by another threads, called thread private memory.
A directive that accepts data-sharing attribute clauses determines two kinds of access to variables used in the directive’s associated structured block:
shared and private.
Data-Sharing Attribute Clauses
• Shared:declares a list of one or more items to be shared by threads generated by a parallel construct.
• Private: declares one or more list items to be private to a task.
No other thread can access this data.
Changes can only visible to the thread owning the data.
• Firstprivate: declares one or more list items to be private to a task, and
initializes each of them with the value that the corresponding original
item has when the construct is encountered.
Data-Sharing Attribute Clauses
Lastprivate: declares one or more list items to be private to an implicit
task, and causes the corresponding original list item to be updated after the end of the region.
!$omp do lastprivate (i) do i = 1,n-1
a(i) = b(i+1) enddo
!$omp end do
a(i) = b(0)
Data-Sharing Attribute Clauses
do i = 1,n
x = x + a(i)
Enddo
Data-Sharing Attribute Clauses
Reduction: The reduction clause specifies an operator and one or more list items. For each list item, a private copy is created in each implicit task, and is initialized appropriately for the operator. After the end of the
region, the original list item is updated with the values of the private copies using the specified operator.
!$omp do reduction (+:x) do i = 1,n
x = x + a(i) enddo
!$omp end do
Support for most arithmetic and logical operators +, *, -, .MIN., .MAX., . AND., .OR., ...