The basic of parallel programming

(1)

www.cineca.it

The basic of parallel programming

Carlo Cavazzoni, HPC department, CINECA

(2)

Message Passing

Cooperation between two process

Processo 0 Data

Processo 1

May I Send?

Yes

Data

Time

Two fundamental function: send(message) e receive(message)

Listening…

(3)

3

MPI

Message Passing Interface

(4)

Goals of the MPI standard

To provide source-code portability

To allow efficient implementation

(5)

Domain decomposition

Data are divided into pieces of approximately the same size and mapped to different processors. Each processors work only on its local data. The resulting code has a single flow.

Functional decomposition

The problem is decompose into a large number of smaller tasks and then the tasks are assigned to processors as they become available, Client-Server / Master-Slave paradigm.

two basic models

(6)

Basic Features of MPI Programs

An MPI program consists of multiple instances of a serial program that communicate by library call.

Calls may be roughly divided into four classes:

Calls used to initialize, manage, and terminate communications

Calls used to communicate between pairs of processors. (Pair communication)

Calls used to communicate among groups of processors. (Collective communication)

Calls to create data types.

(7)

A First Program: Hello World!

Fortran

PROGRAM hello INCLUDE ‘mpif.h‘

INTEGER err

CALL MPI_INIT(err) PRINT *, “hello world!”

CALL MPI_FINALIZE(err) END

C

#include <stdio.h>

#include <mpi.h>

void main (int argc, char * argv[]) {

int err;

err = MPI_Init(&argc, &argv);

printf(“Hello world!\n”);

err = MPI_Finalize();

}

(8)

Header files

All Subprogram that contains calls to MPI subroutine must include the MPI header file

C:

#include<mpi.h>

Fortran:

include ‘mpif.h’

The header file contains definitions of MPI constants, MPI

types and functions

(9)

MPI Communicator

In MPI it is possible divide the total number of processes into groups, called communicators.

The Communicator is a variable identifying a group of

processes that are allowed to communicate with each other.

The communicator that includes all processes is called MPI_COMM_WORLD MPI_COMM_WORLD is the default

communicator (automatically defined):

 All MPI communication subroutines have a communicator argument.

 The Programmer could define many communicator at the same time

1

6

4

3 2

7

0

5

MPI_COMM_WORLD

(10)

MPI function format

C:

Error = MPI_Xxxxx(parameter,...);

MPI_Xxxxx(parameter,...);

Fortran:

CALL MPI_XXXXX(parameter, IERROR)

(11)

Initialising MPI

C:

int MPI-Init(intargc, charargv) Fortran:**

INTEGER IERROR

CALL MPI_INIT(IERROR)

 Must be the first MPI call, Initializes the message passing routines

(12)

Communicator Size

How many processors are associated with a communicator?

C:

**MPI_Comm_size(MPI_Comm comm, int *size)** Fortran:

INTEGER COMM, SIZE, IERR

CALL MPI_COMM_SIZE(COMM, SIZE, IERR)

OUTPUT: SIZE

(13)

Process Rank

How do you identify different processes?

What is the ID of a processor in a group?

C:

**MPI_Comm_rank(MPI_Comm comm, int *rank)** Fortran:

INTEGER COMM, RANK, IERR

CALL MPI_COMM_RANK(COMM, RANK, IERR) OUTPUT: RANK

rank is an integer that identifies the Process inside the communicator comm

MPI_COMM_RANK is used to find the rank (the name or identifier) of the Process

(14)

Communicator Size and Process Rank / 1

P

₀

P

₁

P

₂

P

₃

P

₄

P

₅

P

₆

P

₇

RANK = 2 SIZE = 8

Size is the number of processors associated to the communicator rank is the index of the process within a group associated to a

communicator (rank = 0,1,...,N-1). The rank is used to identify the source

How many processes are contained within a

communicator?

(15)

Initializing and Exiting MPI

Initializing the MPI environment C:

**int MPI_Init(int argc, char argv);

Fortran:

INTEGER IERR

CALL MPI_INIT(IERR) Finalizing MPI environment C:

int MPI_Finalize() Fortran:

INTEGER IERR

CALL MPI_FINALIZE(IERR)

This two subprograms should be called by all process, and no other MPI calls

(16)

A Template for Fortran MPI Programs

PROGRAM template INCLUDE ‘mpif.h‘

INTEGER ierr, myid, nproc CALL MPI_INIT(ierr)

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

!!! INSERT YOUR PARALLEL CODE HERE !!!

CALL MPI_FINALIZE(ierr)

(17)

A Template for C MPI programs

int err, nproc, myid;

err = MPI_Comm_size(MPI_COMM_WORLD, &nproc);

err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);

/*** INSERT YOUR PARALLEL CODE HERE ***/

}

(18)

Example

PROGRAM hello IMPLICIT NONE INCLUDE ‘mpif.h’

INTEGER:: myPE, totPEs, i, ierr CALL MPI_INIT(ierr)

CALL MPI_COMM_RANK( MPI_COMM_WORLD, myPE, ierr ) CALL MPI_COMM_SIZE( MPI_COMM_WORLD, totPEs, ierr )

PRINT *, “myPE is “, myPE, “of total ”, totPEs, “ PEs”

CALL MPI_FINALIZE(ierr) END PROGRAM hello

MyPE is 1 of total 4 PEs

MyPE is 0 of total 4 PEs

MyPE is 3 of total 4 PEs

MyPE is 2 of total 4 PEs

Output (4 Procs)

(19)

Point to Point Communication

Is the fundamental communication facility provided by MPI library. Communication between 2 processes

Is conceptually simple: source process A sends a message to destination process B, B receive the message from A.

Communication take places within a communicator

Source and Destination are identified by their rank in the communicator

Communicator

1

6

4

3 2

7

0

5

Source

Dest

(20)

The Message

MPI Data types

 Basic types

 Derived types

Derived type can be build up from basic types

User-defined data types allows MPI to automatically scatter and gather data to and from non-contiguous buffers

Messages are identified by their envelopes,

 a message could be received only if the receiver specify the correct envelope

envelope body

source destination communicator tag buffer count datatype

Message Structure

(21)

Fortran - MPI Basic Datatypes

DOUBLE COMPLEX MPI_DOUBLE_COMPLEX

MPI_PACKED MPI_BYTE

CHARACTER(1) MPI_CHARACTER

LOGICAL MPI_LOGICAL

COMPLEX MPI_COMPLEX

DOUBLE PRECISION MPI_DOUBLE_PRECISION

REAL MPI_REAL

INTEGER MPI_INTEGER

Fortran Data type

MPI Data type

(22)

C - MPI Basic Datatypes

MPI_BYTE

long double MPI_LONG_DOUBLE

double MPI_DOUBLE

float MPI_FLOAT

unsigned long int MPI_UNSIGNED_LONG

unsigned int MPI_UNSIGNED

unsigned short int MPI_UNSIGNED_SHORT

unsigned char MPI_UNSIGNED_CHAR

Signed log int MPI_LONG

signed int MPI_INT

signed short int MPI_SHORT

signed char MPI_CHAR

C Data type

MPI Data type

(23)

Blocking and non-Blocking

“Completion” of the communication means that memory locations used in the message transfer can be safely accessed

 Send: variable sent can be reused after completion

 Receive: variable received can now be used

MPI communication modes differ in what conditions are needed for completion

Communication modes can be blocking or non-blocking

 Blocking: return from routine implies completion

 Non-blocking: routine returns immediately, user must

test for completion

(24)

Communication Modes

Five different communication modes are allowed:

· Synchronous Send

· Buffered Send

· Standard Send

· Ready Send

· Receive

All of them can be Blocking or Non-Blocking

– Blocking Send will not return until the message data is sent out or buffered in system buffer and the array containing the message is ready for re-use.

– Unblocking Send will return immediately after the send and

then test if the send activity has been sucessful

(25)

Communication Modes

MPI_IRSEND MPI_RSEND

Always completes, irrespective of whether the receive has completed

Ready send

MPI_IBSEND MPI_BSEND

Always completes, irrespective of receiver

Guarantees the message being buffered

Buffered send

MPI_ISSEND MPI_SSEND

Only completes after a matching recv() is posted ad the receive operation is started.

Synchronous send

MPI_IRECV MPI_RECV

Completes when a matching message has arrived

receive

MPI_ISEND MPI_SEND

Message sent (receive state unknown)

Standard send

Non-blocking subroutine Blocking

subroutine Completion Condition

Mode

(26)

Basic blocking point-to-point communication routine in MPI.

Fortran:

MPI_SEND(buf, count, type, dest, tag, comm, ierr)

MPI_RECV(buf, count, type, dest, tag, comm, status, ierr)

Buf array of type type see table.

Count (INTEGER) number of element of buf to be sent Type (INTEGER) MPI type of buf

Dest (INTEGER) rank of the destination process Tag (INTEGER) number identifying the message

Comm (INTEGER) communicator of the sender and receiver Status (INTEGER) array of size MPI_STATUS_SIZE containing

communication status information

Ierr (INTEGER) error code (if ierr=0 no error occurs)

Standard Send and Receive

Message body envelope

(27)

Standard Send and Receive

C:

**int MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm);**

**int MPI_Recv (void *buf, int count, MPI_Datatype type, int**

**dest, int tag, MPI_Comm comm, MPI_Status *status);**

(28)

Handles

MPI controls its own internal data structures

MPI releases ‘handles’ to allow programmers to refer to these

 C handles are defined typedefs

 Fortran handles are INTEGER

(29)

Wildcards

Both in Fortran and C MPI_RECV accept wildcard:

To receive from any source: MPI_ANY_SOURCE To receive with any tag: MPI_ANY_TAG

Actual source and tag are returned in the receiver’s status parameter.

(30)

For a communication to succeed

Sender must specify a valid destination rank.

· Receiver must specify a valid source rank.

· The communicator must be the same.

· Tags must match.

· Message types must match.

· Receiver’s buffer must be large enough: larger-equal than sender's one.

(31)

Communication Envelope

Information includes, source, tag and count:

C:

status.MPI_SOURCE status.MPI_TAG MPI_Get_count() Fortran:

status(MPI_SOURCE) status(MPI_TAG) MPI_GET_COUNT

C:

**int MPI_Get_count(MPI_Status status, MPI_Datatype datatype, int *count) Fortran:**

MPI_GET_COUNT (status, datatype, count, ierror) INTEGER status(mpi_status_size), datatype, count,, ierror

Envelope information is returned from MPI_RECV as status

This information is useful because:

• receive tag can be MPI_ANY_TAG

• source can be MPI_ANY_SOURCE

• count can be larger than the real message size

(32)

example

PROGRAM send_recv INCLUDE ‘mpif.h‘

INTEGER ierr, myid, nproc

INTEGER status(MPI_STATUS_SIZE) REAL A(2)

CALL MPI_INIT(ierr)

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) IF( myid .EQ. 0 ) THEN

A(1) = 3.0 A(2) = 5.0

CALL MPI_SEND(A, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr) ELSE IF( myid .EQ. 1 ) THEN

CALL MPI_RECV(A, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) WRITE(6,*) myid,’: a(1)=’,a(1),’ a(2)=’,a(2)

END IF

CALL MPI_FINALIZE(ierr)

(33)

example

int err, nproc, myid;

MPI_Status status;

float a[2];

err = MPI_Comm_size(MPI_COMM_WORLD, &nproc);

err = MPI_Comm_rank(MPI_COMM_WORLD, &myid);

if( myid == 0 ) {

a[0] = 3.0, a[1] = 5.0;

MPI_Send(a, 2, MPI_FLOAT, 1, 10, MPI_COMM_WORLD);

} else if( myid == 1 ) {

MPI_Recv(a, 2, MPI_FLOAT, 0, 10, MPI_COMM_WORLD, &status);

printf(”%d: a[0]=%f a[1]=%f\n”, myid, a[0], a[1]);

}

(34)

Again about completion

Standard MPI_RECV and MPI_SEND block the calling process until completion.

For MPI_RECV completion: the message is arrived and the process could proceed using the received data.

For MPI_SEND completion: the process could proceed and data could be

overwritten without interfering with the message. But this does not mean that the message has already been sent. In many MPI implementation, depending on the message size, sending data are copied to MPI internal buffers.

If the message is not buffered a call to MPI_SEND implies a process

synchronization, on the contrary this is not true if the message is buffered.

Don’t make any assumptions (

implementation dependent

)

(35)

DEADLOCK

Deadlock occurs when 2 (or more) processes are blocked and each is waiting for the other to make progress.

0 terminate Action A

Proceed if 1 has taken

action B

1 ini t ini

t

compute compute

Action B

terminate

Proceed if 0 has taken

action A

(36)

Simple DEADLOCK

PROGRAM deadlock INCLUDE ‘mpif.h‘

INTEGER status(MPI_STATUS_SIZE) REAL A(2), B(2)

a(1) = 2.0 a(2) = 4.0

CALL MPI_RECV(b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) CALL MPI_SEND(a, 2, MPI_REAL, 1, 10, MPI_COMM_WORLD, ierr)

ELSE IF( myid .EQ. 1 ) THEN a(1) = 3.0

a(2) = 5.0

END IF

WRITE(6,*) myid, ’: a(1)=’, a(1), ’ a(2)=’, a(2) CALL MPI_FINALIZE(ierr)

END

(37)

ini t ini t

compute compute

Avoiding DEADLOCK

terminate terminate

0 1

Action B

Proceed if 1 has taken

action B

Proceed if 0 has taken

action A

Action A

(38)

Avoiding DEADLOCK

PROGRAM avoid_lock INCLUDE ‘mpif.h‘

a(1) = 2.0 a(2) = 4.0

ELSE IF( myid .EQ. 1 ) THEN a(1) = 3.0

a(2) = 5.0

CALL MPI_SEND(a, 2, MPI_REAL, 0, 11, MPI_COMM_WORLD, ierr)

CALL MPI_RECV(b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF

END

(39)

DEADLOCK: the most common error

PROGRAM error_lock INCLUDE ‘mpif.h‘

a(1) = 2.0 a(2) = 4.0

CALL MPI_RECV(b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) ELSE IF( myid .EQ. 1 ) THEN

a(1) = 3.0 a(2) = 5.0

CALL MPI_RECV(b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF

END

(40)

Non-Blocking Send and Receive

Non-Blocking communications allows the separation between the initiation of the communication and the completion.

Advantages: between the initiation and completion the program could do other useful computation (latency hiding).

Disadvantages: the programmer has to insert code to check for

completion.

(41)

Non-Blocking Send and Receive

Fortran:

MPI_ISEND(buf, count, type, dest, tag, comm, req, ierr) MPI_IRECV(buf, count, type, dest, tag, comm, req, ierr)

buf array of type type see table.

count (INTEGER) number of element of buf to be sent type (INTEGER) MPI type of buf

dest (INTEGER) rank of the destination process tag (INTEGER) number identifying the message

comm (INTEGER) communicator of the sender and receiver

req (INTEGER) output, identifier of the communications handle

ierr (INTEGER) output, error code (if ierr=0 no error occurs)

(42)

Non-Blocking Send and Receive

C:

**int MPI_Isend(void *buf, int count, MPI_Datatype**

type, int dest, int tag, MPI_Comm comm, MPI_Request

*req);

**int MPI_Irecv (void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request**

*req);

(43)

Waiting and Testing for Completion

Fortran:

MPI_WAIT(req, status, ierr)

A call to this subroutine cause the code to wait until the communication pointed by req is complete.

req (INTEGER) input/output, identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV ).

Status (INTEGER) array of size MPI_STATUS_SIZE , if req was associated to a call to MPI_IRECV , status contains informations on the received message, otherwise status could contain an error code.

ierr (INTEGER) output, error code (if ierr=0 no error occours).

C:

**int MPI_Wait(MPI_Request req, MPI_Status status);**

(44)

Waiting and Testing for Completion

Fortran:

MPI_TEST(req, flag, status, ierr)

A call to this subroutine sets flag to .true. if the communication pointed by req is complete, sets flag to .false. otherwise.

Req (INTEGER) input/output, identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV ).

Flag(LOGICAL) output, .true. if communication req has completed .false. otherwise

Status(INTEGER) array of size MPI_STATUS_SIZE , if req was associated to a call to MPI_IRECV , status contains informations on the received message, otherwise status could contain an error code.

ierr(INTEGER) output, error code (if ierr=0 no error occours).

C:

(45)

Send and Receive, the easy way.

The easiest way to send and receive data without warring about deadlocks

Fortran:

CALL MPI_SENDRECV(sndbuf, snd_size, snd_type, dest_id, tag, rcvbuf, rcv_size, rcv_type, sour_id, tag, comm, status, ierr)

Sender side

Receiver side

CALL MPI_SENDRECV

sour_id dest_id

(46)

Send and Receive, the easy way.

PROGRAM send_recv INCLUDE ‘mpif.h‘

a(1) = 2.0 a(2) = 4.0

CALL MPI_SENDRECV(a, 2, MPI_REAL, 1, 10, b, 2, MPI_REAL, 1, 11, MPI_COMM_WORLD, status, ierr) ELSE IF( myid .EQ. 1 ) THEN

a(1) = 3.0 a(2) = 5.0

CALL MPI_SENDRECV(a, 2, MPI_REAL, 0, 11, b, 2, MPI_REAL, 0, 10, MPI_COMM_WORLD, status, ierr) END IF

WRITE(6,*) myid, ’: b(1)=’, b(1), ’ b(2)=’, b(2) CALL MPI_FINALIZE(ierr)

END

(47)

Domain decomposition and MPI

MPI is particularly suited for a Domain decomposition approach, where there is a single program flow.

Parallel computation consist of a number of processes, each working on some local data.

Each process has purely local variables (no access to remote memory).

Sharing of data takes place by message passing, by explicitly sending and receiving data

between processes.

(48)

Collective Communications

Barrier Synchronization Broadcast

Gather/Scatter

Reduction (sum, max, prod, … )

-Communications involving a group of process

-Called by all processes in a communicator

(49)

Characteristics

Collective communications do not interfere with point-to-point communication and vice-versa

All processes must call the collective routine No non-blocking collective communication No tags

Receive buffers must be exactly the right size

Safest communication mode

(50)

MPI_Barrier

Stop processes until all processes within a communicator reach the barrier

Fortran:

CALL MPI_BARRIER(comm, ierr)

C:

int MPI_Barrier(MPI_Comm comm)

(51)

Barrier

P₀ P₁ P₂ P₃ P₄

t

₁

t

₂

t

₃

P₀ P₁ P₂ P₃ P₄

barrier

t

P₀ P₁

P₂

P₃ P₄

barrier

(52)

Broadcast (MPI_BCAST)

One-to-all communication: same data sent from root process to all others in the communicator

Fortran:

INTEGER count, type, root, comm, ierr

CALL MPI_BCAST(buf, count, type, root, comm, ierr) Buf array of type type

C:

**int MPI_Bcast(void *buf, int count, MPI_Datatype, datatypem int root, MPI_Comm comm)**

All processes must specify same root, rank and comm

(53)

Broadcast

P ₀ P ₁ P

2 P

3

a₁

P ₀

a₁ a₁ a₁

PROGRAM broad_cast INCLUDE 'mpif.h'

INTEGER ierr, myid, nproc, root INTEGER status(MPI_STATUS_SIZE) REAL A(2)

CALL MPI_INIT(ierr)

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0

IF( myid .EQ. 0 ) THEN a(1) = 2.0

a(2) = 4.0 END IF

CALL MPI_BCAST(a, 2, MPI_REAL, 0, MPI_COMM_WORLD, ierr)

**WRITE(6,*) myid, ': a(1)=', a(1), 'a(2)=', a(2) CALL MPI_FINALIZE(ierr)**

END

(54)

Scatter / Gather

Scatter

P P

₀₀

rcvbuf P₀ a₁

rcvbuf

P₁ a₂ rcvbuf

P₂ a₃ rcvbuf

P₃ a₄ rcvbuf

PP₁₁ PP₂₂ PP₃₃

PP₀₀ a₁ a₂ a₃ a₄

sndbuf sndbuf sndbuf sndbuf Gather

P P

₀₀ a₁ ^a2 a₃ a₄ sndbuf

a₄ a₃ a₁ a₂

(55)

MPI_Scatter

One-to-all communication: different data sent from root process to all others in the communicator

Fortran:

CALL MPI_SCATTER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)

Arguments definition are like other MPI subroutine

sndcount is the number of elements sent to each process, not the size of sndbuf, that should be sndcount times the number of process in the communicator

The sender arguments are meaningful only for root

(56)

(57)

MPI_SCATTERV

Usage

int MPI_Scatterv( void* sendbuf, /* in */

int* sendcounts, /* in */

int* displs, /* in */

MPI_Datatype sendtype, /* in */

void* recvbuf, /* in */

int recvcount, /* in */

MPI_Datatype recvtype, /* in */

int root, /* in */

MPI_Comm comm); /* in */

Description

Distributes individual messages from root to each process in communicator

Messages can have different sizes and displacements

(58)

(59)

MPI_Gather

One-to-all communication: different data collected by the root process, from all others processes in the communicator. Is the opposite of Scatter

Fortran:

CALL MPI_GATHER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr)

Arguments definition are like other MPI subroutine

rcvcount is the number of elements collected from each process, not the size of rcvbuf, that should be rcvcount times the number of process in the communicator

The receiver arguments are meaningful only for root

(60)

(61)

MPI_GATHERV

Usage

int MPI_Gatherv( void* sendbuf, /* in */

int sendcount, /* in */

MPI_Datatype sendtype, /* in */

void* recvbuf, /* out */

int* recvcount, /* in */

int* displs, /* in */

MPI_Datatype recvtype, /* in */

int root, /* in */

MPI_Comm comm ); /* in */

Description

Collects individual messages from each process in communicator to the root process and store them in rank order

Messages can have different sizes and displacements

(62)

(63)

Scatter example

PROGRAM scatter INCLUDE 'mpif.h'

INTEGER ierr, myid, nproc, nsnd, I, root INTEGER status(MPI_STATUS_SIZE)

REAL A(16), B(2) CALL MPI_INIT(ierr)

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) root = 0

IF( myid .eq. root ) THEN DO i = 1, 16

a(i) = REAL(i) END DO

END IF nsnd = 2

CALL MPI_SCATTER(a, nsnd, MPI_REAL, b, nsnd,

& MPI_REAL, root, MPI_COMM_WORLD, ierr)

WRITE(6,*) myid, ': b(1)=', b(1), 'b(2)=', b(2) CALL MPI_FINALIZE(ierr)

(64)

Gather example

PROGRAM gather INCLUDE 'mpif.h'

INTEGER ierr, myid, nproc, nsnd, I, root INTEGER status(MPI_STATUS_SIZE)

REAL A(16), B(2) CALL MPI_INIT(ierr)

b(1) = REAL( myid ) b(2) = REAL( myid ) nsnd = 2

CALL MPI_GATHER(b, nsnd, MPI_REAL, a, nsnd,

& MPI_REAL, root MPI_COMM_WORLD, ierr) IF( myid .eq. root ) THEN

DO i = 1, (nsnd*nproc)

WRITE(6,*) myid, ': a(i)=', a(i) END DO

END IF

(65)

MPI_Alltoall

P

₀

P

₀

P

₀

P

₀

P

₀

P

₀

P

₀

P

₀

a

₄

b

₄

c

₄

d

₄

a

₄

b

₄

c

₄

d

₄

a

₃

b

₃

c

₃

d

₃

a

₃

b

₃

c

₃

d

₃

a

₂

b

₂

c

₂

d

₂

a

₂

b

₂

c

₂

d

₂

a

₁

b

₁

c

₁

d

₁

a

₁

b

₁

c

₁

d

₁

Fortran:

CALL MPI_ALLTOALL(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, comm, ierr)

Very useful to implement data transposition

r c v b u f

s n d b u f

(66)

Reduction

The reduction operation allow to:

•Collect data from each process

•Reduce the data to a single value

•Store the result on the root processes

•Store the result on all processes

•Overlap of communication and computing

(67)

Reduce, Parallel Sum

P₀ P₁

P₂

P₃

P₀ a₁

a₂

a₃

a₄

S_a=a₁+a₂+a₃+a₄

S_a

Reduction function works with arrays

other operation: product, min, max, and, ….

b₁

b₂

b₃

b₄

S_b=b₁+b₂+b₃+b₄

S_b

P₂ S_a S_b

P₃ S_a S_b P₁ S_a S_b

(68)

MPI_REDUCE and MPI_ALLREDUCE

Fortran:

MPI_REDUCE( snd_buf, rcv_buf, count, type, op, root, comm, ierr)

snd_buf input array of type type containing local values.

rcv_buf output array of type type containing global results

Count (INTEGER) number of element of snd_buf and rcv_buf type (INTEGER) MPI type of snd_buf and rcv_buf

op (INTEGER) parallel operation to be performed root (INTEGER) MPI id of the process storing the result

comm (INTEGER) communicator of processes involved in the operation ierr (INTEGER) output, error code (if ierr=0 no error occours)

MPI_ALLREDUCE( snd_buf, rcv_buf, count, type, op, comm, ierr)

(69)

Predefined Reduction Operations

MPI op Function

MPI_MAX Maximum

MPI_MIN Minimum

MPI_SUM Sum

MPI_PROD Product

MPI_LAND Logical AND

MPI_BAND Bitwise AND

MPI_LOR Logical OR

MPI_BOR Bitwise OR

MPI_LXOR Logical exclusive OR

MPI_BXOR Bitwise exclusive OR

MPI_MAXLOC Maximum and location

MPI_MINLOC Minimum and location

(70)

Reduce / 1

C:

**int MPI_Reduce(void * snd_buf, void * rcv_buf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm)**

**int MPI_Allreduce(void * snd_buf, void * rcv_buf, int**

count, MPI_Datatype type, MPI_Op op, MPI_Comm comm)

(71)

Reduce, example

PROGRAM reduce INCLUDE 'mpif.h'

INTEGER ierr, myid, nproc, root INTEGER status(MPI_STATUS_SIZE) REAL A(2), res(2)

a(1) = 2.0*myid a(2) = 4.0+myid

CALL MPI_REDUCE(a, res, 2, MPI_REAL, MPI_SUM, root, & MPI_COMM_WORLD, ierr)

IF( myid .EQ. 0 ) THEN

WRITE(6,*) myid, ': res(1)=', res(1), 'res(2)=', res(2) END IF

CALL MPI_FINALIZE(ierr) END

(72)

Shared Memory System

• Shared memory refers to a large block of RAM that can be accessed by several different CPUs in a multiple-processor computer system.

• Usually the system is a Simmetric MultiProcessor (SMP). SMP

involves a multiprocessor computer hardware architecture where

two or more identical processors are connected to a single shared

main memory.

(73)

• The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all

architectures, including Unix platforms and Windows NT platforms.

• OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel

applications for platforms ranging from the desktop to the supercomputer.

(74)

Pros of OpenMP

• easier to program and debug;

• directives can be added incrementally;

• gradual parallelization;

• can still run the program as a serial code;

• serial code statements usually don't need modification.

Cons of OpenMP

• can only be run in shared memory computers;

• Traffic between CPU and memory increases with the number of CPUs;

(75)

• The OpenMP API uses the fork-join model of parallel execution.

• An OpenMP program begins as a single thread of execution, called the initial thread. The initial thread executes sequentially until encounters a parallel construct.

• The initial thread creates a team of threads and becomes the master of the new

team. Beyond the end of the parallel construct, only the master thread resume

execution.

(76)

OpenMP consists of a set of:

• Compiler directives

• Runtime library routines

• Environment variables

(77)

OpenMP directives for C/C++ are specified with the pragma preprocessing directive.

The syntax of an OpenMP directive is formally specified as follows:

• C/C++:

#pragma omp directive-name [clause[[,]clause]...]

• Fortran:

!$omp directive-name [clause[[,]clause]...]

(78)

Parallel construct

• Start parallel execution;

• A team of threads is created to execute the parallel region;

• The thread that encountered the parallel construct becomes the master thread of the new team with a thread number zero.

• There is an implicit barrier at the end of the construct;

• Any number of parallel constructs can be specified in a single

program;

(79)

A first program in Fortran:

PROGRAM HELLO

INTEGER VAR1, VAR2, VAR3

!Serial code

!Beginning of parallel region.

!Fork a team of threads.

!Specify variable scoping.

!$OMP PARALLEL

Print *, “Hello World!!!”

!$OMP END PARALLEL

!Resume serial code END

(80)

A first program in C:

#include <omp.h>

int main () {

int var1, var2, var3;

Serial code

! Beginning of parallel region. Fork a team of threads.

! Specify variable scoping

#pragma omp parallel {

Parallel section executed by all threads All threads join master thread and disband }

!Resume serial code }

(81)

Worksharing construct

• A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

• A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.

• If a nowait clause is present, an implementation may omit the barrier at the end of the worksharing region.

→ Loop , Sections, Single, Workshare

(82)

Loop construct (DO/for)

The loop construct specifies that the iterations of one or more

associated loops will be executed in parallel by threads in the team

in the context of their implicit tasks. The iterations are distributed

across threads that already exist in the team executing the parallel

region to which the loop region binds.

(83)

Loop construct (DO/for)

Fortran:

integer :: i,n=200 real :: a(n),b(n),c(n)

do i=1, n

a(i) = b(i) + c(i)

enddo

(84)

Loop construct (DO/for)

Fortran:

integer :: i,n=200 real :: a(n),b(n),c(n)

!$OMP PARALLEL

!$OMP DO do i=1, n

a(i) = b(i) + c(i) enddo

!$OMP END DO

!$OMP END PARALLEL

(85)

Loop construct (DO/for)

C/C++:

#pragma omp parallel {

…

#pragma omp for for (i=0;i<n;++i) a[i] = b[i] + c[i]

…

}

(86)

Scheduling

Specifies how iterations of the associated loops are divided into contiguous non-empty subsets, called chunks, and how these chunks are distributed among threads of the team.

• Static: iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

• Dynamic: iterations are distributed to threads in the team in chunks as the threads

request them. Each thread executes a chunk of iterations, then requests another

chunk, until no chunks remain to be distributed.

(87)

Scheduling

• Guided: the iterations are assigned to threads in the team in chunks as the

executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. The chunk decrease with time.

• Runtime: the decision regarding scheduling is deferred until run time.

• Auto: the decision regarding scheduling is delegated to the compiler and/or runtime

system.

(88)

Schedule clauses

(89)

Sections construct

The sections construct is a noniterative worksharing construct that

contains a set of structured blocks that are to be distributed among

and executed by the threads in a team.

(90)

Workshare: Sections construct

Fortran:

!$OMP PARALLEL

…

!$OMP SECTIONS

!$OMP SECTION call subr_A(c,d)

!$OMP SECTION call subr_B(e,f)

!$OMP SECTION call subr_c(g,h,i)

!$OMP END SECTIONS

…

!$OMP END PARALLEL

(91)

Workshare: Sections construct

C/C++:

#pragma omp parallel {

…

#pragma omp sections {

#pragma omp section A=subr_A(c,d)

#pragma omp section B=subr_B(e,f)

#pragma omp section C=subr_c(g,h,i)

}

…

}

(92)

Workshare: Single construct

The single construct specifies that the associated structured block is executed by only one of the threads in the team (not

necessarily the master thread).

The other threads in the team, which do not execute the block, wait at an

implicit barrier at the end of the

single construct unless a nowait

clause is specified.

(93)

Workshare: Single construct

Fortran:

!$OMP PARALLEL

…

!$OMP SINGLE read , a*

!$OMP END SINGLE

…

!$OMP END PARALLEL

(94)

Workshare: Single construct

C/C++:

#pragma omp parallel {

…

#pragma omp single

printf(“Beginning work”);

…

}

(95)

Workshare

Costrutto Parallel combinato con costrutti Workshare

!$omp parallel do ...

!$omp end parallel do

!$omp parallel workshare ...

!$omp end parallel workshare

!$omp parallel sections ...

!$omp end parallel sections

(96)

Workshare: Single construct

C/C++:

#pragma omp parallel {

…

#pragma omp single

printf(“Beginning work”);

…

}

(97)

Master & Sinchronization constructs

• Master

• Critical

• Atomic

• Barrier

• Ordered

(98)

Master construct

The master construct specifies a structured block that is executed by the master thread of the team.

There is no implied barrier either on entry to, or exit from, the master construct.

Fortran:

!$OMP PARALLEL

…

!$OMP MASTER read , a*

!$OMP END MASTER

…

!$OMP END PARALLEL

(99)

Master construct

C/C++:

#pragma omp parallel {

…

#pragma omp master printf(“Beginning work”);

…

}

(100)

Critical construct

The critical construct restricts execution of the associated structured block to a single thread at a time. An optional name may be used to identify the critical construct.

Fortran:

!$OMP PARALLEL

…

!$OMP CRITICAL [NAME]

X=FUNC_A(X)

!$OMP END CRITICAL

…

!$OMP END PARALLEL

(101)

Critical construct

C/C++:

#pragma omp parallel {

…

#pragma omp critical [name]

x=subr_A(x)

…

}

(102)

Barrier construct

The barrier construct specifies an explicit barrier at the point at which the construct appears.

Fortran:

!$OMP PARALLEL

…

X=FUNC_A(X)

!$OMP BARRIER

…

!$OMP END PARALLEL

(103)

Barrier construct

The barrier construct specifies an explicit barrier at the point at which the construct appears.

C/C++:

#pragma omp parallel {

…

x=subr_A(x)

#pragma omp barrier

…

}

(104)

Atomic construct

The atomic construct ensures that a specific storage location is updated atomically, rather than exposing it to the possibility of multiple,

simultaneous writing threads.

Fortran:

!$OMP PARALLEL

…

!$OMP ATOMIC X=X+1

…

!$OMP END PARALLEL

(105)

Atomic construct

The atomic construct ensures that a specific storage location is updated atomically, rather than exposing it to the possibility of multiple,

simultaneous writing threads.

C/C++:

#pragma omp parallel {

…

#pragma omp atomic x++;

…

}

(106)

Ordered construct

The ordered construct specifies a structured block in a loop region that will be executed in the order of the loop iterations. This sequentializes and orders the code within an ordered region while allowing code

outside the region to run in parallel.

(107)

Ordered construct

Fortran:

!$OMP PARALLEL

!$OMP DO ORDERED DO i=1,N

A(i)=...

!$OMP ORDERED PRINT ,a(i)*

!$OMP END ORDERED ENDDO

!$OMP END DO ORDERED

!$OMP END PARALLEL

(108)

Ordered construct

C/C++:

#pragma omp parallel {

…

#pragma omp for ordered for (i=0;i<n;++i)

{

a[i] = b[i] + 1.0;

#pragma omp ordered {

printf(“%f\n”,a[i]);

}

(109)

OpenMP Memory Model

Fortran:

integer :: i=5,n=200 real :: tmp=7

!$OMP PARALLEL

!$OMP DO do i=1, n

tmp = func(b(i)) a(i) = b(i) + tmp enddo

!$OMP END DO

!$OMP END PARALLEL

(110)

OpenMP Memory Model

Fortran:

integer :: i=5,n=200 real :: tmp=7

!$OMP PARALLEL

!$OMP DO PRIVATE(tmp) do i=1, n

tmp = func(b(i)) a(i) = b(i) + tmp enddo

!$OMP END DO

!$OMP END PARALLEL

(111)

OpenMP Memory Model

OpenMP provides a consistent shared-memory model. All threads have access to the main memory to retrieve shared variables.

Each thread also has access to another type of memory that cannot be accessed by another threads, called thread private memory.

A directive that accepts data-sharing attribute clauses determines two kinds of access to variables used in the directive’s associated structured block:

shared and private.

(112)

Data-Sharing Attribute Clauses

• Shared:declares a list of one or more items to be shared by threads generated by a parallel construct.

• Private: declares one or more list items to be private to a task.

No other thread can access this data.

Changes can only visible to the thread owning the data.

• Firstprivate: declares one or more list items to be private to a task, and

initializes each of them with the value that the corresponding original

item has when the construct is encountered.

(113)

Data-Sharing Attribute Clauses

Lastprivate: declares one or more list items to be private to an implicit

task, and causes the corresponding original list item to be updated after the end of the region.

!$omp do lastprivate (i) do i = 1,n-1

a(i) = b(i+1) enddo

!$omp end do

a(i) = b(0)

(114)

Data-Sharing Attribute Clauses

do i = 1,n

x = x + a(i)

Enddo

(115)

Data-Sharing Attribute Clauses

Reduction: The reduction clause specifies an operator and one or more list items. For each list item, a private copy is created in each implicit task, and is initialized appropriately for the operator. After the end of the

region, the original list item is updated with the values of the private copies using the specified operator.

!$omp do reduction (+:x) do i = 1,n

x = x + a(i) enddo

!$omp end do

Support for most arithmetic and logical operators +, *, -, .MIN., .MAX., . AND., .OR., ...

(116)

Runtime Libraries Routines

OpenMP provides several user-callable functions to control and query parallel environment.

• The Runtime Libraries take precedence over the corrisponding environment variables;

• Recommended to use under control of conditional compilation (#ifdef _OPENMP);

• C/C++ programs need to include <omp.h>;

• Fortran program may want to use “USE OMP_LIB” or include

“omp_lib.h”.

(117)

Runtime Libraries Routines

• num_threads=omp_get_num_threads():

Gets number of threads in team;

• thread_id=omp_get_thread_num():

Gets thread ID;

• time=omp_get_wtime():

Return elapsed wall clock time in seconds.

START = omp_get_wtime() ... work to be timed ...

END = omp_get_wtime()

PRINT , "Work took", END – START, “seconds”*

(118)

Conditional Compilation

With OpenMP compilation, the _OPENMP macro is defined.

C:

#ifdef _OPENMP

printf("Compiled with OpenMP support);

#else

printf("Compiled for serial execution.");

#endif Fortran:

!$ print ,”Compiled with OpenMP support”*

(119)

Environment Variables

• OMP_NUM_THREADS: sets the number of threads to use for parallel regions;

• OMP_SCHEDULE: controls the schedule type and chunk size of all loop directives that have the schedule type runtime.

• OMP_STACKSIZE: specifies the size of the stack for threads created by the OpenMP implementation.

csh:

% setenv OMP_NUM_THREADS 8

% setenv OMP_SCHEDULE "guided,4"

sh:

$ export OMP_NUM_THREADS=8

$ export OMP_SCHEDULE=”guided,4”

(120)

OpenMP Compiler

GNU (Version >= 4.3.2) Compile with -fopenmp For Linux, Solaris, AIX, MacOSX, Windows:

IBM Compile with -qsmp=omp for Windows, AIX and Linux.

Sun Microsystems

Compile with -xopenmp for Solaris and Linux.

Intel

Compile with -Qopenmp on Windows, or just -openmp on Linux or Mac Emit useful information to stderr. -openmp-report2

Portland Group Compilers Compile with -mp

Emit useful information to stderr. -Minfo=mp

(121)

Exercises

Write a serial “Hello World”.

Add OpenMP directives to make each thread writing the same message.

Add a new message stating whether the code has been compiled in serial or parallel mode.

Using the environment variable OMP_NUM_THREADS, try to execute the code with different numbers of threads.

Implement a routine that provides the “thread_id” and the number of threads that are currently executing the code without using OpenMP functions. The message has to look like

“Hello World from thread_id 1 of 4 thread_tot”.

Afterwards, print this message according to the id order:

“Hello World from thread_id 0 of 4 thread_tot”.

“Hello World from thread_id 1 of 4 thread_tot”.

“Hello World from thread_id 2 of 4 thread_tot”.

“Hello World from thread_id 3 of 4 thread_tot”.

(122)

Exercises

optional:

Compute the area below the plot of a function using computational methods for integration such as Simpson, ect.

For simplicity you can choose a linear function whose plot defines a rectangular triangle.

Split the computational kernel among the threads.

(123)

Hybrid programming MPI+OpenMP

(124)

 Multi-node SMP (Symmetric Multiprocessor) connected by and interconnection network.

 Each node is mapped (at least) one process MPI and OpenMP threads more.

The hybrid model

(125)

MPI vs. OpenMP

 Pure MPI Pro:

 High scalability

 High portability

 No false sharing

 Scalability out-of-node

 Pure MPI Con:

 Hard to develop and debug.

 Explicit communications

 Coarse granularity

 Hard to ensure load balancing

Pure OpenMP Pro:

Easy do deploy (often) Low latency

Implicit communications Coarse and fine granularity Dynamic Load balancing

Pure OpenMP Con:

Only on shared memory machines Intranode scalability

Possible long waits for unlocking data

No order specific thread

(126)

Why hybrid?

 MPI+OpenMP hybrid paradigm is the trend for clusters with SMP architecture.

Elegant in concept: use OpenMP within the node and MPI between nodes, in order to have a good use of shared resources.

 Avoid additional communication within the MPI node.

 OpenMP introduces fine-granularity.

 Two-level parallelism introduces other problems

 Some problems can be reduced by lowering MPI procs number

 If the problem is suitable, the hybrid approach can have better

performance than pure MPI or OpenMP codes.

(127)

The basic of parallel programming