Introduzione al Message Passing Interface (MPI) Andrea Clematis IMATI CNR

Size: px

Start display at page:

Download "Introduzione al Message Passing Interface (MPI) Andrea Clematis IMATI CNR"

Cornelius Kristian Russell
5 years ago
Views:

1 Introduzione al Message Passing Interface (MPI) Andrea Clematis IMATI CNR

2 Ack. & riferimenti An Introduction to MPI Parallel Programming with the Message Passing InterfaceWilliam Gropp,Ewing Lusk Argonne National Laboratory Online examples available at ftp://ftp.mcs.anl.gov/mpi/mpiexmpl.tar.gz contains source code and run scripts that allows you to evaluate your own MPI implementation

3 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

4 Collective communication All or None: Collective communication must involve all processes in the scope of a communicator. All processes are by default, members in the communicator MPI_COMM_WORLD. It is the programmer's responsibility to insure that all processes within a communicator participate in any collective operations.

5 Collective communication Types of Collective Operations: Synchronization - processes wait until all members of the group have reached the synchronization point. Data Movement - broadcast, scatter/gather, all to all. Collective Computation (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.

6 Collective communication Programming Considerations and Restrictions: Collective operations are blocking. Collective communication routines do not take message tag arguments. Collective operations within subsets of processes are accomplished by first partitioning the subsets into new groups and then attaching the new groups to new communicators Can only be used with MPI predefined datatypes - not with MPI Derived Data Types.

7 Collective communication MPI_Barrier Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. MPI_Barrier (comm) MPI_BARRIER (comm,ierr) C synopsis #include <mpi.h> int MPI_Barrier(MPI_Comm comm); Parameters: comm is a communicator (handle) (IN) IERROR is the FORTRAN return code. It is always the last argument.

8 Collective communication MPI_Bcast Broadcasts (sends) a message from the process with rank "root" to all other processes in the group. MPI_Bcast (&buffer,count,datatype,root,comm) MPI_BCAST (buffer,count,datatype,root,comm,ierr) C synopsis #include <mpi.h> int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); Parameters buffer is the starting address of the buffer (choice) (INOUT) count is the number of elements in the buffer (integer) (IN) datatype is the datatype of the buffer elements (handle) (IN) root is the rank of the root task (integer) (IN) comm is the communicator (handle) (IN) IERROR is the FORTRAN return code. It is always the last argument.

9 Collective communication

10 MPI_Scatter Collective communication Distributes distinct messages from a single source task to each task in the group. MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf,... recvcnt,recvtype,root,comm) MPI_SCATTER (sendbuf,sendcnt,sendtype,recvbuf,... recvcnt,recvtype,root,comm,ierr) C synopsis #include <mpi.h> int MPI_Scatter(void* sendbuf,int sendcount,mpi_datatype sendtype,void* recvbuf, int recvcount,mpi_datatype recvtype,int root,mpi_comm comm);

11 MPI_Scatter Collective communication MPI_SCATTER distributes individual messages from root to each task in comm. This subroutine is the inverse operation to MPI_GATHER. The type signature associated with sendcount, sendtype at the root must be equal to the type signature associated with recvcount, recvtype at all tasks. (Type maps can be different.) This means the amount of data sent must be equal to the amount of data received, pairwise between each task and the root. Distinct type maps between sender and receiver are allowed. The following is information regarding MPI_SCATTER arguments and tasks: On the task root, all arguments to the function are significant. On other tasks, only the arguments recvbuf, recvcount, recvtype, root, and comm are significant. The argument root must be the same on all tasks. A call where the specification of counts and types causes any location on the root to be read more than once is erroneous.

12 Collective communication

13 int MPI_Scatter(void* sendbuf,int sendcount,mpi_datatype sendtype,void* recvbuf, int recvcount,mpi_datatype recvtype,int root,mpi_comm comm);

16 Collective communication MPI_Gather Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter. MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf,... recvcount,recvtype,root,comm) MPI_GATHER (sendbuf,sendcnt,sendtype,recvbuf,... recvcount,recvtype,root,comm,ierr) C synopsis #include <mpi.h> int MPI_Gather(void* sendbuf,int sendcount,mpi_datatype sendtype, void* recvbuf,int recvcount,mpi_datatype recvtype,int root, MPI_Comm comm);

17 Collective communication The type signature of sendcount, sendtype on task i must be equal to the type signature of recvcount, recvtype at the root. This means the amount of data sent must be equal to the amount of data received, pairwise between each task and the root. Distinct type maps between sender and receiver are allowed. The following is information regarding MPI_GATHER arguments and tasks: On the task root, all arguments to the function are significant. On other tasks, only the arguments sendbuf, sendcount, sendtype, root, and comm are significant. The argument root must be the same on all tasks. Note that the argument revcount at the root indicates the number of items it receives from each task. It is not the total number of items received. A call where the specification of counts and types causes any location on the root to be written more than once is erroneous.

18 Parameters Collective communication sendbuf is the starting address of the send buffer (choice) (IN) sendcount is the number of elements in the send buffer (integer) (IN) sendtype is the datatype of the send buffer elements (handle) (IN) recvbuf is the address of the receive buffer (choice, significant only at root) (OUT) recvcount is the number of elements for any single receive (integer, significant only at root) (IN) recvtype is the datatype of the receive buffer elements (handle, significant only at root) (IN) root is the rank of the receiving task (integer) (IN) comm is the communicator (handle) (IN) IERROR is the FORTRAN return code. It is always the last argument.

19 Collective communication

20 MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf,... recvcount,recvtype,root,comm)

22 MPI_Scatterv Scatters a buffer in parts to all tasks in a group Synopsis #include "mpi.h" int MPI_Scatterv ( void *sendbuf, int *sendcnts, int *displs, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm ) Input Parameters sendbuf address of send buffer (choice, significant only at root) sendcounts integer array (of length group size) specifying the number of elements to send to each processor displs integer array (of length group size). Entry i specifies the displacement (relative to sendbuf from which to take the outgoing data to process i sendtype data type of send buffer elements (handle) recvcount number of elements in receive buffer (integer) recvtype data type of receive buffer elements (handle) root rank of sending process (integer) comm communicator (handle) Output Parameter recvbuf address of receive buffer (choice)

23 MPI_Reduce Collective communication Applies a reduction operation on all tasks in the group and places the result in one task. MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm) MPI_REDUCE (sendbuf,recvbuf,count,datatype,op,root,comm,ierr) C synopsis #include <mpi.h> int MPI_Reduce(void* sendbuf,void* recvbuf,int count, MPI_Datatype datatype,mpi_op op,int root,mpi_comm comm);

24 Collective communication This subroutine applies a reduction operation to the vector sendbuf over the set of tasks specified by comm and places the result in recvbuf on root. The input buffer and the output buffer have the same number of elements with the same type. The arguments sendbuf, count, and datatype define the send or input buffer. The arguments recvbuf, count and datatype define the output buffer. MPI_REDUCE is called by all group members using the same arguments for count, datatype, op, and root. If a sequence of elements is provided to a task, the reduction operation is executed element-wise on each entry of the sequence. Here's an example. If the operation is MPI_MAX and the send buffer contains two elements that are floating point numbers (count = 2 and datatype = MPI_FLOAT), recvbuf(1) = global max(sendbuf(1)) and recvbuf(2) = global max(sendbuf(2)). Users can define their own operations or use the predefined operations provided by MPI. User-defined operations can be overloaded to operate on several datatypes, either basic or derived. The argument datatype of MPI_REDUCE must be compatible with op.

25 Collective communication Parameters sendbuf is the address of the send buffer (choice) (IN) recvbuf is the address of the receive buffer (choice, significant only at root) (OUT) count is the number of elements in the send buffer (integer) (IN) datatype is the datatype of elements of the send buffer (handle) (IN) op is the reduction operation (handle) (IN) root is the rank of the root task (integer) (IN) comm is the communicator (handle) (IN) IERROR is the FORTRAN return code. It is always the last argument.

26 Collective communication The predefined MPI reduction operations appear below. Users can also define their own reduction functions by using the MPI_Op_create routine.

27 MPI_Scatter Collective communication Parameters sendbuf is the address of the send buffer (choice, significant only at root) (IN) sendcount is the number of elements to be sent to each task (integer, significant only at root) (IN) sendtype is the datatype of the send buffer elements (handle, significant only at root) (IN) recvbuf is the address of the receive buffer (choice) (OUT) recvcount is the number of elements in the receive buffer (integer) (IN) recvtype is the datatype of the receive buffer elements (handle) (IN) root is the rank of the sending task (integer) (IN) comm is the communicator (handle) (IN) IERROR is the FORTRAN return code. It is always the last argument.

28 Collective communication

29 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

30 Array Processing This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements. The serial program calculates one element at a time in sequential order. Serial code could be of the form:

31 Array processing The calculation of elements is independent of one another - leads to an embarrassingly parallel situation. The problem should be computationally intensive.

32 Array Processing Parallel Solution 1 Arrays elements are distributed so that each processor owns a portion of an array (subarray). Independent calculation of array elements insures there is no need for communication between tasks. Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage. Since it is desirable to have unit stride through the subarrays, the choice of a distribution scheme depends on the programming language. Block - Cyclic Distributions Diagram for the options.

33 Array Processing Parallel Solution 1 After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example, with Fortran block distribution: Notice that only the outer loop variables are different from the serial solution. Come sarà in C?

34 One Possible Solution: Implement as SPMD model. Master process initializes array, sends info to worker processes and receives results. Worker process receives info, performs its share of computation and sends results to master. Using the Fortran storage scheme, perform block distribution of the array Pseudo code solution: red highlights changes for parallelism.

35 Implementazione in C

41 Array Processing Parallel Solution 2: Pool of Tasks The previous array solution demonstrated static load balancing: Each task has a fixed amount of work to do May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance. Static load balancing is not usually a major concern if all tasks are performing the same amount of work on identical machines. If you have a load balance problem (some tasks work faster than others), you may benefit by using a "pool of tasks" scheme.

42 Pool of Tasks Scheme: Two processes are employed Master Process: o o o Holds pool of tasks for worker processes to do Sends worker a task when requested Collects results from workers Worker Process: repeatedly does the following o o o Gets task from master process Performs computation Sends results to master Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform. Dynamic load balancing occurs at run time: the faster tasks will get more work to do.

43 Pseudo code solution: red highlights changes for parallelism.

44 Discussion In the above pool of tasks example, each task calculated an individual array element as a job. The computation to communication ratio is finely granular. Finely granular solutions incur more communication overhead in order to reduce task idle time. A more optimal solution might be to distribute more work with each job. The "right" amount of work is problem dependent.

45 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

46 PI Calculation an embarassling parallel algorithm

48 From sequential to parallel

49 From sequential to parallel

52 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

The heat equation describes the temperature change over time, given initial temperature

53 Simple Heat Equation Most problems in parallel computing require communication among the tasks. A number of common problems require communication with "neighbor" tasks. The heat equation describes the temperature change over time, given initial temperature distribution and boundary conditions. A finite differencing scheme is employed to solve the heat equation numerically on a square region.

54 Simple Heat Equation The initial temperature is zero on the boundaries and high in the middle. The boundary temperature is held at zero. For the fully explicit problem, a time stepping algorithm is used. The elements of a 2-dimensional array represent the temperature at points on the square. The calculation of an element is dependent upon neighbor element values.

55 Simple Heat Equation A serial program would contain code like:

Simple Heat Equation Parallel Solution 1 Implement as an SPMD

Determine data dependencies interior elements belonging to a

56 Simple Heat Equation Parallel Solution 1 Implement as an SPMD model The entire array is partitioned and distributed as subarrays to all tasks. Each task owns a portion of the total array. Determine data dependencies interior elements belonging to a task are independent of other tasks border elements are dependent upon a neighbor task's data, necessitating communication.

process calculates solution, communicating as necessary with neighbor

57 Simple Heat Equation Parallel Solution 1 Master process sends initial info to workers, checks for convergence and collects results Worker process calculates solution, communicating as necessary with neighbor processes Pseudo code solution: red highlights changes for parallelism.

58 Simple Heat Equation Parallel Solution 2: Overlapping Communication and Computation In the previous solution, it was assumed that blocking communications were used by the worker tasks. Blocking communications wait for the communication process to complete before continuing to the next program instruction. In the previous solution, neighbor tasks communicated border data, then each process updated its portion of the array. Computing times can often be reduced by using non-blocking communication. Non-blocking communications allow work to be performed while communication is in progress. Each task could update the interior of its part of the solution array while the communication of border data is occurring, and update its border after communication has completed. Pseudo code for the second solution: red highlights changes for non-blocking communications.

60 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

61 Types of Point-to-Point Operations: There are different types of send and receive routines used for different purposes. For example: Synchronous send Blocking send / blocking receive Non-blocking send / non-blocking receive Buffered send Combined send/receive "Ready" send Any type of send routine can be paired with any type of receive routine. MPI also provides several routines associated with send - receive operations, such as those used to wait for a message's arrival or probe to find out if a message has arrived.

62 Data Movement in point to point communication

63 Data Movement in point to point communication

64 Buffering In a perfect world, every send operation would be perfectly synchronized with its matching receive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync. Consider the following two cases: A send operation occurs 5 seconds before the receive is ready - where is the message while the receive is pending? Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that are "backing up"? The MPI implementation (not the MPI standard) decides what happens to data in these types of cases. Typically, a system buffer area is reserved to hold data in transit. For example:

65 Buffering System buffer space is: Opaque to the programmer and managed entirely by the MPI library A finite resource that can be easy to exhaust Often mysterious and not well documented Able to exist on the sending side, the receiving side, or both Something that may improve program performance because it allows send - receive operations to be asynchronous. User managed address space (i.e. your program variables) is called the application buffer. MPI also provides for a user managed send buffer.

66 Blocking vs. Non-blocking Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode. Blocking: A blocking send routine will only "return" after it is safe to modify the application buffer (your send data) for reuse. Safe means that modifications will not affect the data intended for the receive task. Safe does not imply that the data was actually received - it may very well be sitting in a system buffer. A blocking send can be synchronous which means there is handshaking occurring with the receive task to confirm a safe send. A blocking send can be asynchronous if a system buffer is used to hold the data for eventual delivery to the receive. A blocking receive only "returns" after the data has arrived and is ready for use by the program.

67 Blocking vs. Non-blocking Non-blocking: Non-blocking send and receive routines behave similarly - they will return almost immediately. They do not wait for any communication events to complete, such as message copying from user memory to system buffer space or the actual arrival of message. Non-blocking operations simply "request" the MPI library to perform the operation when it is able. The user can not predict when that will happen. It is unsafe to modify the application buffer (your variable space) until you know for a fact the requested non-blocking operation was actually performed by the library. There are "wait" routines used to do this. Non-blocking communications are primarily used to overlap computation with communication and exploit possible performance gains.

68 Order and Fairness Order: MPI guarantees that messages will not overtake each other. If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same receive, the receive operation will receive Message 1 before Message 2. If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive 1 will receive the message before Receive 2. Order rules do not apply if there are multiple threads participating in the communication operations. Fairness: MPI does not guarantee fairness - it's up to the programmer to prevent "operation starvation". Example: task 0 sends a message to task 2. However, task 1 sends a competing message that matches task 2's receive. Only one of the sends will complete.

69 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

70 Derived Data Type Routines MPI_Type_contiguous The simplest constructor. Produces a new data type by making count copies of an existing data type. MPI_Type_contiguous (count,oldtype,&newtype) MPI_TYPE_CONTIGUOUS (count,oldtype,newtype,ierr) C synopsis #include <mpi.h> int MPI_Type_contiguous(int count,mpi_datatype oldtype,mpi_datatype *newtype);

72 MPI_Type_commit Commits new datatype to the system. Required for all user constructed (derived) datatypes. Makes a datatype ready for use in communication. MPI_Type_commit (&datatype) MPI_TYPE_COMMIT (datatype,ierr) C synopsis #include <mpi.h> int MPI_Type_commit(MPI_Datatype *datatype);

73 Sample program output: rank= 0 b= rank= 1 b= rank= 2 b= rank= 3 b=

74 MPI_Type_vector Similar to contiguous, but allows for regular gaps (stride) in the displacements. MPI_Type_vector (count,blocklength,stride,oldtype,&newtype) MPI_TYPE_VECTOR (count,blocklength,stride,oldtype,newtype,ierr) C synopsis #include <mpi.h> int MPI_Type_vector(int count,int blocklength,int stride, MPI_Datatype oldtype,mpi_datatype *newtype);

77 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

78 Group and Communicator Management Routines Groups vs. Communicators: A group is an ordered set of processes. Each process in a group is associated with a unique integer rank. Rank values start at zero and go to N-1, where N is the number of processes in the group. In MPI, a group is represented within system memory as an object. It is accessible to the programmer only by a "handle". A group is always associated with a communicator object. A communicator encompasses a group of processes that may communicate with each other. All MPI messages must specify a communicator. In the simplest sense, the communicator is an extra "tag" that must be included with MPI calls. Like groups, communicators are represented within system memory as objects and are accessible to the programmer only by "handles". For example, the handle for the communicator that comprises all tasks is MPI_COMM_WORLD. From the programmer's perspective, a group and a communicator are one. The group routines are primarily used to specify which processes should be used to construct a communicator

79 Group and Communicator Management Routines Primary Purposes of Group and Communicator Objects: 1. Allow you to organize tasks, based upon function, into task groups. 2. Enable Collective Communications operations across a subset of related tasks. 3. Provide basis for implementing user defined virtual topologies 4. Provide for safe communications

80 Group and Communicator Management Routines Programming Considerations and Restrictions: Groups/communicators are dynamic - they can be created and destroyed during program execution. Processes may be in more than one group/communicator. They will have a unique rank within each group/communicator. MPI provides over 40 routines related to groups, communicators, and virtual topologies.

81 Typical usage: 1.Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group 2.Form new group as a subset of global group using MPI_Group_incl 3.Create new communicator for new group using MPI_Comm_create 4.Determine new rank in new communicator using MPI_Comm_rank 5.Conduct communications using any MPI message passing routine 6.When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free

84 Sommario Collective communication Considerazioni su array processing Calcolo di pgreco Heat equation Tipi di comunicazioni punto-punto Derived data types Groups and communicator Virtual topologies

85 Virtual Topologies What Are They? In terms of MPI, a virtual topology describes a mapping/ordering of MPI processes into a geometric "shape". The two main types of topologies supported by MPI are Cartesian (grid) and Graph. MPI topologies are virtual - there may be no relation between the physical structure of the parallel machine and the process topology. Virtual topologies are built upon MPI communicators and groups. Must be "programmed" by the application developer

86 Virtual Topologies Why Use Them? Convenience Virtual topologies may be useful for applications with specific communication patterns - patterns that match an MPI topology structure. For example, a Cartesian topology might prove convenient for an application that requires 4-way nearest neighbor communications for grid based data. Communication Efficiency Some hardware architectures may impose penalties for communications between successively distant "nodes". A particular implementation may optimize process mapping based upon the physical characteristics of a given parallel machine. The mapping of processes into an MPI virtual topology is dependent upon the MPI implementation, and may be totally ignored.

87 Virtual Topologies Example: A simplified mapping of processes into a Cartesian virtual topology appears below:

88 int MPI_Cart_create( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart); Makes a new communicator to which topology information has been attached Parameters comm_old [in] input communicator (handle) ndims [in] number of dimensions of cartesian grid (integer) dims [in] integer array of size ndims specifying the number of processes in each dimension periods [in] logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each dimension reorder [in] ranking may be reordered (true) or not (false) (logical) comm_cart [out] communicator with new cartesian topology (handle)

89 int MPI_Cart_coords( MPI_Comm comm, int rank, int maxdims, int *coords); Determines process coords in cartesian topology given rank in group Parameters comm [in] communicator with cartesian structure (handle) rank [in] rank of a process within group of comm (integer) maxdims [in] length of vector coords in the calling program (integer) coords [out] integer array (of size ndims) containing the Cartesian coordinates of specified process (integer)

90 int MPI_Cart_shift(MPI_Comm comm, int direction, int displ, int *source, int *dest) Loosely speaking, MPI_Cart_shift is used to find two "nearby" neighbors of the calling process along a specific direction of an N-dimensional cartesian topology. This direction is specified by the input argument, direction, to MPI_Cart_shift. The two neighbors are called source and destination ranks and the proximity of these two neighbors to the calling process is determined by the input parameter displ. If displ = 1, the neighbors are the two adjoining processes along the specified direction and the source is the process with the lower rank number while the destination rank is the process with the higher rank. On the other hand, if displ = -1, the reverse is true.

91 Create a 4 x 4 Cartesian topology from 16 processors and have each process exchange its rank with four neighbors

High-Performance Computing: MPI (ctd)

High-Performance Computing: MPI (ctd) Adrian F. Clark: alien@essex.ac.uk 2015 16 Adrian F. Clark: alien@essex.ac.uk High-Performance Computing: MPI (ctd) 2015 16 1 / 22 A reminder Last time, we started