Introduction to MPI-2 (Message-Passing Interface)

What are the major new features in MPI-2? Parallel I/O Remote Memory Operations Dynamic Process Management Support for Multithreading

Parallel I/O Includes basic operations similar to standard UNIX open, close, seek, read and write operations. But the power comes from advanced features such as noncontiguous access in both memory and file, collective I/O operations, use of explicit offsets to avoid separate seeks both individual and shared file pointers, nonblocking I/O, portable and customized data representations, and hints for implementation and file system.

Remote Memory Operations The API provides elements of the shared-memory model in an MPI environment. These are known as MPI one-sided or remote memory operations. The design is based on the idea of remote memory access windows: portions of each process s address space that it explicitly exposes to remote memory operations by other processes defined by an MPI communicator. The one-sided get, put and update operations can store into, load from, and update the windows exposed by other processes. All remote memory operations are nonblocking, and synchronization operations are necessary to ensure their completion.

Dynamic Process Management The ability of an MPI process to participate in the creation of new MPI processes or to establish communications with MPI processes that have been started separately. The process operations are collective. The resulting sets of processes are represented as an intercommunicator. Spawning is creating new sets of processes based on intercommunicators. Connecting is establishing communications with pre-existing MPI programs.

Support for Multithreading MPI-1 was designed to be thread-safe. In MPI-2, threads are recognized as potential part of the environment. Users can inquire what level of thread safety is allowed. If multiple levels of thread safety is supported, users can choose the level that meets the application s needs while still providing for the highest level of performance.

Support for Multithreading (contd) int MPI_Init_thread(int *argc, char ***argv, int required, int MPI_Query_thread(int *provided); int MPI_Is_thread_main(int *flag); MPI THREAD SINGLE - Only one thread will execute. MPI THREAD FUNNELED - The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are funneled to the main thread). MPI THREAD SERIALIZED - The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized). MPI THREAD MULTIPLE - Multiple threads may call MPI, with no restrictions.

Parallel I/O All MPI processes can send the data to be written to Process 0, which then writes the data to a file using standard library calls. This is the simplest but it also is the least scalable. Each MPI process writes data to its own local file using standard library calls. After the application finishes, all the separate files have to somehow be combined. This is more scalable but can also be complex. All MPI processes share a single file while still retaining the advantages of parallelism. The processes use MPI I/O calls instead of standard library calls.

Parallel I/O: Example 1 /* lab/mpi/parallel-io/io1.c /* example of sequential write into a common file */ #include <stdio.h> #include <mpi.h> #define BUFSIZE 1024*1024 int main(int argc, char *argv[]) { int i, myrank, numprocs, buf[bufsize]; MPI_Status status; FILE *myfile; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); for (i=0; i<bufsize; i++) buf[i] = myrank * BUFSIZE + i; if (myrank!= 0) MPI_Send(buf, BUFSIZE, MPI_INT, 0, 99, MPI_COMM_WORLD); else { myfile = fopen("testfile", "w"); fwrite(buf, sizeof(int), BUFSIZE, myfile); for (i=1; i<numprocs; i++) { MPI_Recv(buf, BUFSIZE, MPI_INT, i, 99, MPI_COMM_WORLD, &status); fwrite(buf, sizeof(int), BUFSIZE, myfile); fclose(myfile); MPI_Finalize(); return 0;

Parallel I/O: Example 2 /* lab/mpi/parallel-io/io2.c: parallel MPI write into separate files */ /* appropriate header files *. #define BUFSIZE 1024*1024 int main(int argc, char *argv[]) { int i, myrank, buf[bufsize]; char filename[128]; MPI_File myfile; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i<bufsize; i++) buf[i] = myrank * BUFSIZE + i; sprintf(filename, "testfile.%d", myrank); MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_WRONLY MPI_MODE_CREATE MPI_INFO_NULL, &myfile); MPI_File_write(myfile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&myfile); MPI_Finalize(); return 0;

Parallel I/O: Example 3 /* lab/mpi/parallel-io/io3.c: parallel MPI write into a single file */ /* appropriate header files *. #define BUFSIZE 1024*1024 int main(int argc, char *argv[]) { int i, myrank, buf[bufsize]; char filename[128]; MPI_File thefile; MPI_Offset offset; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); sprintf(filename, "testfile"); for (i = 0; i < BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, filename, (MPI_MODE_WRONLY MPI_MODE_CREATE) MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); offset = myrank * BUFSIZE; MPI_File_write_at(thefile, offset, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE MPI_File_close(&thefile); MPI_Finalize(); return 0;

Summary of basic MPI I/O Functions int MPI File open(mpi Comm comm, char *filename, int amode, MPI Info info, MPI File *fh); int MPI File set view(mpi File fh, MPI Offset offset, MPI Datatype etype, MPI Datatype filetype, char *datarep, MPI Info info) int MPI File write(mpi File fh, void *buf, MPI Datatype datatype, MPI Status *status); int MPI File read(mpi File fh, void *buf, int count, MPI Datatype datatype, MPI Status *status); int MPI File get size(mpi File fh, MPI Offset *size); int MPI File close(mpi File *fh);

More on Parallel I/O MPI File seek allows multiple processes to position themselves at specific byte offset in a file before reading or writing. MPI File read at and MPI File write at combine read/write with seek in one call. The shared file pointer is shared amongst all processes in the same communicator. Functions such as MPI File write shared data will write data and update shared pointer for all processes. Handy for writing to a common log file from multiple processes.

Remote Memory Access MPI does not provide a real shared-memory model. However the remote memory operations of MPI provide much of the flexibility of shared memory. Data movement can be initiated entirely by one process (one-sided operation). The synchronization needed to ensure that the data movement is complete is decoupled from the initiation of the operation. Each process can designate portions of its address space as available for other processes to be able to read and write. This is known as a window. A window object consists of multiple windows, each of which consists of all the local memory areas exposed to other processes by collective window-creation operation. A collection of processes may have several window objects.

Remote Memory Functions Window objects are represented by variables of type MPI Win in C. Window objects are made up of variables of single datatype. So we need one window for each type of variable. The MPI Win create operation is a collective operation. So all processes need to call it even though only one contributes memory to the window. The communicator used specifies which processes will have access to the window. MPI_Win nwin; on process 0: MPI_Win_create(&n, sizeof(int), 1, MPI_INFO_NULL, MPI_COMM_WORL on other processes: MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, The first argument is the address, the second the length, the third argument is the displacement unit to specify the offset into the memory in windows. The fourth argument is an MPI Info argument, which can be used to optimize the performance of remote memory operations. The next argument is a communicator and the last argument is the window object that is returned.

More Remote Memory Functions Any ordinary variable can be shared via remote memory operations get, put and accumulate. Special memory can also be allocated for this purpose via the MPI Alloc mem function. Before other processes can access remote memory, we need to synchronize. MPI provides three synchronization mechanisms. The simplest is the fence operation, which starts a RMA access epoch. The MPI call used is MPI Win fence. The function MPI Win fence takes two arguments: the first is an assertion argument permitting certain optimizations, the second is the window the fence operation is being performed on. A value of 0 is always valid for the first argument. MPI_Win_fence(0, nwin); MPI_Get(&n, 1, MPI_INT, 0, 0, 1, MPI_INT, nwin) MPI_Win_fence(0, nwin); The arguments for the MPI Get are receive address, count, datatype, rank of remote process, displacement into the memory window, count, type, window object.

MPI Remote Memory Operations int MPI Win create(void *base, MPI Aint size, int disp unit, MPI Info info, MPI Comm comm, MPI Win *win); int MPI Win fence(int assert, MPI Win win); int MPI Get(void *base, int origin count, MPI Datatype origin datatype, int target rank, MPI Aint target disp, int target count, MPI Datatype target datatype, MPI Win win); int MPI Put(void *origin addr, int origin count, MPI Datatype origin datatype, int target rank, MPI Aint target disp, int target count, MPI Datatype target datatype, MPI Win win); int MPI Accumulate(void *origin addr, int origin count, MPI Datatype origin datatype, int target rank, MPI Aint target disp, int target count, MPI Datatype target datatype, MPI Op op, MPI Win win); int MPI Win free(mpi Win *win);

Remote Memory Access Example /* lab/mpi/remote-memory/cpi-rma.c */ /* appropriate header files *. int main(int argc, char *argv[]) { int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; MPI_Win nwin, piwin; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { MPI_Win_create(&n, sizeof(int), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &nwin); MPI_Win_create(&pi, sizeof(double), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &piwin); else { MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &nwin); MPI_Win_create(MPI_BOTTOM, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &piwin);

Remote Memory Access Example (contd.) while (1) { if (myid == 0) { printf("enter the number of intervals: (0 quits) "); scanf("%d",&n); pi = 0.0; MPI_Win_fence(0, nwin); if (myid!= 0) MPI_Get(&n, 1, MPI_INT, 0, 0, 1, MPI_INT, nwin); MPI_Win_fence(0, nwin); if (n == 0) break; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); mypi = h * sum; MPI_Win_fence( 0, piwin); MPI_Accumulate(&mypi, 1, MPI_DOUBLE, 0, 0, 1, MPI_DOUBLE, MPI_SUM, piwin); MPI_Win_fence(0, piwin); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); MPI_Win_free(&nwin); MPI_Win_free(&piwin); MPI_Finalize(); return 0;

Dynamic Process Management A collective operation over the spawning processes (parents) and the children processes (via MPI Init). Returns an intercommunicator in which, from the point of view of the parents, the local group contains the parents and the remote group contains the children. The function MPI Comm parent, called from the children, returns an intercommunicator in which the local group contains the children and the parents as the remote group.

Dynamic Process Management Functions int MPI Comm spawn(char *command, char *argv[], int maxprocs, MPI Info info, int root, MPI Comm comm, MPI Comm *intercomm, int array of errcodes[]); int MPI Comm get parent(mpi Comm *parent); int MPI Intercomm merge(mpi Comm intercomm, int high, MPI Comm *newintracomm); See example: lab/mpi/spawn-ex1/