ASTROPHYSIKALISCHES INSTITUT POTSDAM AIP. Helmholtz school. Introduction to MPI. Stefan Gottlöber

Size: px

Start display at page:

Download "ASTROPHYSIKALISCHES INSTITUT POTSDAM AIP. Helmholtz school. Introduction to MPI. Stefan Gottlöber"

Gordon Campbell
6 years ago
Views:

1 ASTROPHYSIKALISCHES INSTITUT POTSDAM AIP Helmholtz school Introduction to MPI Stefan Gottlöber 1

2 Topics Basics of parallel programming Calculation of π (an example for the basic structure of MPI programs and the possible combination with OpenMP) Direct integration (an example for message passing in MPI programs and for the scaling of MPI programs developing this MPI program could be your exercise during the first week) ART-MPI (an example for some more elaborated programs) Potsdam, July

3 Modern methods in science Numerical simulations are used to proof or disproof observations because experiments are impossible (astrophysics) are too expensive are too time consuming... Potsdam, July

4 Methods of parallelization OpenMP MPI needs computer with shared memory (JUMP at NIC, 107 Gb, NASA s COLUMBIA) works on distributed memory (in general more memory available) Potsdam, July

5 What is OpenMP? If you don t know it already you will learn about OpenMP tomorrow during Anatoly s lecture. Potsdam, July

6 What is MPI? What is MPI? Message Passing Interface: libraries, designed to be a standard for parallel computing on distributed memory. Goal: to be practical, portable, efficient, and flexible MPI history 1980s - early 1990s: distributed memory parallel computing develops, need for a standard arose April 1992: Workshop on Standards for Message Passing in a Distributed Memory Environment November 1992: Meeting in Minneapolis, MPI draft proposal (MPI1) November 1993: Supercomputing 93, draft MPI standard 1995: MPI1 standard 1997: MPI2 standard Potsdam, July

7 OpenMP vs MPI: Overview Potsdam, July

8 How to install MPI? The MPI home page is maintained at Argonne National Laboratory. Standards, archives, documentation and links to implementations are available. MPI is a library of subroutines for Fortran functions for C classes and methods for C++ Potsdam, July

9 How to install MPI? User programs are compiled as usual and then linked with the appropriate MPI libraries. Implementations are: MPICH ( is available from Argonne National Laboratory. It is free and easily downloaded and can be installed at the user level (i.e., without superuser privileges). Subroutines are provided for Fortran 90, C and C++. The CH in MPICH stands for Chameleon, symbol of adaptability to one s environment and thus of portability. Chameleons are fast, and from the beginning a secondary goal was to give up as little efficiency as possible for the portability. LAM/MPI ( is available from Indiana University. LAM stands for Local Area Multicomputer. WMPI II ( is a commercial (but free to academics) implementation for Windows. Potsdam, July

10 Which tasks can be parallized by MPI Trivial parallel programs parameter studies analysis of many time steps image processing Independent tasks Nbody interaction halo finding and treatment density field (smoothed) Problems in both cases Are the tasks scalable over many CPUs? Over how many? Load balance (Do all CPUs work, do many lie idle?) Potsdam, July

11 Examples: Trivial parallel programs calculation of π by different methods (one CPU rivals the others: which method is faster or more accurate) this little task is an excellent exercise for the combination of OpenMP and MPI Evolution of many clusters of galaxies (Hitachi project: 8 nodes with 8 processors on each node, 8 MPI processes, each with 8 OpenMP threads Potsdam, July

12 IMPLICIT REAL*8 (A-H,O-Z) IMPLICIT INTEGER*4 (I-N) include mpif.h Calculation of π N = pii = ! Num. Rec. p. 914 CALL mpi_init(ierr) CALL mpi_comm_size(mpi_comm_world, msize, ierr ) CALL mpi_comm_rank(mpi_comm_world, mrank, ierr ) CALL Calc_PI(pi,N, mrank)... CALL mpi_finalize(ierr) end SUBROUTINE Calc_PI(pi,N, mrank)... Potsdam, July

13 on mpif.h /* -*- Mode: Fortran; -*- */!! (C) 2001 by Argonne National Laboratory.! See COPYRIGHT in top-level directory.!! DO NOT EDIT! This file created by buildiface! INTEGER MPI_SOURCE, MPI_TAG, MPI_ERROR PARAMETER (MPI_SOURCE=3,MPI_TAG=4,MPI_ERROR=5)... Your compiler will see that file if you have the right environment: source /opt/env/pgi-mpich sh Potsdam, July

14 mpi init CALL mpi_init(ierr) Initializes the MPI execution environment. This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. Potsdam, July

15 mpi comm size CALL mpi_comm_size(mpi_comm_world, msize, ierr ) Determines the number of processes msize in the group associated with a communicator. Generally used within the communicator MPI COMM WORLD to determine the number of processes being used by your application. Potsdam, July

16 What is MPI COMM WORLD? MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. MPI COMM WORLD is the predefined communicator which includes all of your MPI processes. Potsdam, July

17 MPI COMM WORLD extension Potsdam, July

18 mpi comm rank CALL mpi_comm_rank(mpi_comm_world, mrank, ierr ) Determines the rank mrank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI COMM WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. Potsdam, July

19 CALL mpi_finalize(ierr) mpi finalize Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program - no other MPI routines may be called after it. Potsdam, July

20 include mpif.h Coming back to the calculation of π N = pii = ! Num. Rec. p. 914 CALL mpi_init(ierr) CALL mpi_comm_size(mpi_comm_world, msize, ierr ) CALL mpi_comm_rank(mpi_comm_world, mrank, ierr ) CALL Calc_PI(pi,N, mrank)... CALL mpi_finalize(ierr) end SUBROUTINE Calc_PI(pi,N, mrank) Now you can use on different processors different series to calculate π and check speed, convergence... Potsdam, July

21 Important note Don t use within MPI programs commands like the following: IF (error.gt. error_max) STOP increase accuracy One node would stop. During parallelization of your serial program you should replace such lines by something similar to IF (error.gt. error_max) THEN write(*,*) increase accuracy call mpi_abort(mpi_comm_world,ierr1,ierr2) STOP ENDIF which terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. Potsdam, July

22 Summary: structure of MPI programs Potsdam, July

23 first ART MPI code: Evolution of many clusters of galaxies c ==================================================== c c Adaptive Refinement Tree (ART) N-body solver c c Version 3 - February 1997 c c Andrey Kravtsov, Anatoly Klypin, Alexei Khokhlov c c ==================================================== c c this is a simple test version for MPI c changes only in ART_Main.f and ART_IO.f c program ART include mpif.h Potsdam, July

24 ... some more initialisation for ART CALL mpi_init(ierr) CALL mpi_comm_size(mpi_comm_world, isize, ierr ) CALL mpi_comm_rank(mpi_comm_world, irank, ierr )... the main ART program... read data... do loop n steps... integrate one step... decide whether results should be written to disk... enddo CALL mpi_finalize(ierr) STOP END Potsdam, July

25 c SUBROUTINE construct_name(name,in1,jn1) c c c purpose: construct file names for the output from different nodes c into different directories include mpif.h CHARACTER*120 name,tmp3 CHARACTER*5 tmp1 CHARACTER*1 tmp2 tmp1 = node_ tmp2 = / CALL mpi_comm_rank(mpi_comm_world, i_node, ierr ) tmp3 = name CALL get_name(tmp3,in1,jn1) write(name, (a,i1,a,a) ) tmp1,i_node,tmp2,tmp3(in1:jn1) CALL get_name(name,in1,jn1) write(*,*) name(in1:jn1) Potsdam, July

26 END That s all of changes! Each MPI process reads its own data (one cluster of galaxies) and integrates it completely independent of the others tasks. No communication. You will run into problems if there is any STOP in the code. Also here is nothing done concerning load balance, less massive clusters will finish earlier than more massive ones. Potsdam, July

27 Nbody code Examples: Independent tasks The interaction between any two particles does not depend on all the other particles. Straightforward parallelization (for example the direct integration code): more communication Parallelization of tasks in different sub-volumes (for example the MPI version of ART): less communications but problems with load balance Potsdam, July

28 Direct integration N particles position x velocity v move all particles to a new position after t use the leap-frog scheme calculate the movement for a subset of N p N/N CP U of the N CP U processors on each simple to parallize, however, all nodes need to know all positions and velocities (not really a disadvantage on present day computers with large memory) Potsdam, July

29 Leap frog scheme Define positions x and forces at time t, time step n. Define velocities v at time t + t 2, time step n Then we have for particle i x n+1 i = x n i + v n+1/2 i t (1) v n+1/2 i = v n 1/2 i + F i (x n i ) t/m (2) Potsdam, July

30 Initial conditions To start the integration, we need the initial position of all particles x and their velocities v at two separate times: x(t 0 ) and v(t 0 t/2). see Anatoly s lecture about initial conditions (PMstartM.f). Potsdam, July

31 Accuracy of the leap frog scheme x n+1 i = x n i + v n+1/2 i t (3) v n+1/2 i = v n 1/2 i + F i (x n i ) t/m (4) Substitute v n 1/2 i in the second equation using the first. v n+1/2 i = (x n i x n 1 i )/ t + F i (x n i ) t/m (5) Substitute back into first equation x n+1 i = x n i + (x n i x n 1 i ) + F i (x n i )( t) 2 /m (6) we get the central difference formula for F = ma. x n+1 i 2x n i + xn 1 i ) t 2 = F i (x n i )/m (7) Potsdam, July

32 Accuracy of the leap frog scheme Let us assume that X is the true solution. X n+1 i 2Xi n + Xn 1 i t 2 = F i (Xi n )/m + δ (8) Insert Taylor expansion for X n+1 i and X n 1 i, thus X n+1 i 2Xi n + X n 1 i = t 2d2 X dt 2 + t4 12 d 4 X +... (9) dt4 Substitute back and get the truncation error O( t 2 ), δ = t2 12 d 4 X +... (10) dt4 Potsdam, July

33 Consistency of the leap frog scheme As t 0 the difference equation converges to the differential equation: d 2 X dt 2 = F ( x)/m (11) and it is also a sympletic method (time symmetric). The scheme has the same accuracy for negative t. Potsdam, July

34 Truncation error vs. round-off error Truncation error can be reduced by smaller step t can be reduced by higher-order algorithm is not related to round-off error Round-off error representation of real numbers with finite number of bits can be reduced by higher precision (64 bit, REAL*8) can be reduced also by careful ordering of operations Potsdam, July

35 nbody par.f Reading by root Distribution of tasks Load balance N p particles per processor out of N particles Broadcast to all processors move particles on each processor distribute moved particles to all processors root writes to disk Potsdam, July

36 nbody par.f INTEGER Np_on_rank(maxrank+1)... CALL MPI_COMM_RANK( MPI_COMM_WORLD, mrank, ierr ) mroot=0 IF(mrank.eq. mroot) THEN... read the data ENDIF Np_per_process = N/msize Write first particle number and last for each processor on the integer array Np_on_rank(maxrank+1). Note, that in this construction Np_per_process * msize is not necessary equal N, thus the last CPU may get (much) more particles than the others = bad load balance. Potsdam, July

37 nbody par.f Now the root process has all necessary informations. The informations have to be distributed to all the others processors. Root has to tell them which processor has which tasks. = Message passing in systems with distributed memory Potsdam, July

38 Message passing Every processor has its own local memory which can be accessed directly only by its own CPU. We have to distribute data from root to all processors over the network. Potsdam, July

39 Message passing A synchronous send operation will complete only after acknowledgment that the message was safely received by the receiving process. Asynchronous send operations may complete even though the receiving process has not actually received the message. Potsdam, July

40 Point to Point Communication MPI_SEND (buf,count,datatype,dest,tag,comm,ierr) The basic blocking send operation returns only after the application buffer in the sending task is free for reuse. Note that this routine may be implemented differently on different systems. The MPI standard permits the use of a system buffer but does not require it. Some implementations may actually use a synchronous send (block longer until the destination process has started to receive the message) to implement the basic blocking send. Potsdam, July

41 Using a system buffer Potsdam, July

42 Point to Point Communication Buffer Count MPI_SEND (buf,count,datatype,dest,tag,comm,ierr) Address space which references the data that is to be sent or or received = variable name that is be sent/received number of data elements of the particular type to be sent Data Type MPI data type (next slide) Destination This argument indicates the process where the message should be delivered (rank of the receiving process). Potsdam, July

43 tag Arbitrary non-negative integer ( )assigned by the programmer to uniquely identify a message. Send and receive operations should match message tags. Communicator the predefined communicator MPI COMM WORLD is usually used Potsdam, July

44 Message passing - MPI data types MPI data types MPI INTEGER MPI REAL MPI DOUBLE PRECISION MPI COMPLEX MPI LOGICAL MPI CHARACTER MPI BYTE MPI PACKED Fortran data types INTEGER REAL DOUBLE PRECISION COMPLEX LOGICAL CHARACTER(1) 8 binary digits data (un)packed with MPI Pack (MPI Unpack) Potsdam, July

45 Point to Point Communication Source Status MPI_RECV (buf,count,datatype,source,tag,comm,status,ierr) This argument indicates the originating process of the message (rank of the sending process). This may be set to the wild card MPI ANY SOURCE to receive a message from any task. For a receive operation, indicates the source of the message and the tag of the message. In Fortran, it is an integer array of size MPI STATUS SIZE. Potsdam, July

46 nbody par.f Distribute data from root to all processors... CALL MPI_Bcast(Np,1,MPI_INTEGER,mroot,MPI_COMM_WORLD, ierr) CALL MPI_Bcast(dt,1,MPI_DOUBLE_PRECISION, + mroot,mpi_comm_world, ierr) nsend = 10*Nmax CALL MPI_Bcast(Coords,nsend,MPI_DOUBLE_PRECISION, + mroot,mpi_comm_world, ierr) CALL MPI_Bcast(Np_on_rank,maxrank, + MPI_INTEGER,mroot,MPI_COMM_WORLD, ierr)... where we have defined in the original serial program PARAMETER (Nmax =50000)! maximum number of particles REAL*8 Coords COMMON /MAINDATA/Coords(10,Nmax) Potsdam, July

47 MPI Bcast MPI_BCAST (buffer,count,datatype,root,comm,ierr) Broadcasts (sends) a message from the process with rank root to all other processes in the group. Potsdam, July

48 nbody par.f c Do i=1,nsteps Call GetAccelerations_NP Call MoveParticles time = time + dt istep= istep+ 1 distribute particles CALL Send_Receive()! main loop Distribute new positions and velocities after each time step to all processors. Each processor has to send data and to receive data from all other processors. Potsdam, July

49 MPI ALLGATHER Collect data from all tasks and distribute them to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. Potsdam, July

50 sendbuf MPI ALLGATHER MPI_ALLGATHER (sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,comm,ierr) starting address of the send buffer (Fortran variable) sendcount sendtype recvbuf revcount number of data elements in the send buffer (integer) MPI data type address of the send buffer (Fortran variable) number of elements received from any process (integer) Potsdam, July

51 recvtype MPI data type ( = sendtype) Potsdam, July

52 MPI ALLGATHERV MPI ALLGATHERV extends the functionality of MPI ALLGATHER by allowing a varying count of data to be send from each process. MPI_ALLGATHERV (sendbuf,sendcount,sendtype,recvbuf, recvcounts,dipls,recvtype,comm,ierr) revcounts dipls integer array of length group size (msize) containing the number of elements that are received from each process integer array of length group size (msize). The entry i specifies the displacement (relative to recbuf) at which to place the incoming data from process i. Potsdam, July

53 subroutine Send Receive() c c distribute particles SUBROUTINE Send_Receive() c INCLUDE nbody_par.h INTEGER & REAL*8 sendcount,recvcount(msize), rdispl(msize) send(np),receive(np) Do i = 1, msize rdispl(i) = Np_on_rank(i) recvcount(i) = Np_on_rank(i+1) - Np_on_rank(i) ENDDO istart = Np_on_rank(mrank+1)+1 iend = Np_on_rank(mrank+2) sendcount = Np_on_rank(mrank+2) - Np_on_rank(mrank+1) Potsdam, July

54 Note: In our example already all processes know how many particles are handled by the different processors (information is stored in the array Np on rank). Thus each processor can calculate the amount of data which it receives and the corresponding displacement. In a more general case this information must be distributed before calling MPI ALLGATHER. Potsdam, July

55 DO k = 1, 10 DO i = 1,sendcount ii = istart -1 + i send(i) = Coords(k,ii) ENDDO CALL mpi_allgatherv(send, sendcount, MPI_REAL8, + receive, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL MPI_BARRIER(MPI_COMM_WORLD, ierrbar) DO i = 1,Np Coords(k,i) = ENDDO ENDDO RETURN End receive(i) Potsdam, July

56 MPI BARRIER CALL MPI_BARRIER(MPI_COMM_WORLD, ierrbar) Creates a barrier synchronization in a group. It blocks the calling process until all group members have called it; i.e. the call returns at any process only after all group members have entered the call. Potsdam, July

57 nbody par.f Do i=1,nsteps! main loop... if(mod(istep,1000).eq.0) then Open(ifile,file=log_file,position= append )... close(ifile) endif... It is useful to write informations (for example timing) for each processor into separate log-files constructed like: WRITE(a6, (I6) ) mrank log_file = DATA/timing_ //a6(5:6)//.log ifile = 100+mrank Potsdam, July

58 Scaling behavior on octopus Computation (black) and communication (gray) times This is only an example! 200 particles, steps 1 CPU: s 2 CPUs: s 4 CPUs: s 8 CPUs: s 16 CPUs: s ( 200/16 = 12.5, i.e ) too simple distribution of tasks Potsdam, July

59 Scaling behavior on octopus 2000 particles, time measured for integration steps processors time speedup efficiency particles per CPU s s in the original version s s s s s ( 1/ ) s ( 1/ ) speedup = sequential execution time parallel execution time (12) efficiency = sequential execution time processors used parallel execution time (13) Potsdam, July

60 test it! Parallelization and testing of the direct integration Nbody code will be your homework during the first week A serial version of the code is available at: Potsdam, July

61 Performance analysis Speedup ψ(n, p) for a problem of size n on p processors. We have three categories of operations: Computations that must be performed sequentially: σ(n) Computations that can be performed in parallel: ϕ(n) Parallel overhead (communication operations, redundant computations, load balance): κ(n, p) Then the speedup ψ(n, p) is ψ(n, p) σ(n) + ϕ(n) σ(n) + ϕ(n)/p + κ(n, p) (14) and the efficiency 0 ɛ(n, p) 1 is ɛ(n, p) σ(n) + ϕ(n) pσ(n) + ϕ(n) + pκ(n, p) (15) Potsdam, July

62 Amdahl s law Let us neglect the overhead κ(n, p) and define the inherently sequential portion f = σ(n) σ(n) + ϕ(n) (16) of the computation. Then the speedup on a parallel computer with p processors is (Amdahl s law) ψ 1 f + (1 f)/p (17) In particular interesting for estimation of the maximum speedup as p. Potsdam, July

63 Amdahl s law Anteil parallel = 1 f Potsdam, July

64 ART MPI Basic concept: To run the simulation using N MPI MPI-processes we divide the box into N MPI sub-boxes in such a way that all sub-boxes will need approximately the same amount of computational time for one integration step. Each MPIprocess uses N OMP CPUs within OpenMP, thus N CPU = N MPI N OMP. After each basic integration step the box is divided again into sub-boxes according to the best forecast of load balance. Input/output via parallel reading/writing of N MPI processors on N MPI files. The files contain for each primary particle 9 variables (3 coordinates, 3 velocities, mass, individual time step, particle id). Finding of sub-boxes is easy for the initial conditions where matter is distributed almost homogeneously almost impossible after structures have developed even more complicated for multi-mass realizations in the original box Potsdam, July

65 an artist s view of the ART MPI simulation box Potsdam, July

66 ART MPI Example for the load balance in the WMAP run (80h 1 Mpc box size, particles, 64 MPI processes, 512 CPU, done on COLUMBIA) Potsdam, July

67 ART MPI each sub-box is surrounded by a thin shell with primary particles m p more shells contain particles with increasing mass m 2 > m 1 > m p rest of the box is filled with most massive particles m b > m 2 Potsdam, July

68 ART MPI periodicity of the box is taken into account each sub-box runs one integration step of the multi-mass version of ART (tidal fields represented by the more massive particles) after each integration steps new subboxes are determined Potsdam, July

69 Main tasks for parallelization: ART MPI determine on each node which particles has to be send to which nodes (Fortran) construct the corresponding massive particles from the primary ones (Fortran) each node has to inform all others about the particles it wishes to send (MPI allgather) all nodes send to all nodes their particles (MPI alltoallv) Advantage: Advantage: less communications communications only after each basic integration step Potsdam, July

70 ART MPI Example: 5 sends only massive box particles m b to sends only massive box particles m b to 4 10 sends primary particles m p to 11 as well as massive ones, m 1, m 2, m b 5 sends primary particles m p to 4 as well as massive ones, m 1, m 2, m b Potsdam, July

71 ART MPI - the main program c ART_MPI_Main.f... CALL mpi_init(ierr) CALL mpi_comm_size(mpi_comm_world, mpisize, ierr ) CALL mpi_comm_rank(mpi_comm_world, irank, ierr ) IF(mpisize.NE. n_nodes) THEN write(*,*) mpisize.ne. n_nodes,mpisize,n_nodes call mpi_abort(mpi_comm_world,ierr1,ierr2) STOP ENDIF... Potsdam, July

72 CALL Read_ART_MPI_Inp () C$OMP PARALLEL DO DEFAULT(SHARED) C$OMP+PRIVATE ( ic1) do ic1 = 1, mcell var(1,ic1) = -1.0 var(2,ic1) = -1.0 ref(ic1) = zero pot(ic1) = zero enddo... call Read_Control() call Read_Particles()... Potsdam, July

73 c... main loop over mstep (read from input) integration steps DO ijkl = 1, mstep... CALL Send_Small () CALL Send_Large () c integrate one time step... c If(aexpn.lt.0.6)Then call LoadBalance2 Else call LoadBalance1 EndIf redistribution of primary particles call Redistribute_Primaries() Potsdam, July

74 c write output, if necessary call Save_Check ()... ENDDO 999 Continue CALL Save(0) CALL mpi_finalize(ierr) END Potsdam, July

75 ART MPI - send small particles c SUBROUTINE Send_Small() c c c purpose: gathers and sends small particles c c input: in0(3) - coordinates of sending node c c output: sends particles, sets n_refin =number of particles... node = irank + 1 CALL Node_to_IJK(node,iN0)... Do kn =1,n_divz! Loop over other nodes Do jn =1,n_divy Do in =1,n_divx Potsdam, July

76 c c... find boundaries of two nodes in 3D CALL BoundNode(iN0,iN1,Nbound) find primary particles, which node in0 will send for node in1 CALL Find_Small(iN0,Nbound,np_node,nn,ncount)... EndDo EndDo EndDo CALL Send_Receive()... RETURN End An analogous routine exists for sending large particles. Potsdam, July

77 ART MPI - send small particles c SUBROUTINE Send_Receive() c c c purpose: sends and reives data for particles INTEGER INTEGER... sendcount(n_nodes), recvcount(n_nodes) sendcount_all(n_nodes*n_nodes) c send integers with the lengths of all arrays which will be c sent (sendcount) and received (recvcount) to/from all nodes CALL mpi_allgather(sendcount, n_nodes, MPI_INTEGER, + sendcount_all, n_nodes, MPI_INTEGER, + MPI_COMM_WORLD, ierr) Potsdam, July

78 rdispl_new = 0 DO i = 1, n_nodes rdispl(i) = rdispl_new recvcount(i) = sendcount_all(irank+1+n_nodes*(i-1)) rdispl_new = rdispl(i) + recvcount(i) ENDDO CALL mpi_alltoallv(x_se, sendcount, sdispl, MPI_REAL8, + x_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(y_se, sendcount, sdispl, MPI_REAL8, + y_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(z_se, sendcount, sdispl, MPI_REAL8, + z_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(vx_se, sendcount, sdispl, MPI_REAL8, + vx_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) Potsdam, July

79 CALL mpi_alltoallv(vy_se, sendcount, sdispl, MPI_REAL8, + vy_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(vz_se, sendcount, sdispl, MPI_REAL8, + vz_re, recvcount, rdispl, MPI_REAL8, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(pt_se, sendcount, sdispl, MPI_REAL, + pt_re, recvcount, rdispl, MPI_REAL, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(wpar_se, sendcount, sdispl, MPI_REAL, + wpar_re, recvcount, rdispl, MPI_REAL, + MPI_COMM_WORLD,ierr) CALL mpi_alltoallv(ip_se, sendcount, sdispl, MPI_INTEGER, + ip_re, recvcount, rdispl, MPI_INTEGER, + MPI_COMM_WORLD,ierr) RETURN END Potsdam, July

80 MPI ALLGATHER Has been used to distribute particles in the direct integration example. Here we distribute informations about how many particles each node is going to send to the other nodes (sendcount(n_nodes)) so that all nodes know how many particles arrive from the others (sendcount_all(n_nodes*n_nodes)). Having this information each processor can calculate where it has to put the arriving recvcount particles, i.e. rdispl. Potsdam, July

81 MPI ALLTOALL Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index. Potsdam, July

82 MPI ALLTOALL sendbuf MPI_ALLTOALL (sendbuf,sendcount,sendtype,recvbuf, recvcnt,recvtype,comm,ierr) starting address of the send buffer (Fortran variable) sendcount sendtype recvbuf number of data elements in the send buffer (integer) MPI data type address of the send buffer (Fortran variable) Potsdam, July

83 revcount number of elements received from any process (integer) recvtype MPI data type ( = sendtype) Potsdam, July

84 MPI ALLTOALLV MPI ALLTOALLV adds flexibility to MPI ALLTOALL in that the location of data for the send is specified by sdispls and the location of data on the receive side is specified by rdispls. sendcounts recvcnts MPI_ALLTOALL (sendbuf,sendcounts,sdipls,sendtype,recvbuf, recvcnts,rdipls,recvtype,comm,ierr) now: number of data elements to send to each processor (integer array of length msize) now: number of data elements which can be received by each processor (integer array of length msize) Potsdam, July

85 sdipls rdipls new: integer array of length msize specifying the displacement relative to sendbuf from which the data destined for process j has to be taken new: integer array of length msize specifying the displacement relative to recbuf at which to place the incoming data from process i Potsdam, July

86 After running MPI ART = analyze the data an MPI version of the BDM halo finder exists (Arman Khalatyan) an MPI version of the minimum spanning tree and friends-of-friends halo finder exists (Victor Turchaninov) Potsdam, July

87 Bugs leading to a deadlock Single process calls collective function. Example: root calls MPI Bcast Prevention: Do not put collective communications inside conditionally executed parts of the code. Two or more processes are trying to exchange data, but all call a blocking receive function (MPI Recv) before sending. Prevention: You could use MPI SendRecv. A process tries to receive data from a process that never will send it. Prevention: Use collective communications whenever it is possible, if using pointto-point communication, use simple communication patterns. Potsdam, July

88 Web pages about MPI Writing Message-Passing Parallel Programs with MPI SP Parallel Programming Workshop The Message Passing Interface (MPI) standard MPI: A Message-Passing Interface Standard MPI-2: Extensions to the Message-Passing Interface Potsdam, July

89 Books about MPI Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, MC GrawHill, 2004, ISBN Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers, 1997, ISBN MPI: The Complete Reference (The MIT Press, ISBN ) Potsdam, July

Collective Communication: Gather. MPI - v Operations. Collective Communication: Gather. MPI_Gather. root WORKS A OK

Collective Communication: Gather MPI - v Operations A Gather operation has data from all processes collected, or gathered, at a central process, referred to as the root Even the root process contributes