Bryan Carpenter, School of Computing

Size: px

Start display at page:

Download "Bryan Carpenter, School of Computing"

Georgiana Freeman
5 years ago
Views:

1 Bryan Carpenter, School of Computing 1

2 Plan Brief reprise of parallel computers and programming models Overview of MPI, with illustrative example. Brief comparison with OpenMP Use of MPI in GADGET 2 2

3 What is MPI? MPI is a programming interface for sending and receiving messages in C or Fortran programs ( Beyond this, it effectively provides a general framework for Single Program Multiple Data (SPMD) computing on distributed memory parallel computers. 3

4 Parallel Computers Traditional dichotomy shared vs distributed memory: Shared memory Distributed memory processors memory fast interconnect 4

5 Programming Frameworks Shared memory (cooperating threads) OpenMP POSIX Threads Cilk Threaded Building Blocks etc, etc Distributed memory (cooperating processes) MPI (PVM, etc) Co-array Fortran, UPC, etc Global Array Toolkit (etc) Adlib and HPspmd?! etc, etc 5

6 Real World Hybrids Clusters of shared memory nodes typical of large modern systems e.g. SCIAMA... repeat 42 times node 0 memory node 1 memory fast interconnect 6

7 Programming Hybrid Machines Can be programmed using a combination of shared memory framework (e.g. OpenMP) within nodes, and distributed memory framework (e.g. MPI) between nodes. But easier to adopt the lowest common denominator use MPI (say) across all cores. 7

8 General MIMD Programming A program contains different types of interacting process, each coded separately, e.g.: Routine for process A Routine for process B... some computation... sendto(b, values) ;... etc recvfrom(a, values) ;... process values etc... Multiple Instruction, Multiple Data 8

9 Single Program, Multiple Data General MIMD style not usually appropriate for programming massively parallel systems (perhaps OK for distributed systems) Mindset, rather, that all cores working collectively on a single task. Same program executed by all cores, though different subset of data processed by each hence SPMD. 9

10 SPMD For Two Processes Contrived example see subsequent discussion me = getid(); // numeric id of this process if(me == 0) {... some computation... sendto(1, values) ; } else if (me == 1) { recvfrom(0, values) ;... process values... }... etc... 10

11 Comments Illustrates a principle, but real SPMD programs almost never like this. Instead all processes do similar processing (on a different subset of data), then they all exchange portions of their data with peers. Behaviour conditioned by me value, but all processes are doing similar kinds of things most of the time Exception is often node 0, which may at least part of the time be engaged in unique I/O or coordination roles. Here node = process note overloaded terminology! 11

12 Collective Behaviour The program as a whole behaves collectively. Non-trivial parallel tasks can also be subdivided ( horizontally ) into individual phases that are collective e.g. A single iteration of a relaxation solver. Each iteration can be further be broken down into smaller collective phases: 1. Update locally held elements 2. Exchange edge elements with neighbours 12

13 Collective Behaviour 2 Some data is global the same for all processes Trivial example is the overall iteration count for our relaxation solver Every process holds a copy of iteration number, but it is updated identically by all processes, as iterations unfold. Contrast with shared memory programming, e.g. OpenMP, where typically this count would be maintained just by the master thread. In (distributed memory) SPMD, copies of global information often maintained and updated redundantly by all processes. And this is the right way to do things! 13

14 14

15 Features of MPI First MPI standard circa It introduced the important abstraction of a communicator, which is an object something like an N-way communication channel, connecting all members of a group of cooperating processes. This was partly to support using multiple parallel libraries without interference. It also introduced a novel concept of datatypes, used to describe the contents of communication buffers. Partly to support zero-copying message transfer. 15

16 Minimal MPI Program (C language) #include <mpi.h> int main(int argc, char* argv []) { int me ; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); if(me == 1) { char* message = "Hello from process 1\n"; } MPI_Send(message, strlen(message), MPI_BYTE, 0, 111, MPI_COMM_WORLD); } else if(me == 0) { char buffer [100]; MPI_Status status ; MPI_Recv(buffer, 100, MPI_BYTE, 1, 111, MPI_COMM_WORLD, &status); printf(buffer); } MPI_Finalize(); 16

17 Running on SCIAMA Setting up environment and compiling on login node: $ module load compilers/intel/intel-64 mpi/intel/openmpi/1.4.3 $ mpicc example1.c Following job script in (say) jobscript1.sh : #!/bin/bash #PBS -l nodes=2:ppn=1 #PBS -l walltime=00:01:00 #PBS -d /users/dbc/creche/examples. /etc/profile.d/modules.sh module load compilers/intel/intel-64 mpi/intel/openmpi/1.4.3 mpirun a.out 17

18 Running on SCIAMA 2 $ qsub./jobscript1.sh 9470.headnode1.sciama.icg.port.ac.uk $ ls [...] jobscript1.sh.e9470 jobscript1.sh.o9470 $ cat jobscript1.sh.o9470 [...] ======================================================= Job Output Follows: ======================================================= Hello from process 1 ================================================================= Torque job completed on Sun Feb 27 14:49:02 GMT 2011 with exit status of 0 ================================================================= 18

19 The Same in Fortran PROGRAM main USE mpi INTEGER me, ierr, stat CHARACTER*20 message, buffer CALL MPI_INIT( ierr ) CALL MPI_COMM_RANK( MPI_COMM_WORLD, me, ierr ) IF (me.eq. 1) THEN message = 'Hello from process 1' CALL MPI_SEND(message, 20, MPI_CHARACTER, 0, 111, MPI_COMM_WORLD, ierr) ELSE IF (me.eq. 0) THEN CALL MPI_RECV(buffer, 20, MPI_CHARACTER, 1, 111, MPI_COMM_WORLD, stat, ierr) PRINT *, buffer ENDIF CALL MPI_FINALIZE(ierr) STOP END 19

20 MPI Program Structure An MPI program is an ordinary C application, with the program that runs on every node in the main()function. Functions, types and constants of MPI are defined in mpi.h, which should be imported. MPI is initialized by calling MPI_Init(). You should forward the parameters of the main() method to the MPI_Init() method. Call MPI_Finalize() to shut down MPI before the main() method terminates. Failing to do this may cause your executable to not terminate properly. 20

21 Simple send and receive Basic send and receive: int MPI_Send(void* buf, int count, MPI_Datatype type, int dst, int tag, MPI_Comm comm) int MPI_Recv(void* buf, int count, MPI_Datatype type, int src, int tag, MPI_Comm comm, MPI_Status *status) The parameters buf, count, type describe the data buffer the storage of the data that is sent or received see next slide. dst is the rank of the destination process relative to communicator comm. Similarly in MPI_Recv(), src is the rank of the source process. An arbitrarily chosen tag value can be used in MPI_Recv() to select between several incoming messages: the call will wait until a message sent with a matching tag value arrives. The MPI_Recv() method returns an MPI_Status value, discussed later. 21

22 Communication Buffers Most of the communication operations take a sequence of parameters like void* buf, int count, MPI_Datatype type In the actual arguments passed to these functions, buf should be interpreted an array of elements of type consistent with the type parameter. MPI_Datatype can describe many primitive and user-defined types that occur in C or Fortran. count is the number of items to send. 22

23 Communicators The MPI_Comm type represents an MPI communicator. All communication operations logically go through communicators. A communicator spans a group of processes the participants in some kind of parallel task or subtask In MPI, process ids are called ranks, and they are always relative to particular process group. Many programmers only ever use the predefined, global communicator MPI_COMM_WORLD! Ranks relative to MPI_COMM_WORLD are the obvious process ids between 0 and P-1. 23

24 Rank and Size Example #include <mpi.h> int main(int argc, char* argv []) { int me, p ; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &p); printf("hello from process %d of %d\n", me, p); } MPI_Finalize(); 24

25 Discrete Poisson Solver Solves a i,j-1 + a i,j+1 + a i-1,j + a i+1,j - 4a ij = f ij by red-black relaxation. Consider an N by N grid with periodic BCs, for simplicity. 25

26 Data Decomposition Divide a and f arrays over P processes, each with B = N/P rows: process 0 process 1... process P-1 26

27 Local Update Initially ignoring edges; assume B is even: for (iter = 0 ; iter < NITER ; iter++) { } for(i = 0 ; i < B ; i++) for(j = (iter + i) % 2 ; j < N ; j += 2) { a [i][j] = 0.25 * (a [i - 1][j] + a [i + 1][j] + a [i][j - 1] + a [i][j + 1] - f [i][j]); } 27

28 Ghost Regions Biggest problem is with access to elements a[i-1][j] and a[i+1][j], which may go into segment of array held by another process Deal with this by declaring local arrays with ghost extensions extra rows in this example. 28

29 Data with Ghost Extensions Ghost regions process 0 process 1... process P-1 29

30 Possible C Declarations #define P 4 // number of processors #define N 8 // array size (multiple of P) #define B (N/P) // local block size float ag [B + 2][N] ; // local block plus ghost regions float f [B][N] ; // local block float (*a)[n] = ag + 1 ; // just local block // (pointer magic to skip lower ghost row) 30

31 Edge Swap process 0 process 1... process P-1 31

32 MPI Code void edgeswap() { int prev, next ; MPI_Status status ; prev = (me + P - 1) % P ; next = (me + 1) % P ; // First row to high ghost row on previous processor MPI_Sendrecv(a [0], N, MPI_FLOAT, prev, 111, ag [B + 1], N, MPI_FLOAT, next, 111, MPI_COMM_WORLD, &status) ; } // Last row to low ghost row on next processor MPI_Sendrecv(a [B - 1], N, MPI_FLOAT, next, 111, ag [0], N, MPI_FLOAT, prev, 111, MPI_COMM_WORLD, &status) ; 32

33 Send-receive Combines a basic send and a basic receive in a single operation: int MPI_Sendrecv( void* sendbuf, int sendcount, MPI_Datatype sendtype, int dst, int sendtag, void* recvbuf, int recvcount, MPI_Datatype recvtype, int src, int recvtag, MPI_Comm comm, MPI_Status *status) Can be more efficient. More importantly, avoids a potential deadlock in this example. 33

34 Final Main Loop for (iter = 0 ; iter < NITER ; iter++) { edgeswap() ; for(i = 0 ; i < B ; i++) for(j = (iter + i) % 2 ; j < N ; j += 2) { a [i][j] = 0.25 * (a [i - 1][j] + a [i + 1][j] + a [i][(j + N - 1) % N] + a [i][(j + 1) % N] - f [i][j]); } }... Optional debugging output... 34

35 Initialization /* Point source and initial approximation to solution */ for(i = 0 ; i < B ; i++) for(j = 0 ; j < N ; j++) { int x = me * B + i ; /* global indices from local */ int y = j ; if (x == N / 2 && y == N / 2) { f [i][j] = 1 ; } else { f [i][j] = 0 ; } a [i][j] = 0 ; } 35

36 Printing a Distributed Array Simplified assumes enough memory to hold whole array on root process: float a0 [N][N] ; int i, j ; MPI_Gather(a, N * B, MPI_FLOAT, a0, N * B, MPI_FLOAT, 0, MPI_COMM_WORLD) ; if(me == 0) { for(i = 0 ; i < N ; i++) { for(j = 0 ; j < N ; j++) { printf("%8.3f", a0 [i] [j]) ; } printf("\n"); } } 36

37 Gather a Collective Operation Collectives must be called by all processes (spanned by communicator): int MPI_Gather( void* sendbuf, int sendcount, MPI_Datatype sendtype, int dst, int sendtag, void* recvbuf, int recvcount, MPI_Datatype recvtype, int src, int recvtag, int root, MPI_Comm comm) process 0 process root process process P-1 37

38 Other Important Collectives MPI_Bcast broadcast from some root process to all processes MPI_Scatter the opposite of gather MPI_Reduce sum, product, etc of values from all processes etc a handful of others. 38

39 39

40 OpenMP OpenMP (which rather misleadingly styles itself an API ) is a set of compiler directives and supporting libraries for exploiting shared memory computers. Typically more concise than parallelization with MPI but still pitfalls, e.g. race conditions. Works within node (at most 12 cores on SCIAMA, for example) 40

41 Poisson Main Loop with OMP for (iter = 0 ; iter < NITER ; iter++) { #pragma omp parallel for private (j) for(i = 0 ; i < N ; i++) for(j = (iter + i) % 2 ; j < N ; j += 2) { a [i][j] = 0.25 * ( a [(i + N 1) % N][j] + a [(i + 1) % N][j] + a [i][(j + N - 1) % N] + a [i][(j + 1) % N] - f [i][j]); } }... Optional debugging output... 41

42 Running on SCIAMA Setting up environment and compiling on login node: $ module load compilers/intel/intel-64 mpi/intel/openmpi/1.4.3 $ gcc fopenmp omp_poisson.c Submit using e.g. following job script: #!/bin/bash #PBS -l nodes=1:ppn=12 #PBS -l walltime=00:01:00 #PBS -d /users/dbc/creche/examples. /etc/profile.d/modules.sh module load compilers/intel/intel-64 mpi/intel/openmpi/1.4.3 export OMP_NUM_THREADS=12./a.out 42

43 Code Complexity For what it s worth, my MPI version of the Poisson solver is 99 lines of code; my OpenMP version is 62 lines. Note private (j) clause in OMP directive try running without it! 43

44 44

45 Gadget 2 Gadget 2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. It is written in the C language already parallelized using MPI. 45

46 Gadget Main Loop Simplified view of the Gadget code: Initialize while (not done) { move_particles(); // update positions domain_decomposition(); compute_accelerations(); advance_and_find_timesteps(); // update velocities } Most of the interesting work happens in compute_accelerations and domain_decomposition. 46

47 Domain Decomposition Need to divide space and/or particle set into domains, each domain handled by a single processor. Can t just divide space evenly, because some regions will have many more particle than others poor load balancing. Can t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical communication problem. 47

48 Peano-Hilbert Curve Warren and Salmon originally suggested using a space-filling curve : Picture borrowed from 48

49 Peano-Hilbert Key Gadget applies the recursion 20 times, logically dividing space into up to cells on the Peano-Hilbert curve. Then can label each cell by its location along the Peano-Hilbert curve 2 60 possible locations comfortably fit into a 64-bit word. 49

50 Distribution of BH Tree in Gadget Ibid. 50

51 Distributed Representation of Tree Every processor hold a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor c.f. discussion of global data in intro last slide Remotely held nodes are called pseudoparticles. To compute the force on a single local target particle, traverse tree from root as usual, and accumulate contributions from locally held particles. Build an export list containing target particle and hosts of pseudo-particles encountered in walk. 51

52 Communication 1. After local computation for all target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. 2. All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. 3. These contributions are returned to the original processor, and added into the accelerations for the target particles. 52

53 Use of MPI in GADGET 2 Uses collective communications wherever possible throughout the code But some sections have more intricate patterns of communication, not supported by MPI collectives. These sections generally harder to understand. 53

54 The Hard Parts Export of particles to other nodes, for calculation of remote contribution to force, density, etc, and retrieval of results (see above). A partial distributed sort of particles, according to Peano-Hilbert key: this implements domain decomposition. For TreePM: Projection of particle density to regular grid for calculation of long range force; scatter results back to irregularly distributed particles. 54

55 Use of Point-to-point In these difficult sections of code notable that there is extensive use of low-level MPI point-to-point communication (i.e. send/recv not collective) Code difficult to understand, and presumably difficult to maintain. 55

56 A Program of Development Devise higher-level libraries of operations akin to MPI collectives, but more application-specific e.g. for 1. Collective Asynchronous Remote Invocation. 56

Message Passing Interface

MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across