MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

Size: px

Start display at page:

Download "MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge."

Stephanie Alexander
6 years ago
Views:

1 MPI and CUDA Filippo Spiga, HPCS, University of Cambridge

2 Outline Basic principle of MPI Mixing MPI and CUDA 1 st example : parallel GPU detect 2 nd example: heat2d CUDA- aware MPI, how and why 3 rd example: Jacobi GPU Direct over RDMA Examples are located here /scratch/id004914/cuda- mpi See CREDITS.txt for proper acknowledgments 2

3 MPI Overview MPI = Message Passing Interface A specification of a set of functions with prescribed behavior (MPI 1.0, 1.1, 2.0, 2.1, 3.0, ) Point- to- point, collective, one- sided, RDMA, Not a library there are multiple competing implementations of the specification Two popular open- source implementations are OpenMPI (1.8) and MVAPICH2 (2.1) Most MPI implementations from vendors are customized versions of these (e.g. Intel MPI) Almost all open- source MPI implementation support Ethernet and high- speed interconnect (e.g. Infiniband) 3

4 MPI Programming model An MPI program consists of several processes Each process can execute different instructions Each process has its own memory space Processes can only communicate by sending messages to each other COMPUTE COMPUTE COMPUTE COMMUNICATION COMMUNICATION COMMUNICATION COMPUTE COMPUTE COMPUTE 4

5 Very common MPI routines You can do (almost) everything with these MPI routines MPI_Init MPI_Comm_Size MPI_Comm_Rank MPI_Send MPI_Recv MPI_Barrier MPI_Finalize 5

6 An simple MPI program int main(int { argc, char *argv[]) int myid, numprocs; int buffer[100]; int tag=1234; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0){ MPI_Send(&buffer,100,MPI_INT,1,tag,MPI_COMM_WORLD); } if (myid == 1){ MPI_Recv(&buffer,100,MPI_INT,0,tag,MPI_COMM_WORLD,&status); } } MPI_Finalize(); 6

7 A slightly complex MPI program int main(int argc, char *argv[]) { MPI_Request req_in, req_out; MPI_Status stat_in, stat_out; float a[10], b[10]; int mpi_rank, mpi_size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); if (mpi_rank == 0) { MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in); MPI_Isend(a, 10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out); } if (mpi_rank == 1) { MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in); MPI_Isend(a, 10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out); } } MPI_Waitall(1, &req_in, &stat_in); MPI_Waitall(1, &req_out, &stat_out); MPI_Finalize(); 7

8 A closer look int MPI_Init ( int *argc, char ***argv ) Initialises the MPI execution environment int MPI_Comm_size ( MPI_Comm comm, int *size ) Determines the size of the group associated with a communicator int MPI_Comm_rank ( MPI_Comm comm, int *rank ) Determines the rank of the calling process in the communicator int MPI_Finalize () Terminates MPI execution environment 8

9 A closer look int MPI_Irecv ( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request*request ) buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) source: rank of source tag: message tag comm: communicator request: communication request (used for checking message status) 9

10 Compile and run MPI program MPI implementations provide wrappers for popular compilers These are normally named mpicc/mpicxx/mpif77 etc. Wrappers call normal compilers and add automatically everything is needed to link MPI library Running an MPI program normally through mpirun or mpiexec Commands slightly change parameters based of MPI library Always be aware and careful about process binding and task affinity MPI runtimes manage the utilization of the best interconnect available on the node 10

11 MPI and GPU what s the catch? MPI processes have their own separate memory space CPU and GPU have their own memory space I distribute data across MPI processes (coarse- grain parallelism) I can move some (or all?) local computation from CPU to GPU to accelerate it (fine- grain parallelism) MPI communication occur on the HOST sides GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network Card Network Card Network Card Node 0 Node 1 Node n-1 11

12 MPI and GPU what s the catch? GPU CPU CPU GPU COMPUTE COMPUTE COMMUNICATION COMMUNICATION COMPUTE COMPUTE 12

13 MPI and GPU MPI Rank 0 MPI Rank 1 GPU Host cudamemcpy(s_buf_h,s_buf_d,size,cudamemcpydevicetohost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudamemcpy(r_buf_d,r_buf_h,size,cudamemcpyhosttodevice); 13

14 Example 1: query GPU (BASIC) For each MPI process (order by rank): Check how many GPU are visible by each MPI process For each visible GPU: Query GPU information and print them Questions: How can I print output ordered by rank? How many GPU each MPI see? How can I assign a specific GPU to a specific MPI? 14

15 Binding modes 3 scenarios More GPU than MPI As many GPU as MPI More MPI than GPU IDEAL NVIDIA MPS 15

16 Assign GPUs to MPI processes How can I assign GPU to MPI? Internally in the application using cudasetdevice() You need to know what you are doing Externally from a scheduler point of view by change environment variable CUDA_VISIBLE DEVICES based on local MPI rank on a node MV2_COMM_WORLD_LOCAL_RANK for MVAPICH2 OMPI_COMM_WORLD_LOCAL_RANK for OpenMPI Externally from a system perspective by changing COMPUTE mode of a GPU to exclusive This force 1:1 MPI- GPU binding 16

17 Example 2: heat2d (COMPLEX) In 2D: T t = α 2 T x + 2 T 2 y 2 For which a possible finite difference approximation is: ΔT Δt = α T i+1, j 2T i, j + T i 1, j Δx 2 + T i, j +1 2T i, j + T i, j 1 Δy 2 where ΔT is the temperature change over a time Δt and i,j are where the temperature change over a time interval and i,j are indices into a uniform structured grid 17

18 Example 2: heat2d (COMPLEX) Stencil operation (radius 1) d point using data from blue points (and red point) 18

19 Example 2: heat2d (COMPLEX) Domain decomposition and halos 19

20 Example 2: heat2d (COMPLEX) Domain decomposition and halos 20

21 Example 2: heat2d (COMPLEX) Domain decomposition and halos 21

22 Example 2: heat2d (COMPLEX) Communication pattern: The left- most rank sends data to the right The inner ranks send data to both the left and the right The right- most rank sends data to the left MPI can read and write directly from 2D arrays using an advanced feature called datatypes (but this is complicated for GPUs). Message- passing strategy: Fill outgoing buffers (2D - > 1D) Send from outgoing buffers, receive into incoming buffers Wait Fill arrays from incoming buffers (1D - > 2D) 22

23 Example 2: heat2d (COMPLEX) Message- passing strategy with GPUs: Fill outgoing buffers on GPU using a kernel (2D - > 1D) Copy buffers to CPU - cudamemcpy(devicetohost) Send from outgoing buffers, receive into incoming buffers Wait Copy buffers to GPU - cudamemcpy(hosttodevice) Fill arrays from incoming buffers on GPU using a (1D - > 2D) PLUS (not implemented) Pinned memory and asynchronous data- transfer 23

24 Scaling and GPU When benchmarking MPI applications, we look at two issues: Strong scaling how well does the application scale with multiple processors for a fixed problem size? Weak scaling how well does the application scale with multiple processors for a fixed problem size per processor? Achieving good scaling is more difficult with GPUs for mainly two reasons: There is an extra memory copy (H2D/D2H) involved for every message If CUDA kernels are much faster then the MPI communication becomes a larger fraction of the overall runtime 24

25 MPI and GPU We learnt that we need to copy back/forward from/to GPU before communicate data to other MPI processes memcpy D->H MPI_Sendrecv memcpy H->D Time Can I do better? Can this be simpler? 25

26 CUDA- aware MPI MPI library that accepts GPU pointers currently Open MPI, MVAPICH2 and CRAY MPI support this 26

27 CUDA- aware MPI under the hood data transfer H2D/D2H are hidden and managed by the MPI library Automatic selection of best strategy (e.g. chunk size) Automatic selection and best path 27

28 Example 3: Jacobi (ADVANCED) It solves the Poisson equation on a rectangle with Dirichlet boundary conditions. 2D data decomposition halo exchange What to do Play with/without CUDA- aware Play with local/global domain sizes check performance and scalability 28

Can I do even better? No matter if you do explicitly H2D/D2H copies or you let MPI do it for you you always need to reallocate some memory on the HOST! Waste of resources!

29 Can I do even better? No matter if you do explicitly H2D/D2H copies or you let MPI do it for you you always need to reallocate some memory on the HOST! Waste of resources! GPUDirect technology, introduced with CUDA 3.1. This feature allows the network fabric driver and the CUDA driver to share a common pinned buffer in order to avoids an unnecessary memcpy within host 29

GPU direct over RDMA Buffers can be directly sent from the GPU memory to a network adapter without staging through host memory. Available from CUDA 5.

30 GPU direct over RDMA Buffers can be directly sent from the GPU memory to a network adapter without staging through host memory. Available from CUDA 5.0 You need Mellanox IB cards (FDR or better) You need GPU and IB card to seat in the same PCI bus so they can talk directly It works better for small data transfer and point- to- point 30

31 CUDA- aware MPI conclusions CUDA- aware MPI works thank to UVA - - Unified Virtual Addressing (introduced since CUDA 4.0) host memory and the memory of all GPUs in a system (a single node) are combined into one large (virtual) address space. Easy programmability of MPI+CUDA applications Small (manageable) constrain in portability MPI libraries need to support such functionality Better performance all operations that are required to carry out the message transfer can be pipelined In principle always true but highly dependent by the quality of MPI implementation Acceleration technologies like GPU Direct over RDMA can be utilized by the MPI library transparently to the user. 31

32 Resources online Top 3 must- read links: n- cuda- aware- mpi/ ing- cuda- aware- mpi/ ing- gpudirect- rdma- on- modern- server- platforms/ 32

CS 179: GPU Programming. Lecture 14: Inter-process Communication

CS 179: GPU Programming. Lecture 14: Inter-process Communication CS 179: GPU Programming Lecture 14: Inter-process Communication The Problem What if we want to use GPUs across a distributed system? GPU cluster, CSIRO Distributed System A collection of computers Each