Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Size: px

Start display at page:

Download "Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition"

Hugh Dickerson
5 years ago
Views:

1 Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition CUG 2018, Stockholm Andreas Jocksch, Matthias Kraushaar (CSCS), David Daverio (University of Cambridge, Centre for Theoretical Cosmology) 24th May, 2018

2 Optimised all-to-all communication 2

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition Basic algorithm Generalisation Bruck

3 Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition Basic algorithm Generalisation Bruck Blocking / non-blocking communication GPU-direct Applications programming interface Benchmarks Applications Optimised all-to-all communication 3

4 Motivation Modern supercomputers equipped with multicore CPUs Hybridly OpenMP + MPI or pure MPI programmed, focus on MPI only If GPUs are present often MPI + CUDA / OpenACC Collective communication operations provide fast message transfers for certain communication patterns All-to-all personalised communication critical for many applications Assumption of a fully connected network, described by latency and bandwidth Optimised all-to-all communication 4

5 Motivation Scheduling scheme performance critical For small messages store and forward algorithms, e.g., Bruck s algorithm MPICH, OpenMPI and MVAPICH apply the algorithm with all cores equally connected, the reason should be the current MPI standard For allreduce, reduce, broadcast, scatter and gather shared memory parallelism Generalised case of multiple all-to-all communication for pencil decomposed FFTs Library which exploits shared memory, Li and Laizet CUG 2010 More efficient communication algorithm for all-to-all / all-to-allv / FFTs introduced here plan routines, GPU direct supported Optimised all-to-all communication 5

6 Basic all-to-all algorithm Communication on the node using shared memory Communication between nodes done by multiple (e.g. all) cores Optimised all-to-all communication 6

7 Basic all-to-all algorithm 1. MPI Irecv 2. memcpy sendbuffer to shared sendbuffer 3. node barrier 4. MPI Isend 5. memcpy shared sendbuffer to shared receivebuffer, for data which remains on the node 6. node barrier 7. MPI Waitall 8. memcpy shared receivebuffer to receivebuffer Optimised all-to-all communication 7

8 Generalisation Bruck s algorithm A B C D E F G A B C D E F G A B C D E F G A B C D E F G A A B A C A D A E A F A G A A A B B C C D D E E F F G G A A B B C C D D E E F F G G A A B B C C D D E E F F G G A B B B C B D B E B F B G B A B B C C D D E E F F G G A G A A B B C C D D E E F F G G A A B B C C D D E E F F G A C B C C C D C E C F C G C A C B D C E D F E G F A G B A C B D C E D F E G F A G B F A G B A C B D C E D F E G A D B D C D D D E D F D G D A D B E C F D G E A F B G C A D B E C F D G E A F B G C A D B E C F D G E A F B G C A E B E C E D E E E F E G E A E B F C G D A E B F C G D G D A E B F C G D A E B F C G D A E B F C G D A E B F C A F B F C F D F E F F F G F A F B G C A D B E C F D G E A F B G C A D B E C F D G E F D G E A F B G C A D B E C A G B G C G D G E G F G G G A G B A C B D C E D F E G F A G B A C B D C E D F E G F A G B A C B D C E D F E G F (a) Initial data (b) After local rotation (c) After substep 1 of communication step 1 (d) After substep 2 of communication step 1 A B C D E F G A B C D E F G A B C D E F G A A B B C C D D E E F F G G A A B B C C D D E E F F G G A A A B A C A D A E A F A G G A A B B C C D D E E F F G G A A B B C C D D E E F F G B A B B B C B D B E B F B G F A G B A C B D C E D F E G F A G B A C B D C E D F E G C A C B C C C D C E C F C G E A F B G C A D B E C F D G E A F B G C A D B E C F D G D A D B D C D D D E D F D G D A E B F C G D A E B F C G D A E B F C G D A E B F C G E A E B E C E D E E E F E G C A D B E C F D G E A F B G C A D B E C F D G E A F B G F A F B F C F D F E F F F G A G B A C B D C E D F E G F B A C B D C E D F E G F A G G A G B G C G D G E G F G G (e) After substep 1 of communication step 2 (f) After substep 2 of communication step 2 (g) After local inverse rotation Bruck et al Optimised all-to-all communication 8

9 Multiple all-to-all algorithm For FFTs with pencil decomposition tasks communicate in multiple groups independent from each other Our routine includes this case Optimised all-to-all communication 9

10 Blocking communication Communication algorithm message size dependent Very small messages forwarded between nodes Small messages sent directly using the shared memory segment Large messages are sent with the original MPI implementation Cyclic scheduling is applied Throtteling is applied as done in the MPICH reference implementation Optimised all-to-all communication 10

11 Blocking communication A plan routine is used and a code generator encodes the algorithm The execution routine decodes the data MPI Isend, MPI Irecv, MPI Wait are used Enrichment of the set of opcodes possible All-to-allv is implemented without any performance penalty Optimised all-to-all communication 11

12 Non-blocking communication For overlapping of communication and computation as done for FFTs For no message forwarding and no throtteling one MPI Waitall present Split of the execution in two parts, before and after the waitall, for start and end of the communication Optimised all-to-all communication 12

13 GPU-direct Multiple MPI tasks might use one GPU Two implementations: small message copied between GPU and CPU and forwarded between nodes Large messages are bundled on the GPU with interprocess communication CUDA kernel which copies from many sourche buffers to many destination buffers Optimised all-to-all communication 13

14 Application programming interface Routines are written in ANSI C with different interfaces Native one which calles our algorithm with all its parameters Automatic one which choses the algorithm and its parameters Both options are available in C and Fortran C/C++ : C native interface int EXT_MPI_Alltoall_init_native ( void * sendbuf, int sendcount, MPI_Datatype sendtype, void * recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm_row, int cores_per_node_row, MPI_Comm comm_column, int cores_per_node_column, int num_ports, int num_active_ports, int chunks_throttle ); int EXT_MPI_Alltoall_exec_native ( int handle ); int EXT_MPI_Alltoall_done_native ( int handle ); Optimised all-to-all communication 14

15 Benchmarks communication time / s bytes optimised 10 bytes MPICH 100 bytes optimised 100 bytes MPICH 1000 bytes optimised 1000 bytes MPICH number of cores communication time, / s MPICH 12 cores no throttle 6 cores no throttle 6 cores throttle 4 cores no throttle 4 cores throttle 3 3 cores no throttle 3 cores throttle 2 3 cores throttle 4 2 cores no throttle 2 cores throttle 2 2 cores throttle 3 2 cores throttle message size, bytes standard all-to-all, 12 nodes, 12 cores per node standard all-to-all, variable tile size, 12 nodes, 12 cores per node Cray XC50, 12 CPU cores per node, 1 NVIDIA Tesla P100 Cray XC 40 KNL, 12 cores used per KNL Optimised all-to-all communication 15

16 Benchmarks communication time, / s communication time, / s MVAPICH optimised message size, bytes Cray MPICH optimised message size, bytes all-to-all on Piz Kesch, 5 nodes, 12 cores per node GPU-direct, 12 nodes, 12 cores per node Infiniband cluster Piz Kesch (Meteoswiss) GPU-direct on Piz Daint Optimised all-to-all communication 16

17 Applications: LATfield2 Spectral code with real-to-complex 3D FFTs and complex-to-real 3D FFTs High memory requirement small messages originating from FFTs relative speedup number of cores relative speedup for FFTs Optimised all-to-all communication 17

18 Applications: ORB5 Particle in cell code with toroidal domain MPI parallelisation in subdomains and clones Random variable message sizes for particle exchange between all subdomains (all-to-all communication) Typically all clones of a specific subdomain on a node application for our algorithm Optimised all-to-all communication 18

19 Discussion Implementation of the single all-to-all algorithm in the current MPI standard possible but would slowdown the communicator creation Multiple all-to-all possible with a graph communicator, also computationally expensive Suggestion: extension of the MPI standard with plan routines Optimised all-to-all communication 19

20 Conclusions Message passing algorithm for all-to-all communication for networks with shared memory nodes Cray MPICH outperformed by a factor of 2.9 for 10 bytes messages On infiniband cluster standard MVAPICH outperformed for small messages GPU implemenation advantegous for small and medium message size Application to FFTs (LATfield2) demonstrates its strength Optimised all-to-all communication 20

21 Thank you for your attention.

22 Benchmarks communication time, / s MPICH 4 12 active cores 6 active cores 4 active cores 2 3 active cores 2 active cores 1 active core message size, bytes communication time, / s MPICH 12 active cores no throttle 12 active cores throttle 50 6 active cores throttle 4 active cores throttle 3 active cores throttle 2 active cores throttle 1 active core throttle message size, bytes standard all-to-all, variation of active number of cores, 12 nodes, 12 cores per node standard all-to-all, 156 nodes, 12 cores per node Optimised all-to-all communication 21

23 Benchmarks 140 communication time, / s MPICH active cores 6 active cores 4 active cores 20 3 active cores 2 active cores 1 active core number of cores standard all-to-all, 1024 bytes, 12 cores per node, throttle Optimised all-to-all communication 22

24 Benchmarks communication time / s 1 MPICH (1000B) 12 cores active (1000B) 6 cores active (1000B) 6 cores active, throttling=2 (1000B) MPICH (100B) 12 cores active (100B) 6 cores active (100B) 6 cores active, throttling=2 (100B) MPICH (10B) 12 cores active (10B) 6 cores active (10B) 6 cores active, throttling=2 (10B) communication time / s bytes MPICH (1x12 tiles) 100 bytes optimised (1x12 tiles) 10 bytes MPICH (1x12 tiles) 10 bytes optimised (1x12 tiles) number of cores number of cores multiple all-to-all 3 4 cores multiple all-to-all 1 12 cores Optimised all-to-all communication 23