Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Size: px

Start display at page:

Download "Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E"

Kerry Jones
5 years ago
Views:

1 Understanding Communication and MPI on Cray XC40

2 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level features like MPI derived types. Cray provides enhancements on top of this: low level communication libraries Point to point tuning Collective tuning Shared memory device is built on top of Cray XPMEM Cray MPI uses the NEMESIS module SMP aware communication between ranks on the same node uses shared memory rather than the Aries network gives higher performance Supported under all Cray PrgEnv The compiler/linker wrappers takes care of including the correct header and libs 2

3 Basics about communication 3

4 Costs of communication All parallel applications communicate data between individual processes unless they're embarrassingly parallel The cost of any communication is usually defined by two properties of the underlying network (or memory system) 1. Latency 2. Bandwidth 4

5 Costs of communication 1. Latency The time from a message being sent to it reaching its destination Dominates the performance of small messages Combination of factors from: constant software, hardware overheads. the physical and topological distance between the nodes (hops) 2. Bandwidth The maximum rate at which data can flow over the network. Dominates the performance of larger messages. Bandwidth between nodes generally depends upon the number of possible paths between nodes on the network (topology) Can usually be tuned with a large enough budget. 5

6 How message size affects communication performance (As with all things) the decisions made by application developer can affect the overall performance of the application. The size of messages sent between processes affects how important latency and bandwidth costs become. When a message is small the network latency is dominant. Therefore it is advisable to try and bundle multiple small messages into fewer larger message to reduce the number of latency penalties. This is true for all closely coupled communication over any protocols, e.g MPI, SHMEM, UPC, TCP/IP 6

7 Understanding Inter- and Intra- node performance The rise of multi-core has led to fat nodes being common Five years ago there may have one or two CPUs per node Now we routinely see CPUs per node. This will only increase in the future (e.g Intel Phi) Codes usually have multiple MPI ranks per node Many (even most) codes are flat MPI rather than hybrid with, for instance, OpenMP threads Even hybrid codes usually have more than one rank per node as threading does not usually scale well across NUMA regions (e.g. sockets) Latency, bandwidth is different for on- and off-node messages messages between PEs on the same node (intra-node) will be faster messages between PEs on different nodes (inter-node) will be slower We can optimise application performance by maximising communication between process on the same node 7

8 Time (us) Ping-ping Bandwidth (MB/s) Intra Node MPI Ping-Ping Message Performance Time Bandwidth Message Size (Bytes) 0 8

9 Time (us) Ping-ping Bandwidth (MB/s) Single rank MPI inter-node message performance Time Bandwidth Message Size (Bytes) 9

10 Some information about what the Aries HW can do and when it is not used 10

11 Inter-node transfer protocols Building ever-larger flat SMPs is expensive and doesn't scale, An XC system has a hybrid architecture set of SMP machines (nodes) linked into an MPP by the Cray Aries network Messages to another node are sent via Aries messages are sent by one of two possible methods the choice depends on message size developers do not target these directly but knowing about them helps you to understand how to use MPI successfully Possible Aries protocols 1. Fast Memory Access (FMA) 2. Block Transfer Engine (BTE) Sli de 11

12 Comparing FMA and BTE Fast Memory Access (FMA) Used for small messages Good features: Lowest latency More than one transfer active at the same time (multi-core) Bad features: Synchronous: CPU involved in the transfer Block Transfer Engine (BTE) Used for large messages Good features: Gets the better Point-to-point bandwidth in more cases Asynchronous: transfer done by Aries independent of the CPU Bad features: Higher latency Transfers are queued if BTE is busy 12

13 Cray Aries and XPMEM RDMA Capabilities Cray Aries (and Gemini) NICs can do RDMA operations Remote Direct Memory Access can directly read from or write into another node's memory space (with certain limitations). The remote CPU does not have to be involved in the transfer. Sending CPU only needs to initialise the transfer If the sending Aries uses it's Block Transfer Engine (BTE) XPMEM provides similar capability within a node Feature of the Cray Linux Environment Each PE shares part of their address space with other PEs on the same node RDMA operations offer good potential for overlapping communication with computation Sli de 13

14 Overlapping communication 14

15 Overlapping communication with computation The Holy Grail Do communication "in the background" While each PE does (separate) computation The cost of communication is then almost nothing Save the overhead of initiating transfers and synchronisation Relies on Having enough (independent) computation to hide the comms time Having the correct code structure to make this possible Using non-blocking communication calls, e,g. via RDMA 15

16 Time Using RDMA to overlap communication and computation to hide costs PE A RDMA Put PE B RDMA Put So, rather than sitting waiting for a communication operation to complete, applications can use asynchronous RDMA operations instead. e.g. putting some data into a remote PEs memory. The application could then continue with other useful computation until the checkpoint where the data is required. Checkpoint Data Transferred in Background Checkpoint 16

17 Short about PGAS programming models To exploit RDMA directly in an application need a programming model that exposes parts of address space to other PEs (on or off the node) Partitioned Global Address Space programming models (PGAS) All memory (across all the PEs) is addressable, but is not evenly accessible Addressing method recognises this: PE number and local address Languages: e.g. Fortran Coarrays (CAF); Unified Parallel C (UPC); Coarray C++; Chapel APIs e.g. OpenSHMEM; GASPI. Also MPI3-RMA single sided. Symmetric allocation Users declare portions of each PE's memory space symmetrically e.g. (usually) using exactly the same addresses on each PE This is done by either: automatically, using special language features (e.g. coarrays) using collective allocation API calls (e.g. shmalloc) get/put operations only allowed between memory on symmetric heap 17

18 PGAS on Cray Aries Cray Aries network RDMA capabilities highly suited to one-sided communications low latency high bandwidth ideal for small-message transfers PGAS and other single-sided models Fit this hardware model very closely Have potential to give highest bandwidths and lowest latencies Nonetheless, two-sided protocols are more popular specifically MPI been around for longer as a de facto standard traditionally more portable 18

19 Two-sided protocols Typically two-sided protocols like MPI are easier to use. The implicit synchronisation between PEs makes it easier to write programs that are not as vulnerable to race conditions. They allow for data to be sent or received into or from any part of the PEs address space Messages can be matched (or not) via tags or by the PE source (MPI_ANY_SOURCE, MPI_ANY_TAG) However this additional flexibility requires the cooperation of the Intel Xeon CPU to perform many of these tasks. This means communication may wait until the CPU enters an MPI call. Overheads caused by the MPI standard may increase latency and reduce effective bandwidth! 19

20 Time Overlapping Communication and Computation Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend The MPI API provides many functions that allow point-topoint messages (and with MPI- 3, collectives) to be performed asynchronously. Ideally applications would be able to overlap communication and computation, hiding all data transfer behind useful computation. MPI_Waitall Data Transferred in Background MPI_Waitall Unfortunately this is not always possible at the application and not always possible at the implementation level. 20

21 What prevents Overlap? Overlapping computation/comms not always possible even if library has asynchronous API calls and the application has enough computation to allow overlap Usual reason: sending PE does not know where to put messages on the destination this is part of the MPI_Recv, not MPI_Send. Also on Gemini and Aries, host CPU is required complex tasks performed on the CPU e.g. matching message tags with the sender and receiver Allows NICs to have higher clockspeed (better latency, bandwidth) But messages can only "progress" when program is in MPI i.e. within MPI library function or subroutine 21

22 MPI messaging protocols To understand when overlap is or isn't possible need to understand how MPI actually sends messages Two different protocols choice depends on message size 1. Eager messaging Used for small messages Offers good potential for overlap 2. Rendezvous messaging Used for large messages Does not usually overlap (without progress engine) These map (loosely) onto Cray Aries FMA, BTE methods the finer details can be quite complicated 22

23 Time EAGER Messaging Buffering Small Messages Sender MPI_Send MPI Buffers Data pushed to receiver's buffer MPI Buffers Receiver MPI_Recv Smaller messages can avoid this problem using the eager protocol. If the sender does not know where to put a message it can be buffered until the sender is ready to take it. When MPI Recv is called the library fetches the message data from the remote buffer and into the appropriate location (or potentially local buffer) Sender can proceed as soon as data has been copied to the buffer. Sender will block if there are no free buffers 23

24 Time EAGER potentially allows overlapping Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend Data is pushed into an empty buffer(s) on the remote processor. Data is copied from the buffer into the real receive destination when the wait or waitall is called. Involves an extra memcopy, but much greater opportunity for overlap of computation and communication. MPI_Waitall MPI_Waitall 24

25 Time RENDEZVOUS Messaging Larger Messages Sender DATA MPI_Send DATA MPI Buffers Receiver Larger messages (that are too big to fit in the buffers) are sent via the rendezvous protocol Messages cannot begin transfer until MPI_Recv called by the receiver. MPI_Recv Data is pulled from the sender by the receiver. Data pulled from the sender DATA Sender must wait for data to be copied to receiver before continuing. Sender and Receiver block until communication is finished 25

26 Time RENDEZVOUS does not usually overlap Rank A DATA MPI_IRecv MPI_ISend Rank B DATA MPI_IRecv MPI_ISend With rendezvous data transfer is often only occurs during the Wait or Waitall statement. When the message arrives at the destination, the host CPU is busy doing computation, so is unable to do any message matching. Control only returns to the library when MPI_Waitall occurs and does not return until all data is transferred. MPI_Waitall MPI_Waitall There has been no overlap of computation and communication. DATA DATA 26

27 Making more messages EAGER One way to improve performance send more messages on the eager protocol; potentially more overlap Do this by raising the value of the eager threshold set environment variable in jobscript export MPICH_GNI_MAX_EAGER_MSG_SIZE=<value> value is in bytes: default is 8192 bytes. Maximum size is bytes (128KB) When might this help If MPI takes a significant time in the profile If you have a lot of messages between 8kB and 128kB CrayPAT MPI tracing can tell you this Also try to post MPI_IRecv call before the MPI_ISend call can avoid unnecessary buffer copies 27

28 Consequences of more EAGER messages Places more demands on buffers on receiver If the buffers are full, transfer will wait until space is available or until the MPI_Wait Number of buffers can be increased at runtime export MPICH_GNI_NUM_BUFS=<number> default number is 64 buffers are 32kB each (total of 2MB) Buffer memory space competes with application memory so we recommend only moderate increases 28

29 The MPI progress engine 29

30 Time Progress threads help overlap Rank A DATA MPI_IRecv MPI_ISend Thread A Thread B Rank B DATA MPI_IRecv MPI_ISend Cray's MPT library can spawn additional threads that allow progress of messages while computation occurs in the background. Thread performs message matching and initiates the transfer. MPI_Waitall DATA MPI_Waitall DATA Data has already arrived by the time Waitall is called, so overlap between compute and communication. 30

31 MPI - Async Progress Engine Support Used to improve communication/computation overlap Each MPI rank starts a helper thread during MPI_Init Helper threads progress MPI engine in background while the application computes Only works for messages using BTE inter-node messages using Rendezvous path (i.e. large messages) 31

32 Using the progress engine To enable on XC, set two env. vars. in jobscript export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple Also need somewhere for progress engine threads to run: If you are running with aprun -j1 all the (second) hyperthreads are spare Progress engine threads will use these by default So you don't need to do anything else If you are running with aprun -j2 you need to reserve one (or more) hyperthreads for progress use the -r core specialisation flag, e.g. aprun -n XX -N63 -r1./a.out aprun -n XX -N62 -r2./a.out 32

33 Will the progress engine help? For codes that spend a lot of time on large-message transfers and using non-blocking MPI calls Yes, it can help 10% or more performance improvements seen with some apps Why might it not help (even if we have slow, large message transfers with non-blocking MPI) Possible reasons include: MPICH_MAX_THREAD_SAFETY=multiple has performance implications thread safety means more locking in the library core specialisation means fewer user processes per node less computational power per node reduced amount of intra-node MPI messages 33

34 Improving performance of MPI collectives 34

35 Using optimised MPI collectives on a Cray DMAPP: low-lying communication API on Cray systems Used "under the hood" by MPI, SHMEM, CAF, UPC... Some MPI collectives have optimised DMAPP versions MPI_Allreduce, MPI_Barrier, MPI_Alltoall not used by default To use DMAPP collectives: users must manually add the library to link line using: -ldmapp set environment variable in jobscript: export MPICH_USE_DMAPP_COLL=1 Cray Aries also offers some accelerated collective ops: Barriers, single word Allreduce calls enabled using environment variable in jobscript export MPICH_DMAPP_HW_CE=1 35

36 Other Techniques for collectives (1) Cray MPICH uses optimised versions of MPI collectives In most cases these are the correct thing to use If these are not suitable for a particular application can switch them off using env. vars. in jobscript export MPICH_COLL_OPT_OFF=<collective name> e.g. MPICH_COLL_OPT_OFF=mpi_allgather When would you try this? If MPI collectives take a significant part of your profile And if a particular collective takes a "surprising" amount of time e.g. compared to runs on a different architecture 36

37 Other Techniques for collectives (2) Cray MPICH uses various algorithms for all-to-all routines allgather(v), alltoall(v) Library-internal decisions on which to use are based on number of ranks on the calling communicator message sizes Can change this decision using env. vars. in jobscript: MPICH_XXXX_VSHORT_MSG where XXXX is a collective name e.g. ALLGATHER When might you try this e.g. If ALLGATHER suddenly becomes very important for a small change in problem size 37

38 Issue: Expensive collectives The implementation of collectives based on the DMAPP layer will (usually significantly) improve the performance of the expensive collectives Enabled by the variable MPICH_USE_DMAPP_COLL Can be used also selectively, e.g. export MPICH_USE_DMAPP_COLL=mpi_allreduce Consult man mpi XC Hardware Collective Engine (CE) XC supports hardware-offload of Barrier & Allreduce collectives Invoke these via MPICH_USE_DMAPP_COLL and MPICH_DMAPP_HW_CE environment variables Must also link libdmapp into your application (see man mpi )

39 Latency (mircoseconds) CE Test : 8-byte MPI_Allreduce Latency (16p/node) Comparing MPICH2 software, DMAPP software, Aries Collective Engine Cray XC system (CSCS daint) - 03/21/ MPICH2 15 DMAPP-software DMAPP-Hardware (CE) Number of MPI processes 39

40 Huge Pages The Aries NIC performs better with HUGE pages than with 4K pages. The Aries can map more pages using fewer resources meaning communications may be faster. The cray-mpich library will map its buffers into huge pages by default. The size of the huge pages used is controlled by: export MPICH_GNI_HUGEPAGE_SIZE=<size> <size> can be 2M, 4M, 16M, 32M, 64M, 128M, 256M default is 2M 40

41 Miscellaneously Useful Flags Performance enhancements export MPICH_COLL_SYNC=1 Adds a barrier before collectives, use this if CrayPAT makes your code run faster. Reporting export MPICH_CPUMASK_DISPLAY=1 Shows the binding of each MPI rank by core and hostname export MPICH_ENV_DISPLAY=1 Print the value of all MPI environment variables at runtime (STDERR) export MPICH_MPIIO_STATS=1 Prints some MPI-IO stats useful for optimisation (STDERR) export MPICH_RANK_REORDER_DISPLAY=1 Prints the node that each rank is residing on, useful for checking MPICH_RANK_REORDER_METHOD results. export MPICH_VERSION_DISPLAY=1 Display library version and build information. For more information: man mpi 41

42 How can I make my MPI faster? Some hints Runtime options Try to maximise on-node transfers (rank reordering) Try using optimised collectives; or DMAPP collectives (relink needed) Help the MPI library get better overlap use non-blocking MPI calls MPI_Isend, MPI_Irecv, MPI_Iallgather... small messages use the EAGER protocol with good overlap potential try to send more data using the small-message EAGER method consider raising the EAGER threshold larger messages use the RENDEZVOUS protocol post non-blocking receives as early as possible consider using the asynchronous progress thread Try to reorder code to give more potential for overlap local computation (or I/O) that can be done while messages transfer Perhaps consider adding some PGAS 42

43 Additional, detailed slides

44 Day in the Life of an MPI Message Four Main Pathways through the MPICH2 GNI NetMod Two EAGER paths (E0 and E1) For a message that can fit in a GNI SMSG mailbox (E0) For a message that can't fit into a mailbox but is less than MPICH_GNI_MAX_EAGER_MSG_SIZE in length (E1) Two RENDEZVOUS (aka LMT) paths : R0 (RDMA get) and R1 (RDMA put) Selected Pathway is based on Message Size E0 E1 R0 R K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1MB 4MB MPICH_GNI_MAX_VSHORT_MSG_SIZE MPICH_GNI_MAX_EAGER_MSG_SIZE MPICH_GNI_NDREG_MAXSIZE 44

45 Day in the Life of Message type E0 EAGER messages that fit in the GNI SMSG Mailbox Sender 1. GNI SMSG Send (MPI header + user data) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver CQs 2. Memcpy GNI SMSG Mailbox size changes with number of ranks in job If user data is 16 bytes or less, it is copied into the MPI header 45

46 Day in the Life of Message type E1 EAGER messages that don't fit in the GNI SMSG Mailbox Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Memcpy data to pre-allocated MPI buffers CQs 4. GNI SMSG Send (Recv done) 5. Memcpy MPICH_GNI_NUM_BUFS User data is copied into internal MPI buffers on both send and receive side MPICH_GNI_NUM_BUFS default 64 buffers, each 32K 46

47 EAGER Message Protocol Default mailbox size varies with number of ranks in the job Protocol for messages that can fit into a GNI SMSG mailbox The default varies with job size, although this can be tuned by the user to some extent Ranks in Job Max user data (MPT 5.3 ) MPT 5.4 and later < = 512 ranks 984 bytes 8152 bytes > 512 and <= bytes 2008 bytes > 1024 and < bytes 472 bytes > ranks 216 bytes 216 bytes 47

48 Day in the Life of Message type R0 Rendezvous messages using RDMA Get Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Register App Send Buffer CQs 3. Register App Recv Buffer 5. GNI SMSG Send (Recv done) No extra data copies Best chance of overlapping communication with computation 48

49 Day in the Life of Message type R1 Rendezvous messages using RDMA Put Sender 1. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 3. GNI SMSG Send (CTS msg ) CQs 4. Register Chunk of App Send Buffer 5. RDMA PUT 2. Register Chunk of App Recv Buffer 6. GNI SMSG Send (Send done) Repeat steps 2-6 until all sender data is transferred Chunksize is MPI_GNI_MAX_NDREG_SIZE (default of 4MB) 49

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication