Optimizing communications on clusters of multicores

Size: px

Start display at page:

Download "Optimizing communications on clusters of multicores"

Winfred Rodney Waters
5 years ago
Views:

1 Optimizing communications on clusters of multicores Alexandre DENIS with contributions from: Elisabeth Brunet, Brice Goglin, Guillaume Mercier, François Trahay Runtime Project-team INRIA Bordeaux - Sud-Ouest INRIA-Illinois workshop Paris, June 2009

students Part of INRIA Bordeaux Sud-Ouest Research

2 The RUNTIME Team Mid-size research group Head: Raymond Namyst 6 permanent researchers 2 engineers 6 PhD students Part of INRIA Bordeaux Sud-Ouest Research Center Computer Science Lab at University of Bordeaux 1 (LaBRI) Bordeaux

3 Current trends The world is goig multicore Currently: cores per node Tomorrow: hundreds of cores per node Sometimes: multiple NIC per node New programming models Pure-MPI approach One MPI process per core Hybrid approach (MPI + threads/openmp) One MPI process per node/socket 3

4 A thread-aware approach for communications Impact of multi-core on communications Communication library must support multi-threading Multiple flows from multiple threads share a NIC Decouple MPI_Send/Recv from NIC send/recv New opportunities for optimization! Aggregate packets from different threads The best optimization strategy depends on: the capabilities and performance of the underlying network host architecture applications requirements Opportunistically use idle cores Decide optimization at run time

5 Implementing data transfers efficiently Transfer time PIO + copy DMA + copy DMA + RdV Message size

6 Implementing data transfers efficiently Transfer time PIO + copy DMA + copy DMA + RdV Message size

7 Implementing data transfers efficiently Transfer time Sending the 3 chunks sequentially = t1 + t2 + t3 (if no pipelining effects) t 2 t 3 t 1 Chunk 1 Chunk 2 Chunk 3 Message size

8 Implementing data transfers efficiently Transfer time This second strategy is better if t 1 +t 3 > t 4 + k.(sizeof(chunk 1)+sizeof(chunk3)) t 4 t 3 t 1 Chunk 1+3 Chunk 2 Message size

9 NewMadeleine architecture: the interface layer Application submits requests Non-blocking operation pack / send Meta-data are associated to data Reordering constraints Tag, seq number, etc User communication interfaces Incremental message building interface Optimizing scheduler Policy New Madeleine Send-receive interface Simple MPI implementation: Mad-MPI NICs

request backlog Application of a strategy Synthesis of a network request Available drivers

10 NewMadeleine architecture: the network layer NewMadeleine activity triggered by NIC Busy NIC gather application requests Idle NIC invoke the scheduler pack / send Analysis of the user request backlog Application of a strategy Synthesis of a network request Available drivers Myrinet MX, Infiniband verbs, Quadrics Elan, SCI Sisci, TCP sockets Optimizing scheduler Policy New Madeleine NICs

11 NewMadeleine architecture: strategies Strategy invoked when NIC becomes idle Combined with any network(s) pack / send Currently available: Default Aggregation Multirail QoS Raw transmission Aggressive aggregation of small packets Split packets over multiple networks Optimizing scheduler Core scheduler Strategy Tactic 1 Tactic 2 New Madeleine Priority-based scheduling NICs

12 Mixing network and threads Multiple levels of threads support in communication subsystem Level 1: Thread safety Allow simultaneous access to the library Level 2: Background progression of communication Make communications progress in background (rendez-vous, non-blocking) Level 3: Parallel processing Use several cores to process communication 12

13 Ensuring thread-safety Level 1: thread safety Coarse grain Library-wide mutex Avoids simultaneous access to the library Overhead = 140ns negligible MPI_Irecv MPI_Test CPU CPU 13

14 Ensuring thread-safety Level 1: thread safety Coarse grain Fine grain Library-wide mutex Avoids simultaneous access to the library Overhead = 140ns MPI_Irecv Action-wide mutexes Local thread-safety Allows simultaneous access to the library Overhead = 230ns MPI_Test CPU CPU 14

15 Background processing Level 2: background processing A progression thread per NIC Rendezvous handshake progression MPI_Irecv MX/Myrinet, OpenMPI/TCP Priority issue on overloaded systems Overhead : 400ns (inter-core, same chip synchronization) 2-3us. (inter-chip synchronization) Depends on thread placement MPI_Test CPU CPU 15

16 Parallel processing of communication flows Level 3: take benefit from multi-core: The PIOMan communication manager Communication processing seen as a sequence of operations (tasklets) Operations may be scheduled on any core through hooks in thread scheduler: idle, timer, context switch Load balancing of processing Idle cores 'help' working cores Offloading of asynchronous operations No priority issue Reduced overhead : 400ns Placement is controlled Less context switches CPU CPU 16

17 Offloading small messages Sending small messages consumes CPU memcopy or PIO may monopolyze a CPU for dozens of µs MPI_Isend Even a MPI_Isend can be split Split the non-blocking send into basic operations a) Register the MPI request b) Submit the packet to the NIC Spread the operations on cores Compute MPI_Wait Core 1 Core 2 17

18 Offloading small messages Sending small messages consumes CPU MPI_Isend memcopy or PIO may monopolyze a CPU for dozens of µs Even a MPI_Isend can be split Split the non-blocking send into basic operations a) Register the MPI request b) Submit the packet to the NIC Spread the operations on cores Compute MPI_Wait Core 1 Core 2 18

19 The PIOMan communication engine PIOMan: the PM2 I/O event manager Thread-aware I/O event manager Offload message submissions on idle cores, if available Uses interrupt and/or polling transparently, depending on system load Makes communication progress asynchronously No thread needed in application No priority issue, even on overloaded systems Well integrated with the Marcel thread scheduler test_event() NIC_test() PIOMan I/O manager NIC MPI interface NewMadeleine wait_event() NIC_wait() Marcel Thread scheduler 19

20 Bridging the gap with standard API Mad-MPI: a light MPI implementation on top of NewMadeleine Bringing the performance of NewMadeleine to simple MPI applications MPI ADI3 interface NEMESIS/NewMadeleine: a new architecture for MPICH2 Devices CH3 BG/L Quadrics MPICH2 Towards a powerful optimization engine for MPI Channels soc k shm Nemesis - NewMad intra-node inter-node

21 MPICH2 over NewMadeleine The preliminary version was straightforward to implement MPI_xxx point to point communication are (almost) directly mapped on nmad_xxx The NEMESIS is used for intra-node (shared memory) communication (joint work with Argonne NL) Published at IPDPS 2009 Performance is good Low latency overhead, bandwidth is the same Scientific challenges Optimization of MPI datatypes management Support for collective operations within NewMadeleine Full support of dynamicity over heterogeneous (or multi-rails) configurations

22 Experimental testbed For point-to-point experiments Two 2-quadcore nodes with Intel Xeon 3.16GHz 4GB of memory per node One Myrinet 10G NIC and one ConnectX Infiniband HCA For NAS parallel benchmarks Ten 4-dualcore nodes with AMD Opteron 2.6GHz 32 GB of memory per node One Infiniband 10G HCA

23 PIOMan's impact on latency performance 1 MPICH2- Nemesis Shared-memory MPICH2- Nemesis/PIO Man Open MPI 8 MPICH2- NewMad Myrinet 10G MPICH2- NewMad(PIOM an) Open MPI (BTL) Open MPI (PML) Latency (microseconds) Latency (microseconds) MPI message size (bytes) MPI Message size (bytes)

24 Multi-threaded latency over Infiniband OMB multi-threaded latency benchmark 1 sender thread 1000 N receiver threads 4-bytes messages MVAPICH2 MPICH2- NewMadeleine MVAPICH2 1.2-p1 Active waiting Concurrent polling OpenMPI Segmentation fault NewMadeleine Fixed-spin waiting No concurrent polling No overloaded CPU Lat ency (µs)

25 MPI overlap benchmark Time Measured MPI_Isend(dest,sreq); Computation(); MPI_Wait(&sreq); MPI_Recv(dest); Computation time: 20µs for small messages (eager) 400µs for large messages (rendezvous)

26 Communication progress with PIOMan Overlapping eager messages with MX Rendezvous progress with Inf iniband Sending Time (microseconds) Reference (no computation) Open MPI (BTL) MPICH2- NewMad Open MPI (PML) MPICH2- NewMad (PIOMan) Sending Time (microseconds) Ref erence MVAPICH2 MPICH2- NewMad Open MPI MPICH2- NewMad(PIOM an) K 4K 6 K 8 K 10K 12K 14K 16K 18K 20K 22K 24K 28K 32K 36K MPI message size (bytes) 0 12K 27K 41K 6 1K 91K 137K 205K 308K 461K 69 2K 1M 2.3M MPI Message size (bytes)

27 NAS Parallel Benchmarks 8/9 Processes 16 Processes Execution Times (in seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) Execution Times (seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) BT CG EP FT SP MG LU 0 BT CG EP FT SP MG LU NAS Kernel Type (Class C) NAS Kernel Type (Class C)

28 NAS Parallel Benchmarks (contd) 32/36 Processes 64 Processes Execution Times (in seconds) MVAPICH2 Open MPI MPICH2- NewMad MPICH2- NewMad (PIOMan) Execution Times (seconds) MVAPICH2 Open MPI MPICH2- NewMad BT CG EP FT SP MG LU 0 BT CG EP FT SP MG LU NAS Kernel Type (Class C) NAS Kernel Type (Class C)

29 Heterogeneous multi-rail Long fragments Heterogeneous splitting Split ratio computed according to: Performance of each network Estimated availability of NICs Network sampling is necessary Short fragments Aggregation over the fastest NIC Myri-10G NICs Quadrics NIC

30 Aggregating bandwidth through adaptive multi-rail NewMadeleine adaptive multi-rail: Myri-10G + QsNet Adaptive multi-rail 1500 Iso-split Myri + QsNet Bandwidth (MB/s) 1000 Myri-10G Quadrics QsNet Data size (bytes)

Point-to-point performance Adaptive Multirail 6 MPICH2- NewMad (Myrinet) Latency MPICH2-

et) Bandwidth MPICH2- NewMad (Myrinet+Inf inib and) MPICH2- NewMad (Inf iniband) Latency

1 3 5 7 15 31 6 3 127 255 511 0 1 5 15 6 3 255 1K 4K 16K 6 4K 256K 1M 4M 16M 64M MPI message

31 Point-to-point performance Adaptive Multirail 6 MPICH2- NewMad (Myrinet) Latency MPICH2- NewMad (Myrinet +Infiniband) MPICH2- NewMad (Infiniband) MPICH2- NewMad(Myrin et) Bandwidth MPICH2- NewMad (Myrinet+Inf inib and) MPICH2- NewMad (Inf iniband) Latency (microseconds) Bandwidth (Mbit/s) K 4K 16K 6 4K 256K 1M 4M 16M 64M MPI message size (bytes) MPI Message size (bytes) MPICH2-NewMad, point-to-point, Myrinet 10G NIC + ConnectX Infiniband HCA

32 High-throughput intra-node communications Increasing number of cores requires efficient intra-node communication Existing strategy: double buffering Consumes CPU cycles Pollutes cache KNEM offers cheap alternative for large messages Linux kernel-assisted memory data transfers Support for non-contiguous/asynchronous transfers and for I/O AT copy offload New backend in MPI2-Nemesis Improves pt-to-pt and collectives performance significantly Especially when no cache is shared Developped in collaboration with ANL Results to appear in ICPP (Vienna, sep 2009)

33 High-throughput intra-node communications

34 Conclusion The world is going multicore! Massively multicore clusters may arrive sooner than expected It has an impact on communication subsystem Mixing MPI and threads does not work out of the box Multicore is an opportunity to optimize communications Optimization strategies of multiple flows Use idle cores for communication progression We need new communication engines Designed from the groud up with multicore/multithread in mind Fully parallel NUMA-aware Machine-wide optimization of traffic

35 Thank you! More information: Software available on INRIA Gforge:

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory