Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France
Context Multicore architectures everywhere in HPC Increasing number of cores Increasing complexity Hierarchical aspects Multiplication of shared resources NUMA architecture AMD HyperTransport, Intel QPI Need tasks and data placement according to affinities 2
Non Uniform I/O Access NUMA architectures NIC connected to a single NUMA node directly Network micro-benchmarks NUMA-aware manual binding Not included in communication strategies of MPI implementations Does that matter? 3
NUIOA effects RDMA single rail micro-benchmark over InfiniBand 4
NUIOA effects RDMA single rail micro-benchmark over InfiniBand -23% 5
NUIOA Effects Slight impact on latency (< 100ns) High impact on bandwidth Relative to network bandwidth Up to 40% degradation with multirail applications No variation while increasing NUMA distance Not restricted to NICs 42% DMA throughput degradation on NVIDIA GPU access How to adapt MPI communication strategies to these contraints? 6
NUIOA-aware communications Adapt process placement to NIC locations? Privileged network access to communicationintensive processes Detecting communication-intensive tasks is tricky Meaningless for uniform communication patterns Conflict with other placement strategies Adapt the MPI implementation according to NIC and processes location Multirail communication 7
Experimentation Platform Multiple configurations: 2 x Myri-10G 2 x InfiniBand ConnectX DDR Myri-10G + InfiniBand ConnectX DDR Quad-socket dual-core Opteron 8218 (2.6GHz) 4 NUMA nodes 2 I/O chipsets connected to NUMA nodes #0 and #1 8
NUIOA effects on the platform Single rail IMB ping-pong throughput 9
And now? Observed NUIOA effects on our testbed Important using InfiniBand NIC Minor using Myrinet NIC It is an old platform... We have seen even worse on recent ones. Don't worry! How to optimize multirail transfers considering NUIOA effects? Open MPI 1.4.1 implementation Provide multirail transfers 10
Multirail communication in Open MPI 1.4.1 BTL 1 BTL 2 Byte Transfer Layer bw1 NIC1 bw2 NIC2 Network Interface 11
Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 BTL Management Layer BTL 1 BTL 2 Byte Transfer Layer bw1 NIC1 bw2 NIC2 Network Interface 12
Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer BTL 1 BTL 2 BTL 1 BTL 2 bw1 bw2 NIC1 NIC2 NIC1 NIC2 13
Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer BTL 1 BTL 2 BTL 1 BTL 2 2GB/s 1GB/s NIC1 NIC2 NIC1 NIC2 14
Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer 67% 33% BTL 1 BTL 2 BTL 1 BTL 2 2GB/s 1GB/s NIC1 NIC2 NIC1 NIC2 15
Multirail communication in Open MPI weight BTL1 BML = weight BTL2 Splitting ratio 50% Send buffer 50% BTL 1 BTL 2 BTL 1 BTL 2 bw1 NIC1 = bw2 Identical NICs NIC2 NIC1 Isosplit strategy NIC2 16
NUIOA-aware multirail Implementation in Open MPI Let's adapt the data ratio according to the locality Modification of the BML component Bandwidth adjusted with regard to process and BTL location hwloc (Hardware locality) Library Splitting ratio specific to each process After the binding of processes Define data amount sent on each NIC 17
Point-to-point multirail 1MB messages - InfiniBand NICs 18
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement 19
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs 20
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s 21
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 22
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 2600 MB/s combined throughputs 23
InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 2600 MB/s combined throughputs Ratio between single rail throughput: 56.6% 24
InfiniBand splitting ratio Far from both NICs: Identical single rail throughput 50% ratio Multirail ratio: 50% Approximate multirail splitting ratio from single rail bandwidths 25
Privilege local NIC Point-to-point multirail splitting ratio Significantly if InfiniBand card (58%) Slightly if Myri-10G card (51%) Far from both NICs isosplit optimal (50%) Derived from single rail bandwidths What about contention? 26
Effects of contention Hypothesis : the more contention occurs, the more the local NIC should be privileged Experimentation : Add contention on the path to a distant NIC Throughput degradation No optimal splitting ratio variation Conclusion: Seems to reduce the overall memory bandwidth instead of bandwidths of each link independently 27
Effects of contention What about collective operations? Communication from all NUMA nodes simultaneously Which splitting ratio for each running process? 28
All-to-all splitting ratio 1MB messages - InfiniBand NICs 29
All-to-all splitting ratio 1MB messages - InfiniBand NICs 50% 50% 30
All-to-all splitting ratio 1MB messages - InfiniBand NICs 50% 50% 100% 100% 31
All-to-all splitting ratio If local NIC should be exclusively used If no local NIC isosplit is optimal 5% improvement on All-to-all collectives with double InfiniBand configuration Other collective operations Improve very intensive communication patterns Insignificant impact otherwise 32
Conclusion (1/2) Multirail in MPI implementation Blind split of messages in halves over 2 rails is not optimal NUIOA effects should be involved in splitting strategies Adapting amount of data sent by each NICs according to the locality Determine splitting ratio according to NICs/processes affinity using hwloc 33
Conclusion (2/2) Efficient multirail point-to-point communications Splitting ratio derived from single-rail bandwiths 15% performance improvement compared to default strategy Communication intensive patterns Exclusive usage of local NIC for processes close to one Half splitting over NICs for processes not close to them 5% performance improvement for all-to-all 34
Future works hwloc should soon replace paffinity component in Open MPI Properly used for both the process binding and the detection of processes and NICs affinities Using sampling or auto-tuning Dynamically compute per-core splitting ratio Integrating knowledge about NICs affinity in collective algorithms Define local leader(s) according to processes and NICs affinities 35
Questions? stephanie.moreaud@labri.fr http://www.open-mpi.org/ http://www.open-mpi.org/projects/hwloc/ 36