Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Similar documents
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Why you should care about hardware locality and how.

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Optimizing communications on clusters of multicores

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High Performance MPI on IBM 12x InfiniBand Architecture

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Operational Robustness of Accelerator Aware MPI

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

NAMD Performance Benchmark and Profiling. February 2012

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Performance of Variant Memory Configurations for Cray XT Systems

n N c CIni.o ewsrg.au

Intra-MIC MPI Communication using MVAPICH2: Early Experience

MM5 Modeling System Performance Research and Profiling. March 2009

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Master Informatics Eng.

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Unified Runtime for PGAS and MPI over OFED

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

2008 International ANSYS Conference

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

AMBER 11 Performance Benchmark and Profiling. July 2011

KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework

SNAP Performance Benchmark and Profiling. April 2014

Mapping MPI+X Applications to Multi-GPU Architectures

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Research on the Implementation of MPI on Multicore Architectures

Optimizing LS-DYNA Productivity in Cluster Environments

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Single-Points of Performance

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Advances of parallel computing. Kirill Bogachev May 2016

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Memory Management Strategies for Data Serving with RDMA

MILC Performance Benchmark and Profiling. April 2013

Performance of Variant Memory Configurations for Cray XT Systems

Performance of the AMD Opteron LS21 for IBM BladeCenter

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

CP2K Performance Benchmark and Profiling. April 2011

Unified Communication X (UCX)

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Performance Impact of Resource Contention in Multicore Systems

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

GROMACS Performance Benchmark and Profiling. September 2012

IBM Information Technology Guide For ANSYS Fluent Customers

Red Storm / Cray XT4: A Superior Architecture for Scalability

Six-Core AMD Opteron Processor

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

Communication Models for Resource Constrained Hierarchical Ethernet Networks

AcuSolve Performance Benchmark and Profiling. October 2011

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

The Optimal CPU and Interconnect for an HPC Cluster

Performance comparison between a massive SMP machine and clusters

HYCOM Performance Benchmark and Profiling

Placement de processus (MPI) sur architecture multi-cœur NUMA

Rositsa Tinkova Proseminar Technische Informatik SS 2013

Advantages to Using MVAPICH2 on TACC HPC Clusters

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Ravindra Babu Ganapathi

Interconnect Your Future

ABySS Performance Benchmark and Profiling. May 2010

OCTOPUS Performance Benchmark and Profiling. June 2015

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Why CPU Topology Matters. Andreas Herrmann

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

NUMA Support for Charm++

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

A Case for High Performance Computing with Virtual Machines

Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vsphere T E C H N I C A L W H I T E P A P E R

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

CPMD Performance Benchmark and Profiling. February 2014

Solutions for Scalable HPC

InfiniBand Experiences of PC²

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

NUMA-aware OpenMP Programming

Performance Evaluation of InfiniBand with PCI Express

XT Node Architecture

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

NUMA replicated pagecache for Linux

Parallel Performance Studies for a Clustering Algorithm

LS-DYNA Performance Benchmark and Profiling. October 2017

HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure

HPC Architectures. Types of resource currently in use

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

Transcription:

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France

Context Multicore architectures everywhere in HPC Increasing number of cores Increasing complexity Hierarchical aspects Multiplication of shared resources NUMA architecture AMD HyperTransport, Intel QPI Need tasks and data placement according to affinities 2

Non Uniform I/O Access NUMA architectures NIC connected to a single NUMA node directly Network micro-benchmarks NUMA-aware manual binding Not included in communication strategies of MPI implementations Does that matter? 3

NUIOA effects RDMA single rail micro-benchmark over InfiniBand 4

NUIOA effects RDMA single rail micro-benchmark over InfiniBand -23% 5

NUIOA Effects Slight impact on latency (< 100ns) High impact on bandwidth Relative to network bandwidth Up to 40% degradation with multirail applications No variation while increasing NUMA distance Not restricted to NICs 42% DMA throughput degradation on NVIDIA GPU access How to adapt MPI communication strategies to these contraints? 6

NUIOA-aware communications Adapt process placement to NIC locations? Privileged network access to communicationintensive processes Detecting communication-intensive tasks is tricky Meaningless for uniform communication patterns Conflict with other placement strategies Adapt the MPI implementation according to NIC and processes location Multirail communication 7

Experimentation Platform Multiple configurations: 2 x Myri-10G 2 x InfiniBand ConnectX DDR Myri-10G + InfiniBand ConnectX DDR Quad-socket dual-core Opteron 8218 (2.6GHz) 4 NUMA nodes 2 I/O chipsets connected to NUMA nodes #0 and #1 8

NUIOA effects on the platform Single rail IMB ping-pong throughput 9

And now? Observed NUIOA effects on our testbed Important using InfiniBand NIC Minor using Myrinet NIC It is an old platform... We have seen even worse on recent ones. Don't worry! How to optimize multirail transfers considering NUIOA effects? Open MPI 1.4.1 implementation Provide multirail transfers 10

Multirail communication in Open MPI 1.4.1 BTL 1 BTL 2 Byte Transfer Layer bw1 NIC1 bw2 NIC2 Network Interface 11

Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 BTL Management Layer BTL 1 BTL 2 Byte Transfer Layer bw1 NIC1 bw2 NIC2 Network Interface 12

Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer BTL 1 BTL 2 BTL 1 BTL 2 bw1 bw2 NIC1 NIC2 NIC1 NIC2 13

Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer BTL 1 BTL 2 BTL 1 BTL 2 2GB/s 1GB/s NIC1 NIC2 NIC1 NIC2 14

Multirail communication in Open MPI 1.4.1 weight BTL1 BML weight BTL2 Splitting ratio Send buffer 67% 33% BTL 1 BTL 2 BTL 1 BTL 2 2GB/s 1GB/s NIC1 NIC2 NIC1 NIC2 15

Multirail communication in Open MPI weight BTL1 BML = weight BTL2 Splitting ratio 50% Send buffer 50% BTL 1 BTL 2 BTL 1 BTL 2 bw1 NIC1 = bw2 Identical NICs NIC2 NIC1 Isosplit strategy NIC2 16

NUIOA-aware multirail Implementation in Open MPI Let's adapt the data ratio according to the locality Modification of the BML component Bandwidth adjusted with regard to process and BTL location hwloc (Hardware locality) Library Splitting ratio specific to each process After the binding of processes Define data amount sent on each NIC 17

Point-to-point multirail 1MB messages - InfiniBand NICs 18

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement 19

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs 20

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s 21

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 22

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 2600 MB/s combined throughputs 23

InfiniBand splitting ratio Optimal ratio : 58% +15% overall bandwidth improvement Single rail throughputs local NIC 1460 MB/s distant NIC 1140 MB/s 2600 MB/s combined throughputs Ratio between single rail throughput: 56.6% 24

InfiniBand splitting ratio Far from both NICs: Identical single rail throughput 50% ratio Multirail ratio: 50% Approximate multirail splitting ratio from single rail bandwidths 25

Privilege local NIC Point-to-point multirail splitting ratio Significantly if InfiniBand card (58%) Slightly if Myri-10G card (51%) Far from both NICs isosplit optimal (50%) Derived from single rail bandwidths What about contention? 26

Effects of contention Hypothesis : the more contention occurs, the more the local NIC should be privileged Experimentation : Add contention on the path to a distant NIC Throughput degradation No optimal splitting ratio variation Conclusion: Seems to reduce the overall memory bandwidth instead of bandwidths of each link independently 27

Effects of contention What about collective operations? Communication from all NUMA nodes simultaneously Which splitting ratio for each running process? 28

All-to-all splitting ratio 1MB messages - InfiniBand NICs 29

All-to-all splitting ratio 1MB messages - InfiniBand NICs 50% 50% 30

All-to-all splitting ratio 1MB messages - InfiniBand NICs 50% 50% 100% 100% 31

All-to-all splitting ratio If local NIC should be exclusively used If no local NIC isosplit is optimal 5% improvement on All-to-all collectives with double InfiniBand configuration Other collective operations Improve very intensive communication patterns Insignificant impact otherwise 32

Conclusion (1/2) Multirail in MPI implementation Blind split of messages in halves over 2 rails is not optimal NUIOA effects should be involved in splitting strategies Adapting amount of data sent by each NICs according to the locality Determine splitting ratio according to NICs/processes affinity using hwloc 33

Conclusion (2/2) Efficient multirail point-to-point communications Splitting ratio derived from single-rail bandwiths 15% performance improvement compared to default strategy Communication intensive patterns Exclusive usage of local NIC for processes close to one Half splitting over NICs for processes not close to them 5% performance improvement for all-to-all 34

Future works hwloc should soon replace paffinity component in Open MPI Properly used for both the process binding and the detection of processes and NICs affinities Using sampling or auto-tuning Dynamically compute per-core splitting ratio Integrating knowledge about NICs affinity in collective algorithms Define local leader(s) according to processes and NICs affinities 35

Questions? stephanie.moreaud@labri.fr http://www.open-mpi.org/ http://www.open-mpi.org/projects/hwloc/ 36