Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Similar documents
Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

MPI: Overview, Performance Optimizations and Tuning

Op#miza#on and Tuning Collec#ves in MVAPICH2

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

The Exascale Architecture

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

High Performance MPI on IBM 12x InfiniBand Architecture

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Unified Runtime for PGAS and MPI over OFED

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

NAMD Performance Benchmark and Profiling. February 2012

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

ABySS Performance Benchmark and Profiling. May 2010

High-Performance Broadcast for Streaming and Deep Learning

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Performance monitoring in InfiniBand networks

Solutions for Scalable HPC

Job Startup at Exascale:

Interconnect Your Future

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

HYCOM Performance Benchmark and Profiling

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Introduction to Infiniband

Application Acceleration Beyond Flash Storage

NAMD Performance Benchmark and Profiling. November 2010

2008 International ANSYS Conference

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

CP2K Performance Benchmark and Profiling. April 2011

A Case for High Performance Computing with Virtual Machines

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Enhancing Checkpoint Performance with Staging IO & SSD

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Memcached Design on High Performance RDMA Capable Interconnects

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Performance Evaluation of InfiniBand with PCI Express

MVAPICH2 Project Update and Big Data Acceleration

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Interconnect Your Future

Birds of a Feather Presentation

Designing High-Performance and Resilient Message Passing on InfiniBand

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

The Future of Interconnect Technology

High Performance Computing

Future Routing Schemes in Petascale clusters

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Job Startup at Exascale: Challenges and Solutions

Performance Evaluation of InfiniBand with PCI Express

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Optimizing LS-DYNA Productivity in Cluster Environments

The NE010 iwarp Adapter

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Introduction to High-Speed InfiniBand Interconnect

Mellanox GPUDirect RDMA User Manual

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Performance of HPC Middleware over InfiniBand WAN

InfiniBand OFED Driver for. VMware Infrastructure 3. Installation Guide

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Messaging Overview. Introduction. Gen-Z Messaging

Transcription:

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h<p://www.cse.ohio- state.edu/~subramon

Outline Memory overheads in large scale systems OpBmizaBons in MVAPICH2 to address overheads MulBrail Clusters 3D Torus Support Quality of Service 2

Memory overheads in large- scale systems Reliable ConnecBon (RC) is the most common transport protocol in IB ConnecBons need to be established between every pair of processes Each connecbon requires certain amount of memory for handling related data structures Memory required for all connecbons can increase with system size Buffers need to be posted at each receiver to receive message from any sender Buffer requirement can increase with system size Both issues have become cribcal as large- scale IB deployments have taken place Being addressed by both IB specificabon and upper- level middleware IB offers alternate mechanisms and transport protocols for scalability (XRC, UD, SRQ) 3

Outline Memory overheads in large scale systems Op#miza#ons in MVAPICH2 to address overheads Shared Receive Queue (SRQ) Hybrid Communica#on Channels extended Reliable Connec#on (XRC) Unreliable Datagram (UD) Transport Integrated Hybrid UD- RC/XRC Design MulBrail Clusters 3D Torus Support Quality of Service 4

Op#miza#ons in MVAPICH2 to address overheads MVAPICH2 has opbmizabons to address these overheads Support for Shared Receive Queue Enables use of common buffer pool for receiving data extended Reliable Connect transport protocol Reduce number of connecbons UD transport protocol Reduce QP cache trashing Only requires one connecbon Hybrid of RC, XRC and UD gives best of everything 5

Shared Receive Queue (SRQ) Process Process P Q One RQ per connecbon (M*N) - 1 One SRQ for all connecbons SRQ is a hardware mechanism for a process to share receive resources (memory) across mulbple connecbons Introduced in specificabon v1.2 0 < Q << P*((M*N)- 1) 6

Using Shared Receive Queues with MVAPICH2 Memory Used (MB) 120 100 80 60 40 20 0 MVAPICH2- NO- SRQ MVAPICH2- SRQ 8 16 32 64 128 256 512 Number of Processes MPI_Init memory u&liza&on Number of Processes Analy&cal model SRQ reduces the memory used by 1/6 th at 64,000 processes Memory Used (GB) 14 12 MVAPICH2- NO- SRQ MVAPICH2- SRQ 10 8 6 4 2 0 Parameter Significance Default Notes MV2_USE_SRQ Enable / Disable use of SRQ in MVAPICH2 Enabled Always Enable MV2_SRQ_MAX_SIZE MV2_SRQ_SIZE Limits the maximum size of the SRQ Places upper bound on amount of memory used for SRQ Number of buffers posted to the SRQ AutomaBcally doubled by MVAPICH2 on receiving SRQ LIMIT EVENT from IB HCA 4096 Increase to 8192 for large scale runs 256 Upper Bound: MV2_SRQ_MAX_SIZE Refer to Shared Receive Queue (SRQ) Tuning sec#on of MVAPICH2 user guide for more informa#on hyp://mvapich.cse.ohio- state.edu/support/user_guide_mvapich2-2.0a.html#x1-1010008.5 7

extended Reliable Connec#on (XRC) M = # of processes/node N = # of nodes RC Connec#ons XRC Connec#ons M 2 x (N 1) connec#ons/node M x (N 1) connec#ons/node Each QP takes at least one page of memory ConnecBons between all processes is very costly for RC New IB Transport added: extended Reliable ConnecBon Allows connecbons between nodes instead of processes 8

Using extended Reliable Connec#on (XRC) in MVAPICH2 Memory (MB/Process) 500 400 300 200 100 0 Memory Usage MVAPICH2- RC MVAPICH2- XRC 1 4 16 64 256 1K 4K 16K Number of Connec#ons Normalized Time 1.5 1 0.5 0 Performance with NAMD (1024 cores) MVAPICH2- RC MVAPICH2- XRC apoa1 er-gre f1atpase jac Dataset Memory usage for 32K processes with 8- cores per node can be 54 MB/process (for connecbons) NAMD performance improves when there is frequent communicabon to many peers Enabled by sekng MV2_USE_XRC to 1 (Default: Disabled) Requires OFED version > 1.3 Unsupported in earlier versions (< 1.3), OFED- 3.x and MLNX_OFED- 2.0 MVAPICH2 build process will automabcally disable XRC if unsupported by OFED AutomaBcally enables SRQ and ON- DEMAND connecbon establishment Refer to extended Reliable Connec#on (XRC) sec#on of MVAPICH2 user guide for more informa#on hyp://mvapich.cse.ohio- state.edu/support/user_guide_mvapich2-2.0a.html#x1-1020008.6 9

Unreliable Datagram (UD) Transport ConnecBonless unreliable communicabon Avoid QP trashing Light weight reliability layer in the library Zero- copy large message transfer 10

Using UD Transport with MVAPICH2 Number of Processes Memory Footprint of MVAPICH2 RC (MVAPICH2 2.0a) UD (MVAPICH2 2.0a) Conn. Buffers Struct Total Buffers Struct Total 512 22.9 24 0.3 47.2 24 0.2 24.2 1024 29.5 24 0.6 54.1 24 0.4 24.4 2048 42.4 24 1.2 67.6 24 0.9 24.9 0 Normalized Time 1.2 1 0.8 0.6 0.4 0.2 128 256 512 1024 2048 Number of Processes Can use UD transport by configuring MVAPICH2 with the enable- hybrid Reduces QP cache trashing and memory footprint at large scale Parameter Significance Default Notes Performance with SMG2000 MV2_USE_ONLY_UD Enable only UD transport in hybrid configurabon mode Disabled RC/XRC not used RC UD MV2_USE_UD_ZCOPY Enables zero- copy transfers for large messages on UD Enabled Always Enable when UD enabled MV2_UD_RETRY_TIMEOUT Time (in usec) aper which an unacknowledged message will be retried 500000 Increase appropriately on large / congested systems MV2_UD_RETRY_COUNT Number of retries before job is aborted 1000 Increase appropriately on large / congested systems Refer to Running with scalable UD transport sec#on of MVAPICH2 user guide for more informa#on hyp://mvapich.cse.ohio- state.edu/support/user_guide_mvapich2-2.0a.html#x1-640006.11 11

Hybrid (UD/RC/XRC) Mode in MVAPICH2 Time (us) 6 4 2 0 Performance with HPCC Random Ring UD Hybrid RC 26% 40% 38% 30% 128 256 512 1024 Number of Processes Both UD and RC/XRC have benefits Hybrid for the best of both Enabled by configuring MVAPICH2 with the enable- hybrid Available since MVAPICH2 1.7 as integrated interface Parameter Significance Default Notes MV2_USE_UD_HYBRID MV2_HYBRID_ENABLE_THRESHOLD_SIZE MV2_HYBRID_MAX_RC_CONN Enable / Disable use of UD transport in Hybrid mode Job size in number of processes beyond which hybrid mode will be enabled Maximum number of RC or XRC connecbons created per process Limits the amount of connecbon memory Enabled Always Enable 1024 Uses RC/XRC connecbon unbl job size < threshold 64 Prevents HCA QP cache thrashing Refer to Running with Hybrid UD- RC/XRC sec#on of MVAPICH2 user guide for more informa#on hyp://mvapich.cse.ohio- state.edu/support/user_guide_mvapich2-2.0a.html#x1-650006.12 12

Outline Memory overheads in large scale systems OpBmizaBons in MVAPICH2 to address overheads Mul#rail Clusters 3D Torus Support Quality of Service 13

MVAPICH2 Mul#- Rail Design What is a rail? HCA, Port, Queue Pair AutomaBcally detects and uses all acbve HCAs in a system AutomaBcally handles heterogeneity Supports mulbple rail usage policies Rail Sharing Processes share all available rails Rail Binding Specific processes are bound to specific rails 14

Performance Tuning on Mul#- Rail Clusters Bandwidth (MB/sec) 7000 6000 5000 4000 3000 2000 1000 0 Impact of Default Message Striping on Bandwidth Single-Rail Dual-Rail Impact of Default Rail Binding on Message Rate 7 Single-Rail 6 Dual-Rail 5 98% 130% 4 Message Rate (Millions of Messages / Sec) 3 2 1 0 Message Rate (Millions of Messages/sec) 8 7 6 5 4 3 2 1 0 Impact of Advanced Multi-rail Tuning on Message Rate 7% Use First Round Robin Scatter Bunch Message Size (Bytes) Message Size (Bytes) Two 24-core Magny Cours nodes with two Mellanox ConnectX QDR adapters Six pairs with OSU Multi-Pair bandwidth and messaging rate benchmark Message Size (Bytes) Parameter Significance Default Notes MV2_IBA_HCA Manually set the HCA to be used Unset To get names of HCA ibstat grep ^CA MV2_DEFAULT_PORT Select the port to use on a acbve mulb port HCA 0 Set to use different port MV2_RAIL_SHARING_LARGE_MSG_ THRESHOLD Threshold beyond which striping will take place 16 Kbyte MV2_RAIL_SHARING_POLICY Choose mulb- rail rail sharing / binding policy For Rail Sharing set to USE_FIRST or ROUND_ROBIN Set to FIXED_MAPPING for advanced rail binding opbons Rail Binding in Round Robin mode Advanced tuning can result in be<er performance MV2_PROCESS_TO_RAIL_MAPPING Determines how HCAs will be mapped to the rails BUNCH OpBons: SCATTER and custom list Refer to Enhanced design for Mul#ple- Rail sec#on of MVAPICH2 user guide for more informa#on hyp://mvapich.cse.ohio- state.edu/support/user_guide_mvapich2-2.0a.html#x1-670006.14 15

Outline Memory overheads in large scale systems OpBmizaBons in MVAPICH2 to address overheads MulBrail Clusters 3D Torus Support Quality of Service 16

Support for 3D Torus Networks in MVAPICH2 Deadlocks possible with common roubng algorithms in 3D Torus InfiniBand networks Need special roubng algorithm for OpenSM Users need to interact with OpenSM Use appropriate SL to prevent deadlock MVAPICH2 supports 3D Torus Topology Queries OpenSM at runbme to obtain appropriate SL Usage Enabled at configure Bme - - enable- 3dtorus- support MV2_NUM_SA_QUERY_RETRIES Control number of retries if PathRecord query fails 17

Outline Memory overheads in large scale systems OpBmizaBons in MVAPICH2 to address overheads MulBrail Clusters 3D Torus Support Quality of Service 18

Exploi#ng QoS Support in MVAPICH2 Point- to- Point Bandwidth (MBps) 3500 3000 2500 2000 1500 1000 500 0 Intra- Job QoS Through Load Balancing Over Different VLs 1- VL 8- VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) IB is capable of providing network level differenbated service QoS Uses Service Levels (SL) and Virtual Lanes (VL) to classify traffic Enabled at configure Bme using CFLAG ENABLE_QOS_SUPPORT Check with System administrator before enabling Can affect performance of other jobs in system Alltoall Latency (us) Parameter Significance Default Notes 350 300 250 200 150 100 50 0 Inter- Job QoS Through Traffic Segrega#on Over Different VLs 2 Alltoalls (No QoS) 2 Alltoalls (QoS) 1 Alltoall 1K 2K 4K 8K 16K Message Size (Bytes) MV2_USE_QOS Enable / Disable use QoS Disabled Check with System administrator MV2_NUM_SLS Number of Service Levels user requested 8 Use to see benefits of Intra- Job QoS MV2_DEFAULT_SERVICE_LEVE L Indicates the default Service Level to be used by job 0 Set to different values for different jobs to enable Inter- Job QoS How can QoS be used to isolate Checkpoint Restart traffic from Applica#on Traffic??? Fault- Tolerance Support (CR, SCR and Migra#on) in MVAPICH2 ; 11:50 12:20; Tuesday, August 27th 19

Web Pointers NOWLAB Web Page hyp://nowlab.cse.ohio- state.edu MVAPICH Web Page hyp://mvapich.cse.ohio- state.edu 20