OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Similar documents
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

LAMMPSCUDA GPU Performance. April 2011

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

CPMD Performance Benchmark and Profiling. February 2014

MILC Performance Benchmark and Profiling. April 2013

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

LS-DYNA Performance Benchmark and Profiling. April 2015

The Visual Computing Company

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. February 2012

Mellanox GPUDirect RDMA User Manual

NAMD GPU Performance Benchmark. March 2011

Solutions for Scalable HPC

AMBER 11 Performance Benchmark and Profiling. July 2011

HPC Applications Performance and Optimizations Best Practices Pak Lui

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

HYCOM Performance Benchmark and Profiling

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

n N c CIni.o ewsrg.au

ICON Performance Benchmark and Profiling. March 2012

CP2K Performance Benchmark and Profiling. April 2011

AcuSolve Performance Benchmark and Profiling. October 2011

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Mellanox GPUDirect RDMA User Manual

CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. August 2011

Mellanox GPUDirect RDMA User Manual

Interconnection Network for Tightly Coupled Accelerators Architecture

Operational Robustness of Accelerator Aware MPI

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Illinois Proposal Considerations Greg Bauer

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

ABySS Performance Benchmark and Profiling. May 2010

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

GROMACS Performance Benchmark and Profiling. September 2012

MM5 Modeling System Performance Research and Profiling. March 2009

The rcuda middleware and applications

GPU-centric communication for improved efficiency

LS-DYNA Performance Benchmark and Profiling. October 2017

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

LS-DYNA Performance Benchmark and Profiling. October 2017

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

OpenFOAM Performance Testing and Profiling. October 2017

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Interconnect Your Future

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Interconnect Your Future

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Clustering Optimizations How to achieve optimal performance? Pak Lui

Paving the Road to Exascale

UCX: An Open Source Framework for HPC Network APIs and Beyond

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp

Unified Runtime for PGAS and MPI over OFED

Ravindra Babu Ganapathi

Tightly Coupled Accelerators Architecture

OP2 FOR MANY-CORE ARCHITECTURES

Application Acceleration Beyond Flash Storage

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

The Future of Interconnect Technology

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Unifying UPC and MPI Runtimes: Experience with MVAPICH

High Performance Computing

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

LAMMPS Performance Benchmark and Profiling. July 2012

Overview of the MVAPICH Project: Latest Status and Future Roadmap

GRID Testing and Profiling. November 2017

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Fermi Cluster for Real-Time Hyperspectral Scene Generation

Unified Communication X (UCX)

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

iwarp Learnings and Best Practices

CME 213 S PRING Eric Darve

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Transcription:

OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA

OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

CUDA-AWARE DEFINITION Regular MPI //MPI rank 0 cudamemcpy(s_buf_h,s_buf_d,size, ); MPI_Send(s_buf_h,size, ); //MPI rank n-1 MPI_Recv(r_buf_h,size, ); cudamemcpy(r_buf_d,r_buf_h,size, ); CUDA-aware MPI //MPI rank 0 MPI_Send(s_buf_d,size, ); //MPI rank n-1 MPI_Recv(r_buf_d,size, ); CUDA-aware MPI makes MPI+CUDA easier.

CUDA-AWARE MPI IMPLEMENTATIONS Open MPI 1.7.4 http://www.open-mpi.org MVAPICH2 2.0 http://mvapich.cse.ohio-state.edu/overview/mvapich2 IBM Platform MPI 9.1.2 http://www.ibm.com/systems/technicalcomputing/ platformcomputing/products/mpi CRAY MPT

HISTORY OF CUDA-AWARE OPEN MPI Open MPI 1.7.0 released in April, 2013 Pipelining with host memory staging for large messages (utilizing asynchronous copies) over verbs layer Dynamic CUDA IPC support added GPU Direct RDMA support Use Open MPI 1.7.4 to get all the latest features

MPI API SUPPORT Yes No All send and receive types All non-arithmetic collectives Reduction type operations MPI_Reduce, MPI_Allreduce, MPI_Scan Non-blocking collectives MPI-2 and MPI-3 (one sided) RMA FAQ will be updated as support changes

CUDA IPC SUPPORT IN OPEN MPI NODE 0 NODE 1 INTEL CPU RANK 0 RANK 1 QPI INTEL CPU RANK 2 INTEL CPU QPI INTEL CPU RANK 3 RANK 4 GPU GPU NO CUDA IPC GPU GPU CUDA IPC CUDA IPC

CUDA IPC SUPPORT IN OPEN MPI Open MPI dynamically detects if CUDA IPC is supported between GPUs within the same node. Enabled by default --mca btl_smcuda_use_cuda_ipc 0 To see if it is being used between two processes --mca btl_smcuda_cuda_ipc_verbose 100 CUDA 6.0 has performance fixes

CUDA IPC SUPPORT IN OPEN MPI

GPU DIRECT RDMA SUPPORT Kepler class GPUs (K10, K20, K40) Mellanox ConnectX-3, ConnectX-3 Pro, Connect-IB CUDA 6.0 (EA, RC, Final), Open MPI 1.7.4 and Mellanox OFED 2.1 drivers. GPU Direct RDMA enabling software http://www.mellanox.com/downloads/ofed/nvidia_peer_memory-1.0-0.tar.gz

GPU DIRECT RDMA SUPPORT Implement with RDMA type protocol Register send and receive buffers and have NIC transfer data Memory registration is not cheap need to have registration cache

GPU DIRECT RDMA SUPPORT Chipset implementations limit bandwidth at larger message sizes Still use pipelining with host memory staging for large messages (utilizing asynchronous copies) Final implementation is hybrid of both protocols READ GPU Chipset Network Card WRITE Memory CPU IVY BRIDGE P2P READ: 3.5 Gbyte/sec P2P WRITE: 6.4 Gbyte/sec

GPU DIRECT RDMA SUPPORT - PERFORMANCE

GPU DIRECT RDMA SUPPORT - PERFORMANCE

GPU DIRECT RDMA SUPPORT CONFIGURE Nothing different needs to be done at configure time > configure --with-cuda The support is configured in if CUDA 6.0 cuda.h header file is detected. To check: > ompi_info --all grep btl_openib_have_cuda_gdr MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool) > ompi_info -all grep btl_openib_have_driver_gdr MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data source: default, level: 4 tuner/basic, type: bool)

GPU DIRECT RDMA SUPPORT TUNING PARAMETERS Runtime parameters Enable GPU Direct RDMA usage (off by default) --mca btl_openib_want_cuda_gdr 1 Adjust when we switch to pipeline transfers through host memory. Current default is 30,000 bytes --mca btl_openib_cuda_rdma_limit 60000

GPU DIRECT RDMA SUPPORT NUMA ISSUES Configure system so GPU and NIC are close INTEL CPU QPI INTEL CPU NIC NIC INTEL CPU QPI INTEL CPU GPU GPU BETTER

GPU DIRECT RDMA SUPPORT NUMA ISSUES Multi NIC - multi GPU: use hwloc to select GPU near NIC NIC CPU QPI CPU NIC GPU GPU RANK 0 RANK 1

LATENCY COMPARISON CLOSE VS FAR GPU AND NIC

CUDA-AWARE OPEN MPI AND UNIFIED MEMORY CUDA 6 Unified Memory cudamallocmanaged(buf, BUFSIZE, cudamemattachglobal) Unified Memory may not work correctly with CUDA-aware Open MPI Will fix in future release of Open MPI

HOOMD BLUE PERFORMANCE Highly Optimized Object-oriented Many-particle Dynamics - Blue Edition Performs general purpose particle dynamics simulations Takes advantage of NVIDIA GPU Simulations are configured and run using simple python scripts The development effort is led by Glotzer group at University of Michigan Source: http://www.hpcadvisorycouncil.com/pdf/hoomdblue_analysis_and_profiling.pdf

HOOMD CLUSTER 1 Dell PowerEdge R720xd/R720 cluster Dual-Socket Octa-core Intel E5-2680 V2 @ 2.80 GHz CPUs (Static max Perf in BIOS), Memory: 64GB DDR3 1600 MHz Dual Rank Memory Module, OS: RHEL 6.2, MLNX_OFED 2.1-1.0.0 InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand, Mellanox SwitchX SX6036 InfiniBand VPI switch, NVIDIA Tesla K40 GPUs (1 GPU per node), NVIDIA CUDA 5.5 Development Tools and Display Driver 331.20, Open MPI 1.7.4 rc1, GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz) Application: HOOMD-blue (git master 28Jan14), Benchmark datasets: Lennard-Jones Liquid Benchmarks (16K, 64K Particles) Source: http://www.hpcadvisorycouncil.com/pdf/hoomdblue_analysis_and_profiling.pdf

GPU DIRECT RDMA SUPPORT APPLICATION PERFORMANCE Higher is better Source: http://www.hpcadvisorycouncil.com/pdf/hoomdblue_analysis_and_profiling.pdf

HOOMD CLUSTER 2 Dell PowerEdge T620 128-node (1536-core) Wilkes cluster at Univ of Cambridge Dual-Socket Hexa-Core Intel E5-2630 v2 @ 2.60 GHz CPUs, Memory: 64GB memory, DDR3 1600 MHz, OS: Scientific Linux release 6.4 (Carbon), MLNX_OFED 2.1-1.0.0 InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand adapters, Mellanox SwitchX SX6036 InfiniBand VPI switch, NVIDIA Tesla K20 GPUs (2 GPUs per node), NVIDIA CUDA 5.5 Development Tools and Display Driver 331.20, Open MPI 1.7.4rc1, GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz) Application: HOOMD-blue (git master 28Jan14) Benchmark datasets: Lennard-Jones Liquid Benchmarks (512K Particles) Source: http://www.hpcadvisorycouncil.com/pdf/hoomdblue_analysis_and_profiling.pdf

GPU DIRECT RDMA SUPPORT APPLICATION PERFORMANCE 102% Higher is better 20% Source: http://www.hpcadvisorycouncil.com/pdf/hoomdblue_analysis_and_profiling.pdf

OPEN MPI CUDA-AWARE Work continues to improve performance Better interactions with streams Better small message performance eager protocol CUDA-aware support for new RMA Support for reduction operations Support for non-blocking collectives Try it! Lots of information in FAQ at the Open MPI web site Questions?