Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Similar documents
Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

High-Performance Broadcast for Streaming and Deep Learning

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

High-Performance Training for Deep Learning and Computer Vision HPC

Accelerating HPL on Heterogeneous GPU Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Scaling with PGAS Languages

Designing Shared Address Space MPI libraries in the Many-core Era

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

GPU-centric communication for improved efficiency

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Unifying UPC and MPI Runtimes: Experience with MVAPICH

MVAPICH2 and MVAPICH2-MIC: Latest Status

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance Migration Framework for MPI Applications on HPC Cloud

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Comparing Ethernet and Soft RoCE for MPI Communication

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Building the Most Efficient Machine Learning System

Advantages to Using MVAPICH2 on TACC HPC Clusters

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Job Startup at Exascale:

Interconnect Your Future

Interconnect Your Future

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

High Performance Computing

Solutions for Scalable HPC

Enhancing Checkpoint Performance with Staging IO & SSD

The Future of High Performance Interconnects

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

The Future of Interconnect Technology

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Memcached Design on High Performance RDMA Capable Interconnects

Operational Robustness of Accelerator Aware MPI

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Building the Most Efficient Machine Learning System

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

Job Startup at Exascale: Challenges and Solutions

Unified Runtime for PGAS and MPI over OFED

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

MVAPICH2 Project Update and Big Data Acceleration

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Op#miza#on and Tuning Collec#ves in MVAPICH2

In-Network Computing. Paving the Road to Exascale. June 2017

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Overview of the MVAPICH Project: Latest Status and Future Roadmap

A Case for High Performance Computing with Virtual Machines

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

High Performance MPI on IBM 12x InfiniBand Architecture

Scalability and Performance of MVAPICH2 on OakForest-PACS

Transcription:

Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 88ABW-2016-5574

SBAC-PAD 2016 2 Outline Introduction Proposed Designs Performance Evaluation Conclusion and Future Work

Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects InfiniBand <1 µs latency, >100 Gbps Bandwidth Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip Multi-core processors are ubiquitous InfiniBand is very popular in HPC clusters Accelerators/Coprocessors are becoming common in high-end systems Pushing the envelope towards Exascale computing Tianhe 2 Titan Stampede Tianhe 1A SBAC-PAD 2016 3

System share in Top 500 (%) 60 IB and in HPC Systems Growth of IB and clusters in the last 3 years IB is the major commodity network adapter used NVIDIA s boost 18% of the top 50 of the Top 500 systems as of June 2016 50 51.8 40 41 41.4 44.4 44.8 47.4 40.8 30 20 10 0 8.4 7.8 9 9.8 10.4 13.8 13.2 June-13 Nov-13 June-14 Nov-14 June-15 Nov-15 June-16 Cluster InfiniBand Cluster Data from Top500 list (http://www.top500.org) SBAC-PAD 2016 4

Motivation Streaming applications on HPC systems 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) HPC resources for real-time analytics Data Source Real-time streaming Sender Data streaming-like broadcast operations Multiple nodes as workers Examples Deep learning frameworks Worker Worker Worker Worker Worker Proton computed tomography (pct) SBAC-PAD 2016 5

SBAC-PAD 2016 6 Motivation Streaming applications on HPC systems 1. Communication Heterogeneous Broadcast-type operations Data are usually from a live source and stored in Host memory Data need to be sent to remote memories for computing Real-time streaming Sender Data streaming-like broadcast operations Requires data movement from Host memory to remote memories, i.e., host-device (H-D) heterogeneous broadcast Performance bottleneck

Motivation Requirements for streaming applications on HPC systems Low latency, high throughput and scalability Free up Peripheral Component Interconnect Express (PCIe) bandwidth for application needs Data streaming-like broadcast operations Worker Worker Worker Worker Worker SBAC-PAD 2016 7

Motivation Technologies we have NVIDIA Direct [1] InfiniBand (IB) hardware Use remote direct memory access (RDMA) transfers between s and other PCIe devices GDR Peer-to-peer transfers between s and more multicast (IB MCAST) [2] Enables efficient designs of homogeneous broadcast operations Host-to-Host [3] -to- [4] [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., An Introduction to the InfiniBand Architecture. High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, Fast and Scalable MPI-level Broadcast using InfiniBand s Hardware Multicast Support, in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. SBAC-PAD 2016 8

Problem Statement Can we design a high-performance heterogeneous broadcast for streaming applications? SupportsHost-to-Device broadcast operations Can we also design an efficient broadcast for multi- systems? Can we combine Direct and IB technologies to Avoid extra data movements to achieve better performance Increase available Host-Device (H-D) PCIe bandwidth for application use SBAC-PAD 2016 9

Outline Introduction Proposed Designs Heterogeneous Broadcast with Direct RDMA (GDR) and InfiniBand (IB) Hardware Multicast Intra-node Topology-Aware Broadcast for Multi- Systems Performance Evaluation Conclusion and Future Work SBAC-PAD 2016 10

Proposed Heterogeneous Broadcast Key requirement of IB MCAST Control header needs to be stored in host memory SL-based approach: Combine CUDA GDR and IB MCAST features Also, take advantage of IB Scatter-Gather List (SGL) feature: Multicast two separate addresses (control on the host + data on ) in but one IB message Directly IB read/write from/to using GDR feature low-latency zerocopy based schemes Avoiding extra copy between Host and frees up PCIe bandwidth resource for application needs Employing IB MCAST feature increases scalability SBAC-PAD 2016 11

SBAC-PAD 2016 12 Proposed Heterogeneous Broadcast Overview of SL-based approach C Source Data IB HCA Multicast steps IB SL step IB Switch IB HCA IB HCA Node 1 C Data Node N C Data

Broadcast on Multi- systems Existing two-level approach Inter-node: Can apply proposed SL-based Intra-node: Use host-based shared memory Issues of H-D cudamemcpy : 1. Expensive 2. Consumes PCIe bandwidth between and s! Source IB Switch Multicast steps cudamemcpy (Host Device) 0 Node 1 Node N 1 N SBAC-PAD 2016 13

SBAC-PAD 2016 14 Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source IB Switch Multicast steps cudamemcpy (Device Device) Node 1 Node N 0 1 N

Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast Leader keeps a copy of the data Synchronization between s Use a one-byte flag in shared memory on host Non-leaders copy the data using CUDA IPC Ø Frees up PCIe bandwidth resource Other Topology-Aware designs Ring, K-nomial etc. Dynamic tuning selection 0 RecvBuf CopyBuf Host Memory Shared Memory Region 1 RecvBuf IPC N RecvBuf SBAC-PAD 2016 15

SBAC-PAD 2016 16 Outline Introduction Proposed Designs Performance Evaluation OSU Micro-Benchmark (OMB) level evaluation Streaming benchmark level evaluation Conclusion and Future Work

Experimental Environments 1. Wilkes cluster @ University of Cambridge http://www.hpc.cam.ac.uk/services/wilkes 2 NVIDIA K20c s per node Used Up to 32 nodes 2. CSCS cluster @ Swiss National Supercomputing Centre http://www.cscs.ch/computers/kesch_escha/index.html Cray CS-Storm system 8 NVIDIA K80 cards per node ( = 16 NVIDIA Kepler GK210 chips per node) Used Up to 88 NVIDIA K80 cards (176 chips) over 11 nodes Modified Ohio State University (OSU) Micro-Benchmark (OMB) http://mvapich.cse.ohio-state.edu/benchmarks/ osu_bcast - MPI_Bcast Latency Test Modified to support heterogeneous broadcast Streaming benchmark Mimic real streaming applications Continuously broadcasts data from a source to -based compute nodes Includes a computation phase that involves host-to-device and device-to-host copies SBAC-PAD 2016 17

Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 MVAPICH2-X (MPI + PGAS), Available since 2011 Support forgps (MVAPICH2-GDR), Available since 2014 Support for MIC (MVAPICH2-MIC), Availablesince 2014 Support for Virtualization (MVAPICH2-Virt), Available since 2015 Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 Used by more than 2,675 organizations in 83 countries More than 391,000 (> 0.39 million) downloads from the OSU site directly Empoweringmany TOP500 clusters (June 16 ranking) 12 th ranked 462,462-core cluster (Stampede) at TACC 15 th ranked 185,344-core cluster (Pleiades) at NASA 31 th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and LinuxDistros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu Empowering Top500 systems forovera decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 Tflop/s) Stampede at TACC (12 th in June 2016, 462,462 cores, 5.168 Pflop/s) SBAC-PAD 2016 18

OMB Heterogeneous Inter-node Broadcast @ Wilkes SL-MCAST -MCAST Host-MCAST SL-MCAST -MCAST Host-MCAST Latency (μs) 60 50 40 30 20 10 0 56% Latency(μs) 10000 8000 6000 4000 2000 0 39% 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Compared proposed SL-based design to homogeneous broadcast designs with explicitly data transfers Reduces latency up to 56% and 39% for small and large messages No extra data transfers between Host and memories SBAC-PAD 2016 19

OMB SL-based Approach Inter-node Broadcast on Wilkes IB Hardware Multicast provides good scalability 25 20 SL-MCAST -MCAST Host-MCAST Latency (μs) 15 10 5 0 2 4 8 16 32 System size (Number of nodes) SBAC-PAD 2016 20

OMB Inter- and Intra-node Broadcast @ CSCS IPC SL-MCAST SHMEM SL-MCAST IPC SL-MCAST SHMEM SL-MCAST Latency (μs) 60 50 40 30 20 10 0 58% 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) 12000 10000 8000 6000 4000 2000 0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) SL-based inter-node + Topology-aware intra-node on CSCS Up to 58% and 79% reduction for small and large messages No extra data transfers between Host and memories 79% SBAC-PAD 2016 21

Execution Time (s) 4 3.5 3 2.5 2 1.5 1 0.5 0 Streaming Benchmark Execution Time @ CSCS IPC SL-MCAST 1 2 4 8 16 32 64 SHMEM SL-MCAST 128 256 512 Message Size (Bytes) 1K 2K 4K 26% 8K 16K Execution Time (s) 16 14 12 10 8 6 4 2 0 IPC SL-MCAST SHMEM SL-MCAST 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Utilizes IPC-based Device-To-Device data transfer for streaming applications on multi- systems Up to 26% and 67% improvement for small and large messages 67% SBAC-PAD 2016 22

PCIe Throughput (GB/s) 4 3.5 3 2.5 2 1.5 1 0.5 0 Streaming Benchmark Throughput @ CSCS 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST Increases availability of PCIe Host-Device Resources Utilize IPC-based Device-to-Device data transfers Free up PCIe bandwidth resources between Host and Devices for applications 3.2X SBAC-PAD 2016 23

SBAC-PAD 2016 24 Outline Introduction Proposed Designs Performance Evaluation Conclusion and Future Work

Conclusion Combines NVIDIA Direct technology and InfiniBand (IB) hardware multicast for -enabled streaming applications Further proposes an intra-node topology-aware scheme that exploits CUDA IPC for multi- systems Achieves 2X improvement over state-of-the-art schemes with Ohio State University (OSU) Micro-Benchmarks (OMBs) Achieves up to a 67% improvement in execution time and 3.5X of throughput in a synthetic streaming benchmark Indicates applying this approach to a streaming application, such as photon computed tomography (pct) or deep learning framework, is promising SBAC-PAD 2016 25

Future Work Include in future releases of MVAPICH2-GDR library Improve reliability Evaluate effectiveness with streaming applications, such as, photon computed tomography (pct) and deep learning frameworks Extend the designs for other collective operations as well as non-blocking operations Allreduce, gather etc. SBAC-PAD 2016 26

Thank You! Ching-Hsiang Chu chu.368@osu.edu The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ This project is supported under the United States Department of Defense (DOD) High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity (Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors and do not necessarily reflect the views of the DOD or the employer of the author. SBAC-PAD 2016 27

Streaming Benchmark Mimics behavior of a streaming application Continuously broadcasts data from a source to -based compute nodes Includes a computation phase that involves host-to-device and device-to-host copies /* h_buf and d_buf: buffer on Host and memory. */ for iter=0 to max_iter do cudamemcpyasync(..., cudamemcpyhosttodevice, cpy_stream); if rank == root then MPI Bcast(h_buf,...); else MPI Bcast(d_buf,...); end if dummy kernel<<<...>>>(d_buf,...); cudamemcpyasync(..., cudamemcpydevicetohost, cpy_stream); end for SBAC-PAD 2016 28