High-Performance Broadcast for Streaming and Deep Learning

Similar documents
Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

High-Performance Training for Deep Learning and Computer Vision HPC

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Accelerating HPL on Heterogeneous GPU Clusters

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Job Startup at Exascale:

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Designing Shared Address Space MPI libraries in the Many-core Era

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

GPU-Aware Intranode MPI_Allreduce

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Building the Most Efficient Machine Learning System

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Comparing Ethernet and Soft RoCE for MPI Communication

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

GPU-centric communication for improved efficiency

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Job Startup at Exascale: Challenges and Solutions

Building the Most Efficient Machine Learning System

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Operational Robustness of Accelerator Aware MPI

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

A Case for High Performance Computing with Virtual Machines

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Advantages to Using MVAPICH2 on TACC HPC Clusters

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Enhancing Checkpoint Performance with Staging IO & SSD

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Unified Runtime for PGAS and MPI over OFED

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

High Performance Migration Framework for MPI Applications on HPC Cloud

Op#miza#on and Tuning Collec#ves in MVAPICH2

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Mapping MPI+X Applications to Multi-GPU Architectures

High-Performance Heterogeneity/ Energy-Aware Communication for Multi-Petaflop HPC Systems

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Advanced RDMA-based Admission Control for Modern Data-Centers

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Overview of the MVAPICH Project: Latest Status and Future Roadmap

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Future Routing Schemes in Petascale clusters

Transcription:

High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University

OSU Booth - SC17 2 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Trends in Modern HPC Architecture Multi-core Processors High Performance Interconnects InfiniBand (), Omni-Path < 1 μsec latency, 100 Gbps Bandwidth> Multi-core/many-core technologies High Performance Interconnects Accelerators / Coprocessors high compute density, high performance/watt > 1 Tflop/s DP on a chip SSD, NVMe-SSD, NVRAM Accelerators/Coprocessors are becoming common in high-end systems High Performance Storage and Compute devices Sunway TaihuLight K - Computer Tianhe 2 Titan OSU Booth - SC17 3

Architectures for Deep Learning (DL) Past and Current Trend Multi-core s within a node Multi-core s across nodes Near-future Multi-core s + Multi- within a node Multi-core s + Single across nodes Multi-core s + Multi- across nodes Networks Networks Networks E.g., NVIDIA DGX-1 systems OSU Booth - SC17 4

Streaming Applications Streaming applications on HPC systems Source Real-time streaming 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) HPC resources for real-time analytics Sender streaming-like broadcast operations Multiple nodes as workers Worker Worker Worker Worker Worker OSU Booth - SC17 5

OSU Booth - SC17 6 High-performance Deep Learning Computation using Communication using MPI Exchanging partial gradients after each minibatch All-to-all (Multi-Source) communications Ø E.g., MPI_Bcast Challenges Node 1 Node 3 Node 2 Node 4 High computation-communication overlap Good scalability for upcoming large-scale clusters No application-level modification

OSU Booth - SC17 7 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Hardware Multicast-based Broadcast For -resident data, using Direct RDMA (GDR) InfiniBand Hardware Multicast (-MCAST) Overhead UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. Source Header HCA Switch 1. Gather + GDR Read 2. Hardware Multicast 3. Scatter + GDR Write Destination 1 HCA HCA Header Destination N Header OSU Booth - SC17 8

Hardware Multicast-based Broadcast (con t) Heterogeneous Broadcast for streaming applications Ø Free-up PCIe resources C Source C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. HCA Switch Multicast steps SL step HCA HCA Node 1 C Node N OSU Booth - SC17 9 C

Optimized Broadcast Send Preparing Intermediate buffer (im_buf) Page-locked (pinned) host buffer Ø Fast Device-Host data movement Allocated at initialization phase Ø Low overhead Streaming data through host Fine-tuned chunked data Asynchronous copy operations Ø Three-stage pipeline MPI_Bcast(d_out, ) Source Header im_buf d_out HCA 1. Preparation 2. Gather 3. Hardware Multicast Switch C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. OSU Booth - SC17 10

Optimized Broadcast Receive Zero-copy broadcast receive Pre-posted user buffer (d_in) Avoids additional data movement Leverages Scatter and GDR features Ø Low-latency Ø Free-up PCIe resources for applications C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. Switch Hardware Multicast Scatter (GDR Write) MPI_Bcast(d_in, ) Destination 1 HCA HCA Header d_in Destination N Header d_in OSU Booth - SC17 11

Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source Multicast steps Switch Node 1 Node N C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. cudamemcpy (Device Device) 0 1 N OSU Booth - SC17 12

Efficient Reliability Support for -MCAST When a receiver experiences timeout (lost MCAST packet) Performs the RMA Get operation to the sender s backup buffer to retrieve lost MCAST packets Sender is not interrupted Broadcast sender Broadcast receiver Time MPI HCA HCA MPI Timeout C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, 2016. OSU Booth - SC17 13

OSU Booth - SC17 14 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Experimental Environments Ohio State University (OSU) Micro-Benchmark (OMB) http://mvapich.cse.ohio-state.edu/benchmarks/ osu_bcast - MPI_Bcast Latency Test osu_bcast_streaming MPI_Bcast streaming Test Deep learning framework: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK)* AlexNet and VGG models with ImageNet dataset *D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern Enabled Clusters," IEEE CloudCom, Luxembourg City, 2016, pp. 144-151. OSU Booth - SC17 15

Benchmark Evaluation @ RI2 cluster, 16 s, 1 /node Lower is better Latency (μs) 10000 1000 100 10 1 4K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K 512K 1M Message Size (bytes) MV2-GDR-Ring MCAST-GDR-Opt 2M 65% 4M 8M 16M Latency (μs) 10000 1000 100 Provide near-constant latency over the system sizes Reduces up to 65% of latency for large messages 10 1 2 MB Message Near-Constant MV2-GDR-Knomial MCAST-GDR MV2-GDR-Ring MCAST-GDR-Opt 2 4 8 16 Number of nodes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. OSU Booth - SC17 16

Streaming Benchmark @ CSCS (88 s) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 60 12000 Latency (μs) 50 40 30 20 58% Latency (μs) 10000 8000 6000 4000 79% 10 2000 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) -MCAST + GDR + Topology-aware IPC-based schemes Up to 58% and 79% reduction for small and large messages C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. OSU Booth - SC17 17

Deep Learning Frameworks Training Time (s) @ RI2 cluster, 16 s, 1 /node: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification AlexNet model MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 300 200 100 0 8 16 Number of nodes VGG model Lower is better MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 3000 15% 24% 6% OSU Booth - SC17 18 Training Time (s) 2000 1000 0 Reduces up to 24% and 15% of latency for AlexNet and VGG models Higher improvement is expected for larger system sizes 8 16 Number of nodes 15%

OSU Booth - SC17 19 Concluding Remarks High-performance broadcast schemes to leverage GDR and - MCAST features for streaming and deep learning applications Optimized streaming design for large messages transfers High-performance reliability support for -MCAST Ø These features are included in MVAPICH2-GDR 2.3a Ø http://mvapich.cse.ohio-state.edu/ Ø http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/

OSU Booth - SC17 20 Thank You! Ching-Hsiang Chu chu.368@osu.edu The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ [1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. [2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. [3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, 2016. [4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast, submitted to IEEE TPDS. (Under review)

OSU Booth - SC17 21 Thank You! Join us for more tech talks from MVAPICH2 team http://mvapich.cse.ohio-state.edu/conference/677/talks/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

OSU Booth - SC17 22 Evaluation Parameters Notation Meaning Unit n Number of processes N/A m Number of broadcast sources N/A t s Set up time for sending data sec t o (n) Overhead for issuing an -MCAST packet sec M Original message size bytes C Size of a data chunk bytes U Maximum Transmission Unit for -MCAST, provided by hardware manufacturer bytes B H Bandwidth of reading Host memory bytes/sec B G B PCIe Bandwidth of reading memory (NVIDIA Direct RDMA) PCIe Bandwidth between Host and memory bytes/sec bytes/sec U C B PCIe Message M Bandwidth B H B G HCA B G

OSU Booth - SC17 23 Ring-based Broadcast Direct Pipeline Staging (n 1) t 7 + M B ; M C + (n 2) t 7 + C B ; M B >?@A + (n 1) t 7 + M B B GDR Read GDR Write Network Transfer Source HCA Destination 1 HCA Poor Scalability Destination 2 HCA Destination 3 HCA

K-nomial-based Broadcast Direct Pipeline Staging log F n t 7 + M B ; M C log F n t 7 + C B ; M B >?@A + log F n t 7 + M B B Source HCA GDR Read GDR Write Network Transfer Destination 1 HCA HCA Destination 3 Non-optimized Scalability Destination 2 HCA OSU Booth - SC17 24

Overlap Opportunities Overlap within a node : cudamemcpyasync : Hardware Multicast : cudastreamsynchronize : GDR Write Broadcast from Node A Node A HCA Node B HCA Node C HCA Timeline Broadcast from Node B Broadcast from Node C Overlap Across Nodes OSU Booth - SC17 25

MCAST-based Broadcast NVIDIA Direct [1] Remote direct memory access (RDMA) transfers between s and other PCIe devices GDR and more InfiniBand () hardware multicast ( MCAST) [2] Enables efficient designs of broadcast operations Host-based [3] -based [4] [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., An Introduction to the InfiniBand Architecture. High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, Fast and Scalable MPI-level Broadcast using InfiniBand s Hardware Multicast Support, in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. OSU Booth - SC17 26

OSU Booth - SC17 27 Future Work Extend the design for other broadcast-based collective algorithms as well as non-blocking operations Allreduce, Allgather,, and so on