High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University

OSU Booth - SC17 2 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Trends in Modern HPC Architecture Multi-core Processors High Performance Interconnects InfiniBand (), Omni-Path < 1 μsec latency, 100 Gbps Bandwidth> Multi-core/many-core technologies High Performance Interconnects Accelerators / Coprocessors high compute density, high performance/watt > 1 Tflop/s DP on a chip SSD, NVMe-SSD, NVRAM Accelerators/Coprocessors are becoming common in high-end systems High Performance Storage and Compute devices Sunway TaihuLight K - Computer Tianhe 2 Titan OSU Booth - SC17 3

Architectures for Deep Learning (DL) Past and Current Trend Multi-core s within a node Multi-core s across nodes Near-future Multi-core s + Multi- within a node Multi-core s + Single across nodes Multi-core s + Multi- across nodes Networks Networks Networks E.g., NVIDIA DGX-1 systems OSU Booth - SC17 4

Streaming Applications Streaming applications on HPC systems Source Real-time streaming 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) HPC resources for real-time analytics Sender streaming-like broadcast operations Multiple nodes as workers Worker Worker Worker Worker Worker OSU Booth - SC17 5

OSU Booth - SC17 6 High-performance Deep Learning Computation using Communication using MPI Exchanging partial gradients after each minibatch All-to-all (Multi-Source) communications Ø E.g., MPI_Bcast Challenges Node 1 Node 3 Node 2 Node 4 High computation-communication overlap Good scalability for upcoming large-scale clusters No application-level modification

OSU Booth - SC17 7 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Hardware Multicast-based Broadcast For -resident data, using Direct RDMA (GDR) InfiniBand Hardware Multicast (-MCAST) Overhead UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. Source Header HCA Switch 1. Gather + GDR Read 2. Hardware Multicast 3. Scatter + GDR Write Destination 1 HCA HCA Header Destination N Header OSU Booth - SC17 8

Hardware Multicast-based Broadcast (con t) Heterogeneous Broadcast for streaming applications Ø Free-up PCIe resources C Source C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. HCA Switch Multicast steps SL step HCA HCA Node 1 C Node N OSU Booth - SC17 9 C

Optimized Broadcast Send Preparing Intermediate buffer (im_buf) Page-locked (pinned) host buffer Ø Fast Device-Host data movement Allocated at initialization phase Ø Low overhead Streaming data through host Fine-tuned chunked data Asynchronous copy operations Ø Three-stage pipeline MPI_Bcast(d_out, ) Source Header im_buf d_out HCA 1. Preparation 2. Gather 3. Hardware Multicast Switch C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. OSU Booth - SC17 10

Optimized Broadcast Receive Zero-copy broadcast receive Pre-posted user buffer (d_in) Avoids additional data movement Leverages Scatter and GDR features Ø Low-latency Ø Free-up PCIe resources for applications C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. Switch Hardware Multicast Scatter (GDR Write) MPI_Bcast(d_in, ) Destination 1 HCA HCA Header d_in Destination N Header d_in OSU Booth - SC17 11

Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source Multicast steps Switch Node 1 Node N C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. cudamemcpy (Device Device) 0 1 N OSU Booth - SC17 12

Efficient Reliability Support for -MCAST When a receiver experiences timeout (lost MCAST packet) Performs the RMA Get operation to the sender s backup buffer to retrieve lost MCAST packets Sender is not interrupted Broadcast sender Broadcast receiver Time MPI HCA HCA MPI Timeout C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, 2016. OSU Booth - SC17 13

OSU Booth - SC17 14 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

Experimental Environments Ohio State University (OSU) Micro-Benchmark (OMB) http://mvapich.cse.ohio-state.edu/benchmarks/ osu_bcast - MPI_Bcast Latency Test osu_bcast_streaming MPI_Bcast streaming Test Deep learning framework: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK)* AlexNet and VGG models with ImageNet dataset *D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern Enabled Clusters," IEEE CloudCom, Luxembourg City, 2016, pp. 144-151. OSU Booth - SC17 15

Benchmark Evaluation @ RI2 cluster, 16 s, 1 /node Lower is better Latency (μs) 10000 1000 100 10 1 4K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K 512K 1M Message Size (bytes) MV2-GDR-Ring MCAST-GDR-Opt 2M 65% 4M 8M 16M Latency (μs) 10000 1000 100 Provide near-constant latency over the system sizes Reduces up to 65% of latency for large messages 10 1 2 MB Message Near-Constant MV2-GDR-Knomial MCAST-GDR MV2-GDR-Ring MCAST-GDR-Opt 2 4 8 16 Number of nodes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. OSU Booth - SC17 16

Streaming Benchmark @ CSCS (88 s) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 60 12000 Latency (μs) 50 40 30 20 58% Latency (μs) 10000 8000 6000 4000 79% 10 2000 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) -MCAST + GDR + Topology-aware IPC-based schemes Up to 58% and 79% reduction for small and large messages C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. OSU Booth - SC17 17

Deep Learning Frameworks Training Time (s) @ RI2 cluster, 16 s, 1 /node: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification AlexNet model MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 300 200 100 0 8 16 Number of nodes VGG model Lower is better MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 3000 15% 24% 6% OSU Booth - SC17 18 Training Time (s) 2000 1000 0 Reduces up to 24% and 15% of latency for AlexNet and VGG models Higher improvement is expected for larger system sizes 8 16 Number of nodes 15%

OSU Booth - SC17 19 Concluding Remarks High-performance broadcast schemes to leverage GDR and - MCAST features for streaming and deep learning applications Optimized streaming design for large messages transfers High-performance reliability support for -MCAST Ø These features are included in MVAPICH2-GDR 2.3a Ø http://mvapich.cse.ohio-state.edu/ Ø http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/

OSU Booth - SC17 20 Thank You! Ching-Hsiang Chu chu.368@osu.edu The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ [1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. [2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. [3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, 2016. [4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast, submitted to IEEE TPDS. (Under review)

OSU Booth - SC17 21 Thank You! Join us for more tech talks from MVAPICH2 team http://mvapich.cse.ohio-state.edu/conference/677/talks/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

OSU Booth - SC17 22 Evaluation Parameters Notation Meaning Unit n Number of processes N/A m Number of broadcast sources N/A t s Set up time for sending data sec t o (n) Overhead for issuing an -MCAST packet sec M Original message size bytes C Size of a data chunk bytes U Maximum Transmission Unit for -MCAST, provided by hardware manufacturer bytes B H Bandwidth of reading Host memory bytes/sec B G B PCIe Bandwidth of reading memory (NVIDIA Direct RDMA) PCIe Bandwidth between Host and memory bytes/sec bytes/sec U C B PCIe Message M Bandwidth B H B G HCA B G

OSU Booth - SC17 23 Ring-based Broadcast Direct Pipeline Staging (n 1) t 7 + M B ; M C + (n 2) t 7 + C B ; M B >?@A + (n 1) t 7 + M B B GDR Read GDR Write Network Transfer Source HCA Destination 1 HCA Poor Scalability Destination 2 HCA Destination 3 HCA

K-nomial-based Broadcast Direct Pipeline Staging log F n t 7 + M B ; M C log F n t 7 + C B ; M B >?@A + log F n t 7 + M B B Source HCA GDR Read GDR Write Network Transfer Destination 1 HCA HCA Destination 3 Non-optimized Scalability Destination 2 HCA OSU Booth - SC17 24

Overlap Opportunities Overlap within a node : cudamemcpyasync : Hardware Multicast : cudastreamsynchronize : GDR Write Broadcast from Node A Node A HCA Node B HCA Node C HCA Timeline Broadcast from Node B Broadcast from Node C Overlap Across Nodes OSU Booth - SC17 25

MCAST-based Broadcast NVIDIA Direct [1] Remote direct memory access (RDMA) transfers between s and other PCIe devices GDR and more InfiniBand () hardware multicast ( MCAST) [2] Enables efficient designs of broadcast operations Host-based [3] -based [4] [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., An Introduction to the InfiniBand Architecture. High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, Fast and Scalable MPI-level Broadcast using InfiniBand s Hardware Multicast Support, in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. OSU Booth - SC17 26

OSU Booth - SC17 27 Future Work Extend the design for other broadcast-based collective algorithms as well as non-blocking operations Allreduce, Allgather,, and so on