High-Performance Broadcast for Streaming and Deep Learning

Size: px

Start display at page:

Download "High-Performance Broadcast for Streaming and Deep Learning"

Aubrey Douglas
6 years ago
Views:

1 High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu Department of Computer Science and Engineering The Ohio State University

2 OSU Booth - SC17 2 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

3 Trends in Modern HPC Architecture Multi-core Processors High Performance Interconnects InfiniBand (), Omni-Path < 1 μsec latency, 100 Gbps Bandwidth> Multi-core/many-core technologies High Performance Interconnects Accelerators / Coprocessors high compute density, high performance/watt > 1 Tflop/s DP on a chip SSD, NVMe-SSD, NVRAM Accelerators/Coprocessors are becoming common in high-end systems High Performance Storage and Compute devices Sunway TaihuLight K - Computer Tianhe 2 Titan OSU Booth - SC17 3

Architectures for Deep Learning (DL) Past and

across nodes Multi-core s + Multi- across nodes

4 Architectures for Deep Learning (DL) Past and Current Trend Multi-core s within a node Multi-core s across nodes Near-future Multi-core s + Multi- within a node Multi-core s + Single across nodes Multi-core s + Multi- across nodes Networks Networks Networks E.g., NVIDIA DGX-1 systems OSU Booth - SC17 4

5 Streaming Applications Streaming applications on HPC systems Source Real-time streaming 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) HPC resources for real-time analytics Sender streaming-like broadcast operations Multiple nodes as workers Worker Worker Worker Worker Worker OSU Booth - SC17 5

OSU Booth - SC17 6 High-performance Deep Learning Computation using Communication using MPI Exchanging partial gradients after each minibatch All-to-all (Multi-Source)

6 OSU Booth - SC17 6 High-performance Deep Learning Computation using Communication using MPI Exchanging partial gradients after each minibatch All-to-all (Multi-Source) communications Ø E.g., MPI_Bcast Challenges Node 1 Node 3 Node 2 Node 4 High computation-communication overlap Good scalability for upcoming large-scale clusters No application-level modification

7 OSU Booth - SC17 7 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

8 Hardware Multicast-based Broadcast For -resident data, using Direct RDMA (GDR) InfiniBand Hardware Multicast (-MCAST) Overhead UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec Source Header HCA Switch 1. Gather + GDR Read 2. Hardware Multicast 3. Scatter + GDR Write Destination 1 HCA HCA Header Destination N Header OSU Booth - SC17 8

9 Hardware Multicast-based Broadcast (con t) Heterogeneous Broadcast for streaming applications Ø Free-up PCIe resources C Source C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , HCA Switch Multicast steps SL step HCA HCA Node 1 C Node N OSU Booth - SC17 9 C

10 Optimized Broadcast Send Preparing Intermediate buffer (im_buf) Page-locked (pinned) host buffer Ø Fast Device-Host data movement Allocated at initialization phase Ø Low overhead Streaming data through host Fine-tuned chunked data Asynchronous copy operations Ø Three-stage pipeline MPI_Bcast(d_out, ) Source Header im_buf d_out HCA 1. Preparation 2. Gather 3. Hardware Multicast Switch C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, OSU Booth - SC17 10

11 Optimized Broadcast Receive Zero-copy broadcast receive Pre-posted user buffer (d_in) Avoids additional data movement Leverages Scatter and GDR features Ø Low-latency Ø Free-up PCIe resources for applications C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, Switch Hardware Multicast Scatter (GDR Write) MPI_Bcast(d_in, ) Destination 1 HCA HCA Header d_in Destination N Header d_in OSU Booth - SC17 11

12 Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source Multicast steps Switch Node 1 Node N C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , cudamemcpy (Device Device) 0 1 N OSU Booth - SC17 12

13 Efficient Reliability Support for -MCAST When a receiver experiences timeout (lost MCAST packet) Performs the RMA Get operation to the sender s backup buffer to retrieve lost MCAST packets Sender is not interrupted Broadcast sender Broadcast receiver Time MPI HCA HCA MPI Timeout C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, OSU Booth - SC17 13

14 OSU Booth - SC17 14 Outline Introduction Proposed Designs in MVAPICH2-GDR Performance Evaluation Concluding Remarks

15 Experimental Environments Ohio State University (OSU) Micro-Benchmark (OMB) osu_bcast - MPI_Bcast Latency Test osu_bcast_streaming MPI_Bcast streaming Test Deep learning framework: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK)* AlexNet and VGG models with ImageNet dataset *D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern Enabled Clusters," IEEE CloudCom, Luxembourg City, 2016, pp OSU Booth - SC17 15

Benchmark Evaluation @ RI2 cluster, 16 s, 1 /node Lower is better Latency (μs) 10000 1000 100 10 1 4K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K

latency for large messages 10 1 2 MB Message Near-Constant MV2-GDR-Knomial MCAST-GDR MV2-GDR-Ring MCAST-GDR-Opt 2 4 8 16 Number of nodes C.-H. Chu, X. Lu, A. A. Awan, H.

16 Benchmark RI2 cluster, 16 s, 1 /node Lower is better Latency (μs) K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K 512K 1M Message Size (bytes) MV2-GDR-Ring MCAST-GDR-Opt 2M 65% 4M 8M 16M Latency (μs) Provide near-constant latency over the system sizes Reduces up to 65% of latency for large messages MB Message Near-Constant MV2-GDR-Knomial MCAST-GDR MV2-GDR-Ring MCAST-GDR-Opt Number of nodes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, OSU Booth - SC17 16

Streaming Benchmark @ CSCS (88 s) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 60 12000 Latency (μs) 50 40 30 20 58% Latency (μs) 10000 8000 6000 4000 79% 10 2000 0 0 1 2 4 8 16 32 64 128 256 512

17 Streaming CSCS (88 s) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR Latency (μs) % Latency (μs) % K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) -MCAST + GDR + Topology-aware IPC-based schemes Up to 58% and 79% reduction for small and large messages C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , OSU Booth - SC17 17

18 Deep Learning Frameworks Training Time RI2 cluster, 16 s, 1 /node: CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification AlexNet model MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt Number of nodes VGG model Lower is better MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt % 24% 6% OSU Booth - SC17 18 Training Time (s) Reduces up to 24% and 15% of latency for AlexNet and VGG models Higher improvement is expected for larger system sizes 8 16 Number of nodes 15%

19 OSU Booth - SC17 19 Concluding Remarks High-performance broadcast schemes to leverage GDR and - MCAST features for streaming and deep learning applications Optimized streaming design for large messages transfers High-performance reliability support for -MCAST Ø These features are included in MVAPICH2-GDR 2.3a Ø Ø

20 OSU Booth - SC17 20 Thank You! Ching-Hsiang Chu chu.368@osu.edu The MVAPICH2 Project Network-Based Computing Laboratory [1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , [2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, [3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in -enabled Streaming Applications, " COMHPC Workshop, [4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast, submitted to IEEE TPDS. (Under review)

OSU Booth - SC17 21 Thank You! Join us for more tech talks from MVAPICH2 team http://mvapich.cse.ohio-state.

21 OSU Booth - SC17 21 Thank You! Join us for more tech talks from MVAPICH2 team The MVAPICH2 Project Network-Based Computing Laboratory

22 OSU Booth - SC17 22 Evaluation Parameters Notation Meaning Unit n Number of processes N/A m Number of broadcast sources N/A t s Set up time for sending data sec t o (n) Overhead for issuing an -MCAST packet sec M Original message size bytes C Size of a data chunk bytes U Maximum Transmission Unit for -MCAST, provided by hardware manufacturer bytes B H Bandwidth of reading Host memory bytes/sec B G B PCIe Bandwidth of reading memory (NVIDIA Direct RDMA) PCIe Bandwidth between Host and memory bytes/sec bytes/sec U C B PCIe Message M Bandwidth B H B G HCA B G

23 OSU Booth - SC17 23 Ring-based Broadcast Direct Pipeline Staging (n 1) t 7 + M B ; M C + (n 2) t 7 + C B ; M B >?@A + (n 1) t 7 + M B B GDR Read GDR Write Network Transfer Source HCA Destination 1 HCA Poor Scalability Destination 2 HCA Destination 3 HCA

24 K-nomial-based Broadcast Direct Pipeline Staging log F n t 7 + M B ; M C log F n t 7 + C B ; M B >?@A + log F n t 7 + M B B Source HCA GDR Read GDR Write Network Transfer Destination 1 HCA HCA Destination 3 Non-optimized Scalability Destination 2 HCA OSU Booth - SC17 24

25 Overlap Opportunities Overlap within a node : cudamemcpyasync : Hardware Multicast : cudastreamsynchronize : GDR Write Broadcast from Node A Node A HCA Node B HCA Node C HCA Timeline Broadcast from Node B Broadcast from Node C Overlap Across Nodes OSU Booth - SC17 25

26 MCAST-based Broadcast NVIDIA Direct [1] Remote direct memory access (RDMA) transfers between s and other PCIe devices GDR and more InfiniBand () hardware multicast ( MCAST) [2] Enables efficient designs of broadcast operations Host-based [3] -based [4] [1] [2] Pfister GF., An Introduction to the InfiniBand Architecture. High Performance Mass Storage and Parallel I/O, Chapter 42, pp , Jun [3] J. Liu, A. R. Mamidala, and D. K. Panda, Fast and Scalable MPI-level Broadcast using InfiniBand s Hardware Multicast Support, in IPDPS 2004, p. 10, April [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec OSU Booth - SC17 26

27 OSU Booth - SC17 27 Future Work Extend the design for other broadcast-based collective algorithms as well as non-blocking operations Allreduce, Allgather,, and so on

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth