Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Size: px

Start display at page:

Download "Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters"

Magnus Fowler
5 years ago
Views:

1 Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu Department of Computer Science and Engineering The Ohio State University

2 OSU Booth - SC18 2 Outline Introduction Advanced Designs in MVAPICH2-GDR CUDA-Aware MPI_Bcast CUDA-Aware MPI_Allreduce / MPI_Reduce Concluding Remarks

Processors High Performance Interconnects

SSD, NVMe-SSD, NVRAM Remote Direct Memory

(InfiniBand and RoCE) Solid State Drives

Storage Clusters Accelerators (NVIDIA GPs

3 Drivers of Modern HPC Cluster Architectures - Hardware Multi-/Many-core Processors High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-core/many-core technologies Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters Accelerators (NVIDIA GPs and Intel Xeon Phi) Sierra@LLNL Stampede2@TACC Comet@SDSC OSU Booth - SC18 3

4 Architectures for Deep Learning (DL) Multi-core s within a node Multi-core s + Multi- within a node Multi-core s + Multi- across nodes (E.g., Sierra/Summit) Multi-core s across nodes Multi-core s + Single across nodes Networks Networks Networks OSU Booth - SC18 4

5 Streaming-like Applications Streaming-like applications on HPC systems Data Source Real-time streaming 1. Communication (MPI) Broadcast HPC resources for real-time analytics Sender Allreduce/Reduce 2. Computation (CUDA) Multiple nodes as workers Worker Data streaming-like broadcast operations Worker Worker Worker Worker OSU Booth - SC18 5

OSU Booth - SC18 6 High-performance Deep Learning Computation using Communication using MPI Node 1 Node 3 Exchanging partial gradients after each minibatch All-to-all (Multi-Source)

6 OSU Booth - SC18 6 High-performance Deep Learning Computation using Communication using MPI Node 1 Node 3 Exchanging partial gradients after each minibatch All-to-all (Multi-Source) communications Ø E.g., MPI_Bcast, MPI_Allreduce Challenges Node 2 Node 4 High computation-communication overlap Good scalability for upcoming large-scale clusters No application-level modification

7 OSU Booth - SC18 7 Outline Introduction Advanced Designs in MVAPICH2-GDR CUDA-Aware MPI_Bcast CUDA-Aware MPI_Allreduce / MPI_Reduce Concluding Remarks

8 Hardware Multicast-based Broadcast For -resident data, using Direct RDMA (GDR) InfiniBand Hardware Multicast (-MCAST) Overhead UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec Source Header Data HCA Switch 1. Gather + GDR Read 2. Hardware Multicast 3. Scatter + GDR Write Destination 1 HCA HCA Header Data Destination N Header Data OSU Booth - SC18 8

9 Hardware Multicast-based Broadcast (con t) Heterogeneous Broadcast for streaming applications Ø Free-up PCIe resources C Source Data C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , HCA Switch Multicast steps SL step 2 2 HCA HCA Node 1 C Data Node N OSU Booth - SC Data C

10 Optimized Broadcast Send Preparing Intermediate buffer (im_buf) Page-locked (pinned) host buffer Ø Fast Device-Host data movement Allocated at initialization phase Ø Low overhead, one time effort Streaming data through host Fine-tuned chunked data Asynchronous copy operations Ø Three-stage fine-tuned pipeline MPI_Bcast(d_out, ) Source Header im_buf d_out HCA 1. Data Preparation 2. Gather 3. Hardware Multicast Switch C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, OSU Booth - SC18 10

11 Optimized Broadcast Receive Zero-copy broadcast receive Pre-posted user buffer (d_in) Avoids additional data movement Leverages Scatter and GDR features Ø Low-latency Ø Free-up PCIe resources for applications C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). Switch Hardware Multicast Scatter (GDR Write) MPI_Bcast(d_in, ) Destination 1 HCA HCA Header d_in Destination N Header d_in OSU Booth - SC18 11

12 Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source 1 Multicast steps Switch 2 2 Node 1 Node N C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters, " SBAC-PAD'16, Oct , cudamemcpy (Device Device) 0 1 N OSU Booth - SC18 12

Benchmark Evaluation Latency (μs) @ RI2 cluster, 16 s, 1 /node 10000 1000 100 10 1 4K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K 512K 1M Message Size (bytes) MV2-GDR-Ring

13 Benchmark Evaluation Latency RI2 cluster, 16 s, 1 /node K MV2-GDR-Knomial MCAST-GDR 8K 16K 32K 64K 128K Hit GDR read limit 256K 512K 1M Message Size (bytes) MV2-GDR-Ring MCAST-GDR-Opt 2M 65% 4M 8M 16M Latency (μs) Provide near-constant latency over the system sizes Reduces up to 65% of latency for large messages MB Message Near-Constant MV2-GDR-Knomial MCAST-GDR Lower is better MV2-GDR-Ring MCAST-GDR-Opt Number of nodes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning, " ICPP 2017, Aug 14-17, OSU Booth - SC18 13

14 Throughput (GB/s) Streaming RI2 (16 s) & CSCS (88 s) Peak Streaming Peak Knomial-GDR Ring-GDR-Pipeline TA-Zcpy-MCAST-GDR-Pipeline Streaming Peak Streaming 4 Nodes 8 s Nodes 16 Nodes -MCAST + GDR + IPC-based MPI_Bcast schemes Stable high throughput compared to existing schemes Peak Streaming Peak Streaming C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). OSU Booth - SC18 14 Peak Streaming Peak Streaming 11 s 22 s 44 s 88 s

Performance Benefits with CNTK Deep Learning Framework @ RI2 cluster, 16 s CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification 1.

15 Performance Benefits with CNTK Deep Learning RI2 cluster, 16 s CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification 1.5 CA-CNTK - Image Classification Knomial-GDR Ring-GDR-Pipeling Zcpy-MCAST-GDR-Pipeline Speedup % 24% Reduces up to 24%, 15%, 18% of latency for AlexNet, VGG, and ResNet-50 models Higher improvement is expected for larger system sizes AlexNet VGG ResNet-50 Scale (Number of nodes) C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and Direct RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). OSU Booth - SC18 15

Explicit copy the data from host to memory Node A Proposed designs 1.

16 CUDA-Aware MPI_Allreduce Existing designs 1. Explicit copy the data from to host memory 2. Host-to-Host communication to remote processes 3. Perform computation on 4. Explicit copy the data from host to memory Node A Proposed designs 1. -to- communication NVIDIA DirectRDMA (GDR) Pipeline through host for large msg 2. Perform computation on Efficient CUDA kernels 1 Host Memory PCIe Adapter Expensive! Fast Good for small data Relative slow for large data Expensive! Adapter Host Memory PCIe OSU Booth - SC Node B 4

17 Benchmark RI2 cluster, 16 s Latency (ms) Default RD-DD BRB-DD 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Latency (ms) MPI NCCL2 MPI-Opt K 8K 32K 128K 512K 2M 8M 32M 128M Message Size (Bytes) [1] C. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan and D. K. Panda, "CUDA Kernel Based Collective Reduction Operations on Large-scale Clusters," in CCGrid 16, Cartagena, 2016, pp [2] Awan AA, Bedorf J, Chu CH, Subramoni H, Panda DK. Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation. arxiv preprint arxiv: Oct 25. OSU Booth - SC18 17

18 OSU Booth - SC18 18 Outline Introduction Advanced Designs in MVAPICH2-GDR CUDA-Aware MPI_Bcast CUDA-Aware MPI_Allreduce / MPI_Reduce Concluding Remarks

19 OSU Booth - SC18 19 Concluding Remarks High-performance broadcast schemes to leverage GDR and - MCAST features for streaming and deep learning applications Optimized streaming design for large messages transfers High-performance reliability support for -MCAST High-performance CUDA-Aware Allreduce for deep learning Efficient reduction kernel on s Ø These features are included in MVAPICH2-GDR 2.3 Ø Ø

20 OSU Booth - SC18 20 Thank You! Join us for more tech talks from MVAPICH2 team MVAPICH The MVAPICH2 Project Network-Based Computing Laboratory

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction