Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 88ABW-2016-5574

SBAC-PAD 2016 2 Outline Introduction Proposed Designs Performance Evaluation Conclusion and Future Work

Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects InfiniBand <1 µs latency, >100 Gbps Bandwidth Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip Multi-core processors are ubiquitous InfiniBand is very popular in HPC clusters Accelerators/Coprocessors are becoming common in high-end systems Pushing the envelope towards Exascale computing Tianhe 2 Titan Stampede Tianhe 1A SBAC-PAD 2016 3

System share in Top 500 (%) 60 IB and in HPC Systems Growth of IB and clusters in the last 3 years IB is the major commodity network adapter used NVIDIA s boost 18% of the top 50 of the Top 500 systems as of June 2016 50 51.8 40 41 41.4 44.4 44.8 47.4 40.8 30 20 10 0 8.4 7.8 9 9.8 10.4 13.8 13.2 June-13 Nov-13 June-14 Nov-14 June-15 Nov-15 June-16 Cluster InfiniBand Cluster Data from Top500 list (http://www.top500.org) SBAC-PAD 2016 4

Motivation Streaming applications on HPC systems 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) HPC resources for real-time analytics Data Source Real-time streaming Sender Data streaming-like broadcast operations Multiple nodes as workers Examples Deep learning frameworks Worker Worker Worker Worker Worker Proton computed tomography (pct) SBAC-PAD 2016 5

SBAC-PAD 2016 6 Motivation Streaming applications on HPC systems 1. Communication Heterogeneous Broadcast-type operations Data are usually from a live source and stored in Host memory Data need to be sent to remote memories for computing Real-time streaming Sender Data streaming-like broadcast operations Requires data movement from Host memory to remote memories, i.e., host-device (H-D) heterogeneous broadcast Performance bottleneck

Motivation Requirements for streaming applications on HPC systems Low latency, high throughput and scalability Free up Peripheral Component Interconnect Express (PCIe) bandwidth for application needs Data streaming-like broadcast operations Worker Worker Worker Worker Worker SBAC-PAD 2016 7

Motivation Technologies we have NVIDIA Direct [1] InfiniBand (IB) hardware Use remote direct memory access (RDMA) transfers between s and other PCIe devices GDR Peer-to-peer transfers between s and more multicast (IB MCAST) [2] Enables efficient designs of homogeneous broadcast operations Host-to-Host [3] -to- [4] [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., An Introduction to the InfiniBand Architecture. High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, Fast and Scalable MPI-level Broadcast using InfiniBand s Hardware Multicast Support, in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and Direct RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 2014, Dec 2014. SBAC-PAD 2016 8

Problem Statement Can we design a high-performance heterogeneous broadcast for streaming applications? SupportsHost-to-Device broadcast operations Can we also design an efficient broadcast for multi- systems? Can we combine Direct and IB technologies to Avoid extra data movements to achieve better performance Increase available Host-Device (H-D) PCIe bandwidth for application use SBAC-PAD 2016 9

Outline Introduction Proposed Designs Heterogeneous Broadcast with Direct RDMA (GDR) and InfiniBand (IB) Hardware Multicast Intra-node Topology-Aware Broadcast for Multi- Systems Performance Evaluation Conclusion and Future Work SBAC-PAD 2016 10

Proposed Heterogeneous Broadcast Key requirement of IB MCAST Control header needs to be stored in host memory SL-based approach: Combine CUDA GDR and IB MCAST features Also, take advantage of IB Scatter-Gather List (SGL) feature: Multicast two separate addresses (control on the host + data on ) in but one IB message Directly IB read/write from/to using GDR feature low-latency zerocopy based schemes Avoiding extra copy between Host and frees up PCIe bandwidth resource for application needs Employing IB MCAST feature increases scalability SBAC-PAD 2016 11

SBAC-PAD 2016 12 Proposed Heterogeneous Broadcast Overview of SL-based approach C Source Data IB HCA Multicast steps IB SL step IB Switch IB HCA IB HCA Node 1 C Data Node N C Data

Broadcast on Multi- systems Existing two-level approach Inter-node: Can apply proposed SL-based Intra-node: Use host-based shared memory Issues of H-D cudamemcpy : 1. Expensive 2. Consumes PCIe bandwidth between and s! Source IB Switch Multicast steps cudamemcpy (Host Device) 0 Node 1 Node N 1 N SBAC-PAD 2016 13

SBAC-PAD 2016 14 Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast CUDA InterProcess Communication (IPC) Source IB Switch Multicast steps cudamemcpy (Device Device) Node 1 Node N 0 1 N

Broadcast on Multi- systems Proposed Intra-node Topology-Aware Broadcast Leader keeps a copy of the data Synchronization between s Use a one-byte flag in shared memory on host Non-leaders copy the data using CUDA IPC Ø Frees up PCIe bandwidth resource Other Topology-Aware designs Ring, K-nomial etc. Dynamic tuning selection 0 RecvBuf CopyBuf Host Memory Shared Memory Region 1 RecvBuf IPC N RecvBuf SBAC-PAD 2016 15

SBAC-PAD 2016 16 Outline Introduction Proposed Designs Performance Evaluation OSU Micro-Benchmark (OMB) level evaluation Streaming benchmark level evaluation Conclusion and Future Work

Experimental Environments 1. Wilkes cluster @ University of Cambridge http://www.hpc.cam.ac.uk/services/wilkes 2 NVIDIA K20c s per node Used Up to 32 nodes 2. CSCS cluster @ Swiss National Supercomputing Centre http://www.cscs.ch/computers/kesch_escha/index.html Cray CS-Storm system 8 NVIDIA K80 cards per node ( = 16 NVIDIA Kepler GK210 chips per node) Used Up to 88 NVIDIA K80 cards (176 chips) over 11 nodes Modified Ohio State University (OSU) Micro-Benchmark (OMB) http://mvapich.cse.ohio-state.edu/benchmarks/ osu_bcast - MPI_Bcast Latency Test Modified to support heterogeneous broadcast Streaming benchmark Mimic real streaming applications Continuously broadcasts data from a source to -based compute nodes Includes a computation phase that involves host-to-device and device-to-host copies SBAC-PAD 2016 17

Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 MVAPICH2-X (MPI + PGAS), Available since 2011 Support forgps (MVAPICH2-GDR), Available since 2014 Support for MIC (MVAPICH2-MIC), Availablesince 2014 Support for Virtualization (MVAPICH2-Virt), Available since 2015 Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 Used by more than 2,675 organizations in 83 countries More than 391,000 (> 0.39 million) downloads from the OSU site directly Empoweringmany TOP500 clusters (June 16 ranking) 12 th ranked 462,462-core cluster (Stampede) at TACC 15 th ranked 185,344-core cluster (Pleiades) at NASA 31 th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and LinuxDistros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu Empowering Top500 systems forovera decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 Tflop/s) Stampede at TACC (12 th in June 2016, 462,462 cores, 5.168 Pflop/s) SBAC-PAD 2016 18

OMB Heterogeneous Inter-node Broadcast @ Wilkes SL-MCAST -MCAST Host-MCAST SL-MCAST -MCAST Host-MCAST Latency (μs) 60 50 40 30 20 10 0 56% Latency(μs) 10000 8000 6000 4000 2000 0 39% 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Compared proposed SL-based design to homogeneous broadcast designs with explicitly data transfers Reduces latency up to 56% and 39% for small and large messages No extra data transfers between Host and memories SBAC-PAD 2016 19

OMB SL-based Approach Inter-node Broadcast on Wilkes IB Hardware Multicast provides good scalability 25 20 SL-MCAST -MCAST Host-MCAST Latency (μs) 15 10 5 0 2 4 8 16 32 System size (Number of nodes) SBAC-PAD 2016 20

OMB Inter- and Intra-node Broadcast @ CSCS IPC SL-MCAST SHMEM SL-MCAST IPC SL-MCAST SHMEM SL-MCAST Latency (μs) 60 50 40 30 20 10 0 58% 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) 12000 10000 8000 6000 4000 2000 0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) SL-based inter-node + Topology-aware intra-node on CSCS Up to 58% and 79% reduction for small and large messages No extra data transfers between Host and memories 79% SBAC-PAD 2016 21

Execution Time (s) 4 3.5 3 2.5 2 1.5 1 0.5 0 Streaming Benchmark Execution Time @ CSCS IPC SL-MCAST 1 2 4 8 16 32 64 SHMEM SL-MCAST 128 256 512 Message Size (Bytes) 1K 2K 4K 26% 8K 16K Execution Time (s) 16 14 12 10 8 6 4 2 0 IPC SL-MCAST SHMEM SL-MCAST 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Utilizes IPC-based Device-To-Device data transfer for streaming applications on multi- systems Up to 26% and 67% improvement for small and large messages 67% SBAC-PAD 2016 22

PCIe Throughput (GB/s) 4 3.5 3 2.5 2 1.5 1 0.5 0 Streaming Benchmark Throughput @ CSCS 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST Increases availability of PCIe Host-Device Resources Utilize IPC-based Device-to-Device data transfers Free up PCIe bandwidth resources between Host and Devices for applications 3.2X SBAC-PAD 2016 23

SBAC-PAD 2016 24 Outline Introduction Proposed Designs Performance Evaluation Conclusion and Future Work

Conclusion Combines NVIDIA Direct technology and InfiniBand (IB) hardware multicast for -enabled streaming applications Further proposes an intra-node topology-aware scheme that exploits CUDA IPC for multi- systems Achieves 2X improvement over state-of-the-art schemes with Ohio State University (OSU) Micro-Benchmarks (OMBs) Achieves up to a 67% improvement in execution time and 3.5X of throughput in a synthetic streaming benchmark Indicates applying this approach to a streaming application, such as photon computed tomography (pct) or deep learning framework, is promising SBAC-PAD 2016 25

Future Work Include in future releases of MVAPICH2-GDR library Improve reliability Evaluate effectiveness with streaming applications, such as, photon computed tomography (pct) and deep learning frameworks Extend the designs for other collective operations as well as non-blocking operations Allreduce, gather etc. SBAC-PAD 2016 26

Thank You! Ching-Hsiang Chu chu.368@osu.edu The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ This project is supported under the United States Department of Defense (DOD) High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity (Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors and do not necessarily reflect the views of the DOD or the employer of the author. SBAC-PAD 2016 27

Streaming Benchmark Mimics behavior of a streaming application Continuously broadcasts data from a source to -based compute nodes Includes a computation phase that involves host-to-device and device-to-host copies /* h_buf and d_buf: buffer on Host and memory. */ for iter=0 to max_iter do cudamemcpyasync(..., cudamemcpyhosttodevice, cpy_stream); if rank == root then MPI Bcast(h_buf,...); else MPI Bcast(d_buf,...); end if dummy kernel<<<...>>>(d_buf,...); cudamemcpyasync(..., cudamemcpydevicetohost, cpy_stream); end for SBAC-PAD 2016 28