Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Size: px

Start display at page:

Download "Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities"

Cornelius Harrington
5 years ago
Views:

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer

The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iwarp and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohiostate.

1 Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iwarp and RoCE) libraries, designed and developed by his research group ( are currently being used by more than 2,675 organizations worldwide (in 81 countries). More than 4, downloads of this software have taken place from the project's site. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group ( are also publicly available. These libraries are currently being used by more than 19 organizations in 26 countries. More than 18, downloads of these libraries have taken place. He is an IEEE Fellow.

2 Designing HPC, Big Data and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Talk at HPC Connection Workshop (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University

Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance

technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

(NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high

3 Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects - InfiniBand <1usec latency, 1Gbps Bandwidth> Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM Tianhe 2 Titan Stampede Tianhe 1A HPC-Connection (SC 16) 3

4 Three Major Computing Categories Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, UPC++ etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) Deep Learning Caffe, CNTK, TensorFlow, and many more Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. HPC-Connection (SC 16) 4

Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

Networking Technologies (InfiniBand, 4/1GigE, Aries, and OmniPath) Application Kernels/Applications Middleware Communication Library or Runtime for Programming Models Collective

5 Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and OmniPath) Application Kernels/Applications Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (NVIDIA and MIC) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Fault- Resilience HPC-Connection (SC 16) 5

6 Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Scalable job start-up Scalable Collective communication Offload Non-blocking Topology-aware Balancing intra-node and inter-node communication for next generation nodes ( cores) Multiple end-points per node Support for efficient multi-threading Integrated Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, ) Virtualization Energy-Awareness HPC-Connection (SC 16) 6

7 HPC-Connection (SC 16) 7 Additional Challenges for Designing Exascale Software Libraries Extreme Low Memory Footprint Memory per core continues to decrease D-L-A Framework Discover Overall network topology (fat-tree, 3D, ), Network topology for processes for a given job Node architecture, Health of network and node Learn Impact on performance and scalability Potential for failure Adapt Internal protocols and algorithms Process mapping Fault-tolerance solutions Low overhead techniques while delivering performance, scalability and fault-tolerance

Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.

8 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,69 organizations in 83 countries More than 42, (>.4 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 16 ranking) 1 st ranked 1,649,64-core cluster (Sunway TaihuLight) at NSC, Wuxi, China 13 th ranked 241,18-core cluster (Pleiades) at NASA 17 th ranked 519,64-core cluster (Stampede) at TACC 4 th ranked 76,32-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade System-X from Virginia Tech (3 rd in Nov 23, 2,2 processors, TFlops) -> Sunway TaihuLight at NSC, Wuxi, China (1 st in Nov 16, 1,649,64 cores, 93 PFlops) HPC-Connection (SC 16) 8

9 Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Number of Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2-Virt 2.1rc2 MV2 2.2 MV2 1.9 MV2-X 2.2 MV2-GDR 2.2 MV2-GDR 2.b MV2-MIC 2. MV MVAPICH/MVAPICH2 Release Timeline and Downloads Timeline HPC-Connection (SC 16) 9

10 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features RC XRC UD DC UMR ODP SR- IOV Multi Rail Transport Mechanisms Shared Memory CMA IVSHMEM Modern Features MCDRAM * NVLink * CAPI * * Upcoming HPC-Connection (SC 16) 1

11 HPC-Connection (SC 16) 11 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime Optimized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI Optimized MPI for clusters with Intel KNC Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration Utility to measure the energy consumption of MPI applications

12 HPC-Connection (SC 16) 12 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication Dynamic and Adaptive Tag Matching Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Accelerating Graph Processing (Mizan) with MPI-3 RMA Optimized MVAPICH2 for KNL and Omni-Path

13 Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH2 1.8 Small Message Latency Large Message Latency TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch HPC-Connection (SC 16) 13

14 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 Unidirectional Bandwidth 12, , ,83 8 6, , Bidirectional Bandwidth TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-4-EDR Omni-Path 24,136 21,983 21,227 12,161 6,228 Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch HPC-Connection (SC 16) 14

15 Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 1 Latency Intra-Socket Inter-Socket Latest MVAPICH us Intel Ivy-bridge.18 us K Message Size (Bytes) 15 1 Bandwidth (Intra-socket) intra-socket-cma intra-socket-shmem intra-socket-limic 14,25 MB/s 15 1 Bandwidth (Inter-socket) inter-socket-cma inter-socket-shmem inter-socket-limic 13,749 MB/s 5 5 Message Size (Bytes) Message Size (Bytes) HPC-Connection (SC 16) 15

16 Execution Time (s) HPC-Connection (SC 16) 16 On-Demand Paging (ODP) Applications no longer need to pin down underlying physical pages Memory Region (MR) are NEVER pinned by the OS Paged in by the HCA when needed Paged out by the OS when reclaimed ODP can be divided into two classes Explicit ODP Applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages Implicit ODP Applications are provided with a special memory key that represents their complete address space, does not need to register any virtual address range Advantages Simplifies programming M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda, Unlimited MR sizes Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits, SC 216. Physical memory optimization Wednesday 11: 11:3 AM in Room 355-D Pin-down ODP Applications (64 Processes)

Challenge Dynamic and Adaptive Tag Matching Solution Results Tag matching is a significant overhead for receivers Existing Solutions are - Static and do not adapt dynamically to communication pattern

object that must be traversed before hitting on the required one Better performance than other state-of-the art tagmatching schemes Minimum memory consumption Will be available in future MVAPICH2

17 Challenge Dynamic and Adaptive Tag Matching Solution Results Tag matching is a significant overhead for receivers Existing Solutions are - Static and do not adapt dynamically to communication pattern - Do not consider memory overhead A new tag matching design - Dynamically adapt to communication patterns - Use different strategies for different ranks - Decisions are based on the number of request object that must be traversed before hitting on the required one Better performance than other state-of-the art tagmatching schemes Minimum memory consumption Will be available in future MVAPICH2 releases Normalized Total Tag Matching Time at 512 Processes Normalized to Default (Lower is Better) Normalized Memory Overhead per Process at 512 Processes Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 216. [Best Paper Nominee] HPC-Connection (SC 16) 17

18 HPC-Connection (SC 16) 18 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Accelerating Graph Processing (Mizan) with MPI-3 RMA Optimized MVAPICH2 for KNL and Omni-Path

MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network

19 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 212 (starting with MVAPICH2-X 1.9) HPC-Connection (SC 16) 19

20 Time (ms) UPC++ Support in MVAPICH2-X GASNet_MPI GASNET_IBV MV2-X 14x MPI + {UPC++} Application UPC++ Runtime MPI Runtime MPI + {UPC++} Application UPC++ Interface MPI Interfaces K 2K 4K 8K 16K 32K 64K 128K256K512K 1M GASNet Interfaces Conduit (MPI) MVAPICH2-X Unified Communication Runtime (UCR) Message Size (bytes) Network Inter-node Broadcast (64 nodes 1:ppn) Full and native support for hybrid MPI + UPC++ applications Better performance compared to IBV and MPI conduits OSU Micro-benchmarks (OMB) support for UPC++ Available since MVAPICH2-X (2.2rc1) HPC-Connection (SC 16) 2

21 Time (s) Time (seconds) Application Level Performance with Graph5 and Sort Graph5 Execution Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 16K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September X Performance of Hybrid (MPI+ OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple Sort Execution Time J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October MPI Hybrid 5GB-512 1TB-1K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort Application 4,96 processes, 4 TB Input Size - MPI 248 sec;.16 TB/min - Hybrid 1172 sec;.36 TB/min - 51% improvement over MPI-design HPC-Connection (SC 16) 21 51%

22 HPC-Connection (SC 16) 22 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs CUDA-aware MPI GPUDirect RDMA (GDR) Support Control Flow Decoupling through GPUDirect Async Accelerating Graph Processing (Mizan) with MPI-3 RMA Optimized MVAPICH2 for KNL and Omni-Path

) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, );

23 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity HPC-Connection (SC 16) 23

24 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory HPC-Connection (SC 16) 24

25 Bi-Bandwidth (MB/s) Latency (us) Bandwidth (MB/s) Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU internode latency MV2-GDR2.2 MV2 w/o GDR 3X 1x 2.18us K Message Size (bytes) GPU-GPU Internode Bi-Bandwidth MV2-GDR2.2 MV2-GDR2.b MV2 w/o GDR K 4K Message Size (bytes) MV2-GDR2.b 11x 2X GPU-GPU Internode Bandwidth MV2-GDR2.2 MV2-GDR2.b MV2 w/o GDR K 4K Message Size (bytes) 11X MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-268 v2) node - 2 cores NVIDIA Tesla K4c GPU Mellanox Connect-X4 EDR HCA CUDA 8. Mellanox OFED 3. with GPU-Direct-RDMA HPC-Connection (SC 16) 25 2X

Average Time Steps per second (TPS) Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) 35 3 25 2 15 1 5 64K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c

26 Average Time Steps per second (TPS) Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes K Particles MV2 MV2+GDR Number of Processes HPC-Connection (SC 16) 26 2X

Normalized Execution Time Normalized Execution Time Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Wilkes GPU Cluster Default Callback-based Event-based 1.2 1.8.6.4.

27 Normalized Execution Time Normalized Execution Time Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 HPC-Connection (SC 16) 27

28 Latency (us) Latency (us) MVAPICH2-GDS: Preliminary Results Latency oriented: Send+kernel and Recv+kernel Default Enhanced-GDS Message Size (bytes) Latency Oriented: Able to hide the kernel launch overhead 25% improvement at 256 Bytes compared to default behavior Throughput Oriented: back-to-back Default Enhanced-GDS Message Size (bytes) Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks 14% improvement at 1KB message size Intel Sandy Bridge, NVIDIA K2 and Mellanox FDR HCA Will be available in a public release soon HPC-Connection (SC 16) 28

29 HPC-Connection (SC 16) 29 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Accelerating Graph Processing (Mizan) with MPI-3 RMA Optimized MVAPICH2 for KNL and Omni-Path

30 Execution Time (s) HPC-Connection (SC 16) 3 Mizan-RMA: Accelerating Graph Processing Framework PageRank with Arabic Dataset Default Optimized 2.5x Number of Cores Mizan framework is available from: com/the-graph-blog/mizan/ Accelerate communication with MPI one-sided programming model (RMA)? Overlap communication with computation 2.5X improvement on 4 nodes Will be released soon M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda, Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA, Int l Conference on High Performance Computing, HiPC 16, To be presented

31 HPC-Connection (SC 16) 31 MVAPICH2 for Omni-Path and KNL MVAPICH2 has been supporting QLogic/PSM for many years with all different compute platforms Latest version (MVAPICH2 2.2 GA) supports Omni-Path (derivative of QLogic IB) Xeon family with Omni-Path KNL with Omni-Path

32 Enhanced Designs for KNL: MVAPICH2 Approach New designs Take advantage of the idle cores Dynamically configurable Take advantage of highly multithreaded cores Take advantage of MCDRAM of KNL processors Applicable to other programming models such as PGAS, Task-based, etc. Provides portability, performance, and applicability to runtime as well as applications in a transparent manner HPC-Connection (SC 16) 32

33 Latency (us) Latency (us) Bandwidth (MB/s) HPC-Connection (SC 16) 33 Performance Benefits of the Enhanced Designs MVAPICH2 MVAPICH2-Optimized 27% No. of processes Intra-node Broadcast with 64MB Message MVAPICH2 MVAPICH2-Optimized 17.2% 1M 2M 4M 8M 16M 32M Message size 16-process Intra-node All-to-All MVAPICH2 MVAPICH2-Optimized 52% 2M 4M 8M 16M 32M 64M Message size Very Large Message Bi-directional Bandwidth New designs to exploit high concurrency and MCDRAM of KNL Significant improvements for large message sizes Benefits seen in varying message size as well as varying MPI processes

34 Time (s) Bandwidth (MB/s) HPC-Connection (SC 16) 34 Performance Benefits of the Enhanced Designs MV2_Def_DRAM MV2_Opt_DRAM 15% 4:268 4:24 4:64 MPI Processes : OMP Threads CNTK: MLP Training Time using MNIST (BS:64) MV2_Opt_DRAM MV2_Opt_MCDRAM MV2_Def_DRAM MV2_Def_MCDRAM 3% 1M 2M 4M 8M 16M 32M 64M Message Size (bytes) Multi-Bandwidth using 32 MPI processes Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset using CNTK Deep Learning Framework Enhanced Designs will be available in upcoming MVAPICH2 releases

35 HPC-Connection (SC 16) 35 MVAPICH2 Plans for Exascale Performance and Memory scalability toward 1M cores Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + UPC++, MPI + CAF ) MPI + Task* Enhanced Optimization for GPU Support and Accelerators Taking advantage of advanced features of Mellanox InfiniBand Switch-IB2 SHArP* GID-based support* Enhanced communication schemes for upcoming architectures Knights Landing with MCDRAM* NVLINK* CAPI* Extended topology-aware collectives Extended Energy-aware designs and Virtualization Support Extended Support for MPI Tools Interface (as in MPI 3.1) Extended Checkpoint-Restart and migration support with SCR Support for * features will be available in future MVAPICH2 Releases

36 Three Major Computing Categories Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, UPC++ etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++) Deep Learning Caffe, CNTK, TensorFlow, and many more Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. HPC-Connection (SC 16) 36

37 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers How to address these newer requirements? 1 2 Internode Comm. (Knomial) 3 4 GPU-specific Communication Libraries (NCCL) NVidia's NCCL library provides inter-gpu communication Intranode Comm. (NCCL Ring) PLX CPU Ring Direction PLX CUDA-Aware MPI (MVAPICH2-GDR) Provides support for GPU-based communication Can we exploit CUDA-Aware MPI and NCCL to support Deep Learning applications? Hierarchical Communication (Knomial + NCCL ring) HPC-Connection (SC 16) 37

38 Time (seconds) K 32K 256K 2M 16M 128M Latency (us) Log Scale Efficient Broadcast: MVAPICH2-GDR and NCCL NCCL has some limitations Only works for a single node, thus, no scale-out on multiple nodes Degradation across IOH (socket) for scale-up (within a node) We propose optimized MPI_Bcast Communication of very large GPU buffers (order of megabytes) Scale-out on large number of dense multi-gpu nodes Hierarchical Communication that efficiently exploits: Message Size MV2-GDR MV2-GDR-Opt 2 CUDA-Aware MPI_Bcast in MV2-GDR 1 NCCL Broadcast primitive Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda, Number of GPUs The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 216 [Best Paper Runner-Up] Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) HPC-Connection (SC 16) 38 3 Performance Benefits: OSU Micro-benchmarks MV2-GDR MV2-GDR-Opt 1x

39 Latency (milliseconds) Log Scale HPC-Connection (SC 16) 39 Efficient Reduce: MVAPICH2-GDR Can we optimize MVAPICH2-GDR to efficiently support DL frameworks? We need to design large-scale reductions using CUDA-Awareness GPU performs reduction using kernels Overlap of computation and communication Hierarchical Designs Proposed designs achieve 2.5x speedup over MVAPICH2-GDR and 133x over OpenMPI Optimized Large-Size Reduction M 8M 16M 32M 64M 128M Message Size (MB) MV2-GDR OpenMPI MV2-GDR-Opt

40 Latency (ms) Latency (ms) Latency (ms) Latency (ms) Latency (ms) Latency (ms) Large Message Optimized Collectives for Deep Learning MV2-GDR provides optimized collectives for large message sizes Optimized Reduce, Allreduce, and Bcast Good scaling with large number of GPUs Available in MVAPICH2- GDR 2.2GA Reduce 192 GPUs Message Size (MB) Allreduce 64 GPUs Message Size (MB) Bcast 64 GPUs Reduce 64 MB No. of GPUs Allreduce MB No. of GPUs 1 5 Bcast 128 MB Message Size (MB) No. of GPUs HPC-Connection (SC 16) 4

41 Training Time (seconds) OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from GoogLeNet (ImageNet) on 128 GPUs No. of GPUs Invalid use case Caffe OSU-Caffe (124) OSU-Caffe (248) HPC-Connection (SC 16) 41

42 Three Major Computing Categories Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++) Deep Learning Caffe, CNTK, TensorFlow, and many more Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. HPC-Connection (SC 16) 42

How Can HPC Clusters with High-Performance

alleviated with new designs by taking advantage

What are the major bottlenecks in current Big

Can RDMA-enabled high-performance interconnects

Can HPC Clusters with high-performance storage

SSD, parallel file systems) benefit Big Data

How much performance benefits can be achieved

43 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? HPC-Connection (SC 16) 43 Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters? Bring HPC and Big Data processing into a convergent trajectory!

44 HPC-Connection (SC 16) 44 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other Protocols? Upper level Changes? Point-to-Point Communication I/O and File Systems Communication and I/O Library Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4/1 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, and NVMe-SSD)

45 HPC-Connection (SC 16) 45 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, and HBase Micro-benchmarks Users Base: 195 organizations from 27 countries More than 18,6 downloads from the project site RDMA for Impala (upcoming) Available for InfiniBand and RoCE

46 Different Modes of RDMA for Apache Hadoop 2.x HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations inmemory and obtain as much performance benefit as possible. HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). HPC-Connection (SC 16) 46

47 Acceleration Case Studies and Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark HPC-Connection (SC 16) 47

Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk

SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D.

48 Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 215 HPC-Connection (SC 16) 48

49 Design Overview of MapReduce with RDMA MapReduce Job Tracker Java Socket Interface 1/1/4/1 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 214 HPC-Connection (SC 16) 49

50 Execution Time (s) Execution Time (s) Performance Benefits RandomWriter & TeraGen in TACC-Stampede 25 2 IPoIB (FDR) Reduced by 3x 25 2 IPoIB (FDR) Reduced by 4x Data Size (GB) Data Size (GB) RandomWriter Cluster with 32 Nodes with a total of 128 maps TeraGen RandomWriter 3-4x improvement over IPoIB for 8-12 GB file size TeraGen 4-5x improvement over IPoIB for 8-12 GB file size HPC-Connection (SC 16) 5

51 Execution Time (s) Execution Time (s) Performance Benefits Sort & TeraSort in TACC-Stampede IPoIB (FDR) OSU-IB (FDR) Data Size (GB) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) Reduced by 52% Reduced by 44% Cluster with 32 Nodes with a total of 128 maps and 57 reduces Cluster with 32 Nodes with a total of 128 maps and 64 reduces Sort with single HDD per node TeraSort with single HDD per node 4-52% improvement over IPoIB 42-44% improvement over IPoIB for 8-12 GB data for 8-12 GB data HPC-Connection (SC 16) 51

52 Acceleration Case Studies and Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark HPC-Connection (SC 16) 52

53 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/1/4/1 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 214 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec HPC-Connection (SC 16) 53

54 Time (sec) Time (sec) Performance Evaluation on SDSC Comet SortBy/GroupBy IPoIB 8% RDMA IPoIB RDMA 74% Data Size (GB) Data Size (GB) 64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R) RDMA-based design for Spark RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. SortBy: Total time reduced by up to 8% over IPoIB (56Gbps) GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps) HPC-Connection (SC 16) 54

Time (sec) Time (sec) Performance Evaluation on SDSC Comet HiBench PageRank 8 7 6 5 4 3 2 1 IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64

55 Time (sec) Time (sec) Performance Evaluation on SDSC Comet HiBench PageRank IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA-based design for Spark RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) HPC-Connection (SC 16) IPoIB 43% RDMA Huge BigData Gigantic Data Size (GB)

56 Performance Evaluation on SDSC Comet: Astronomy Application Kira Toolkit 1 : Distributed astronomy image processing toolkit implemented using Apache Spark. Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,15 image files. Compare RDMA Spark performance with the standard apache implementation using IPoIB. 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/ , Aug % RDMA Spark Apache Spark (IPoIB) Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 216 HPC-Connection (SC 16) 56

57 Looking into the Future. Exascale systems will be constrained by Power Memory per core Data movement cost Faults Programming Models, Runtimes and Middleware need to be designed for Scalability Performance Fault-resilience Energy-awareness Programmability Productivity Highlighted some of the issues and challenges Need continuous innovation on all these fronts HPC-Connection (SC 16) 57

58 HPC-Connection (SC 16) 58 Funding Acknowledgments Funding Support by Equipment Support by

59 HPC-Connection (SC 16) 59 Personnel Acknowledgments Current Students A. Awan (Ph.D.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) J. Hashmi (Ph.D.) N. Islam (Ph.D.) M. Li (Ph.D.) M. Rahman (Ph.D.) D. Shankar (Ph.D.) A. Venkatesh (Ph.D.) J. Zhang (Ph.D.) Current Research Scientists K. Hamidouche X. Lu H. Subramoni Current Research Specialist J. Smith Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist S. Sur Past Programmers D. Bureddy M. Arnold J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang

HPC-Connection (SC 16) 6 Thank You! panda@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.

60 HPC-Connection (SC 16) 6 Thank You! panda@cse.ohio-state.edu Network-Based Computing Laboratory The MVAPICH2 Project The High-Performance Big Data Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu