Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Size: px

Start display at page:

Download "Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities"

Aldous Bryan
5 years ago
Views:

1 Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Keynote Talk at SEA Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

2 SEA Symposium (April 18) 2 High-End Computing (HEC): Towards Exascale 1 PFlops in EFlops in ? Expected to have an ExaFlop system in !

3 SEA Symposium (April 18) 3 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

4 SEA Symposium (April 18) 4 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Physical Compute

5 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 5

6 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 6

7 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 7 Hadoop Job Deep Learning Job Spark Job

8 SEA Symposium (April 18) 8 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

9 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance SEA Symposium (April 18) 9

10 SEA Symposium (April 18) 1 Partitioned Global Address Space (PGAS) Models Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication Different approaches to PGAS - Languages - Libraries Unified Parallel C (UPC) OpenSHMEM Co-Array Fortran (CAF) UPC++ X1 Global Arrays Chapel

11 SEA Symposium (April 18) 11 Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications

Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core

12 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience SEA Symposium (April 18) 12

13 Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Scalable job start-up Scalable Collective communication Offload Non-blocking Topology-aware Balancing intra-node and inter-node communication for next generation nodes ( cores) Multiple end-points per node Support for efficient multi-threading Integrated Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, ) Virtualization Energy-Awareness SEA Symposium (April 18) 13

14 SEA Symposium (April 18) 14 Additional Challenges for Designing Exascale Software Libraries Extreme Low Memory Footprint Memory per core continues to decrease D-L-A Framework Discover Overall network topology (fat-tree, 3D, ), Network topology for processes for a given job Node architecture, Health of network and node Learn Impact on performance and scalability Potential for failure Adapt Internal protocols and algorithms Process mapping Fault-tolerance solutions Low overhead techniques while delivering performance, scalability and fault-tolerance

SEA Symposium (April 18) 15 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH

15 SEA Symposium (April 18) 15 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,875 organizations in 86 countries More than 46, (>.46 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade

16 Number of Downloads MVAPICH2 Release Timeline and Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline SEA Symposium (April 18) 16 May-16 Oct-16 Mar-17 Aug-17 Jan-18

Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

17 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink * CAPI * * Upcoming SEA Symposium (April 18) 17

18 SEA Symposium (April 18) 18 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication Scalable Start-up Optimized Collectives using SHArP and Multi-Leaders Optimized CMA-based Collectives Upcoming Optimized XPMEM-based Collectives Performance Engineering with MPI-T Support Integrated Network Analysis and Monitoring Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

19 One-way Latency: MPI over IB with MVAPICH2 Latency (us) 2 Small Message Latency Latency (us) Large Message Latency TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch SEA Symposium (April 18) 19

20 Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 Unidirectional Bandwidth 12, , , ,356 3,373 Bandwidth (MBytes/sec) Bidirectional Bandwidth TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path 24,136 22,564 21,983 12,161 6,228 Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch SEA Symposium (April 18) 2

21 Startup Performance on KNL + Omni-Path 22s Time Taken (Seconds) MPI_Init & Hello World - Oakforest-PACS Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a) 21s 5.8s K 2K 4K 8K Number of Processes 16K 32K 64K MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 Full scale) 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) All numbers reported with 64 processes per node New designs available since MVAPICH2-2.3a and as patch for SLURM 15, 16, and 17 SEA Symposium (April 18) 21

Latency (seconds) Latency (seconds) Advanced Allreduce Collective Designs Using SHArP.4.3.2.1.

MiniAMR 13% 12% (4,28) (8,28) (16,28) SHArP Support is available since MVAPICH2 2.3a M. Bayatpour, S. Chakraborty, H. Subramoni, X.

22 Latency (seconds) Latency (seconds) Advanced Allreduce Collective Designs Using SHArP MVAPICH2 MVAPICH2-SHArP MVAPICH2 MVAPICH2-SHArP Avg DDOT Allreduce time of HPCG (Number of Nodes, PPN) Mesh Refinement Time of MiniAMR 13% 12% (4,28) (8,28) (16,28) SHArP Support is available since MVAPICH2 2.3a M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi- Leader Design, SuperComputing '17. (4,28) (8,28) (16,28) (Number of Nodes, PPN) SEA Symposium (April 18) 22

23 Performance of MPI_Allreduce On Stampede2 (1,24 Processes) Latency (us) Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X 2.4X M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b SEA Symposium (April 18) 23

24 Latency (us) Optimized CMA-based Collectives for Large Messages KNL (2 Nodes, 128 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size ~ 2.5x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA KNL (4 Nodes, 256 Procs) MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Message Size Performance of MPI_Gather on KNL nodes (64PPN) SEA Symposium (April 18) KNL (8 Nodes, 512 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) New two-level algorithms for better scalability Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster 17, BEST Paper Finalist ~ 3.2x Better ~ 17x Better ~ 4x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA Available in MVAPICH2-X 2.3b

25 Shared Address Space (XPMEM)-based Collectives Design 1 1 OSU_Allreduce (Broadwell 256 procs) MVAPICH2-2.3b IMPI-217v X MVAPICH2-Opt 1 1 OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b IMPI-217v1.132 MVAPICH2-Opt 4X Latency (us) K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size Shared Address Space -based true zero-copy Reduction collective designs in MVAPICH2 Offloaded computation/communication to peers ranks in reduction collective operation Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Will be available in future Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 218. SEA Symposium (April 18) K 32K 64K 128K 256K 512K 1M 2M 4M

Performance Engineering Applications using MVAPICH2 and TAU Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and

Variables (CVARs) Introduced support for new MPI_T based CVARs to MVAPICH2 MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE,

without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf S. Ramesh, A. Maheo, S. Shende, A.

26 Performance Engineering Applications using MVAPICH2 and TAU Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU Control the runtime s behavior via MPI Control Variables (CVARs) Introduced support for new MPI_T based CVARs to MVAPICH2 MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE TAU enhanced with support for setting MPI_T CVARs in a noninteractive mode for uninstrumented applications VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf S. Ramesh, A. Maheo, S. Shende, A. Malony, H. Subramoni, and D. K. Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, Euro MPI, 217 [Best Paper Award] SEA Symposium (April 18) 26

27 SEA Symposium (April 18) 27 Overview of OSU INAM A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes Capability to analyze and profile node-level, job-level and process-level activities for MPI communication Point-to-Point, Collectives and RMA Ability to filter data based on type of counters using drop down list Remotely monitor various metrics of MPI processes at user specified granularity "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X Visualize the data transfer happening in a live or historical fashion for entire network, job or set of nodes OSU INAM v.9.3 released on 3/16/218 Enhance INAMD to query end nodes based on command line option Add a web page to display size of the database in real-time Enhance interaction between the web application and SLURM job launcher for increased portability Improve packaging of web application and daemon to ease installation

OSU INAM Features Comet@SDSC --- Clustered View Finding Routes Between Nodes (1,879

Visualize traffic pattern on different links Quickly identify congested links/links in

28 OSU INAM Features --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) Show network topology of large clusters Visualize traffic pattern on different links Quickly identify congested links/links in error state See the history unfold play back historical state of the network SEA Symposium (April 18) 28

29 OSU INAM Features (Cont.) Visualizing a Job (5 Nodes) Job level view Show different network metrics (load, error, etc.) for any live job Play back historical data for completed jobs to identify bottlenecks Node level view - details per process or per node CPU utilization for each rank/node Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) Network metrics (e.g. XmitDiscard, RcvError) per rank/node Estimated Process Level Link Utilization Estimated Link Utilization view Classify data flowing over a network link at different granularity in conjunction with MVAPICH2-X 2.2rc1 Job level and Process level SEA Symposium (April 18) 29

30 SEA Symposium (April 18) 3 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network

31 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 212 (starting with MVAPICH2-X 1.9) SEA Symposium (April 18) 31

32 Application Level Performance with Graph5 and Sort Time (s) Graph5 Execution Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 16K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September X Performance of Hybrid (MPI+ OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 213. Time (seconds) MPI Sort Execution Time Hybrid 5GB-512 1TB-1K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort Application 4,96 processes, 4 TB Input Size - MPI 248 sec;.16 TB/min - Hybrid 1172 sec;.36 TB/min - 51% improvement over MPI-design SEA Symposium (April 18) 32 51%

SEA Symposium (April 18) 33 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation Time (s) 1 1 1 Heat-2D Kernel using Jacobi method KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM)

33 SEA Symposium (April 18) 33 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation Time (s) Heat-2D Kernel using Jacobi method KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell Time (s) Heat Image Kernel KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell No. of processes No. of processes On heat diffusion based kernels AVX-512 vectorization showed better performance MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-512 vectorization, it showed up to 4X improved performance

34 SEA Symposium (April 18) 34 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs CUDA-aware MPI GPUDirect RDMA (GDR) Support Support for Streaming Applications Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, );

35 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity SEA Symposium (April 18) 35

36 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory SEA Symposium (April 18) 36

37 SEA Symposium (April 18) 37 Latency (us) Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) 35 3 25 2 15 1 5 64K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue

38 Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes Average Time Steps per second (TPS) K Particles MV2 MV2+GDR Number of Processes SEA Symposium (April 18) 38 2X

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based 1.2 1.8.6.4.

39 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 SEA Symposium (April 18) 39

40 Streaming Applications Streaming applications on HPC systems 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) Multiple GPU nodes as workers HPC resources for real-time analytics Data Source Real-time streaming Sender Data streaming-like broadcast operations Worker CPU GPU Worker CPU GPU Worker CPU GPU Worker CPU GPU Worker CPU GPU GPU GPU GPU GPU GPU SEA Symposium (April 18) 4

41 Hardware Multicast-based Broadcast For GPU-resident data, using GPUDirect RDMA (GDR) InfiniBand Hardware Multicast (IB-MCAST) Overhead IB UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 214, Dec 214. CPU GPU Source Header Data IB HCA IB Switch 1. IB Gather + GDR Read 2. IB Hardware Multicast 3. IB Scatter + GDR Write Destination 1 IB HCA IB HCA Header CPU GPU Data Destination N Header CPU GPU Data SEA Symposium (April 18) 41

42 Streaming CSCS (88 GPUs) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 6 12 Latency (μs) % Latency (μs) % K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) IB-MCAST + GDR + Topology-aware IPC-based schemes Up to 58% and 79% reduction for small and large messages C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct , 216. SEA Symposium (April 18) 42

43 SEA Symposium (April 18) 43 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

44 Intra-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Latency (us) Intra-Socket Small Message Latency.3us MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Intra-Socket Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Intra-Socket Large Message Latency Intra-Socket Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA SEA Symposium (April 18) 44 Latency (us) Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

45 Inter-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Latency (us) Small Message Latency MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Large Message Latency Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA SEA Symposium (April 18) 45 Latency (us) Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

46 Latency (us) MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) K 2K 4K 8K Message Size (Bytes) Latency (us) 4 2 INTRA-NODE LATENCY (LARGE) 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) Bandwidth (GB/sec) INTRA-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M Message Size (Bytes) INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 13.8 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.2 GB/sec (NVLINK) Latency (us) INTER-NODE LATENCY (SMALL) K 2K 4K 8K Latency (us) INTER-NODE LATENCY (LARGE) Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR) Available in MVAPICH2-GDR 2.3a Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and 4X-FDR InfiniBand Inter-connect SEA Symposium (April 18) 46 Bandwidth (GB/sec) INTER-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M

47 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Host-based Collectives with CMA on OpenPOWER (Intra-node Reduce & AlltoAll) Reduce Latency (us) Latency (us) MVAPICH2-GDR-Next SpectrumMPI OpenMPI K 2K 4K MVAPICH2-GDR-Next SpectrumMPI OpenMPI X 3.2X 3.3X 5.2X Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively SEA Symposium (April 18) 47 Latency (us) Latency (us) MVAPICH2-GDR-Next SpectrumMPI OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M MVAPICH2-GDR-Next SpectrumMPI OpenMPI X 3.2X 1.3X 1.2X (Nodes=1, PPN=2) (Nodes=1, PPN=2)

48 Optimized All-Reduce with XPMEM on OpenPOWER (Nodes=1, PPN=2) Latency (us) (Nodes=2, PPN=2) Latency (us) MVAPICH2-GDR-Next Latency (us) 2 MVAPICH2-GDR-Next SpectrumMPI-1.1. SpectrumMPI-1.1. OpenMPI OpenMPI-3.. 2X 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-GDR-Next 3.3X 4 MVAPICH2-GDR-Next SpectrumMPI SpectrumMPI-1.1. OpenMPI OpenMPI-3.. 3X 1 2X 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size Optimized MPI All-Reduce Design in MVAPICH2 Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node 4X Latency (us) 34% 48% 2X Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch SEA Symposium (April 18) 48

49 Intra-node Point-to-point Performance on ARMv8 Small Message Latency Large Message Latency 3 6 MVAPICH2 MVAPICH2 Latency (us) Bandwidth (MB/s) microsec (4 bytes) K 2K 4K Bandwidth MVAPICH2 Latency (us) Bidirectional Bandwidth Available since MVAPICH2 2.3a 8K 16K 32K 64K 128K 256K 512K 1M 2M Bi-directional Bandwidth MVAPICH2 Platform: ARMv8 (aarch64) MIPS processor with 96 cores dual-socket CPU. Each socket contains 48 cores. SEA Symposium (April 18) 49

50 SEA Symposium (April 18) 5 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

51 Performance of SPEC MPI 27 Benchmarks (KNL + Omni-Path) % 6% Intel MPI 18.. MVAPICH2 2.3 rc1 Execution Time in (S) % 6% 1% 2% 4% 448 processes on 7 KNL nodes of TACC Stampede2 (64 ppn) milc leslie3d pop2 lammps wrf2 tera_tf lu Mvapich2 outperforms Intel MPI by up to 1% SEA Symposium (April 18) 51

52 Performance of SPEC MPI 27 Benchmarks (Skylake + Omni-Path) 12 % 1% Intel MPI MVAPICH2 2.3 rc1 Execution Time in (S) % -4% 4% % 38% -3% 48 processes on 1 Skylake nodes of TACC Stampede2 (48 ppn) milc leslie3d pop2 lammps wrf2 GaP tera_tf lu MVAPICH2 outperforms Intel MPI by up to 38% SEA Symposium (April 18) 52

53 SEA Symposium (April 18) 53 Compilation of Best Practices MPI runtime has many parameters Tuning a set of parameters can help you to extract higher performance Compiled a list of such contributions through the MVAPICH Website Initial list of applications Amber HoomDBlue HPCG Lulesh MILC Neuron SMG2 Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. We will link these results with credits to you.

54 SEA Symposium (April 18) 54 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

Can the bottlenecks be alleviated with new

Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)?

interconnects benefit Big Data processing?

55 SEA Symposium (April 18) 55 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters? Bring HPC and Big Data processing into a convergent trajectory!

SEA Symposium (April 18) 56 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase,

Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance

56 SEA Symposium (April 18) 56 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks RDMA? Upper level Changes? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/1/4/1 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

57 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Users Base: 275 organizations from 34 countries More than 25,55 downloads from the project site SEA Symposium (April 18) 57

58 HiBD Release Timeline and Downloads Number of Downloads RDMA-Hadoop 1.x.9. RDMA-Hadoop 1.x.9.8 RDMA-Hadoop 1.x.9.9 RDMA-Memcached.9.1 & OHB-.7.1 RDMA-Hadoop 2.x.9.1 RDMA-Hadoop 2.x.9.6 RDMA-Hadoop 2.x.9.7 RDMA-Memcached.9.4 RDMA-Spark.9.1 RDMA-HBase.9.1 RDMA-Hadoop 2.x 1.. RDMA-Spark.9.4 RDMA-Hadoop 2.x 1.3. RDMA-Memcached.9.6 & OHB-.9.3 RDMA-Spark.9.5 Timeline SEA Symposium (April 18) 58

59 Different Modes of RDMA for Apache Hadoop 2.x HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations inmemory and obtain as much performance benefit as possible. HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). SEA Symposium (April 18) 59

Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO,

60 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/1/4/1 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 214 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec SEA Symposium (April 18) 6

61 SEA Symposium (April 18) 61 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

62 SEA Symposium (April 18) 62 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

63 SEA Symposium (April 18) 63 Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) Execution Time (s) IPoIB (EDR) Reduced by 3x IPoIB (EDR) Reduced by 4x 7 OSU-IB (EDR) 6 OSU-IB (EDR) Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps Execution Time (s) RandomWriter 3x improvement over IPoIB for 8-16 GB file size TeraGen 4x improvement over IPoIB for 8-24 GB file size

64 SEA Symposium (April 18) 64 Performance Evaluation of RDMA-Spark on SDSC Comet HiBench PageRank Time (sec) IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Time (sec) IPoIB 43% RDMA Huge BigData Gigantic Data Size (GB)

65 Performance Evaluation on SDSC Comet: Astronomy Application Kira Toolkit 1 : Distributed astronomy image processing toolkit implemented using Apache Spark. Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,15 image files. Compare RDMA Spark performance with the standard apache implementation using IPoIB. 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/ , Aug % RDMA Spark Apache Spark (IPoIB) Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 216 SEA Symposium (April 18) 65

66 SEA Symposium (April 18) 66 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication

sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both?

(Reductions) Hadoop What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D.

67 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance NCCL2, CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Scale-up Performance NCCL cudnn cublas NCCL2 grpc Proposed Co- Designs MPI Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) Hadoop What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) SEA Symposium (April 18) 67

68 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3. Latency (us) ~3X better Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) ~1X better ~4X better 512K 1M 2M 4M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) 6 OpenMPI is ~5X slower 5 than Baidu 4 MV2 is ~2X better 3 than Baidu Message Size (Bytes) MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a SEA Symposium (April 18) 68

69 MVAPICH2-GDR vs. NCCL2 Broadcast Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Bcast (MVAPICH2-GDR) vs. ncclbcast (NCCL2) on 16 K-8 GPUs GPU/node 2 GPUs/node 1 Latency (us) 1 1 ~1X better Latency (us) 1 1 ~4X better Message Size (Bytes) Message Size (Bytes) NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b NCCL2 MVAPICH2-GDR Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 69

70 MVAPICH2-GDR vs. NCCL2 Reduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Reduce (MVAPICH2-GDR) vs. ncclreduce (NCCL2) on 16 GPUs 1 ~2.5X better 1 1 Latency (us) 1 1 Latency (us) ~5X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 7

71 MVAPICH2-GDR vs. NCCL2 Allreduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Allreduce (MVAPICH2-GDR) vs. ncclallreduce (NCCL2) on 16 GPUs 1 1 Latency (us) 1 1 ~3X better Latency (us) ~1.2X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 71

72 OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node GoogLeNet (ImageNet) on 128 GPUs 25 Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from Training Time (seconds) No. of GPUs Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (124) OSU-Caffe (248) SEA Symposium (April 18) 72

73 Performance Benefits for RDMA-gRPC with Micro-Benchmark Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc K 8K payload (Bytes) 16K 32K 64K 128K 256K 512K Payload (Bytes) 1 1M 2M 4M 8M Payload (Bytes) RDMA-gRPC RPC Latency grpc-rdma Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over IPoIB for Latency for small messages Up to 2.8x performance speedup over IPoIB for Latency for medium messages Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Accelerating grpc and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review. SEA Symposium (April 18) 73

74 SEA Symposium (April 18) 74 Performance Benefits for RDMA-gRPC with TensorFlow Communication Mimic Benchmark Default grpc OSU RDMA grpc Default grpc Latency (us) 18 OSU RDMA grpc 9 Calls/ Second M 4M 8M 2M 4M 8M Payload (Bytes) Payload (Bytes) TensorFlow communication Pattern mimic benchmark TensorFlow communication pattern mimic on SDSC-Comet-FDR Single process spawns both grpc server and grpc client Up to 3.6x performance speedup over IPoIB for Latency Up to 3.7x throughput improvement over IPoIB

75 High-Performance Deep Learning over Big Data (DLoBD) Stacks Benefits of Deep Learning over Big Data (DLoBD) Easily integrate deep learning components into Big Data processing workflow Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters; Reuse existing big data analytics clusters Challenges Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on highperformance interconnects, GPUs, and multi-core CPUs? What are the performance characteristics of representative DLoBD stacks on RDMA networks? Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy Epochs Time (secs) IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy Epoch Number 2.6X Accuracy (%) X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 217. SEA Symposium (April 18) 75

76 SEA Symposium (April 18) 76 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

77 Can HPC and Virtualization be Combined? Virtualization has many benefits Fault-tolerance Job migration Compaction Have not been very popular in HPC due to overhead associated with Virtualization New SR-IOV (Single Root IO Virtualization) support available with Mellanox InfiniBand adapters changes the field Enhanced MVAPICH2 support for SR-IOV MVAPICH2-Virt 2.2 supports: OpenStack, Docker, and singularity J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC 14 J. Zhang, X.Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid 15 SEA Symposium (April 18) 77

78 Application-Level Performance on Chameleon Execution Time (ms) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 2% 5% Execution Time (s) % 9.5% MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 22,2 24,1 24,16 24,2 26,1 26,16 Problem Size (Scale, Edgefactor) Graph5 milc leslie3d pop2 GAPgeofem zeusmp2 lu SPEC MPI27 32 VMs, 6 Core/VM Compared to Native, 2-5% overhead for Graph5 with 128 Procs Compared to Native, 1-9.5% overhead for SPEC MPI27 with 128 Procs SEA Symposium (April 18) 78

79 Application-Level Performance on Singularity with MVAPICH2 NPB Class D Graph5 Execution Time (s) Singularity Native 7% BFS Execution Time (ms) Singularity Native 6% 5 5 CG EP FT IS LU MG 512 Processes across 32 nodes 22,16 22,2 24,16 24,2 26,16 26,2 Problem Size (Scale, Edgefactor) Less than 7% and 6% overhead for NPB and Graph5, respectively J. Zhang, X.Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC 17, Best Student Paper Award SEA Symposium (April 18) 79

80 SEA Symposium (April 18) 8 Latency (us) Bandwidth (MB/s) 1 1 MVAPICH2-GDR on Container with Negligible Overhead GPU-GPU Inter-node Latency K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Docker Native GPU-GPU Inter-node Bandwidth Message Size (Bytes) Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth Message Size (Bytes) Docker Native MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA Docker Native

81 SEA Symposium (April 18) 81 Concluding Remarks Next generation HPC systems need to be designed with a holistic view of Big Data, Deep Learning and Cloud Presented some of the approaches and results along these directions Enable Big Data, Deep Learning and Cloud community to take advantage of modern HPC technologies Many other open issues need to be solved Next generation HPC professionals need to be trained along these directions

82 Additional Presentation and Tutorials Talk: Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries 4/4/18 at 9: am at Overlapping Communication with Computation Symposium Tutorial: How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries 4/5/18, 9: am-12: noon Tutorial: High Performance Distributed Deep Learning: A Beginner s Guide 4/5/18, 1:pm-4: pm SEA Symposium (April 18) 82

83 SEA Symposium (April 18) 83 Funding Acknowledgments Funding Support by Equipment Support by

84 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) W. Huang (Ph.D.) J. Liu (Ph.D.) W. Jiang (M.S.) M. Luo (Ph.D.) J. Jose (Ph.D.) A. Mamidala (Ph.D.) S. Kini (M.S.) G. Marsh (M.S.) M. Koop (Ph.D.) V. Meshram (M.S.) K. Kulkarni (M.S.) A. Moody (M.S.) R. Kumar (M.S.) S. Naravula (Ph.D.) S. Krishnamoorthy (M.S.) R. Noronha (Ph.D.) K. Kandalla (Ph.D.) X. Ouyang (Ph.D.) M. Li (Ph.D.) S. Pai (M.S.) P. Lai (M.S.) S. Potluri (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela K. Manian R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Current Research Specialist J. Smith M. Arnold Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang SEA Symposium (April 18) 84

SEA Symposium (April 18) 85 Thank You! panda@cse.

edu Network-Based Computing Laboratory http://nowlab.cse.

cse.edu/ The High-Performance Big Data Project http://hibd.

85 SEA Symposium (April 18) 85 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu