High-Performance MPI and Deep Learning on OpenPOWER Platform

Size: px

Start display at page:

Download "High-Performance MPI and Deep Learning on OpenPOWER Platform"

Florence Townsend
6 years ago
Views:

1 High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University Hari Subramoni The Ohio State University

2 OpenPOWER Summit (Mar 18) 2 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

3 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance OpenPOWER Summit (Mar 18) 3

4 OpenPOWER Summit (Mar 18) 4 Partitioned Global Address Space (PGAS) Models Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication Different approaches to PGAS - Languages - Libraries Unified Parallel C (UPC) OpenSHMEM Co-Array Fortran (CAF) UPC++ X1 Global Arrays Chapel

5 OpenPOWER Summit (Mar 18) 5 Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS

Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy-

6 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience OpenPOWER Summit (Mar 18) 6

7 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning Support for Big Data (Hadoop and Spark) on OpenPOWER (Talk at 2:15 pm) OpenPOWER Summit (Mar 18) 7

OpenPOWER Summit (Mar 18) 8 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH

8 OpenPOWER Summit (Mar 18) 8 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,875 organizations in 85 countries More than 455, (>.45 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors, Linux Distros (RedHat and SuSE), and OpenHPC Empowering Top5 systems for over a decade

9 OpenPOWER Summit (Mar 18) 9 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime Optimized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI Optimized MPI for clusters with Intel KNC (being updated with KNL-based designs) Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration Utility to measure the energy consumption of MPI applications

10 Number of Downloads MVAPICH2 Release Timeline and Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline OpenPOWER Summit (Mar 18) 1 May-16 Oct-16 Mar-17 Aug-17 Jan-18

11 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink CAPI * * Upcoming OpenPOWER Summit (Mar 18) 11

12 OpenPOWER Summit (Mar 18) 12 OpenPOWER Platform Support in MVAPICH2 Libraries MVAPICH2 Basic MPI support MVAPICH2-X since MVAPICH2 2.2rc1 (March 216) PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support since MVAPICH2-X 2.2b (March 216) Advanced Collective Support with CMA Since MVAPICH2-X 2.3b (Oct 217) MVAPICH2-GDR NVIDIA GPGPU support with NVLink (CORAL like systems) Since MVAPICH2-GDR 2.3a (Nov 217)

13 OpenPOWER Summit (Mar 18) 13 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

14 Intra-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Intra-Socket Small Message Latency.3us MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Intra-Socket Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Intra-Socket Large Message Latency Intra-Socket Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA OpenPOWER Summit (Mar 18) 14 Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

15 Inter-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Small Message Latency MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Large Message Latency Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA OpenPOWER Summit (Mar 18) 15 Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

16 Multi-Pair Latency Comparison Intra-node (PPN=2) Inter-node (Nodes=2, PPN=2) Multi-pair Multi-pair MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM Message Size 48% 45% 45% Basic point-to-point multi-pair latency comparison Multi-pair Multi-pair Up to 48% and 45% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM K 8K 32K 128K 512K 2M MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size 4% 32% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 16

17 Intra-node (PPN=2) Inter-node (Nodes=2, PPN=2) Multi-pair Bandwidth Comparison Bandwidth (MB/s) Bandwidth (MB/s) Basic point-to-point multi-pair bandwidth comparison MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM Message Size 1.6X 1.8X 1.5X Bandwidth (MB/s) 6, 45, 3, 15, MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM Message Size Up to 83% and 62% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node Bandwidth (MB/s) MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 1.2X 1.2X 1.4X Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 17

18 OpenPOWER Summit (Mar 18) 18 Process Mapping Support in MVAPICH2 for Multi-Core Platforms Process-Mapping support in MVAPICH2 (available since v1.4) Preset Binding Policies User-defined binding Policy bunch (Default) scatter hybrid MPI rank-to-core binding core (Default) socket numanode Granularity MVAPICH2 detects processor architecture at job-launch

19 Process and Thread binding policies in Hybrid MPI+Threads A new process binding policy hybrid MV2_CPU_BINDING_POLICY = hybrid A new environment variable for co-locating Threads with MPI Processes MV2_THREADS_PER_PROCESS = k Automatically set to OMP_NUM_THREADS if OpenMP is being used Provides a hint to the MPI runtime to spare resources for application threads. New variable for threads bindings with respect to parent process and architecture MV2_THREADS_BINDING_POLICY = {linear compact spread} Linear binds MPI ranks and OpenMP threads sequentially (one after the other) Recommended to be used on non-hyperthreaded systems with MPI+OpenMP Compact binds MPI rank to physical-core and locates respective OpenMP threads on hardware threads Recommended to be used on multi-/many-cores e.g., KNL, POWER8, and hyper-threaded Xeon OpenPOWER Summit (Mar 18) 19

20 User-Defined Process Mapping User has complete-control over process-mapping To run 4 processes on cores, 1, 4, 5: $ mpirun_rsh -np 4 -hostfile hosts MV2_CPU_MAPPING=:1:4:5./a.out Use, or - to bind to a set of cores: $mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=,2-4:1:5:6./a.out Is process binding working as expected? MV2_SHOW_CPU_BINDING=1 Display CPU binding information Launcher independent Example MV2_SHOW_CPU_BINDING=1 MV2_CPU_BINDING_POLICY=scatter CPU AFFINITY RANK: CPU_SET: RANK:1 CPU_SET: 8 Refer to Running with Efficient CPU (Core) Mapping section of MVAPICH2 user guide for more information OpenPOWER Summit (Mar 18) 2

21 OpenPOWER Summit (Mar 18) 21 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

22 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Collectives with CMA (Intra-node Reduce & AlltoAll) MVAPICH2-X-2.3b SpectrumMPI OpenMPI K 2K 4K MVAPICH2-X-2.3b SpectrumMPI OpenMPI X 3.2X 3.3X Reduce 5.2X Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2-X-2.3b for small and large messages respectively OpenPOWER Summit (Mar 18) MVAPICH2-X-2.3b SpectrumMPI OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M MVAPICH2-X-2.3b SpectrumMPI OpenMPI X 3.2X 1.3X 1.2X (Nodes=1, PPN=2) (Nodes=1, PPN=2)

23 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Collectives with CMA (Intra-node, Gather & Scatter) MVAPICH2-X-2.3b 24.5X SpectrumMPI OpenMPI X K 2K 4K Message Size (bytes) 4.9X MVAPICH2-X-2.3b SpectrumMPI OpenMPI X Gather SpectrumMPI OpenMPI X 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) OpenPOWER Summit (Mar 18) 23 Scatter 2 MVAPICH2-X-2.3b 15.6X 75 MVAPICH2-X-2.3b 6.4X 6.7X 5 SpectrumMPI OpenMPI K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 24X and 15x performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively (Nodes=1, PPN=2) (Nodes=1, PPN=2)

24 Scalable Collectives with CMA (Multi-node, Reduce & Alltoall) (Nodes=4, PPN=2) (Nodes=4, PPN=2) MVAPICH2-X-2.3b OpenMPI K 2K 4K 1.9X MVAPICH2-X-2.3b OpenMPI X Reduce Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 12.4X and 8.5X performance improvement by MVAPICH2-X-2.3b for small and large messages respectively OpenPOWER Summit (Mar 18) MVAPICH2-X-2.3b 8.5X OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M 15 MVAPICH2-X-2.3b 1.X OpenMPI (Nodes=4, PPN=2) (Nodes=4, PPN=2)

25 (Nodes=4, PPN=2) (Nodes=4, PPN=2) Scalable Collectives with CMA (Multi-node, Gather & Scatter) X 4 3.9X MVAPICH2-X-2.3b MVAPICH2-X-2.3b 3 OpenMPI-3.. OpenMPI K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Scatter Message Size (bytes) MVAPICH2-X-2.3b OpenMPI-3.. Gather K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 17.8X and 3.9X performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively OpenPOWER Summit (Mar 18) X 5 4 MVAPICH2-X-2.3b.8X 3 OpenMPI (Nodes=4, PPN=2) (Nodes=4, PPN=2)

26 OpenPOWER Summit (Mar 18) 26 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

27 Optimized MVAPICH2 All-Reduce with XPMEM (Nodes=1, PPN=2) (Nodes=2, PPN=2) MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 2X 16K 32K 64K 128K 3X 16K 32K 64K 128K Message Size Optimized MPI All-Reduce Design in MVAPICH MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 256K 512K 1M 2M Message Size Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node 4X 3.3X 48% 34% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 27 2X 2X

28 Optimized MVAPICH2 All-Reduce with XPMEM (Nodes=3, PPN=2) (Nodes=4, PPN=2) MVAPICH2-2.3rc1 OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K Message Size 22% Optimized MPI All-Reduce Design in MVAPICH2 42% 4 MVAPICH2-2.3rc1 27% 3 OpenMPI-3.. MVAPICH2-XPMEM 2 Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn t run for >2 processes) 1 16K 32K 64K 128K 256K 512K 1M 2M 2X MVAPICH2-2.3rc1 4 35% MVAPICH2-2.3rc1 OpenMPI OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-XPMEM 2 11% 1 256K 512K 1M 2M Message Size 18.8% 2% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 28

29 (Nodes=1, PPN=2) (Nodes=2, PPN=2) Optimized MVAPICH2 Reduce with XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K Message Size 26% 5.6X 37% 2.8X Optimized MPI Reduce Design in MVAPICH2 Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M Message Size 1.9X 27% 3.1X 93% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 29

30 (Nodes=3, PPN=2) (Nodes=4, PPN=2) Optimized MVAPICH2 Reduce with XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 8% 25% 16K 32K 64K 128K 256K Message Size Optimized MPI Reduce Design in MVAPICH2 Up to 7X performance improvement over OpenMPI at scale and up to 25% over MVAPICH2-2.3rc1 6X 7X MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M Message Size 16% 14% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 3 4X 4X

31 MiniAMR Performance using Optimized XPMEM-based Collectives Execution Time (s) 6 42% MVAPICH-2.3rc1 36% MVAPICH2-XPMEM 4 41% 45% MPI Processes (proc-per-sock=1, proc-per-node=2) MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weakscaling workload on up to four POWER8 nodes. Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter OpenPOWER Summit (Mar 18) 31

32 OpenPOWER Summit (Mar 18) 32 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network

33 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 212 (starting with MVAPICH2-X 1.9) OpenPOWER Summit (Mar 18) 33

34 OpenPOWER Summit (Mar 18) 34 OpenSHMEM PUT/GET using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) shmem_get (intra) shmem_put (intra) Message Size shmem_get (inter) shmem_put (inter) Message Size OpenSHMEM one-sided PUT/GET benchmark using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER shmem_get (intra) shmem_put (intra) Message Size shmem_get (inter) shmem_put (inter) Message Size

35 OpenPOWER Summit (Mar 18) 35 UPC PUT/GET benchmarks using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) upc_getmem (intra) upc_putmem (intra) K Message Size upc_getmem (inter) upc_putmem (inter) K Message Size UPC one-sided memput/memget benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER upc_getmem (intra) upc_putmem (intra) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size upc_getmem (inter) upc_putmem (inter) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

36 OpenPOWER Summit (Mar 18) 36 UPC++ PUT/GET benchmarks using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) upcxx_async_get (intra) upcxx_async_put (intra) K Message Size upcxx_async_get (inter) upcxx_async_put (inter) K Message Size UPC++ one-sided async_copy_put/get benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER upcxx_async_get (intra) upcxx_async_put (intra) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size upcxx_async_get (inter) upcxx_async_put (inter) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

37 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) CUDA-aware MPI Optimized GPUDirect RDMA (GDR) Support Optimized MVAPICH2 for OpenPower Support for Deep Learning OpenPOWER Summit (Mar 18) 37

) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, );

38 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity OpenPOWER Summit (Mar 18) 38

39 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory OpenPOWER Summit (Mar 18) 39

40 OpenPOWER Summit (Mar 18) 4 Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) 35 3 25 2 15 1 5 64K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomDBlue

41 Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomDBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes Average Time Steps per second (TPS) K Particles MV2 MV2+GDR Number of Processes OpenPOWER Summit (Mar 18) 41 2X

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based 1.2 1.8.6.4.

42 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 OpenPOWER Summit (Mar 18) 42

43 MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) K 2K 4K 8K Message Size (Bytes) 4 2 INTRA-NODE LATENCY (LARGE) 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) Bandwidth (GB/sec) INTRA-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M Message Size (Bytes) INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 14.6 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.9 GB/sec (NVLINK) INTER-NODE LATENCY (SMALL) K 2K 4K 8K INTER-NODE LATENCY (LARGE) Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR) Available in MVAPICH2-GDR 2.3a Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and EDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 43 Bandwidth (GB/sec) INTER-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M

44 OpenPOWER Summit (Mar 18) 44 Compilation of Best Practices MPI runtime has many parameters Tuning a set of parameters can help you to extract higher performance Compiled a list of such contributions through the MVAPICH Website List of applications (16 contributions so far) Amber, Cloverleaf, GAPgeofem, HoomDBlue, HPCG, LAMPS, LU, Lulesh, MILC, Neuro, POP2, SMG2, SPEC MPI, TERA_TF, UMR, and WRF2 Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu We will link these results with credits to you

45 OpenPOWER Summit (Mar 18) 45 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning New challenges for MPI Run-time Optimizing Large-message Collectives Co-designing DL Frameworks

Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers

Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both?

46 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) What application co-designs are needed to exploit communication-runtime co-designs? Scale-up Performance NCCL cudnn cublas Hadoop grpc Scale-out Performance Proposed Co-Designs MPI A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) OpenPOWER Summit (Mar 18) 46

47 Large Message Optimized Collectives for Deep Learning MV2-GDR provides optimized collectives for large message sizes Optimized Reduce, Allreduce, and Bcast Good scaling with large number of GPUs Available since MVAPICH2- GDR 2.2GA Latency (ms) Latency (ms) Latency (ms) Reduce 192 GPUs Message Size (MB) Allreduce 64 GPUs Message Size (MB) Bcast 64 GPUs Message Size (MB) No. of GPUs OpenPOWER Summit (Mar 18) 47 Latency (ms) Latency (ms) Latency (ms) Reduce 64 MB No. of GPUs Allreduce MB No. of GPUs Bcast 128 MB

48 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a* 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI ~1X better K 4K Message Size (Bytes) 16K 64K 256K ~3.5X better 512K 1M 2M 4M Message Size (Bytes) M OpenMPI is ~4X slower than Baidu MV2 is ~2% better than Baidu 16M 32M 64M 128M 256M 512M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and FDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 48

49 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI ~3X better K 4K 16K Message Size (Bytes) 64K 256K ~1X better ~4X better 512K 1M 2M 4M Message Size (Bytes) M OpenMPI is ~5X slower than Baidu MV2 is ~2X better than Baidu 16M 32M 64M 128M 256M 512M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and FDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 49

50 MVAPICH2-GDR vs. NCCL2 Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Bcast (MVAPICH2-GDR) vs. ncclbcast (NCCL2) on 16 K-8 GPUs ~1X better 1 1 GPU/node 2 GPUs/node 1 1 ~4X better K 4K 16K 64K 256K 1M 4M 16M 64M Message Size (Bytes) K 4K 16K 64K 256K 1M 4M 16M 64M Message Size (Bytes) NCCL2 MVAPICH2-GDR NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-8 GPUs, and EDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 5

51 OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node GoogLeNet (ImageNet) on 128 GPUs 25 Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from Training Time (seconds) No. of GPUs Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (124) OSU-Caffe (248) OpenPOWER Summit (Mar 18) 51

52 OpenPOWER Summit (Mar 18) 52 Concluding Remarks Next generation HPC systems need to be designed with a holistic view of Big Data and Deep Learning OpenPOWER platform is emerging with many novel features Presented some of the approaches and results along these directions from the MVAPICH2 and HiDL Projects Solutions Enable High-Performance and Scalable HPC and Deep Learning on OpenPOWER Platforms

53 OpenPOWER Summit (Mar 18) 53 High-Performance Hadoop and Spark on OpenPOWER Platform Talk at 2:15 pm

54 OpenPOWER Summit (Mar 18) 54 Funding Acknowledgments Funding Support by Equipment Support by

55 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela Current Research Specialist J. Smith M. Arnold Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) M. Li (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee J. Lin S. Marcarelli X. Besseron M. Luo J. Vienne H.-W. Jin E. Mancini H. Wang OpenPOWER Summit (Mar 18) 55

OpenPOWER Summit (Mar 18) 56 Thank You! panda@cse.

edu Network-Based Computing Laboratory http://nowlab.cse.

cse.edu/ The High-Performance Big Data Project http://hibd.

56 OpenPOWER Summit (Mar 18) 56 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel