High-Performance MPI and Deep Learning on OpenPOWER Platform

Size: px
Start display at page:

Download "High-Performance MPI and Deep Learning on OpenPOWER Platform"

Transcription

1 High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University Hari Subramoni The Ohio State University

2 OpenPOWER Summit (Mar 18) 2 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

3 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance OpenPOWER Summit (Mar 18) 3

4 OpenPOWER Summit (Mar 18) 4 Partitioned Global Address Space (PGAS) Models Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication Different approaches to PGAS - Languages - Libraries Unified Parallel C (UPC) OpenSHMEM Co-Array Fortran (CAF) UPC++ X1 Global Arrays Chapel

5 OpenPOWER Summit (Mar 18) 5 Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

6 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience OpenPOWER Summit (Mar 18) 6

7 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning Support for Big Data (Hadoop and Spark) on OpenPOWER (Talk at 2:15 pm) OpenPOWER Summit (Mar 18) 7

8 OpenPOWER Summit (Mar 18) 8 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,875 organizations in 85 countries More than 455, (>.45 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors, Linux Distros (RedHat and SuSE), and OpenHPC Empowering Top5 systems for over a decade

9 OpenPOWER Summit (Mar 18) 9 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime Optimized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI Optimized MPI for clusters with Intel KNC (being updated with KNL-based designs) Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration Utility to measure the energy consumption of MPI applications

10 Number of Downloads MVAPICH2 Release Timeline and Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline OpenPOWER Summit (Mar 18) 1 May-16 Oct-16 Mar-17 Aug-17 Jan-18

11 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink CAPI * * Upcoming OpenPOWER Summit (Mar 18) 11

12 OpenPOWER Summit (Mar 18) 12 OpenPOWER Platform Support in MVAPICH2 Libraries MVAPICH2 Basic MPI support MVAPICH2-X since MVAPICH2 2.2rc1 (March 216) PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support since MVAPICH2-X 2.2b (March 216) Advanced Collective Support with CMA Since MVAPICH2-X 2.3b (Oct 217) MVAPICH2-GDR NVIDIA GPGPU support with NVLink (CORAL like systems) Since MVAPICH2-GDR 2.3a (Nov 217)

13 OpenPOWER Summit (Mar 18) 13 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

14 Intra-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Intra-Socket Small Message Latency.3us MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Intra-Socket Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Intra-Socket Large Message Latency Intra-Socket Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA OpenPOWER Summit (Mar 18) 14 Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

15 Inter-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Small Message Latency MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Large Message Latency Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA OpenPOWER Summit (Mar 18) 15 Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

16 Multi-Pair Latency Comparison Intra-node (PPN=2) Inter-node (Nodes=2, PPN=2) Multi-pair Multi-pair MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM Message Size 48% 45% 45% Basic point-to-point multi-pair latency comparison Multi-pair Multi-pair Up to 48% and 45% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM K 8K 32K 128K 512K 2M MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size 4% 32% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 16

17 Intra-node (PPN=2) Inter-node (Nodes=2, PPN=2) Multi-pair Bandwidth Comparison Bandwidth (MB/s) Bandwidth (MB/s) Basic point-to-point multi-pair bandwidth comparison MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM Message Size 1.6X 1.8X 1.5X Bandwidth (MB/s) 6, 45, 3, 15, MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM Message Size Up to 83% and 62% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node Bandwidth (MB/s) MV2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MV2-XPMEM 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 1.2X 1.2X 1.4X Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 17

18 OpenPOWER Summit (Mar 18) 18 Process Mapping Support in MVAPICH2 for Multi-Core Platforms Process-Mapping support in MVAPICH2 (available since v1.4) Preset Binding Policies User-defined binding Policy bunch (Default) scatter hybrid MPI rank-to-core binding core (Default) socket numanode Granularity MVAPICH2 detects processor architecture at job-launch

19 Process and Thread binding policies in Hybrid MPI+Threads A new process binding policy hybrid MV2_CPU_BINDING_POLICY = hybrid A new environment variable for co-locating Threads with MPI Processes MV2_THREADS_PER_PROCESS = k Automatically set to OMP_NUM_THREADS if OpenMP is being used Provides a hint to the MPI runtime to spare resources for application threads. New variable for threads bindings with respect to parent process and architecture MV2_THREADS_BINDING_POLICY = {linear compact spread} Linear binds MPI ranks and OpenMP threads sequentially (one after the other) Recommended to be used on non-hyperthreaded systems with MPI+OpenMP Compact binds MPI rank to physical-core and locates respective OpenMP threads on hardware threads Recommended to be used on multi-/many-cores e.g., KNL, POWER8, and hyper-threaded Xeon OpenPOWER Summit (Mar 18) 19

20 User-Defined Process Mapping User has complete-control over process-mapping To run 4 processes on cores, 1, 4, 5: $ mpirun_rsh -np 4 -hostfile hosts MV2_CPU_MAPPING=:1:4:5./a.out Use, or - to bind to a set of cores: $mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=,2-4:1:5:6./a.out Is process binding working as expected? MV2_SHOW_CPU_BINDING=1 Display CPU binding information Launcher independent Example MV2_SHOW_CPU_BINDING=1 MV2_CPU_BINDING_POLICY=scatter CPU AFFINITY RANK: CPU_SET: RANK:1 CPU_SET: 8 Refer to Running with Efficient CPU (Core) Mapping section of MVAPICH2 user guide for more information OpenPOWER Summit (Mar 18) 2

21 OpenPOWER Summit (Mar 18) 21 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

22 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Collectives with CMA (Intra-node Reduce & AlltoAll) MVAPICH2-X-2.3b SpectrumMPI OpenMPI K 2K 4K MVAPICH2-X-2.3b SpectrumMPI OpenMPI X 3.2X 3.3X Reduce 5.2X Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2-X-2.3b for small and large messages respectively OpenPOWER Summit (Mar 18) MVAPICH2-X-2.3b SpectrumMPI OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M MVAPICH2-X-2.3b SpectrumMPI OpenMPI X 3.2X 1.3X 1.2X (Nodes=1, PPN=2) (Nodes=1, PPN=2)

23 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Collectives with CMA (Intra-node, Gather & Scatter) MVAPICH2-X-2.3b 24.5X SpectrumMPI OpenMPI X K 2K 4K Message Size (bytes) 4.9X MVAPICH2-X-2.3b SpectrumMPI OpenMPI X Gather SpectrumMPI OpenMPI X 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) OpenPOWER Summit (Mar 18) 23 Scatter 2 MVAPICH2-X-2.3b 15.6X 75 MVAPICH2-X-2.3b 6.4X 6.7X 5 SpectrumMPI OpenMPI K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 24X and 15x performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively (Nodes=1, PPN=2) (Nodes=1, PPN=2)

24 Scalable Collectives with CMA (Multi-node, Reduce & Alltoall) (Nodes=4, PPN=2) (Nodes=4, PPN=2) MVAPICH2-X-2.3b OpenMPI K 2K 4K 1.9X MVAPICH2-X-2.3b OpenMPI X Reduce Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 12.4X and 8.5X performance improvement by MVAPICH2-X-2.3b for small and large messages respectively OpenPOWER Summit (Mar 18) MVAPICH2-X-2.3b 8.5X OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M 15 MVAPICH2-X-2.3b 1.X OpenMPI (Nodes=4, PPN=2) (Nodes=4, PPN=2)

25 (Nodes=4, PPN=2) (Nodes=4, PPN=2) Scalable Collectives with CMA (Multi-node, Gather & Scatter) X 4 3.9X MVAPICH2-X-2.3b MVAPICH2-X-2.3b 3 OpenMPI-3.. OpenMPI K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Scatter Message Size (bytes) MVAPICH2-X-2.3b OpenMPI-3.. Gather K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 17.8X and 3.9X performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively OpenPOWER Summit (Mar 18) X 5 4 MVAPICH2-X-2.3b.8X 3 OpenMPI (Nodes=4, PPN=2) (Nodes=4, PPN=2)

26 OpenPOWER Summit (Mar 18) 26 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Point-to-point Inter-node and Intra-node CMA-based Collectives Upcoming XPMEM Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

27 Optimized MVAPICH2 All-Reduce with XPMEM (Nodes=1, PPN=2) (Nodes=2, PPN=2) MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 2X 16K 32K 64K 128K 3X 16K 32K 64K 128K Message Size Optimized MPI All-Reduce Design in MVAPICH MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 256K 512K 1M 2M Message Size Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node 4X 3.3X 48% 34% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 27 2X 2X

28 Optimized MVAPICH2 All-Reduce with XPMEM (Nodes=3, PPN=2) (Nodes=4, PPN=2) MVAPICH2-2.3rc1 OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K Message Size 22% Optimized MPI All-Reduce Design in MVAPICH2 42% 4 MVAPICH2-2.3rc1 27% 3 OpenMPI-3.. MVAPICH2-XPMEM 2 Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn t run for >2 processes) 1 16K 32K 64K 128K 256K 512K 1M 2M 2X MVAPICH2-2.3rc1 4 35% MVAPICH2-2.3rc1 OpenMPI OpenMPI-3.. MVAPICH2-XPMEM MVAPICH2-XPMEM 2 11% 1 256K 512K 1M 2M Message Size 18.8% 2% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 28

29 (Nodes=1, PPN=2) (Nodes=2, PPN=2) Optimized MVAPICH2 Reduce with XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K Message Size 26% 5.6X 37% 2.8X Optimized MPI Reduce Design in MVAPICH2 Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M Message Size 1.9X 27% 3.1X 93% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 29

30 (Nodes=3, PPN=2) (Nodes=4, PPN=2) Optimized MVAPICH2 Reduce with XPMEM MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 16K 32K 64K 128K 256K MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 8% 25% 16K 32K 64K 128K 256K Message Size Optimized MPI Reduce Design in MVAPICH2 Up to 7X performance improvement over OpenMPI at scale and up to 25% over MVAPICH2-2.3rc1 6X 7X MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M MVAPICH2-2.3rc1 SpectrumMPI-1.1. OpenMPI-3.. MVAPICH2-XPMEM 512K 1M 2M 4M 8M 16M Message Size 16% 14% Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch OpenPOWER Summit (Mar 18) 3 4X 4X

31 MiniAMR Performance using Optimized XPMEM-based Collectives Execution Time (s) 6 42% MVAPICH-2.3rc1 36% MVAPICH2-XPMEM 4 41% 45% MPI Processes (proc-per-sock=1, proc-per-node=2) MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weakscaling workload on up to four POWER8 nodes. Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter OpenPOWER Summit (Mar 18) 31

32 OpenPOWER Summit (Mar 18) 32 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning

33 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 212 (starting with MVAPICH2-X 1.9) OpenPOWER Summit (Mar 18) 33

34 OpenPOWER Summit (Mar 18) 34 OpenSHMEM PUT/GET using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) shmem_get (intra) shmem_put (intra) Message Size shmem_get (inter) shmem_put (inter) Message Size OpenSHMEM one-sided PUT/GET benchmark using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER shmem_get (intra) shmem_put (intra) Message Size shmem_get (inter) shmem_put (inter) Message Size

35 OpenPOWER Summit (Mar 18) 35 UPC PUT/GET benchmarks using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) upc_getmem (intra) upc_putmem (intra) K Message Size upc_getmem (inter) upc_putmem (inter) K Message Size UPC one-sided memput/memget benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER upc_getmem (intra) upc_putmem (intra) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size upc_getmem (inter) upc_putmem (inter) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

36 OpenPOWER Summit (Mar 18) 36 UPC++ PUT/GET benchmarks using MVAPICH2-X on OpenPOWER Intra-node (PPN=2) Inter-node (PPN=1) upcxx_async_get (intra) upcxx_async_put (intra) K Message Size upcxx_async_get (inter) upcxx_async_put (inter) K Message Size UPC++ one-sided async_copy_put/get benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster Near native performance benefits are observed on OpenPOWER upcxx_async_get (intra) upcxx_async_put (intra) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size upcxx_async_get (inter) upcxx_async_put (inter) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

37 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) CUDA-aware MPI Optimized GPUDirect RDMA (GDR) Support Optimized MVAPICH2 for OpenPower Support for Deep Learning OpenPOWER Summit (Mar 18) 37

38 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity OpenPOWER Summit (Mar 18) 38

39 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory OpenPOWER Summit (Mar 18) 39

40 OpenPOWER Summit (Mar 18) 4 Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

41 Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomDBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes Average Time Steps per second (TPS) K Particles MV2 MV2+GDR Number of Processes OpenPOWER Summit (Mar 18) 41 2X

42 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 OpenPOWER Summit (Mar 18) 42

43 MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) K 2K 4K 8K Message Size (Bytes) 4 2 INTRA-NODE LATENCY (LARGE) 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) Bandwidth (GB/sec) INTRA-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M Message Size (Bytes) INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 14.6 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.9 GB/sec (NVLINK) INTER-NODE LATENCY (SMALL) K 2K 4K 8K INTER-NODE LATENCY (LARGE) Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR) Available in MVAPICH2-GDR 2.3a Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and EDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 43 Bandwidth (GB/sec) INTER-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M

44 OpenPOWER Summit (Mar 18) 44 Compilation of Best Practices MPI runtime has many parameters Tuning a set of parameters can help you to extract higher performance Compiled a list of such contributions through the MVAPICH Website List of applications (16 contributions so far) Amber, Cloverleaf, GAPgeofem, HoomDBlue, HPCG, LAMPS, LU, Lulesh, MILC, Neuro, POP2, SMG2, SPEC MPI, TERA_TF, UMR, and WRF2 Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu We will link these results with credits to you

45 OpenPOWER Summit (Mar 18) 45 MPI, PGAS and Deep Learning Support for OpenPOWER Message Passing Interface (MPI) Support Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators (NVIDIA GPGPUs) Support for Deep Learning New challenges for MPI Run-time Optimizing Large-message Collectives Co-designing DL Frameworks

46 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) What application co-designs are needed to exploit communication-runtime co-designs? Scale-up Performance NCCL cudnn cublas Hadoop grpc Scale-out Performance Proposed Co-Designs MPI A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) OpenPOWER Summit (Mar 18) 46

47 Large Message Optimized Collectives for Deep Learning MV2-GDR provides optimized collectives for large message sizes Optimized Reduce, Allreduce, and Bcast Good scaling with large number of GPUs Available since MVAPICH2- GDR 2.2GA Latency (ms) Latency (ms) Latency (ms) Reduce 192 GPUs Message Size (MB) Allreduce 64 GPUs Message Size (MB) Bcast 64 GPUs Message Size (MB) No. of GPUs OpenPOWER Summit (Mar 18) 47 Latency (ms) Latency (ms) Latency (ms) Reduce 64 MB No. of GPUs Allreduce MB No. of GPUs Bcast 128 MB

48 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a* 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI ~1X better K 4K Message Size (Bytes) 16K 64K 256K ~3.5X better 512K 1M 2M 4M Message Size (Bytes) M OpenMPI is ~4X slower than Baidu MV2 is ~2% better than Baidu 16M 32M 64M 128M 256M 512M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and FDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 48

49 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI ~3X better K 4K 16K Message Size (Bytes) 64K 256K ~1X better ~4X better 512K 1M 2M 4M Message Size (Bytes) M OpenMPI is ~5X slower than Baidu MV2 is ~2X better than Baidu 16M 32M 64M 128M 256M 512M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and FDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 49

50 MVAPICH2-GDR vs. NCCL2 Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Bcast (MVAPICH2-GDR) vs. ncclbcast (NCCL2) on 16 K-8 GPUs ~1X better 1 1 GPU/node 2 GPUs/node 1 1 ~4X better K 4K 16K 64K 256K 1M 4M 16M 64M Message Size (Bytes) K 4K 16K 64K 256K 1M 4M 16M 64M Message Size (Bytes) NCCL2 MVAPICH2-GDR NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-8 GPUs, and EDR InfiniBand Inter-connect OpenPOWER Summit (Mar 18) 5

51 OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node GoogLeNet (ImageNet) on 128 GPUs 25 Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from Training Time (seconds) No. of GPUs Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (124) OSU-Caffe (248) OpenPOWER Summit (Mar 18) 51

52 OpenPOWER Summit (Mar 18) 52 Concluding Remarks Next generation HPC systems need to be designed with a holistic view of Big Data and Deep Learning OpenPOWER platform is emerging with many novel features Presented some of the approaches and results along these directions from the MVAPICH2 and HiDL Projects Solutions Enable High-Performance and Scalable HPC and Deep Learning on OpenPOWER Platforms

53 OpenPOWER Summit (Mar 18) 53 High-Performance Hadoop and Spark on OpenPOWER Platform Talk at 2:15 pm

54 OpenPOWER Summit (Mar 18) 54 Funding Acknowledgments Funding Support by Equipment Support by

55 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela Current Research Specialist J. Smith M. Arnold Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) M. Li (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee J. Lin S. Marcarelli X. Besseron M. Luo J. Vienne H.-W. Jin E. Mancini H. Wang OpenPOWER Summit (Mar 18) 55

56 OpenPOWER Summit (Mar 18) 56 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox booth (SC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Designing High-Performance MPI Libraries for Multi-/Many-core Era

Designing High-Performance MPI Libraries for Multi-/Many-core Era Designing High-Performance MPI Libraries for Multi-/Many-core Era Talk at IXPUG-Fall Conference (September 18) J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni and D. K Panda The Ohio State University

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms Talk at Intel HPC Developer Conference (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Talk at HPCAC-AI Switzerland Conference (April 19) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems

High-Performance and Scalable Designs of Programming Models for Exascale Systems High-Performance and Scalable Designs of Programming Models for Exascale Systems Keynote Talk at HPCAC, Switzerland (April 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 217 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Keynote Talk at HPCAC China Conference by Dhabaleswar K. (DK) Panda The Ohio State University E mail:

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Keynote Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (March 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Keynote Talk at SEA Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies Talk at NRL booth (SC 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries

Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries Talk at Overlapping Communication with Computation Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries A Tutorial at MVAPICH User Group 217 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

How to Tune and Extract Higher Performance with MVAPICH2 Libraries? How to Tune and Extract Higher Performance with MVAPICH2 Libraries? An XSEDE ECSS webinar by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization Talk at HPCAC-Switzerland (Mar 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC 18) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15 Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at Presented by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu hcp://www.cse.ohio-

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 213 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries? How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries? A Tutorial at MVAPICH User Group Meeting 216 by The MVAPICH Group The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Optimizing & Tuning Techniques for Running MVAPICH2 over IB Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu

More information

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Designing OpenSHMEM and Hybrid MPI+OpenSHMEM Libraries for Exascale Systems: MVAPICH2-X Experience

Designing OpenSHMEM and Hybrid MPI+OpenSHMEM Libraries for Exascale Systems: MVAPICH2-X Experience Designing OpenSHMEM and Hybrid MPI+OpenSHMEM Libraries for Exascale Systems: MVAPICH2-X Experience Talk at OpenSHMEM Workshop (August 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015) High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

An Assessment of MPI 3.x (Part I)

An Assessment of MPI 3.x (Part I) An Assessment of MPI 3.x (Part I) High Performance MPI Support for Clouds with IB and SR-IOV (Part II) Talk at HP-CAST 25 (Nov 2015) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI INAM 2 : InfiniBand Network Analysis and Monitoring with MPI H. Subramoni, A. A. Mathews, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and D. K. Panda Department of Computer Science and Engineering The

More information

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Keynote Talk at HPSR 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters 2015 IEEE International Conference on Cluster Computing Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan,

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information