Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities

Size: px
Start display at page:

Download "Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities"

Transcription

1 Designing HPC, Big Data, Deep Learning, and Cloud Middleware for Exascale Systems: Challenges and Opportunities Keynote Talk at SEA Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

2 SEA Symposium (April 18) 2 High-End Computing (HEC): Towards Exascale 1 PFlops in EFlops in ? Expected to have an ExaFlop system in !

3 SEA Symposium (April 18) 3 Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Convergence of HPC, Big Data, and Deep Learning! Deep Learning (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

4 SEA Symposium (April 18) 4 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Physical Compute

5 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 5

6 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 6

7 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? SEA Symposium (April 18) 7 Hadoop Job Deep Learning Job Spark Job

8 SEA Symposium (April 18) 8 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

9 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance SEA Symposium (April 18) 9

10 SEA Symposium (April 18) 1 Partitioned Global Address Space (PGAS) Models Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication Different approaches to PGAS - Languages - Libraries Unified Parallel C (UPC) OpenSHMEM Co-Array Fortran (CAF) UPC++ X1 Global Arrays Chapel

11 SEA Symposium (April 18) 11 Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI

12 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience SEA Symposium (April 18) 12

13 Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Scalable job start-up Scalable Collective communication Offload Non-blocking Topology-aware Balancing intra-node and inter-node communication for next generation nodes ( cores) Multiple end-points per node Support for efficient multi-threading Integrated Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, ) Virtualization Energy-Awareness SEA Symposium (April 18) 13

14 SEA Symposium (April 18) 14 Additional Challenges for Designing Exascale Software Libraries Extreme Low Memory Footprint Memory per core continues to decrease D-L-A Framework Discover Overall network topology (fat-tree, 3D, ), Network topology for processes for a given job Node architecture, Health of network and node Learn Impact on performance and scalability Potential for failure Adapt Internal protocols and algorithms Process mapping Fault-tolerance solutions Low overhead techniques while delivering performance, scalability and fault-tolerance

15 SEA Symposium (April 18) 15 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,875 organizations in 86 countries More than 46, (>.46 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade

16 Number of Downloads MVAPICH2 Release Timeline and Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline SEA Symposium (April 18) 16 May-16 Oct-16 Mar-17 Aug-17 Jan-18

17 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink * CAPI * * Upcoming SEA Symposium (April 18) 17

18 SEA Symposium (April 18) 18 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication Scalable Start-up Optimized Collectives using SHArP and Multi-Leaders Optimized CMA-based Collectives Upcoming Optimized XPMEM-based Collectives Performance Engineering with MPI-T Support Integrated Network Analysis and Monitoring Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

19 One-way Latency: MPI over IB with MVAPICH2 Latency (us) 2 Small Message Latency Latency (us) Large Message Latency TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch SEA Symposium (April 18) 19

20 Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 Unidirectional Bandwidth 12, , , ,356 3,373 Bandwidth (MBytes/sec) Bidirectional Bandwidth TrueScale-QDR ConnectX-3-FDR ConnectIB-DualFDR ConnectX-5-EDR Omni-Path 24,136 22,564 21,983 12,161 6,228 Message Size (bytes) Message Size (bytes) TrueScale-QDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch SEA Symposium (April 18) 2

21 Startup Performance on KNL + Omni-Path 22s Time Taken (Seconds) MPI_Init & Hello World - Oakforest-PACS Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a) 21s 5.8s K 2K 4K 8K Number of Processes 16K 32K 64K MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 Full scale) 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) All numbers reported with 64 processes per node New designs available since MVAPICH2-2.3a and as patch for SLURM 15, 16, and 17 SEA Symposium (April 18) 21

22 Latency (seconds) Latency (seconds) Advanced Allreduce Collective Designs Using SHArP MVAPICH2 MVAPICH2-SHArP MVAPICH2 MVAPICH2-SHArP Avg DDOT Allreduce time of HPCG (Number of Nodes, PPN) Mesh Refinement Time of MiniAMR 13% 12% (4,28) (8,28) (16,28) SHArP Support is available since MVAPICH2 2.3a M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi- Leader Design, SuperComputing '17. (4,28) (8,28) (16,28) (Number of Nodes, PPN) SEA Symposium (April 18) 22

23 Performance of MPI_Allreduce On Stampede2 (1,24 Processes) Latency (us) Message Size MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN 8K 16K 32K 64K 128K 256K Message Size MVAPICH2 MVAPICH2-OPT IMPI For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X 2.4X M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available in MVAPICH2-X 2.3b SEA Symposium (April 18) 23

24 Latency (us) Optimized CMA-based Collectives for Large Messages KNL (2 Nodes, 128 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size ~ 2.5x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA KNL (4 Nodes, 256 Procs) MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Message Size Performance of MPI_Gather on KNL nodes (64PPN) SEA Symposium (April 18) KNL (8 Nodes, 512 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) New two-level algorithms for better scalability Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster 17, BEST Paper Finalist ~ 3.2x Better ~ 17x Better ~ 4x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA Available in MVAPICH2-X 2.3b

25 Shared Address Space (XPMEM)-based Collectives Design 1 1 OSU_Allreduce (Broadwell 256 procs) MVAPICH2-2.3b IMPI-217v X MVAPICH2-Opt 1 1 OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b IMPI-217v1.132 MVAPICH2-Opt 4X Latency (us) K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size Shared Address Space -based true zero-copy Reduction collective designs in MVAPICH2 Offloaded computation/communication to peers ranks in reduction collective operation Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Will be available in future Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 218. SEA Symposium (April 18) K 32K 64K 128K 256K 512K 1M 2M 4M

26 Performance Engineering Applications using MVAPICH2 and TAU Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU Control the runtime s behavior via MPI Control Variables (CVARs) Introduced support for new MPI_T based CVARs to MVAPICH2 MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE TAU enhanced with support for setting MPI_T CVARs in a noninteractive mode for uninstrumented applications VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf S. Ramesh, A. Maheo, S. Shende, A. Malony, H. Subramoni, and D. K. Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, Euro MPI, 217 [Best Paper Award] SEA Symposium (April 18) 26

27 SEA Symposium (April 18) 27 Overview of OSU INAM A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes Capability to analyze and profile node-level, job-level and process-level activities for MPI communication Point-to-Point, Collectives and RMA Ability to filter data based on type of counters using drop down list Remotely monitor various metrics of MPI processes at user specified granularity "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X Visualize the data transfer happening in a live or historical fashion for entire network, job or set of nodes OSU INAM v.9.3 released on 3/16/218 Enhance INAMD to query end nodes based on command line option Add a web page to display size of the database in real-time Enhance interaction between the web application and SLURM job launcher for increased portability Improve packaging of web application and daemon to ease installation

28 OSU INAM Features --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) Show network topology of large clusters Visualize traffic pattern on different links Quickly identify congested links/links in error state See the history unfold play back historical state of the network SEA Symposium (April 18) 28

29 OSU INAM Features (Cont.) Visualizing a Job (5 Nodes) Job level view Show different network metrics (load, error, etc.) for any live job Play back historical data for completed jobs to identify bottlenecks Node level view - details per process or per node CPU utilization for each rank/node Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) Network metrics (e.g. XmitDiscard, RcvError) per rank/node Estimated Process Level Link Utilization Estimated Link Utilization view Classify data flowing over a network link at different granularity in conjunction with MVAPICH2-X 2.2rc1 Job level and Process level SEA Symposium (April 18) 29

30 SEA Symposium (April 18) 3 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

31 MVAPICH2-X for Hybrid MPI + PGAS Applications Current Model Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF Available with since 212 (starting with MVAPICH2-X 1.9) SEA Symposium (April 18) 31

32 Application Level Performance with Graph5 and Sort Time (s) Graph5 Execution Time MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 4K 8K 16K No. of Processes 7.6X J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September X Performance of Hybrid (MPI+ OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 213. Time (seconds) MPI Sort Execution Time Hybrid 5GB-512 1TB-1K 2TB-2K 4TB-4K Input Data - No. of Processes Performance of Hybrid (MPI+OpenSHMEM) Sort Application 4,96 processes, 4 TB Input Size - MPI 248 sec;.16 TB/min - Hybrid 1172 sec;.36 TB/min - 51% improvement over MPI-design SEA Symposium (April 18) 32 51%

33 SEA Symposium (April 18) 33 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation Time (s) Heat-2D Kernel using Jacobi method KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell Time (s) Heat Image Kernel KNL (Default) KNL (AVX-512) KNL (AVX-512+MCDRAM) Broadwell No. of processes No. of processes On heat diffusion based kernels AVX-512 vectorization showed better performance MCDRAM showed significant benefits on Heat-Image kernel for all process counts. Combined with AVX-512 vectorization, it showed up to 4X improved performance

34 SEA Symposium (April 18) 34 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs CUDA-aware MPI GPUDirect RDMA (GDR) Support Support for Streaming Applications Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

35 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity SEA Symposium (April 18) 35

36 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory SEA Symposium (April 18) 36

37 SEA Symposium (April 18) 37 Latency (us) Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

38 Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes Average Time Steps per second (TPS) K Particles MV2 MV2+GDR Number of Processes SEA Symposium (April 18) 38 2X

39 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 SEA Symposium (April 18) 39

40 Streaming Applications Streaming applications on HPC systems 1. Communication (MPI) Broadcast-type operations 2. Computation (CUDA) Multiple GPU nodes as workers HPC resources for real-time analytics Data Source Real-time streaming Sender Data streaming-like broadcast operations Worker CPU GPU Worker CPU GPU Worker CPU GPU Worker CPU GPU Worker CPU GPU GPU GPU GPU GPU GPU SEA Symposium (April 18) 4

41 Hardware Multicast-based Broadcast For GPU-resident data, using GPUDirect RDMA (GDR) InfiniBand Hardware Multicast (IB-MCAST) Overhead IB UD limit GDR limit A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters, in HiPC 214, Dec 214. CPU GPU Source Header Data IB HCA IB Switch 1. IB Gather + GDR Read 2. IB Hardware Multicast 3. IB Scatter + GDR Write Destination 1 IB HCA IB HCA Header CPU GPU Data Destination N Header CPU GPU Data SEA Symposium (April 18) 41

42 Streaming CSCS (88 GPUs) MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 6 12 Latency (μs) % Latency (μs) % K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) IB-MCAST + GDR + Topology-aware IPC-based schemes Up to 58% and 79% reduction for small and large messages C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct , 216. SEA Symposium (April 18) 42

43 SEA Symposium (April 18) 43 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

44 Intra-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Latency (us) Intra-Socket Small Message Latency.3us MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Intra-Socket Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Intra-Socket Large Message Latency Intra-Socket Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA SEA Symposium (April 18) 44 Latency (us) Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

45 Inter-node Point-to-Point Performance on OpenPower Bandwidth (MB/s) Latency (us) Small Message Latency MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 2K Bandwidth MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M Large Message Latency Bi-directional Bandwidth Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA SEA Symposium (April 18) 45 Latency (us) Bandwidth (MB/s) MVAPICH2-2.3rc1 SpectrumMPI OpenMPI-3.. 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-2.3rc1 SpectrumMPI OpenMPI K 32K 256K 2M

46 Latency (us) MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) K 2K 4K 8K Message Size (Bytes) Latency (us) 4 2 INTRA-NODE LATENCY (LARGE) 16K 32K 64K 128K256K512K 1M 2M 4M Message Size (Bytes) Bandwidth (GB/sec) INTRA-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M Message Size (Bytes) INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 13.8 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.2 GB/sec (NVLINK) Latency (us) INTER-NODE LATENCY (SMALL) K 2K 4K 8K Latency (us) INTER-NODE LATENCY (LARGE) Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR) Available in MVAPICH2-GDR 2.3a Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P1-SXM GPUs, and 4X-FDR InfiniBand Inter-connect SEA Symposium (April 18) 46 Bandwidth (GB/sec) INTER-NODE BANDWIDTH K 4K 16K 64K 256K 1M 4M

47 (Nodes=1, PPN=2) (Nodes=1, PPN=2) Scalable Host-based Collectives with CMA on OpenPOWER (Intra-node Reduce & AlltoAll) Reduce Latency (us) Latency (us) MVAPICH2-GDR-Next SpectrumMPI OpenMPI K 2K 4K MVAPICH2-GDR-Next SpectrumMPI OpenMPI X 3.2X 3.3X 5.2X Alltoall K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively SEA Symposium (April 18) 47 Latency (us) Latency (us) MVAPICH2-GDR-Next SpectrumMPI OpenMPI-3.. 8K 16K 32K 64K 128K 256K 512K 1M MVAPICH2-GDR-Next SpectrumMPI OpenMPI X 3.2X 1.3X 1.2X (Nodes=1, PPN=2) (Nodes=1, PPN=2)

48 Optimized All-Reduce with XPMEM on OpenPOWER (Nodes=1, PPN=2) Latency (us) (Nodes=2, PPN=2) Latency (us) MVAPICH2-GDR-Next Latency (us) 2 MVAPICH2-GDR-Next SpectrumMPI-1.1. SpectrumMPI-1.1. OpenMPI OpenMPI-3.. 2X 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M MVAPICH2-GDR-Next 3.3X 4 MVAPICH2-GDR-Next SpectrumMPI SpectrumMPI-1.1. OpenMPI OpenMPI-3.. 3X 1 2X 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size Optimized MPI All-Reduce Design in MVAPICH2 Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node 4X Latency (us) 34% 48% 2X Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch SEA Symposium (April 18) 48

49 Intra-node Point-to-point Performance on ARMv8 Small Message Latency Large Message Latency 3 6 MVAPICH2 MVAPICH2 Latency (us) Bandwidth (MB/s) microsec (4 bytes) K 2K 4K Bandwidth MVAPICH2 Latency (us) Bidirectional Bandwidth Available since MVAPICH2 2.3a 8K 16K 32K 64K 128K 256K 512K 1M 2M Bi-directional Bandwidth MVAPICH2 Platform: ARMv8 (aarch64) MIPS processor with 96 cores dual-socket CPU. Each socket contains 48 cores. SEA Symposium (April 18) 49

50 SEA Symposium (April 18) 5 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, ) Integrated Support for GPGPUs Optimized MVAPICH2 for OpenPower (with/ NVLink) and ARM Application Scalability and Best Practices

51 Performance of SPEC MPI 27 Benchmarks (KNL + Omni-Path) % 6% Intel MPI 18.. MVAPICH2 2.3 rc1 Execution Time in (S) % 6% 1% 2% 4% 448 processes on 7 KNL nodes of TACC Stampede2 (64 ppn) milc leslie3d pop2 lammps wrf2 tera_tf lu Mvapich2 outperforms Intel MPI by up to 1% SEA Symposium (April 18) 51

52 Performance of SPEC MPI 27 Benchmarks (Skylake + Omni-Path) 12 % 1% Intel MPI MVAPICH2 2.3 rc1 Execution Time in (S) % -4% 4% % 38% -3% 48 processes on 1 Skylake nodes of TACC Stampede2 (48 ppn) milc leslie3d pop2 lammps wrf2 GaP tera_tf lu MVAPICH2 outperforms Intel MPI by up to 38% SEA Symposium (April 18) 52

53 SEA Symposium (April 18) 53 Compilation of Best Practices MPI runtime has many parameters Tuning a set of parameters can help you to extract higher performance Compiled a list of such contributions through the MVAPICH Website Initial list of applications Amber HoomDBlue HPCG Lulesh MILC Neuron SMG2 Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. We will link these results with credits to you.

54 SEA Symposium (April 18) 54 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

55 SEA Symposium (April 18) 55 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters? Bring HPC and Big Data processing into a convergent trajectory!

56 SEA Symposium (April 18) 56 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks RDMA? Upper level Changes? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization (SR-IOV) I/O and File Systems QoS & Fault Tolerance Performance Tuning Networking Technologies (InfiniBand, 1/1/4/1 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD)

57 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) HDFS, Memcached, HBase, and Spark Micro-benchmarks Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Users Base: 275 organizations from 34 countries More than 25,55 downloads from the project site SEA Symposium (April 18) 57

58 HiBD Release Timeline and Downloads Number of Downloads RDMA-Hadoop 1.x.9. RDMA-Hadoop 1.x.9.8 RDMA-Hadoop 1.x.9.9 RDMA-Memcached.9.1 & OHB-.7.1 RDMA-Hadoop 2.x.9.1 RDMA-Hadoop 2.x.9.6 RDMA-Hadoop 2.x.9.7 RDMA-Memcached.9.4 RDMA-Spark.9.1 RDMA-HBase.9.1 RDMA-Hadoop 2.x 1.. RDMA-Spark.9.4 RDMA-Hadoop 2.x 1.3. RDMA-Memcached.9.6 & OHB-.9.3 RDMA-Spark.9.5 Timeline SEA Symposium (April 18) 58

59 Different Modes of RDMA for Apache Hadoop 2.x HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations inmemory and obtain as much performance benefit as possible. HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). SEA Symposium (April 18) 59

60 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/1/4/1 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 214 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec SEA Symposium (April 18) 6

61 SEA Symposium (April 18) 61 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

62 SEA Symposium (April 18) 62 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

63 SEA Symposium (April 18) 63 Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) Execution Time (s) IPoIB (EDR) Reduced by 3x IPoIB (EDR) Reduced by 4x 7 OSU-IB (EDR) 6 OSU-IB (EDR) Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps Execution Time (s) RandomWriter 3x improvement over IPoIB for 8-16 GB file size TeraGen 4x improvement over IPoIB for 8-24 GB file size

64 SEA Symposium (April 18) 64 Performance Evaluation of RDMA-Spark on SDSC Comet HiBench PageRank Time (sec) IPoIB 37% RDMA Huge BigData Gigantic Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Time (sec) IPoIB 43% RDMA Huge BigData Gigantic Data Size (GB)

65 Performance Evaluation on SDSC Comet: Astronomy Application Kira Toolkit 1 : Distributed astronomy image processing toolkit implemented using Apache Spark. Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,15 image files. Compare RDMA Spark performance with the standard apache implementation using IPoIB. 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/ , Aug % RDMA Spark Apache Spark (IPoIB) Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE 16, July 216 SEA Symposium (April 18) 65

66 SEA Symposium (April 18) 66 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

67 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance NCCL2, CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Scale-up Performance NCCL cudnn cublas NCCL2 grpc Proposed Co- Designs MPI Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) Hadoop What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) SEA Symposium (April 18) 67

68 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3. Latency (us) ~3X better Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) ~1X better ~4X better 512K 1M 2M 4M Message Size (Bytes) MVAPICH2 BAIDU OPENMPI Latency (us) 6 OpenMPI is ~5X slower 5 than Baidu 4 MV2 is ~2X better 3 than Baidu Message Size (Bytes) MVAPICH2 BAIDU OPENMPI *Available with MVAPICH2-GDR 2.3a SEA Symposium (April 18) 68

69 MVAPICH2-GDR vs. NCCL2 Broadcast Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Bcast (MVAPICH2-GDR) vs. ncclbcast (NCCL2) on 16 K-8 GPUs GPU/node 2 GPUs/node 1 Latency (us) 1 1 ~1X better Latency (us) 1 1 ~4X better Message Size (Bytes) Message Size (Bytes) NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b NCCL2 MVAPICH2-GDR Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 69

70 MVAPICH2-GDR vs. NCCL2 Reduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Reduce (MVAPICH2-GDR) vs. ncclreduce (NCCL2) on 16 GPUs 1 ~2.5X better 1 1 Latency (us) 1 1 Latency (us) ~5X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 7

71 MVAPICH2-GDR vs. NCCL2 Allreduce Operation Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases MPI_Allreduce (MVAPICH2-GDR) vs. ncclallreduce (NCCL2) on 16 GPUs 1 1 Latency (us) 1 1 ~3X better Latency (us) ~1.2X better K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1 Message Size (Bytes) MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-8 GPUs, and EDR InfiniBand Inter-connect SEA Symposium (April 18) 71

72 OSU-Caffe: Scalable Deep Learning Caffe : A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node GoogLeNet (ImageNet) on 128 GPUs 25 Performance degradation for GPUs across different sockets Limited Scale-out OSU-Caffe: MPI-based Parallel Training Enable Scale-up (within a node) and Scale-out (across multi-gpu nodes) Scale-out on 64 GPUs for training CIFAR-1 network on CIFAR-1 dataset Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe publicly available from Training Time (seconds) No. of GPUs Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (124) OSU-Caffe (248) SEA Symposium (April 18) 72

73 Performance Benefits for RDMA-gRPC with Micro-Benchmark Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc Latency (us) Default grpc OSU RDMA grpc K 8K payload (Bytes) 16K 32K 64K 128K 256K 512K Payload (Bytes) 1 1M 2M 4M 8M Payload (Bytes) RDMA-gRPC RPC Latency grpc-rdma Latency on SDSC-Comet-FDR Up to 2.7x performance speedup over IPoIB for Latency for small messages Up to 2.8x performance speedup over IPoIB for Latency for medium messages Up to 2.5x performance speedup over IPoIB for Latency for large messages R. Biswas, X. Lu, and D. K. Panda, Accelerating grpc and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review. SEA Symposium (April 18) 73

74 SEA Symposium (April 18) 74 Performance Benefits for RDMA-gRPC with TensorFlow Communication Mimic Benchmark Default grpc OSU RDMA grpc Default grpc Latency (us) 18 OSU RDMA grpc 9 Calls/ Second M 4M 8M 2M 4M 8M Payload (Bytes) Payload (Bytes) TensorFlow communication Pattern mimic benchmark TensorFlow communication pattern mimic on SDSC-Comet-FDR Single process spawns both grpc server and grpc client Up to 3.6x performance speedup over IPoIB for Latency Up to 3.7x throughput improvement over IPoIB

75 High-Performance Deep Learning over Big Data (DLoBD) Stacks Benefits of Deep Learning over Big Data (DLoBD) Easily integrate deep learning components into Big Data processing workflow Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters; Reuse existing big data analytics clusters Challenges Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on highperformance interconnects, GPUs, and multi-core CPUs? What are the performance characteristics of representative DLoBD stacks on RDMA networks? Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy Epochs Time (secs) IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy Epoch Number 2.6X Accuracy (%) X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 217. SEA Symposium (April 18) 75

76 SEA Symposium (April 18) 76 HPC, Big Data, Deep Learning, and Cloud Traditional HPC Message Passing Interface (MPI), including MPI + OpenMP Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) Exploiting Accelerators Big Data/Enterprise/Commercial Computing Spark and Hadoop (HDFS, HBase, MapReduce) Memcached is also used for Web 2. Deep Learning Caffe, CNTK, TensorFlow, and many more Cloud for HPC and BigData Virtualization with SR-IOV and Containers

77 Can HPC and Virtualization be Combined? Virtualization has many benefits Fault-tolerance Job migration Compaction Have not been very popular in HPC due to overhead associated with Virtualization New SR-IOV (Single Root IO Virtualization) support available with Mellanox InfiniBand adapters changes the field Enhanced MVAPICH2 support for SR-IOV MVAPICH2-Virt 2.2 supports: OpenStack, Docker, and singularity J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC 14 J. Zhang, X.Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid 15 SEA Symposium (April 18) 77

78 Application-Level Performance on Chameleon Execution Time (ms) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 2% 5% Execution Time (s) % 9.5% MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 22,2 24,1 24,16 24,2 26,1 26,16 Problem Size (Scale, Edgefactor) Graph5 milc leslie3d pop2 GAPgeofem zeusmp2 lu SPEC MPI27 32 VMs, 6 Core/VM Compared to Native, 2-5% overhead for Graph5 with 128 Procs Compared to Native, 1-9.5% overhead for SPEC MPI27 with 128 Procs SEA Symposium (April 18) 78

79 Application-Level Performance on Singularity with MVAPICH2 NPB Class D Graph5 Execution Time (s) Singularity Native 7% BFS Execution Time (ms) Singularity Native 6% 5 5 CG EP FT IS LU MG 512 Processes across 32 nodes 22,16 22,2 24,16 24,2 26,16 26,2 Problem Size (Scale, Edgefactor) Less than 7% and 6% overhead for NPB and Graph5, respectively J. Zhang, X.Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC 17, Best Student Paper Award SEA Symposium (April 18) 79

80 SEA Symposium (April 18) 8 Latency (us) Bandwidth (MB/s) 1 1 MVAPICH2-GDR on Container with Negligible Overhead GPU-GPU Inter-node Latency K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Docker Native GPU-GPU Inter-node Bandwidth Message Size (Bytes) Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth Message Size (Bytes) Docker Native MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA Docker Native

81 SEA Symposium (April 18) 81 Concluding Remarks Next generation HPC systems need to be designed with a holistic view of Big Data, Deep Learning and Cloud Presented some of the approaches and results along these directions Enable Big Data, Deep Learning and Cloud community to take advantage of modern HPC technologies Many other open issues need to be solved Next generation HPC professionals need to be trained along these directions

82 Additional Presentation and Tutorials Talk: Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries 4/4/18 at 9: am at Overlapping Communication with Computation Symposium Tutorial: How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries 4/5/18, 9: am-12: noon Tutorial: High Performance Distributed Deep Learning: A Beginner s Guide 4/5/18, 1:pm-4: pm SEA Symposium (April 18) 82

83 SEA Symposium (April 18) 83 Funding Acknowledgments Funding Support by Equipment Support by

84 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) W. Huang (Ph.D.) J. Liu (Ph.D.) W. Jiang (M.S.) M. Luo (Ph.D.) J. Jose (Ph.D.) A. Mamidala (Ph.D.) S. Kini (M.S.) G. Marsh (M.S.) M. Koop (Ph.D.) V. Meshram (M.S.) K. Kulkarni (M.S.) A. Moody (M.S.) R. Kumar (M.S.) S. Naravula (Ph.D.) S. Krishnamoorthy (M.S.) R. Noronha (Ph.D.) K. Kandalla (Ph.D.) X. Ouyang (Ph.D.) M. Li (Ph.D.) S. Pai (M.S.) P. Lai (M.S.) S. Potluri (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela K. Manian R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Current Research Specialist J. Smith M. Arnold Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang SEA Symposium (April 18) 84

85 SEA Symposium (April 18) 85 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance MPI and Deep Learning on OpenPOWER Platform

High-Performance MPI and Deep Learning on OpenPOWER Platform High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Designing High-Performance MPI Libraries for Multi-/Many-core Era

Designing High-Performance MPI Libraries for Multi-/Many-core Era Designing High-Performance MPI Libraries for Multi-/Many-core Era Talk at IXPUG-Fall Conference (September 18) J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni and D. K Panda The Ohio State University

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox booth (SC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems Keynote Talk at HPCAC China Conference by Dhabaleswar K. (DK) Panda The Ohio State University E mail:

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems

Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems Keynote Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems

High-Performance and Scalable Designs of Programming Models for Exascale Systems High-Performance and Scalable Designs of Programming Models for Exascale Systems Keynote Talk at HPCAC, Switzerland (April 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems

Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Designing Convergent HPC, Deep Learning and Big Data Analytics Software Stacks for Exascale Systems Talk at HPCAC-AI Switzerland Conference (April 19) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms Talk at Intel HPC Developer Conference (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC 18) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (March 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 218) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC Stanford Conference (Feb 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Keynote Talk at Swiss Conference (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 217 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters Presentation at Mellanox Theatre (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies Talk at NRL booth (SC 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries

Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries Talk at Overlapping Communication with Computation Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio

More information

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities

Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Networking Technologies and Middleware for Next-Generation Clusters and Data Centers: Challenges and Opportunities Keynote Talk at HPSR 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies

HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies HPC Meets Big Data: Accelerating Hadoop, Spark, and Memcached with HPC Technologies Talk at OpenFabrics Alliance Workshop (OFAW 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries A Tutorial at MVAPICH User Group 217 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization Talk at HPCAC-Switzerland (Mar 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Talk at HPCAC-Switzerland (April 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

How to Tune and Extract Higher Performance with MVAPICH2 Libraries? How to Tune and Extract Higher Performance with MVAPICH2 Libraries? An XSEDE ECSS webinar by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI INAM 2 : InfiniBand Network Analysis and Monitoring with MPI H. Subramoni, A. A. Mathews, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and D. K. Panda Department of Computer Science and Engineering The

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas

Big Data Analytics with the OSU HiBD Stack at SDSC. Mahidhar Tatineni OSU Booth Talk, SC18, Dallas Big Data Analytics with the OSU HiBD Stack at SDSC Mahidhar Tatineni OSU Booth Talk, SC18, Dallas Comet HPC for the long tail of science iphone panorama photograph of 1 of 2 server rows Comet: System Characteristics

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015) High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani The Ohio State University E-mail: gugnani.2@osu.edu http://web.cse.ohio-state.edu/~gugnani/ Network Based Computing Laboratory SC 17 2 Neuroscience:

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters

DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA- enabled Clusters Talk at OFA Workshop 2018 by Xiaoyi Lu The Ohio State University E- mail: luxi@cse.ohio- state.edu h=p://www.cse.ohio-

More information

Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks

Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks Sunrise or Sunset: Exploring the Design Space of Big Data So7ware Stacks Panel PresentaAon at HPBDC 17 by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

An Assessment of MPI 3.x (Part I)

An Assessment of MPI 3.x (Part I) An Assessment of MPI 3.x (Part I) High Performance MPI Support for Clouds with IB and SR-IOV (Part II) Talk at HP-CAST 25 (Nov 2015) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information