Software Libraries and Middleware for Exascale Systems

Size: px
Start display at page:

Download "Software Libraries and Middleware for Exascale Systems"

Transcription

1 Software Libraries and Middleware for Exascale Systems Talk at HPC Advisory Council China Workshop (212) by Dhabaleswar K. (DK) Panda The Ohio State University

2 Welcome 欢迎 2

3 Current and Next Generation HPC Systems and Applications Growth of High Performance Computing (HPC) Growth in processor performance Chip density doubles every 18 months Growth in commodity networking Increase in speed/features + reducing cost 3

4 Two Major Categories of Applications Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used for Web 2. 4

5 High-End Computing (HEC): PetaFlop to ExaFlop 1 PFlops in EFlops in 219 Expected to have an ExaFlop system in ! 5

6 Number of Clusters Percentage of Clusters Trends for Commodity Computing Clusters in the Top 5 List ( Percentage of Clusters Number of Clusters Timeline 6

7 Large-scale InfiniBand Installations 29 IB Clusters (41.8%) in the June 212 Top5 list ( Installations in the Top 4 (13 systems): 147, 456 cores (Super MUC) in Germany (4 th ) 78,66 cores (Lomonosov) in Russia (22 nd ) 77,184 cores (Curie thin nodes) at France/CEA (9 th ) 137,2 cores (Sunway Blue Light) in China (26 th ) 12, 64 cores (Nebulae) at China/NSCS (1 th ) 46,28 cores (Zin) at LLNL (27 th ) 125,98 cores (Pleiades) at NASA/Ames (11 th ) 29,44 cores (Mole-8.5) in China (37 th ) 7,56 cores (Helios) at Japan/IFERC (12 th ) 42,44 cores (Red Sky) at Sandia (39 th ) 73,278 cores (Tsubame 2.) at Japan/GSIC (14 th ) More are getting installed! 138,368 cores (Tera-1) at France/CEA (17 th ) 122,4 cores (Roadrunner) at LANL (19 th ) 7

8 Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Middleware Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. Co-Design Opportunities and Challenges across Various Layers Communication Library or Runtime for Programming Models Point-to-point Communication (two-sided & one-sided) Collective Communication Synchronization & Locks I/O & File Systems Fault Tolerance Networking Technologies (InfiniBand, 1/4GigE, Gemini, BlueGene) Multi/Many-core Architectures Accelerators (NVIDIA and MIC) 8

9 Two Major Categories of Applications Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used for Web 2. 9

10 Designing (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, ) Balancing intra-node and inter-node communication for next generation multi-core ( cores/node) Multiple end-points per node Support for efficient multi-threading Support for GPGPUs and Accelerators Scalable Collective communication Offload Non-blocking Topology-aware Power-aware Fault-tolerance/resiliency QoS support for communication and I/O 1

11 MVAPICH2/MVAPICH2-X Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Used by more than 1,975 organizations (HPC Centers, Industry and Universities) in 7 countries More than 135, downloads from OSU site directly Empowering many TOP5 clusters 11 th ranked 125,98-core cluster (Pleiades) at NASA 14 th ranked 73,278-core cluster (Tsubame 2.) at Tokyo Institute of Technology 4 th ranked 62,976-core cluster (Ranger) at TACC and many others Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) Partner in the upcoming U.S. NSF-TACC Stampede (9 PFlop) System 11

12 Challenges being Addressed by MVAPICH2 for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenSHMEM, MPI + UPC, ) Support for GPGPUs and Co-Processors (Intel MIC) Scalable Collective communication Multicore-aware and Hardware-multicast-based Offload and Non-blocking Topology-aware Power-aware QoS support for communication and I/O 12

13 Latency (us) Latency (us) One-way Latency: MPI over IB Small Message Latency Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch 13

14 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth 25 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch 14

15 Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support) Intra-Socket.45 us.19 us Latency Inter-Socket K Message Size (Bytes) Latest MVAPICH2 1.9a Intel Westmere Bandwidth intra-socket-limic intra-socket-shmem inter-socket-limic inter-socket-shmem 1,MB/s Bi-Directional Bandwidth intra-socket-limic intra-socket-shmem inter-socket-limic inter-socket-shmem 18,5MB/s Message Size (Bytes) Message Size (Bytes) 15

16 Hybrid Transport Design (UD/RC/XRC) Both UD and RC/XRC have benefits Evaluate characteristics of all of them and use two sets of transports in the same application get the best of both Available in MVAPICH (since 1.1), as a separate interface Available since MVAPICH2 1.7 as an integrated interface M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand, IPDPS 8 16

17 Challenges being Addressed by MVAPICH2 for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenSHMEM, MPI + UPC, ) Support for GPGPUs and Co-Processors (Intel MIC) Scalable Collective communication Multicore-aware and Hardware-multicast-based Offload and Non-blocking Topology-aware Power-aware QoS support for communication and I/O 17

18 Different Approaches for Supporting PGAS Models Library-based Global Arrays OpenSHMEM Compiler-based Unified Parallel C (UPC) Co-Array Fortran (CAF) HPCS Language-based X1 Chapel Fortress 18

19 Scalable OpenSHMEM and Hybrid (MPI and OpenSHMEM) designs Existing Design Based on OpenSHMEM Reference Implementation Provides a design over GASNet Does not take advantage of all OFED features Design scalable and High-Performance OpenSHMEM over OFED Designing a Hybrid MPI + OpenSHMEM Model Current Model Separate Runtimes for OpenSHMEM and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Our Approach Single Runtime for MPI and OpenSHMEM Hybrid (OpenSHMEM+MPI) Application OpenSHMEM calls OpenSHMEM Interface OSU Design MPI calls MPI Interface InfiniBand Network Hybrid MPI+OpenSHMEM Available in MVAPICH2-X 1.9a 19

20 Time (ms) Time (ms) Time (us) Time (ms) Micro-Benchmark Performance (OpenSHMEM) % Atomic Operations OpenSHMEM-GASNet OpenSHMEM-OSU 16% fadd finc cswap swap add inc Collect (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU No. of Processes 36% Broadcast (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU No. of Processes Reduce (256 bytes) OpenSHMEM-GASNet OpenSHMEM-OSU No. of Processes 41% 35% 2

21 Resource Utilization (MB) Time (s) Time (s) Performance of Hybrid (OpenSHMEM+MPI) Applications D Heat Transfer Modeling Hybrid-GASNet Hybrid-OSU 8 6 Graph-5 Hybrid-GASNet Hybrid-OSU % No. of Processes Improved Performance for Hybrid Applications 34% improvement for 2DHeat Transfer Modeling with 512 processes 45% improvement for Graph5 with 256 processes Our approach with single Runtime consumes 27% lesser network resources No. of Processes Network Resource Usage Hybrid-GASNet Hybrid-OSU No. of Processes J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September % 27%

22 Scalable UPC and Hybrid (MPI and UPC) designs Based on Berkeley UPC version Design scalable and High-Performance UPC over OFED Designing a Hybrid MPI + UPC Model Current Model Separate Runtimes for UPC and MPI Possible deadlock if both runtimes are not progressed Consumes more network resource Our Approach Single Runtime for MPI and UPC Will be available in next release of MVAPICH2-X 1.9 Hybrid (MPI+ UPC) Application UPC calls GASNet Interface OSU Design MPI calls MPI Interface InfiniBand Network Hybrid MPI+UPC 22

23 Time (s) Evaluation using UPC NAS Benchmarks CG MG Number of Processes FT UPC-GASNet UPC-OSU Evaluations using UPC-NAS Benchmarks v2.4 (Class C) UPC-OSU performs equal or better than UPC-GASNet 11% improvement for CG (256 processes) 1% improvement for MG (256 processes) 23

24 Time (s) Evaluation of Hybrid MPI+UPC NAS-FT % UPC-GASNet 15 UPC-OSU 1 Hybrid-OSU 5 B-64 C-64 B-128 C-128 NAS Problem Size System Size Modified NAS FT UPC all-to-all pattern using MPI_Alltoall Truly hybrid program Hybrid MPI + UPC Support For FT (Class C, 128 processes) Will be available in upcoming 34% improvement over UPC-GASNet MVAPICH2-X Release 3% improvement over UPC-OSU J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth Conference on Partitioned Global Address Space Programming Model (PGAS 1), October 21 24

25 Challenges being Addressed by MVAPICH2 for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenSHMEM, MPI + UPC, ) Support for GPGPUs and Co-Processors (Intel MIC) Scalable Collective communication Multicore-aware and Hardware-multicast-based Offload and Non-blocking Topology-aware Power-aware QoS support for communication and I/O 25

26 Sample Code - Without MPI integration Naïve implementation with standard MPI and CUDA At Sender: cudamemcpy(sbuf, sdev,...); MPI_Send(sbuf, size,...); At Receiver: MPI_Recv(rbuf, size,...); cudamemcpy(rdev, rbuf,...); CPU PCIe GPU NIC High Productivity and Poor Performance Switch 26

27 Sample Code User Optimized Code Pipelining at user level with non-blocking MPI and CUDA interfaces Code at Sender side (and repeated at Receiver side) At Sender: for (j = ; j < pipeline_len; j++) cudamemcpyasync(sbuf + j * blk, sdev + j * blksz,...); for (j = ; j < pipeline_len; j++) { } while (result!= cudasucess) { } result = cudastreamquery( ); if(j > ) MPI_Test( ); MPI_Isend(sbuf + j * block_sz, blksz...); MPI_Waitall(); User-level copying may not match with internal MPI design High Performance and Poor Productivity CPU PCIe GPU NIC Switch 27

28 Sample Code MVAPICH2-GPU MVAPICH2-GPU: standard MPI interfaces used Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_device, size, ); At Receiver: MPI_Recv(r_device, size, ); inside MVAPICH2 High Performance and High Productivity 28

29 Time (us) MPI-Level Two-sided Communication Memcpy+Send MemcpyAsync+Isend MVAPICH2-GPU % improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC 11 29

30 Other MPI Operations and Optimizations for GPU Buffers Overlap optimizations for One-sided Communication Collectives Communication with Datatypes Optimized Designs for multi-gpus per node Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 211 A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 211 S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 212, May 212 3

31 MVAPICH2 1.8 and 1.9a Release Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers 31

32 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (Device-Device, Inter-node) 4 Latency MVAPICH2 3 OpenMPI 2 1 B E T T E R K 64K 1M Message Size (bytes) 4 Bandwidth MVAPICH2 4 Bi-directional Bandwidth MVAPICH2 3 2 OpenMPI B E T T E R 3 2 OpenMPI B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.9a and OpenMPI (Trunk nightly snapshot on 13 Sep 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit

33 Step Time (S) Total Execution Time (sec) Applications-Level Evaluation MPI 13.7% LBM-CUDA MPI-GPU 12.% 11.8% 9.4% 256*256* *256* *512* *512*512 Domain Size X*Y*Z LBM-CUDA (Courtesy: Carlos Rosale, TACC) Lattice Boltzmann Method for multiphase flows with large density ratios one process/gpu per node, 16 nodes AWP-ODC (Courtesy: Yifeng Cui, SDSC) a seismic modeling code, Gordon Bell Prize finalist at SC x256x512 data grid per process, 8 nodes Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M27, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory MPI AWP-ODC MPI-GPU 7.9% 1 GPU/Proc per Node 2 GPUs/Procs per Node Configuration 11.1% Working on Upcoming GPU-Direct-RDMA 33

34 Many Integrated Core (MIC) Architecture Intel s Many Integrated Core (MIC) architecture geared for HPC Critical part of their solution for exascale computing Knights Ferry (KNF) and Knights Corner (KNC) Many low-power x86 processor cores with hardware threads and wider vector units Applications and libraries can run with minor or no modifications Using MVAPICH2 for communication within a KNF coprocessor out-of-the box (Default) shared memory designs tuned for KNF (Optimized V1) designs enhanced using lower level API (Optimized V2) 34

35 Normalized Bandwidth Normalized Bandwidth Normalized Latency Normalized Latency Point-to-Point Communication 1. Latency 13% 17% Default.8 Optimized V1.6 Optimized V K 8K Message Size (Bytes) 1. 15% Bandwidth Default.8 Optimized V1.6 Optimized V Default Optimized V1 Optimized V2 Latency 32K 128K 512K 2M Message Size (Bytes) Default Optimized V1 Optimized V2 Bandwidth 7% 8% 4X 9.5X K 4K 16K Message Size (Bytes).2. Knights Ferry - D 1.2GHz card - 32 cores Alpha 9 MIC software stack with an additional pre-alpha patch 32K 128K 512K 2M Message Size (Bytes) 35

36 Normalized Latency Normalized Latency Normalized Latency Collective Communication Broadcast Default Optimized V1 Optimized V2 17% Broadcast Default Optimized V1 Optimized V2 2% 26% Message Size (Bytes). 2K 8K 32K 128K 512K Message Size (Bytes) Scatter Default Optimized V1 Optimized V2 72% 8% K 128K 256K 512K 1M Message Size (Bytes) S. Potluri, K. Tomko, D. Bureddy and D. K. Panda Intra-MIC MPI Communication using MVPAICH2: Early Experience, TACC-Intel Highly-Parallel Computing Symposium, April 212 Best Student Paper Award 36

37 Challenges being Addressed by MVAPICH2 for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenSHMEM, MPI + UPC, ) Support for GPGPUs and Co-Processors (Intel MIC) Scalable Collective communication Multicore-aware and Hardware-multicast-based Offload and Non-blocking Topology-aware Power-aware QoS support for communication and I/O 37

38 Latency (us) Latency (us) Shared-memory Aware Collectives MVAPICH2 Reduce/Allreduce with 4K cores on TACC Ranger (AMD Barcelona, SDR IB) Latency (us) MPI_Reduce (496 cores) Original Shared-memory Message Size (bytes) MV2_USE_SHMEM_REDUCE=/1 Pt2Pt Barrier Shared-Mem Barrier Number of Processes MPI_ Allreduce (496 cores) Original Shared-memory K 8K Message Size (bytes) MV2_USE_SHMEM_ALLREDUCE=/1 MVAPICH2 Barrier with 1K Intel Westmere cores, QDR IB MV2_USE_SHMEM_BARRIER=/1 38

39 Time (us) Time (us) Time (us) Time (us) Hardware Multicast-aware MPI_Bcast Small Messages (1,24 Cores) Default Multicast K 16 byte MPI_Bcast Default Multicast Message Size (Bytes) Number of Processes Large Messages (1,24 Cores) Default Multicast 2K 8K 32K 128K 512K Default 64KB MPI_Bcast Multicast Available in MVAPICH2 1.9a Message Size (Bytes) Number of Processes 39

40 Non-contiguous Allocation of Jobs New Job Line Card Switches Spine Switches Line Card Switches Supercomputing systems organized as racks of nodes interconnected using complex network architectures Job schedulers used to allocate compute nodes to various jobs - Busy - Idle Core Core - New Job 4

41 Non-contiguous Allocation of Jobs Line Card Switches Spine Switches Line Card Switches Supercomputing systems organized as racks of nodes interconnected using complex network architectures Job schedulers used to allocate compute nodes to various jobs Individual processes belonging to one job can get scattered Primary responsibility of scheduler is to keep system throughput high - Busy Core - Idle Core - New Job 41

42 Latency (ms) Normalized Latency Impact of Network-Topology Aware Algorithms on Broadcast Performance No-Topo-Aware Topo-Aware-DFT Topo-Aware-BFT 128K 256K 512K 1M Message Size (Bytes) Impact of topology aware schemes more pronounced as Message size increases System size scales 14% Upto 14% improvement in performance at scale K Job Size (Processes) H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster 11 42

43 Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications? Overall performance and Split up of physical communication for MILC on Ranger 15% Performance for varying system sizes Default for 248 core run Topo-Aware for 248 core run Reduce network topology discovery time from O(N 2 hosts) to O(N hosts ) 15% improvement in MILC execution 248 cores 15% improvement in Hypre execution 124 cores H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12. BEST Paper and BEST STUDENT Paper Finalist 43

44 MCQ Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send) Application Sender creates a task-list consisting of only send and wait WQEs Send Wait Task List Send Send Send Wait One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list MQ InfiniBand HCA Send Q A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver Send CQ Recv Q Physical Link Data Recv CQ 44

45 Normalized Performance Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes) 17% 21.8% HPL-Offload HPL-1ring HPL-Host 4.5% HPL Problem Size (N) as % of Total Memory Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes) K. Kandalla, et. al.. High-Performance and Scalable Non- Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 211 Modified Pre-Conjugate Gradient Solver with Offload- Allreduce does up to 21.8% better than default version K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 211 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS 12 45

46 Non-blocking Neighborhood Collectives to minimize Communication Overheads of Irregular Graph Algorithms Critical to improve communication overheads of irregular graph algorithms to cater to the graphtheoretic demands of emerging big-data applications What are the challenges involved in re-designing irregular graph algorithms, such as the 2D BFS, to achieve fine-grained computation/communication overlap? Can we design network-offload based Non-blocking Neighborhood collectives in a high-performance, scalable manner? 7 % 2 % 6.5 % CombBLAS (2D BFS) Application Run-time on Hyperion (Scale=29) Communication Overheads in CombBLAS with 1936 Processes on Hyperion Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS workshop (Cluster 212) 46

47 Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes 5% 32% CPMD Application Performance and Power Savings K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, Designing Power Aware Collective Communication Algorithms for Infiniband Clusters, ICPP 1 47

48 Challenges being Addressed by MVAPICH2 for Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenSHMEM, MPI + UPC, ) Support for GPGPUs and Co-Processors (Intel MIC) Scalable Collective communication Offload Non-blocking Topology-aware Power-aware QoS support for communication and I/O 48

49 Minimizing Network Contention w/ QoS-Aware Data-Staging Asynchronous I/O introduces contention for network-resources How should data be orchestrated in a data-staging architecture to eliminate such contention? Can the QoS capabilities provided by cutting-edge interconnect technologies be leveraged by parallel filesystems to minimize network contention? MPI Message Latency MPI Message Bandwidth Anelastic Wave Propagation (64 MPI processes) default 17.9% with I/O noise 8% I/O noise isolated NAS Parallel Benchmark Conjugate Gradient Class D (64 MPI processes) default 32.8% 9.31% with I/O noise I/O noise isolated Normalized Runtime Normalized Runtime Reduces runtime overhead from 17.9% to 8% and from 32.8% to 9.31%, in case of AWP and NAS-CG applications respectively R. Rajachandrasekar, J. Jaswani, H. Subramoni and D. K. Panda, Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework, IEEE Cluster, Sept

50 MVAPICH2 Plans for Exascale Performance and Memory scalability toward 5K-1M cores Dynamically Connected Transport (DCT) service with Connect-IB Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Enhanced Optimization for GPU Support and Accelerators Extending the GPGPU support (GPU-Direct RDMA) Support for Intel MIC Taking advantage of Collective Offload framework Including support for non-blocking collectives (MPI 3.) RMA support (as in MPI 3.) Extended topology-aware collectives Power-aware collectives Support for MPI Tools Interface (as in MPI 3.) Checkpoint-Restart and migration support with in-memory checkpointing 5

51 Two Major Categories of Applications Scientific Computing Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) UPC, OpenSHMEM, CAF, etc. Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) Big Data/Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used for Web 2. 51

52 Can High-Performance Interconnects Benefit Big Data Processing? Beginning to draw interest from the enterprise domain Oracle, IBM, Google are working along these directions Performance in the enterprise domain remains a concern Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs? 52

53 Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets 53

54 Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics Framework includes: MapReduce, HDFS, and HBase Underlying Hadoop Distributed File System (HDFS) used by MapReduce and HBase Model scales but high amount of communication during intermediate phases HPC Advisory Council, Zhangjiajie, Oct '12 54

55 Big Data Processing with Hadoop: HDFS Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure TaskTracker: manage tasks running in different JVMs and report to JobTracker High amount of communication during data replication HPC Advisory Council, Zhangjiajie, Oct '12 55

56 Big Data Processing with Hadoop: HBase An open source database project based on Hadoop framework for hosting very large tables Major components: HBaseMaster, HRegionServer and HBaseClient HBase and HDFS are deployed in the same cluster to get better data locality 56

57 Communication Time (s) RDMA-based Design for Native HDFS-IB: Gain in Communication Time in HDFS GigE IPoIB (QDR) OSU-IB (QDR) 1GB 2GB 3GB 4GB 5GB File Size (GB) Cluster with 32 HDD DataNodes 3% improvement in communication time over IPoIB (32Gbps) 87% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov

58 Throughput (MegaBytes/s) Throughput (MegaBytes/s) Evaluations using TestDFSIO GigE IPoIB (QDR) OSU-IB (QDR) GigE IPoIB (QDR) OSU-IB (QDR) File Size (GB) File Size (GB) Cluster with 32 HDD Nodes Cluster with 4 SSD Nodes Cluster with 32 HDD DataNodes 12% improvement over IPoIB (32Gbps) for 5GB file size Cluster with 4 SSD DataNodes 1% improvement over IPoIB (32Gbps) for 5GB file size 58

59 Average Latency (us) Throughput (MegaBytes/s) Evaluations using YCSB (Single Region Server: 1% Update) GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Record (Record Size = 1KB) Put Latency GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Throughput HBase using TCP/IP, running over HDFS-IB HBase Put latency for 36K records 24 us for OSU Design; 252 us for IPoIB (32Gbps) HBase Put throughput for 36K records 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 2% improvement in both average latency and throughput 59

60 Average Latency (us) Throughput (Kops/sec) Evaluations using YCSB (32 Region Servers: 1% Update) GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K Number of Records (Record Size = 1KB) Put Latency HBase using TCP/IP, running over HDFS-IB HBase Put latency for 48K records 21 us for OSU Design; 272 us for IPoIB (32Gbps) HBase Put throughput for 48K records Number of Records (Record Size = 1KB) Put Throughput 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps) 26% improvement in average latency; 24% improvement in throughput GigE IPoIB (QDR) OSU-IB (QDR) 12K 24K 36K 48K

61 Execution Time (s) Hadoop MapReduce (TeraSort) GigE IPoIB (QDR) OSU-IB (QDR) Native IB design for MapReduce (OSU-IB) TeraSort evaluation with 12 DataNodes (12GB RAM and 16GB single HDD per node) 8 MapTasks and 4 ReduceTasks per node 37% improvement over IPoIB (32Gbps) for 1 GB sort 61

62 Time (us) Time (us) HBase Put/Get Detailed Analysis Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5 1GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB HBase Get 1KB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 62

63 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients No. of Clients Latency Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 63

64 Time (us) Time (us) HBase YCSB Read-Write Workload IPoIB 1GigE OSU-IB 1GigE No. of Clients No. of Clients Read Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 64

65 Latency (us) Throughput (Kops/sec) HDFS and HBase Integration over IB (OSU-IB) 1GigE IPoIB (QDR) 1GigE IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) K 24K 36K 48K Number of Records (Record Size = 1KB) 12K 24K 36K 48K Number of Records (Record Size = 1KB) YCSB Evaluation with 4 RegionServers (1% update) HBase Put Latency and Throughput for 36K records - 37% improvement over IPoIB (32Gbps) - 18% improvement over OSU-IB HDFS only 65

66 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD Web Frontend Servers (Memcached Clients) Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose (Memcached Servers) Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive Native IB-verbs-level Design and evaluation with Memcached Server: (Database Servers) Memcached Client: (libmemcached) Main memory SSD CPUs HDD HPC Advisory Council, Zhangjiajie, Oct '12 66

67 Time (us) Time (us) Memcached Get Latency (Small Message) SDP OSU-RC-IB 1GigE IPoIB 1GigE OSU-UD-IB K 2K K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster 67

68 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K 4 8 No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB 68

69 Time (ms) Time (ms) Application Level Evaluation Olio Benchmark SDP 1 IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 69

70 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 7

71 Concluding Remarks InfiniBand with RDMA feature is gaining momentum in HPC and Big Data systems with best performance and greater usage As the HPC community moves to Exascale, new solutions are needed in the MPI stack for supporting PGAS, GPU, collectives (topology-aware, power-aware), one-sided and QoS, etc. Demonstrated how such solutions can be designed with MVAPICH2 and their performance benefits New solutions are also needed to re-design software libraries for Big Data environments to take advantage of IB and RDMA Such designs will allow application scientists and engineers to take advantage of upcoming exascale systems 71

72 Funding Acknowledgments Funding Support by Equipment Support by 72

73 Personnel Acknowledgments Current Students Past Students J. Chen (Ph.D.) P. Balaji (Ph.D.) N. Islam (Ph.D.) D. Buntinas (Ph.D.) J. Jose (Ph.D.) S. Bhagvat (M.S.) K. Kandalla (Ph.D.) L. Chai (Ph.D.) M. Li (Ph.D.) B. Chandrasekharan (M.S.) M. Luo (Ph.D.) N. Dandapanthula (M.S.) S. Potluri (Ph.D.) V. Dhanraj (M.S.) R. Rajachandrasekhar (Ph.D.) T. Gangadharappa (M.S.) M. Rahman (Ph.D.) K. Gopalakrishnan (M.S.) H. Subramoni (Ph.D.) W. Huang (Ph.D.) A. Venkatesh (Ph.D.) W. Jiang (M.S.) S. Kini (M.S.) M. Koop (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) Current Post-Docs Current Programmers P. Lai (M.S.) Past Post-Docs X. Lu M. Arnold X. Besseron H. Wang D. Bureddy H.-W. Jin J. Perkins E. Mancini S. Marcarelli J. Vienne J. Liu (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist S. Sur 73

74 Web Pointers MVAPICH Web Page 74

75 Thanks, Q&A? 谢谢, 请提问? 75

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2 Project Update and Big Data Acceleration MVAPICH2 Project Update and Big Data Acceleration Presentation at HPC Advisory Council European Conference 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

The Future of Supercomputer Software Libraries

The Future of Supercomputer Software Libraries The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 213 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Acceleration for Big Data, Hadoop and Memcached

Acceleration for Big Data, Hadoop and Memcached Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 212 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Intel Nerve Center (SC 7) Presentation Dhabaleswar K (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu Parallel

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop Keynote Talk at BPOE-4, in conjunction with ASPLOS 14 (March 2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached Talk at HPC Advisory Council Lugano Conference (213) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters Presentation at IXPUG Meeting, July 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Programming Models for Exascale Systems

Programming Models for Exascale Systems Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH2 and MVAPICH2-MIC: Latest Status MVAPICH2 and MVAPICH2-MIC: Latest Status Presentation at IXPUG Meeting, July 214 by Dhabaleswar K. (DK) Panda and Khaled Hamidouche The Ohio State University E-mail: {panda, hamidouc}@cse.ohio-state.edu

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems

Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems A Tutorial at SC 13 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 214 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (April 216) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu h

More information

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach Talk at OSC theater (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Scalability and Performance of MVAPICH2 on OakForest-PACS

Scalability and Performance of MVAPICH2 on OakForest-PACS Scalability and Performance of MVAPICH2 on OakForest-PACS Talk at JCAHPC Booth (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Talk at Brookhaven National Laboratory (Oct 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

More information

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X IXPUG 7 PresentaNon J. Hashmi, M. Li, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29,li.292,subramoni.,panda.2}@osu.edu

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015) High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Open RG Big Data Webinar (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating and Benchmarking Big Data Processing on Modern Clusters

Accelerating and Benchmarking Big Data Processing on Modern Clusters Accelerating and Benchmarking Big Data Processing on Modern Clusters Keynote Talk at BPOE-6 (Sept 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

MPI: Overview, Performance Optimizations and Tuning

MPI: Overview, Performance Optimizations and Tuning MPI: Overview, Performance Optimizations and Tuning A Tutorial at HPC Advisory Council Conference, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming MVAPICH2 User Group (MUG) MeeFng by Jithin Jose The Ohio State University E- mail: jose@cse.ohio- state.edu h

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters Talk at OpenFabrics Workshop (April 2016) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Communication Frameworks for HPC and Big Data

Communication Frameworks for HPC and Big Data Communication Frameworks for HPC and Big Data Talk at HPC Advisory Council Spain Conference (215) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Enhancing Checkpoint Performance with Staging IO & SSD

Enhancing Checkpoint Performance with Staging IO & SSD Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance Migration Framework for MPI Applications on HPC Cloud High Performance Migration Framework for MPI Applications on HPC Cloud Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda {zhanjie, luxi, panda}@cse.ohio-state.edu Computer Science & Engineering Department,

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase 2 IEEE 8th International Conference on Cloud Computing Technology and Science Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase Xiaoyi Lu, Dipti Shankar, Shashank Gugnani,

More information

High-Performance MPI and Deep Learning on OpenPOWER Platform

High-Performance MPI and Deep Learning on OpenPOWER Platform High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The

More information

An Assessment of MPI 3.x (Part I)

An Assessment of MPI 3.x (Part I) An Assessment of MPI 3.x (Part I) High Performance MPI Support for Clouds with IB and SR-IOV (Part II) Talk at HP-CAST 25 (Nov 2015) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters 2015 IEEE International Conference on Cluster Computing Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan,

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach Talk at OpenFabrics Workshop (March 217) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Talk at Mellanox Theater (SC 16) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities DHABALESWAR K PANDA DK Panda is a Professor and University Distinguished Scholar of Computer Science

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand. Presented at GTC 15

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand. Presented at GTC 15 Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presented at Presented by Dhabaleswar K. (DK) Panda The Ohio State University E- mail: panda@cse.ohio- state.edu hcp://www.cse.ohio-

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Optimizing & Tuning Techniques for Running MVAPICH2 over IB Optimizing & Tuning Techniques for Running MVAPICH2 over IB Talk at 2nd Annual IBUG (InfiniBand User's Group) Workshop (214) by Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu

More information