The Future of Supercomputer Software Libraries

Size: px

Start display at page:

Download "The Future of Supercomputer Software Libraries"

Branden Brandon Carpenter
6 years ago
Views:

1 The Future of Supercomputer Software Libraries Talk at HPC Advisory Council Israel Supercomputing Conference by Dhabaleswar K. (DK) Panda The Ohio State University

2 High-End Computing (HEC): PetaFlop to ExaFlop 2-3 PFlops in PFlops in 216 Expected to have an ExaFlop system in 219! 2

Trends for Commodity Computing Clusters in the Top 5 List (http://www.top5.org) Nov. 1996: /5 (%) Jun. 22: 8/5 (16%) Nov. 27: 46/5 (81.2%) Jun. 1997: 1/5 (.2%) Nov. 22: 93/5 (18.6%) Jun. 28: 4/5 (8.

3 Trends for Commodity Computing Clusters in the Top 5 List ( Nov. 1996: /5 (%) Jun. 22: 8/5 (16%) Nov. 27: 46/5 (81.2%) Jun. 1997: 1/5 (.2%) Nov. 22: 93/5 (18.6%) Jun. 28: 4/5 (8.%) Nov. 1997: 1/5 (.2%) Jun. 23: 149/5 (29.8%) Nov. 28: 41/5 (82.%) Jun. 1998: 1/5 (.2%) Nov. 23: 28/5 (41.6%) Jun. 29: 41/5 (82.%) Nov. 1998: 2/5 (.4%) Jun. 24: 291/5 (58.2%) Nov. 29: 417/5 (83.4%) Jun. 1999: 6/5 (1.2%) Nov. 24: 294/5 (58.8%) Jun. 21: 424/5 (84.8%) Nov. 1999: 7/5 (1.4%) Jun. 25: 34/5 (6.8%) Nov. 21: 415/5 (83%) Jun. 2: 11/5 (2.2%) Nov. 25: 36/5 (72.%) Jun. 211: 411/5 (82.2%) Nov. 2: 28/5 (5.6%) Jun. 26: 364/5 (72.8%) Nov. 211: 41/5 (82.%) Jun. 21: 33/5 (6.6%) Nov. 26: 361/5 (72.2%) Nov. 21: 43/5 (8.6%) Jun. 27: 373/5 (74.6%) 3

4 Large-scale InfiniBand Installations 29 IB Clusters (41.8%) in the November 11 Top5 list ( Installations in the Top 3 (13 systems): 12,64 cores (Nebulae) in China (4 th ) 29,44 cores (Mole-8.5) in China (21 st ) 73,278 cores (Tsubame-2.) in Japan (5 th ) 42,44 cores (Red Sky) at Sandia (24 th ) 111,14 cores (Pleiades) at NASA Ames (7 th ) 62,976 cores (Ranger) at TACC (25 th ) 138,368 cores (Tera-1) at France (9 th ) 2,48 cores (Bull Benchmarks) in France (27 th ) 122,4 cores (RoadRunner) at LANL (1 th ) 2,48 cores (Helios) in Japan (28 th ) 137,2 cores (Sunway Blue Light) in China (14 th ) More are getting installed! 46,28 cores (Zin) at LLNL (15 th ) 33,72 cores (Lomonosov) in Russia (18 th ) 4

5 Two Major Categories of Applications Scientific Computing Message Passing Interface (MPI) is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used 5

6 Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges Applications/Libraries Programming Models Message Passing Interface (MPI), Sockets, PGAS (UPC, Global Arrays), Hadoop and MapReduce Library or Runtime for Programming Models Point-to-point Communication Collective Communication Synchronization & Locks I/O & File Systems QoS Fault Tolerance Networking Technologies (InfiniBand, 1/1/4GigE, RNICs & Intelligent NICs) Commodity Computing System Architectures (single, dual, quad,..) Multi/Many-core architecture and Accelerators 6

7 Designing (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenMP, MPI + UPC, ) Balancing intra-node and inter-node communication for next generation multi-core ( cores/node) Multiple end-points per node Support for efficient multi-threading Support for GPGPUs and Accelerators Scalable Collective communication Offload Non-blocking Topology-aware Power-aware Fault-tolerance/resiliency QoS support for communication and I/O 7

8 MVAPICH/MVAPICH2 Software High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2), Available since 22 Used by more than 1,84 organizations (HPC Centers, Industry and Universities) in 65 countries More than 95, downloads from OSU site directly Empowering many TOP5 clusters 5 th ranked 73,278-core cluster (Tsubame 2.) at Tokyo Institute of Technology 7 th ranked 111,14-core cluster (Pleiades) at NASA 25 th ranked 62,976-core cluster (Ranger) at TACC and many others Available with software stacks of many InfiniBand, High-speed Ethernet and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) Partner in the upcoming U.S. NSF-TACC Stampede (1-15 PFlop) System 8

9 Approaches being used in MVAPICH2 for Exascale High Performance and Scalable Inter-node Communication with Hybrid UD- RC/XRC transport Kernel-based zero-copy intra-node communication Collective Communication Multi-core Aware, Topology-aware and Power-aware Exploiting Collective Offload Support for GPGPUs and Accelerators PGAS Support (Hybrid MPI + UPC) 9

10 Latency (us) Latency (us) One-way Latency: MPI over IB Small Message Latency Large Message Latency MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR Message Size (bytes) Message Size (bytes) 2.4 GHz Quad-core (Westmere) Intel with IB switch 1

11 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR Message Size (bytes) Message Size (bytes) 2.4 GHz Quad-core (Westmere) Intel with IB switch 11

12 IB Transport Services Service Type Connection Oriented Acknowledged Transport Reliable Connection (RC) Yes Yes IBA Unreliable Connection (UC) Yes No IBA Reliable Datagram (RD) No Yes IBA Unreliable Datagram (UD) No No IBA RAW Datagram No No Raw extended Reliable Connection (XRC) has been added later for multi-core platforms 12

13 Normalized Time UD vs. RC: Performance and Scalability (SMG2 Application) Memory Usage (MB/process) RC (MVAPICH.9.8) UD Design RC UD Conn. Buffers Struct. Total Buffers Struct Total Processes M. Koop, S. Sur, Q. Gao and D. K. Panda, High Performance MPI Design using Unreliable Datagram for Ultra- Scale InfiniBand Clusters, ICS 7 13

1 and 1.2 for some time (as a separate interface) Available since MVAPICH2 1.

14 Hybrid Transport Design (UD-RC/XRC) Both UD and RC/XRC have benefits User transparent, automatically adjusts between UD-RC/XRC Runtime options for advanced users to extract highest performance and scalability Available in MVAPICH 1.1 and 1.2 for some time (as a separate interface) Available since MVAPICH2 1.7 in an integrated manner with Gen2 interface M. Koop, T. Jones and D. K. Panda, MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand, IPDPS 8 14

15 Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support) Intra-Socket Inter-Socket intra-socket-limic intra-socket-shmem inter-socket-limic inter-socket-shmem Bandwidth K 4K 16K 64K 256K 1M 4M Message Size (byte) Latency K Message Size (byte) 1,MB/s Latest MVAPICH2 1.8a2 Intel Westmere Intra-socket: -.19 microseconds for 4bytes Inter-socket: -.45 microseconds for 4bytes Bi-Directional Bandwidth intra-socket-limic intra-socket-shmem inter-socket-limic inter-socket-shmem 18,5MB/s K 4K 6K 64K 256K 1M 4M Message Size (byte) 15

16 MPI Design Challenges High Performance and Scalable Inter-node Communication with Hybrid UD- RC/XRC transport Kernel-based zero-copy intra-node communication Collective Communication Multi-core Aware, Topology-aware and Power-aware Exploiting Collective Offload Support for GPGPUs and Accelerators PGAS Support (Hybrid MPI + UPC) 16

17 K 2K 4K 8K Latency (us) Latency (us) Shared-memory Aware Collectives (4K cores on TACC Ranger with MVAPICH2) 16 MPI_Reduce (496 cores) 45 MPI_ Allreduce (496 cores) 14 4 Original Shared-memory Original Shared-memory Message Size (bytes) Message Size (bytes) 17

18 Non-contiguous Allocation of Jobs New Job Line Card Switches Spine Switches Line Card Switches Supercomputing systems organized as racks of nodes interconnected using complex network architectures Job schedulers used to allocate compute nodes to various jobs - Busy Core - Idle Core - New Job 18

19 Non-contiguous Allocation of Jobs Line Card Switches Spine Switches Line Card Switches Supercomputing systems organized as racks of nodes interconnected using complex network architectures Job schedulers used to allocate compute nodes to various jobs Individual processes belonging to one job can get scattered Primary responsibility of scheduler is to keep system throughput high - Busy Core - Idle Core - New Job 19

20 Latency (usec) 2K 4K 8K 16K 32K 64K 128K 256K 512K 128K 2K 4K 8K 16K 32K 64K 12K 256K 512K 128K Latency (usec) Latency (usec) Topology-Aware Collectives 7.E+5 6.E+5 5.E+5 4.E+5 3.E+5 2.E+5 1.E+5.E+ Gather-Default Gather-Topo-Aware 54% 5.E+5 4.E+5 3.E+5 2.E+5 1.E+5.E+ Scatter-Default Scatter-Topo-Aware 22% Message Size (Bytes) Message Size (Bytes) Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes Default Topo-Aware 2R 4R 8R 16R 32R System Size Number of Racks Estimated Latency Of Default and Topology Aware Algorithms for small messages And Varying System Sizes K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, Designing Topology-Aware Collective Communication Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather, CAC 1 2

21 Latency (ms) Normalized Latency Impact of Network-Topology Aware Algorithms on Broadcast Performance No-Topo-Aware Topo-Aware-DFT Topo-Aware-BFT 128K 256K 512K 1M Message Size (Bytes) Impact of topology aware schemes more pronounced as Message size increases System size scales 14% Upto 14% improvement in performance at scale K Job Size (Processes) H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster 11 21

22 Energy (KJ) 8K 16K 32K 64K 128K 256K 512K 128K 256K Latency (usec) Power (KW) Power-Aware Collectives 4.E+5 3.E+5 2.E+5 1.E+5 5.E+3 Default DVFS Proposed Message Size (Bytes) Time (s) Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes 2.51E E E E E E+13 5% 8K 16K 32K 64K 128K 256K System Size (Number of Processes) % 32% Estimated Energy Consumption during an MPI_Alltoall operation with 128K Message size and Varying System Size K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, Designing Power Aware Collective Communication Algorithms for Infiniband Clusters, ICPP 1

23 MCQ Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send) Application Sender creates a task-list consisting of only send and wait WQEs Send Wait Task List Send Send Send Wait One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list MQ InfiniBand HCA Send Q A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver Send CQ Recv Q Physical Link Data Recv CQ 23

24 Application Run-Time (s) Application Run-Time (s) P3DFFT Application Performance with Non-Blocking Alltoall based on CX-2 Collective Offload Data Size Processes 128 Processes Blocking Host-Test Offload Data Size P3DFFT Application Run-time Comparison. Overlap version with Offload-Alltoall does up to 17% better than default blocking version K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, Int'l Supercomputing Conference (ISC), June

25 Normalized HPL Performance Non-Blocking Broadcast with Collective Offload and Impact on HPL Performance HPL-Offload HPL-1ring HPL-Host HPL Problem Size (N) as % of Total Memory HPL Performance Comparison with 512 Processes 4.5% K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, Hot Interconnect '11, Aug

26 Run-Time (s) Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload PCG-Default 21.8% Number of Processes Modified-PCG-Offload Experimental Setup: 64 nodes, 8 core Intel Xeon(2.53 GHz) 12MB L3 Cache, 12 GB Memory per node, 64 nodes MT26428 QDR ConnectX-2, PCI- Ex interfaces, 171-port Mellanox 64, unknowns per process. Modified PCG with Offload-Allreduce has a run-time which is about 21% lesser than default PCG K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, Accepted for publication at IPDPS

27 MPI Design Challenges High Performance and Scalable Inter-node Communication with Hybrid UD- RC/XRC transport Kernel-based zero-copy intra-node communication Collective Communication Multi-core Aware, Topology-aware and Power-aware Exploiting Collective Offload Support for GPGPUs and Accelerators PGAS Support (Hybrid MPI + UPC) 27

Data movement in GPU+IB clusters Many systems today want to use systems that have both GPUs and high-speed

main memory at source process, using CUDA From source to destination process, using MPI From main memory to

separate memory registration GPU-Direct (collaboration between NVIDIA and Mellanox) supported common

28 Data movement in GPU+IB clusters Many systems today want to use systems that have both GPUs and high-speed networks such as InfiniBand Steps in Data movement in InfiniBand clusters with GPUs From GPU device memory to main memory at source process, using CUDA From source to destination process, using MPI From main memory to GPU device memory at destination process, using CUDA Earlier, GPU device and InfiniBand device required separate memory registration GPU-Direct (collaboration between NVIDIA and Mellanox) supported common registration between these devices However, GPU-GPU communication is still costly and programming is harder CPU PCIe GPU NIC Switch 28

29 Sample Code - Without MPI integration Naïve implementation with standard MPI and CUDA At Sender: cudamemcpy(sbuf, sdev,...); MPI_Send(sbuf, size,...); At Receiver: MPI_Recv(rbuf, size,...); cudamemcpy(rdev, rbuf,...); CPU PCIe GPU NIC High Productivity and Poor Performance Switch 29

30 Sample Code User Optimized Code Pipelining at user level with non-blocking MPI and CUDA interfaces Code at Sender side (and repeated at Receiver side) At Sender: for (j = ; j < pipeline_len; j++) cudamemcpyasync(sbuf + j * blk, sdev + j * blksz,...); for (j = ; j < pipeline_len; j++) { } while (result!= cudasucess) { } result = cudastreamquery( ); if(j > ) MPI_Test( ); MPI_Isend(sbuf + j * block_sz, blksz...); MPI_Waitall(); User-level copying may not match with internal MPI design High Performance and Poor Productivity CPU PCIe GPU NIC Switch 3

31 Can this be done within MPI Library? Support GPU to GPU communication through standard MPI interfaces e.g. enable MPI_Send, MPI_Recv from/to GPU memory Provide high performance without exposing low level details to the programmer Pipelined data transfer which automatically provides optimizations inside MPI library without user tuning A new Design incorporated in MVAPICH2 to support this functionality 31

32 Sample Code MVAPICH2-GPU MVAPICH2-GPU: standard MPI interfaces used At Sender: MPI_Send(s_device, size, ); At Receiver: MPI_Recv(r_device, size, ); inside MVAPICH2 High Performance and High Productivity 32

33 Design considerations Memory detection CUDA 4. introduces Unified Virtual Addressing (UVA) MPI library can differentiate between device memory and host memory without any hints from the user Overlap CUDA copy and RDMA transfer Data movement from GPU and RDMA transfer are DMA operations Allow for asynchronous progress 33

34 Time (us) MPI-Level Two-sided Communication Time (us) 3 3 Memcpy+Send Memcpy-Send 25 MemcpyAsync+Isend 25 MemcpyAsync+Isend 2 MVAPICH2-GPU 2 MVAPICH2-GPU K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) with GPU Direct 45% and 38% improvements compared to Memcpy+Send, with and without GPUDirect respectively, for 4MB messages 24% and 33% improvement compared with MemcpyAsync+Isend, with and without GPUDirect respectively, for 4MB messages 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) without GPU Direct H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC 11 34

35 Other MPI Operations from GPU Buffers Similar approaches can be used for One-sided Collectives Communication with Datatypes Designs can also be extended for multi-gpus per node H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES) to be held in conjunction with IPDPS

36 MVAPICH2 1.8a2 Release Supports point-to-point and collective communication Supports communication between GPU devices and between GPU device and host Supports communication using contiguous and noncontiguous MPI Datatypes Supports GPU-Direct through CUDA support (from 4.) Takes advantage of CUDA IPC for intra-node (intra-i/oh) communication (from CUDA 4.1) Provides flexibility in tuning performance of both RDMA and shared-memory based designs based on predominant message sizes in applications 36

37 OSU MPI Micro-Benchmarks (OMB) Release OSU MPI Micro-Benchmarks provides a comprehensive suite of benchmarks to compare performance of different MPI stacks and networks Enhancements done for three benchmarks Latency Bandwidth Bi-directional Bandwidth Flexibility for using buffers in NVIDIA GPU device (D) and host memory (H) Flexibility for selecting data movement between D->D, D->H and H->D Available from Available in an integrated manner with MVAPICH2 stack 37

38 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (Device-Device, Inter-node) 2 Latency MVAPICH2 15 OpenMPI 1 5 B E T T E R K 64K 1M Message Size (bytes) 4 Bandwidth MVAPICH2 4 Bi-directional Bandwidth MVAPICH2 3 2 OpenMPI B E T T E R 3 2 OpenMPI B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit

39 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (Device-Host, Inter-node) 2 MVAPICH2 Latency 15 1 OpenMPI B E T T E R Host-Device Performance is 5 Similar K 64K 1M Message Size (bytes) 4 Bandwidth MVAPICH2 4 Bi-directional Bandwidth MVAPICH2 3 OpenMPI 3 OpenMPI 2 1 B E T T E R 2 1 B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit

40 Bandwidth(MB/s) Bandwidth(MB/s) Latency (us) MVAPICH2 vs. OpenMPI (Device-Device, Intra-node, Multi-GPU) 4 Latency MVAPICH2 3 OpenMPI 2 1 B E T T E R K 64K 1M Message Size (bytes) Bandwidth MVAPICH2 OpenMPI B E T T E R Bi-directional Bandwidth MVAPICH2 OpenMPI B E T T E R K 64K 1M Message Size (bytes) K 64K 1M Message Size (bytes) MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 212) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C275 GPU and CUDA Toolkit 4.1 4

41 Time for LB Step (us) Applications-Level Evaluation (Lattice Boltzmann Method (LBM)) 1 MVAPICH MVAPICH2 1.8a2 23.5% % 24.2% 23.5% 128x512x64 256x512x64 512x512x64 124x512x64 Matrix Size XxYxZ LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios NVIDIA Tesla C25, Mellanox QDR InfiniBand HCA MT26428, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 Run one process on each node for one GPU (8-node cluster) 41

42 Total Execution Time (sec) Application-Level Evaluation (AWP-ODC) MVAPICH2 1.7 MVAPICH2 1.8a2 12.5% 13.% 4 8 Number of Processes AWP-ODC simulates the dynamic rapture and wave propagation that occur during an earthquake. A Gordon Bell Prize Finalist at SC 21. Originally a Fortran code, a new version is being written in C and CUDA. NVIDIA Tesla C25, Mellanox QDR IB, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 One process on each node with one GPU. 128x128x124 data grid per process/gpu. 42

43 MPI Design Challenges High Performance and Scalable Inter-node Communication with Hybrid UD- RC/XRC transport Kernel-based zero-copy intra-node communication Collective Communication Multi-core Aware, Topology-aware and Power-aware Exploiting Collective Offload Support for GPGPUs and Accelerators PGAS Support (Hybrid MPI + UPC) 43

44 PGAS Models Partitioned Global Address Space (PGAS) models provide a different complementary interface as compared to message passing Idea is to decouple data movement with process synchronization Processes should have asynchronous access to globally distributed data Well suited for irregular applications and kernels that require dynamic access to different data Different libraries and compilers exist that provide this model Global Arrays (library), UPC (compiler), CAF (compiler) HPCS languages: X1, Chapel, Fortress 44

Unifying UPC and MPI Runtimes: Experience with MVAPICH2 UPC Compiler UPC Compiler MPI Interface GASNet Interface MPI Interface GASNet Interface MPI

Queue Pairs, and other resources Network Interface Network Interface Currently UPC and MPI do not share runtimes Duplication of lower level communication

45 Unifying UPC and MPI Runtimes: Experience with MVAPICH2 UPC Compiler UPC Compiler MPI Interface GASNet Interface MPI Interface GASNet Interface MPI Runtime, Buffers, Queue Pairs, and other resources GASNet Runtime, Buffers, Queue Pairs, and other resources Unified MVAPICH + GASNet Runtime, Buffers, Queue Pairs, and other resources Network Interface Network Interface Currently UPC and MPI do not share runtimes Duplication of lower level communication mechanisms GASNet unable to leverage advanced buffering mechanisms developed for MVAPICH2 Our novel approach is to enable a truly unified communication library 45

46 K 4K 16K 64K 256K 1M Latency (us) Bandwidth (MBps) Memory Footprint (MB) UPC Micro-benchmark Performance UPC Memput Latency UPC Memput Bandwidth UPC Memory Scalability Message Size (bytes) Message Size (bytes) Number of Processes GASNet-UCR GASNet-IBV GASNet-MPI BUPC micro-benchmarks from latest release UPC performance is identical with both native IBV layer and new UCR layer Performance of GASNet-MPI conduit is not very good Mismatch of MPI specification and Active messages GASNet-UCR is more scalable compared native IBV conduit J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, International Conference on Partitioned Global Address Space (PGAS), 21 46

47 Time (sec) Evaluation using UPC NAS Benchmarks Performance of MG, Class B and C B-64 C-64 B-128 C-128 Class-Processes Performance of FT, Class B and C B-64 C-64 B-128 C-128 Class-Processes GASNet-UCR GASNet-IBV GASNet-MPI Performance of CG, Class B and C B-64 C-64 B-128 C-128 Class-Processes GASNet-UCR performs equal or better than GASNet-IBV 1% improvement for CG (B, 128) 23% improvement for MG (B, 128) 47

48 Time (sec) Evaluation of Hybrid MPI+UPC NAS-FT GASNet-UCR GASNet-IBV GASNet-MPI Hybrid 1 5 B-64 C-64 B-128 C-128 Class-Processes Modified NAS FT UPC all-to-all pattern using MPI_Alltoall Truly hybrid program 34% improvement for FT (C, 128) 48

49 Graph5 Results with new UPC Queue Design 3% 44% Workload Scale:24, Edge Factor:16 (16 million vertices, 256 million edges) 44% Improvement over base version for 512 UPC-Threads 3% Improvement over base version for 124 UPC-Threads J. Jose, S. Potluri, M. Luo, S. Sur and D. K. Panda, UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct

50 MVAPICH2 Future Plans Performance and Memory scalability toward 5K-1M cores Unified Support for PGAS Models and Languages (UPC, OpenShmem, etc.) Support for Hybrid Programming Models (being discussed in MPI 3.) Enhanced Optimization for GPU Support and Accelerators Extending the GPGPU support and adding Intel MIC support Taking advantage of Collective Offload framework in ConnectX-2 Including support for non-blocking collectives (MPI 3.) Extended topology-aware collectives Power-aware collectives Enhanced Multi-rail Designs Automatic optimization of collectives LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail Checkpoint-Restart and migration support with incremental checkpointing Fault-tolerance with run-through stabilization (being discussed in MPI 3.) QoS-aware I/O and checkpointing 5

51 Two Major Categories of Applications Scientific Computing Message Passing Interface (MPI) is the Dominant Programming Model Many discussions towards Partitioned Global Address Space (PGAS) Enterprise/Commercial Computing Focuses on large data and data analysis Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum Memcached is also used 51

52 Can High-Performance Interconnects Benefit Enterprise Computing? Beginning to draw interest from the enterprise domain Oracle, IBM, Google are working along these directions Performance in the enterprise domain remains a concern Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs? 52

Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol mplementation Kernel Space Ethernet Driver

Adapter Ethernet Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch

53 Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Protocol mplementation Kernel Space Ethernet Driver TCP/IP IPoIB TCP/IP Hardware Offload SDP RDMA iwarp User space RDMA User space RDMA User space Network Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE SDP iwarp RoCE IB Verbs 53

54 Can New Data Analysis and Management Systems be designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) Zero-copy not available for non-blocking sockets 54

55 Memcached Architecture Main memory CPUs Main memory CPUs SSD HDD... SSD HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD Web Frontend Servers (Memcached Clients) Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose (Memcached Servers) Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive Native IB-verbs-level Design and evaluation with Memcached Server: (Database Servers) Memcached Client: (libmemcached) Main memory SSD CPUs HDD 55

56 Time (us) Time (us) Memcached Get Latency (Small Message) SDP OSU-RC-IB 1GigE IPoIB 1GigE OSU-UD-IB K 2K Message Size Message Size K 2K Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster 56

57 Time (us) Time (us) Memcached Set Latency (Large Message) K 4K 8K 16K 32K 64K 128K 256K 512K Message Size Intel Clovertown Cluster (IB: DDR) Memcached Get latency 5 45 SDP IPoIB OSU-RC-IB 1GigE 4 1GigE OSU-UD-IB K 4K 8K 16K 32K 64K 128K 256K 512K Message Size Intel Westmere Cluster (IB: QDR) 8K bytes RC/UD DDR: 19.8/21.2 us; QDR: 11.7/12.7 us 512K bytes RC/UD -- DDR: 366/413 us; QDR: 181/26 us Almost factor of two improvement over 1GE (TOE) for 512K bytes on the DDR cluster 57

58 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Memcached Get TPS (4byte) SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE K 4 8 No. of Clients No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients Significant improvement with native IB QDR compared to SDP and IPoIB 58

59 Memory Footprint (MB) Memcached - Memory Scalability SDP OSU-RC-IB OSU-UD-IB IPoIB 1GigE OSU-Hybrid-IB 2 1 Steady Memory Footprint for UD Design ~ 2MB K 1.6K 2K 4K RC Memory Footprint increases as increase in number of clients ~5MB for 4K clients No. of Clients 59

60 Time (ms) Time (ms) Application Level Evaluation Olio Benchmark SDP 1 IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB No. of Clients No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 6

Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB 6 35 3 25 2 15 4 2 1 5 1 4 8 No.

61 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads 12 SDP 1 IPoIB OSU-RC-IB 8 OSU-UD-IB OSU-Hybrid-IB No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid 12 61

62 Overview of HBase Architecture An open source database project based on Hadoop framework for hosting very large tables Major components: HBaseMaster, HRegionServer and HBaseClient HBase and HDFS are deployed in the same cluster to get better data locality 62

Time (us) Time (us) 3 25 2 15 1 HBase Put/Get Detailed Analysis 25 2 15 1 Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5

63 Time (us) Time (us) HBase Put/Get Detailed Analysis Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 5 5 1GigE IPoIB 1GigE OSU-IB HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time 1GigE IPoIB 1GigE OSU-IB HBase Get 1KB W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 63

64 Time (us) Ops/sec HBase Single Server-Multi-Client Results IPoIB OSU-IB 4 4 1GigE 3 3 1GigE No. of Clients No. of Clients Latency Throughput HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 64

65 Time (us) Time (us) HBase YCSB Read-Write Workload IPoIB 1GigE OSU-IB 1GigE No. of Clients No. of Clients Read Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients Write Latency J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 65

Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks

66 Hadoop Architecture Underlying Hadoop Distributed File System (HDFS) Fault-tolerance by replicating data blocks NameNode: stores information on data blocks DataNodes: store blocks and host Map-reduce computation JobTracker: track jobs and detect failure Model scales but high amount of communication during intermediate phases 66

67 Time (ms) RDMA-based Design for Native HDFS Preliminary Results GigE 1GigE IPoIB OSU-IB-DDR File Size (MB) HDFS File Write Experiment using one data node on IB-DDR Cluster HDFS File Write Time 64 MB 344ms, 128 MB 669ms, 256 MB 1.3s, 512 MB 2.7s, 1GB 6.7s 2% over IPoIB, 13% over 1GigE improvement for 1GB file 67

68 Concluding Remarks InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage As the HPC community moves to Exascale, new solutions are needed in the MPI stack for supporting PGAS, GPU, collectives (topology-aware, power-aware), one-sided and QoS, etc. Demonstrated how such solutions can be designed with MVAPICH2 and their performance benefits New solutions are also needed to re-design software libraries for enterprise environments to take advantage of modern networks Allow application scientists and engineers to take advantage of modern supercomputers 68

69 Funding Acknowledgments Funding Support by Equipment Support by 69

70 Personnel Acknowledgments Current Students Past Students V. Dhanraj (M.S.) P. Balaji (Ph.D.) N. Islam (Ph.D.) D. Buntinas (Ph.D.) J. Jose (Ph.D.) S. Bhagvat (M.S.) K. Kandalla (Ph.D.) L. Chai (Ph.D.) M. Luo (Ph.D.) B. Chandrasekharan (M.S.) X. Ouyang (Ph.D.) N. Dandapanthula (M.S.) S. Potluri (Ph.D.) T. Gangadharappa (M.S.) R. Rajachandrasekhar (Ph.D.) K. Gopalakrishnan (M.S.) M. Rahman (Ph.D.) W. Huang (Ph.D.) A. Singh (Ph.D.) W. Jiang (M.S.) H. Subramoni (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) Current Post-Docs Current Programmers Past Post-Docs J. Vienne M. Arnold X. Besseron H. Wang D. Bureddy H.-W. Jin J. Perkins E. Mancini D. Sharma S. Marcarelli P. Lai (Ph. D.) J. Liu (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) S. Pai (M.S.) G. Santhanaraman (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist S. Sur 7

71 Web Pointers MVAPICH Web Page 71

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters A Talk at NVIDIA Booth (SC 11) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda