Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Size: px

Start display at page:

Download "Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization"

Alaina Berry
6 years ago
Views:

HPCAC-Switzerland (Mar 16) by Dhabaleswar K.

1 Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization Talk at HPCAC-Switzerland (Mar 16) by Dhabaleswar K. (DK) Panda The Ohio State University

2 HPCAC-Switzerland (Mar 16) 2 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Scalability for million to billion processors Collective communication Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, ) InfiniBand Network Analysis and Monitoring (INAM) Integrated Support for GPGPUs Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

3 HPCAC-Switzerland (Mar 16) 3 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Integrated Support for GPGPUs CUDA-Aware MPI GPUDirect RDMA (GDR) Support CUDA-aware Non-blocking Collectives Support for Managed Memory Efficient datatype Processing Supporting Streaming applications with GDR Efficient Deep Learning with MVAPICH2-GDR Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

4 MPI + CUDA - Naive Data movement in applications with standard MPI and CUDA interfaces At Sender: CPU cudamemcpy(s_hostbuf, s_devbuf,...); MPI_Send(s_hostbuf, size,...); At Receiver: MPI_Recv(r_hostbuf, size,...); cudamemcpy(r_devbuf, r_hostbuf,...); PCIe GPU NIC Switch High Productivity and Low Performance HPCAC-Switzerland (Mar 16) 4

$pipeline_len; j++) cudamemcpyasync(s_hostbuf + j * blk, s_devbuf + j * blksz, ); for (j = ; j < pipeline_len; j++) { while$ $(result!= cudasucess) { result = cudastreamquery( ); if(j > ) MPI_Test( ); } MPI_Isend(s_hostbuf + j * block_sz, blksz.$

5 MPI + CUDA - Advanced Pipelining at user level with non-blocking MPI and CUDA interfaces At Sender: for (j = ; j < pipeline_len; j++) cudamemcpyasync(s_hostbuf + j * blk, s_devbuf + j * blksz, ); for (j = ; j < pipeline_len; j++) { while (result!= cudasucess) { result = cudastreamquery( ); if(j > ) MPI_Test( ); } MPI_Isend(s_hostbuf + j * block_sz, blksz...); } MPI_Waitall(); <<Similar at receiver>> Low Productivity and High Performance CPU PCIe GPU NIC Switch HPCAC-Switzerland (Mar 16) 5

6 GPU-Aware MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity HPCAC-Switzerland (Mar 16) 6

HPCAC-Switzerland (Mar 16) 7 GPU-Direct RDMA (GDR) with CUDA OFED with support for GPUDirect RDMA is developed

GPUDirect RDMA and Host-based pipelining Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge

for RoCE with Mellanox ConnectX VPI adapters IB Adapter SNB E5-267 / IVB E5-268V2 CPU Chipset SNB E5-267 P2P

7 HPCAC-Switzerland (Mar 16) 7 GPU-Direct RDMA (GDR) with CUDA OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox OSU has a design of MVAPICH2 using GPUDirect RDMA Hybrid design using GPU-Direct RDMA GPUDirect RDMA and Host-based pipelining Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge Support for communication using multi-rail Support for Mellanox Connect-IB and ConnectX VPI adapters Support for RoCE with Mellanox ConnectX VPI adapters IB Adapter SNB E5-267 / IVB E5-268V2 CPU Chipset SNB E5-267 P2P write: 5.2 GB/s P2P read: < 1. GB/s IVB E5-268V2 P2P write: 6.4 GB/s P2P read: 3.5 GB/s System Memory GPU GPU Memory

8 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers HPCAC-Switzerland (Mar 16) 8

9 Bi-Bandwidth (MB/s) Latency (us) Bandwidth (MB/s) Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) x K Message Size (bytes) GPU-GPU internode latency MV2-GDR2.2b MV2-GDR2.b MV2 w/o GDR 2.18us GPU-GPU Internode Bi-Bandwidth MV2-GDR2.2b MV2-GDR2.b MV2 w/o GDR 11x K 4K Message Size (bytes) 2x GPU-GPU Internode Bandwidth MV2-GDR2.2b MV2-GDR2.b MV2 w/o GDR K 4K Message Size (bytes) MVAPICH2-GDR-2.2b Intel Ivy Bridge (E5-268 v2) node - 2 cores NVIDIA Tesla K4c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA HPCAC-Switzerland (Mar 16) 9 11X 2X

Average Time Steps per second (TPS) Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) 35 3 25 2 15 1 5 64K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c

10 Average Time Steps per second (TPS) Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes K Particles MV2 MV2+GDR Number of Processes HPCAC-Switzerland (Mar 16) 1 2X

11 Overlap (%) Overlap (%) CUDA-Aware Non-Blocking Collectives 12 Medium/Large Message Overlap (64 GPU nodes) 12 Medium/Large Message Overlap (64 GPU nodes) Ialltoall (1process/node) Ialltoall (2process/node; 1process/GPU) 4K 16K 64K 256K 1M Message Size (Bytes) Igather (1process/node) Igather (2processes/node; 1process/GPU) 4K 16K 64K 256K 1M Message Size (Bytes) Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K2c + Mellanox Connect-IB Available since MVAPICH2-GDR 2.2a A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 215 HPCAC-Switzerland (Mar 16) 11

12 Communication Runtime with GPU Managed Memory Latency (us) Bandwidth (MB/s) CUDA 6. NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocation for GPU or CPU through cudamallocmanaged() call Significant productivity benefits due to abstraction of explicit allocation and cudamemcpy() Extended MVAPICH2 to perform communications directly from managed buffers (Available in MVAPICH2-GDR 2.2b) OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collective communications using managed buffers Available in OMB 5.2 D S Banerjee, K Hamidouche, DK Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences at GPGPU- 9 Workshop held in conjunction with PPoPP 216. Barcelona Spain Latency Message Size (Bytes) HPCAC-Switzerland (Mar 16) 12 H-H D-D MH-MH Bandwidth MD-MD Message Size (Bytes)

13 HPCAC-Switzerland (Mar 16) 13 MPI Datatype Processing (Communication Optimization ) Common Scenario MPI_Isend (A,.. Datatype, ) MPI_Isend (B,.. Datatype, ) MPI_Isend (C,.. Datatype, ) MPI_Isend (D,.. Datatype, ) Waste of computing resources on CPU and GPU Existing Design Isend(1) Initiate Kernel Wait For Kernel (WFK) Kernel on Stream Start Send Initiate Kernel Isend(1) Wait For Kernel (WFK) Kernel on Stream Start Send GPU Initiate Kernel Isend(1) Wait For Kernel (WFK) Kernel on Stream Start Send Wait Progress CPU MPI_Waitall ( ); Proposed Design Isend(1) Isend(2)Isend(3) Wait CPU *Buf1, Buf2 contain noncontiguous MPI Datatype Initiate Kernel Initiate Kernel Kernel on Stream Initiate Kernel Kernel on Stream WFK Start Send WFK Start Send Kernel on Stream WFK Progress Start Send GPU Expected Benefits Start Time Finish Proposed Finish Existing

14 Normalized Execution Time Normalized Execution Time Application-Level Evaluation (HaloExchange - Cosmo) Wilkes GPU Cluster Default Callback-based Event-based CSCS GPU cluster Default Callback-based Event-based Number of GPUs Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 HPCAC-Switzerland (Mar 16) 14

15 Nature of Streaming Applications Pipelined data parallel compute phases that form the crux of streaming applications lend themselves for GPGPUs Data distribution to GPGPU sites occur over PCIe within the node and over InfiniBand interconnects across nodes Broadcast operation is a key dictator of throughput of streaming applications Current Broadcast operation on GPU clusters does not take advantage of IB Hardware MCAST GPU Direct RDMA Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic Imaging 26 HPCAC-Switzerland (Mar 16) 15

SGL-based design for Efficient Broadcast Operation on GPU Systems HPCAC-Switzerland (Mar 16) 16 Current design is limited by the expensive copies from/to GPUs Proposed several alternative designs

GPUDirect RDMA features High performance and scalability for D-D broadcast Direct code path between HCA and GPU Free PCI resources 3X improvement in latency 3X A. Venkatesh, H. Subramoni, K.

16 SGL-based design for Efficient Broadcast Operation on GPU Systems HPCAC-Switzerland (Mar 16) 16 Current design is limited by the expensive copies from/to GPUs Proposed several alternative designs to avoid the overhead of the copy Loopback, GDRCOPY and hybrid High performance and scalability Still uses PCI resources for Host-GPU copies Proposed SGL-based design Combines IB MCAST and GPUDirect RDMA features High performance and scalability for D-D broadcast Direct code path between HCA and GPU Free PCI resources 3X improvement in latency 3X A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters, IEEE Int l Conf. on High Performance Computing (HiPC 14)

17 Accelerating Deep Learning with MVAPICH2-GDR Caffe: A flexible and layered Deep Learning framework. Benefits and Weaknesses Multi-GPU Training within a single node Performance degradation for GPUs across different sockets 8x improvement Can we enhance Caffe with MVAPICH2-GDR? Caffe-Enhanced: A CUDA-Aware MPI version Enables Scale-up (within a node) and Scaleout (across multi-gpu nodes) Initial Evaluation suggests up to 8X reduction in training time on CIFAR-1 dataset HPCAC-Switzerland (Mar 16) 17

18 HPCAC-Switzerland (Mar 16) 18 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Integrated Support for GPGPUs Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

HPCAC-Switzerland (Mar 16) 19 MPI Applications on MIC

clusters with Xeon Phi Multi-core Centric Xeon Xeon

Coprocessor-only Many-core Centric MPI Program MPI

19 HPCAC-Switzerland (Mar 16) 19 MPI Applications on MIC Clusters Flexibility in launching MPI jobs on clusters with Xeon Phi Multi-core Centric Xeon Xeon Phi Host-only Offload (/reverse Offload) Symmetric Coprocessor-only Many-core Centric MPI Program MPI Program MPI Program Offloaded Computation MPI Program MPI Program

20 HPCAC-Switzerland (Mar 16) 2 MVAPICH2-MIC 2. Design for Clusters with IB and MIC Offload Mode Intranode Communication Coprocessor-only and Symmetric Mode Internode Communication Coprocessors-only and Symmetric Mode Multi-MIC Node Configurations Running on three major systems Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

21 Latency (usec) Bandwidth (MB/sec) Better Latency (usec) Bandwidth (MB/sec) Better MIC-Remote-MIC P2P Communication with Proxy-based Communication Latency (Large Messages) 8K 32K 128K 512K 2M Message Size (Bytes) Latency (Large Messages) Intra-socket P2P Better Inter-socket P2P Better Bandwidth K 64K 1M Message Size (Bytes) Bandwidth K 32K 128K 512K 2M K 64K 1M Message Size (Bytes) Message Size (Bytes) HPCAC-Switzerland (Mar 16) 21

22 Latency (usecs) Execution Time (secs) Latency (usecs) Latency (usecs) Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 76% Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 58% K Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 55% K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) P3DFFT Performance Communication Computation 4K 8K 16K 32K 64K 128K 256K 512K Message Size (Bytes) MV2-MIC-Opt MV2-MIC 32 Nodes (8H + 8M), Size = 2K*2K*1K A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS 14, May 214 HPCAC-Switzerland (Mar 16) 22

23 HPCAC-Switzerland (Mar 16) 23 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Integrated Support for GPGPUs Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

24 Can HPC and Virtualization be Combined? Virtualization has many benefits Fault-tolerance Job migration Compaction Have not been very popular in HPC due to overhead associated with Virtualization New SR-IOV (Single Root IO Virtualization) support available with Mellanox InfiniBand adapters changes the field Enhanced MVAPICH2 support for SR-IOV MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available How about the Containers support? J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC 14 J. Zhang, X.Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid 15 HPCAC-Switzerland (Mar 16) 24

25 Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM Redesign MVAPICH2 to make it virtual machine aware SR-IOV shows near to native performance for inter-node point to point communication IVSHMEM offers zero-copy access to data on shared memory of co-resident VMs Locality Detector: maintains the locality information of co-resident virtual machines Communication Coordinator: selects the communication channel (SR-IOV, IVSHMEM) adaptively MPI proc Guest 1 PCI Device user space kernel space IV-SHM VF Driver /dev/shm/ Host Environment Hypervisor Virtual Function MPI proc Guest 2 PCI Device Virtual Function Infiniband Adapter user space kernel space VF Driver PF Driver Physical Function IV-Shmem Channel SR-IOV Channel J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? Euro-Par, 214. J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 214. HPCAC-Switzerland (Mar 16) 25

26 MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack OpenStack is one of the most popular open-source solutions to build clouds and manage virtual machines Deployment with OpenStack Supporting SR-IOV configuration Supporting IVSHMEM configuration Virtual Machine aware design of MVAPICH2 with SR-IOV An efficient approach to build HPC Clouds with MVAPICH2-Virt and OpenStack J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 215. HPCAC-Switzerland (Mar 16) 26

27 Execution Time (ms) Execution Time (s) Application-Level Performance on Chameleon MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 2% MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native % % 9.5% 22,2 24,1 24,16 24,2 26,1 26,16 Problem Size (Scale, Edgefactor) Graph5 milc leslie3d pop2 GAPgeofem zeusmp2 lu SPEC MPI27 32 VMs, 6 Core/VM Compared to Native, 2-5% overhead for Graph5 with 128 Procs Compared to Native, 1-9.5% overhead for SPEC MPI27 with 128 Procs HPCAC-Switzerland (Mar 16) 27

NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument Large-scale instrument Targeting Big Data, Big Compute, Big Instrument research ~65 nodes (~14,5 cores), 5 PB disk over two sites,

28 NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument Large-scale instrument Targeting Big Data, Big Compute, Big Instrument research ~65 nodes (~14,5 cores), 5 PB disk over two sites, 2 sites connected with 1G network Reconfigurable instrument Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use Connected instrument Workload and Trace Archive Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others Partnerships with users Complementary instrument Complementing GENI, Grid 5, and other testbeds Sustainable instrument Industry connections HPCAC-Switzerland (Mar 16) 28

29 Latency (us) Bandwidth (MBps) Containers Support: MVAPICH2 Intra-node Point-to-Point Performance on Chameleon Container-Def 14 Container-Def Container-Opt Native 81% Container-Opt Native 191% k 2k 4k 8k 16k 32k 64k Message Size (Bytes) Intra-Node Inter-Container k 2k 4k 8k 16k 32k 64k Message Size (Bytes) Compared to Container-Def, up to 81% and 191% improvement on Latency and BW Compared to Native, minor overhead on Latency and BW HPCAC-Switzerland (Mar 16) 29

30 Execution Time (s) Execution Time (ms) Containers Support: Application-Level Performance on Chameleon Container-Def 4 Container-Def Container-Opt 11% 35 Container-Opt Native 3 Native % MG.D FT.D EP.D LU.D CG.D 22, 16 22, 2 24, 16 24, 2 26, 16 26, 2 NAS Problem Size (Scale, Edgefactor) Graph 5 64 Containers across 16 nodes, pining 4 Cores per Container Compared to Container-Def, up to 11% and 16% of execution time reduction for NAS and Graph 5 Compared to Native, less than 9 % and 4% overhead for NAS and Graph 5 Optimized Container support will be available with the next release of MVAPICH2-Virt HPCAC-Switzerland (Mar 16) 3

31 HPCAC-Switzerland (Mar 16) 31 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Integrated Support for GPGPUs Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

32 HPCAC-Switzerland (Mar 16) 32 Designing Energy-Aware (EA) MPI Runtime Overall application Energy Expenditure Energy Spent in Communication Routines Energy Spent in Computation Routines Point-to-point Routines Collective Routines RMA Routines MVAPICH2-EA Designs MPI Two-sided and collectives (ex: MVAPICH2) Impact MPI-3 RMA Implementations (ex: MVAPICH2) One-sided runtimes (ex: ComEx) Other PGAS Implementations (ex: OSHMPI)

33 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT) MVAPICH2-EA 2.1 (Energy-Aware) A white-box approach New Energy-Efficient communication protocols for pt-pt and collective operations Intelligently apply the appropriate Energy saving techniques Application oblivious energy saving OEMT A library utility to measure energy consumption for MPI applications Works with all MPI runtimes PRELOAD option for precompiled applications Does not require ROOT permission: A safe kernel module to read only a subset of MSRs HPCAC-Switzerland (Mar 16) 33

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM) An energy efficient runtime that provides energy savings without application knowledge

Pessimistic MPI applies energy reduction lever to each MPI call 1 A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A.

34 MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM) An energy efficient runtime that provides energy savings without application knowledge Uses automatically and transparently the best energy lever Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation Pessimistic MPI applies energy reduction lever to each MPI call 1 A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, Supercomputing 15, Nov 215 [Best Student Paper Finalist] HPCAC-Switzerland (Mar 16) 34

35 Joules Seconds HPCAC-Switzerland (Mar 16) 35 MPI-3 RMA Energy Savings with Proxy-Applications Graph5 (Energy Usage) 46% optimistic pessimistic EAM-RMA optimistic pessimistic EAM-RMA Graph5 (Execution Time) #Processes #Processes MPI_Win_fence dominates application execution time in graph5 Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no degradation in execution time in comparison with the default optimistic MPI runtime

36 Joules Seconds MPI-3 RMA Energy Savings with Proxy-Applications % SCF (Energy Usage) optimistic pessimistic EAM-RMA SCF (Execution Time) optimistic pessimistic EAM-RMA #Processes #Processes SCF (self-consistent field) calculation spends nearly 75% total time in MPI_Win_unlock call With 256 and 512 processes, EAM-RMA yields 42% and 36% savings at 11% degradation (close to permitted degradation ρ = 1%) 128 processes is an exception due 2-sided and 1-sided interaction MPI-3 RMA Energy-efficient support will be available in upcoming MVAPICH2-EA release HPCAC-Switzerland (Mar 16) 36

37 HPCAC-Switzerland (Mar 16) 37 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale Integrated Support for GPGPUs Integrated Support for MICs Virtualization (SR-IOV and Containers) Energy-Awareness Best Practice: Set of Tunings for Common Applications

38 Applications-Level Tuning: Compilation of Best Practices MPI runtime has many parameters Tuning a set of parameters can help you to extract higher performance Compiled a list of such contributions through the MVAPICH Website Initial list of applications Amber HoomdBlue HPCG Lulesh MILC MiniAMR Neuron SMG2 Soliciting additional contributions, send your results to mvapich-help at cse.ohiostate.edu. We will link these results with credits to you. HPCAC-Switzerland (Mar 16) 38

39 HPCAC-Switzerland (Mar 16) 39 MVAPICH2 Plans for Exascale Performance and Memory scalability toward 1M cores Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Support for task-based parallelism (UPC++)* Enhanced Optimization for GPU Support and Accelerators Taking advantage of advanced features of Mellanox InfiniBand On-Demand Paging (ODP) Swith-IB2 SHArP GID-based support Enhanced Inter-node and Intra-node communication schemes for upcoming architectures OpenPower* OmniPath-PSM2* Knights Landing Extended topology-aware collectives Extended Energy-aware designs and Virtualization Support Extended Support for MPI Tools Interface (as in MPI 3.) Extended Checkpoint-Restart and migration support with SCR Support for * features will be available in MVAPICH2-2.2 RC1

40 Looking into the Future. Exascale systems will be constrained by Power Memory per core Data movement cost Faults Programming Models and Runtimes for HPC need to be designed for Scalability Performance Fault-resilience Energy-awareness Programmability Productivity Highlighted some of the issues and challenges Need continuous innovation on all these fronts HPCAC-Switzerland (Mar 16) 4

41 HPCAC-Switzerland (Mar 16) 41 Funding Acknowledgments Funding Support by Equipment Support by

42 Personnel Acknowledgments Current Students A. Augustine (M.S.) A. Awan (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) N. Islam (Ph.D.) M. Li (Ph.D.) Past Students P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) Past Post-Docs H. Wang X. Besseron H.-W. Jin M. Luo K. Kulkarni (M.S.) M. Rahman (Ph.D.) D. Shankar (Ph.D.) A. Venkatesh (Ph.D.) J. Zhang (Ph.D.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) E. Mancini S. Marcarelli J. Vienne Current Research Scientists Current Senior Research Associate H. Subramoni X. Lu Current Post-Doc J. Lin D. Banerjee M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) Past Research Scientist S. Sur - K. Hamidouche Current Programmer J. Perkins Current Research Specialist M. Arnold G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Programmers D. Bureddy HPCAC-Switzerland (Mar 16) 42

43 International Workshop on Communication Architectures at Extreme Scale (Exacomm) HPCAC-Switzerland (Mar 16) 43 ExaComm 215 was held with Int l Supercomputing Conference (ISC 15), at Frankfurt, Germany, on Thursday, July 16th, 215 One Keynote Talk: John M. Shalf, CTO, LBL/NERSC Four Invited Talks: Dror Goldenberg (Mellanox); Martin Schulz (LLNL); Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL) Panel: Ron Brightwell (Sandia) Two Research Papers ExaComm 216 will be held in conjunction with ISC 16 Technical Paper Submission Deadline: Friday, April 15, 216

44 HPCAC-Switzerland (Mar 16) 44 Thank You! Network-Based Computing Laboratory The MVAPICH2 Project The High-Performance Big Data Project

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication