Unifying UPC and MPI Runtimes: Experience with MVAPICH

Similar documents
Unified Runtime for PGAS and MPI over OFED

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Scaling with PGAS Languages

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Memcached Design on High Performance RDMA Capable Interconnects

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

High Performance MPI on IBM 12x InfiniBand Architecture

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Enhancing Checkpoint Performance with Staging IO & SSD

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Designing Shared Address Space MPI libraries in the Many-core Era

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Case for High Performance Computing with Virtual Machines

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Job Startup at Exascale: Challenges and Solutions

MVAPICH2 Project Update and Big Data Acceleration

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Memcached Design on High Performance RDMA Capable Interconnects

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

High Performance Migration Framework for MPI Applications on HPC Cloud

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Acceleration for Big Data, Hadoop and Memcached

MVAPICH2 and MVAPICH2-MIC: Latest Status

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

The Future of Supercomputer Software Libraries

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

High-Performance Training for Deep Learning and Computer Vision HPC

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Job Startup at Exascale:

Designing High-Performance and Resilient Message Passing on InfiniBand

Performance Evaluation of InfiniBand with PCI Express

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

C PGAS XcalableMP(XMP) Unified Parallel

Performance Evaluation of InfiniBand with PCI Express

The Exascale Architecture

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

Memory Management Strategies for Data Serving with RDMA

FVM - How to program the Multi-Core FVM instead of MPI

High-Performance Broadcast for Streaming and Deep Learning

Solutions for Scalable HPC

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

2008 International ANSYS Conference

Interconnect Your Future

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Welcome to the IBTA Fall Webinar Series

Accelerating HPL on Heterogeneous GPU Clusters

Transcription:

Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA

Introduction UPC and PGAS concepts are gaining interest Exascale programming model roadmap: MPI + X Is X == UPC? Maybe! MPI has been around for many years Hundreds of man years invested in scientific software Cannot afford to re-implement all this in PGAS InfiniBand open standard, fast, scalable MPI (MVAPICH, MVAPICH2) optimized to the hilt Not productive to re-implement it for PGAS Must allow incrementally optimizing apps with UPC An unified runtime will be a first step in this direction

The Need for a Unified Runtime P0 upc_memget(data from p1); /* operate on data */ MPI_Barrier(comm); memget request /* local computation */ MPI_Barrier(comm); P1 UPC Runtime MPI Runtime UPC Runtime MPI Runtime Deadlock when a message is sitting in one runtime, but application calls the other runtime Current prescription to avoid this is to barrier in one mode (either UPC or MPI) before entering the other Bad performance!!

Coercing UPC over MPI not Optimal MPI does not provide Active Messages AMs critical to UPC compilation and performance Simulating AMs over MPI leads to performance loss Not going to be included in MPI-3 MPI RMA model for non cache-coherent machines Penalizes vast majority of cache coherent machines MPI-3 considering a proposal to support both cachecoherent and non cache-coherent machines (will take time) MPI will not support instant teams Communicators in MPI require group communication Path forward: unify runtimes, not programming models

Outline Introduction Problem Statement Proposed Design Experimental Results & Analysis Conclusions & Future Work

Problem Statement Can we design a communication library for UPC? Scalable on large InfiniBand clusters Provides equal or better performance than existing runtime Can this library support both MPI and UPC? Individually, both with great performance Simultaneously, with great performance and less memory

Outline Introduction Problem Statement Proposed Design Experimental Results & Analysis Conclusions & Future Work

Overall Approach UPC Compiler UPC Compiler MPI GASNet MPI GASNet MPI Runtime, Buffers, QPs GASNet Runtime, Buffers, QPs... Unified MPI + GASNet Runtime Network Interface Network Interface Unified runtime provides APIs for MPI and GASNet INCR (Integrated Communication Runtime)

The INCR Interface Different AM APIs based on size for optimization Send short AM without arguments Short AM (no data payload) Medium AM (bounce buffer using RDMA FP) Large AM (RDMA Put, on-demand connections) GASNet Extended interface for efficient RMA Inline put Put (may be internally buffered) Put bulk (send buffer will not be touched, no buffering) Get (RDMA Read)

Unified Implementation MPI Message UPC Message MPI Message Queue connection connection AM hndlr AM Tbl MPI Library (Sender) HCA HCA MPI Library (Receiver) All resources are shared between MPI and UPC Connections, buffers, memory registrations Schemes for establishing connections (fixed, on-demand) RDMA for large AMs and for PUT, GET

Various Configurations for running UPC and MPI Applications Pure UPC or UPC + MPI Applications Pure UPC or UPC + MPI Applications UPC Compiler UPC Compiler Pure UPC Applications GASNet Interface and UPC Runtime GASNet Interface and UPC Runtime Pure MPI Applications MVAPICH - MPI Standard Interface UPC Compiler GASNet Interface and UPC Runtime GASNet MPI Runtime MVAPICH - MPI Standard Interface GASNet INCR Runtime MVAPICH INCR Implementation Our Contribution MVAPICH- Aptus Runtime GASNet IBVerbs Runtime MVAPICH- Aptus Runtime MVAPICH- Aptus Runtime InfiniBand Network InfiniBand Network InfiniBand Network InfiniBand Network MPI GASNet-IBV GASNet-MPI GASNet-INCR

Outline Introduction Problem Statement Proposed Design Experimental Results & Analysis Conclusions & Future Work

MVAPICH and MVAPICH2 Software MVAPICH and MVAPICH2 High-performance, scalable, and fault-tolerant MPI library for InfiniBand/ 10GigE/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) Developed by Network-Based Computing Laboratory, OSU 45,000 direct downloads from OSU site Included in InfiniBand OFED, RedHat, SuSE etc. Being used by more than 1,275 organizations world wide, including many of the top 500 supercomputers (Jun 10 ranking) 6 th ranked 81,920 core (Pleiades) at NASA 7 th ranked 71,680 core (Tianhe-1) at NUDT, China 11 th ranked 62,976 core (Ranger) at TACC 34 th ranked 18,224 core (Juno) at LLNL Proposed design will be incorporated in MVAPICH2 for public release MVAPICH Aptus runtime Designed as a hybrid of Unreliable Datagram, Shared Receive Queues, Extended Reliable Connection (XRC), RDMA Fast Path M. Koop, T. Jones, and D. K. Panda, MVAPICH-Aptus: Scalable High- Performance Multi-Transport MPI over InfiniBand, IPDPS '08. Designs will be integrated into MVAPICH2

Experimental Setup MVAPICH version 1.1 extended to support INCR Berkeley GASNet version 2.10.2 (--enable-pshm) Experimental Testbed Type 1 Intel Nehalem (dual socket quad core Xeon 5500 2.4GHz) ConnectX QDR InfiniBand Type 2 Intel Clovertown (dual socket quad core Xeon 2.33GHz) ConnectX DDR InfiniBand Type 3 AMD Barcelona Quad-socket quad-core Opteron 8530 processors ConnectX DDR InfiniBand

Microbenchmark: upc-memput Latency Small Messages Bandwidth Time (us) 10 8 6 4 2 0 3500 3000 2500 2000 1500 1000 500 0 1 4 16 64 256 1K 1 8 64 512 4K 32K 256K Msg size (Bytes) Msg size (Bytes) Bandwidth (MB/s) GASNet- INCR GASNet- IBV GASNet- MPI Cluster #1 used for these experiments GASNet-INCR performs identically with GASNet-IBV Comparatively GASNet-MPI performs much worse Mismatch of Active Message semantics Message queue processing overheads

Microbenchmark: upc_memget Time (us) 10 8 6 4 2 0 Latency Small Messages 1 4 16 64 256 1K Msg size (bytes) Bandwidth (MB/s) GASNet- INCR GASNet- IBV GASNet- MPI Bandwidth 3500 3000 2500 2000 1500 1000 500 0 1 16 256 4K 64K 1M Msg size (bytes) GASNet-INCR performs identically with GASNet-IBV Due to mismatch of AM semantics with MPI leads to worse performance

Memory Scalability 300 Memory consumpaon (MB) 250 200 150 100 50 GASNet- INCR GASNet- IBV GASNet- MPI 0 UPC hello world program 16 32 64 128 256 Number of processes GASNet-IBV establishes all-to-all reliable connections Not scalable (may be improved in future release) GASNet-INCR best scalability due to inherent Aptus design Cluster #2 used for this experiment

Evaluation using UPC NAS Benchmarks Time (sec) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Performance of MG, Class B and C B- 64 C- 64 B- 128 C- 128 Class- Processes 35 30 25 20 15 10 5 0 Performance of FT, Class B and C B- 64 C- 64 B- 128 C- 128 Class- Processes GASNet- INCR GASNet- IBV GASNet- MPI 30 25 20 15 10 5 0 Performance of CG, Class B and C B- 64 C- 64 B- 128 C- 128 Class- Processes GASNet-INCR performs equal or better than GASNet-IBV 10% improvement for CG (B, 128) 23% improvement for MG (B, 128) Cluster #3 used for these experiments

Evaluation using Hybrid NAS-FT 35 Time (sec) 30 25 20 15 10 GASNet- INCR GASNet- IBV GASNet- MPI Hybrid 5 0 B- 64 C- 64 B- 128 C- 128 Class- Processes Modified NAS FT UPC all-to-all pattern using MPI_Alltoall Truly hybrid program 34% improvement for FT (C, 128) Cluster #3 used for this experiment

Conclusions and Future Work Integrated Communication Runtime (INCR): supports MPI and UPC simultaneously Promising: MPI communication not harmed and UPC communication not penalized No need for programmer to barrier between UPC and MPI modes, as is current practice Pure UPC NAS: 10% improvement CG (B, 128), 23% improvement MG (B, 128) MPI+UPC FT: 34% improvement for FT (C, 128) Public release with MVAPICH2 coming soon

Thank You! {jose, luom, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/