Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Similar documents
Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Scaling with PGAS Languages

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Accelerating HPL on Heterogeneous GPU Clusters

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

MVAPICH2 Project Update and Big Data Acceleration

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Unified Runtime for PGAS and MPI over OFED

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

High-Performance Training for Deep Learning and Computer Vision HPC

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Job Startup at Exascale:

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

High-Performance Broadcast for Streaming and Deep Learning

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

GPU-centric communication for improved efficiency

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Unifying UPC and MPI Runtimes: Experience with MVAPICH

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Solutions for Scalable HPC

Unified Communication X (UCX)

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MVAPICH2 and MVAPICH2-MIC: Latest Status

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Job Startup at Exascale: Challenges and Solutions

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Interconnect Your Future

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

High Performance Computing

Designing Shared Address Space MPI libraries in the Many-core Era

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

UCX: An Open Source Framework for HPC Network APIs and Beyond

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

MVAPICH2-X 2.1 User Guide

Designing OpenSHMEM and Hybrid MPI+OpenSHMEM Libraries for Exascale Systems: MVAPICH2-X Experience

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Parallel Programming Libraries and implementations

The Future of Interconnect Technology

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Paving the Road to Exascale

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

High-Performance MPI and Deep Learning on OpenPOWER Platform

Advanced RDMA-based Admission Control for Modern Data-Centers

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Speeding up the execution of numerical computations and simulations with rcuda José Duato

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Future Routing Schemes in Petascale clusters

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters*

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

High Performance Migration Framework for MPI Applications on HPC Cloud

RDMA in Embedded Fabrics

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Transcription:

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Accelerator Era Accelerators are becoming common in high-end system architectures 25% 72% Top 100 Nov 2013 (25% use Accelerators) 72% use GPUs Increasing number of workloads are being ported to take advantage of GPUs As they scale to large GPU clusters with high compute density higher the synchronization and communication overheads higher the penalty Critical to minimize these overheads to achieve maximum performance 2

Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. Each model has strengths and drawbacks - suite different problems or applications 3

Outline Overview of PGAS models (UPC and OpenSHMEM) Limitations in PGAS models for GPU computing Proposed Designs and Alternatives Performance Evaluation Exploiting GPUDirect RDMA 4

Partitioned Global Address Space (PGAS) Models PGAS models, an attractive alternative to traditional message passing Simple shared memory abstractions Lightweight one-sided communication Flexible synchronization Different approaches to PGAS - Libraries OpenSHMEM Global Arrays Chapel - Languages Unified Parallel C (UPC) Co-Array Fortran (CAF) X10 5

OpenSHMEM Memory Model Defines symmetric data objects that are globally addressable Allocated using a collective shmalloc routine Same type, size and offset address at all processes/processing elements (PEs) Address of a remote object can be calculated based on info of local object b Symmetric Object b PE 0 PE 1 int main (int c, char *v[]) { int *b; start_pes(); b = (int *) shmalloc (sizeof(int)); shmem_int_get (b, b, 1, 1); } (dst, src, count, pe) int main (int c, char *v[]) { int *b; } start_pes(); b = (int *) shmalloc (sizeof(int)); 6

UPC Memory Model Thread 0 Thread 1 Thread 2 Thread 3 Global Shared Space A1[0] A1[1] A1[2] A1[3] Private Space y y y y Global Shared Space: can be accessed by all the threads Private Space: holds all the normal variables; can only be accessed by the local thread Example: shared int A1[THREADS]; //shared variable int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access } 7

MPI+PGAS for Exascale Architectures and Applications Gaining attention in efforts towards Exascale computing Hierarchical architectures with multiple address spaces (MPI + PGAS) Model MPI across address spaces PGAS within an address space MPI is good at moving data between address spaces Within an address space, MPI can interoperate with other shared memory programming models Re-writing complete applications can be a huge effort Port critical kernels to the desired model instead 8

Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Exascale Roadmap*: Hybrid Programming is a practical way to program exascale systems Kernel 3 MPI Kernel N PGAS MPI * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420 9

MVAPICH2-X for Hybrid MPI + PGAS Applications MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls MPI Class Unified MVAPICH2-X Runtime InfiniBand, RoCE, iwarp Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 (2012) onwards! : http://mvapich.cse.ohio-state.edu Feature Highlights Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant Scalable Inter-node and intra-node communication point-to-point and collectives Effort underway for support on NVIDIA GPU clusters 10

Outline Overview of PGAS models (UPC and OpenSHMEM) Limitations in PGAS models for GPU computing Proposed Designs and Alternatives Performance Evaluation Exploiting GPUDirect RDMA 11

Limitations of PGAS models for GPU Computing PGAS memory models does not support disjoint memory address spaces - case with GPU clusters OpenSHMEM case Existing OpenSHMEM Model with CUDA PE 0 PE 0 PE 1 GPU-to-GPU Data Movement host_buf = shmalloc ( ) cudamemcpy (host_buf, dev_buf,... ) shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier ( ) PE 1 host_buf = shmalloc ( ) shmem_barrier (... ) cudamemcpy (dev_buf, host_buf, size,... ) Copies severely limit the performance Synchronization negates the benefits of one-sided communication Similar limitations in UPC 12

Outline Overview of PGAS models (UPC and OpenSHMEM) Limitations in PGAS models for GPU computing Proposed Designs and Alternatives Performance Evaluation Exploiting GPUDirect RDMA 13

Global Address Space with Host and Device Memory Host Memory Private Host Memory Private Shared N shared space on host memory Shared N Device Memory Private Device Memory Private Shared N shared space on device memory Shared N heap_on_device(); /*allocated on device*/ dev_buf = shmalloc (sizeof(int)); heap_on_host(); /*allocated on host*/ host_buf = shmalloc (sizeof(int)); Extended APIs: heap_on_device/heap_on_host a way to indicate location on heap Can be similar for dynamically allocated memory in UPC 14

CUDA-aware OpenSHMEM and UPC runtimes After device memory becomes part of the global shared space: Accessible through standard UPC/OpenSHMEM communication APIs Data movement transparently handled by the runtime Preserves one-sided semantics at the application level Efficient designs to handle communication Inter-node transfers use host-staged transfers with pipelining Intra-node transfers use CUDA IPC Service-thread for asynchronous and one-sided progress in Goal: Enabling High performance one-sided communications with GPU devices 15

Outline Overview of PGAS models (UPC and OpenSHMEM) Limitations in PGAS models for GPU computing Proposed Designs and Alternatives Performance Evaluation Exploiting GPUDirect RDMA 16

Latency (usec) Latency (usec) Latency (usec) Shmem_putmem Inter-node Communication 35 30 25 20 15 10 5 0 22% Small Messages 1 4 16 64 256 1K 4K Message Size (Bytes) 3000 2500 2000 1500 1000 500 0 Large Messages 16K 64K 256K 1M 4M Message Size (Bytes) 28% 1200 1000 800 600 400 200 0 0 200 400 600 800 1000 Remote Compute Skew (usec) One-sided Progress Small messages benefit from selective CUDA registration 22% for 4Byte messages Large messages benefit from pipelined overlap 28% for 4MByte messages Service thread enables one-sided communication S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K. Panda, Extending OpenSHMEM for GPU Computing, Int'l Parallel and Distributed Processing Symposium (IPDPS '13) 17

Latency (usec) Latency (usec) Shmem_putmem Intra-node Communication Small Messages Large Messages 30 25 20 15 10 5 0 73% 1 4 16 64 256 1K 4K Message Size (Bytes) 3000 2500 2000 1500 1000 500 0 85% 16K 64K 256K 1M 4M Message Size (Bytes) Using IPC for intra-node communication Small messages 73% improvement for 4Byte messages Large messages 85% for 4MByte messages Based on MVAPICH2X-2.0b + Extensions Intel WestmereEP node with 8 cores 2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA CUDA 6.0RC1 18

Total Execution Time (msec) Total Execution Time (msec) Application Kernel Evaluation: Stencil2D 30000 25000 20000 15000 10000 5000 0 512x512 1Kx1K 2Kx2K 4Kx4K 8Kx8K Problem Size/GPU (192 GPUs) 14 12 10 8 6 4 2 0 48 96 192 Number of GPUs (4Kx4K problem/gpu) 65% Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/gpu Using OpenSHMEM for GPU-GPU communication allows runtime to optimize non-contiguous transfers 19

Total Execution Time (msec) Application Kernel Evaluation: BFS 1600 1400 1200 1000 800 600 400 200 0 12% 24 48 96 Number of GPUs (1 million vertices/gpu with degree 32) Extended SHOC BFS kernel to run on a GPU cluster using a level-synchronized algorithm and OpenSHMEM The enhanced version shows upto 12% improvement on 96 GPUs, a consistent improvement in performance as we scale from 24 to 96 GPUs. 20

Outline Overview of PGAS models (UPC and OpenSHMEM) Limitations in PGAS models for GPU computing Proposed Designs and Alternatives Performance Evaluation Exploiting GPUDirect RDMA 21

Latency (usec) Latency (usec) Exploiting GPUDirect RDMA In OpenSHMEM (Preliminary results) 25 20 15 10 5 0 GDR for small message sizes Host-staging for large message to avoid PCIe bottlenecks Hybrid design brings best of both 3.3us latency for 4 bytes Naive Enhanced GDR Hybrid 16.2 3.3 1 4 16 64 256 1K 4K Message Size (Bytes) Naive Enhanced GDR Hybrid 7000 6000 5000 4000 3000 2000 1000 0 16K 64K 256K 1M 4M Message Size (Bytes) GPU features will be available in future releases of MVAPICH2-X!! Based on MVAPICH2X-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPU-Direct-RDMA Plugin 22

Hangout with the Speaker Come know and discuss how we make it easier to use MPI and PGAS models on NVIDIA GPU clusters S4951 Hangout: GTC Speakers Tuesday 03/25 13:00 14:00 (Now) Concourse Pod B 23

Talk on advances in CUDA-aware MPI and Hybrid HPL 1) S4517 - Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Tuesday, 03/25 (Today) 15:00 15:25 Room LL21A 2) S4535 - Accelerating HPL on Heterogeneous Clusters with NVIDIA GPUs Tuesday, 03/25 (today) 17:00 17:25 Room LL21A 24

Thank You! panda@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 25