GPU-centric communication for improved efficiency

Similar documents
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Tightly Coupled Accelerators Architecture

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters

Interconnection Network for Tightly Coupled Accelerators Architecture

High-Performance Broadcast for Streaming and Deep Learning

Cray XC Scalability and the Aries Network Tony Ford

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds

Interconnect Your Future

dcuda: Distributed GPU Computing with Hardware Overlap

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Intra-MIC MPI Communication using MVAPICH2: Early Experience

ECE 574 Cluster Computing Lecture 18

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Simulation using MIC co-processor on Helios

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

CUDA. Matthew Joyner, Jeremy Williams

Leveraging HyperTransport for a custom high-performance cluster network

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

HPC future trends from a science perspective

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Building the Most Efficient Machine Learning System

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Cuda C Programming Guide Appendix C Table C-

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Paralization on GPU using CUDA An Introduction

2008 International ANSYS Conference

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

CPU-GPU Heterogeneous Computing

Paving the Road to Exascale

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Fundamental CUDA Optimization. NVIDIA Corporation

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

Tesla Architecture, CUDA and Optimization Strategies

GPUs and GPGPUs. Greg Blanton John T. Lubia

Solutions for Scalable HPC

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Shifter: Fast and consistent HPC workflows using containers

Master Informatics Eng.

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

Building the Most Efficient Machine Learning System

High Performance Computing

CUDA OPTIMIZATIONS ISC 2011 Tutorial

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Mapping MPI+X Applications to Multi-GPU Architectures

Real Parallel Computers

Birds of a Feather Presentation

Single-Points of Performance

GPU Fundamentals Jeff Larkin November 14, 2016

Fundamental CUDA Optimization. NVIDIA Corporation

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Scaling Data Warehousing Applications using GPUs

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

The rcuda middleware and applications

Future Routing Schemes in Petascale clusters

GPUs and Emerging Architectures

Trends in HPC (hardware complexity and software challenges)

Accelerating HPL on Heterogeneous GPU Clusters

GPU Architecture. Alan Gray EPCC The University of Edinburgh

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

HPC Architectures. Types of resource currently in use

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Transcription:

GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop @ Green Computing Conference 2014 November 3, 2014, Dallas, TX, USA

Green500 (June 2014) Rank GFLOPS/W TFLOPS System Power (kw) TOP500 Rank 1 2 3 4.39 151,79 3.63 191.10 3.52 277.10 TSUBAME- KFC 1U- 4GPU Cluster Intel CPU & NVIDIA K20x, IB Wilkes (Dell) Intel CPU & NVIDIA K20, IB HA- PACS (Cray) Intel CPU & NVIDIA K20x, IB 34.58 437 52.62 201 78.77 165 4 3.46 253.60 Cartesius Accelerator Island Intel CPU & NVIDIA K40m, IB 44.40 421 5 1.75 5,587.00 Piz Daint (Cray) Intel CPU & NVIDIA K20x, Aries 1,753.66 6!2

Motivation There is a tradeoff between performance and energy efficiency (top500!= green500) Most energy efficient systems use NVIDIA GPUs as accelerators Today s HPC systems are cluster systems! According to Exascale Computing Study: Most energy is spent for communication! Top500, www.top500.org, June 2014 Green500, www.green500.org, June 2014 Exascale Computing Study: Technology Challenges in Achieving Exascale Systems, September 28, 2008!3

Motivation Accelerators (e.g. GPUS) are widely used, but can only excel in performance when data can be held in on-chip memory (scarce resource) are deployed in cluster systems, requiring communication to work on large amounts of data cannot control communication and need return control flow to the CPU are limited in performance due to PCIe data copies What we need is a direct communication path between distributed GPUs getting rid of PCIe copies and context switches!4

Outline Background GPU-centric communication Energy analysis Conclusion Future Work!5

Background Why GPUs? Massively parallel (demanded by many applications) Power efficient Intel Xeon E5-2687W: 1.44 GFLOPs/W NVIDIA K20 GPU: 14.08 GFLOPs/W NVIDIA Tegra K1: 32.60 GFLOPs/W (only GPU) Drawbacks Memory is a scarce resource PCIe copies Communication is still CPU controlled!6

GPU Computing: architecture K20: 192 cores 65K Register File 16KB/48KB L1$/Shared Memory Host Memory PCIe 2.0 ~TBs 6 GB/s Register File Global Memory 5 GB 208GB/s L1$/Shared Memory L2$ CORE CORE CORE CORE CORE CORE CORE CORE SM SM 13 SMS 2496 cores CORE CORE CORE CORE SM CORE CORE CORE!7

GPU Computing: execution model CUDA (Compute Unified Device Architecture) Kernels define instructions that are executed by every thread (SIMT = Single Instruction Multiple Threads) Threads are organized in blocks, forming a grid Blocks are tightly bound to SMs (about 2 per SM concurrently) 32 Threads (warp) are executed in parallel Memory access latencies are tolerated by scheduling thousands of threads (hardware multi-threading) Synchronization only within blocks!8

How do we measure power/energy? Vendors provide software APIs to obtain power information Intel: CPU, DRAM power provided by RAPL NVIDIA: NVML library provide power information Network power can be assumed to be static (no dynamic link on and off switching and embedded clock) We implemented a python framework start application poll RAPL and NVML to obtain power information during run time of application write results to disk and create graphs!9

GPU-centric communication!10

State- of- the- art Communication is tailored for CPUs Message Passing Interface (MPI) for cluster systems But: there is no MPI on GPUs CPU MPI_Send network CPU Optimizations: GPUDirect RDMA (in theory) Host memory staging copies Pipelining (cudamemcpy/mpi_send) PCIe cudamemcpy PCIe cudamemcpy GPU GPU!11

Direct GPU communication path Bypass the CPU and communicate directly GPU controls network hardware No context switches are necessary, reducing latency and improving energy efficiency CPU CPU Examples CPU- controlled GPU- controlled GGAS X PCIe PCIe GPU RMA DCGN X X MVAPICH- GPU X GPU communication network GPU!12

GGAS Idea: global address space for GPUs Memory can be accessed by every other GPU Communication inline with execution model thread-collaborative control flow remains on the GPU Global barrier enables synchronization Global Address Space (GAS) GPU0 GPU1 GPU2 GPUn memory access!13

GGAS PCIe PCIe GPU CPU NIC Network PCIe PCIe NIC CPU GPU 0 store to GAS network package write data to GPU memory 0 0 Computation 1 CUDA stack 2 MPI stack 0 x Possible overlap!14

GGAS EXTOLL interconnect is used research project at Heidelberg University company has been founded still FPGA based (157MHz), ASIC (~800MHz) under test Functional unit to span global address spaces (SMFU) PCIe BAR is mapped to user space incoming memory transactions are forwarded and completed on target side!15

Energy analysis!16

Benchmarks We implemented and measured basic microbenchmarks bandwidth latency barrier (synchronization costs) Benchmarks are implemented with CPU-tailored communication: MPI+CUDA Direct GPU communication: GGAS We introduce new metric: Joule / word of data transfers The whole system is considered and not only the network link energy consumption!17

Bandwidth Simple streaming bandwidth test GGAS: CUDA kernel launch within kernel: write data to remote GPU finish CUDA+MPI: memory copy from GPU to CPU memory MPI send (respective receive on target side) On target side: memory copy from CPU to GPU memory!18

Bandwidth performance savings due to CPU being idle energy per word [J/word] 1e 08 1e 07 1e 06 1e 05 1e 04 1e 03 GGAS MPI 6x 3x 0 200 400 600 800 1000 bandwidth [MB/s] 128 256 512 1024 2048 4096 8192 16384 65536 262144 1048576 4194304 message size [B]!19

Latency ping-pong test to determine half-round trip latency GGAS: CUDA kernel launch ping: write data to remote GPU and wait for response pong: wait for data and write response to remote GPU finish CUDA+MPI: ping: memory copy to CPU, then MPI Send, followed by a Receive and memory copy to the GPU pong: MPI Receive, memory copy to the GPU and back, MPI send for the response!20

Latency GGAS MPI energy per word [J/word] 1e 05 1e 04 1e 03 1e 02 1e 01 1e+00 1 4 7 10 14 18 22 26 30 half roundtrip latency [us] 8 16 32 64 128 256 512 1024 message size [B]!21

Barrier The barrier is called several times GGAS: CUDA kernel launch call barrier plenty of times finish CUDA+MPI: dummy kernel launch (to consider context switches) MPI barrier call repeat!22

Barrier energy per barrier call 1e 04 5e 04 1e 03 2e 03 5e 03 1e 02 GGAS MPI 0 5 10 15 20 25 30 35 40 barrier latency [us] 2 4 6 8 10 12 number of nodes!23

Power consumption Numbers refer to average power consumption over all measured message sizes CPU power is always lower for GGAS GPU power about the same Total power savings from 10W to 15W!24

Conclusion & Future Work!25

Conclusion We implemented and analyzed three microbenchmarks regarding energy per word of data transfers. GPU-centric communication shows improved energy efficiency bandwidth: 50.67% energy savings ; 2.42x performance latency: 85.01% energy savings ; 6.96x performance barrier: 67.31% energy savings ; 2.28x performance This is due to two aspects: performance: context switches increase latency power consumption: context switches prevent the CPU from entering power saving states!26

Insights GPU-centric communication is superior in performance and energy efficiency It is more energy efficient to communicate larger messages instead of very small ones (aggregating messages can be an option to improve energy efficiency) Synchronization is more energy efficient when it does not require context switches energy/word metric characterizes overall communication efficiency (including source and sink processors), and not only the physical link implementation (energy/bit metric)!27

Future Work Analysis of the impact of communication methods on application performance / energy consumption Communication library for GPUs including thread-collaborative communication one-sided communication (offloading to NIC) collective operations (reduce, all-to-all, scatter, gather) Exploration of further optimizations including simulation Power and performance models including CPU, GPU and network!28

Thank you for your attention! Q&A!29

References [1 ]Top500, www.top500.org, June 2014 [2] Green500, www.green500.org, June 2014 [3] Exascale Computing Study: Technology Challenges in Achieving Exascale Systems, September 28, 2008 [4] Oden, L.; Froning, H., "GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters," Cluster Computing (CLUSTER), 2013 IEEE International Conference [5] Stuart, J.A; Owens, J.D., "Message passing on data- parallel architectures," Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, vol., no., pp.1,12, 23-29 May 2009 [6] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda, Efficient Inter- node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs Int'l Conference on Parallel Processing (ICPP '13), October 2013., October 2013 [7] Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co- located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia. [8] Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Put/Get APIs for Thread- Collaborative Processors, Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA) co- located with Internation Conference on Parallel Processing (ICPP2014), September 12, 2014, Minneapolis, Minnesota!30