Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Similar documents
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Intel Xeon Phi Coprocessors

Performance Tools for Technical Computing

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

S Comparing OpenACC 2.5 and OpenMP 4.5

A case study of performance portability with OpenMP 4.5

Accelerators in Technical Computing: Is it Worth the Pain?

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

First Experiences with Intel Cluster OpenMP

Advanced OpenMP Features

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC 2.6 Proposed Features

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Optimising the Mantevo benchmark suite for multi- and many-core architectures

CellSs Making it easier to program the Cell Broadband Engine processor

Exploring Dynamic Parallelism on OpenMP

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Accelerator programming with OpenACC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

OpenACC Course. Office Hour #2 Q&A

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

COMP Parallel Computing. Programming Accelerators using Directives

CPU-GPU Heterogeneous Computing

Heterogeneous Computing and OpenCL

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

When MPPDB Meets GPU:

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Research on Programming Models to foster Programmer Productivity

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Lecture 6: odds and ends

Using SYCL as an Implementation Framework for HPX.Compute

Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

OpenMP 4.0. Mark Bull, EPCC

Advanced OpenACC. Steve Abbott November 17, 2017

JCudaMP: OpenMP/Java on CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Directive-based Programming for Highly-scalable Nodes

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

NUMA-aware OpenMP Programming

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenCL: History & Future. November 20, 2017

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Modernizing OpenMP for an Accelerated World

rcuda: an approach to provide remote access to GPU computational power

OpenACC Pipelining Jeff Larkin November 18, 2016

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

OpenMP 4.0/4.5. Mark Bull, EPCC

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

OP2 FOR MANY-CORE ARCHITECTURES

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Overview: The OpenMP Programming Model

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

COMP528: Multi-core and Multi-Processor Computing

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Running the FIM and NIM Weather Models on GPUs

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

Hybrid Implementation of 3D Kirchhoff Migration

Evaluating OpenMP s Effectiveness in the Many-Core Era

AutoTune Workshop. Michael Gerndt Technische Universität München

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

n N c CIni.o ewsrg.au

CME 213 S PRING Eric Darve

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Trends in HPC (hardware complexity and software challenges)

Accelerating Dense Linear Algebra on the GPU

OpenMP for Accelerators

Preparing for Highly Parallel, Heterogeneous Coprocessing

C++ and OpenMP Christian Terboven Center for Computing and Communication RWTH Aachen University, Germany

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Accelerating Financial Applications on the GPU

Early Experiences with the OpenMP Accelerator Model

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Accelerated Library Framework for Hybrid-x86

Automated Finite Element Computations in the FEniCS Framework using GPUs

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Runtime Correctness Checking for Emerging Programming Paradigms

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Transcription:

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller 1 1: RWTH Aachen University, Germany, email: {hahnfeld,terboven,pflug,mueller}@itc.rwth-aachen.de 2: University of Bristol, UK, email: j.price@bristol.ac.uk WACCP 2017: Fourth Workshop on Accelerator Programming Using Directives Nov. 13th, 2017

Motivation & Agenda Asynchronous Offloading: are high-level models inferior to low-level APIs? CUDA, OpenCL: APIs - low-level - full control OpenMP, OpenACC: based on pragmas - ease of use - some abstractions Agenda of this talk 1. Asynchronous Offloading Capabilities of Accelerator Models 2. Kernel used for evaluation: Conjugate Gradient Method 3. Findings on NVIDIA GPU 4. Findings on Intel Xeon Phi Coprocessor 5. Summary 2

Asynchronous Offloading Capabilities of Accelerator Models

Comparison of CUDA, OpenCL, OpenACC and OpenMP Asynchronicity Unstructured data movement Asynchronous memory transfer Page-locked memory CUDA OpenCL OpenACC OpenMP streams: actions in different streams can execute concurrently command queues: operations in different queues can execute concurrently Clause: async with option argument to select a queue; synchronization via acc wait construct Yes: implicitly Yes: implicitly acc enter/exit data construct API: cudamemcpyasync API: cudamallochost Argument: blocking_write No, but shared virtual memory Clause: async - - Clause: nowait because the target construct is a task; synchronization via taskwait constr. target enter/exit data construct Clause: nowait 4

Offloading to multiple devices (1/2) Performance Projection without Overlapping Total runtime consists of computation and communication: Communication time: Prediction for data d with bandwidth B: t overhead accounts for preparational tasks - may be significant for the overall communication time t comm - may depend on data volume d Hahnfeld, Cramer, Klemm, Terboven, Müller: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives. IWOMP 2017. 5

Offloading to multiple devices (2/2) Performance Projection with Pipelining Pattern Optimal runtime when overlapping computation and communication: Maximum optimization over runtime without overlapping: Approximates a performance increase of o max = 0.5 - if communication and computation time are perfectly balanced Hahnfeld, Cramer, Klemm, Terboven, Müller: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives. IWOMP 2017. 6

Pipelining Concept for Overlapping Communication Implementation with OpenMP 4.5 Using standalone directives from OpenMP 4.5 - omp target enter data - omp target exit data Aynchronous tasks (nowait) and specify dependencies Black lines: data dependency Red and blue lines: mutual exclusion - of enter and compute tasks - avoid oversubscription 7

Kernel used for evaluation: Conjugate Gradient Method

Evaluation kernel: CG Method (1/3) Conjugate Gradients Method on Multiple Devices Iterative solver for linear equation systems: - A * x = k - widely used for PDE-based problems Sparse matrix with regular sparsity pattern: Serena from SuiteSparse Matrix Collection - Matrx is spd - about 1.4 mio. rows and columns - about 780 MB memory consumption in CRS format on host and Xeon Phi - about 6.14 GB memory consumption in ELLPACK-R format on GPU Division of matrix and each vector into partitions Matrix vector multiplication requires data exchange - Start computation with local data - Apply pipelining concept for data transfer 9

Evaluation kernel: CG Method (2/3) Concept for executing kernels on multiple devices local remote 10

Evaluation kernel: CG Method (3/3) Dependencies for overlapping the communication with two devices (1) The ability to express this algorithm with a given offloading programming model gives insight on expressiveness. (2) The performance of the result gives insight into the quality of the implementation. 11

Findings on NVIDIA GPU

Evaluation System Technical Specification NEC GPU server system - 2 NVIDIA Tesla P100, each: 5.3 TFLOP/s dp performance 549 GB/s Triad bandwidth to HBM2 measured with BabelStream 13.2 GB/s achievable transfer rate to host via PCIe - NVLink between the GPUs 37 GB/s achievable transfer rate between GPUs - 2 Intel Westmere-EP 12-core processors at 2.2 GHz 120 GB/s Triad bandwidth to DDR4 measured with Stream 13

Data Transfer with Host Basis for performance model DMA: Direct Memory Access - Only possible with memory that is page-locked or pinned Necessity for device data transfers By default CUDA (et al.) do a transparent copy In addition, special allocation methods are available Size of x vector chunks: 5.6 MB No explanation for the drop after 32 MB CUDA and OpenACC on par pocl about 10% below 14

Results with a single GPU Basis for evaluation Initial evaluation on the host and on a single NVIDIA Tesla P100 - Host: Intel C++ 17.0.4 compiler w/ OpenMP Offloading to the GPU pays off (timings include data transfer) Number of iterations varies because of reduction operating in vector dot product CUDA is fastest - Compiler generates better code with added pragma unroll 1 15

CUDA Options to implement data exchange between devices Via host: utilization of PCIe in two successive transfers - Plus a temporary buffer on the host Between devices: direct communication between devices - cudamemcpyasync with kind cudamemcpydevicetodevice - Will use PCIe by default - Runtime may employ double buffering, or other optimizations Peer to peer: utilization of the NVLink - cudadeviceenablepeeraccess on both devices for unidirectional access 16

CUDA Results (matvec only) Runtime [sec] 2,5 2 1,5 1 0,5 0 Reference (1 GPU) Naive (2 GPUs): host Naive (2 GPUs): device Naive (2 GPUs): p2p Overlapping (2 GPUs): host Overlapping (2 GPUs): device Overlapping (2 GPUs): p2p matvec Performance model prediction: up to 46.91 % optimization - Requires two CUDA streams per device: computation & communication Remember (3): optimization effect depends on max of t comp and t comm - Utilizing NVLink reduces the communication time Smaller improvement for whole CG as partitioning takes extra time: 5.15s to 4.26s in best case 17

OpenCL Quality of the implementation NVIDIA s own OpenCL -... does not support page-locked memory -... has a performance bug in the device to device copy Therefore we switched to pocl (OpenCL 2.0 implementation) -... and made some improvements which will become available with next release 18

OpenCL Results (matvec only) Runtime [sec] 2,5 2 1,5 1 0,5 0 Reference (1 GPU) Naive (2 GPUs): host Naive (2 GPUs): device Overlapping (2 GPUs): host Overlapping (2 GPUs): device matvec Performance model prediction: up to 44.34 % optimization - Requires two OpenCL command queues per device: computation & communication 19

OpenACC Results (matvec only) (1/2) Runtime [sec] 2,5 2 1,5 1 0,5 0 Reference (1 GPU) Naive (2 GPUs) Overlapping (2 GPUs) matvec Performance model prediction: up to 44.05 % optimization OpenACC currrently does not allow transfer between two devices without involving the host Data transfer time increases - PGI s implementation cannot transfer matrix from pageable memory asynchronously - Issues with the runtime prevented from using threads to parallelize data transfer 20

OpenACC Results (2/2) Not successful considering the total runtime 21

Findings on the Intel Xeon Phi Coprocessor

Evaluation System Technical Specification Bull server system - 2 Intel Xeon Phi 5110P coprocessors, each: approx. 1 TFLOP/s dp performance 117 GB/s Triad bandwidth to HBM measured with Stream 6.5 GB/s achievable transfer rate to host via PCIe (gen2) - 2 Intel SandyBridge-EP 8-core processors at 2.0 GHz 65 GB/s Triad bandwidth to DDR4 measured with Stream Software - Intel 17.0.2 compilers 17.0.4 contains a performance bug - Intel MPSS 3.8 - Intel OpenCL SDK 14.2 23

OpenMP Results Not successful considering the total runtime Very high overhead for launching the kernels Intel 16.0.x compilers show better performance... but do not provide asynchronous offloading 24

Summary

Summary Evaluation of asynchronous offloading Asynchronous offloading to multiple devices can deliver the expected performance improvements CUDA is perfectly up to the task OpenCL 2.0 provides all the necessary ingredients OpenACC can be successful - Currently, device to device support is missing - Issues with the quality of the implementation Intel Xeon Phi - Implementation issues lead to bad results for OpenMP - GCC 7, IBM xlc and LLVM/Clang compilers will soon fully support OpenMP on GPUs Programming is challenging: asynchronous offload pattern may ease implementation work Code is available at: https://rwth-aachen.sciebo.de/index.php/s/edjjkedclhlizye 26

Vielen Dank für Ihre Aufmerksamkeit Thank you for your attention