OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Similar documents
Exploring Dynamic Parallelism on OpenMP

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

OmpSs Specification. BSC Programming Models

The OmpSs programming model and its runtime support

Design Decisions for a Source-2-Source Compiler

OpenMP 4.0/4.5. Mark Bull, EPCC

Parallel Programming. Libraries and Implementations

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Hybridizing MPI and tasking: The MPI+OmpSs experience. Jesús Labarta BSC CS Dept. Director

Parallel Programming Libraries and implementations

From the latency to the throughput age

OpenACC 2.6 Proposed Features

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

OpenACC Fundamentals. Steve Abbott November 15, 2017

MultiGPU Made Easy by OmpSs + CUDA/OpenACC

Programming with

Design and Development of support for GPU Unified Memory in OMPSS

OpenMP 4.0. Mark Bull, EPCC

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Fundamentals of OmpSs

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

S Comparing OpenACC 2.5 and OpenMP 4.5

Parallel Programming. Libraries and implementations

Getting Started with Directive-based Acceleration: OpenACC

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

OpenMP 4.0 (and now 5.0)

Programming paradigms for GPU devices

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenMP Tasking Model Unstructured parallelism

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

PGI Accelerator Programming Model for Fortran & C

OpenMP 4.5: Threading, vectorization & offloading

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

An Introduction to OpenAcc

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Advanced OpenMP Features

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

OpenACC (Open Accelerators - Introduced in 2012)

Addressing Heterogeneity in Manycore Applications

JCudaMP: OpenMP/Java on CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to OpenACC. 16 May 2013

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

DawnCC : a Source-to-Source Automatic Parallelizer of C and C++ Programs

Preparing for Highly Parallel, Heterogeneous Coprocessing

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Parallel Accelerators

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Barcelona Supercomputing Center

COMP Parallel Computing. Programming Accelerators using Directives

Parallel Programming in C with MPI and OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Compiler Directives with OpenACC

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Directive-based Programming for Highly-scalable Nodes

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Performance of deal.ii on a node

Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda

Parallel Programming in C with MPI and OpenMP

Task-based programming models to support hierarchical algorithms

GPU Programming Using CUDA

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Lecture 4: OpenMP Open Multi-Processing

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

Parallel Accelerators

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Accelerated Test Execution Using GPUs

INTRODUCTION TO OPENACC

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Code Migration Methodology for Heterogeneous Systems

Overview of research activities Toward portability of performance

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

PGI Fortran & C Accelerator Programming Model. The Portland Group

OpenMP API Version 5.0

Experiences with Achieving Portability across Heterogeneous Architectures

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Transcription:

www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es

Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400 nodes) with Intel Xeon Emerging technologies clusters 1. IBM Power9 Nvidia GPU 2. Intel Knights Landing (KNL) and Intel Knights Hill (KNH) 3. 64 bit ARMv8 processors that Fujitsu Research Lines OmpSs Parallel programming model Simple data directionality annotations for tasks Asynchronous data-flow, intelligence to the runtime BSCTools - Performance analysis tools Extrae, paraver and Dimemas Performance analytics: intelligence, insight CUDA Center of Excellence PUMPS summer school 2010-2017, courses at BSC and UPC Mont-Blanc Exploring the potential of low-power GPU clusters as high-performance platforms 2

www.bsc.es Home of OmpSs Programming Model

OmpSs Programming Model Parallel Programming Model Directive based to keep a serial version Targeting: SMP, clusters and accelerator devices Experimental Platform Mercurium Compiler (source-to-source) for Fortran/C/C++ Nanos Runtime Applications Forerunner for OpenMP extending OpenMP following OpenMP 4

OmpSs Programming Model Key concept Single Program Any target Sequential task based program on single address/name space + directionality annotations Happens to execute parallel: Automatic run time computation of dependencies between tasks Differentiation of OmpSs Dependences: Tasks instantiated but not ready. Look ahead Avoid stalling the main control flow when a computation depending on previous tasks is reached Possibility to see the future searching for further potential concurrency Dependences built from data access specification Locality aware Without defining new concepts Homogenizing heterogeneity Device specific tasks but homogeneous program logic 5

Task based concepts of OmpSs Minimalist set of concepts Key: OpenMP, influenced OpenMP, pushing, not yet #pragma omp task [ in (array_spec, l_values...)] [ out (...)] [ inout (, v[neigh[j]], j=0;n)]) \ [ concurrent ( )] [commutative(...)] [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied] \ [final(expr)][if (expression)] \ [reduction(identifier : list)] \ [resources( )] {code block or function #pragma omp taskwait [ { in out inout (...) ] [noflush] #pragma omp taskloop [grainsize( ) ] [num_tasks( ) [nogroup] [ in (...)] [reduction(identifier : list)] {for_loop 6

OpenMP compatibility Follow OpenMP syntax For adopted OmpSs features Adapt semantics for OpenMP features. Ensure High compatibility #pragma omp parallel // ignore #pragma omp for [ shared(...)][private(...)][firstprivate(...)][schedule_clause] // taskloop {for_loop #pragma omp task [depend (type: list)] 7

OpenMP 4.5 GPU Offload Support MACC Compiler: Experimental branch supports OpenMP 4.5 GPU Offload Relying on OmpSs task model and migrating OpenMP 4.5 directives with OmpSs Generates CUDA/OpenCL codes Key concepts: Propose clauses that improve kernel performance Change in mentality minor details make a difference #pragma omp target device (acc) #pragma omp task #pragma omp teams distribute parallel for {for_loop #pragma omp target device (acc) #pragma omp task #pragma omp parallel for {for_loop #pragma omp taskwait [ on (...) ][noflush] 8

OmpSs GPU Support Single address space program executes in several non-coherent address spaces Copy clauses: ensure sequentially consistent copy accessible in the address space where task is going to be executed Requires precise specification of data accessed (e.g. array sections) Runtime offloads data and computation and manages consistency Kernel based programming Separation of iteration space identification and loop body #pragma omp target device ({ smp opencl cuda ) \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] \ [ implements ( function_name )] \ [shmem(...) ] \ [ndrange (dim, g_array, l_array)] #pragma omp taskwait [ on (...) ][noflush] 9

GPU Execution Model of OmpSs #pragma omp target device(cuda) ndrange(1, N, 128) #pragma omp task in(c) out(d) global MyFastKernel(double *C, double *D, int N) { <.. CUDA Kernel Codes..> int main( ) { double A[N], B[N], C[N], D[N]; for (int j=0; j<2; ++j) { MyFastKernel( C, D, N) ; #pragma omp target device(acc) #pragma omp task in(a) out(b) #pragma omp teams distribute parallel for for(i=0 ; i< N; ++i) <..Sequential Codes to generate CUDA..> memcpy H2D(C) E X E memcpy H2D(A) #pragma omp target device(acc) #pragma omp task inout(a,b) #pragma omp teams distribute parallel for for(i=0 ; i< N; ++i) <..Sequential Codes to generate CUDA..> #pragma omp target device(acc) #pragma omp task inout(c,b) in(d) #pragma omp teams distribute parallel for for(i=0 ; i< N; ++i) <..Sequential Codes to generate CUDA..> memcpy D2D(B) #pragma omp target device(smp) #pragma omp task in(a, C) <..CPU codes / Print results to file..> #pragma omp taskwait memcpy D2H(C) memcpy D2H(A) 10

www.bsc.es Wouldn t be great to have OpenACC in OmpSs?

Idea & Motivation Motivation OpenACC compilers deliver best performance by generating highly optimized GPU codes OmpSs has powerful task support that allows to maintain entire application Single address space -> any or multiple target Potential ability to run same tasks onto hybrid GPU + CPU Goal: Make use of OpenACC GPU support with OmpSs task model 12

OpenACC Integration in OmpSs New device type for openacc is added New Key: OpenACC Start with OmpSs, Manage task dependency, data and multiple device Parallelize with OpenACC #pragma omp target device (openacc) #pragma omp task [ { in out inout (...) ] #pragma acc kernels [clause-list] {code block #pragma omp target device (openacc) #pragma omp task [ { in out inout (...) ] #pragma acc parallel [clause-list] {code block 13

Compilation Workflow Device management is done by OmpSs passed to OpenACC Streams are managed by OmpSs and passed to OpenACC Each kernel is submitted asynchronously #include <openacc.h> #include <cuda_runtime.h> OpenACC Code extern C { extern int nanos_get_device_id_(); extern cudastream_t nanos_get_kernel_stream(); extern unsigned int nanos_get_kernel_stream_id(); void oacc_ol_main_0_7_vecadd_unpacked (int* a, int* b, int* c, int N) { acc_set_device_num( nanos_get_device_id_(), acc_device_nvidia ); acc_set_cuda_stream(nanos_get_kernel_stream_id(), nanos_get_kernel_stream()); Input Code int main(int argc, char* argv) { double a[n], b,[n] c[n]; #pragma acc parallel loop deviceptr(a,b,c) async(nanos_get_kernel_stream_id()) for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i]; OpenACC Code OpenACC Compiler #pragma omp target device (openacc) #pragma omp task in(a[:n],b[:n]) out(c[:n]) #pragma acc parallel loop deviceptr(a,b,c) for (int i = 0; i < N; ++i) { c[i] = a[i] + b[i];. #pragma omp taskwait return 0; Mercurium Compiler [C/C++/Fortran] Host code Host Backend Compiler Link EXE 14

Stream Benchmark 1 st Style (single GPU) OmpSs creates single OpenACC task Single OpenACC task is run on single GPU void triad(t* a, T* b, T* c, T scalar, int N){ #pragma omp target device (openacc) #pragma omp task in(b[0:n], c[0:n]) out(a[0:n]) #pragma acc parallel loop deviceptr(a,b,c) for (int i = 0; i < N; i++) a[i] = b[i]+scalar*c[i]; int main(int argc, char const *argv[]) {... copy(a, c, size); scale(b, c, size); add(a, b, c, scalar, size); triad(a, b, c, scalar, size); device = Only openacc is requested copy_deps = Copies dependencies to the target Dependencies are specified OmpSs manages data. Symbols are passed deviceptr clause to inform OpenACC 15

Stream Benchmark 2 nd Style (Multiple GPU) OmpSs creates multiple OpenACC tasks Multiple OpenACC tasks are run automatically on multiple GPU void triad(t* a, T* b, T* c, T scalar, int N){ #pragma omp target device (openacc) #pragma omp task in(b[0:n], c[0:n]) out(a[0:n]) #pragma acc parallel loop deviceptr(a,b,c) for (int i = 0; i < N; i++) a[i] = b[i]+scalar*c[i]; int main(int argc, char const *argv[]) {... for (int i = 0; i < N; i += CHUNK) { copy(&a[i], &c[i], CHUNK); scale(&b[i], &c[i], CHUNK); add(&a[i], &b[i], &c[i], scalar, CHUNK); triad(&a[i], &b[i], &c[i], scalar CHUNK); device = openacc are requested copy_deps = Copies dependencies to the target if it s required Dependencies are specified OmpSs manages data. Symbols are passed deviceptr clause to inform OpenACC Loop Blocking 16

Stream Benchmark 3 rd Style (multiple GPU + multicore SMP) OmpSs compiler creates multiple OpenACC and SMP tasks Multiple tasks are run automatically on multiple GPU or CPU void triad(t* a, T* b, T* c, T scalar, int N){ #pragma omp target device (openacc, smp) #pragma omp task in(b[0:n], c[0:n]) out(a[0:n]) #pragma acc parallel loop deviceptr(a,b,c) for (int i = 0; i < N; i++) a[i] = b[i]+scalar*c[i]; int main(int argc, char const *argv[]) {... for (int i = 0; i < N; i += CHUNK) { copy(&a[i], &c[i], CHUNK); scale(&b[i], &c[i], CHUNK); add(&a[i], &b[i], &c[i], scalar, CHUNK); triad(&a[i], &b[i], &c[i], scalar CHUNK); device = openacc and smp are requested copy_deps = Copies dependencies to the target if it s required Dependencies are specified OmpSs manages data. Symbols are passed deviceptr clause to inform OpenACC Loop Blocking 17

SpeedUp over Multithread OmpSs[CPU] Stream Benchmark Configuration: i7-4820k 4 core NVIDIA Tesla 2 x K 40 OpenMP = OmpSs 16.06 OpenACC = PGI OpenACC 16.9 Note: All executables use same GPU kernel OpenACC is scaled manually across multiple GPUs modifying code 10 9 8 7 6 5 4 3 2 Speedup of Stream Microbenchmark 1 0 OmpSs [CPU] OpenACC Ompss [OpenACC] Ompss [OpenACC + CPU] OpenACC Ompss [OpenACC] Ompss [OpenACC + CPU] 1 GPU 2 GPU 18

Future work Bring smoother OpenACC code generation by OmpSs Include new OpenACC directives routine declare etc. 19

Conclusion OpenACC integration is feasible as OmpSs is flexible Using OpenACC + OmpSs delivers best performance Two heads are better than one!!! OpenACC become even easier with OmpSs Guray Ozen guray.ozen@bsc.es 20