OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Size: px

Start display at page:

Download "OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer"

Leo Randall
6 years ago
Views:

1 OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

2 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance Easy to use Portable code Most Performance Most Flexibility 2

3 Programming GPU-Accelerated Systems Using OpenACC directives for data management, loop parallelization GPU Developer View with Separated Memories CPU System Memory GPU Memory #pragma acc data copyin(a,b) copyout(c) {... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i];... } }... } 3

4 OpenACC Basic Syntactic Concepts Fortran OpenACC directive syntax!$acc directive [clause]... C/C++ OpenACC directive syntax #pragma acc directive [clause]... eol 4

$$ACC END PARALLEL Multicore CPU Tesla GPU % pgfortran -ta=multicore fast Minfo=acc -c \ update_tile_halo_kernel.f90... 100, Loop is parallelizable Generating Multicore code 100,!$

5 OpenACC is for Multicore, Manycore & GPUs 98!$ACC PARALLEL 99!$ACC LOOP INDEPENDENT 100 DO k=y_min-depth,y_max+depth 101!$ACC LOOP INDEPENDENT 102 DO j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 ENDDO 105 ENDDO 106!$ACC END PARALLEL Multicore CPU Tesla GPU % pgfortran -ta=multicore fast Minfo=acc -c \ update_tile_halo_kernel.f , Loop is parallelizable Generating Multicore code 100,!$acc loop gang 102, Loop is parallelizable % pgfortran -ta=tesla fast -Minfo=acc c \ update_tile_halo_kernel.f , Loop is parallelizable 102, Loop is parallelizable Accelerator kernel generated Generating Tesla code 100,!$acc loop gang, vector(4)! blockidx%y threadidx%y 102,!$acc loop gang, vector(32)! blockidx%x threadidx%x 5

$.. #pragma acc data copy(b[0:n][0:m]) \ create(a[0:n][0:m]) { for (iter = 1; iter <= p; ++iter){ #pragma acc parallel loop for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+$

6 OpenACC for GPUs in a Nutshell... #pragma acc data copy(b[0:n][0:m]) \ create(a[0:n][0:m]) { for (iter = 1; iter <= p; ++iter){ #pragma acc parallel loop for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+ w1*(b[i-1][j]+b[i+1][j]+ b[i][j-1]+b[i][j+1])+ w2*(b[i-1][j-1]+b[i-1][j+1]+ b[i+1][j-1]+b[i+1][j+1]); } } #pragma acc parallel loop for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) b[i][j] = a[i][j]; } }... A B System Memory S p 12 (B) S 1 p (B) Accelerator Memory 6

7 OpenACC Interoperability Add CUDA (C or Fortran), OpenCL, or accelerated libraries to an OpenACC application Write a truly heterogeneous application with MPI, OpenMP and OpenACC Add OpenACC to an existing accelerated application Share data between OpenACC and CUDA 7

8 OpenACC Accelerator Compute Constructs Parallel Construct Kernels Construct Loop Construct Combined Loop Directives Other clauses and directives private data, reductions, collapsing loops, caching 8

9 OpenACC Compute constructs parallel, kernels directives PARALLEL LOOP KERNELS Requires analysis by programmer to ensure safe parallelism Straightforward path from OpenMP Compiler performs parallel analysis and parallelizes what it believes safe Can cover larger area of code with single directive Gives compiler additional leeway. Both approaches are equally valid and can perform equally well. 9

10 OpenACC Compute constructs parallel C/C++ Syntax: #pragma acc parallel [clause-list] structured block Fortran Syntax:!$acc parallel [clause-list] structured block!$acc end parallel async[(int-expr)] wait[(int-expr-list)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) device_type(device-type-list) if(condition) reduction(operator:var-list) private(var-list) firstprivate(var-list) default( none present) data clauses 10

11 OpenACC Compute constructs kernels C/C++ Syntax: #pragma acc kernels [clause-list] structured block Fortran Syntax:!$acc kernels [clause-list] structured block!$acc end kernels async[(int-expr)] wait[(int-expr-list)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) device_type(device-type-list) if(condition) reduction(operator:var-list) default( none present) data clauses 11

12 OpenACC Compute constructs Serial (new in OpenACC 2.6) C/C++ Syntax: #pragma acc serial [clause-list] structured block Fortran Syntax:!$acc serial [clause-list] structured block!$acc end serial C/C++ Syntax: #pragma acc parallel num_gangs(1)\ num_workers(1) vector_length(1) structured block Fortran Syntax:!$acc parallel num_gangs(1) &!$acc num_workers(1) vector_length(1) structured block!$acc end parallel 12

13 OpenACC 3 Levels of Parallelism Vector Vector Gang Gang Workers Workers Vector threads work in lockstep (SIMD/SIMT parallelism) Workers have 1 or more vectors. Gangs have 1 or more workers and share resources (such as cache, the streaming multiprocessor, etc.) Multiple gangs work independently of each other 13

14 OpenACC gang, worker, vector Clauses gang, worker, and vector can be added to a loop clause A parallel region can only specify one of each gang, worker, vector Control the size using the following clauses on the parallel region num_gangs(n), num_workers(n), vector_length(n) #pragma acc parallel loop gang for (int i = 0; i < n; ++i) #pragma acc loop worker for (int j = 0; j < n; ++j)... #pragma acc parallel vector_length(32) #pragma acc loop gang for (int i = 0; i < n; ++i) #pragma acc loop vector for (int j = 0; j < n; ++j)... 14

15 OpenACC gang NVIDIA streaming multiprocessor AMD Radeon vector unit Intel Xeon Phi Coprocessor core multicore core Coarse grain parallelism across core or hardware unit No synchronization between gangs CUDA thread block OpenCL workgroup 15

16 OpenACC worker CUDA/NVIDIA warp OpenCL subgroup AMD Radeon wavefront IXPC thread multicore core fine grain parallelism for latency tolerance, multithreading You can usually ignore worker parallelism until you are fine-tuning! 16

17 OpenACC vector(lane) CUDA thread OpenCL work item Intel Xeon Phi Coprocessor vector lane multicore SSE or AVX lane Synchronous parallelism, SIMD parallelism, vector parallelism Compiler can remap parallelism to improve performance 17

18 OpenACC Data Movement Management The compiler can discover the need for data movement automatically The shapes and sizes of arrays or data structures may not be apparent to the compiler Definite performance advantages to move data regions out and avoid data transfer at every kernel launch Scalars are generally handled optimally; loop variables are private by default, scalars are firstprivate Arrays, even very small ones, are global (shared) by default 18

19 OpenACC Data contructs Structured data regions C/C++ Syntax: #pragma acc data [clause-list] structured block Fortran Syntax:!$acc data [clause-list] structured block!$acc end data if(condition) copy(var-list) copyin(var-list) copyout(var-list) create(var-list) no_create(var-list) present(var-list) deviceptr(var-list) attach(var-list) 19

20 OpenACC Data contructs Unstructured data regions C/C++ Syntax: #pragma acc enter data [clause-list] Fortran Syntax:!$acc enter data [clause-list] if(condition) copy(var-list) copyin(var-list create(var-list) attach(var-list) C/C++ Syntax: #pragma acc exit data [clause-list] Fortran Syntax:!$acc exit data [clause-list] if(condition) copy(var-list) copyout(var-list delete(var-list) detach(var-list) 20

21 OpenACC Data management Modern Data Structures Modern HPC Node Architectures Manual deep copy CPU Main Memory GPU HBM Deep copy directives Unified Memory 21

22 OpenACC 2.6 Manual Deep Copy Supported Today in PGI Compilers typedef struct points { float* x; float* y; float* z; int n; float coef, direction; } points; void sub ( int n, float* y ) { points p; #pragma acc data create (p) { p.n = n; p.x = ( float*) malloc ( sizeof ( float )*n ); p.y = ( float*) malloc ( sizeof ( float )*n ); p.z = ( float*) malloc ( sizeof ( float )*n ); #pragma acc update device (p.n) #pragma acc data copyin (p.x[0:n], p.y[0: n]) { #pragma acc parallel loop for ( i =0; i<p.n; ++I ) p.x[i] += p.y[i];... 22

23 Draft OpenACC 3.0 True Deep Copy Still in definition by the OpenACC Committee typedef struct points { float* x; float* y; float* z; int n; float coef, direction; #pragma acc policy inout(x[0:n],y[0:n]) } points; void sub ( int n, float* y ) { points p; p.n = n; p.x = ( float*) malloc ( sizeof ( float )*n ); p.y = ( float*) malloc ( sizeof ( float )*n ); p.z = ( float*) malloc ( sizeof ( float )*n ); #pragma acc data copy (p) { #pragma acc parallel loop for ( i =0; i<p.n; ++I ) p.x[i] += p.y[i];... 23

24 OpenACC 2.6 Manual Deep Copy A real-world example: managing one aggregate data structure Derived Type 1 Members: 3 dynamic 1 derived type 2!$acc data copyin(array1) call my_copyin(array1) Derived Type 2 Members: 21 dynamic 1 derived type 3 1 derived type 4 Derived Type 3 Members: only static Derived Type 4 Members: 8 dynamic 4 derived type 5 2 derived type 6 Derived Type 5 Members: 3 dynamic Derived Type 6 Members: 8 dynamic -> 12 lines of code -> 48 lines of code -> 26 lines of code -> 8 lines of code -> 107 lines of code just for COPYIN -> 13 lines of code Plus additional lines of code for COPYOUT, CREATE, UPDATE 24

25 OpenACC with CUDA UNIFIED MEMORY A real-world example: managing one aggregate data structure Derived Type 1 Members: 3 dynamic 1 derived type 2 Derived Type 2 Members: 21 dynamic 1 derived type 3 1 derived type 4 Derived Type 3 Members: only static Derived Type 4 Members: 8 dynamic 4 derived type 5 2 derived type 6 Derived Type 5 Members: 3 dynamic Derived Type 6 Members: 8 dynamic 0 lines of code! It just works. 25

Programming GPU-Accelerated Systems Using OpenACC directives for data management, loop parallelization GPU Developer View with Separated Memories CPU System Memory

26 Programming GPU-Accelerated Systems Using OpenACC directives for data management, loop parallelization GPU Developer View with Separated Memories CPU System Memory GPU Memory #pragma acc data copyin(a,b) copyout(c) {... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i];... } }... } 26

27 Programming GPU-Accelerated Systems OpenACC data directives can be ignored on a unified memory system GPU Developer View With CUDA Unified Memory Unified Memory #pragma acc data copyin(a,b) copyout(c) {... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i];... } }... } 27

28 Programming GPU-Accelerated Systems You can even leave them out entirely GPU Developer View With CUDA Unified Memory Unified Memory... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i];... } }... 28

29 OpenACC with CUDA UNIFIED MEMORY Porting a production solver 1227 Directives 215 Directives With Manual Deep Copy With Unified Memory Total number of OpenACC directives required for a real-world Solver port 29

30 OpenACC and CUDA Unified Memory Simplify GPU acceleration of applications Focus on exposing and expressing parallelism Dynamic data today, all data in the future 30

31 OpenACC The Standard for GPU Directives Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU 31

32 OpenACC The Standard for GPU Directives The current specification is version 2.6, which was finalized in November What s new since 2.5? Optional if or if_present clause allowed on the host_data construct New no_create data clause allowed on compute and data constructs New attach and detach behavior was added to the data clauses, new attach and detach clauses were added, and matching acc_attach and acc_detach API calls 32

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively