OpenACC (Open Accelerators - Introduced in 2012)

Size: px

Start display at page:

Download "OpenACC (Open Accelerators - Introduced in 2012)"

Virgil Cannon
5 years ago
Views:

1 OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in C, C++, Fortran (as in OpenMP) to specify parallel regions. Can be used with CPUs and GPU accelerators but (other than OpenMP) it is meant for offloading computational kernels to GPUs. See Example for : void saxpy(int n, float a, float *x, float *restrict y) #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Pragma suggests that the compiler generate parallel code on an accelerator. Movement of data to and from the GPU is performed transparently. The data movement is slow and can create a large overhead if it happens often. 1

2 OpenACC (Open Accelerators) Each loop in the kernels-block is executed as a parallel function on the GPU. Keywordrestrict is a promise to the compiler that pointer aliasing will not occur, i.e., here pointer y will not be an alias for data pointed to by x. In other words: Declares that only this pointer will be used to access the data. PGI Compiler pgcc -acc test.c 2

3 OpenMP and OpenACC are similar // OpenMP (CPU) #pragma omp parallel for reduction(+:sum) for(i=0;i<n;i++) sum+=i; // OpenACC (GPU) #pragma acc kernels for(i=0;i<n;i++) sum+=i; 3

4 Jacobi Iteration in OpenMP // Jacobi Iteration in OpenMP (example from Dresden_OpenACC_Intro_1.pdf) while(err>tol && iter<iter_max) err=0.0; // parallelization using threads on the CPU #pragma omp parallel for shared(m,n,anew,a) reduction(max:err) for(int j=1; j<n-1; j++) for(int i=1; i<m-1; i++) Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i]); #pragma omp parallel for shared(m, n, Anew, A) for(int j=1; j<n-1; j++) for(int i=1; i<m-1; i++ ) A[j][i] = Anew[j][i]; iter++; 4

5 Jacobi Iteration in OpenACC // Jacobi Iteration in OpenACC (example from Dresden_OpenACC_Intro_1.pdf) while(err>tol && iter<iter_max) err=0.0; // parallelization using threads on the GPU #pragma acc kernels reduction(max:err) for(int j=1; j<n-1; j++) for(int i=1; i<m-1; i++) Anew[j][i]=0.25*(A[j][i+1]+A[j][i-1]+A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i]); #pragma acc kernels for(int j=1;j<n-1;j++) for(int i=1;i<m-1;i++) A[j][i] = Anew[j][i]; iter++; 5

6 OpenCL (Open Computing Language - Introduced in 2009) Framework (language+api, open standard) for parallel CPU/GPU computing based on C99. Implementations available from AMD, Apple, ARM, IBM, Nvidia, and others. Functions executed on a OpenCL device (GPU) are called kernels. Defines a memory hierarchy (CPU memory (global shared), global read-only, local, private) ( global, constant, local, private). Model: Host (CPU) has one ore more OpenCL devices (GPUs, accelerators); Model: An OpenCL device has several (10 1 to 10 2 ) compute units (CUs); Model: A compute unit is several (10 2 to 10 3 ) processing elements (PEs). kernel void matvec( global const float *A, global const float *x, uint ncols, global float *y) size_t i = get_global_id(0); // Global id, used as the row index. global float const *a = &A[i*ncols]; // Pointer to the i th row. float sum = 0.f; // Accumulator for dot product. for (size_t j = 0; j < ncols; j++) sum += a[j] * x[j]; y[i] = sum; // // en.wikipedia.org/wiki/opencl 6

7 OpenCL Language Restrictions (2010) No recursion allowed (still true) Pointers to functions are not allowed. Pointers to pointers allowed within a kernal, but not as an argument. Bit-fields not supported. C99 variable length arrays not allowed. Struct s not supported. No writes to a pointed to a type smaller than 32 bits. OpenCL C Kernel Example kernel void dp_mul( global const float *a, global const float *b, global float *c, int N) int id = get_global_id(0); if (id < N) c[id] = a[id] * b[id]; 7

8 OpenCL (portable) or Cuda (Nvidia)? OpenCL will run a a large number of devices (Nvidia, AMD, KNL, X86, Power,...). Reduces risk, in case your favorite architecture dies. However, for portability, it has to be conservative about supporting features. And just because the code runs that does not mean that it will be fast. Generally it can be expected that code might run fast only when adapted to the hardware. If you own an Nvidia device, why then not use Nvidias SDK named Cuda? To reduce risk, the US is taking two paths in HPC. One is x86 (Intel/AMD) the other one is IBM+Nvidia. If Nvidia has won the accelerator race in the West (have they?) why not use their programming model and SDK (China now has its architecture)? Time to look at details of GPU hardware and at the Cuda SDK by Nvidia. However, after that we may appreciate higher-level, portable approaches more. 8

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire