Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Size: px

Start display at page:

Download "Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray"

Ira Skinner
5 years ago
Views:

1 Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray

2 Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on accelerator Compiler does the work of generating CUDA/OpenCL code Pushed by NVIDIA in effort to make GPU programming more mainstream Developed OpenACC standard with PGI, CRAY, and CAPS OpenACC will eventually be supported by compilers from multiple vendors

3 Pragma-based GPU programming NVIDIA performance improvement guarantee when using pragma-based programming on their GPUs

4 HMPP Workbench Developed by CAPS Entreprise Directive-based Multi-language and Multi-target Hybrid Programming Model Directives similar to OpenMP Parallelize sequential code for multiple architectures Preserve original source code

5 HMPP Workbench Currently supported on NVIDIA and AMD GPUs Supports high-level code written in C and Fortran (assuming user has license) Can target CUDA and OpenCL environments Working on open standard called OpenHMPP Will be supported on additional accelerators in future

6 Parallel Processing w/ HMPP Workbench Sequential code: 2D Convolution Code in void function: void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

7 Parallel Processing w/ HMPP Workbench Parallelize 2D Convolution code w/ HMPP Use #pragma before function heading to turn function into "codelet" targeted for processing on accelerator Use #pragma in function call to execute function in parallel on accelerator

8 Parallel Processing w/ HMPP Workbench 2D Convolution function w/ HMPP #pragma Turns function to codelet for parallel processing #pragma hmpp conv codelet, target=opencl, args[a].io=in, args [B].io=inout void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

9 Parallel Processing w/ HMPP Workbench Call to function w/ #pragma for parallel processing Without #pragma, function will execute on CPU int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv callsite conv2d(a, B_outputFromGpu); return 0;

10 Tuning GPU Execution in HMPP Default config. includes CPU-GPU transfer What if you want to keep data on GPU? Additional functions use same data Benchmarking advancedload / delegatedstore #pragmas advancedload = pre-load data to accelerator delegatedstore = write data from accelerator to main memory Use "allocate" and "release" pragmas to manage memory on accelerator #pragmas to synchronize kernel execution

11 Tuning GPU Execution in HMPP 2D convolution with memory loaded before kernel call and written back after int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); #pragma hmpp conv synchronize #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release return 0;

12 Parallel Processing w/ HMPP Workbench Convolution on GPU and CPU Compare output results and runtimes Use advanceload/delegated store with synchronization for timing kernel only Use kernel call without callsite #pragma to run codelet on CPU Another function compares each output value to check that results match (within a certain threshold)

13 Parallel Processing w/ HMPP Workbench int main(int argc, char *argv[]) double t_start, t_end; DATA_TYPE A[NI][NJ], B[NI][NJ], B_outputFromGpu[NI][NJ]; init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] t_start = rtclock(); #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); //run 2D convolution on accelerator #pragma hmpp conv synchronize t_end = rtclock(); fprintf(stdout, "GPU Runtime: %0.6lf\n", t_end - t_start); #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release t_start = rtclock(); conv2d(a, B); //run 2D convolution on CPU t_end = rtclock(); fprintf(stdout, "CPU Runtime: %0.6lf\n", t_end - t_start); //compare output on CPU and GPU to make sure results match compareresults(b, B_outputFromGpu); return 0;

14 Running HMPP on cuda.acad... See README on course website... Set environment variables for CUDA and HMPP Kernel execution is same as for GPU programs in project 1 Doesn't work right now due to issue with license; hopefully will be fixed soon... 2D convolution output on fatalii (GTX 480) Using array dimensions of 4096 X 4096 in all experiments GPU Runtime: CPU Runtime: Number of misses: 0

15 HMPP Transformations Used to optimize code Permutation Unroll (`contiguous' and `split' options) `Remainder' loop behavior Default: allow `remainder loop' `Guarded' - if-statement in loop to check if current iteration in bounds for processing; no remainder loop Tiling Dimension of thread block / local work-group (default is 32 X 4) Specify loop(s) to parallelize in loop nest By default outer two "parallelizable" loops are parallelized

16 HMPP Transformations Input/output code when using HMPP transformations:

17 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, guarded for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

18 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)

19 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, split for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

20 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)

21 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option... Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 4(, split) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

22 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0 Using `split' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0

23 3D Convolution Code: void conv3d(data_type A[NI][NJ][NK], DATA_TYPE B[NI][NJ][NK]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1];

24 3D Convolution Initial results of HMPP parallelization on fatalii: Using array dimensions of 256 X 256 X 256 in all experiments GPU Runtime: s CPU Runtime: s Number of misses: 0

25 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default?

26 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest

27 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest Is this desirable for this kernel?

28 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order?

29 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation)

30 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation) Use noparallel/parallel pragmas to specify which loops to parallelize

31 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...)

32 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...) Want 'k' loop to be the `inner' loop parallelized

33 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (i, k, j) #pragma hmppcg permute i, k, j for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

34 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (j, k, i) #pragma hmppcg permute j, k, i for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

35 Code Transformations: 3D Convolution 3D Convolution Loop: Use noparallel/parallel to parallelize inner loop #pragma hmppcg noparallel for (i = 1; i < NI - 1; ++i) // 0 #pragma hmppcg parallel for (j = 1; j < NJ - 1; ++j) // 1 #pragma hmppcg parallel for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

36 Pragma-based GPU programming Allows programmer to write code targeted toward GPUs without knowing CUDA / OpenCL Provides code transformations to potentially speed up code Runtime of 2D and 3D convolution decreased when using specific transformations However, still does not give as much control as CUDA/OpenCL (example: shared/local memory usage)

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos GPU Computing Utilization of GPU gives speedup on many algorithms