Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Size: px
Start display at page:

Download "Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray"

Transcription

1 Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray

2 Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on accelerator Compiler does the work of generating CUDA/OpenCL code Pushed by NVIDIA in effort to make GPU programming more mainstream Developed OpenACC standard with PGI, CRAY, and CAPS OpenACC will eventually be supported by compilers from multiple vendors

3 Pragma-based GPU programming NVIDIA performance improvement guarantee when using pragma-based programming on their GPUs

4 HMPP Workbench Developed by CAPS Entreprise Directive-based Multi-language and Multi-target Hybrid Programming Model Directives similar to OpenMP Parallelize sequential code for multiple architectures Preserve original source code

5 HMPP Workbench Currently supported on NVIDIA and AMD GPUs Supports high-level code written in C and Fortran (assuming user has license) Can target CUDA and OpenCL environments Working on open standard called OpenHMPP Will be supported on additional accelerators in future

6 Parallel Processing w/ HMPP Workbench Sequential code: 2D Convolution Code in void function: void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

7 Parallel Processing w/ HMPP Workbench Parallelize 2D Convolution code w/ HMPP Use #pragma before function heading to turn function into "codelet" targeted for processing on accelerator Use #pragma in function call to execute function in parallel on accelerator

8 Parallel Processing w/ HMPP Workbench 2D Convolution function w/ HMPP #pragma Turns function to codelet for parallel processing #pragma hmpp conv codelet, target=opencl, args[a].io=in, args [B].io=inout void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

9 Parallel Processing w/ HMPP Workbench Call to function w/ #pragma for parallel processing Without #pragma, function will execute on CPU int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv callsite conv2d(a, B_outputFromGpu); return 0;

10 Tuning GPU Execution in HMPP Default config. includes CPU-GPU transfer What if you want to keep data on GPU? Additional functions use same data Benchmarking advancedload / delegatedstore #pragmas advancedload = pre-load data to accelerator delegatedstore = write data from accelerator to main memory Use "allocate" and "release" pragmas to manage memory on accelerator #pragmas to synchronize kernel execution

11 Tuning GPU Execution in HMPP 2D convolution with memory loaded before kernel call and written back after int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); #pragma hmpp conv synchronize #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release return 0;

12 Parallel Processing w/ HMPP Workbench Convolution on GPU and CPU Compare output results and runtimes Use advanceload/delegated store with synchronization for timing kernel only Use kernel call without callsite #pragma to run codelet on CPU Another function compares each output value to check that results match (within a certain threshold)

13 Parallel Processing w/ HMPP Workbench int main(int argc, char *argv[]) double t_start, t_end; DATA_TYPE A[NI][NJ], B[NI][NJ], B_outputFromGpu[NI][NJ]; init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] t_start = rtclock(); #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); //run 2D convolution on accelerator #pragma hmpp conv synchronize t_end = rtclock(); fprintf(stdout, "GPU Runtime: %0.6lf\n", t_end - t_start); #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release t_start = rtclock(); conv2d(a, B); //run 2D convolution on CPU t_end = rtclock(); fprintf(stdout, "CPU Runtime: %0.6lf\n", t_end - t_start); //compare output on CPU and GPU to make sure results match compareresults(b, B_outputFromGpu); return 0;

14 Running HMPP on cuda.acad... See README on course website... Set environment variables for CUDA and HMPP Kernel execution is same as for GPU programs in project 1 Doesn't work right now due to issue with license; hopefully will be fixed soon... 2D convolution output on fatalii (GTX 480) Using array dimensions of 4096 X 4096 in all experiments GPU Runtime: CPU Runtime: Number of misses: 0

15 HMPP Transformations Used to optimize code Permutation Unroll (`contiguous' and `split' options) `Remainder' loop behavior Default: allow `remainder loop' `Guarded' - if-statement in loop to check if current iteration in bounds for processing; no remainder loop Tiling Dimension of thread block / local work-group (default is 32 X 4) Specify loop(s) to parallelize in loop nest By default outer two "parallelizable" loops are parallelized

16 HMPP Transformations Input/output code when using HMPP transformations:

17 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, guarded for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

18 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)

19 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, split for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

20 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)

21 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option... Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 4(, split) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];

22 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0 Using `split' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0

23 3D Convolution Code: void conv3d(data_type A[NI][NJ][NK], DATA_TYPE B[NI][NJ][NK]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1];

24 3D Convolution Initial results of HMPP parallelization on fatalii: Using array dimensions of 256 X 256 X 256 in all experiments GPU Runtime: s CPU Runtime: s Number of misses: 0

25 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default?

26 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest

27 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest Is this desirable for this kernel?

28 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order?

29 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation)

30 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation) Use noparallel/parallel pragmas to specify which loops to parallelize

31 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...)

32 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...) Want 'k' loop to be the `inner' loop parallelized

33 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (i, k, j) #pragma hmppcg permute i, k, j for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

34 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (j, k, i) #pragma hmppcg permute j, k, i for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

35 Code Transformations: 3D Convolution 3D Convolution Loop: Use noparallel/parallel to parallelize inner loop #pragma hmppcg noparallel for (i = 1; i < NI - 1; ++i) // 0 #pragma hmppcg parallel for (j = 1; j < NJ - 1; ++j) // 1 #pragma hmppcg parallel for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0

36 Pragma-based GPU programming Allows programmer to write code targeted toward GPUs without knowing CUDA / OpenCL Provides code transformations to potentially speed up code Runtime of 2D and 3D convolution decreased when using specific transformations However, still does not give as much control as CUDA/OpenCL (example: shared/local memory usage)

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos GPU Computing Utilization of GPU gives speedup on many algorithms

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Heterogeneous Multicore Parallel Programming

Heterogeneous Multicore Parallel Programming Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

HMPP port. G. Colin de Verdière (CEA)

HMPP port. G. Colin de Verdière (CEA) HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm

More information

CAPS Technology. ProHMPT, 2009 March12 th

CAPS Technology. ProHMPT, 2009 March12 th CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit

More information

Code Migration Methodology for Heterogeneous Systems

Code Migration Methodology for Heterogeneous Systems Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions

OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions Albert Saà-Garriga Universitat Autonòma de Barcelona Edifici Q,Campus de la UAB Bellaterra, Spain albert.saa@uab.cat David Castells-Rufas

More information

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

How to Write Code that Will Survive the Many-Core Revolution

How to Write Code that Will Survive the Many-Core Revolution How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Accelerated Test Execution Using GPUs

Accelerated Test Execution Using GPUs Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

An Hybrid Data Transfer Optimization Technique for GPGPU

An Hybrid Data Transfer Optimization Technique for GPGPU An Hybrid Data Transfer Optimization Technique for GPGPU Eric Petit, François Bodin, Romain Dolbeau To cite this version: Eric Petit, François Bodin, Romain Dolbeau. An Hybrid Data Transfer Optimization

More information

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

OPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS

OPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS OPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS A Dissertation Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Blue Waters Programming Environment

Blue Waters Programming Environment December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

OPENACC ONLINE COURSE 2018

OPENACC ONLINE COURSE 2018 OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week

More information

Automatic Testing of OpenACC Applications

Automatic Testing of OpenACC Applications Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization

More information

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance

More information

PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012

PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012 PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives C. Bergström May 14th, 2012 Brief Introduction to ENZO 2 PathScale GTC12 S0631 Tutorial May 14th, 2012 ENZO Overview & Goals

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Performance Diagnosis for Hybrid CPU/GPU Environments

Performance Diagnosis for Hybrid CPU/GPU Environments Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

Porting COSMO to Hybrid Architectures

Porting COSMO to Hybrid Architectures Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

A General Discussion on! Parallelism!

A General Discussion on! Parallelism! Lecture 2! A General Discussion on! Parallelism! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Lecture 2: Overview Flynn s Taxonomy of

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Designing and Optimizing LQCD code using OpenACC

Designing and Optimizing LQCD code using OpenACC Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

Early Experiences With The OpenMP Accelerator Model

Early Experiences With The OpenMP Accelerator Model Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Profiling and Optimization. Scott Grauer-Gray

GPU Profiling and Optimization. Scott Grauer-Gray GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

Early Experiences with the OpenMP Accelerator Model

Early Experiences with the OpenMP Accelerator Model Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Compiler Tools for HighLevel Parallel Languages

Compiler Tools for HighLevel Parallel Languages Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory

More information

Programming NVIDIA GPUs with OpenACC Directives

Programming NVIDIA GPUs with OpenACC Directives Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe mwolfe@nvidia.com http://www.pgroup.com/accelerate

More information

Parallel Programming Concepts. What kind of programming model can bridge the gap? Dr. Peter Tröger M.Sc. Frank Feinbube

Parallel Programming Concepts. What kind of programming model can bridge the gap? Dr. Peter Tröger M.Sc. Frank Feinbube Parallel Programming Concepts What kind of programming model can bridge the gap? Dr. Peter Tröger M.Sc. Frank Feinbube 5 Hybrid System GDDR5 Dual Gigabit LAN Dual Gigabit LAN GDDR5 DDR3 Core CPU Core 16x

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Compiling a High-level Directive-Based Programming Model for GPGPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University

More information

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information