Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
|
|
- Ira Skinner
- 5 years ago
- Views:
Transcription
1 Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray
2 Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on accelerator Compiler does the work of generating CUDA/OpenCL code Pushed by NVIDIA in effort to make GPU programming more mainstream Developed OpenACC standard with PGI, CRAY, and CAPS OpenACC will eventually be supported by compilers from multiple vendors
3 Pragma-based GPU programming NVIDIA performance improvement guarantee when using pragma-based programming on their GPUs
4 HMPP Workbench Developed by CAPS Entreprise Directive-based Multi-language and Multi-target Hybrid Programming Model Directives similar to OpenMP Parallelize sequential code for multiple architectures Preserve original source code
5 HMPP Workbench Currently supported on NVIDIA and AMD GPUs Supports high-level code written in C and Fortran (assuming user has license) Can target CUDA and OpenCL environments Working on open standard called OpenHMPP Will be supported on additional accelerators in future
6 Parallel Processing w/ HMPP Workbench Sequential code: 2D Convolution Code in void function: void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];
7 Parallel Processing w/ HMPP Workbench Parallelize 2D Convolution code w/ HMPP Use #pragma before function heading to turn function into "codelet" targeted for processing on accelerator Use #pragma in function call to execute function in parallel on accelerator
8 Parallel Processing w/ HMPP Workbench 2D Convolution function w/ HMPP #pragma Turns function to codelet for parallel processing #pragma hmpp conv codelet, target=opencl, args[a].io=in, args [B].io=inout void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];
9 Parallel Processing w/ HMPP Workbench Call to function w/ #pragma for parallel processing Without #pragma, function will execute on CPU int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv callsite conv2d(a, B_outputFromGpu); return 0;
10 Tuning GPU Execution in HMPP Default config. includes CPU-GPU transfer What if you want to keep data on GPU? Additional functions use same data Benchmarking advancedload / delegatedstore #pragmas advancedload = pre-load data to accelerator delegatedstore = write data from accelerator to main memory Use "allocate" and "release" pragmas to manage memory on accelerator #pragmas to synchronize kernel execution
11 Tuning GPU Execution in HMPP 2D convolution with memory loaded before kernel call and written back after int main(int argc, char *argv[]) DATA_TYPE A[NI][NJ]; DATA_TYPE B_outputFromGpu[NI][NJ]; // GPU exec results //initialize the input array init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); #pragma hmpp conv synchronize #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release return 0;
12 Parallel Processing w/ HMPP Workbench Convolution on GPU and CPU Compare output results and runtimes Use advanceload/delegated store with synchronization for timing kernel only Use kernel call without callsite #pragma to run codelet on CPU Another function compares each output value to check that results match (within a certain threshold)
13 Parallel Processing w/ HMPP Workbench int main(int argc, char *argv[]) double t_start, t_end; DATA_TYPE A[NI][NJ], B[NI][NJ], B_outputFromGpu[NI][NJ]; init(a); #pragma hmpp conv allocate #pragma hmpp conv advancedload, args[a;b] t_start = rtclock(); #pragma hmpp conv callsite, args[a;b].advancedload=true, asynchronous conv2d(a, B_outputFromGpu); //run 2D convolution on accelerator #pragma hmpp conv synchronize t_end = rtclock(); fprintf(stdout, "GPU Runtime: %0.6lf\n", t_end - t_start); #pragma hmpp conv delegatedstore, args[b] #pragma hmpp conv release t_start = rtclock(); conv2d(a, B); //run 2D convolution on CPU t_end = rtclock(); fprintf(stdout, "CPU Runtime: %0.6lf\n", t_end - t_start); //compare output on CPU and GPU to make sure results match compareresults(b, B_outputFromGpu); return 0;
14 Running HMPP on cuda.acad... See README on course website... Set environment variables for CUDA and HMPP Kernel execution is same as for GPU programs in project 1 Doesn't work right now due to issue with license; hopefully will be fixed soon... 2D convolution output on fatalii (GTX 480) Using array dimensions of 4096 X 4096 in all experiments GPU Runtime: CPU Runtime: Number of misses: 0
15 HMPP Transformations Used to optimize code Permutation Unroll (`contiguous' and `split' options) `Remainder' loop behavior Default: allow `remainder loop' `Guarded' - if-statement in loop to check if current iteration in bounds for processing; no remainder loop Tiling Dimension of thread block / local work-group (default is 32 X 4) Specify loop(s) to parallelize in loop nest By default outer two "parallelizable" loops are parallelized
16 HMPP Transformations Input/output code when using HMPP transformations:
17 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, guarded for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];
18 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using default`contiguous' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)
19 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 2, split for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];
20 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 2? Using `split' loop unroll option... "Guarded" remainder loop behavior Results on fatalii: CPU Runtime: Number of misses: 0 GPU Runtime: (compared to in default config.)
21 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option... Using `split' loop unroll option... "Guarded" remainder loop behavior void conv2d(data_type A[NI][NJ], DATA_TYPE B[NI][NJ]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for ( int i = 0 ; i < NI 1 ; i++) #pragma hmppcg unroll 4(, split) for ( int j = 0 ; j < NJ 1 ; j++) array2[i][j] = C11*A[i 1][j 1] + C12*A[i+0][j 1] + C13*A[i+1][j 1] + C21*A[i 1][j+0] + C22*A[i+0][j+0] + C23*A[i+1][j+0] + C31*A[i 1][j+1] + C32*A[i+0][j+1] + C33*A[i+1][j+1];
22 Code Transformations: 2D Convolution Focus on inner loop What happens when unrolling inner loop by a factor of 4? Using `contiguous' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0 Using `split' loop unroll option w/ `guarded' remainder... GPU Runtime: (compared to in default config.) CPU Runtime: Number of misses: 0
23 3D Convolution Code: void conv3d(data_type A[NI][NJ][NK], DATA_TYPE B[NI][NJ][NK]) int i, j, k; DATA_TYPE c11, c12, c13, c21, c22, c23, c31, c32, c33; c11 = +2; c21 = +5; c31 = -8; c12 = -3; c22 = +6; c32 = -9; c13 = +4; c23 = +7; c33 = +10; for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1];
24 3D Convolution Initial results of HMPP parallelization on fatalii: Using array dimensions of 256 X 256 X 256 in all experiments GPU Runtime: s CPU Runtime: s Number of misses: 0
25 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default?
26 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest
27 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; What does HMPP parallelize by default? First two "parallelizable" loops in loop nest Is this desirable for this kernel?
28 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order?
29 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation)
30 Code Transformations: 3D Convolution 3D Convolution Loop: for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; How to adjust which loops are parallelized and in what order? Re-order loops (permutation) Use noparallel/parallel pragmas to specify which loops to parallelize
31 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...)
32 Code Transformations: 3D Convolution 3D Convolution Loop: #pragma hmppcg permute?????????? for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; Reordering loops What order(s) would work best for HMPP parallelization (think memory coalescence...) Want 'k' loop to be the `inner' loop parallelized
33 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (i, k, j) #pragma hmppcg permute i, k, j for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0
34 Code Transformations: 3D Convolution 3D Convolution Loop: loop order (j, k, i) #pragma hmppcg permute j, k, i for (i = 1; i < NI - 1; ++i) // 0 for (j = 1; j < NJ - 1; ++j) // 1 for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0
35 Code Transformations: 3D Convolution 3D Convolution Loop: Use noparallel/parallel to parallelize inner loop #pragma hmppcg noparallel for (i = 1; i < NI - 1; ++i) // 0 #pragma hmppcg parallel for (j = 1; j < NJ - 1; ++j) // 1 #pragma hmppcg parallel for (k = 1; k < NK - 1; ++k) // 2 B[i][j][k] = c11 * A[i - 1][j - 1][k - 1] + c13 * A[i + 1][j - 1][k - 1] + c21 * A[i - 1][j - 1][k - 1] + c23 * A[i + 1][j - 1][k - 1] + c31 * A[i - 1][j - 1][k - 1] + c33 * A[i + 1][j - 1][k - 1] + c12 * A[i + 0][j - 1][k + 0] + c22 * A[i + 0][j + 0][k + 0] + c32 * A[i + 0][j + 1][k + 0] + c11 * A[i - 1][j - 1][k + 1] + c13 * A[i + 1][j - 1][k + 1] + c21 * A[i - 1][j + 0][k + 1] + c23 * A[i + 1][j + 0][k + 1] + c31 * A[i - 1][j + 1][k + 1] + c33 * A[i + 1][j + 1][k + 1]; GPU Runtime: s (compared to s in default config.) CPU Runtime: s Number of misses: 0
36 Pragma-based GPU programming Allows programmer to write code targeted toward GPUs without knowing CUDA / OpenCL Provides code transformations to potentially speed up code Runtime of 2D and 3D convolution decreased when using specific transformations However, still does not give as much control as CUDA/OpenCL (example: shared/local memory usage)
Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos
Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos GPU Computing Utilization of GPU gives speedup on many algorithms
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationHeterogeneous Multicore Parallel Programming
Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationHMPP port. G. Colin de Verdière (CEA)
HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm
More informationCAPS Technology. ProHMPT, 2009 March12 th
CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationIncremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010
Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit
More informationCode Migration Methodology for Heterogeneous Systems
Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationOMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions
OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions Albert Saà-Garriga Universitat Autonòma de Barcelona Edifici Q,Campus de la UAB Bellaterra, Spain albert.saa@uab.cat David Castells-Rufas
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationParallel Hybrid Computing Stéphane Bihan, CAPS
Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationHow to Write Code that Will Survive the Many-Core Revolution
How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationAccelerated Test Execution Using GPUs
Accelerated Test Execution Using GPUs Vanya Yaneva Supervisors: Ajitha Rajan, Christophe Dubach Mathworks May 27, 2016 The Problem Software testing is time consuming Functional testing The Problem Software
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationAn Hybrid Data Transfer Optimization Technique for GPGPU
An Hybrid Data Transfer Optimization Technique for GPGPU Eric Petit, François Bodin, Romain Dolbeau To cite this version: Eric Petit, François Bodin, Romain Dolbeau. An Hybrid Data Transfer Optimization
More informationHow to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO
How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationOPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS
OPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS A Dissertation Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationBlue Waters Programming Environment
December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationOPENACC ONLINE COURSE 2018
OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week
More informationAutomatic Testing of OpenACC Applications
Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization
More informationPortability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17
Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance
More informationPathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012
PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives C. Bergström May 14th, 2012 Brief Introduction to ENZO 2 PathScale GTC12 S0631 Tutorial May 14th, 2012 ENZO Overview & Goals
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationThe Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer
The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania
More informationA Uniform Programming Model for Petascale Computing
A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationPGI Fortran & C Accelerator Programming Model. The Portland Group
PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationLittle Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo
OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationPerformance Diagnosis for Hybrid CPU/GPU Environments
Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationProgress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core
Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationGetting Started with Directive-based Acceleration: OpenACC
Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationA General Discussion on! Parallelism!
Lecture 2! A General Discussion on! Parallelism! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Lecture 2: Overview Flynn s Taxonomy of
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationDesigning and Optimizing LQCD code using OpenACC
Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationEarly Experiences With The OpenMP Accelerator Model
Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU Profiling and Optimization. Scott Grauer-Gray
GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationCompiler Tools for HighLevel Parallel Languages
Compiler Tools for HighLevel Parallel Languages Paul Keir Codeplay Software Ltd. LEAP Conference May 21st 2013 Presentation Outline Introduction EU Framework 7 Project: LPGPU Offload C++ for PS3 Memory
More informationProgramming NVIDIA GPUs with OpenACC Directives
Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe mwolfe@nvidia.com http://www.pgroup.com/accelerate
More informationParallel Programming Concepts. What kind of programming model can bridge the gap? Dr. Peter Tröger M.Sc. Frank Feinbube
Parallel Programming Concepts What kind of programming model can bridge the gap? Dr. Peter Tröger M.Sc. Frank Feinbube 5 Hybrid System GDDR5 Dual Gigabit LAN Dual Gigabit LAN GDDR5 DDR3 Core CPU Core 16x
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationCompiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University
More informationINTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016
INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More information