GPGPU Programming with OpenACC
|
|
- Brandon Parsons
- 6 years ago
- Views:
Transcription
1 GPGPU Programming with OpenACC Sandra Wienke, M.Sc. IT Center, RWTH Aachen University PPCES, March 18th 2016 IT Center der RWTH Aachen University
2 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 2
3 06/ / / / / / / / / / / / / / / / / / / / / / / / / / / /2020 Performance Motivation Why to care about accelerators? Towards exa-flop computing (performance gain, but power constraints) Accelerators provide good performance per watt ratio (first step) 10 EFlop/s 1 EFlop/s 100 PFlop/s 10 PFlop/s 1 PFlop/s 100 TFlop/s 10 TFlop/s 1 TFlop/s 100 GFlop/s 10 GFlop/s 1 GFlop/s 100 MFlop/s PetaFlop/s PFlop/s PFlop/s TFlop/s ExaFlop/s #1 #500 Sum #1 Trend #500 Trend Sum Trend 3 Source: Top500, 11/2015
4 Motivation Accelerators/ co-processors GPGPUs (e.g. NVIDIA, AMD) Here, NVIDIA GPUs are the focus. Intel Many Integrated Core (MIC) Architecture (Intel Xeon Phi) FPGAs (e.g. Convey), DSPs (e.g. TI), System Share No. of accelerator systems is growing Performance Share 66.3 % None NVIDIA 16.4 % 16.7 % 0.4 % AMD/ ATI Intel 0.2 % PEZY-SC 4 Source: Top500, 11/2015 Source: Top500, 6/2015
5 Motivation Top500 (11/2015) 40% of the first 10 Top500 systems contain accelerators Rank Name Site Rmax Power (kw) Mflops/ Watt Accelerator/ Co-Processor 1 Tianhe-2 National Super Computer Center in 33,862,700 17, Intel Xeon Phi (MilkyWay-2) Guangzhou 31S1P 2 Titan DOE/SC/Oak Ridge National 17,590,000 8, NVIDIA K20x Laboratory 3 Sequoia DOE/NNSA/LLNL 17,173,224 7, None 4 RIKEN Advanced Institute for Computational Science (AICS) 10,510,000 12, None 5 Mira DOE/SC/Argonne National Laboratory 8,586,612 3, None 6 Trinity DOE/NNSA/LANL/SNL 8,100,900 None 7 Piz Daint Swiss National Supercomputing Centre (CSCS) 8 Hazel Hen HLRS - Höchstleistungsrechenzentrum Stuttgart 9 Shaheen II King Abdullah University of Science and Technology 10 Stampede Texas Advanced Computing Center/Univ. of Texas 5 6,271,000 2, NVIDIA K20x 5,640,170 None 5,536,990 2, None 5,168,110 4, Intel Xeon Phi SE10P
6 Motivation Green500 (11/2015) Comparison to server with 2x Intel Sandy 2GHz HPL: ~260 W Peak Performance: 256 GFLOPS 985 MFLOPS/W 6 performance per watt Source: Green500, 11/2015 * Performance data obtained from publicly available sources including TOP500
7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, 7 General purpose
8 Comparison CPU GPU 8 cores CPU 2880 cores GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Massively Parallel Processors Manycore Architecture Fast switching 8
9 DRAM DRAM Comparison CPU GPU Different design ALU ALU ALU ALU L2 ALU L2 Control ALU CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation 9
10 DRAM DRAM Why can accelerators deliver good performance watt ratio? 1. High (peak) performance ALU ALU More transistors for computation CPU ALU ALU L2 No control logic Small caches Control 2. Low power consumption 10 Many low frequency cores P~V 2 f No control logic GPU Power use for 1 TFlop/s of an usual system ALU Heat removal Power supply loss Control Disk Communication Memory Compute Source: Andrey Semin (Intel): HPC systems energy efficiency optimization thru hardware-software co-design on Intel technologies, EnaHPC 2011; & S. Borkar, J. Gustafson L2 ALU
11 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 11
12 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors streaming multiprocessors (SM) Each comprises 32 cores SM cores i.a. Floating point & integer unit Memory hierarchy Peak performance (Tesla M2070) SP: 1.03 TFlops DP: 515 GFlops ECC support Compute capability: 2.0 GPU Defines features, e.g. double precision capability, memory access pattern 12
13 GPU architecture: Kepler 7.1 billion transistors streaming multiprocessors extreme (SMX) Each comprises 192 cores cores Memory hierarchy SMX Peak performance (K20) SP: 3.52 TFlops GPU DP: 1.17 TFlops ECC support Compute capability: 3.5 E.g. dynamic parallelism = possibility to launch dynamically new work from GPU 13
14 Comparison CPU GPU Weak memory model PCI Bus MEMORY CPU 3 1 Host + device memory = separate entities No coherence between host + device Data transfers needed Host-directed execution model Copy input data from CPU mem. to device mem. Execute the device program Copy results from device mem. to CPU mem. MEMORY GPU 2 14
15 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs float x = input[threadid]; float y = func(x); output[threadid] = y; Select input/output data Control decisions With OpenACC, you don t have to bother with thread IDs. 15
16 Based on: NVIDIA Corporation 2010 Programming model Threads are grouped into blocks Blocks are grouped into a grid 16
17 Time Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 Kernel 2 Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) 17
18 Programming model Why blocks? Cooperation of threads within a block possible Synchronization Share data/ results using shared memory Scalability Fast communication between n threads is not feasible when n large No global synchronization on GPU possible (only by completing one kernel and starting another one from the CPU) But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors In any order, concurrently, sequentially 18
19 Scalability Assume: 10 thread blocks 8 SM(X) 2 SM(X) idle idle 20 SM(X) 19
20 Mapping to Execution Model Thread Core Each thread is executed by a core Block Multiprocessor instruction cache registers hardware/ software cache Each block is executed on a SM(X) Several concurrent blocks can reside on a SM(X) depending on shared resources Grid (Kernel) Device Each kernel is executed on a device 20
21 Memory model Thread Registers Fermi: 63 per thread K20: 255 per thread Local memory Block Shared memory Fermi:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 K20:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 OR 32KB shared +32KB L1 Grid/ application Constant memory read-only; off-chip; cached Global memory several GB; off-chip Fermi/K20: L2 cache Device Multiprocessor 1 Registers Registers Core 1 Shared Mem 1 Host L1 Core m L2 Global/Constant Memory Host Memory Caches L1 Fermi: configurable 16/48KB K20: configurable 16/32/ 48 KB Multiprocessor n Registers Core 1 L1 L2 Fermi: 768KB K20: 1536KB Registers Core m Shared Mem n 21
22 Summary execution model logical hierarchy memory model Core thread / vector Registers Registers streaming multiprocessor (SM) instruction cache registers block of threads / gang hardware/ software cache device: GPU grid (kernel) sync possible, shared mem Shared Mem SM-1 L1 L2 Shared Mem SM-n L1 Device Global Memory Host CPU PCIe CPU Mem Host Host Memory 22 Vector, worker, gang mapping is compiler dependent.
23 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 23
24 Based on: NVIDIA Corporation GPGPU Programming Application Libraries Directives Programming Languages Drop-in acceleration High productivity Maximum flexibility examples CUBLAS OpenACC CUDA CUSPARSE OpenMP OpenCL 24
25 GPGPU Programming CUDA (Compute Unified Device Architecture) C/C++ (NVIDIA): programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ OpenACC C/C++, Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov OpenMP C/C++, Fortran: Directive-based programming for hosts and accelerators, standard, portable, released in July 2013, implementations soon 25
26 Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y y x y 26 free(x); free(y); return 0;
27 Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 27 free(x); free(y); return 0;
28 Example SAXPY CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 2.0f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=5.0*i-1.0; float *d_x,*d_y; // Pointer to GPU memory 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 2. Launch kernel 3. Transfer data to CPU + free data on GPU 28
29 Motivation Nowadays: GPU APIs (like CUDA, OpenCL) often used May be difficult to program (as/but more flexibility) Verbose/ may complicate software design Directive-based programming model delegates responsibility for low-level GPU programming tasks to compiler Data movement Kernel execution Awareness of particular GPU type OpenACC or OpenMP 4.x device constructs Many tasks can be done by compiler/ runtime User-directed programming 29
30 Source: Overview OpenACC Open industry standard Portability Here, PGI s OpenACC compiler is used for examples. Details are compiler dependent. Introduced by CAPS, Cray, NVIDIA, PGI (Nov. 2011) Support C,C++ and Fortran NVIDIA GPUs, AMD GPUs, x86 Multicore, Intel MIC (now/near future) Timeline Nov 11: Specification 1.0 Jun 13: Specification 2.0 Nov 14: Proposal for complex data management Nov 15: Specification
31 Directives Syntax C #pragma acc directive-name [clauses] Fortran!$acc directive-name [clauses] Iterative development process Compiler feedback helpful Whether an accelerator kernel could be generated Which loop schedule is used Where/which data is copied 31
32 OpenACC - Compiling (PGI) pgcc -acc -ta=nvidia,cc20,7.0 -Minfo=accel saxpy.c pgcc C PGI compiler (pgf90 for Fortran) -acc Tells compiler to recognize OpenACC directives -ta=nvidia Specifies the target architecture here: NVIDIA GPUs cc20 Optional. Specifies target compute capability Optional. Uses CUDA Toolkit 7.0 for code generation -Minfo=accel Optional. Compiler feedback for accelerator code The PGI tool pgaccelinfo prints the minimal needed compiler flags for the usage of OpenACC on the current hardware (see Development Tips). 32
33 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 33
34 Directives (in examples): SAXPY serial (host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 34
35 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE magic sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; main: PGI Compiler Feedback 48, Generating copyin(x[:10240]) Generating copy(y[:10240]) 50, Kernel Loop 1 is parallelizable Accelerator kernel generated Generating Tesla code 50, #pragma acc loop gang, vector(128) 56, Loop is parallelizable Kernel Accelerator 2 kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 35
36 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 36
37 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 37
38 sync Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; kernels!= parallel compiler can choose different optimizations free(x); free(y); return 0; 38
39 Array Shaping Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr Array shaping Compiler sometimes cannot determine size of arrays Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout Specify explicitly using data clauses and array shape [lower bound: size] C/C++ #pragma acc parallel copyin(a[0:length]) copyout(b[s/2:s/2]) Fortran!$acc parallel copyin(a(0:length-1)) copyout(b(s/2:s)) [lower bound: upper bound] 39
40 Directives: Offload regions Offload region Region maps to a CUDA kernel function C/C++ #pragma acc parallel [clauses] Fortran!$acc parallel [clauses]!$acc end parallel C/C++ #pragma acc kernels[clauses] Fortran!$acc kernels[clauses]!$acc end kernels User responsible for finding parallelism (loops) acc loop needed for worksharing No automatic sync between several loops Compiler responsible for finding parallelism (loops) acc loop directive only for tuning needed Automatic sync between loops within kernels region 40 background OpenACC
41 Directives: Offload regions Clauses for compute constructs (parallel, kernels) If condition true, acc version is executed Executes async, see Tuning slides. Define number of parallel gangs..... Define number of workers within gang.... Define length for vector operations.. Reduction with op at end of region(parallel only).... H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides.. Copy of each list-item for each parallel gang(parallel only) As private + copy initialization from host(parallel only). C/C++, Fortran if(condition) async [(int-expr)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) reduction(op:list) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) private(list) firstprivate(list) Also: wait, device_type, default(none present) 41 background OpenACC
42 Loop schedule Directives: Loops Share work of loops Loop work gets distributed among threads on GPU (in certain schedule) C/C++ #pragma acc loop[clauses] Fortran!$acc loop[clauses] kernels loop defines loop schedule by int-expr in gang, worker or vector (similar to num_gangs etc with parallel) Loop clauses Distributes work into thread blocks.. Distributes work into warps Distributes work into threads within warp/ thread block Executes loop sequentially on the device Collapse n tightly nested loops Says independent loop iterations(kernels loop only). Reduction with op.. Private copy for each loop iteration 42 OpenACC 2.0 also: auto, tile, device_type C/C++, Fortran gang worker vector seq collapse(n) independent reduction(op:list) private(list) background OpenACC
43 Misc on Loops Combined directives #pragma acc kernels loop for (int i=0; i<n; ++i) { /*...*/ #pragma acc parallel loop for (int i=0; i<n; ++i) { /*...*/ Reductions #pragma acc parallel #pragma acc loop reduction(+:sum) for (int i=0; i<n; ++i) { sum += i; Also possible: *, max, min, &,, ^, &&, PGI compiler can often recognize reductions on its own. See compiler feedback, e.g.: Sum reduction generated for var 43
44 Procedure Calls Procedure calls within parallel region routine directive: compile for accelerator + host Clauses denote level of parallelism #pragma acc routine seq float saxpy(float a, float x, float y) { return (a*x + y); // [..] #pragma acc parallel #pragma acc loop gang vector for (int i=0; i<n; ++i) { //call on the device y[i] = saxpy(a,x[i],y[i]); See section on Loop Schedules for explanation of gang, vector, seq With OpenACC 1.0 (!) routines must be inlined (no routine support) 44
45 Directives: Procedure Calls Routine construct 45 Tells compiler to compile the given procedure for an accelerator & host Either use name or directly put construct at procedure call C/C++ #pragma acc routine clause-list #pragma acc routine (name) clause-list Fortran!$acc routine clause-list!$acc routine (name) clause-list A clause must be specified (May) contain/call gang parallelism, executed in gang-redundant mode.. (May) contain/call worker parallelism (no gang), executed in worker-single mode C/C++, Fortran gang worker (May) contain/call vector parallelism (no worker/gang), executed in vector-single mode vector Does not contain/call gang, worker, vector parallelism.. seq Specifies name for compiling/calling (as if specified in language being compiled) bind(name) Specifies name for compiling/calling (string is used for name)... bind(string) No compilation for the host, must called from within acc compute regions nohost Device-specific clause tuning dtype background OpenACC
46 Summary Offloading work kernels magic parallel region & worksharing construct (loop) Easy reductions by reduction clause Procedure calls with routine construct Moving data with data clauses Array shaping copy, copyin, copyout Compiler feedback What is done implicitly? How are explicit directives understood? 46
47 Development Tips Favorable for parallelization: (nested) loops Large loop counts to compensate (at least) data movement overhead Independent loop iterations Think about data availability on host/ GPU Use data regions to avoid unnecessary data transfers Specify data array shaping (may adjust for alignment) Verifying execution on GPU See PGI compiler feedback. Term Accelerator kernel generated needed. See information on runtime (more details in the hands-on session): export ACC_NOTIFY=3 47
48 Development Tips Conditional compilation for OpenACC Macro _OPENACC Using pointers (C/C++) Problem: compiler can not determine whether loop iterations that access pointers are independent no parallelization possible Solution (if independence is actually there) Use restrict keyword: float *restrict ptr; Use compiler flag: -ansi-alias (PGI) Use OpenACC loop clause: independent Prefer array notation over pointer arithmetic: array[2] instead of *(array+2) 48
49 Development Tips Runtime measurements GPU needs time to initialize (up to a couple of seconds, depending on GPU) Might get included into runtime measurements & influence short runtimes Exclude GPU initialization from measurements by calling acc_init(acc_device_nvidia), #pragma acc init(acc_device_nvidia) Simple profile (that also shows the init time) by Compiler flag: -ta=nvidia,time Environment variable: PGI_ACC_TIME=1 Setup information Information on GPU type, hardware details and configuration NVIDIA: Download CUDA SDK, run devicequery PGI: pgaccelinfo (also specifies needed compiler flags for OpenACC) 49
50 Hands-On Introduction to RWTH GPU Cluster environment Write your first OpenACC program (Jacobi solver) Use parallel and loop directives Follow TODOs in GPU/exercises/task1 Exercises Optional (only if you have completed task 2.2) Read slide 104 (pinned memory) Apply this to your program from task 1 Why does this give a performance 50 improvement?
51 The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) Foto: C. Iwainsky High utilization of resources Daytime VR: new CAVE (49 GPUs) HPC: interactive software development (8 GPUs) Nighttime aixcave, VR, RWTH Aachen, since June 2012 HPC: Processing of GPGPU compute jobs (55-57 GPUs) 51
52 Hardware & Software Stack 24 rendering nodes 2x NVIDIA Quadro 6000 GPUs (Fermi) 2x Intel Xeon X GHz (Westmere, 12 cores) GPU cluster software environment CUDA Toolkit 7.0 module load cuda directory: $CUDA_ROOT PGI Compiler (OpenACC) TotalView (Debugging) module load pgi (module switch intel pgi) module load totalview 52
53 Lab: Login to GPU machines User name (see badge): hpclab<xy> Password (see Access to Lab Machine) GPU node (see paper snippet): linuxgpus<ab> These nodes are available in interactive mode only for the workshop! Jump from frontend node to GPU node: ssh Y linuxgpus<ab> Remark: GPUs are set to exclusive mode (per process) Only one person can access GPU If occupied, e.g. message all CUDA-capable devices are busy or unavailable Set export CUDA_VISIBLE_DEVICES=<no> to use certain GPU! See your printout for the correct number (either 0 or 1) 53
54 See what is running: nvidia-smi $> nvidia-smi Mon Mar 7 10:29: NVIDIA-SMI Driver Version: GPU ID + type nvidia-smi q Lists GPU details GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Quadro 6000 Off 0000:02:00.0 Off 0 30% 63C P0 N/A / N/A 10MiB / 5375MiB 0% E. Process Quadro 6000 Off 0000:85:00.0 On 0 30% 80C P0 N/A / N/A 216MiB / 5375MiB 0% E. Process Processes: GPU Memory GPU PID Type Process name Usage ============================================================================= C./jacobi 155MiB process running on GPU
55 Jacobi Solver Solving Laplace equation (2D) by Jacobi iterative method A x, y = 2 A x, y = 0 Start with guess of objective function A(x,y) A(i,j+1) Use finite difference discretization (5-point-stencil) A(i-1,j) A(i,j) A(i+1,j) A k+1 i, j = A k(i 1,j)+A k i+1,j +A k i,j 1 +A k i,j+1 4 A(i,j-1) Iterate over inner elements of 2D grid Repeat until Maximum of number of iterations reached Computed approximation close to solution Biggest change on any matrix element < tolerance value 55
56 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) What went wrong? 56
57 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 57
58 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; What went wrong? free(x); free(y); return 0; 58
59 Analysis with the NVIDIA Visual Profiler Comes with the NVIDIA Toolkit Call: nvvp Also includes automatic application analysis (utilization, ) Move data between host & device SAXPY execution 2. SAXPY execution
60 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE MEMORY CPU PCI Bus MEMORY GPU #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; move x, y to device move y to host move x, y to device move y to host free(x); free(y); return 0; 60
61 Directives (in examples): SAXPY acc data int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { x remains on the GPU until the end of the data region. #pragma acc parallel copy(y[0:n]) main: present(x[0:n]) PGI Compiler Feedback #pragma acc loop 46, Generating copyin(x[:n]) for (int i = 0; i < n; ++i){ 48, Generating copy(y[:n]) y[i] = a*x[i] + y[i]; Generating present(x[:n]) Accelerator kernel generated // Modify y on host, do not touch x on host Generating Tesla code #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 50, #pragma acc loop gang, vector(128) 54, Generating copy(y[:n]) Generating present(x[:n]) Accelerator kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 61
62 Directives: Data regions & clauses Data region 62 Decouples data movement from offload regions C/C++ #pragma acc data [clauses] Fortran!$acc data [clauses]!$acc end data Data clauses Triggers data movement of denoted arrays If cond true, move data to accelerator H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides C/C++, Fortran if(cond) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) background OpenACC
63 Data Movement Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr copy (also copyin, copyout) Checks whether data on device, if not: copies data create Does not copy any data, just creates it on the device deviceptr Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout See section Interoperability with CUDA & GPU Libraries 63
64 Directives (in examples) acc update #pragma acc data copy(x[0:n]) { for (t=0; t<t; t++){ // Modify x on device (e.g. in subroutine // w/o data copying) #pragma acc update host(x[0:n]) // Modify x on host #pragma acc update device(x[0:n]) 64
65 Directives: Update Update executable directive 65 Move data from GPU to host, or host to GPU Used to update existing data after it has changed in its corresponding copy C/C++ #pragma acc update host device [clauses] Fortran!$acc update host device [clauses] Data movement can be conditional or asynchronous Update clauses list variables are copied from acc to host. list variables are copied from host to acc.. If cond true, move data to accelerator Issues no error when data is not present on device Executes async, see Tuning slides. Data movement waits on actions in queue to finish. Set certain device type for update C/C++, Fortran host self(list) device(list) if(cond) if_present async[(int-expr)] wait[(int-expr)] device_type(list) background OpenACC
66 Unstructured Data Lifetime Structured data lifetime Applied to a region Unstructured data lifetime enter data: allocates (+ copies) data to device memory exit data: deallocates (+ copies) data from device memory #pragma acc data copyin(x[0:n]) \ { 66 // data lifetime create(y[0:n]) class Matrix { Matrix() { v = new double[n]; #pragma acc enter data copyin(this) #pragma acc enter data create(v[0:n]) ~Matrix() { #pragma acc exit data delete(v[0:n], this) delete[] v; private: double* v; Also possible: copyout
67 Directives: Unstructured Data Lifetime Enter data construct Allocation (& copy) of scalars and (sub-)arrays in(to) device memory Remain there until end of program or corresponding exit data call C/C++ #pragma acc enter data clause-list Fortran C/C++, Fortran!$acc enter data clause-list copyin(list) Specific clauses for copy or just creation. create(list) Exit data construct (Copies data to host memory &) deletes data from device memory C/C++ #pragma acc exit data clause-list Fortran!$acc exit data clause-list Specific clauses for copy or deletion.... Clauses valid for both: if, async, wait C/C++, Fortran copyout(list) delete(list) finalize 67 background OpenACC
68 device memory device memory host memory device memory host memory Deep Copies Support of a flat object model Primitive types Composite types w/o allocatable/pointer members struct { int x[2]; // static size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) Challenges with pointer indirection Non-contiguous transfers Pointer translation struct { int *x; // dynamic size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) A[0].x[0] A[0].x[1] A[1].x[0] A[1].x[1] da[0].x[0] da[0].x[1] da[1].x[0] da[1].x[1] x[0] x[1] A[0].x A[1].x x[0] x[1] x[0] x[1] OpenACC proposal for complex data management since 11/2014. da[0].x da[0].x da[1].x da[1].x shallow copy deep copy Manuel deep copy is possible with data API, but usually not practical x[0] x[1] 68 Courtesy by J.C. Beyer, Cray Inc.
69 Directives: Data API Data API (1/2) d_void* acc_malloc(size_t): allocates device memory void acc_free(d_void*): frees device memory void* acc_copyin(h_void*,size_t): allocates device memory & copies data to device void* acc_present_or_copyin(h_void*,size_t): tests whether data is present on device, if not: allocates & copies data void* acc_create(h_void*,size_t): allocates device memory corresponding to host data void* acc_present_or_create(h_void*,size_t): tests whether data is present on device, if not: allocates device memory corresponding to host data void acc_copyout(h_void*,size_t):copies data to host & deallocates device memory void acc_delete(h_void*,size_t): deallocates device memory corresponding to host data 69 background OpenACC
70 Directives: Data API Data API (2/2) void acc_update_device(h_void*,size_t) and acc_update_self: updates device data copy corresponding to host data void acc_memcpy_to_device(d_void* dest,h_void* src,size_t): copies data from host to device memory void acc_memcpy_from_device(h_void* dest,d_void* src, size_t): copies data from device to host memory 70 int acc_is_present(h_void*,size_t): tests whether data is present on device void acc_map_data(h_void*,d_void*,size_t): maps previously allocated data to the specific host data void acc_unmap_data(h_void*): unmaps device data from the specific host data d_void* acc_deviceptr(h_void*): returns device pointer associated with a specific host address h_void* acc_hostptr(d_void*): returns host pointer associated with a specific device address background OpenACC
71 Summary Optimizing code with explicit data transfers Structured: data region Unstructured: enter data, exit data Within data environment: update Clauses: copy, copyin, copyout, Data API for deep copies (and more) Compiler and Runtime feedback Where is the data moved? How much time is spend with data movement? Using NVIDIA Visual Profiler with OpenACC 71
72 Hands-On Data motion optimization of your first OpenACC program Use structured data regions to minimize data movement across the PCI bus Use parallel and loop directives on all loops that read or write the data within the data regions Follow TODOs in GPU/exercises/task2 Exercises
73 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) 73
74 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 74
75 Heterogeneous Computing Heterogeneous Computing CPU & GPU are (fully) utilized Challenge: load balancing Domain decomposition If load is known beforehand, static decomposition Exchange data if needed (e.g. halos) matrix vector multiplication GPU CPU = 75
76 Asynchronous operations Definition Synchronous: Control does not return until accelerator action is complete Asynchronous: Control returns immediately Allows heterogeneous computing (CPU + GPU) Default: synchronous operations Asynchronous operations with async clause Kernel execution: kernels, parallel The wait directive can also take an async clause. Data movement: update, enter data, exit data #pragma acc kernels async for (int i=0; i<n; i++) { // do work on host Heterogenity 76
77 Directives: async clause Async clause Executes parallel kernels update enter data exit data wait operations asynchronously while host process continues with code C/C++ #pragma acc <op-see-above> async[(scalar-int-expr)] Fortran!$acc <op-see-above> async[(scalar-int-expr)] Integer argument (optional) can be seen as CUDA stream number Integer argument can be used in a wait directive Async activities with same argument: executed in order Async activities with diff. argument: executed in any order relative to each other OpenACC 2.0: int arg can be acc_async_sync (sync execution) acc_async_noval (same stream) 77 background OpenACC
78 Streams Streams = CUDA concept Stream = sequence of operations that execute in issue-order on GPU Same int-expr in async clause: one stream Different streams/ int-expr: any order relative to each other Tips for debugging - disable async Environment variable ACC_SYNCHRONOUS=1 Use pre-defined int-expr acc_async_sync 78
79 Asynchronous operations & wait Synchronize async operations wait directive Wait for completion of an asynchronous activity (all or certain stream) #pragma acc parallel loop async(1) // kernel A #pragma acc parallel loop async(2) // kernel B #pragma acc wait(1,2) async(3) #pragma acc parallel loop async(3) // wait(1,2) // alternative to wait directive // kernel C #pragma acc parallel loop async(4) wait(3) // kernel D Kernel A Kernel D Kernel C Kernel B Kernel E #pragma acc parallel loop async(5) wait(3) // kernel E #pragma acc wait(1) // kernel F // on host Kernel F 79 Courtesy by J.C. Beyer, Cray Inc.
80 Directives: wait Wait directive Host thread waits for completion of an asynchronous activity If integer expression specified, waits for all async activities with the same value Otherwise, host waits until all async activities have completed C/C++ #pragma acc wait[(scalar-int-expr)] Fortran!$acc wait[(scalar-int-expr)] OpenACC 2.0: wait clause available (parallel, kernels, update, enter data, exit data) Runtime routines int acc_async_test(int): tests for completion of all associated async activities; returns nonzero value/.true. if all have completed int acc_asnyc_test_all(): tests for completion of all async activities int acc_async_wait(int): waits for completion of all associated async activites; routine will not return until the latest async activity has completed int acc_async_wait_all(): waits for completion of all asnyc activities 80 background OpenACC
81 Overlapping data transfers and computations Other feature of streams Allows overlap of tasks on GPU PCIe transfers in both directions #pragma acc data create(a[0:n]) { #pragma acc update \ device(a[0:n/2]) async(1) stream 1 H2D Plus multiple kernels (up to 16 with Kepler) H2D H2D K D2H K stream 2 K without streams D2H D2H H2D H2D K K D2H D2H #pragma acc update \ device(a[n/2:n/2]) async(2) #pragma acc parallel loop async(1) // compute on A[0:n/2] #pragma acc parallel loop async(2) // compute on A[n/2:n/2] potential runtime improvement #pragma acc update \ host(a[0:n/2]) async(1) using async ops #pragma acc update \ host(a[n/2:n/2]) async(2) 81
82 Summary OpenACC default: synchronous operations CPU thread waits until OpenACC kernel/ movement is completed Asynchronous operations enable: Heterogeneous computing (utilizing CPU + GPU) async clause returns control directly to CPU thread Overlapping of kernels (and data transfers) Specify streams by using int-expressions in async clause Synchronization possible by wait directive 82
83 Hands-On Heterogeneous computing on CPU & GPU Decomposition of the matrix (what is the best distribution?) Use async & wait Make sure reduction variables are copied back at certain point Use OpenMP on host Follow TODOs in GPU/exercises/task3 Exercise
84 .. Heterogeneous Computing Domain decomposition 0 GPU GPU CPU CPU Halo exchange (each iteration) GPU CPU 84
85 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)
86 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 86
87 Based on: NVIDIA Corporation Interoperability with Libraries and CUDA Application Libraries Library OpenACC Directives Low-level language OpenACC Programming Languages Drop-in acceleration High productivity Maximum flexibility CUSPARSE CUBLAS OpenACC CUDA 87
88 Interoperability CUDA and OpenACC both operate on device memory Interoperability with (CUDA) libraries Interoperability with custom (CUDA) C/C++/Fortran code Make device address available on host or in OpenACC code #pragma acc data copy(x[0:n]) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] #pragma acc host_data use_device(x) { 88 func(x); //.. use device pointer x: for library for custom CUDA code cudamalloc(&x, sizeof(float)*n); // use device pointer x: // - for library // - for custom CUDA code #pragma acc data deviceptr(x) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] assume x is pointer to device memory
89 Directives: deviceptr & host_data Deviceptr clause Declares that pointers in list refer to device pointers that need not be allocated/ moved between host and device C/C++ #pragma acc parallel kernels data deviceptr(list) Fortran!$acc data deviceptr(list) Host_data construct Makes the address of device data in list available on the host Vars in list must be present in device memory C/C++ #pragma acc host_data use_device(list) Fortran!$acc host_data use_device(list)!$acc end host_data only valid clause 89 background OpenACC
90 Summary OpenACC is interoperable with CUDA and GPU Libraries host_data: device address available on host deviceptr: device address available in OpenACC code 90
91 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 91
92 Execution Model Threads execute as groups of 32 threads (warps) Threads in warp share same program counter t0 t1 t31 t32 t63 SIMT architecture SIMT = single instruction, multiple threads warp Possible mapping to CUDA terminology (GPUs) gang = block worker = warp compiler dependent Execution Model Vector Gang Core Multiprocessor instruction cache registers vector = threads Within block (if omitting worker) Within warp (if specifying worker) Grid (Kernel) hardware/ software cache Device 92
93 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector(256) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; gang(64) vector free(x); free(y); return 0; Without loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(128) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang, vector(128) With loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(256) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang(64), vector(128) 93
94 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { 94 #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; free(x); free(y); return 0; vector_length(256) num_gangs(64) Can both be used with parallel or kernels: vector_length: Specifies number of threads in thread block. num_gangs: Specifies number of thread blocks in grid.
95 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 256; int blocks = 4; int bsize = 32; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars a, b & initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc parallel copy(y[0:n]) present(x[0:n]) \ num_gangs(blocks) vector_length(bsize) all do the same e.g. int tmp = 5; #pragma acc loop gang in parallel on multiprocessors workshare (w/o barrier) #pragma acc loop vector workshare (w/ barrier) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Change y on host & do second SAXPY on device free(x); free(y); return 0; *32=256 Note: Loop schedules are implementation dependent. This representation is one (common) example. 95
96 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j){ // do something #pragma acc parallel vector_length(256) num_gangs(16) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something Distributes loop to n threads on GPU. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes outer loop to n threads on GPU. Each thread executes inner loop sequentially. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes loop to threads on GPU (see above). If 16*256 < n, each thread gets multiple elements. 256 threads per thread block 16 blocks in grid 96
97 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang for (int i = 0; i < n; ++i){ #pragma acc loop vector for (int j = 0; j < m; ++j){ // do something Distributes outer loop to GPU multiprocessors (block-wise). Distributes inner loop to threads within thread blocks. 256 threads per thread block usually n blocks in grid #pragma acc kernels #pragma acc loop gang(100) vector(8) for (int i = 0; i < n; ++i){ #pragma acc loop gang(200) vector(32) for (int j = 0; j < m; ++j){ // do something Multidimensional portioning with parallel construct only possible in v2.0. See tile clause With nested loops, specification of multidimensional blocks and grids possible: use same resource for outer and inner loop. 100 blocks in Y-direction (rows); 200 blocks in X-direction (columns) 8 threads in Y-dimension of one block; 32 threads in X-dimension of one block Total: 8*32=256 threads per thread block; 100*200=20,000 blocks in grid 97
98 Launch configuration How many vector/ gangs to launch? OpenACC runtime tries to find good values automatically Performance portability across different device types possible Own tuning may deliver better performance on a certain hardware Hardware operation Instructions are issued in order Thread execution stalls when one of the operands isn t ready Latency is hidden by switching threads GMEM latency: cycles Arithmetic latency: cycles Need enough threads to hide latency 98
99 Launch configuration Hiding arithmetic latency Need ~18 warps (576 threads) per Fermi MP Or, independent instructions from the same warp x = a + b; //~18 cycles y = a + c; //independent //can start any time // stall z = x + d; //dependent //must wait for compl. Source: Volkov2010 Maximizing global memory throughput Gmem throughput depends on the access pattern, and word size 99 Need enough memory transactions in flight to saturate the bus Independent loads & stores from the same thread (mult. elements) Loads and stores from different threads (many threads) Larger word sizes can also help ( vectors ) e.g. float a0= src[id]; // no latency stall float a1= src[id+blockdim.x]; // stall dst[id]= a0; dst[id+blockdim.x]= a1; // Note, threads don t stall // on memory access Source: Volkov2010
100 GB/s Launch configuration Maximizing global memory throughput Example program: increment an array of ~67M elements Impact of launch configuration (32-bit words) NVIDIA Tesla C2050 (Fermi) ECC off Bandwidth: 144 GB/s Threads/Block
101 Launch configuration Vectors per gang: Multiple of warp size (32) Threads are always spanned in warps onto resources Not used threads are marked as inactive Starting point: vectors/gang Blocks per grid heuristics #blocks > #MPs Launch MANY MPs have at least one block to execute threads to keep #blocks / #MPs > 2 GPU busy! MP can concurrently execute up to 8 blocks Blocks that aren t waiting in barrier keep hardware busy Subject to resource availability (registers, smem) #blocks > 50 to scale to future devices Most obvious: #blocks * #threads = #problem-size Multiple elements per thread may amortize setup costs of simple kernels: #blocks * #threads < #problem-size 101
102 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 102
103 Recap memory hierarchy Local memory/ Registers Shared memory/ L1 Very low latency (~100x than gmem) Bandwidth (aggregate): 1+ TB/s (Fermi), 2.5 TB/s (Kepler) L2 Global memory High latency ( cycles) Device Multiprocessor 1 Registers Registers Core 1 Core m Multiprocessor n Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Host Memory Host Bandwidth: 144 GB/s (Fermi), 250 GB/s (Kepler) 103
104 Pinned Memory 104 CPU default: data allocations are pageable Use of virtual memory: physical memory pages can be swapped out to disk When host needs page, it loads it back in from the disk No direct access to pageable memory by GPU possible 1. CUDA driver allocates a temporary page-locked ( pinned ) host array 2. Copies host data to pinned array (PCIe transfers occur only using DMA) 3. Transfer data from pinned array to device 4. Delete pinned memory after transfer completion Pinned memory = page-locked memory Prevents memory from being swapped out to disk Avoids cost of transferring between pageable and pinned host arrays Improves transfer speeds But: reduces amount of physical memory available to OS & other programs pgcc ta=nvidia:pinned saxpy.c It s just a compiler flag Try it out!
105 gmem access Stores: Invalidate L1, write-back for L2 Loads: Enabling by Caching default (Fermi) Non-caching Compiler options, e.g. PGI s loadcache Attempt to hit: L1 L2 gmem L2 gmem (no L1: invalidate line if it s already in L1) Load granularity 128-byte line 32-byte line Threads in a warp provide memory addresses Determine which lines/segments are needed Kepler uses L1 cache only for thread-private data. Request the needed lines/segments threads 0-31, 32 bit words t0 t1 t2... t30 t31 Gmem access per warp (32 threads) 105
106 gmem throughput coalescing Range of accesses coalesced access packing data X Address alignment X... X... X X addresses from a warp X caching loads X X X X coalesced access X X... X X addresses from a warp... X X X X
107 GB/s Impact of address alignment Impact of Address Alignment Fermi cached Offset in 4 byte words Example program: Copy ~67 MB of floats NVIDIA Tesla C2050 (Fermi) ECC off 256 threads/block Misaligned accesses can drop memory throughput 107
108 Data layout matters (improve coalescing) Example: AoS vs. SoA Array of Structures (AoS) Structure of Arrays (SoA) struct mystruct_t { struct { float a; float a[]; float b; float b[]; int c; int c[]; mystruct_t mydata[]; #pragma acc kernels for (int i=0; i<n; i++) { mydata[i].a / mydata.a[i] Address AoS A[0] B[0] C[0] A[1] B[1] C[1] A[2] B[2] C[2] SoA A[0] A[1] A[2] B[0] B[1] B[2] C[0] C[1] C[2] 108 This slide is based on: NVIDIA Corporation
109 Summary Strive for perfect coalescing Warp should access within contiguous region Align starting address (may require padding) Have enough concurrent accesses to saturate the bus Process several elements per thread Launch enough threads to cover access latency 109
110 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 110
111 Shared memory 111 Smem per SM(X) (10s of KB) Inter-thread communication within a block Need synchronization to avoid RAW / WAR / WAW hazards Low-latency, high-throughput memory Cache data in smem to reduce gmem accesses #pragma acc kernels copyout(b[0:n][0:n]) copyin(a[0:n][0:n]) #pragma acc loop gang vector Accelerator kernel generated 65, #pragma acc loop gang, vector(2) for (int i = 1; i < N-1; ++i){ Cached references to size #pragma acc loop gang vector [(y+2)x(x+2)] block of 'a' for (int j = 1; j < N-1; ++j){ 67, #pragma acc loop gang, vector(128) #pragma acc cache(a[i-1:3][j-1:3]) b[i][j] = (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]) / 4; This slide is based on: NVIDIA Corporation, the Portland Group The tile clause can additionally strip-mine the data space. It is not yet supported by PGI.
112 Directives: Caching & Strip-mining Cache construct Prioritizes data for placement in the highest level of data cache on GPU C/C++ #pragma acc cache(list) Fortran!$acc cache(list) Use at the beginning of the loop or in front of the loop Tile clause Sometimes the PGI compiler ignores the cache construct. See compiler feedback Strip-mines each loop in a loop nest according to the values in expr-list C/C++ #pragma acc loop [<schedule>] tile(expr-list) 1. entry applies to innermost loop Fortran!$acc loop [<schedule>] tile(expr-list) 112
113 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 113
114 Multiple GPUs Several GPUs within one compute node Can be used within or across processes/ threads E.g. assign one device to each CPU thread/ process #include <openacc.h> acc_set_device_num(0, acc_device_nvidia); #pragma acc kernels async acc_set_device_num(1, acc_device_nvidia); #pragma acc kernels async Device-specific tuning of clauses possible with device_type kernel clause. 114
115 Multiple GPUs OpenMP + MPI #include <openacc.h> #include <omp.h> OpenMP #pragma omp parallel num_threads(12) { int numdev = acc_get_num_devices(acc_device_nvidia); int devid = omp_get_thread_num() % numdev; acc_set_device_num(devid, acc_device_nvidia); #include <openacc.h> #include <mpi.h> MPI int myrank; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); int numdev = acc_get_num_devices(acc_device_nvidia); int devid = myrank % numdev; acc_set_device_num(devid, acc_device_nvidia); 115
116 Directives: Multiple Devices Multi GPU (or device) setup int acc_get_num_devices(acc_device_t): returns number of device of type void acc_set_device_num(int,acc_device_t): sets device number of type int acc_get_device_num(acc_device_t): returns device number of type void acc_set_device_type(acc_device_t): type for offload region acc_device_t acc_get_device_type(): returns device type for next offload region Device-specific clause tuning: device_type or dtype Takes accelerator architecture name or an asterisk: tuning for this device Clauses belong to dtype that follow it up to the end or another dtype Can applied for: parallel, kernels, loop, update, routine C/C++ #pragma acc <op> device_type(device-type list) (clauses) #pragma acc <op> device_type(*) (clauses) Fortran!$acc <op> device_type(device-type list) (clauses)!$acc <op> device_type(*) (clauses) 116
117 Multiple GPUs MPI Multiple GPUs across nodes Use MPI for communication Use update for a upt-to-date version on the before sending over MEMORY CPU MEMORY GPU MEMORY CPU MEMORY GPU MPI rank 1 MPI rank n-1 #pragma acc update host \ ( s_buf[0:size] ) MPI_Recv(r_buf,size,MPI_CHAR, 0,tag,MPI_COMM_WORLD,&stat); MPI_Send(s_buf,size,MPI_CHAR, n-1,tag,mpi_comm_world); #pragma acc update device \ (r_buf[0:size] ) 117
118 Summary OpenACC can use several devices Within one node: acc_set_device_num to handle GPU affinity Across nodes: use MPI 118
119 .. Hands-On Extend the heterogeneous version to use 2 GPUs within one compute node (w/o MPI) Decomposition of the matrix Use acc_set_device_num to differentiate between the GPUs Use enter data and exit data to move initial data to the GPUs Use update for CPU GPU halo exchange Think about correct synchronization GPU 0 GPU 1 CPU 119 Follow TODOs in GPU/exercises/task4 Exercise 2.6
120 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)
121 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 121
122 Introduction OpenMP = de-facto standard for shared-memory parallelization From 1997 until now Since 2009: OpenMP Accelerator subcommittee Sub group wanted faster development: OpenACC Idea: include lessons lernt into OpenMP standard 07/2013: OpenMP 4.0 Accelerator support SIMD, Some content of the following slides has been developed by James Beyer (Cray) and Eric Stotzer (TI), the leaders of the OpenMP for Accelerators subcommittee. RWTH Aachen University is a member of the OpenMP Architecture Review Board (ARB) since Main topics: Thread affinity Tasking Tool support Accelerator support 122
Introduction to GPGPUs
Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationIntroduction to GPGPUs
Introduction to GPGPUs Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de PPCES 2012 Rechen- und Kommunikationszentrum (RZ) Links General GPGPU Community: http://gpgpu.org/ GPU Computing Community: http://gpucomputing.net/
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationGPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation
GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application
More informationIntroduction to Compiler Directives with OpenACC
Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism
More informationGetting Started with Directive-based Acceleration: OpenACC
Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationAn OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.
API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran
More informationAn Introduction to OpenACC - Part 1
An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationThe GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
The GPU-Cluster Sandra Wienke wienke@rz.rwth-aachen.de Fotos: Christian Iwainsky Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative computer
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationRWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU-Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de March 2012 Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationGPU Programming Paradigms
GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 10-12, 2013 HPC@LSU
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationGPU Computing with OpenACC Directives
GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration
More informationINTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016
INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC
More informationOpenACC introduction (part 2)
OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationIntroduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA
Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA
ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationAdvanced OpenACC. Steve Abbott November 17, 2017
Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationOpenACC and the Cray Compilation Environment James Beyer PhD
OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC
More informationCode Migration Methodology for Heterogeneous Systems
Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationThe PGI Fortran and C99 OpenACC Compilers
The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationOpenACC Fundamentals. Steve Abbott November 13, 2016
OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationADVANCED OPENACC PROGRAMMING
ADVANCED OPENACC PROGRAMMING DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES AGENDA Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationOptimizing OpenACC Codes. Peter Messmer, NVIDIA
Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationGPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation
GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC Authored by Mark Harris NVIDIA Corporation GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More information