GPGPU Programming with OpenACC

Size: px

Start display at page:

Download "GPGPU Programming with OpenACC"

Brandon Parsons
6 years ago
Views:

1 GPGPU Programming with OpenACC Sandra Wienke, M.Sc. IT Center, RWTH Aachen University PPCES, March 18th 2016 IT Center der RWTH Aachen University

Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing

2 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 2

3 06/ / / / / / / / / / / / / / / / / / / / / / / / / / / /2020 Performance Motivation Why to care about accelerators? Towards exa-flop computing (performance gain, but power constraints) Accelerators provide good performance per watt ratio (first step) 10 EFlop/s 1 EFlop/s 100 PFlop/s 10 PFlop/s 1 PFlop/s 100 TFlop/s 10 TFlop/s 1 TFlop/s 100 GFlop/s 10 GFlop/s 1 GFlop/s 100 MFlop/s PetaFlop/s PFlop/s PFlop/s TFlop/s ExaFlop/s #1 #500 Sum #1 Trend #500 Trend Sum Trend 3 Source: Top500, 11/2015

Motivation Accelerators/ co-processors GPGPUs (e.g. NVIDIA, AMD) Here, NVIDIA GPUs are the focus. Intel Many Integrated Core (MIC) Architecture (Intel Xeon Phi) FPGAs (e.g. Convey), DSPs (e.g. TI), System Share No.

4 Motivation Accelerators/ co-processors GPGPUs (e.g. NVIDIA, AMD) Here, NVIDIA GPUs are the focus. Intel Many Integrated Core (MIC) Architecture (Intel Xeon Phi) FPGAs (e.g. Convey), DSPs (e.g. TI), System Share No. of accelerator systems is growing Performance Share 66.3 % None NVIDIA 16.4 % 16.7 % 0.4 % AMD/ ATI Intel 0.2 % PEZY-SC 4 Source: Top500, 11/2015 Source: Top500, 6/2015

5 Motivation Top500 (11/2015) 40% of the first 10 Top500 systems contain accelerators Rank Name Site Rmax Power (kw) Mflops/ Watt Accelerator/ Co-Processor 1 Tianhe-2 National Super Computer Center in 33,862,700 17, Intel Xeon Phi (MilkyWay-2) Guangzhou 31S1P 2 Titan DOE/SC/Oak Ridge National 17,590,000 8, NVIDIA K20x Laboratory 3 Sequoia DOE/NNSA/LLNL 17,173,224 7, None 4 RIKEN Advanced Institute for Computational Science (AICS) 10,510,000 12, None 5 Mira DOE/SC/Argonne National Laboratory 8,586,612 3, None 6 Trinity DOE/NNSA/LANL/SNL 8,100,900 None 7 Piz Daint Swiss National Supercomputing Centre (CSCS) 8 Hazel Hen HLRS - Höchstleistungsrechenzentrum Stuttgart 9 Shaheen II King Abdullah University of Science and Technology 10 Stampede Texas Advanced Computing Center/Univ. of Texas 5 6,271,000 2, NVIDIA K20x 5,640,170 None 5,536,990 2, None 5,168,110 4, Intel Xeon Phi SE10P

Motivation Green500 (11/2015) Comparison to server with 2x Intel Sandy Bridge@ 2GHz HPL: ~260 W Peak Performance: 256 GFLOPS 985

6 Motivation Green500 (11/2015) Comparison to server with 2x Intel Sandy 2GHz HPL: ~260 W Peak Performance: 256 GFLOPS 985 MFLOPS/W 6 performance per watt Source: Green500, 11/2015 * Performance data obtained from publicly available sources including TOP500

7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, 7 General purpose

8 Comparison CPU GPU 8 cores CPU 2880 cores GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Massively Parallel Processors Manycore Architecture Fast switching 8

9 DRAM DRAM Comparison CPU GPU Different design ALU ALU ALU ALU L2 ALU L2 Control ALU CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation 9

10 DRAM DRAM Why can accelerators deliver good performance watt ratio? 1. High (peak) performance ALU ALU More transistors for computation CPU ALU ALU L2 No control logic Small caches Control 2. Low power consumption 10 Many low frequency cores P~V 2 f No control logic GPU Power use for 1 TFlop/s of an usual system ALU Heat removal Power supply loss Control Disk Communication Memory Compute Source: Andrey Semin (Intel): HPC systems energy efficiency optimization thru hardware-software co-design on Intel technologies, EnaHPC 2011; & S. Borkar, J. Gustafson L2 ALU

11 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 11

NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors 14-16 streaming

h comprises 32 cores SM 448-512 cores i.a.

12 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors streaming multiprocessors (SM) Each comprises 32 cores SM cores i.a. Floating point & integer unit Memory hierarchy Peak performance (Tesla M2070) SP: 1.03 TFlops DP: 515 GFlops ECC support Compute capability: 2.0 GPU Defines features, e.g. double precision capability, memory access pattern 12

http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.

1 billion transistors 13-15 streaming multiprocessors extreme (SMX) Each comprises 192 cores 2496-2880

13 GPU architecture: Kepler 7.1 billion transistors streaming multiprocessors extreme (SMX) Each comprises 192 cores cores Memory hierarchy SMX Peak performance (K20) SP: 3.52 TFlops GPU DP: 1.17 TFlops ECC support Compute capability: 3.5 E.g. dynamic parallelism = possibility to launch dynamically new work from GPU 13

14 Comparison CPU GPU Weak memory model PCI Bus MEMORY CPU 3 1 Host + device memory = separate entities No coherence between host + device Data transfers needed Host-directed execution model Copy input data from CPU mem. to device mem. Execute the device program Copy results from device mem. to CPU mem. MEMORY GPU 2 14

threads execute the same code Threads are identified by IDs float x = input[threadid]; float y = func(x);

15 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs float x = input[threadid]; float y = func(x); output[threadid] = y; Select input/output data Control decisions With OpenACC, you don t have to bother with thread IDs. 15

16 Based on: NVIDIA Corporation 2010 Programming model Threads are grouped into blocks Blocks are grouped into a grid 16

17 Time Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 Kernel 2 Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) 17

18 Programming model Why blocks? Cooperation of threads within a block possible Synchronization Share data/ results using shared memory Scalability Fast communication between n threads is not feasible when n large No global synchronization on GPU possible (only by completing one kernel and starting another one from the CPU) But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors In any order, concurrently, sequentially 18

19 Scalability Assume: 10 thread blocks 8 SM(X) 2 SM(X) idle idle 20 SM(X) 19

20 Mapping to Execution Model Thread Core Each thread is executed by a core Block Multiprocessor instruction cache registers hardware/ software cache Each block is executed on a SM(X) Several concurrent blocks can reside on a SM(X) depending on shared resources Grid (Kernel) Device Each kernel is executed on a device 20

Memory model Thread Registers Fermi: 63 per thread K20: 255 per thread Local memory Block Shared memory Fermi:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 K20:64KB

21 Memory model Thread Registers Fermi: 63 per thread K20: 255 per thread Local memory Block Shared memory Fermi:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 K20:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 OR 32KB shared +32KB L1 Grid/ application Constant memory read-only; off-chip; cached Global memory several GB; off-chip Fermi/K20: L2 cache Device Multiprocessor 1 Registers Registers Core 1 Shared Mem 1 Host L1 Core m L2 Global/Constant Memory Host Memory Caches L1 Fermi: configurable 16/48KB K20: configurable 16/32/ 48 KB Multiprocessor n Registers Core 1 L1 L2 Fermi: 768KB K20: 1536KB Registers Core m Shared Mem n 21

cache device: GPU grid (kernel) sync possible, shared mem Shared Mem SM-1 L1 L2 Shared Mem SM-n L1

22 Summary execution model logical hierarchy memory model Core thread / vector Registers Registers streaming multiprocessor (SM) instruction cache registers block of threads / gang hardware/ software cache device: GPU grid (kernel) sync possible, shared mem Shared Mem SM-1 L1 L2 Shared Mem SM-n L1 Device Global Memory Host CPU PCIe CPU Mem Host Host Memory 22 Vector, worker, gang mapping is compiler dependent.

23 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 23

24 Based on: NVIDIA Corporation GPGPU Programming Application Libraries Directives Programming Languages Drop-in acceleration High productivity Maximum flexibility examples CUBLAS OpenACC CUDA CUSPARSE OpenMP OpenCL 24

25 GPGPU Programming CUDA (Compute Unified Device Architecture) C/C++ (NVIDIA): programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ OpenACC C/C++, Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov OpenMP C/C++, Fortran: Directive-based programming for hosts and accelerators, standard, portable, released in July 2013, implementations soon 25

26 Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y y x y 26 free(x); free(y); return 0;

27 Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 27 free(x); free(y); return 0;

28 Example SAXPY CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 2.0f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=5.0*i-1.0; float *d_x,*d_y; // Pointer to GPU memory 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 2. Launch kernel 3. Transfer data to CPU + free data on GPU 28

29 Motivation Nowadays: GPU APIs (like CUDA, OpenCL) often used May be difficult to program (as/but more flexibility) Verbose/ may complicate software design Directive-based programming model delegates responsibility for low-level GPU programming tasks to compiler Data movement Kernel execution Awareness of particular GPU type OpenACC or OpenMP 4.x device constructs Many tasks can be done by compiler/ runtime User-directed programming 29

Source: www.openacc.org Overview OpenACC Open industry standard Portability Here, PGI s OpenACC compiler is used for examples. Details are compiler dependent.

30 Source: Overview OpenACC Open industry standard Portability Here, PGI s OpenACC compiler is used for examples. Details are compiler dependent. Introduced by CAPS, Cray, NVIDIA, PGI (Nov. 2011) Support C,C++ and Fortran NVIDIA GPUs, AMD GPUs, x86 Multicore, Intel MIC (now/near future) Timeline Nov 11: Specification 1.0 Jun 13: Specification 2.0 Nov 14: Proposal for complex data management Nov 15: Specification

31 Directives Syntax C #pragma acc directive-name [clauses] Fortran!$acc directive-name [clauses] Iterative development process Compiler feedback helpful Whether an accelerator kernel could be generated Which loop schedule is used Where/which data is copied 31

32 OpenACC - Compiling (PGI) pgcc -acc -ta=nvidia,cc20,7.0 -Minfo=accel saxpy.c pgcc C PGI compiler (pgf90 for Fortran) -acc Tells compiler to recognize OpenACC directives -ta=nvidia Specifies the target architecture here: NVIDIA GPUs cc20 Optional. Specifies target compute capability Optional. Uses CUDA Toolkit 7.0 for code generation -Minfo=accel Optional. Compiler feedback for accelerator code The PGI tool pgaccelinfo prints the minimal needed compiler flags for the usage of OpenACC on the current hardware (see Development Tips). 32

33 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 33

34 Directives (in examples): SAXPY serial (host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 34

35 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE magic sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; main: PGI Compiler Feedback 48, Generating copyin(x[:10240]) Generating copy(y[:10240]) 50, Kernel Loop 1 is parallelizable Accelerator kernel generated Generating Tesla code 50, #pragma acc loop gang, vector(128) 56, Loop is parallelizable Kernel Accelerator 2 kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 35

36 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 36

37 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 37

38 sync Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; kernels!= parallel compiler can choose different optimizations free(x); free(y); return 0; 38

39 Array Shaping Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr Array shaping Compiler sometimes cannot determine size of arrays Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout Specify explicitly using data clauses and array shape [lower bound: size] C/C++ #pragma acc parallel copyin(a[0:length]) copyout(b[s/2:s/2]) Fortran!$acc parallel copyin(a(0:length-1)) copyout(b(s/2:s)) [lower bound: upper bound] 39

40 Directives: Offload regions Offload region Region maps to a CUDA kernel function C/C++ #pragma acc parallel [clauses] Fortran!$acc parallel [clauses]!$acc end parallel C/C++ #pragma acc kernels[clauses] Fortran!$acc kernels[clauses]!$acc end kernels User responsible for finding parallelism (loops) acc loop needed for worksharing No automatic sync between several loops Compiler responsible for finding parallelism (loops) acc loop directive only for tuning needed Automatic sync between loops within kernels region 40 background OpenACC

41 Directives: Offload regions Clauses for compute constructs (parallel, kernels) If condition true, acc version is executed Executes async, see Tuning slides. Define number of parallel gangs..... Define number of workers within gang.... Define length for vector operations.. Reduction with op at end of region(parallel only).... H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides.. Copy of each list-item for each parallel gang(parallel only) As private + copy initialization from host(parallel only). C/C++, Fortran if(condition) async [(int-expr)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) reduction(op:list) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) private(list) firstprivate(list) Also: wait, device_type, default(none present) 41 background OpenACC

42 Loop schedule Directives: Loops Share work of loops Loop work gets distributed among threads on GPU (in certain schedule) C/C++ #pragma acc loop[clauses] Fortran!$acc loop[clauses] kernels loop defines loop schedule by int-expr in gang, worker or vector (similar to num_gangs etc with parallel) Loop clauses Distributes work into thread blocks.. Distributes work into warps Distributes work into threads within warp/ thread block Executes loop sequentially on the device Collapse n tightly nested loops Says independent loop iterations(kernels loop only). Reduction with op.. Private copy for each loop iteration 42 OpenACC 2.0 also: auto, tile, device_type C/C++, Fortran gang worker vector seq collapse(n) independent reduction(op:list) private(list) background OpenACC

43 Misc on Loops Combined directives #pragma acc kernels loop for (int i=0; i<n; ++i) { /*...*/ #pragma acc parallel loop for (int i=0; i<n; ++i) { /*...*/ Reductions #pragma acc parallel #pragma acc loop reduction(+:sum) for (int i=0; i<n; ++i) { sum += i; Also possible: *, max, min, &,, ^, &&, PGI compiler can often recognize reductions on its own. See compiler feedback, e.g.: Sum reduction generated for var 43

44 Procedure Calls Procedure calls within parallel region routine directive: compile for accelerator + host Clauses denote level of parallelism #pragma acc routine seq float saxpy(float a, float x, float y) { return (a*x + y); // [..] #pragma acc parallel #pragma acc loop gang vector for (int i=0; i<n; ++i) { //call on the device y[i] = saxpy(a,x[i],y[i]); See section on Loop Schedules for explanation of gang, vector, seq With OpenACC 1.0 (!) routines must be inlined (no routine support) 44

45 Directives: Procedure Calls Routine construct 45 Tells compiler to compile the given procedure for an accelerator & host Either use name or directly put construct at procedure call C/C++ #pragma acc routine clause-list #pragma acc routine (name) clause-list Fortran!$acc routine clause-list!$acc routine (name) clause-list A clause must be specified (May) contain/call gang parallelism, executed in gang-redundant mode.. (May) contain/call worker parallelism (no gang), executed in worker-single mode C/C++, Fortran gang worker (May) contain/call vector parallelism (no worker/gang), executed in vector-single mode vector Does not contain/call gang, worker, vector parallelism.. seq Specifies name for compiling/calling (as if specified in language being compiled) bind(name) Specifies name for compiling/calling (string is used for name)... bind(string) No compilation for the host, must called from within acc compute regions nohost Device-specific clause tuning dtype background OpenACC

46 Summary Offloading work kernels magic parallel region & worksharing construct (loop) Easy reductions by reduction clause Procedure calls with routine construct Moving data with data clauses Array shaping copy, copyin, copyout Compiler feedback What is done implicitly? How are explicit directives understood? 46

47 Development Tips Favorable for parallelization: (nested) loops Large loop counts to compensate (at least) data movement overhead Independent loop iterations Think about data availability on host/ GPU Use data regions to avoid unnecessary data transfers Specify data array shaping (may adjust for alignment) Verifying execution on GPU See PGI compiler feedback. Term Accelerator kernel generated needed. See information on runtime (more details in the hands-on session): export ACC_NOTIFY=3 47

48 Development Tips Conditional compilation for OpenACC Macro _OPENACC Using pointers (C/C++) Problem: compiler can not determine whether loop iterations that access pointers are independent no parallelization possible Solution (if independence is actually there) Use restrict keyword: float *restrict ptr; Use compiler flag: -ansi-alias (PGI) Use OpenACC loop clause: independent Prefer array notation over pointer arithmetic: array[2] instead of *(array+2) 48

49 Development Tips Runtime measurements GPU needs time to initialize (up to a couple of seconds, depending on GPU) Might get included into runtime measurements & influence short runtimes Exclude GPU initialization from measurements by calling acc_init(acc_device_nvidia), #pragma acc init(acc_device_nvidia) Simple profile (that also shows the init time) by Compiler flag: -ta=nvidia,time Environment variable: PGI_ACC_TIME=1 Setup information Information on GPU type, hardware details and configuration NVIDIA: Download CUDA SDK, run devicequery PGI: pgaccelinfo (also specifies needed compiler flags for OpenACC) 49

50 Hands-On Introduction to RWTH GPU Cluster environment Write your first OpenACC program (Jacobi solver) Use parallel and loop directives Follow TODOs in GPU/exercises/task1 Exercises Optional (only if you have completed task 2.2) Read slide 104 (pinned memory) Apply this to your program from task 1 Why does this give a performance 50 improvement?

51 The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) Foto: C. Iwainsky High utilization of resources Daytime VR: new CAVE (49 GPUs) HPC: interactive software development (8 GPUs) Nighttime aixcave, VR, RWTH Aachen, since June 2012 HPC: Processing of GPGPU compute jobs (55-57 GPUs) 51

52 Hardware & Software Stack 24 rendering nodes 2x NVIDIA Quadro 6000 GPUs (Fermi) 2x Intel Xeon X GHz (Westmere, 12 cores) GPU cluster software environment CUDA Toolkit 7.0 module load cuda directory: $CUDA_ROOT PGI Compiler (OpenACC) TotalView (Debugging) module load pgi (module switch intel pgi) module load totalview 52

53 Lab: Login to GPU machines User name (see badge): hpclab<xy> Password (see Access to Lab Machine) GPU node (see paper snippet): linuxgpus<ab> These nodes are available in interactive mode only for the workshop! Jump from frontend node to GPU node: ssh Y linuxgpus<ab> Remark: GPUs are set to exclusive mode (per process) Only one person can access GPU If occupied, e.g. message all CUDA-capable devices are busy or unavailable Set export CUDA_VISIBLE_DEVICES=<no> to use certain GPU! See your printout for the correct number (either 0 or 1) 53

54 See what is running: nvidia-smi $> nvidia-smi Mon Mar 7 10:29: NVIDIA-SMI Driver Version: GPU ID + type nvidia-smi q Lists GPU details GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Quadro 6000 Off 0000:02:00.0 Off 0 30% 63C P0 N/A / N/A 10MiB / 5375MiB 0% E. Process Quadro 6000 Off 0000:85:00.0 On 0 30% 80C P0 N/A / N/A 216MiB / 5375MiB 0% E. Process Processes: GPU Memory GPU PID Type Process name Usage ============================================================================= C./jacobi 155MiB process running on GPU

Jacobi Solver Solving Laplace equation (2D) by Jacobi iterative method A x, y = 2 A x, y = 0 Start with guess of objective function A(x,y) A(i,j+1) Use finite difference discretization

55 Jacobi Solver Solving Laplace equation (2D) by Jacobi iterative method A x, y = 2 A x, y = 0 Start with guess of objective function A(x,y) A(i,j+1) Use finite difference discretization (5-point-stencil) A(i-1,j) A(i,j) A(i+1,j) A k+1 i, j = A k(i 1,j)+A k i+1,j +A k i,j 1 +A k i,j+1 4 A(i,j-1) Iterate over inner elements of 2D grid Repeat until Maximum of number of iterations reached Computed approximation close to solution Biggest change on any matrix element < tolerance value 55

56 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) What went wrong? 56

57 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 57

58 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; What went wrong? free(x); free(y); return 0; 58

59 Analysis with the NVIDIA Visual Profiler Comes with the NVIDIA Toolkit Call: nvvp Also includes automatic application analysis (utilization, ) Move data between host & device SAXPY execution 2. SAXPY execution

60 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE MEMORY CPU PCI Bus MEMORY GPU #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; move x, y to device move y to host move x, y to device move y to host free(x); free(y); return 0; 60

61 Directives (in examples): SAXPY acc data int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { x remains on the GPU until the end of the data region. #pragma acc parallel copy(y[0:n]) main: present(x[0:n]) PGI Compiler Feedback #pragma acc loop 46, Generating copyin(x[:n]) for (int i = 0; i < n; ++i){ 48, Generating copy(y[:n]) y[i] = a*x[i] + y[i]; Generating present(x[:n]) Accelerator kernel generated // Modify y on host, do not touch x on host Generating Tesla code #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 50, #pragma acc loop gang, vector(128) 54, Generating copy(y[:n]) Generating present(x[:n]) Accelerator kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 61

62 Directives: Data regions & clauses Data region 62 Decouples data movement from offload regions C/C++ #pragma acc data [clauses] Fortran!$acc data [clauses]!$acc end data Data clauses Triggers data movement of denoted arrays If cond true, move data to accelerator H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides C/C++, Fortran if(cond) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) background OpenACC

63 Data Movement Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr copy (also copyin, copyout) Checks whether data on device, if not: copies data create Does not copy any data, just creates it on the device deviceptr Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout See section Interoperability with CUDA & GPU Libraries 63

64 Directives (in examples) acc update #pragma acc data copy(x[0:n]) { for (t=0; t<t; t++){ // Modify x on device (e.g. in subroutine // w/o data copying) #pragma acc update host(x[0:n]) // Modify x on host #pragma acc update device(x[0:n]) 64

65 Directives: Update Update executable directive 65 Move data from GPU to host, or host to GPU Used to update existing data after it has changed in its corresponding copy C/C++ #pragma acc update host device [clauses] Fortran!$acc update host device [clauses] Data movement can be conditional or asynchronous Update clauses list variables are copied from acc to host. list variables are copied from host to acc.. If cond true, move data to accelerator Issues no error when data is not present on device Executes async, see Tuning slides. Data movement waits on actions in queue to finish. Set certain device type for update C/C++, Fortran host self(list) device(list) if(cond) if_present async[(int-expr)] wait[(int-expr)] device_type(list) background OpenACC

66 Unstructured Data Lifetime Structured data lifetime Applied to a region Unstructured data lifetime enter data: allocates (+ copies) data to device memory exit data: deallocates (+ copies) data from device memory #pragma acc data copyin(x[0:n]) \ { 66 // data lifetime create(y[0:n]) class Matrix { Matrix() { v = new double[n]; #pragma acc enter data copyin(this) #pragma acc enter data create(v[0:n]) ~Matrix() { #pragma acc exit data delete(v[0:n], this) delete[] v; private: double* v; Also possible: copyout

67 Directives: Unstructured Data Lifetime Enter data construct Allocation (& copy) of scalars and (sub-)arrays in(to) device memory Remain there until end of program or corresponding exit data call C/C++ #pragma acc enter data clause-list Fortran C/C++, Fortran!$acc enter data clause-list copyin(list) Specific clauses for copy or just creation. create(list) Exit data construct (Copies data to host memory &) deletes data from device memory C/C++ #pragma acc exit data clause-list Fortran!$acc exit data clause-list Specific clauses for copy or deletion.... Clauses valid for both: if, async, wait C/C++, Fortran copyout(list) delete(list) finalize 67 background OpenACC

68 device memory device memory host memory device memory host memory Deep Copies Support of a flat object model Primitive types Composite types w/o allocatable/pointer members struct { int x[2]; // static size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) Challenges with pointer indirection Non-contiguous transfers Pointer translation struct { int *x; // dynamic size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) A[0].x[0] A[0].x[1] A[1].x[0] A[1].x[1] da[0].x[0] da[0].x[1] da[1].x[0] da[1].x[1] x[0] x[1] A[0].x A[1].x x[0] x[1] x[0] x[1] OpenACC proposal for complex data management since 11/2014. da[0].x da[0].x da[1].x da[1].x shallow copy deep copy Manuel deep copy is possible with data API, but usually not practical x[0] x[1] 68 Courtesy by J.C. Beyer, Cray Inc.

69 Directives: Data API Data API (1/2) d_void* acc_malloc(size_t): allocates device memory void acc_free(d_void*): frees device memory void* acc_copyin(h_void*,size_t): allocates device memory & copies data to device void* acc_present_or_copyin(h_void*,size_t): tests whether data is present on device, if not: allocates & copies data void* acc_create(h_void*,size_t): allocates device memory corresponding to host data void* acc_present_or_create(h_void*,size_t): tests whether data is present on device, if not: allocates device memory corresponding to host data void acc_copyout(h_void*,size_t):copies data to host & deallocates device memory void acc_delete(h_void*,size_t): deallocates device memory corresponding to host data 69 background OpenACC

70 Directives: Data API Data API (2/2) void acc_update_device(h_void*,size_t) and acc_update_self: updates device data copy corresponding to host data void acc_memcpy_to_device(d_void* dest,h_void* src,size_t): copies data from host to device memory void acc_memcpy_from_device(h_void* dest,d_void* src, size_t): copies data from device to host memory 70 int acc_is_present(h_void*,size_t): tests whether data is present on device void acc_map_data(h_void*,d_void*,size_t): maps previously allocated data to the specific host data void acc_unmap_data(h_void*): unmaps device data from the specific host data d_void* acc_deviceptr(h_void*): returns device pointer associated with a specific host address h_void* acc_hostptr(d_void*): returns host pointer associated with a specific device address background OpenACC

71 Summary Optimizing code with explicit data transfers Structured: data region Unstructured: enter data, exit data Within data environment: update Clauses: copy, copyin, copyout, Data API for deep copies (and more) Compiler and Runtime feedback Where is the data moved? How much time is spend with data movement? Using NVIDIA Visual Profiler with OpenACC 71

72 Hands-On Data motion optimization of your first OpenACC program Use structured data regions to minimize data movement across the PCI bus Use parallel and loop directives on all loops that read or write the data within the data regions Follow TODOs in GPU/exercises/task2 Exercises

73 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) 73

74 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 74

75 Heterogeneous Computing Heterogeneous Computing CPU & GPU are (fully) utilized Challenge: load balancing Domain decomposition If load is known beforehand, static decomposition Exchange data if needed (e.g. halos) matrix vector multiplication GPU CPU = 75

76 Asynchronous operations Definition Synchronous: Control does not return until accelerator action is complete Asynchronous: Control returns immediately Allows heterogeneous computing (CPU + GPU) Default: synchronous operations Asynchronous operations with async clause Kernel execution: kernels, parallel The wait directive can also take an async clause. Data movement: update, enter data, exit data #pragma acc kernels async for (int i=0; i<n; i++) { // do work on host Heterogenity 76

77 Directives: async clause Async clause Executes parallel kernels update enter data exit data wait operations asynchronously while host process continues with code C/C++ #pragma acc <op-see-above> async[(scalar-int-expr)] Fortran!$acc <op-see-above> async[(scalar-int-expr)] Integer argument (optional) can be seen as CUDA stream number Integer argument can be used in a wait directive Async activities with same argument: executed in order Async activities with diff. argument: executed in any order relative to each other OpenACC 2.0: int arg can be acc_async_sync (sync execution) acc_async_noval (same stream) 77 background OpenACC

78 Streams Streams = CUDA concept Stream = sequence of operations that execute in issue-order on GPU Same int-expr in async clause: one stream Different streams/ int-expr: any order relative to each other Tips for debugging - disable async Environment variable ACC_SYNCHRONOUS=1 Use pre-defined int-expr acc_async_sync 78

79 Asynchronous operations & wait Synchronize async operations wait directive Wait for completion of an asynchronous activity (all or certain stream) #pragma acc parallel loop async(1) // kernel A #pragma acc parallel loop async(2) // kernel B #pragma acc wait(1,2) async(3) #pragma acc parallel loop async(3) // wait(1,2) // alternative to wait directive // kernel C #pragma acc parallel loop async(4) wait(3) // kernel D Kernel A Kernel D Kernel C Kernel B Kernel E #pragma acc parallel loop async(5) wait(3) // kernel E #pragma acc wait(1) // kernel F // on host Kernel F 79 Courtesy by J.C. Beyer, Cray Inc.

80 Directives: wait Wait directive Host thread waits for completion of an asynchronous activity If integer expression specified, waits for all async activities with the same value Otherwise, host waits until all async activities have completed C/C++ #pragma acc wait[(scalar-int-expr)] Fortran!$acc wait[(scalar-int-expr)] OpenACC 2.0: wait clause available (parallel, kernels, update, enter data, exit data) Runtime routines int acc_async_test(int): tests for completion of all associated async activities; returns nonzero value/.true. if all have completed int acc_asnyc_test_all(): tests for completion of all async activities int acc_async_wait(int): waits for completion of all associated async activites; routine will not return until the latest async activity has completed int acc_async_wait_all(): waits for completion of all asnyc activities 80 background OpenACC

81 Overlapping data transfers and computations Other feature of streams Allows overlap of tasks on GPU PCIe transfers in both directions #pragma acc data create(a[0:n]) { #pragma acc update \ device(a[0:n/2]) async(1) stream 1 H2D Plus multiple kernels (up to 16 with Kepler) H2D H2D K D2H K stream 2 K without streams D2H D2H H2D H2D K K D2H D2H #pragma acc update \ device(a[n/2:n/2]) async(2) #pragma acc parallel loop async(1) // compute on A[0:n/2] #pragma acc parallel loop async(2) // compute on A[n/2:n/2] potential runtime improvement #pragma acc update \ host(a[0:n/2]) async(1) using async ops #pragma acc update \ host(a[n/2:n/2]) async(2) 81

82 Summary OpenACC default: synchronous operations CPU thread waits until OpenACC kernel/ movement is completed Asynchronous operations enable: Heterogeneous computing (utilizing CPU + GPU) async clause returns control directly to CPU thread Overlapping of kernels (and data transfers) Specify streams by using int-expressions in async clause Synchronization possible by wait directive 82

83 Hands-On Heterogeneous computing on CPU & GPU Decomposition of the matrix (what is the best distribution?) Use async & wait Make sure reduction variables are copied back at certain point Use OpenMP on host Follow TODOs in GPU/exercises/task3 Exercise

84 .. Heterogeneous Computing Domain decomposition 0 GPU GPU CPU CPU Halo exchange (each iteration) GPU CPU 84

85 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)

86 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 86

87 Based on: NVIDIA Corporation Interoperability with Libraries and CUDA Application Libraries Library OpenACC Directives Low-level language OpenACC Programming Languages Drop-in acceleration High productivity Maximum flexibility CUSPARSE CUBLAS OpenACC CUDA 87

88 Interoperability CUDA and OpenACC both operate on device memory Interoperability with (CUDA) libraries Interoperability with custom (CUDA) C/C++/Fortran code Make device address available on host or in OpenACC code #pragma acc data copy(x[0:n]) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] #pragma acc host_data use_device(x) { 88 func(x); //.. use device pointer x: for library for custom CUDA code cudamalloc(&x, sizeof(float)*n); // use device pointer x: // - for library // - for custom CUDA code #pragma acc data deviceptr(x) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] assume x is pointer to device memory

89 Directives: deviceptr & host_data Deviceptr clause Declares that pointers in list refer to device pointers that need not be allocated/ moved between host and device C/C++ #pragma acc parallel kernels data deviceptr(list) Fortran!$acc data deviceptr(list) Host_data construct Makes the address of device data in list available on the host Vars in list must be present in device memory C/C++ #pragma acc host_data use_device(list) Fortran!$acc host_data use_device(list)!$acc end host_data only valid clause 89 background OpenACC

90 Summary OpenACC is interoperable with CUDA and GPU Libraries host_data: device address available on host deviceptr: device address available in OpenACC code 90

91 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 91

92 Execution Model Threads execute as groups of 32 threads (warps) Threads in warp share same program counter t0 t1 t31 t32 t63 SIMT architecture SIMT = single instruction, multiple threads warp Possible mapping to CUDA terminology (GPUs) gang = block worker = warp compiler dependent Execution Model Vector Gang Core Multiprocessor instruction cache registers vector = threads Within block (if omitting worker) Within warp (if specifying worker) Grid (Kernel) hardware/ software cache Device 92

93 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector(256) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; gang(64) vector free(x); free(y); return 0; Without loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(128) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang, vector(128) With loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(256) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang(64), vector(128) 93

94 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { 94 #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; free(x); free(y); return 0; vector_length(256) num_gangs(64) Can both be used with parallel or kernels: vector_length: Specifies number of threads in thread block. num_gangs: Specifies number of thread blocks in grid.

95 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 256; int blocks = 4; int bsize = 32; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars a, b & initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc parallel copy(y[0:n]) present(x[0:n]) \ num_gangs(blocks) vector_length(bsize) all do the same e.g. int tmp = 5; #pragma acc loop gang in parallel on multiprocessors workshare (w/o barrier) #pragma acc loop vector workshare (w/ barrier) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Change y on host & do second SAXPY on device free(x); free(y); return 0; *32=256 Note: Loop schedules are implementation dependent. This representation is one (common) example. 95

96 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j){ // do something #pragma acc parallel vector_length(256) num_gangs(16) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something Distributes loop to n threads on GPU. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes outer loop to n threads on GPU. Each thread executes inner loop sequentially. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes loop to threads on GPU (see above). If 16*256 < n, each thread gets multiple elements. 256 threads per thread block 16 blocks in grid 96

97 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang for (int i = 0; i < n; ++i){ #pragma acc loop vector for (int j = 0; j < m; ++j){ // do something Distributes outer loop to GPU multiprocessors (block-wise). Distributes inner loop to threads within thread blocks. 256 threads per thread block usually n blocks in grid #pragma acc kernels #pragma acc loop gang(100) vector(8) for (int i = 0; i < n; ++i){ #pragma acc loop gang(200) vector(32) for (int j = 0; j < m; ++j){ // do something Multidimensional portioning with parallel construct only possible in v2.0. See tile clause With nested loops, specification of multidimensional blocks and grids possible: use same resource for outer and inner loop. 100 blocks in Y-direction (rows); 200 blocks in X-direction (columns) 8 threads in Y-dimension of one block; 32 threads in X-dimension of one block Total: 8*32=256 threads per thread block; 100*200=20,000 blocks in grid 97

98 Launch configuration How many vector/ gangs to launch? OpenACC runtime tries to find good values automatically Performance portability across different device types possible Own tuning may deliver better performance on a certain hardware Hardware operation Instructions are issued in order Thread execution stalls when one of the operands isn t ready Latency is hidden by switching threads GMEM latency: cycles Arithmetic latency: cycles Need enough threads to hide latency 98

99 Launch configuration Hiding arithmetic latency Need ~18 warps (576 threads) per Fermi MP Or, independent instructions from the same warp x = a + b; //~18 cycles y = a + c; //independent //can start any time // stall z = x + d; //dependent //must wait for compl. Source: Volkov2010 Maximizing global memory throughput Gmem throughput depends on the access pattern, and word size 99 Need enough memory transactions in flight to saturate the bus Independent loads & stores from the same thread (mult. elements) Loads and stores from different threads (many threads) Larger word sizes can also help ( vectors ) e.g. float a0= src[id]; // no latency stall float a1= src[id+blockdim.x]; // stall dst[id]= a0; dst[id+blockdim.x]= a1; // Note, threads don t stall // on memory access Source: Volkov2010

100 GB/s Launch configuration Maximizing global memory throughput Example program: increment an array of ~67M elements Impact of launch configuration (32-bit words) NVIDIA Tesla C2050 (Fermi) ECC off Bandwidth: 144 GB/s Threads/Block

101 Launch configuration Vectors per gang: Multiple of warp size (32) Threads are always spanned in warps onto resources Not used threads are marked as inactive Starting point: vectors/gang Blocks per grid heuristics #blocks > #MPs Launch MANY MPs have at least one block to execute threads to keep #blocks / #MPs > 2 GPU busy! MP can concurrently execute up to 8 blocks Blocks that aren t waiting in barrier keep hardware busy Subject to resource availability (registers, smem) #blocks > 50 to scale to future devices Most obvious: #blocks * #threads = #problem-size Multiple elements per thread may amortize setup costs of simple kernels: #blocks * #threads < #problem-size 101

102 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 102

103 Recap memory hierarchy Local memory/ Registers Shared memory/ L1 Very low latency (~100x than gmem) Bandwidth (aggregate): 1+ TB/s (Fermi), 2.5 TB/s (Kepler) L2 Global memory High latency ( cycles) Device Multiprocessor 1 Registers Registers Core 1 Core m Multiprocessor n Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Host Memory Host Bandwidth: 144 GB/s (Fermi), 250 GB/s (Kepler) 103

104 Pinned Memory 104 CPU default: data allocations are pageable Use of virtual memory: physical memory pages can be swapped out to disk When host needs page, it loads it back in from the disk No direct access to pageable memory by GPU possible 1. CUDA driver allocates a temporary page-locked ( pinned ) host array 2. Copies host data to pinned array (PCIe transfers occur only using DMA) 3. Transfer data from pinned array to device 4. Delete pinned memory after transfer completion Pinned memory = page-locked memory Prevents memory from being swapped out to disk Avoids cost of transferring between pageable and pinned host arrays Improves transfer speeds But: reduces amount of physical memory available to OS & other programs pgcc ta=nvidia:pinned saxpy.c It s just a compiler flag Try it out!

105 gmem access Stores: Invalidate L1, write-back for L2 Loads: Enabling by Caching default (Fermi) Non-caching Compiler options, e.g. PGI s loadcache Attempt to hit: L1 L2 gmem L2 gmem (no L1: invalidate line if it s already in L1) Load granularity 128-byte line 32-byte line Threads in a warp provide memory addresses Determine which lines/segments are needed Kepler uses L1 cache only for thread-private data. Request the needed lines/segments threads 0-31, 32 bit words t0 t1 t2... t30 t31 Gmem access per warp (32 threads) 105

gmem throughput coalescing Range of accesses 0 32 64 96 128 160 192 224

224 256 288 320 352 384 416 448 0 32 64 96 128 160 192 224 256 288 320

106 gmem throughput coalescing Range of accesses coalesced access packing data X Address alignment X... X... X X addresses from a warp X caching loads X X X X coalesced access X X... X X addresses from a warp... X X X X

107 GB/s Impact of address alignment Impact of Address Alignment Fermi cached Offset in 4 byte words Example program: Copy ~67 MB of floats NVIDIA Tesla C2050 (Fermi) ECC off 256 threads/block Misaligned accesses can drop memory throughput 107

108 Data layout matters (improve coalescing) Example: AoS vs. SoA Array of Structures (AoS) Structure of Arrays (SoA) struct mystruct_t { struct { float a; float a[]; float b; float b[]; int c; int c[]; mystruct_t mydata[]; #pragma acc kernels for (int i=0; i<n; i++) { mydata[i].a / mydata.a[i] Address AoS A[0] B[0] C[0] A[1] B[1] C[1] A[2] B[2] C[2] SoA A[0] A[1] A[2] B[0] B[1] B[2] C[0] C[1] C[2] 108 This slide is based on: NVIDIA Corporation

109 Summary Strive for perfect coalescing Warp should access within contiguous region Align starting address (may require padding) Have enough concurrent accesses to saturate the bus Process several elements per thread Launch enough threads to cover access latency 109

110 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 110

Shared memory 111 Smem per SM(X) (10s of KB) Inter-thread communication within a block Need synchronization to avoid RAW / WAR / WAW hazards Low-latency, high-throughput memory Cache data in smem to

N-1; ++i){ Cached references to size #pragma acc loop gang vector [(y+2)x(x+2)] block of 'a' for (int j = 1; j < N-1; ++j){ 67, #pragma acc loop gang, vector(128) #pragma acc cache(a[i-1:3][j-1:3])

111 Shared memory 111 Smem per SM(X) (10s of KB) Inter-thread communication within a block Need synchronization to avoid RAW / WAR / WAW hazards Low-latency, high-throughput memory Cache data in smem to reduce gmem accesses #pragma acc kernels copyout(b[0:n][0:n]) copyin(a[0:n][0:n]) #pragma acc loop gang vector Accelerator kernel generated 65, #pragma acc loop gang, vector(2) for (int i = 1; i < N-1; ++i){ Cached references to size #pragma acc loop gang vector [(y+2)x(x+2)] block of 'a' for (int j = 1; j < N-1; ++j){ 67, #pragma acc loop gang, vector(128) #pragma acc cache(a[i-1:3][j-1:3]) b[i][j] = (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]) / 4; This slide is based on: NVIDIA Corporation, the Portland Group The tile clause can additionally strip-mine the data space. It is not yet supported by PGI.

112 Directives: Caching & Strip-mining Cache construct Prioritizes data for placement in the highest level of data cache on GPU C/C++ #pragma acc cache(list) Fortran!$acc cache(list) Use at the beginning of the loop or in front of the loop Tile clause Sometimes the PGI compiler ignores the cache construct. See compiler feedback Strip-mines each loop in a loop nest according to the values in expr-list C/C++ #pragma acc loop [<schedule>] tile(expr-list) 1. entry applies to innermost loop Fortran!$acc loop [<schedule>] tile(expr-list) 112

113 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 113

114 Multiple GPUs Several GPUs within one compute node Can be used within or across processes/ threads E.g. assign one device to each CPU thread/ process #include <openacc.h> acc_set_device_num(0, acc_device_nvidia); #pragma acc kernels async acc_set_device_num(1, acc_device_nvidia); #pragma acc kernels async Device-specific tuning of clauses possible with device_type kernel clause. 114

115 Multiple GPUs OpenMP + MPI #include <openacc.h> #include <omp.h> OpenMP #pragma omp parallel num_threads(12) { int numdev = acc_get_num_devices(acc_device_nvidia); int devid = omp_get_thread_num() % numdev; acc_set_device_num(devid, acc_device_nvidia); #include <openacc.h> #include <mpi.h> MPI int myrank; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); int numdev = acc_get_num_devices(acc_device_nvidia); int devid = myrank % numdev; acc_set_device_num(devid, acc_device_nvidia); 115

116 Directives: Multiple Devices Multi GPU (or device) setup int acc_get_num_devices(acc_device_t): returns number of device of type void acc_set_device_num(int,acc_device_t): sets device number of type int acc_get_device_num(acc_device_t): returns device number of type void acc_set_device_type(acc_device_t): type for offload region acc_device_t acc_get_device_type(): returns device type for next offload region Device-specific clause tuning: device_type or dtype Takes accelerator architecture name or an asterisk: tuning for this device Clauses belong to dtype that follow it up to the end or another dtype Can applied for: parallel, kernels, loop, update, routine C/C++ #pragma acc <op> device_type(device-type list) (clauses) #pragma acc <op> device_type(*) (clauses) Fortran!$acc <op> device_type(device-type list) (clauses)!$acc <op> device_type(*) (clauses) 116

117 Multiple GPUs MPI Multiple GPUs across nodes Use MPI for communication Use update for a upt-to-date version on the before sending over MEMORY CPU MEMORY GPU MEMORY CPU MEMORY GPU MPI rank 1 MPI rank n-1 #pragma acc update host \ ( s_buf[0:size] ) MPI_Recv(r_buf,size,MPI_CHAR, 0,tag,MPI_COMM_WORLD,&stat); MPI_Send(s_buf,size,MPI_CHAR, n-1,tag,mpi_comm_world); #pragma acc update device \ (r_buf[0:size] ) 117

118 Summary OpenACC can use several devices Within one node: acc_set_device_num to handle GPU affinity Across nodes: use MPI 118

.. Hands-On Extend the heterogeneous version to use 2 GPUs within one compute node (w/o MPI) Decomposition of the matrix Use acc_set_device_num to differentiate between the GPUs Use enter data and

119 .. Hands-On Extend the heterogeneous version to use 2 GPUs within one compute node (w/o MPI) Decomposition of the matrix Use acc_set_device_num to differentiate between the GPUs Use enter data and exit data to move initial data to the GPUs Use update for CPU GPU halo exchange Think about correct synchronization GPU 0 GPU 1 CPU 119 Follow TODOs in GPU/exercises/task4 Exercise 2.6

120 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)

121 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 121

Introduction OpenMP = de-facto standard for shared-memory parallelization From 1997 until now Since 2009: OpenMP Accelerator subcommittee Sub group wanted faster development: OpenACC Idea: include

122 Introduction OpenMP = de-facto standard for shared-memory parallelization From 1997 until now Since 2009: OpenMP Accelerator subcommittee Sub group wanted faster development: OpenACC Idea: include lessons lernt into OpenMP standard 07/2013: OpenMP 4.0 Accelerator support SIMD, Some content of the following slides has been developed by James Beyer (Cray) and Eric Stotzer (TI), the leaders of the OpenMP for Accelerators subcommittee. RWTH Aachen University is a member of the OpenMP Architecture Review Board (ARB) since Main topics: Thread affinity Tasking Tool support Accelerator support 122

Introduction to GPGPUs

Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces