GPGPU Programming with OpenACC

Size: px
Start display at page:

Download "GPGPU Programming with OpenACC"

Transcription

1 GPGPU Programming with OpenACC Sandra Wienke, M.Sc. IT Center, RWTH Aachen University PPCES, March 18th 2016 IT Center der RWTH Aachen University

2 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 2

3 06/ / / / / / / / / / / / / / / / / / / / / / / / / / / /2020 Performance Motivation Why to care about accelerators? Towards exa-flop computing (performance gain, but power constraints) Accelerators provide good performance per watt ratio (first step) 10 EFlop/s 1 EFlop/s 100 PFlop/s 10 PFlop/s 1 PFlop/s 100 TFlop/s 10 TFlop/s 1 TFlop/s 100 GFlop/s 10 GFlop/s 1 GFlop/s 100 MFlop/s PetaFlop/s PFlop/s PFlop/s TFlop/s ExaFlop/s #1 #500 Sum #1 Trend #500 Trend Sum Trend 3 Source: Top500, 11/2015

4 Motivation Accelerators/ co-processors GPGPUs (e.g. NVIDIA, AMD) Here, NVIDIA GPUs are the focus. Intel Many Integrated Core (MIC) Architecture (Intel Xeon Phi) FPGAs (e.g. Convey), DSPs (e.g. TI), System Share No. of accelerator systems is growing Performance Share 66.3 % None NVIDIA 16.4 % 16.7 % 0.4 % AMD/ ATI Intel 0.2 % PEZY-SC 4 Source: Top500, 11/2015 Source: Top500, 6/2015

5 Motivation Top500 (11/2015) 40% of the first 10 Top500 systems contain accelerators Rank Name Site Rmax Power (kw) Mflops/ Watt Accelerator/ Co-Processor 1 Tianhe-2 National Super Computer Center in 33,862,700 17, Intel Xeon Phi (MilkyWay-2) Guangzhou 31S1P 2 Titan DOE/SC/Oak Ridge National 17,590,000 8, NVIDIA K20x Laboratory 3 Sequoia DOE/NNSA/LLNL 17,173,224 7, None 4 RIKEN Advanced Institute for Computational Science (AICS) 10,510,000 12, None 5 Mira DOE/SC/Argonne National Laboratory 8,586,612 3, None 6 Trinity DOE/NNSA/LANL/SNL 8,100,900 None 7 Piz Daint Swiss National Supercomputing Centre (CSCS) 8 Hazel Hen HLRS - Höchstleistungsrechenzentrum Stuttgart 9 Shaheen II King Abdullah University of Science and Technology 10 Stampede Texas Advanced Computing Center/Univ. of Texas 5 6,271,000 2, NVIDIA K20x 5,640,170 None 5,536,990 2, None 5,168,110 4, Intel Xeon Phi SE10P

6 Motivation Green500 (11/2015) Comparison to server with 2x Intel Sandy 2GHz HPL: ~260 W Peak Performance: 256 GFLOPS 985 MFLOPS/W 6 performance per watt Source: Green500, 11/2015 * Performance data obtained from publicly available sources including TOP500

7 Overview GPGPUs = General Purpose Graphics Processing Units History a very brief overview 80s - 90s: Development is mainly driven by games Fixed-function 3D graphics pipeline Graphics APIs like OpenGL, DirectX popular Since 2001: Programmable pixel and vertex shader in graphics pipeline (adjustments in OpenGL, DirectX) Researchers take notice of performance growth of GPUs: Tasks must be cast into native graphics operations Since 2006: Vertex/pixel shader are replaced by a single processor unit Support of programming language C, synchronization, 7 General purpose

8 Comparison CPU GPU 8 cores CPU 2880 cores GPU GPU-Threads Thousands ( few on CPU) Light-weight, little creation overhead Massively Parallel Processors Manycore Architecture Fast switching 8

9 DRAM DRAM Comparison CPU GPU Different design ALU ALU ALU ALU L2 ALU L2 Control ALU CPU Optimized for low latencies Huge caches Control logic for out-of-order and speculative execution GPU Optimized for data-parallel throughput Architecture tolerant of memory latency More transistors dedicated to computation 9

10 DRAM DRAM Why can accelerators deliver good performance watt ratio? 1. High (peak) performance ALU ALU More transistors for computation CPU ALU ALU L2 No control logic Small caches Control 2. Low power consumption 10 Many low frequency cores P~V 2 f No control logic GPU Power use for 1 TFlop/s of an usual system ALU Heat removal Power supply loss Control Disk Communication Memory Compute Source: Andrey Semin (Intel): HPC systems energy efficiency optimization thru hardware-software co-design on Intel technologies, EnaHPC 2011; & S. Borkar, J. Gustafson L2 ALU

11 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 11

12 NVIDIA Corporation 2010 GPU architecture: Fermi 3 billion transistors streaming multiprocessors (SM) Each comprises 32 cores SM cores i.a. Floating point & integer unit Memory hierarchy Peak performance (Tesla M2070) SP: 1.03 TFlops DP: 515 GFlops ECC support Compute capability: 2.0 GPU Defines features, e.g. double precision capability, memory access pattern 12

13 GPU architecture: Kepler 7.1 billion transistors streaming multiprocessors extreme (SMX) Each comprises 192 cores cores Memory hierarchy SMX Peak performance (K20) SP: 3.52 TFlops GPU DP: 1.17 TFlops ECC support Compute capability: 3.5 E.g. dynamic parallelism = possibility to launch dynamically new work from GPU 13

14 Comparison CPU GPU Weak memory model PCI Bus MEMORY CPU 3 1 Host + device memory = separate entities No coherence between host + device Data transfers needed Host-directed execution model Copy input data from CPU mem. to device mem. Execute the device program Copy results from device mem. to CPU mem. MEMORY GPU 2 14

15 NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs float x = input[threadid]; float y = func(x); output[threadid] = y; Select input/output data Control decisions With OpenACC, you don t have to bother with thread IDs. 15

16 Based on: NVIDIA Corporation 2010 Programming model Threads are grouped into blocks Blocks are grouped into a grid 16

17 Time Programming model Kernel is executed as a grid of blocks of threads Host Device Kernel 1 Block 0 Block 4 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 Kernel 2 Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) 17

18 Programming model Why blocks? Cooperation of threads within a block possible Synchronization Share data/ results using shared memory Scalability Fast communication between n threads is not feasible when n large No global synchronization on GPU possible (only by completing one kernel and starting another one from the CPU) But: blocks are executed independently Blocks can be distributed across arbitrary number of multiprocessors In any order, concurrently, sequentially 18

19 Scalability Assume: 10 thread blocks 8 SM(X) 2 SM(X) idle idle 20 SM(X) 19

20 Mapping to Execution Model Thread Core Each thread is executed by a core Block Multiprocessor instruction cache registers hardware/ software cache Each block is executed on a SM(X) Several concurrent blocks can reside on a SM(X) depending on shared resources Grid (Kernel) Device Each kernel is executed on a device 20

21 Memory model Thread Registers Fermi: 63 per thread K20: 255 per thread Local memory Block Shared memory Fermi:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 K20:64KB configurable,on-chip 16KB shared +48KB L1 OR 48KB shared +16KB L1 OR 32KB shared +32KB L1 Grid/ application Constant memory read-only; off-chip; cached Global memory several GB; off-chip Fermi/K20: L2 cache Device Multiprocessor 1 Registers Registers Core 1 Shared Mem 1 Host L1 Core m L2 Global/Constant Memory Host Memory Caches L1 Fermi: configurable 16/48KB K20: configurable 16/32/ 48 KB Multiprocessor n Registers Core 1 L1 L2 Fermi: 768KB K20: 1536KB Registers Core m Shared Mem n 21

22 Summary execution model logical hierarchy memory model Core thread / vector Registers Registers streaming multiprocessor (SM) instruction cache registers block of threads / gang hardware/ software cache device: GPU grid (kernel) sync possible, shared mem Shared Mem SM-1 L1 L2 Shared Mem SM-n L1 Device Global Memory Host CPU PCIe CPU Mem Host Host Memory 22 Vector, worker, gang mapping is compiler dependent.

23 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 23

24 Based on: NVIDIA Corporation GPGPU Programming Application Libraries Directives Programming Languages Drop-in acceleration High productivity Maximum flexibility examples CUBLAS OpenACC CUDA CUSPARSE OpenMP OpenCL 24

25 GPGPU Programming CUDA (Compute Unified Device Architecture) C/C++ (NVIDIA): programming language, NVIDIA GPUs Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs OpenCL C (Khronos Group): open standard, portable, CPU/GPU/ OpenACC C/C++, Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov OpenMP C/C++, Fortran: Directive-based programming for hosts and accelerators, standard, portable, released in July 2013, implementations soon 25

26 Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y y x y 26 free(x); free(y); return 0;

27 Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 27 free(x); free(y); return 0;

28 Example SAXPY CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 2.0f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=5.0*i-1.0; float *d_x,*d_y; // Pointer to GPU memory 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 2. Launch kernel 3. Transfer data to CPU + free data on GPU 28

29 Motivation Nowadays: GPU APIs (like CUDA, OpenCL) often used May be difficult to program (as/but more flexibility) Verbose/ may complicate software design Directive-based programming model delegates responsibility for low-level GPU programming tasks to compiler Data movement Kernel execution Awareness of particular GPU type OpenACC or OpenMP 4.x device constructs Many tasks can be done by compiler/ runtime User-directed programming 29

30 Source: Overview OpenACC Open industry standard Portability Here, PGI s OpenACC compiler is used for examples. Details are compiler dependent. Introduced by CAPS, Cray, NVIDIA, PGI (Nov. 2011) Support C,C++ and Fortran NVIDIA GPUs, AMD GPUs, x86 Multicore, Intel MIC (now/near future) Timeline Nov 11: Specification 1.0 Jun 13: Specification 2.0 Nov 14: Proposal for complex data management Nov 15: Specification

31 Directives Syntax C #pragma acc directive-name [clauses] Fortran!$acc directive-name [clauses] Iterative development process Compiler feedback helpful Whether an accelerator kernel could be generated Which loop schedule is used Where/which data is copied 31

32 OpenACC - Compiling (PGI) pgcc -acc -ta=nvidia,cc20,7.0 -Minfo=accel saxpy.c pgcc C PGI compiler (pgf90 for Fortran) -acc Tells compiler to recognize OpenACC directives -ta=nvidia Specifies the target architecture here: NVIDIA GPUs cc20 Optional. Specifies target compute capability Optional. Uses CUDA Toolkit 7.0 for code generation -Minfo=accel Optional. Compiler feedback for accelerator code The PGI tool pgaccelinfo prints the minimal needed compiler flags for the usage of OpenACC on the current hardware (see Development Tips). 32

33 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 33

34 Directives (in examples): SAXPY serial (host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 34

35 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE magic sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; main: PGI Compiler Feedback 48, Generating copyin(x[:10240]) Generating copy(y[:10240]) 50, Kernel Loop 1 is parallelizable Accelerator kernel generated Generating Tesla code 50, #pragma acc loop gang, vector(128) 56, Loop is parallelizable Kernel Accelerator 2 kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 35

36 Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 36

37 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE sync #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 37

38 sync Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; kernels!= parallel compiler can choose different optimizations free(x); free(y); return 0; 38

39 Array Shaping Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr Array shaping Compiler sometimes cannot determine size of arrays Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout Specify explicitly using data clauses and array shape [lower bound: size] C/C++ #pragma acc parallel copyin(a[0:length]) copyout(b[s/2:s/2]) Fortran!$acc parallel copyin(a(0:length-1)) copyout(b(s/2:s)) [lower bound: upper bound] 39

40 Directives: Offload regions Offload region Region maps to a CUDA kernel function C/C++ #pragma acc parallel [clauses] Fortran!$acc parallel [clauses]!$acc end parallel C/C++ #pragma acc kernels[clauses] Fortran!$acc kernels[clauses]!$acc end kernels User responsible for finding parallelism (loops) acc loop needed for worksharing No automatic sync between several loops Compiler responsible for finding parallelism (loops) acc loop directive only for tuning needed Automatic sync between loops within kernels region 40 background OpenACC

41 Directives: Offload regions Clauses for compute constructs (parallel, kernels) If condition true, acc version is executed Executes async, see Tuning slides. Define number of parallel gangs..... Define number of workers within gang.... Define length for vector operations.. Reduction with op at end of region(parallel only).... H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides.. Copy of each list-item for each parallel gang(parallel only) As private + copy initialization from host(parallel only). C/C++, Fortran if(condition) async [(int-expr)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) reduction(op:list) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) private(list) firstprivate(list) Also: wait, device_type, default(none present) 41 background OpenACC

42 Loop schedule Directives: Loops Share work of loops Loop work gets distributed among threads on GPU (in certain schedule) C/C++ #pragma acc loop[clauses] Fortran!$acc loop[clauses] kernels loop defines loop schedule by int-expr in gang, worker or vector (similar to num_gangs etc with parallel) Loop clauses Distributes work into thread blocks.. Distributes work into warps Distributes work into threads within warp/ thread block Executes loop sequentially on the device Collapse n tightly nested loops Says independent loop iterations(kernels loop only). Reduction with op.. Private copy for each loop iteration 42 OpenACC 2.0 also: auto, tile, device_type C/C++, Fortran gang worker vector seq collapse(n) independent reduction(op:list) private(list) background OpenACC

43 Misc on Loops Combined directives #pragma acc kernels loop for (int i=0; i<n; ++i) { /*...*/ #pragma acc parallel loop for (int i=0; i<n; ++i) { /*...*/ Reductions #pragma acc parallel #pragma acc loop reduction(+:sum) for (int i=0; i<n; ++i) { sum += i; Also possible: *, max, min, &,, ^, &&, PGI compiler can often recognize reductions on its own. See compiler feedback, e.g.: Sum reduction generated for var 43

44 Procedure Calls Procedure calls within parallel region routine directive: compile for accelerator + host Clauses denote level of parallelism #pragma acc routine seq float saxpy(float a, float x, float y) { return (a*x + y); // [..] #pragma acc parallel #pragma acc loop gang vector for (int i=0; i<n; ++i) { //call on the device y[i] = saxpy(a,x[i],y[i]); See section on Loop Schedules for explanation of gang, vector, seq With OpenACC 1.0 (!) routines must be inlined (no routine support) 44

45 Directives: Procedure Calls Routine construct 45 Tells compiler to compile the given procedure for an accelerator & host Either use name or directly put construct at procedure call C/C++ #pragma acc routine clause-list #pragma acc routine (name) clause-list Fortran!$acc routine clause-list!$acc routine (name) clause-list A clause must be specified (May) contain/call gang parallelism, executed in gang-redundant mode.. (May) contain/call worker parallelism (no gang), executed in worker-single mode C/C++, Fortran gang worker (May) contain/call vector parallelism (no worker/gang), executed in vector-single mode vector Does not contain/call gang, worker, vector parallelism.. seq Specifies name for compiling/calling (as if specified in language being compiled) bind(name) Specifies name for compiling/calling (string is used for name)... bind(string) No compilation for the host, must called from within acc compute regions nohost Device-specific clause tuning dtype background OpenACC

46 Summary Offloading work kernels magic parallel region & worksharing construct (loop) Easy reductions by reduction clause Procedure calls with routine construct Moving data with data clauses Array shaping copy, copyin, copyout Compiler feedback What is done implicitly? How are explicit directives understood? 46

47 Development Tips Favorable for parallelization: (nested) loops Large loop counts to compensate (at least) data movement overhead Independent loop iterations Think about data availability on host/ GPU Use data regions to avoid unnecessary data transfers Specify data array shaping (may adjust for alignment) Verifying execution on GPU See PGI compiler feedback. Term Accelerator kernel generated needed. See information on runtime (more details in the hands-on session): export ACC_NOTIFY=3 47

48 Development Tips Conditional compilation for OpenACC Macro _OPENACC Using pointers (C/C++) Problem: compiler can not determine whether loop iterations that access pointers are independent no parallelization possible Solution (if independence is actually there) Use restrict keyword: float *restrict ptr; Use compiler flag: -ansi-alias (PGI) Use OpenACC loop clause: independent Prefer array notation over pointer arithmetic: array[2] instead of *(array+2) 48

49 Development Tips Runtime measurements GPU needs time to initialize (up to a couple of seconds, depending on GPU) Might get included into runtime measurements & influence short runtimes Exclude GPU initialization from measurements by calling acc_init(acc_device_nvidia), #pragma acc init(acc_device_nvidia) Simple profile (that also shows the init time) by Compiler flag: -ta=nvidia,time Environment variable: PGI_ACC_TIME=1 Setup information Information on GPU type, hardware details and configuration NVIDIA: Download CUDA SDK, run devicequery PGI: pgaccelinfo (also specifies needed compiler flags for OpenACC) 49

50 Hands-On Introduction to RWTH GPU Cluster environment Write your first OpenACC program (Jacobi solver) Use parallel and loop directives Follow TODOs in GPU/exercises/task1 Exercises Optional (only if you have completed task 2.2) Read slide 104 (pinned memory) Apply this to your program from task 1 Why does this give a performance 50 improvement?

51 The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) Foto: C. Iwainsky High utilization of resources Daytime VR: new CAVE (49 GPUs) HPC: interactive software development (8 GPUs) Nighttime aixcave, VR, RWTH Aachen, since June 2012 HPC: Processing of GPGPU compute jobs (55-57 GPUs) 51

52 Hardware & Software Stack 24 rendering nodes 2x NVIDIA Quadro 6000 GPUs (Fermi) 2x Intel Xeon X GHz (Westmere, 12 cores) GPU cluster software environment CUDA Toolkit 7.0 module load cuda directory: $CUDA_ROOT PGI Compiler (OpenACC) TotalView (Debugging) module load pgi (module switch intel pgi) module load totalview 52

53 Lab: Login to GPU machines User name (see badge): hpclab<xy> Password (see Access to Lab Machine) GPU node (see paper snippet): linuxgpus<ab> These nodes are available in interactive mode only for the workshop! Jump from frontend node to GPU node: ssh Y linuxgpus<ab> Remark: GPUs are set to exclusive mode (per process) Only one person can access GPU If occupied, e.g. message all CUDA-capable devices are busy or unavailable Set export CUDA_VISIBLE_DEVICES=<no> to use certain GPU! See your printout for the correct number (either 0 or 1) 53

54 See what is running: nvidia-smi $> nvidia-smi Mon Mar 7 10:29: NVIDIA-SMI Driver Version: GPU ID + type nvidia-smi q Lists GPU details GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. ===============================+======================+====================== 0 Quadro 6000 Off 0000:02:00.0 Off 0 30% 63C P0 N/A / N/A 10MiB / 5375MiB 0% E. Process Quadro 6000 Off 0000:85:00.0 On 0 30% 80C P0 N/A / N/A 216MiB / 5375MiB 0% E. Process Processes: GPU Memory GPU PID Type Process name Usage ============================================================================= C./jacobi 155MiB process running on GPU

55 Jacobi Solver Solving Laplace equation (2D) by Jacobi iterative method A x, y = 2 A x, y = 0 Start with guess of objective function A(x,y) A(i,j+1) Use finite difference discretization (5-point-stencil) A(i-1,j) A(i,j) A(i+1,j) A k+1 i, j = A k(i 1,j)+A k i+1,j +A k i,j 1 +A k i,j+1 4 A(i,j-1) Iterate over inner elements of 2D grid Repeat until Maximum of number of iterations reached Computed approximation close to solution Biggest change on any matrix element < tolerance value 55

56 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) What went wrong? 56

57 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 57

58 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; What went wrong? free(x); free(y); return 0; 58

59 Analysis with the NVIDIA Visual Profiler Comes with the NVIDIA Toolkit Call: nvvp Also includes automatic application analysis (utilization, ) Move data between host & device SAXPY execution 2. SAXPY execution

60 Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE MEMORY CPU PCI Bus MEMORY GPU #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; move x, y to device move y to host move x, y to device move y to host free(x); free(y); return 0; 60

61 Directives (in examples): SAXPY acc data int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { x remains on the GPU until the end of the data region. #pragma acc parallel copy(y[0:n]) main: present(x[0:n]) PGI Compiler Feedback #pragma acc loop 46, Generating copyin(x[:n]) for (int i = 0; i < n; ++i){ 48, Generating copy(y[:n]) y[i] = a*x[i] + y[i]; Generating present(x[:n]) Accelerator kernel generated // Modify y on host, do not touch x on host Generating Tesla code #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 50, #pragma acc loop gang, vector(128) 54, Generating copy(y[:n]) Generating present(x[:n]) Accelerator kernel generated Generating Tesla code 56, #pragma acc loop gang, vector(128) 61

62 Directives: Data regions & clauses Data region 62 Decouples data movement from offload regions C/C++ #pragma acc data [clauses] Fortran!$acc data [clauses]!$acc end data Data clauses Triggers data movement of denoted arrays If cond true, move data to accelerator H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides C/C++, Fortran if(cond) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) background OpenACC

63 Data Movement Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, create, deviceptr copy (also copyin, copyout) Checks whether data on device, if not: copies data create Does not copy any data, just creates it on the device deviceptr Since OpenACC 2.5: copy = present_or_copy = pcopy copyin = present_or_copyin = pcopyin copyout = present_or_copyout = pcopyout See section Interoperability with CUDA & GPU Libraries 63

64 Directives (in examples) acc update #pragma acc data copy(x[0:n]) { for (t=0; t<t; t++){ // Modify x on device (e.g. in subroutine // w/o data copying) #pragma acc update host(x[0:n]) // Modify x on host #pragma acc update device(x[0:n]) 64

65 Directives: Update Update executable directive 65 Move data from GPU to host, or host to GPU Used to update existing data after it has changed in its corresponding copy C/C++ #pragma acc update host device [clauses] Fortran!$acc update host device [clauses] Data movement can be conditional or asynchronous Update clauses list variables are copied from acc to host. list variables are copied from host to acc.. If cond true, move data to accelerator Issues no error when data is not present on device Executes async, see Tuning slides. Data movement waits on actions in queue to finish. Set certain device type for update C/C++, Fortran host self(list) device(list) if(cond) if_present async[(int-expr)] wait[(int-expr)] device_type(list) background OpenACC

66 Unstructured Data Lifetime Structured data lifetime Applied to a region Unstructured data lifetime enter data: allocates (+ copies) data to device memory exit data: deallocates (+ copies) data from device memory #pragma acc data copyin(x[0:n]) \ { 66 // data lifetime create(y[0:n]) class Matrix { Matrix() { v = new double[n]; #pragma acc enter data copyin(this) #pragma acc enter data create(v[0:n]) ~Matrix() { #pragma acc exit data delete(v[0:n], this) delete[] v; private: double* v; Also possible: copyout

67 Directives: Unstructured Data Lifetime Enter data construct Allocation (& copy) of scalars and (sub-)arrays in(to) device memory Remain there until end of program or corresponding exit data call C/C++ #pragma acc enter data clause-list Fortran C/C++, Fortran!$acc enter data clause-list copyin(list) Specific clauses for copy or just creation. create(list) Exit data construct (Copies data to host memory &) deletes data from device memory C/C++ #pragma acc exit data clause-list Fortran!$acc exit data clause-list Specific clauses for copy or deletion.... Clauses valid for both: if, async, wait C/C++, Fortran copyout(list) delete(list) finalize 67 background OpenACC

68 device memory device memory host memory device memory host memory Deep Copies Support of a flat object model Primitive types Composite types w/o allocatable/pointer members struct { int x[2]; // static size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) Challenges with pointer indirection Non-contiguous transfers Pointer translation struct { int *x; // dynamic size 2 *A; // dynamic size 2 #pragma acc data copy(a[0:2]) A[0].x[0] A[0].x[1] A[1].x[0] A[1].x[1] da[0].x[0] da[0].x[1] da[1].x[0] da[1].x[1] x[0] x[1] A[0].x A[1].x x[0] x[1] x[0] x[1] OpenACC proposal for complex data management since 11/2014. da[0].x da[0].x da[1].x da[1].x shallow copy deep copy Manuel deep copy is possible with data API, but usually not practical x[0] x[1] 68 Courtesy by J.C. Beyer, Cray Inc.

69 Directives: Data API Data API (1/2) d_void* acc_malloc(size_t): allocates device memory void acc_free(d_void*): frees device memory void* acc_copyin(h_void*,size_t): allocates device memory & copies data to device void* acc_present_or_copyin(h_void*,size_t): tests whether data is present on device, if not: allocates & copies data void* acc_create(h_void*,size_t): allocates device memory corresponding to host data void* acc_present_or_create(h_void*,size_t): tests whether data is present on device, if not: allocates device memory corresponding to host data void acc_copyout(h_void*,size_t):copies data to host & deallocates device memory void acc_delete(h_void*,size_t): deallocates device memory corresponding to host data 69 background OpenACC

70 Directives: Data API Data API (2/2) void acc_update_device(h_void*,size_t) and acc_update_self: updates device data copy corresponding to host data void acc_memcpy_to_device(d_void* dest,h_void* src,size_t): copies data from host to device memory void acc_memcpy_from_device(h_void* dest,d_void* src, size_t): copies data from device to host memory 70 int acc_is_present(h_void*,size_t): tests whether data is present on device void acc_map_data(h_void*,d_void*,size_t): maps previously allocated data to the specific host data void acc_unmap_data(h_void*): unmaps device data from the specific host data d_void* acc_deviceptr(h_void*): returns device pointer associated with a specific host address h_void* acc_hostptr(d_void*): returns host pointer associated with a specific device address background OpenACC

71 Summary Optimizing code with explicit data transfers Structured: data region Unstructured: enter data, exit data Within data environment: update Clauses: copy, copyin, copyout, Data API for deep copies (and more) Compiler and Runtime feedback Where is the data moved? How much time is spend with data movement? Using NVIDIA Visual Profiler with OpenACC 71

72 Hands-On Data motion optimization of your first OpenACC program Use structured data regions to minimize data movement across the PCI bus Use parallel and loop directives on all loops that read or write the data within the data regions Follow TODOs in GPU/exercises/task2 Exercises

73 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads) 73

74 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 74

75 Heterogeneous Computing Heterogeneous Computing CPU & GPU are (fully) utilized Challenge: load balancing Domain decomposition If load is known beforehand, static decomposition Exchange data if needed (e.g. halos) matrix vector multiplication GPU CPU = 75

76 Asynchronous operations Definition Synchronous: Control does not return until accelerator action is complete Asynchronous: Control returns immediately Allows heterogeneous computing (CPU + GPU) Default: synchronous operations Asynchronous operations with async clause Kernel execution: kernels, parallel The wait directive can also take an async clause. Data movement: update, enter data, exit data #pragma acc kernels async for (int i=0; i<n; i++) { // do work on host Heterogenity 76

77 Directives: async clause Async clause Executes parallel kernels update enter data exit data wait operations asynchronously while host process continues with code C/C++ #pragma acc <op-see-above> async[(scalar-int-expr)] Fortran!$acc <op-see-above> async[(scalar-int-expr)] Integer argument (optional) can be seen as CUDA stream number Integer argument can be used in a wait directive Async activities with same argument: executed in order Async activities with diff. argument: executed in any order relative to each other OpenACC 2.0: int arg can be acc_async_sync (sync execution) acc_async_noval (same stream) 77 background OpenACC

78 Streams Streams = CUDA concept Stream = sequence of operations that execute in issue-order on GPU Same int-expr in async clause: one stream Different streams/ int-expr: any order relative to each other Tips for debugging - disable async Environment variable ACC_SYNCHRONOUS=1 Use pre-defined int-expr acc_async_sync 78

79 Asynchronous operations & wait Synchronize async operations wait directive Wait for completion of an asynchronous activity (all or certain stream) #pragma acc parallel loop async(1) // kernel A #pragma acc parallel loop async(2) // kernel B #pragma acc wait(1,2) async(3) #pragma acc parallel loop async(3) // wait(1,2) // alternative to wait directive // kernel C #pragma acc parallel loop async(4) wait(3) // kernel D Kernel A Kernel D Kernel C Kernel B Kernel E #pragma acc parallel loop async(5) wait(3) // kernel E #pragma acc wait(1) // kernel F // on host Kernel F 79 Courtesy by J.C. Beyer, Cray Inc.

80 Directives: wait Wait directive Host thread waits for completion of an asynchronous activity If integer expression specified, waits for all async activities with the same value Otherwise, host waits until all async activities have completed C/C++ #pragma acc wait[(scalar-int-expr)] Fortran!$acc wait[(scalar-int-expr)] OpenACC 2.0: wait clause available (parallel, kernels, update, enter data, exit data) Runtime routines int acc_async_test(int): tests for completion of all associated async activities; returns nonzero value/.true. if all have completed int acc_asnyc_test_all(): tests for completion of all async activities int acc_async_wait(int): waits for completion of all associated async activites; routine will not return until the latest async activity has completed int acc_async_wait_all(): waits for completion of all asnyc activities 80 background OpenACC

81 Overlapping data transfers and computations Other feature of streams Allows overlap of tasks on GPU PCIe transfers in both directions #pragma acc data create(a[0:n]) { #pragma acc update \ device(a[0:n/2]) async(1) stream 1 H2D Plus multiple kernels (up to 16 with Kepler) H2D H2D K D2H K stream 2 K without streams D2H D2H H2D H2D K K D2H D2H #pragma acc update \ device(a[n/2:n/2]) async(2) #pragma acc parallel loop async(1) // compute on A[0:n/2] #pragma acc parallel loop async(2) // compute on A[n/2:n/2] potential runtime improvement #pragma acc update \ host(a[0:n/2]) async(1) using async ops #pragma acc update \ host(a[n/2:n/2]) async(2) 81

82 Summary OpenACC default: synchronous operations CPU thread waits until OpenACC kernel/ movement is completed Asynchronous operations enable: Heterogeneous computing (utilizing CPU + GPU) async clause returns control directly to CPU thread Overlapping of kernels (and data transfers) Specify streams by using int-expressions in async clause Synchronization possible by wait directive 82

83 Hands-On Heterogeneous computing on CPU & GPU Decomposition of the matrix (what is the best distribution?) Use async & wait Make sure reduction variables are copied back at certain point Use OpenMP on host Follow TODOs in GPU/exercises/task3 Exercise

84 .. Heterogeneous Computing Domain decomposition 0 GPU GPU CPU CPU Halo exchange (each iteration) GPU CPU 84

85 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)

86 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 86

87 Based on: NVIDIA Corporation Interoperability with Libraries and CUDA Application Libraries Library OpenACC Directives Low-level language OpenACC Programming Languages Drop-in acceleration High productivity Maximum flexibility CUSPARSE CUBLAS OpenACC CUDA 87

88 Interoperability CUDA and OpenACC both operate on device memory Interoperability with (CUDA) libraries Interoperability with custom (CUDA) C/C++/Fortran code Make device address available on host or in OpenACC code #pragma acc data copy(x[0:n]) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] #pragma acc host_data use_device(x) { 88 func(x); //.. use device pointer x: for library for custom CUDA code cudamalloc(&x, sizeof(float)*n); // use device pointer x: // - for library // - for custom CUDA code #pragma acc data deviceptr(x) { #pragma acc kernels loop for (int i=0; i<n; i++) { // work on x[i] assume x is pointer to device memory

89 Directives: deviceptr & host_data Deviceptr clause Declares that pointers in list refer to device pointers that need not be allocated/ moved between host and device C/C++ #pragma acc parallel kernels data deviceptr(list) Fortran!$acc data deviceptr(list) Host_data construct Makes the address of device data in list available on the host Vars in list must be present in device memory C/C++ #pragma acc host_data use_device(list) Fortran!$acc host_data use_device(list)!$acc end host_data only valid clause 89 background OpenACC

90 Summary OpenACC is interoperable with CUDA and GPU Libraries host_data: device address available on host deviceptr: device address available in OpenACC code 90

91 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 91

92 Execution Model Threads execute as groups of 32 threads (warps) Threads in warp share same program counter t0 t1 t31 t32 t63 SIMT architecture SIMT = single instruction, multiple threads warp Possible mapping to CUDA terminology (GPUs) gang = block worker = warp compiler dependent Execution Model Vector Gang Core Multiprocessor instruction cache registers vector = threads Within block (if omitting worker) Within warp (if specifying worker) Grid (Kernel) hardware/ software cache Device 92

93 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector(256) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc kernels copy(y[0:n]) present(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; gang(64) vector free(x); free(y); return 0; Without loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(128) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang, vector(128) With loop schedule 33, Loop is parallelizable Accelerator kernel generated 33, #pragma acc loop gang, vector(256) 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang(64), vector(128) 93

94 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { 94 #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Modify y on host, do not touch x on host #pragma acc parallel copy(y[0:n]) present(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; free(x); free(y); return 0; vector_length(256) num_gangs(64) Can both be used with parallel or kernels: vector_length: Specifies number of threads in thread block. num_gangs: Specifies number of thread blocks in grid.

95 Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 256; int blocks = 4; int bsize = 32; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Define scalars a, b & initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc parallel copy(y[0:n]) present(x[0:n]) \ num_gangs(blocks) vector_length(bsize) all do the same e.g. int tmp = 5; #pragma acc loop gang in parallel on multiprocessors workshare (w/o barrier) #pragma acc loop vector workshare (w/ barrier) for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; // Change y on host & do second SAXPY on device free(x); free(y); return 0; *32=256 Note: Loop schedules are implementation dependent. This representation is one (common) example. 95

96 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j){ // do something #pragma acc parallel vector_length(256) num_gangs(16) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something Distributes loop to n threads on GPU. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes outer loop to n threads on GPU. Each thread executes inner loop sequentially. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes loop to threads on GPU (see above). If 16*256 < n, each thread gets multiple elements. 256 threads per thread block 16 blocks in grid 96

97 Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang for (int i = 0; i < n; ++i){ #pragma acc loop vector for (int j = 0; j < m; ++j){ // do something Distributes outer loop to GPU multiprocessors (block-wise). Distributes inner loop to threads within thread blocks. 256 threads per thread block usually n blocks in grid #pragma acc kernels #pragma acc loop gang(100) vector(8) for (int i = 0; i < n; ++i){ #pragma acc loop gang(200) vector(32) for (int j = 0; j < m; ++j){ // do something Multidimensional portioning with parallel construct only possible in v2.0. See tile clause With nested loops, specification of multidimensional blocks and grids possible: use same resource for outer and inner loop. 100 blocks in Y-direction (rows); 200 blocks in X-direction (columns) 8 threads in Y-dimension of one block; 32 threads in X-dimension of one block Total: 8*32=256 threads per thread block; 100*200=20,000 blocks in grid 97

98 Launch configuration How many vector/ gangs to launch? OpenACC runtime tries to find good values automatically Performance portability across different device types possible Own tuning may deliver better performance on a certain hardware Hardware operation Instructions are issued in order Thread execution stalls when one of the operands isn t ready Latency is hidden by switching threads GMEM latency: cycles Arithmetic latency: cycles Need enough threads to hide latency 98

99 Launch configuration Hiding arithmetic latency Need ~18 warps (576 threads) per Fermi MP Or, independent instructions from the same warp x = a + b; //~18 cycles y = a + c; //independent //can start any time // stall z = x + d; //dependent //must wait for compl. Source: Volkov2010 Maximizing global memory throughput Gmem throughput depends on the access pattern, and word size 99 Need enough memory transactions in flight to saturate the bus Independent loads & stores from the same thread (mult. elements) Loads and stores from different threads (many threads) Larger word sizes can also help ( vectors ) e.g. float a0= src[id]; // no latency stall float a1= src[id+blockdim.x]; // stall dst[id]= a0; dst[id+blockdim.x]= a1; // Note, threads don t stall // on memory access Source: Volkov2010

100 GB/s Launch configuration Maximizing global memory throughput Example program: increment an array of ~67M elements Impact of launch configuration (32-bit words) NVIDIA Tesla C2050 (Fermi) ECC off Bandwidth: 144 GB/s Threads/Block

101 Launch configuration Vectors per gang: Multiple of warp size (32) Threads are always spanned in warps onto resources Not used threads are marked as inactive Starting point: vectors/gang Blocks per grid heuristics #blocks > #MPs Launch MANY MPs have at least one block to execute threads to keep #blocks / #MPs > 2 GPU busy! MP can concurrently execute up to 8 blocks Blocks that aren t waiting in barrier keep hardware busy Subject to resource availability (registers, smem) #blocks > 50 to scale to future devices Most obvious: #blocks * #threads = #problem-size Multiple elements per thread may amortize setup costs of simple kernels: #blocks * #threads < #problem-size 101

102 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 102

103 Recap memory hierarchy Local memory/ Registers Shared memory/ L1 Very low latency (~100x than gmem) Bandwidth (aggregate): 1+ TB/s (Fermi), 2.5 TB/s (Kepler) L2 Global memory High latency ( cycles) Device Multiprocessor 1 Registers Registers Core 1 Core m Multiprocessor n Registers Core 1 Registers Core m Shared Mem 1 L1 L1 Shared Mem n L2 Global/Constant Memory Host Memory Host Bandwidth: 144 GB/s (Fermi), 250 GB/s (Kepler) 103

104 Pinned Memory 104 CPU default: data allocations are pageable Use of virtual memory: physical memory pages can be swapped out to disk When host needs page, it loads it back in from the disk No direct access to pageable memory by GPU possible 1. CUDA driver allocates a temporary page-locked ( pinned ) host array 2. Copies host data to pinned array (PCIe transfers occur only using DMA) 3. Transfer data from pinned array to device 4. Delete pinned memory after transfer completion Pinned memory = page-locked memory Prevents memory from being swapped out to disk Avoids cost of transferring between pageable and pinned host arrays Improves transfer speeds But: reduces amount of physical memory available to OS & other programs pgcc ta=nvidia:pinned saxpy.c It s just a compiler flag Try it out!

105 gmem access Stores: Invalidate L1, write-back for L2 Loads: Enabling by Caching default (Fermi) Non-caching Compiler options, e.g. PGI s loadcache Attempt to hit: L1 L2 gmem L2 gmem (no L1: invalidate line if it s already in L1) Load granularity 128-byte line 32-byte line Threads in a warp provide memory addresses Determine which lines/segments are needed Kepler uses L1 cache only for thread-private data. Request the needed lines/segments threads 0-31, 32 bit words t0 t1 t2... t30 t31 Gmem access per warp (32 threads) 105

106 gmem throughput coalescing Range of accesses coalesced access packing data X Address alignment X... X... X X addresses from a warp X caching loads X X X X coalesced access X X... X X addresses from a warp... X X X X

107 GB/s Impact of address alignment Impact of Address Alignment Fermi cached Offset in 4 byte words Example program: Copy ~67 MB of floats NVIDIA Tesla C2050 (Fermi) ECC off 256 threads/block Misaligned accesses can drop memory throughput 107

108 Data layout matters (improve coalescing) Example: AoS vs. SoA Array of Structures (AoS) Structure of Arrays (SoA) struct mystruct_t { struct { float a; float a[]; float b; float b[]; int c; int c[]; mystruct_t mydata[]; #pragma acc kernels for (int i=0; i<n; i++) { mydata[i].a / mydata.a[i] Address AoS A[0] B[0] C[0] A[1] B[1] C[1] A[2] B[2] C[2] SoA A[0] A[1] A[2] B[0] B[1] B[2] C[0] C[1] C[2] 108 This slide is based on: NVIDIA Corporation

109 Summary Strive for perfect coalescing Warp should access within contiguous region Align starting address (may require padding) Have enough concurrent accesses to saturate the bus Process several elements per thread Launch enough threads to cover access latency 109

110 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 110

111 Shared memory 111 Smem per SM(X) (10s of KB) Inter-thread communication within a block Need synchronization to avoid RAW / WAR / WAW hazards Low-latency, high-throughput memory Cache data in smem to reduce gmem accesses #pragma acc kernels copyout(b[0:n][0:n]) copyin(a[0:n][0:n]) #pragma acc loop gang vector Accelerator kernel generated 65, #pragma acc loop gang, vector(2) for (int i = 1; i < N-1; ++i){ Cached references to size #pragma acc loop gang vector [(y+2)x(x+2)] block of 'a' for (int j = 1; j < N-1; ++j){ 67, #pragma acc loop gang, vector(128) #pragma acc cache(a[i-1:3][j-1:3]) b[i][j] = (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]) / 4; This slide is based on: NVIDIA Corporation, the Portland Group The tile clause can additionally strip-mine the data space. It is not yet supported by PGI.

112 Directives: Caching & Strip-mining Cache construct Prioritizes data for placement in the highest level of data cache on GPU C/C++ #pragma acc cache(list) Fortran!$acc cache(list) Use at the beginning of the loop or in front of the loop Tile clause Sometimes the PGI compiler ignores the cache construct. See compiler feedback Strip-mines each loop in a loop nest according to the values in expr-list C/C++ #pragma acc loop [<schedule>] tile(expr-list) 1. entry applies to innermost loop Fortran!$acc loop [<schedule>] tile(expr-list) 112

113 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing Interoperability with CUDA & GPU Libraries [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 113

114 Multiple GPUs Several GPUs within one compute node Can be used within or across processes/ threads E.g. assign one device to each CPU thread/ process #include <openacc.h> acc_set_device_num(0, acc_device_nvidia); #pragma acc kernels async acc_set_device_num(1, acc_device_nvidia); #pragma acc kernels async Device-specific tuning of clauses possible with device_type kernel clause. 114

115 Multiple GPUs OpenMP + MPI #include <openacc.h> #include <omp.h> OpenMP #pragma omp parallel num_threads(12) { int numdev = acc_get_num_devices(acc_device_nvidia); int devid = omp_get_thread_num() % numdev; acc_set_device_num(devid, acc_device_nvidia); #include <openacc.h> #include <mpi.h> MPI int myrank; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); int numdev = acc_get_num_devices(acc_device_nvidia); int devid = myrank % numdev; acc_set_device_num(devid, acc_device_nvidia); 115

116 Directives: Multiple Devices Multi GPU (or device) setup int acc_get_num_devices(acc_device_t): returns number of device of type void acc_set_device_num(int,acc_device_t): sets device number of type int acc_get_device_num(acc_device_t): returns device number of type void acc_set_device_type(acc_device_t): type for offload region acc_device_t acc_get_device_type(): returns device type for next offload region Device-specific clause tuning: device_type or dtype Takes accelerator architecture name or an asterisk: tuning for this device Clauses belong to dtype that follow it up to the end or another dtype Can applied for: parallel, kernels, loop, update, routine C/C++ #pragma acc <op> device_type(device-type list) (clauses) #pragma acc <op> device_type(*) (clauses) Fortran!$acc <op> device_type(device-type list) (clauses)!$acc <op> device_type(*) (clauses) 116

117 Multiple GPUs MPI Multiple GPUs across nodes Use MPI for communication Use update for a upt-to-date version on the before sending over MEMORY CPU MEMORY GPU MEMORY CPU MEMORY GPU MPI rank 1 MPI rank n-1 #pragma acc update host \ ( s_buf[0:size] ) MPI_Recv(r_buf,size,MPI_CHAR, 0,tag,MPI_COMM_WORLD,&stat); MPI_Send(s_buf,size,MPI_CHAR, n-1,tag,mpi_comm_world); #pragma acc update device \ (r_buf[0:size] ) 117

118 Summary OpenACC can use several devices Within one node: acc_set_device_num to handle GPU affinity Across nodes: use MPI 118

119 .. Hands-On Extend the heterogeneous version to use 2 GPUs within one compute node (w/o MPI) Decomposition of the matrix Use acc_set_device_num to differentiate between the GPUs Use enter data and exit data to move initial data to the GPUs Use update for CPU GPU halo exchange Think about correct synchronization GPU 0 GPU 1 CPU 119 Follow TODOs in GPU/exercises/task4 Exercise 2.6

120 Lab Results Hardware Version Runtime [s] 2x Intel 2.67GHz (Intel Westmere, 12 cores) NVIDIA Quadro 6000 (Fermi, cc 2.0) OpenMP 2.11 OpenACC-Offload OpenACC-Data 1.27 OpenACC-Hetero (1 GPU + 12 OMP threads) OpenACC-MultiGPU (2 GPUs + 12 OMP threads)

121 Agenda Introduction to GPGPUs Motivation & Overview GPU Architecture OpenACC Basics Motivation & Overview Offload Regions Data Management OpenACC Advanced Heterogeneous Computing [Interoperability with CUDA & GPU Libraries] [Loop Schedules & Launch Configuration] [Maximize Global Memory Throughput] [Caching & Tiling] [Multiple GPUs ] [Comparison of OpenACC & OpenMP Device Constructs] 121

122 Introduction OpenMP = de-facto standard for shared-memory parallelization From 1997 until now Since 2009: OpenMP Accelerator subcommittee Sub group wanted faster development: OpenACC Idea: include lessons lernt into OpenMP standard 07/2013: OpenMP 4.0 Accelerator support SIMD, Some content of the following slides has been developed by James Beyer (Cray) and Eric Stotzer (TI), the leaders of the OpenMP for Accelerators subcommittee. RWTH Aachen University is a member of the OpenMP Architecture Review Board (ARB) since Main topics: Thread affinity Tasking Tool support Accelerator support 122

Introduction to GPGPUs

Introduction to GPGPUs Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Introduction to GPGPUs

Introduction to GPGPUs Introduction to GPGPUs Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de PPCES 2012 Rechen- und Kommunikationszentrum (RZ) Links General GPGPU Community: http://gpgpu.org/ GPU Computing Community: http://gpucomputing.net/

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

Introduction to Compiler Directives with OpenACC

Introduction to Compiler Directives with OpenACC Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block. API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran

More information

An Introduction to OpenACC - Part 1

An Introduction to OpenACC - Part 1 An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky The GPU-Cluster Sandra Wienke wienke@rz.rwth-aachen.de Fotos: Christian Iwainsky Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative computer

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU-Cluster Fotos: Christian Iwainsky Sandra Wienke wienke@rz.rwth-aachen.de March 2012 Rechen- und Kommunikationszentrum (RZ) The GPU-Cluster GPU-Cluster: 57 Nvidia Quadro 6000 (29 nodes) innovative

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

GPU Programming Paradigms

GPU Programming Paradigms GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 10-12, 2013 HPC@LSU

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

GPU Computing with OpenACC Directives

GPU Computing with OpenACC Directives GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration

More information

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC

More information

OpenACC introduction (part 2)

OpenACC introduction (part 2) OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Advanced OpenACC. Steve Abbott November 17, 2017

Advanced OpenACC. Steve Abbott November 17, 2017 Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

OpenACC and the Cray Compilation Environment James Beyer PhD

OpenACC and the Cray Compilation Environment James Beyer PhD OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC

More information

Code Migration Methodology for Heterogeneous Systems

Code Migration Methodology for Heterogeneous Systems Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

The PGI Fortran and C99 OpenACC Compilers

The PGI Fortran and C99 OpenACC Compilers The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

ADVANCED OPENACC PROGRAMMING

ADVANCED OPENACC PROGRAMMING ADVANCED OPENACC PROGRAMMING DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES AGENDA Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Optimizing OpenACC Codes. Peter Messmer, NVIDIA Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC Authored by Mark Harris NVIDIA Corporation GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information