ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

Similar documents
ADVANCED OPENACC PROGRAMMING

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Introduction to Compiler Directives with OpenACC

GPU Computing with OpenACC Directives

Programming paradigms for GPU devices

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 3, 2016

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 12, 2015

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC (Open Accelerators - Introduced in 2012)

COMP Parallel Computing. Programming Accelerators using Directives

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

An Introduction to OpenACC - Part 1

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Advanced OpenACC. Steve Abbott November 17, 2017

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

OPENACC ONLINE COURSE 2018

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

GPU Computing with OpenACC Directives Presented by Bob Crovella. Authored by Mark Harris NVIDIA Corporation

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

S Comparing OpenACC 2.5 and OpenMP 4.5

OpenACC 2.6 Proposed Features

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Course. Office Hour #2 Q&A

Directive-based Programming for Highly-scalable Nodes

Introduction to OpenACC

OpenACC and the Cray Compilation Environment James Beyer PhD

Introduction to OpenACC. 16 May 2013

PGI Accelerator Programming Model for Fortran & C

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages

An Introduc+on to OpenACC Part II

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Blue Waters Programming Environment

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Parallel Compu,ng with OpenACC

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

OPENACC ONLINE COURSE 2018

INTRODUCTION TO OPENACC

OpenACC Pipelining Jeff Larkin November 18, 2016

HPC Programming Models, Compilers, Performance Analysis IBM Systems Infrastructure Solutions. Ludovic Enault, IBM Geoffrey Pascal, IBM

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Introduction to OpenACC

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

PGI Fortran & C Accelerator Programming Model. The Portland Group

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Programming Overview

HPC Programming Models, Compilers, Performance Analysis IBM Systems Infrastructure Solutions. Ludovic Enault, IBM

An Introduction to OpenAcc

OpenACC programming for GPGPUs: Rotor wake simulation

OPENMP FOR ACCELERATORS

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

OpenACC Accelerator Directives. May 3, 2013

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Parallel Programming. Libraries and Implementations

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Introduction to GPU Computing

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

GPU Fundamentals Jeff Larkin November 14, 2016

Adrian Tate XK6 / openacc workshop Manno, Mar

Getting Started with Directive-based Acceleration: OpenACC

Advanced OpenMP Features

CUDA Introduction Part I PATC GPU Programming Course 2017

Lecture 1: an introduction to CUDA

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Accelerator programming with OpenACC

An Introduction to GPU Programming

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenMP 4.0 (and now 5.0)

Advanced OpenMP. Lecture 11: OpenMP 4.0

Programming techniques for heterogeneous architectures. Pietro Bonfa SuperComputing Applications and Innovation Department

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Introduction to Standard OpenMP 3.1

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

OpenMP 4.0/4.5. Mark Bull, EPCC

Transcription:

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA

OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives OpenACC atomics Miscellaneous Best Practices Next Steps 2

WHAT ARE COMPILER DIRECTIVES? CPU... serial code... #pragma acc parallel loop for (int i = 0; i<n; ++I)... parallel code...... serial code... GPU Compiler Directive When a compiler directive is encountered the compiler/runtime will 1.Generate parallel code for GPU 2.Allocate GPU memory and copy input data 3.Execute parallel code on GPU 4.Copy output data to CPU and deallocate GPU memory Your original Fortran, C, or C++ code 4

WHY USE COMPILER DIRECTIVES? Single Source Code No need to maintain multiple code paths High Level Abstract away device details, focus on expressing the parallelism and data locality Low Learning Curve Programmer remains in same language and adds directives to existing code Rapid Development Fewer code changes means faster development 5

WHY SUPPORT COMPILER DIRECTIVES? Increased Usability, Broader Adoption 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics 2004 Present 6

3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 7

OPENACC 2.0 AND OPENMP 4.0

OPENACC 2.0 OpenACC is a specification for high-level, compiler directives for expressing parallelism for accelerators. Aims to be performance portable to a wide range of accelerators. Multiple Vendors, Multiple Devices, One Specification The OpenACC specification was first released in November 2011. Original members: CAPS, Cray, NVIDIA, Portland Group OpenACC 2.0 was released in June 2013, expanding functionality and improving portability At the end of 2013, OpenACC had more than 10 member organizations 9

OPENACC DIRECTIVE SYNTAX C/C++ #pragma acc directive [clause [,] clause] ] often followed by a structured code block Fortran!$acc directive [clause [,] clause] ]...often paired with a matching end directive surrounding a structured code block:!$acc end directive 10

OPENACC EXAMPLE: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) #pragma acc parallel loop for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y);... subroutine saxpy(n, a, x, y) real :: x(n), y(n), a integer :: n, i!$acc parallel loop do i=1,n y(i) = a*x(i)+y(i) enddo!$acc end parallel loop end subroutine saxpy...! Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x, y)... 11

OPENMP 4.0 OpenMP has existed since 1997 as a specification for compiler directives for shared memory parallelism. In 2013, OpenMP 4.0 was released, expanding the focus beyond shared memory parallel computers, including accelerators. The OpenMP 4.0 target construct provides the means to offload data and computation to accelerators. Additional directives were added to support multiple thread teams and simd parallelism. OpenMP continues to improve upon its support for offloading to accelerators. 12

OPENMP DIRECTIVE SYNTAX C/C++ #pragma omp target directive [clause [,] clause] ] often followed by a structured code block Fortran!$omp target directive [clause [,] clause] ]...often paired with a matching end directive surrounding a structured code block:!$omp end target directive 13

OPENMP TARGET EXAMPLE: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) #pragma omp target teams \ distribute parallel for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y);... subroutine saxpy(n, a, x, y) real :: x(n), y(n), a integer :: n, i!$omp target teams &!$omp& distribute parallel do do i=1,n y(i) = a*x(i)+y(i) enddo!$omp end target teams &!$omp& distribute parallel do end subroutine saxpy...! Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x, y)... 14

ASYNCHRONOUS EXECUTION

OPENACC ASYNC CLAUSE An async clause may be added to most OpenACC directives, causing the affected section of code to run asynchronously with the host. async(handle) Add this directive to the asynchronous queue represented by the integer handle. Asynchronous activities on the same queue execute in order The user must wait before accessing variables associated with the asynchronous region. #pragma acc parallel loop async(1) 16

OPENACC WAIT DIRECTIVE The wait directive instructs the compiler to wait for asynchronous work to complete. #pragma acc wait(1) wait Wait on all asynchronous work to complete wait(handle) - Wait strictly for the asynchronous queue represented by handle to complete. wait(handle1) async(handle2) The asynchronous queue represented by handle2 must wait for all work in handle1 to complete before proceeding, but control is returned to the CPU. 17

OPENACC PIPELINING #pragma acc data for(int p = 0; p < nplanes; p++) #pragma acc update device(plane[p]) #pragma acc parallel loop for (int i = 0; i < nwork; i++) // Do work on plane[p] #pragma acc update self(plane[p]) For this example, assume that each plane is completely independent and must be copied to/from the device. As it is currently written, plane[p+1] will not begin copying to the GPU until plane[p] is copied from the GPU. 18

OPENACC PIPELINING (CONT.) [p] H2D [p] kernel [p] D2H [p+1] H2D [p+1] kernel [p+1] D2H P and P+1 Serialize [p] H2D [p] kernel [p] D2H [p+1] H2D [p+1] kernel [p+1] D2H NOTE: In real applications, your boxes will not be so evenly sized. P and P+1 Overlap Data Movement 19

OPENACC PIPELINING (CONT.) #pragma acc data for(int p = 0; p < nplanes; p++) #pragma acc update device(plane[p]) async(p) #pragma acc parallel loop async(p) for (int i = 0; i < nwork; i++) // Do work on plane[p] #pragma acc update self(plane[p]) async(p) #pragma acc wait Enqueue each plane in a queue to execute in order Wait on all queues. 20

PIPELINED BOUNDARY EXCHANGE Some algorithms, such as this boundary exchange, have a natural pipeline. Each boundary zone is exchanged with one neighbor, so a separate queue may be used for each zone. 21

ASYNCHRONOUS TIP Even if you can t achieve overlapping, it may be beneficial to use async to queue up work to reduce latency. The Cray compiler can do this automatically with the h accel_model=auto_async_all flag 22

OPENACC INTEROPERABILITY

OPENACC INTEROPERABILITY OpenACC plays well with others. Add CUDA, OpenCL, or accelerated libraries to an OpenACC application Add OpenACC to an existing accelerated application Share data between OpenACC and CUDA 24

OPENACC & CUDA STREAMS OpenACC suggests two functions for interoperating with CUDA streams: void* acc_get_cuda_stream( int async ); int acc_set_cuda_stream( int async, void* stream ); 25

OPENACC HOST_DATA DIRECTIVE Exposes the device address of particular objects to the host code. #pragma acc data copy(x,y) // x and y are host pointers #pragma acc host_data use_device(x,y) // x and y are device pointers // x and y are host pointers X and Y are device pointers here 26

OpenACC Main HOST_DATA EXAMPLE CUDA C Kernel & Wrapper program main integer, parameter :: N = 2**20 real, dimension(n) :: X, Y real :: A = 2.0!$acc data! Initialize X and Y...!$acc host_data use_device(x,y) call saxpy(n, a, x, y)!$acc end host_data!$acc end data global void saxpy_kernel(int n, float a, float *x, float *y) int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; void saxpy(int n, float a, float *dx, float *dy) // Launch CUDA Kernel saxpy_kernel<<<4096,256>>>(n, 2.0, dx, dy); end program It s possible to interoperate from C/C++ or Fortran. OpenACC manages the data and passes device pointers to CUDA. CUDA kernel launch wrapped in function expecting device arrays. Kernel is launch with arrays passed from OpenACC in main. 27

CUBLAS LIBRARY & OPENACC OpenACC Main Calling CUBLAS OpenACC can interface with existing GPU-optimized libraries (from C/C++ or Fortran). This includes CUBLAS Libsci_acc CUFFT MAGMA CULA Thrust int N = 1<<20; float *x, *y // Allocate & Initialize X & Y... cublasinit(); #pragma acc data copyin(x[0:n]) copy(y[0:n]) #pragma acc host_data use_device(x,y) // Perform SAXPY on 1M elements cublassaxpy(n, 2.0, x, 1, y, 1); cublasshutdown(); 28

OPENACC DEVICEPTR The deviceptr clause informs the compiler that an object is already on the device, so no translation is necessary. Valid for parallel, kernels, and data cudamalloc((void*)&x,(size_t)n*sizeof(float)); cudamalloc((void*)&y,(size_t)n*sizeof(float)); #pragma acc parallel loop deviceptr(x,y) for(int i=0; i<n ; i++) y(i) = a*x(i)+y(i) Do not translate x and y, they are already on the device. 29

OpenACC Kernels DEVICEPTR EXAMPLE CUDA C Main void saxpy(int n, float a, float * restrict x, float * restrict y) #pragma acc kernels deviceptr(x[0:n],y[0:n]) for(int i=0; i<n; i++) y[i] += 2.0*x[i]; int main(int argc, char **argv) float *x, *y, tmp; int n = 1<<20, i; cudamalloc((void*)&x,(size_t)n*sizeof(float)); cudamalloc((void*)&y,(size_t)n*sizeof(float));... saxpy(n, 2.0, x, y); cudamemcpy(&tmp,y,(size_t)sizeof(float), cudamemcpydevicetohost); return 0; By passing a device pointer to an OpenACC region, it s possible to add OpenACC to an existing CUDA code. Memory is managed via standard CUDA calls. 30

OPENACC & THRUST Thrust (thrust.github.io) is a STL-like library for C++ on accelerators. High-level interface Host/Device container classes Common parallel algorithms It s possible to cast Thrust vectors to device pointers for use with OpenACC void saxpy(int n, float a, float * restrict x, float * restrict y) #pragma acc kernels deviceptr(x[0:n],y[0:n]) for(int i=0; i<n; i++) y[i] += 2.0*x[i]; int N = 1<<20; thrust::host_vector<float> x(n), y(n); for(int i=0; i<n; i++) x[i] = 1.0f; y[i] = 0.0f; // Copy to Device thrust::device_vector<float> d_x = x; thrust::device_vector<float> d_y = y; saxpy(n,2.0, d_x.data().get(), d_y.data().get()); // Copy back to host y = d_y; 31

OPENACC ACC_MAP_DATA FUNCTION The acc_map_data (acc_unmap_data) maps (unmaps) an existing device allocation to an OpenACC variable. cudamalloc((void*)&x_d,(size_t)n*sizeof(float)); acc_map_data(x, x_d, n*sizeof(float)); cudamalloc((void*)&y_d,(size_t)n*sizeof(float)); acc_map_data(y, y_d, n*sizeof(float)); #pragma acc parallel loop for(int i=0; i<n ; i++) y(i) = a*x(i)+y(i) Allocate device arrays with CUDA and map to OpenACC Here x and y will reuse the memory of x_d and y_d 32

OPENACC ROUTINE DIRECTIVE

OPENACC ROUTINE DIRECTIVE The routine directive specifies that the compiler should generate a device copy of the function/subroutine in addition to the host copy. Clauses: gang/worker/vector/seq bind Specifies the level of parallelism contained in the routine. Specifies an optional name for the routine, also supplied at call-site no_host The routine will only be used on the device device_type Specialize this routine for a particular device type. 34

OPENACC ROUTINE: BLACK SCHOLES /////////////////////////////////////////////////////////////////////////////// // Polynomial approximation of cumulative normal distribution function /////////////////////////////////////////////////////////////////////////////// #pragma acc routine seq real CND(real d) const real A1 = (real)0.31938153; const real A2 = (real)-0.356563782; const real A3 = (real)1.781477937; const real A4 = (real)-1.821255978; const real A5 = (real)1.330274429; const real RSQRT2PI = (real)0.39894228040143267793994605993438; real K = (real)1.0 / ((real)1.0 + (real)0.2316419 * FABS(d)); real cnd = RSQRT2PI * EXP(- (real)0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))); if(d > 0) cnd = (real)1.0 - cnd; return cnd; 35

OPENACC ROUTINE: FORTRAN subroutine foo(v, i, n) use!$acc routine vector real :: v(:,:) integer :: i, n!$acc loop vector do j=1,n v(i,j) = 1.0/(i*j) enddo end subroutine The routine directive may appear in a fortran function or subroutine definition, or in an interface block. Nested acc routines require the routine directive within each nested routine. The save attribute is not supported.!$acc parallel loop do i=1,n call foo(v,i,n) enddo!$acc end parallel loop 36

ADVANCED DATA DIRECTIVES

OPENACC DATA REGIONS REVIEW #pragma acc data copy(a) create(anew) while ( err > tol && iter < iter_max ) err=0.0; #pragma acc parallel loop reduction(max:err) for( int j = 1; j < n-1; j++) for(int i = 1; i < m-1; i++) Copy A to/from the accelerator only when needed. Create Anew as a device temporary. Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i])); Data is shared within this region. #pragma acc parallel loop for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) A[j][i] = Anew[j][i]; iter++; 38

OPENACC UPDATE DIRECTIVE Programmer specifies an array (or partial array) that should be refreshed within a data region. do_something_on_device() The programmer may choose to specify only part of the array to update.!$acc update self(a) Copy a from GPU to CPU do_something_on_host()!$acc update device(a) Copy a from CPU to GPU 39

UNSTRUCTURED DATA REGIONS OpenACC 2.0 provides a means for beginning and ending a data region in different program scopes. double a[100]; #pragma acc data copy(a) // OpenACC code double a[100]; #pragma acc enter data copyin(a) // OpenACC code #pragma acc exit data copyout(a) 40

UNSTRUCTURED DATA REGIONS: C++ CLASSES Unstructured Data class Matrix Matrix(int n) len = n; v = new double[len]; #pragma acc enter data create(v[0:len]) ~Matrix() #pragma acc exit data delete(v[0:len]) delete[] v; private: double* v; int len; ; Regions enable OpenACC to be used in C++ classes Unstructured data regions can be used whenever data is allocated and initialized in a different scope than where it is freed. 41

EXTENDED DATA API Most of the data directives can also be used as function calls from C: acc_malloc / acc_free acc_update_device / acc_update_self acc_create / acc_delete acc_memcpy_to_device / acc_memcpy_from_device acc_copy* / acc_present_or_* These are useful for more complex data structures. 42

OPENACC SLIDING WINDOW When the GPU does not have sufficient memory for a problem, acc_map_data may be used to implement a sliding window for moving data to the GPU. Multiple windows + async may hide PCIe costs, but increase code complexity. double *bigdataarray, *slidingwindow; // Allocate on device slidingwindow = (double*)acc_malloc(...); for(...) // Map window to a slide of data acc_map_data(&bigdataarray[x], slidingwindow,...); #pragma acc update device(bigdataarray[x:n]) #pragma acc kernels... #pragma acc update self(bigdataarray[x:n]) // Unmap slice for next iteration acc_unmap_data(&bigdataarray[x]); acc_free(slidingwindow); 43

OPENACC ATOMICS

OPENACC ATOMIC DIRECTIVE Ensures a variable is accessed atomically, preventing race conditions and inconsistent results. #pragma acc parallel loop for(int i=0; i<n; i++) if ( x[i] > 0 ) #pragma acc atomic capture cnt++; The atomic construct may read, write, update, or capture variables in a section of code. cnt can only be accessed by 1 thread at a time. 45

MISCELLANEOUS BEST PRACTICES

C TIP: THE RESTRICT KEYWORD Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float *restrict ptr Meaning: for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to which it points * Limits the effects of pointer aliasing Compilers often require restrict to determine independence (true for OpenACC, OpenMP, and vectorization) Otherwise the compiler can t parallelize loops that access ptr Note: if programmer violates the declaration, behavior is undefined 47 http://en.wikipedia.org/wiki/restrict

EFFICIENT LOOP ORDERING Stride-1 inner loops (each loop iteration accesses the next element in memory) will generally work best both for the compiler and the GPU. Simpler to vectorize Better memory throughput In C, loop on the right-most dimension In Fortran, loop on the left-most dimension 48

LOOP FUSION/FISSION It s often beneficial to Fuse several small loops into one larger one Multiple small loops result in multiple small kernels, which may not run long enough to cover launch latencies. Split large loops into multiple small ones Very large loops consume a lot of resources (registers, shared memory, etc.) which may result in poor GPU utilization 49

REDUCING DATA MOVEMENT It s often beneficial to move a loop to the GPU, even if it s not actually faster on the GPU, if it prevents copying data to/from the device. 50

NEXT STEPS

NEXT STEPS: TRY OPENACC GTC14 has multiple hands-on labs for trying OpenACC S4803 - Getting Started with OpenACC Wednesday 5PM S4796 - Parallel Programming: OpenACC Profiling Thursday 2PM Get a PGI Trial license for your machine NVIDIA Developer Blogs Parallel Forall Blog CUDACasts Screencast Series 52

MORE AT GTC14 S4514 - Panel on Compiler Directives for Accelerated Computing Wednesday @ 4PM Wednesday afternoon in LL20C has all compiler-directives talks Check the sessions agenda for more talks on compiler directives Please remember to fill out your surveys. 53