OpenACC and the Cray Compilation Environment James Beyer PhD

Size: px

Start display at page:

Download "OpenACC and the Cray Compilation Environment James Beyer PhD"

Kelly McDaniel
5 years ago
Views:

1 OpenACC and the Cray Compilation Environment James Beyer PhD

2 Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC A selection of insights concerning the use of OpenACC Summary 2

3 OpenACC review

4 Contents OpenACC programming model What does OpenACC look like? How are OpenACC directives used? 4

5 OpenACC programming model Host-directed execution with attached GPU Main program executes on host (i.e. CPU) Directs execution on device (i.e. GPU) Memory allocation and transfers Kernel execution Synchronization Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data between the separate GPUs have a weak memory model No synchronization possible between outermost parallel level User responsible for Specifying code to run on device Specifying parallelism Specifying data allocation/movement that spans single kernels 5

6 Accelerator directives Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler CCE -hnoacc (or -xacc) also supresses compilation Sentinel: acc C/C++: preceded by #pragma Structured block {...} avoids need for end directives Fortran: preceded by!$ (or c$ for FORTRAN77) Usually paired with!$acc end * Directives can be capitalised Continuation to extra lines allowed C/C++: \ (at end of line to be continued) Fortran: Fixed form: c$acc& or!$acc& on continuation line Free form: & at end of line to be continued continuation lines can start with either!$acc or!$acc& // C/C++ example #pragma acc * {structured block}! Fortran example!$acc * <structured block>!$acc end * 6

A basic example Execute a loop nest on the GPU Compiler does the work: Data movement allocates/frees GPU memory at start/end of region moves of data to/from GPU!

7 A basic example Execute a loop nest on the GPU Compiler does the work: Data movement allocates/frees GPU memory at start/end of region moves of data to/from GPU!$acc parallel loop DO i = 2,N-1 c(i) = a(i) + b(i) ENDDO!$acc end parallel loop write-only read-only Loop schedule: spreading loop iterations over PEs of GPU OpenACC CUDA gang: a threadblock worker: warp (group of 32 threads) vector: threads within a warp Compiler takes care of cases where iterations doesn't divide threadblock size Caching (explicitly use GPU shared memory for reused data) automatic caching (e.g. NVIDIA Fermi, Kepler) important Tune default behavior with optional clauses on directives 7

8 Cray PE introduction

9 Cray packaged OpenACC Programming Environments Two different OpenACC compilers You select these by loading a Programming Environment module PrgEnv-cray for CCE (the default) PrgEnv-pgi for PGI Once one of these is loaded, you can then select a compiler version CCE: module avail cce PGI: module avail pgi Swap to the most up to date version in each case e.g. "module avail cce" to see the versions available then "module swap cce cce/<whatever>" For any GPU programming (CUDA, OpenCL, OpenACC...) make sure you always: "module load craype-accel-nvidia35" it is not loaded by default sys-admin decides 9

10 Using the compilers You use the compilers via wrapper functions ftn for Fortran; cc for C; CC for C++ it doesn't matter which PrgEnv is loaded (same wrapper name) the wrappers add optimisation options, architecture-specific stuff and all the important library paths make sure module xtpe-<processor type> is loaded so these are correct in many cases, you don't need any other compiler options if you really want unoptimised code, you must use option -O0 Further information man pages for the wrapper commands give you general information For more detail see the compiler-specific man pages CCE: crayftn, craycc, craycc PGI: pgfortran, pgcc You will need the appropriate PrgEnv module loaded to see these 10

11 Some Cray Compilation Environment basics CCE-specific features: Optimisation: -O2 is the default and you should usually use this -O3 activates more aggressive options; could be faster or slower OpenMP: is supported by default. if you don't want it, use -hnoomp compiler flag OpenACC: is enabled automatically when module is loaded CCE only gives minimal information to stderr when compiling -hmsgs to see more information, you should request a compiler listing file flag -hlist=a for ftn and cc writes a file with extension.lst contains annotated source listing, followed by explanatory messages each message is tagged with an identifier, e.g.: ftn-6430 to get more information on this, type: explain <identifier> For a full description of the Cray compilers, see the reference manuals at 11

12 Further information: Compiling CUDA Compilation: module load craype-accel-nvidia35 Main CPU code compiled with PrgEnv "cc" wrapper either PrgEnv-gnu for gcc; or PrgEnv-cray for craycc GPU CUDA-C kernels compiled with nvcc nvcc -O3 -arch=sm_35 PrgEnv "cc" wrapper used for linking Only GPU flag needed: -lcudart e.g. no CUDA -L flags needed (added in cc wrapper) 12

13 CCE 8.2 OpenACC status

14 Contents Cray Compilation Environment (CCE) What does CCE do with X? -hacc_model= 14

15 OpenACC in CCE man intro_openacc Which module to use craype-accel-nvidia20 craype-accel-nvidia35 Forces dynamic linking Single object file Whole program Messages/list file Compiles to PTX not cuda Leverages years of vector code generator experience Debugger sees original program not cuda intermediate 15

16 OpenACC implementation status OpenACC 1.0 features complete _OPENACC change complete Default(none) complete acc_async_sync and acc_async_noval complete Loop nesting clarification matches what we have always done. wait clause on parallel, kernels and update complete Async clause on wait directive complete enter / exit data complete Common block names deferred Link clause complete Multidimensional C/C++ array support complete Tile clause complete/deferred Auto clause complete Device_type complete Routine directive complete Nested parallelism deferred Atomic constructs complete New APIs complete 16

17 What does CCE do with OpenACC constructs (1) Parallel/kernels Flatten all calls that do not have routine constructs on them Package code for kernel Insert data motion to and from device Clauses Autodetect Insert kernel launch code Automatic vectorization is enabled Inserts joins/events for wait clauses Kernels Identify kernels Inserts joins/events for wait clauses Loop Gang Thread Block (TB) Worker warp Vector Threads within a warp or TB Automatic vectorization is enabled Collapse Will only rediscover indices when required Independent Turns off safety/correctness checking for work-sharing of loop Reduction Nontrivial to implement Does not use multiple kernels All loop directives within a loop nest must list to reduction if applicable Tile Similar to collapse Auto Treated as preferred clause for our auto parallelism feature 17

18 What does CCE do with OpenACC constructs (2) Data clause( object list ) create allocate at start register in present-table de-allocate at exit copy, copyin, copyout create plus data copy present Abort at runtime if object is not in present table. present_or_copy, present_or_copyin, present_or_copyout, present_or_create deviceptr Send address directly to kernel without translation. Unstructured Data enter data Same as init part of data construct exit data Delete object from present table Abort at runtime if object is not on device Update Implicit!$acc data present( obj ) For known contiguous memory Transfer (Essentially a CUDA memcpy) Not contiguous memory Pack into contiguous buffer Transfer contiguous Unpack from contiguous buffer 18

19 What does CCE do with OpenACC constructs (3) Cache Create shared memory copies of objects Generate copy into shared memory objects Generate copy out of shared memory objects Release the shared memory Routine construct gang Generate gang-redundant code Worker Generate worker-single code vector Generate vector-single code Seq Generate per thread code Bind( name ) / Bind ( string ) If block with acc_on_device Nohost is ignored Declare construct Implementation completely reworked for 8.2 release Link Create pointer for object on device Replace all references to object in kernels with pointer based references Similar to PIC code Adds fixup code to ensure that device pointers contian correct address after object is moved to the device. Device_resident Places object on device Fortran allocatables not complete 19

20 What does CCE do with OpenACC constructs (4) Atomic construct Maps onto our OpenMP translation system CAS loops for unsupported operators Compiler issues an error if type requires locks Complex(128) 20

21 Extended OpenACC 2.0 runtime routines void cray_acc_update_device_async( void *, size_t, int ); void cray_acc_update_host_async( void *, size_t, int ); void *cray_acc_memcpy_to_host_async( void* destination, const void* source, size_t size, int async_id ); void *cray_acc_memcpy_to_device_async( void* destination, const void * source, size_t size, int async_id ); 21

22 Partitioning clause mappings 1.!$acc loop gang : across thread blocks 2.!$acc loop worker : across warps within a thread block 3.!$acc loop vector : across threads within a warp 1.!$acc loop gang : across thread blocks 2.!$acc loop worker vector : across threads within a thread block 1.!$acc loop gang : across thread blocks 2.!$acc loop vector : same as worker vector 1.!$acc loop gang worker: across thread blocks and the warps within a thread block 2.!$acc loop vector : across threads within a warp 1.!$acc loop gang vector : across thread blocks and threads within a thread block 1.!$acc loop gang worker vector : same as gang vector 22

23 Partitioning clause mappings (cont) You can also force things to be within a single thread block: 1.!$acc loop worker : across warps within a single thread block 2.!$acc loop vector : across threads within a warp 1.!$acc worker vector : across threads within a single thread block 1.!$acc vector : across threads within a single thread block 23

24 -hacc_model options auto_async_(none kernel all) Compiler automatically adds some asynchronous behavior Only overlaps host and accelerator No automatic overlap of different accelerator constructs (single stream) May require some explicit user waits Host_data [no_]fast_addr Uses 32 bit variables/calculations for index expressions Faster address computation Fewer registers [no_]deep_copy Enable automatic deep copy support 24

25 OpenACC insights

26 parallel vs. kernels parallel and kernels regions look very similar both define a region to be accelerated different heritage; different levels of obligation for the compiler parallel prescriptive (like OpenMP programming model) uses a single accelerator kernel to accelerate region compiler will accelerate region (even if this leads to incorrect results) kernels descriptive (like PGI Accelerator programming model) uses one or more accelerator kernels to accelerate region compiler may accelerate region (if decides loop iterations are independent) For more info: Which to use (my opinion) parallel (or parallel loop) offers greater control fits better with the OpenMP model kernels (or kernels loop) better for initially exploring parallelism not knowing if loopnest is accelerated could be a problem 26

27 parallel loop vs. parallel and loop parallel region can span multiple code blocks i.e. sections of serial code statements and/or loopnests loopnests in parallel region are not automatically partitioned need to explicitly use loop directive for this to happen scalar code (serial code, loopnests without loop directive) executed redundantly, i.e. identically by every thread or maybe just by one thread per block (its implementation dependent) There is no synchronisation between redundant code or kernels offers potential for overlap of execution on GPU also offers potential (and likelihood) of race conditions and incorrect code There is no mechanism for a barrier inside a parallel region after all, CUDA offers no barrier on GPU across threadblocks to effect a barrier, end the parallel region and start a new one also use wait directive outside parallel region for extra safety 27

28 parallel loop vs. parallel and loop Some advice: don't... GPU threads are very lightweight (unlike OpenMP) so don't worry about having extra parallel regions explicit use of async clause may achieve same results as using one parallel region but with greater code clarity and better control over overlap... but if you feel you must begin with composite parallel loop and get correct code separate directives with care only as a later performance tuning when you are sure the kernels are independent and no race conditions 28

29 parallel loop vs. parallel and loop When you actually might want to You might split the directive if: you have a single loopnest, and you need explicit control over the loop scheduling you do this with multiple loop directives inside parallel region or you could use parallel loop for the outermost loop, and loop for the others But beware of reduction variables With separate loop directives, you need a reduction clause on every loop directive that includes a reduction, at least with CCE: t = 0!$acc parallel loop &!$acc reduction(+:t) DO j = 1,N DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel loop t = 0!$acc parallel &!$acc reduction(+:t)!$acc loop DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop reduction(+:t) DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel Correct! Wrong! Wrong! Correct! 29

30 parallel gotchas No loop directive The code will (or may) run redundantly Every thread does every loop iteration Not usually what we want Serial code in parallel region avoids copyin(t), but a good idea? No! Every thread sets t=0 asynchronicity: no guarantee this finishes before loop kernel starts race condition, unstable answers. Multiple kernels Again, potential race condition Treat OpenACC "end loop" like OpenMP "enddo nowait"!$acc parallel DO i = 1,N a(i) = b(i) + c(i) ENDDO!$acc end parallel!$acc parallel t = 0!$acc loop reduction(+:t) DO i = 1,N t = t + a(i) ENDDO!$acc end parallel!$acc parallel!$acc loop DO i = 1,N a(i) = 2*a(i) ENDDO!$acc loop DO i = 1,N a(i) = a(i) + 1 ENDDO!$acc end parallel 30

31 Declare link int a[100000]; #pragma acc declare link(a) int main() { #pragma acc parallel loop for( int i = 0; i < ; i++ ) { } } int a[100000]; #pragma acc declare link(a) #pragma acc routine gang void foo() {!$acc loop gang worker vector for( int i = 0; i < ; i++ ) { } } int main() { #pragma acc parallel copy(a) foo(); } 31

32 32

33 Summary The Cray Programming Environment support for OpenACC was introduced An in depth look at OpenACC support in CCE was presented A few insights gained while implementing and working with OpenACC were presented Final thoughts There is still work to do in CCE and in OpenACC 33

34 Upcoming GTC Express Webinars November 20 - Improving Performance using the CUDA Memory Model and Features of the Kepler Architecture November 21 - Speeding Up Financial Risk Management Cost Efficiently for Intra-day and Pre-deal CVA Calculations December 3 - CUDA Tools for Optimal Performance and Productivity December 12 - GPU-accelerated High Performance Geospatial Line-of-sight Calculations Register at

35 GTC 2014 Call for Posters Open Posters should describe novel or interesting topics in Science and research Professional graphics Mobile computing Automotive applications Game development Cloud computing Submit for chance to win Best Poster Award

Practical: a sample code

Practical: a sample code Alistair Hart Cray Exascale Research Initiative Europe 1 Aims The aim of this practical is to examine, compile and run a simple, pre-prepared OpenACC code The aims of this are: