Practical: a sample code Alistair Hart Cray Exascale Research Initiative Europe 1 Aims The aim of this practical is to examine, compile and run a simple, pre-prepared OpenACC code The aims of this are: to familiarise you with the system to let you explore the compiler and runtime feedback 2 1
The system You are using a small Cray system called "raven" it is a hybrid XE6/XK6 system XE6: each node is dual AMD Interlagos (total of 32 cores) XK6: each node is one AMD Interlagos and one Nvidia Fermi+ X2090 GPU you log in and compile on a front end node you run the jobs by submitting a jobscript to the PBS batch system jobs will not run from the front end command line you select the XK6 nodes by submitting to a special PBS queue there are two filesystems home directories the lustre filesystem you should submit jobs from a directory on the lustre filesystem 3 Raven Raven is part of the Cray Marketing Partner Network to use it you must agree to the Terms and Conditions in particular, you cannot publish or show performance figures without Cray's prior approval Raven is not backed up. At all. the lustre filesystem can be purged at any time if it gets too full please copy any important (but small) results files back to your home directory the home directories are also not backed up no-one is going to delete them without notice, but hardware can fail After the course ends the training accounts that you are using are temporary and will be deleted at the end of the course. it is your responsibility to copy any files that you wish to keep to another, non-cray system before the course ends. if you want a more permanent account to continue working on OpenACC, please contact me 4 2
Getting started Cray uses a linux-based environment on the login nodes You will have a bash login shell by default All the usual linux commands are available Software versions are loaded and unloaded using the Gnu module command (see man module) To see which modules are currently loaded, type: module list To see which modules are available, type: module avail You can wildcard the end of the names, e.g.: module avail PrgEnv* For more complicated grepping, you need to redirect stderr to stdout, e.g. module avail 2>&1 grep "Env" You load a new module by typing: module load <module name> Some modules (e.g. different compiler versions) conflict, so you should first "module unload" the old version (or use "module swap") 5 Programming Environments A number of different compilers are supported You select these by loading a Programming Environment module PrgEnv-cray for CCE (the default) PrgEnv-pgi for PGI PrgEnv-gnu for gcc, gfortran Once one of these is loaded, you can then select a compiler suite CCE: module avail cce Make sure you type: module swap cce cce/81.newest PGI: module avail pgi use the default module pgi/12.7.0 Gnu: module avail gcc use the default module gcc/4.7.1 For GPU programming (CUDA, OpenCL, OpenACC...) make sure you: module load craype-accel-nvidia20 6 3
Using the compilers You use the compilers via wrapper functions ftn for Fortran; cc for C; CC for C++ it doesn't matter which PrgEnv is loaded the wrappers add optimisation options, architecture-specific stuff and all the important library paths in many cases, you don't need any other compiler options if you want unoptimised code, you must use option -O0 Further information man pages for the wrapper commands give you general information For more detail see the compiler-specific man pages CCE: crayftn, craycc, craycc PGI: pgfortran, pgcc GNU: gfortran, gcc You will need the appropriate PrgEnv module loaded to see these 7 Some Cray Compilation Environment basics CCE-specific features: Optimisation: -O2 is the default and you should usually use this -O3 activates more aggressive options; could be faster or slower OpenMP: is supported by default. if you don't want it, use either -hnoomp or -xomp compiler flags CCE only gives minimal information to stderr when compiling to see more information, you should request a compiler listing file flags -ra for ftn or -hlist=a for cc writes a file with extension.lst contains annotated source listing, followed by explanatory messages each message is tagged with an identifier, e.g.: ftn-6430 to get more information on this, type: explain <identifier> For a full description of the Cray compilers, see the reference manuals at http://docs.cray.com. 8 4
Compiling CUDA Compilation: module load craype-accel-nvidia20 Main CPU code compiled with PrgEnv "cc" wrapper either PrgEnv-gnu for gcc; or PrgEnv-cray for craycc GPU CUDA-C kernels compiled with nvcc nvcc -O3 -arch=sm_20 update the -arch option for Kepler PrgEnv "cc" wrapper used for linking Only GPU flag needed: -lcudart e.g. no CUDA -L flags needed (added in cc wrapper) 9 Compiling OpenCL Compilation: module load craype-accel-nvidia20 Main CPU code compiled with PrgEnv "cc" wrapper either PrgEnv-gnu for gcc; or PrgEnv-cray for craycc GPU OpenCL kernels compiled with nvcc PrgEnv "cc" wrapper used for linking Only GPU flag needed: -lopencl Alternatively: Use PrgEnv-gnu for all compilation still need -lopencl at linktime 10 5
Submitting jobs and the lustre filesystem You should submit jobs from the lustre filesystem you can compile there as well if you wish Create a unique directory for yourself: mkdir -p /lus/scratch/$user and subdirectories if you want To submit a job, create a PBS jobscript there is a skeleton script provided as part of the tutorial materials just rename the executable note that command aprun is used by the jobscript to run the executable. submit the job using command: qsub <jobscript name> other options are specified in the jobscript a job number (ending in.sdb) is returned to view the queued and running jobs: qstat to stop a queued or running job: qdel <job number> 11 The sample code The sample code Designed to demonstrate functionality not interested in performance at this stage Implements the simple example from the lectures A 3d array a is initialised It's values are doubled and stored in a new array b A checksum is calculated and compared with the expected result These are implemented as 3 OpenACC kernels There are three versions of the code Version 00 has all 3 kernels in same main program. There is no attempt to keep data on the GPU between the kernels. Version 01 uses a data region to avoid data sloshing. Version 02 has more complicated calltree calls a subroutine that contains an OpenACC kernel. This kernel also contains a function call. 12 6
Code versions and building them There are versions for 4 different programming models C or Fortran, with static or dynamic allocation of arrays N.B. there is no version00 for dynamic arrays with C (see note in version01) source filename based on these, e.g. first_example_fstatic_v00.f90 Get your environment right make sure you have the right PrgEnv loaded (cray or pgi) make sure you have loaded the correct compiler version module make sure you have loaded module craype-accel-nvidia20 Build the code PrgEnv-cray: ftn -ra <Fortran source file> cc -hlist=a <C source file> PrgEnv-pgi: ftn -Minfo=all <Fortran source file> cc -Minfo=all <C source file> 13 Automation You can do it all by hand if you wish, or use automation There's nothing magic being done here Automated building: can just type: make VERSION=[00 01 02] [F C][static dynamic] Makefile will echo commands it uses to build the code automatically detects which PrgEnv you are using (uses PE_ENV env. var.) remember to type "make clean" if you switch PrgEnv modules Automated building and job submission type: bash build_submit.bash MYPE TARGET VERSION MYPE should be cray or pgi TARGET should be Fstatic or Fdynamic or Cstatic or Cdynamic VERSION should be 00 or 01 or 02 This will: load the correct modules using script../xk_setup.bash build the code using the Makefile create directory: /lus/scratch/$user/openacc_training/practical1/target_version_date_time write and submit a PBS jobscript You can then cd to this directory and look at the output 14 7
What to check Check correctness Did the code compile correctly? Did the job execute? Was the answer correct? Next, understand what did the compiler did examine and understand the compiler feedback CCE: open the.lst file PGI: read the output to stdout did it compile for the accelerator? what data did it plan to move and when? how were the loop iterations scheduled? 15 What actually ran? Did we actually run on the accelerator? We can ask the runtime for some feedback cd to run directory, edit jobscript, uncomment appropriate line CCE: set CRAY_ACC_DEBUG to 1 (least detailed) to 3 (most detailed) PGI: set ACC_NOTIFY Resubmit the job: qsub <jobscript name> Examine commentary (in the log file) and make sure you understand it Profiling the code A quick way of profiling is to use the Nvidia compute profiler CCE and PGI compile to PTX (as does nvcc), so this will work for all Edit the jobscript and uncomment the profiling line Resubmit the job Examine the profile (in file cuda_profile_0.log) Can change location with env. var. COMPUTE_PROFILE_LOG This is a "blow-by-blow" account Larger codes need a more aggregated report We will cover profiling in more detail later 16 8
Further work Choose a target and repeat this for all three versions Start with the Cray compiler Then either: repeat for a different programming model target, or try the PGI compiler 17 Getting the examples On raven: change to a directory where you want to work either in your home directory or under /lus/scratch/$user type: tar zxvf ~tr99/cray_openacc_training.tgz This creates a new directory./cray_openacc_training please note the file LICENCE.txt The codes for Practical 1 are in: Cray_OpenACC_training/Practical1 There is a README file that summarises these slides 18 9