Accelerator programming with OpenACC

Size: px

Start display at page:

Download "Accelerator programming with OpenACC"

Brittney Shelton
5 years ago
Views:

1 ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro 2018.

2 Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling and parallelizing Optimizing data movement 4 Best practices 2 / 69

3 Introduction 3 / 69

4 What is OpenACC? OpenACC is a parallel programming standard that describes a set of compiler directives in C, C++ and Fortran to specify regions of code offloading from a host CPU to an attached accelerator 4 / 69

5 Introduction OpenACC life cycle Hands on session Best practices What is an Accelerator? Dedicated piece of hardware that performs specific functions faster than a CPU Graphic Processing Unit (GPU): electronic device that runs computer graphic algorithms to render images Coprocessor: electronic device to supplement functions of CPU (arithmetic, encryption, error detection) 5 / 69

6 Top 500 green Source: (June 2018 list) 6 / 69

K40 vs P100 vs V100 Accelerator Cores Boost clock Memory BW DP perf SP perf Tesla k40 2880 875 MHz 288 GB/s 1.7 TFLOPS 5.

7 K40 vs P100 vs V100 Accelerator Cores Boost clock Memory BW DP perf SP perf Tesla k MHz 288 GB/s 1.7 TFLOPS 5.0 TFLOPS Tesla P MHz 720 GB/s 5.3 TFLOPS 10.6 TFLOPS Tesla V MHz 900 GB/s 7.8 TFLOPS 15.7 TFLOPS 7 / 69

8 Architecture 8 / 69

9 Heterogeneous computing Heterogeneous programming combines the use of more than one type of processors 9 / 69

10 CPU vs GPU Features CPU GPU Main memory large small Memory bandwidth low high Clock Frequency high low Performance per watt low high Throughput 1 low high 1 number of operations per unit of time 10 / 69

11 Why use OpenACC? Simple Portable (Nvidia GPUs and Intel-AMD CPUs) Inter-operable (CUDA, MPI, OPENMP) Powerful (90% CUDA) 11 / 69

12 Why use OpenACC? (2) 12 / 69

13 Introduction OpenACC life cycle Hands on session Best practices Motivation 13 / 69

14 Automatic Manatee Count Method 14 / 69

15 Automatic Manatee Count Method 15 / 69

16 Automatic Manatee Count Method 16 / 69

17 Automatic Manatee Count Method 17 / 69

18 Automatic Manatee Count Method 18 / 69

19 Automatic Manatee Count Method 19 / 69

20 Automatic Manatee Count Method 20 / 69

21 Denoising method Original Denoised 21 / 69

22 Motivation Figure: Cell segmentation and tracking 22 / 69

23 OpenACC life cycle 23 / 69

24 OpenACC life cycle 24 / 69

25 Jacobi iteration 25 / 69

26 Jacobi iteration (2) 26 / 69

27 OpenACC life cycle 27 / 69

28 Identify parallelism 28 / 69

29 Identify parallelism (2) 29 / 69

30 OpenACC life cycle 30 / 69

31 Express parallelism 31 / 69

32 Express parallelism 32 / 69

33 Express parallelism (2) 33 / 69

34 Express parallelism (3) 34 / 69

35 Express parallelism (4) 35 / 69

36 Express parallelism (5) 36 / 69

37 Express parallelism (6) 37 / 69

38 Express parallelism (7) 38 / 69

39 OpenACC life cycle 39 / 69

40 Express data movement 40 / 69

41 Express data movement (2) 41 / 69

42 Express data movement (3) 42 / 69

43 Express data movement (4) 43 / 69

44 Express data movement (4) 44 / 69

45 Express data movement (5) 45 / 69

46 Express data movement (6) 46 / 69

47 Express data movement (7) 47 / 69

48 Express data movement (8) 48 / 69

49 OpenACC life cycle 49 / 69

50 Optimize loop performance 50 / 69

51 Optimize loop performance (2) 51 / 69

52 Optimize loop performance (3) 52 / 69

53 Optimize loop performance (4) 53 / 69

54 Optimize loop performance (5) 54 / 69

55 Hands on session 55 / 69

56 OpenACC life cycle 56 / 69

57 Profiling tools A profiler allows to analyze the behaviour of a program Duration of function calls Performance Optimization Graphic profiling tools Nvvp, pgprof, vampir, etc Command-line profiling tools nvprof, gprof, etc 57 / 69

58 CUDA Unified Memory 58 / 69

59 Nvidia OpenACC course repository Log into cluster Kabré Pull repository CRHPCS cd CRHPCS git pull Load CUDA toolkit 1 module load cuda/ Load pgi compiler 1 module load pgi/ / 69

60 Profiling and parallelizing Access laboratory #2 1 cd openacc/lab2/c99/ Open README file in browser 1 Complete steps 0-3 (Send jobs to queue: k40) Action Queue system Check job status Check GPU info Command -qsub [jobname.pbs] watch -n 5 qstat -u USERNAME nvidia-smi 60 / 69

61 Compiler flags PGI C compiler: pgcc PGI C++ compiler: pgc++ Flag -acc -fast -ta=[tesla:managed,multicore,etc] -Minfo=[accel,all,etc] Action Enable OpenACC directives Choose optimal flags for target platform Specify accelerator type Show compilation information 61 / 69

62 Optimizing data movement Access laboratory #3 1 cd openacc/lab3/c99/ Open README file in browser 1 Complete steps 0,1,2 and 4 (Send jobs to queue: k40) 62 / 69

63 Best practices 63 / 69

64 Optimization tips Use restrict keyword to avoid false loop dependencies (pointer aliasing) collapse(n), useful when: Many nested loops Very small loops tile(n[,m,... ]), useful when high data locality Efficient loop ordering Innermost loop iterates on fastest varying array dimension Improve cache efficiency (access consecutive memory addresses) On NVIDIA devices: vector lengths must be multiples of 32 (up to 1024) (workers X vector) must be less than / 69

65 Current limitations Shallow copy vs Deep copy 2 2 Beyer, James, David Oehmke, and Jeff Sandoval. Transferring user-defined types in OpenACC. Proceedings of Cray User Group (2014). 65 / 69

66 Current limitations (2) Debugging is complicated Unsupported use of print functions Limited use of dynamic memory in accelerated regions Some math library functions are still unsupported OpenACC still under development (Compiler Bugs) 66 / 69

67 Summary Minimize data movement Maximize compute intensity More explicit mapping of parallelism, less portable code Use device type clause for architecture-specific optimizations When using OpenACC: Measure sequential performance Understand program structure and data movement Find hot-spots (profiler: pgrof, nvvp) Ensure safe parallelism 67 / 69

68 OpenACC material 68 / 69

69 Acknowledgements Thank you! Lecture notes by Jeff Larkin, NVIDIA Developer Technologies Lecture notes by Esteban Meneses, CNCA 69 / 69

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit