Code optimization. Geert Jan Bex

Size: px

Start display at page:

Download "Code optimization. Geert Jan Bex"

Christiana Sims
5 years ago
Views:

1 Code optimization Geert Jan Bex License: this presentation is released under the Creative Commons, see 1

2 CPU 2

3 Vectorization Arithmetic operation done on registers Vector registers for floating point operands: 256 bit wide 4 double precision 4 concurrent operations 8 single precision 8 concurrent operations double precision: 4 dp/register additions 14 cores 2 sockets = 269 GFLOPS Theoretical peak performance! 3

4 (Counter) examples Can be vectorized double a[n], b[n], c[n]; for (int i = 0; i < N; i++) a[i] = a[i] + b[i]*c[i]; Loop done in chunks of 4 Can not be vectorized All iterations are independent double a[n], b[n], c[n]; for (int i = 1; i < N; i++) a[i] = a[i-1] + b[i]*c[i]; Iteration i depends on iteration i - 1 4

5 Compiler flags & directives GCC compiler family gcc march=corei7-avx O3 gcc ftree-vectorize \ march=corei7-avx O2 for feedback, use -ftree-vectorizer-verbose=2 Intel compiler family icc xhost O2 for feedback, use -qopt-report-phase=vec \ -qopt-report=3 Help compiler using, e.g., #pragma omp simd 5

6 Timings for double precision Number of operations Dependent Independent Timings are unit-less relative numbers Intel compilers 16.x are very smart 6

7 AVX2 Haswell, Broadwell CPUs: AVX2 instruction set Fused multiply/add: a*x + b is single operation Streaming: 1 addition and 1 multiplication/cycle! Integer vector registers: 256 bit wide Extra operations for cryptography Worth to recompile! 7

8 AVX-512 Skylake CPUs: vector registers for floating point operands: 512 bit wide 8 double precision 16 single precision 8 concurrent operations 16 concurrent operations Even more worth to recompile! 8

9 Double promotion GCC gcc/g++ -Wdouble-promotion float area(float radius) { return *radius*radius; } Promoted to double float area(float radius) { return f*radius*radius; } All float 9

10 Note of caution Intel compilers: aggressive optimization Even at O2 Reordering of operations/operands May impact precision Verify results with -fp-model precise fp-model source Potentially severe performance impact! 10

11 Multithreading: false sharing 11

12 Cache lines, again core 0 L1 a[i] a[i+1] a[i+7] core 1 L1 a[i] a[i+1] a[i+7] invalid cache consistency: MESI protocol L2 a[i] a[i+1] a[i+7] L2 a[i] a[i+1] a[i+7] invalid L3 a[i] a[i+1] a[i+7] RAM a[i-1] a[i] a[i+1] a[i+7] a[i+8] cache line shared modified exclusive 12

13 Bad news and good news Bad news: performance degraded by 1.5 to 4 can be hard to spot, e.g., global variables close in memory Good news: compilers compilers detect many cases, make variables implicitly thread-private GCC: -O2 or more Intel: -O1 or more However, compiler won't always remedy, so avoid false sharing! 13

14 How to avoid? Use thread-local variables/copies when possible Align C global variables at cache boundaries, e.g., int counter_t0 attribute ((aligned(64))); int counter_t1 attribute ((aligned(64))); Align Fortran variables at cache boundaries (Intel only), e.g., integer :: counter_t0, counter_t1!dir$ attributes align:16 :: counter_t0, counter_t1 Pad C struct to multiples of cache line length, e.g., struct data { double x, y, z; double padding[5]; }; struct data pnts[20] attribute ((aligned(64))); For Fortran user defined types, use SEQUENCE + carefully order members avoid SEQUENCE, but use compiler flag -align rec16byte Cost: larger memory footprint! 14

15 Feedback-guided optimization 15

16 Philosophy "The proof of the pudding is in the eating" Build application with instrumentation Run application creates profile Rebuild application, using profile to guide optimizations Depends on quality of run: must be representative for general use CPU/memory architecture input data/parameters YMMV: expect < 10 % 16

17 GCC compilers Build with instrumentation gcc -fprofile-generate -o appl.exe Run as usual Build using profile gcc -fprofile-use=appl.exe.gcda -o appl.exe 17

18 Intel compilers Build with instrumentation icc -prof-gen -prof-dir=./profs -o appl.exe Run as usual Build using profile icc -prof-use prof-dir=./profs -ipo -o appl.exe 18

19 Example Code illustrating cache hierarchy improved pre-fetching 19

20 Conclusion 20

21 Useful references Gallery of processor cache effects Avoiding and Identifying False Sharing Among Threads Vectorization A guide to vectorization with Intel C++ compilers Auto-vectorization with gcc 4.7 Introduction to High Performance Computing for Scientists and Engineers Georg Hager & Gerhard Wellein Chapman & Hall, 2010 Why has CPU frequency ceased to grow? 21

22 Profiling with Allinea MAP Geert Jan Bex License: this presentation is released under the Creative Commons, see 22

23 Introduction 23

24 Introduction Allinea Forge ( Allinea DDT: parallel debugger Allinea MAP: parallel profiler Commercial product Floating licence, token based 64 tokens, e.g., 2 2 MAP sessions of 32 processes 1 MAP + 1 DDT session of 32 processes 1 DDT session of 64 processes Analyzing a profile offline: half price 24

25 Supported programming models Serial applications Shared memory programming: OpenMP GPU programming: CUDA Distributed programming: MPI, UPC Weapon of choice for MPI (+ OpenMP) Debugging/profiling at scale: user interface optimized 25

26 Alternatives Debugging Commercial: RogueWave TotalView Open source: Eclipse PTP Profiling Open source: Scalasca Paraver + Extrae 26

27 Profiling 27

28 Workflow Concentrate on single node profile & analyze bottlenecks memory access? cache use? vectorization? branch prediction? OpenMP overhead? Inter-node communication profile & analyze bottlenecks granularity of communication/computation? domain decomposition? suboptimal MPI calls? 28

29 Methodology MAP uses sampling (call stack) No instrumentation Simply compile with -g for details Overhead is minimal ( 5-10 % at most) Works with many MPI implementations Intel MPI OpenMPI MVAPICH 29

30 Startup $ module load AllineaForge $ map Start profiling interactively Load a profile to analyze ``` Start a job to profile interactively 30

31 Interactive profiling Run configuration Choose application Application arguments Working directory Configure MPI/OpenMP Start run/profiling Remembers between runs 31

32 Results 32

33 Line breakdown Vector floating point Memory access 74 % No branching 33

34 Time line min, max, mean, s.d. available Zoom by selecting All view updated! processes time Can display many metrics CPU instructions I/O: disk read/write MPI Number calls peer-to-peer & collectives/s Peer-to-peer & collectives bandwidth Send & receive bandwidth Very useful to identify run phases Any combination 34

35 Source code view Code can be folded Line based activity Color coded: Communication Compute Easy to navigate through code Go to function definitions in any file Requires compile with -g 35

36 Call stack view Click to go to code Ordered by % runtime Navigate to source code 36

37 Round tripping From within Allinea MAP Edit code Rebuild Profile Commit in version control system Switch between MAP and DDT 37

38 Interactive profiling via job Job will run on compute nodes 38

39 Batch profiling via job #!/bin/bash l #PBS l nodes=1:ppn=4 #PBS l walltime=1:00:00 map --np 4 --profile --stop-after 3500 \ diffusion.exe Submit job When done, open profile with MAP 39

40 ommunication vs. computation 40

41 Domain decomposition resources walltime lots of MPI chatter! load imbalance resources walltime 41

42 Conclusions 42

43 Conclusions MAP is excellent for applications with many processes/threads Easy to get an overview However, works well for serial code too Timeline is valuable tool Very easy to use, but correct interpretation requires insight Drawback: limited to number of tokens Number of processes Concurrent sessions As any tool, not Swiss army knife Use in combination with other tools, e.g., Intel 43

44 Appendix 44

45 Tools Profilers gprof Scalasca AllineaForge MAP Intel vtune Monitoring numastat mpstat Hardware information Use a profiler, it is the law! lscpu: CPU information, including cache size and NUMA configuration lstopo-no-graphics: more detailed cache topology Intel mlc: provides memory bandwidth & latency info 45

46 compute time latency Latency disk I/O cycles 80 cycles erf 40 cycles ** 20 cycles exp 10 cycles / or sqrt 5 cycles + or * GPGPU Infiniband OpenMP barrier RAM L3 cache L2 cache L1 cache cycles 5000 cycles 1500 cycles 150 cycles 50 cycles 20 cycles 5 cycles 1 cycle streaming DP fused multiply/add Core 46

47 Bandwidth RAM ivybridge (dual socket, 10 core): 93 GB/s haswell (dual socket, 12 core): 110 GB/s broadwell (dual socket, 14 core): 125 GB/s QPI ivybridge: 25 GB/s haswell: 30 GB/s broadwell: 30 GB/s GPGPU RAM K40c): GB/s SATA revision 3: 0.6 GB/s SATA revision 3.2: 2.0 GB/s SAS 3: 1.2 GB/s PCI Express 3.0 (16x): GB/s Infiniband QDR 4x: 4.0 GB/s Infiniband EDR 4x: 12.5 GB/s Note: bandwidth depends on message size! 47

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon