Performance Engineering

Size: px

Start display at page:

Download "Performance Engineering"

Gillian May
5 years ago
Views:

1 Performance Engineering J. Treibig Erlangen Regional Computing Center University Erlangen-Nuremberg

2 Using the RRZE clusters Terminalserver: cshpc.rrze.uni-erlangen.de Loginnodes: emmy, lima, woody, testfront Using the batchsystem to allocate compute resources: qsub I l nodes=<xxx>:ppn=<xxx>[:xxx] -l walltime=02:00:00 qsub I l nodes=phinally:ppn=32:turbo -l walltime=08:00:00 Module system to control environment: module avail: List available modules module load <XXX>: Load/Unload specific module module list: List loaded modules module show <XXX>: Show environment set by module 2

3 Best Practices Benchmarking Preparation: Reliable Timing (Minimum time which can be measured) Document Code generation (Flags, Compiler Version) Get exclusive System System state (Clock, Turbo mode, Memory, Caches) Consider to automate runs with a skript (Shell, python, perl) Doing Affinity control Check: Is the result reasonable? Is result deterministic and reproducible. Statistics: Mean, Best?? Postprocessing Documentation Try to understand and explain the result Plan variations to gain more information Many things can be better understood if you plot them (gnuplot, xmgrace) 3

4 A(:)=B(:)+C(:)*D(:) on one Sandy Bridge core (3 GHz) Theoretical limit L1D cache (32k) L2 cache (256k) L3 cache (20M) Memory Variation of data set size 4

5 Bandwidth limitations: Outer-level cache Scalability of shared data paths in L3 cache Variation of threads 5

6 Throughput vector triad on Sandy Bridge socket (3 GHz) Saturation effect in memory Scalable BW in L1, L2, L3 cache Variation of affinity 6

monitoring: likwid-perfctr, perf, PAPI Micro

Performance model Performance pattern Kernel

7 Basic process for optimization Runtime Profile: gprof, compiler Hardware Performance monitoring: likwid-perfctr, perf, PAPI Micro Benchmarking: STREAM, likwid-bench MPI trace tool: Intel ITAC Runtime profiling Algorithm/Code analysis Machine characteristics Performance model Performance pattern Kernel benchmarking Metric signatures Code optimization You are a investigator! 7

8 Using hardware performance metrics likwid-perfctr

9 Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic analysis) are supported by many tools are often reduced to cache misses (what could be worse than cache misses?) Reality: Modern parallel computing is plagued by bottlenecks There are typical performance patterns that cover a large part of possible performance behaviors HPM signatures Scaling behavior Other sources of information Performance pattern 9

Tools for performance engineering Today: Automatic,

approach) Enable the user Make resource bottlenecks

LIKWID tools: Small, flexible and effective tools

10 Tools for performance engineering Today: Automatic, intelligent tools Expert low level tools (bare metal approach) Enable the user Make resource bottlenecks visible And important: Don t get into the way! LIKWID tools: Small, flexible and effective tools likwid-topology and likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun 10

11 Get to know the machine I likwid-topology You need quick and reliable access to all relevant properties of a compute node. Difficulties: Node information is scattered in various places Internet is one source of information, but may be unreliable likwid-topology offers: All relevant information from one single source Reliable data based directly on cpuid Quick overview about thread and memory topology (Turbo mode steps on Intel CPUs) 11

12 Get to know the machine I likwid-topology cont CPU type: Intel Core Westmere processor ************************************************************* Sockets: 2 Cores per socket: 6 Threads per core: HWThread Thread Core Socket Socket 0: ( ) Socket 1: ( ) Cache Topology Level: 3 Size: 12 MB Type: Unified cache Associativity: 16 Number of sets: Cache line size: 64 Non Inclusive cache Shared among 12 threads Cache groups: ( ) ( NUMA Topology NUMA domains: Domain 0: Processors: Memory: MB free of total MB Domain 1: Processors: Memory: MB free of total MB kB 32kB 32kB 32kB 32kB 32kB kB 256kB 256kB 256kB 256kB 256kB MB kB 32kB 32kB 32kB 32kB 32kB MB 3MB 3MB MB

13 Controlling affinity of threads likwid-pin It is crucial for threaded programs to control thread affinity on todays complex node topologies. Difficulties: Different solutions depending on threading model or OpenMP implementation Either policy based or using physical processor IDs Using environment variables likwid-pin offers: Portable pinning without touching the code Simple accessible command line interface Logical numberings within thread groups likwid-pin -c 13

14 Controlling affinity of threads Thread groups Possible unit prefixes N node Default if c is not specified! S socket M NUMA domain C outer level cache group Chipset Memory 14

15 Get to know the machine II likwid-bench Knowing the performance capabilities of a machine is essential for any optimization effort. Difficulties : Doing time measurements in microbenchmarking is tedious Thread and data placement need to be quickly adaptable Implementation in programming language may introduce problems likwid-bench offers: Rapid prototyping of assembly kernel Thread management and placement Data allocation and NUMA aware initialization Timing and result presentation Ready to use set of microbenchmarks 15

16 Get to know the machine II likwid-bench cont. Benchmarks are simple text files Benchmarks are automatically converted, compiled and added to the benchmark application $likwid-bench t clcopy g 1 i 1000 w S0:1MB:2 $likwid-bench t copy g 2 i 100 w S1:1GB w S0:1GB-0:S1,1:S0 STREAMS 2 TYPE DOUBLE FLOPS 0 BYTES 16 LOOP 32 movaps FPR1, [STR0 + GPR1 * 8 ] movaps FPR2, [STR0 + GPR1 * ] movaps FPR3, [STR0 + GPR1 * ] movaps FPR4, [STR0 + GPR1 * ] movaps [STR1 + GPR1 * 8 ], FPR1 movaps [STR1 + GPR1 * ], FPR2 movaps movaps [STR1 + GPR1 * ], FPR3 [STR1 + GPR1 * ], FPR4 Port to IBM Power available! Contact me if you are interested. 16

17 Hardware Performance Monitoring likwid-perfctr Hardware Performance Monitoring is a indispensable source of information how your code interacts with the hardware. Difficulties : HPM implementations are volatile and undocumented Tools are often vendor specific Events frequently are buggy Finding event sets with useful derived metrics is difficult likwid-perfctr offers: likwid-perfctr supports energy counters on Sandy Bridge accessible as a regular event! Simple end-to-end measurement of hardware performance metrics Same tool on any X86 processor Portable ready to use validated performance groups Flexible usage modes: Wrapper, Stethoscope, Timeline, Marker Full coverage of HPM events including Uncore 17

18 likwid-perfctr Example usage for Wrapper mode $ likwid-perfctr -C N:0-3 -g FLOPS_DP./stream.exe CPU type: Intel Core Westmere processor CPU clock: 2.67 GHz Always Measuring group FLOPS_DP measured Configured metrics YOUR PROGRAM OUTPUT (this group) Event core 0 core 1 core 2 core INSTR_RETIRED_ANY e e e e+08 CPU_CLK_UNHALTED_CORE e e e e+08 CPU_CLK_UNHALTED_REF e e e e+08 FP_COMP_OPS_EXE_SSE_FP_PACKED e e e e+07 FP_COMP_OPS_EXE_SSE_FP_SCALAR FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION e e e e Metric core 0 core 1 core 2 core Runtime [s] Clock [MHz] Derived CPI metrics DP MFlops/s Packed MUOPS/s Scalar MUOPS/s DP MUOPS/s

19 likwid-perfctr Group files SHORT PSTI EVENTSET FIXC0 INSTR_RETIRED_ANY FIXC1 CPU_CLK_UNHALTED_CORE FIXC2 CPU_CLK_UNHALTED_REF PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION UPMC0 UNC_QMC_NORMAL_READS_ANY UPMC1 UNC_QMC_WRITES_FULL_ANY UPMC2 UNC_QHL_REQUESTS_REMOTE_READS UPMC3 UNC_QHL_REQUESTS_LOCAL_READS METRICS Runtime [s] FIXC1*inverseClock CPI FIXC1/FIXC0 Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock DP MFlops/s (DP assumed) 1.0E-06*(PMC0*2.0+PMC1)/time Packed MUOPS/s 1.0E-06*PMC0/time Scalar MUOPS/s 1.0E-06*PMC1/time SP MUOPS/s 1.0E-06*PMC2/time Groups are architecture specific They are defined in simple text files During recompile the code is generated likwid-perfctr -a outputs list of groups For every group an extensive documentation is available DP MUOPS/s 1.0E-06*PMC3/time Memory bandwidth [MBytes/s] 1.0E-06*(UPMC0+UPMC1)*64/time; Remote Read BW [MBytes/s] 1.0E-06*(UPMC2)*64/time; LONG Formula: DP MFlops/s = (FP_COMP_OPS_EXE_SSE_FP_PACKED*2 + FP_COMP_OPS_EXE_SSE_FP_SCALAR)/ runtime. 19

20 Basics for building LIKWID Download the latest release from Read the INSTALL and README files J Also consider a look in the Wiki on the LIKWID website LIKWID has no external dependencies and should build on any Linux system with a 2.6 or newer kernel Installing with make install is necessary for the pinning functionality and if you want to use the accessdaemon 20

21 Access to MSR and PCI address space likwid-perfctr and likwid-powermeter require access to MSR (model-specific register) and (on SandyBridge) PCI address space. MSR registers are accessed on x86 processors via special instructions which can only be executed in kernel space The Linux kernel allows reading and writing to these registers via special device files. This enables to implement LIKWID completely in user space The following options are available: Direct access to device files: The user must have read/write access to device files. AccessDaemon: The application starts a proxy application for access to device files (can be enabled in config.mk). If you want to measure memory bandwidth on SandyBridge you have to use the accessdaemon (WIKI)! 21

22 Setup direct access All modern Linux distributions support the necessary msr kernel module Check if device file exists: ls l /dev/cpu/0/ If msr file is missing, load module (must be root): modprobe msr Allow users access to msr device files (various solutions possible, must be root): chmod o+rw /dev/cpu/*/msr Now you can use likwid-perfctr as normal user You can integrate the necessary steps in a startup script or configure udev 22

23 likwid-perfctr Marker API To measure only parts of an application a marker API is available The API only turns counters on/off. The configuration of the counters is still performed by likwid-perfctr Multiple named regions can be measured Results on multiple calls are accumulated Inclusive and overlapping regions are possible #include <likwid.h> likwid_markerinit(); // must be called from serial region Likwid_markerThreadInit(); //Only if used in threaded setting likwid_markerstartregion( Compute );... likwid_markerstopregion( Compute ); likwid_markerstartregion( postprocess );... likwid_markerstopregion( postprocess ); likwid_markerclose(); // must be called from serial regio 23

24 likwid-perfctr Marker API convenience C preprocessor macros To enable easy toggling of instrumentation there is a set of macros To enable LIKWID instrumentation define LIKWID_PERFMON If LIKWID_PERFMON is not defined instrumentation will not be built #define LIKWID_PERFMON // comment out to disable #include <likwid.h> LIKWID_MARKER_INIT; LIKWID_MARKER_THREADINIT; LIKWID_MARKER_START( Compute );... LIKWID_MARKER_STOP( Compute ); Only necessary if measuring threaded code with accessdaemon LIKWID_MARKER_START( postprocess );... LIKWID_MARKER_STOP( postprocess ); LIKWID_MARKER_CLOSE; 24

25 Using the marker API Tips and pitfalls It may be convenient to copy the LIKWID header to your project. This enables to build even if LIKWID is not available. With instrumentation enabled you need to link against liblikwid.[a so]. If you want to instrument code in shared libraries you must build LIKWID also shared (can be enabled in config.mk) The initialization and closing part of the API must be always in the application part Use marked regions Start application with: likwid-perfctr g MEM -C N:0-15 -m./a.out If you want to control affinity yourself (-c option) you have to ensure that the number of threads entering all regions is equal to the cores specified to likwid-perfctr 25

26 Measuring energy consumption with LIKWID

27 Measuring energy consumption likwid-powermeter and likwid-perfctr -g ENERGY Implements Intel RAPL interface (Sandy Bridge) RAPL = Running average power limit CPU name: Intel Core SandyBridge processor CPU clock: 3.49 GHz Base clock: MHz Minimal clock: MHz Turbo Boost Steps: C MHz C MHz C MHz C MHz Thermal Spec Power: 95 Watts Minimum Power: 20 Watts Maximum Power: 95 Watts Maximum Time Window: micro sec

28 Example: A medical image reconstruction code on Sandy Bridge Sandy Bridge EP (8 cores, 2.7 GHz base freq.) Test case Runtime [s] Power [W] Energy [J] Faster code è less energy 8 cores, plain C cores, SSE cores (SMT), SSE cores (SMT), AVX

29 Useful Performance Patterns and Metric Signatures J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. 5 th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. Preprint: arxiv:

30 Performance patterns (1) Pattern Peformance behavior Metric signature Load imbalance Saturating/sub-linear speedup Different amount of work on the cores (FLOPS_DP, FLOPS_SP, FLOPS_AVX); note that instruction count is not reliable! BW saturation in outer-level cache Memory BW saturation Saturating speedup across cores of OL cache group Saturating speedup across cores on a memory interface OLC bandwidth meets BW of suitable streaming benchmark (L3) Memory BW meets BW of suitable streaming benchmark (MEM) Strided or erratic data access Simple BW performance model much too optimistic Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM) 30

31 Performance patterns (2) Pattern Peformance behavior Metric signature Bad instruction mix Limited instruction throughput Microarchitectural anomalies Peformance insensitive to problem size vs. cache levels Large discrepancy from simple performance model based on LD/ST and arithmetic throughput Large discrepancy from performance model Large ratio of instructions retired to FP instructions if the useful work is FP / Many cycles per instruction (CPI) if the problem is large-latency arithmetic / Scalar instructions dominating in data-parallel loops (FLOPS_DP, FLOPS_SP, CPI) Low CPI near theoretical limit if instruction throughput is the problem / Static code analysis predicting large pressure on single execution port / High CPI due to bad pipelining (FLOPS_DP, FLOPS_SP, DATA) Relevant events are very hardware-specific, e.g., stalls due to 4k memory aliasing, conflict misses, unaligned vs. aligned LD/ST, requeue events. Code review required, with architectural features in mind. 31

32 Performance patterns (3) Pattern Peformance behavior Metric signature Synchronization overhead False sharing of cache lines Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance Small speedup or slowdown when adding cores Large non-fp instruction count (growing with number of cores used) / Low CPI (FLOPS_DP, FLOPS_DP, CPI) Frequent (remote) CL evicts (CACHE) Bad ccnuma page placement Bad or no scaling across NUMA domains Unbalanced bandwidth on memory interfaces / High remote traffic (MEM) 32

33 Example 1: Abstraction penalties in C++ code C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions Often (but not always) good (i.e., low) CPI à Bad instruction mix pattern Low-ish bandwidth Low # of floating-point instructions vs. other instructions High-level optimizations complex or impossible à Strided access pattern Example: Matrix-matrix multiply with expression template frameworks on a 2.93 GHz Westmere core Total retired instructions [10 11 ] CPI Memory Bandwidth [MB/s] MFlops/s Classic Boost ublas Eigen Blaze/DGEMM

Example 2: Image reconstruction by backprojection Simple roofline analysis à Memory-bound algorithm à Memory BW saturation pattern Closer look via likwid-perfctr MEM group and IACA

34 Example 2: Image reconstruction by backprojection Simple roofline analysis à Memory-bound algorithm à Memory BW saturation pattern Closer look via likwid-perfctr MEM group and IACA tool à Limited instruction throughput pattern Work reduction optimization à Load imbalance pattern identified by likwid-perfctr FLOPS_SP group à corrected by round-robin schedule 34

35 Conclusions Performance patterns are more than simple numbers Scaling behavior Bottleneck saturation HPM signatures The set presented here is just a suggestion; it will have to be tested against more codes 35

36 Conclusions There is no alternative to knowing what is going on between your code and the hardware Without performance modeling, optimizing code is like stumbling in the dark Performance x Flexibility = constant a.k.a. Abstraction is the natural enemy of performance 36

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE)