Sequential Optimization. Analysis Tools and Cache Optimization. Lab Course Efficient Programming of Multicore-Systems and Supercomputers

Size: px

Start display at page:

Download "Sequential Optimization. Analysis Tools and Cache Optimization. Lab Course Efficient Programming of Multicore-Systems and Supercomputers"

Preston Murphy
6 years ago
Views:

1 Sequential Optimization Analysis Tools and Cache Optimization Lab Course Efficient Programming of Multicore-Systems and Supercomputers

2 Outline Performance analysis tools/libraries Measurement strategies Examples: gprof, perf, likwid, PAPI, callgrind,... Cache optimization How do caches work? How to improve cache usage?

3 Performance Analysis Why? locate code regions (where most time is spent in) where optimizations are useful check correctness of assumptions on runtime behavior chose best algorithm from alternatives for a problem get knowledge about unknown code

4 Performance Analysis (Cont d) How? at end of (fully tested) Implementation on compiler-optimized release version with typical/representative input data Steps of optimization cycle: Start Measurement Locate Bottleneck Modify Code Finished Yes No Improvement Satisfying? Check for Improvement (Runtime)

5 Performance Analysis (Cont d) bottlenecks (sequential) logical errors: too often called functions algorithms with bad complexity or implementation bad memory access behavior (bad layout, low acess locality) lots of (conditional) jumps, lots of (unnecessary) data dependencies

6 Performance Measurement (cont'd) Hardware Sensors Event Stream Postprocessing (Online) Runs on Instrumented Program Meta- Information Measurement Output Instrumentation Visualization (Online/Offline) Program

7 Measurement - Terms Trace record of time-stamped event stream enter/leave of code region, actions, example: dynamic call tree huge amount of data (linear to runtime) unneeded for sequential performance analysis

8 Measurement Terms (Cont d) Profiling (e.g. of time) record summary over execution exclusive, inclusive cost / time, counters example: DCT DCG (Dynamic Call Graph) amount of data linear to code size

9 Methods Precise Measurements Increment Counter (Array) on Event Attribute Counters to Code (Context, Basic Block Sequence) Data (Address, Access Pattern) Data Reduction Possibilities Selection (Event Type, Code Range, Data) Online Processing (Compression, ) Needs Instrumentation (Measurement Code)

10 Methods - Instrumentation Manual Source Instrumentation Library Version with Instrumentation Compiler Binary Editing Runtime Instrumentation / Compiler Runtime Injection

Methods (Cont d) Statistical Measurement ( Sampling ) TBS (Time Based), EBS (Event Based) Assumption: Event Distribution over Code Approximated by checking

11 Methods (Cont d) Statistical Measurement ( Sampling ) TBS (Time Based), EBS (Event Based) Assumption: Event Distribution over Code Approximated by checking every N-th Event only Similar Way for Iterative Code: Measure only every N-th Iteration Advantage: Data Reduction Tunable Compromise between Quality/Overhead

12 Methods (Cont d) Simulation Events for (not existant) HW Models Results not influenced by Measurement Compromise Quality / Slowdown Rough Model = High Discrepancy to Reality Detailed Model = Best Match to Reality But: Reality (CPU) often unknown Allows for Architecture Parameter Studies

13 Hardware Support Monitor Hardware Event Sensors (in CPU, on Board) Event Processing / Collection / Storing Best: Separate HW Comprimise: Use Same Resources after Data Reduction Most CPUs nowadays include Performance Counters

14 Performance Counters Multiple Event Sensors ALU Utilisation, Branch Prediction, Cache Events (L1/L2/TLB), Bus Utilisation Processing Hardware Filter Techniques (Itanium, Pentium-4) Opcode, Code/Data Range, Threesholds (Latency!) Counter Registers Nehalem/Sandybridge: 7 (3 fixed), Core2Duo: 5 (3 fest), Itanium2: 4, Pentium-4: 18, Opteron: 8 Athlon: 4, Pentium-II/III/M: 2, Alpha 21164: 3

15 Performance Counters (Cont d) Two Uses: Read Get Precise Count of Events in Code Regions by Enter/Leave Instrumentation Interrupt on Overflow Allows Statistical Sampling Handler Gets Process State & Restarts Counter Both can have Overhead Often Difficult to Understand

16 Performance Counters (Cont d) Common Misconceptions Don t take absolute values for granted! Instruction Counts: Including Speculation? Miss Counts: influenced by OOO, including Prefetch Events (?), Nothing about Stall Time

17 Tools - Measurement Read Hardware Performance Counters Specific: PerfCtr (x86), Pfmon (Itanium), perfex (SGI), DCPI (Alpha), perf (Linux) Portable: PAPI, PCL, perf (Linux) Statistical Sampling DCPI, Pfmon (Itanium), OProfile (Linux), perf (Linux), VTune (commercial - Intel), Prof/GProf (TBS) Instrumentation GProf, Pixie (HP/SGI), Atom (Alpha), VTune (Intel) DynaProf (Using DynInst), Valgrind/PIN (Runtime Instr.)

18 Tools Example (1) GProf (Compiler generated Instr.): Function Entries increment Call Counter for (caller, called)-tupel Combined with Time Based Sampling Compile with gcc pg..., icc -p Run creates gmon.out Analysis with gprof... Overhead sometimes around 100%! Available with GCC on UNIX

Tools Examples (2) Perf access performance counter: sampling/reading perf stat <program> / perf record <program> / perf report on SuperMUC: old binary Intel AmplifierXE (was VTune) profiling

19 Tools Examples (2) Perf access performance counter: sampling/reading perf stat <program> / perf record <program> / perf report on SuperMUC: old binary Intel AmplifierXE (was VTune) profiling tool from Intel with GUI module load amplifier_xe/2013 amplxe-gui only does time sampling (performance counter needs kernel module...) Likwid module load likwid likwid-perfctr -g L3 c 0 <program>

20 Tools Example (3) Callgrind (Linux/x86), GPL cache simulator using Valgrind builds up dynamic call graph runtime instrumentation module load valgrind valgrind --tool=callgrind --cache-sim=yes <program> callgrind_annotate callgrind.out.xxx (GUI: kcachegrind)

21 The Memory Wall CPU Peak Performance (clock & cores) + 40% / year 1000 Growing Gap 100 Main Memory Performance +7% / year Access latency to main memory today up to 300 cycles Assume 2 Flops/clock tick 600 Flops wasted while waiting for one main memory access! 21/83

22 The Memory Hierarchy motivation: technology & economy fast memory is expensive & small SRAM: 6 transistors per cell parallel access & wide access paths needs lot of power & die space access time related to size: smaller = faster slow memory possible at much higher density DRAM: 1 transistor + 1 capacitor per cell serialized access discrepancy in design goals processor should be fast memory should be large wish: system with fast processor & large memory solution: hierarchy of buffers 22/83

23 Solution: The Memory Hierarchy upper level LLC: last level cache Size Latency Bandwidth lower level L1 Registers Fast Buffer 300 B 1 32 kb GB/s on-chip LLC Slower Buffer 4 MB GB/s off-chip CPU-local Main Memory 4 GB GB/s Remote Main Memory (attached to other CPUs) Even more remote Memory (on I/O devices,...) 4 GB GB/s 1 TB > ,2 GB/s 23/83

24 Namespaces in Memory Hierarchy easiest for programming? every memory hierarchy level with separate namespace explicit copy operations to move data between levels one large address namespace for all levels all but bottom/largest level only can contain subsets transparent use of higher levels: work as cache addresses are keys into associative data array cost: time to find data + storage for keys (= tags) cached entries can become invalid: coherency protocol best use of smaller buffers? eviction/replacement policy mix registers: we need very fast access no transparency needed: allocated by compiler L1 (apart from registers: smallest / fastest buffer) LLC transparent, use same namespace as memory 24/83

25 Driving Goal: Performance How to get good performance? fast processing high clock rate parallelism fast data access: processing happens on data memory performance: latency & bandwidth we want locality: data stored where processing done easy to have fast access & high bandwidth: cheap! depends on algorithm: better to not move data 25/83

26 Importance of fast Memory Accesses applications very often do memory accesses exploitation of processor caches is crucial example: 3 GHz clock rate, 2 sockets NUMA latency to remote socket ~ 100 ns: 300 clock ticks bandwidth (1 core) ~ 15 GB/s compare to L1 access: latency 2-3 ticks, bandwidth ~150GB/s bad memory performance easily can dominate performance (better memory performance also will speed up parallel code) further issue: contention if served from private levels (e.g. L1): perf. scales memory modules: shared by all cores & sockets M M 26/83

27 Processor Caches Why are Caches effective? Principle of Locality repeated accesses to same memory cells temporal locality good to keep recently accessed data in cache accesses to memory cells near recent accesses spatial locality good to work on blocks of nearside data (cache line) locality depends on the application 27/83

28 Cache Utilization effective only when data is found (cache hit) otherwise fetch from next level (cache miss) hit ratio h: probability of a cache hit example (1 cache level + memory) t eff : average resulting access time (= access latency) t cache : access time cache t mem : access time memory t h * t (1 h ) eff cache t mem 28/83

29 Design Options how to find cached data fast? full associativity: copies can reside in every cache line lots of comparisons needed very expensive (power/space/time) used for very small caches, e.g. TLB direct mapped: one possible place for a memory block address bits index into cache lines no eviction policy needed very simple/power efficient/fast ( microcontrollers) high conflict potential: reduces effectiveness m-way set-associativity: m possible places per address tradeoff: reduced conflicts, less expensive often: L1 4/8-way, L3: -24 way (less latency sensitive) 29/83

30 Full Associative Cache Word offset within line Byte offset in Word Address Tag Tag V = Tag V = Tag V = Tag V = Multiplexer Example: 64 byte line, 32-bit architecture Data bus 2 bit byte offset, 4 bit word offset, 26 bit tags 30/83

31 Decoder Direct Mapped Cache Index: use lower bits! Word offset within line Byte offset in Word Address Tag Index Tag Tag Tag Tag V V V V Example: 64 byte line, 256 lines (16k) = & Hit Multiplexer Data bus 4+2 bit offset, 8 bit index, 18 bit tags 31/83

32 Decoder m-way Associativity Word offset within line Byte offset in Word Address Tag Index Tag V Cache Set with m lines Tag V = = = Example: 64 byte line, 384 lines (24k), 6-way 64 sets, 4+2 bit offset, 6 bit index, 20 bit tags 32/83

33 Cache Hierarchy: Design Alternatives inclusive cache hierarchy all data in L1 also in L2 (also in L3) automatically ensured if (assume same line sizes) L n assoc * (#L n caches) L n+1 assoc easier management: to invalidate from outside only check larger level Intel (L2 & L3 usually without strict inclusion why?) exclusive cache hierarchy same data only once in hierarchy memory L1, if L1 full, evict data to L2 (L2 to L3) (victim cache: temporarily take cache data on overflow) AMD on-chip: wider paths easy, high BW off-chip: allows larger caches sometimes done, tags usually on-chip (for fast checks) 33/83

34 Cache Hierarchy: Design Alternatives Write Policies on hit: also update lower levels? write through: yes, directly on update write back: mark as modified (dirty), write on eviction (reduces write bandwidth needs!) on miss: allocate a cache line? write-allocate policy no: directly write to memory (high bandwidth!) yes: read memory block first (more data to transfer!) Write Buffers queuing of writes to lower levels, in background increases parallelism within cache hierarchy loads check write buffers, allow forwarding 34/83

35 Eviction/Replacement Policies New line to be allocated in set, but set full which existing line to evict to make room? Best strategy? OPT: we know the future! evict the line which is accessed again farthest in future upper bound to evaluate other policies also known as Belady (IBM, 60-ties) 35/83

36 Eviction Policies LRU: evict least recently used line approximates OPT if locality of future accesses is same as locality of recent accesses LFU: evict least frequently used line captures more history behavior than only recent access (as LRU) experiments show that LRU is better Others RANDOM: evict random line FIFO: evict in order of allocation 36/83

37 Techniques for Processor Caches Most important non-blocking (asynchronous) cache allow overlapping of multiple accesses non-blocking only for hits: lockup-free cache also for misses: table with outstanding loads, needs non-blocking support also from lower levels pipelining of cache accesses multi-level (hierarchy) smaller caches are faster, use less power if inclusive: lower caches should be significantly larger lower caches are accessed less frequently, thus can be slower (higher latency, lower bandwidth) hardware prefetcher: predict future accesses 37/83

38 Hardware Prefetching: Motivation Example sequential memory accesses in loop (8 byte type) cache accepts multiple outstanding loads out-of-order pipeline allows 64 outstanding loads 64 byte cache lines: always waiting for 8 lines enough to utilize memory performance? Memory performance 70ns latency, 3GHz: 210 cycles 2x DDR3-1600: ~ 25GB/s 8 cache lines every 200 cycles 512 bytes / 70 ns = 512GB /70s = 7,3 GB/s this is an ideal case: no dependencies! pointer chasing, linked list: 1 cache line/200 cycles: < 1GB/s we need prefetching (load before use) to exploit memory 38/83

39 Hardware Prefetching: Motivation in software: by programmer, by compiler regular loads (waste register) / separate instructions very often trigger much more prefetching than needed estimate what already is in cache (assume LRU ) prefetching only once per cache line complicates code often not possible: prefetch per data item prefetches consume space in instruction sequence larger code, higher IF/ID bandwidth needed slowdown best prefetch distance? in hardware caches observe memory accesses behavior predict future memory accesses, trigger prefetching 39/83

40 Hardware Prefetching What is a good prefetcher? accuracy: is all data prefetched / is prefetched data used? timeliness: early enough / not overwriting data still needed? Types prefetch adjacent memory block illusion of doubled cache line size for loads stream prefetchers: sequential access up/down stride prefetchers: series with fix addr. difference stream prefetching is a special case address correlation prefetchers remember arbitrary miss sequences: pointer chasing! with large tables very effective, only in research papers 40/83

41 The Principle of Locality is not enough... Reasons for Performance Loss for SPEC2000 [Beyls/Hollander, ICCS 2004] 41/83

42 Cache Optimizations use fastest algorithm (bad complexity kills) use existing implementations (Intel MKL, ) compiler options (-O3, -march=native), reduce aliasing problems (restrict keyword) Only optimize after profiling useless waste of time? changes for increased performance usually increase complexity, reduce maintainability / portability hide optimization behind interfaces Tips on following slides: best with C / C++ / Fortran / may also work with Java / JavaScript data layout cannot be influenced in Java (in JavaScript: use typed arrays ) 42/83

43 Cache Optimization: Reduce Accesses use large data types (done by compiler?) vectors instead of bytes 1 cache line = 1 access: use full cache lines hash table entries can be 64 bytes (if not fitting in L1/L2) hierarchical data structures alignment: crossing cache line gives two accesses (redundant) calculation instead of memory access avoid unneeded writes check if a variable already has given value before writing writes result in higher bandwidth needs 43/83

44 Cache Optimization: Reorder Accesses if possible, do sequential accesses exploit full cache line, trigger hardware prefetcher best up to page size, but stay in L1 small sequential accesses: reduce accuracy of prefetcher blocking: reuse data as much as possible instead of multiple large sweeps over large buffer, split up into multiple small sweeps over buffer parts recursion: block bisection / space filling curve best use of multiple cache levels without complexity cache-oblivious 44/83

45 Cache Optimization: Reorder Accesses Blocking: make arrays fit into a cache Recursion: Morton (Peano, Hilbert, ) [ ] 45/83

46 Cache Optimization: Improve Layout group data with same access frequency and access type (read vs. write) writes always trigger a read before avoid holes : if possible, use every byte of a fetched cache line (unused data is wasted space + bandwidth) reorder data in memory according to traversal order in program may also be useful for trees, linked lists incremental, in small steps, while accessing the data go from array of structs to structs of arrays AoS-to-SoA conversion to avoid power-of-2 strides (easy: conflict misses) use padding (artificial holes in data structure) No need to complicate code: interface layout 46/83

47 Cache Optimization

48 Basic efficiency guidelines Choose the best algorithm Use efficient libraries Find good compiler, options, give hints -O3, -fno-alias, restrict C99 keyword... Use suitable data layout Restructure code

49 Use suitable data layout Sequential access is fastest! E.g. in C/C++, for a 2D matrix double a[n][m]; the loops should be such that for (i...) for (j...) a[i][j]... Apply loop interchange, if necessary, see below

50 Data layout optimizations: Array padding Idea: Allocate arrays larger than necessary Change relative memory distances Avoid severe cache thrashing effects Example: Replace double u[1024][1024] by double u[1024][1024+pad] How to choose pad?

51 Loop unrolling Loop fusion Loop split Loop blocking Etc. Loop optimizations

Data access optimizations: Loop blocking Loop blocking = loop tiling Divide the data set into subsets (blocks) which are small enough to fit in cache

52 Data access optimizations: Loop blocking Loop blocking = loop tiling Divide the data set into subsets (blocks) which are small enough to fit in cache Perform as much work as possible on the data in cache before moving to the next block This is not always easy to accomplish because of data dependencies

53 Data access optimizations: Interleaving of iterations Start next iteration while the previous is still calculated Aim: Reuse data in cache as often as possible Also known as time skewing

Example: Stencil, 2D Jacobi iteratively solve PDE (partial differential equation) discretize regularily in 2D, double precision for(iter=0; iter<itermax; iter++) for(i=0; i<n; i++) for(j=0; j<n; j++)

54 Example: Stencil, 2D Jacobi iteratively solve PDE (partial differential equation) discretize regularily in 2D, double precision for(iter=0; iter<itermax; iter++) for(i=0; i<n; i++) for(j=0; j<n; j++) anew[i,j] = (a[i-1,j]+ +a[j,i+1])/4; update: average of 4 neighbors, 4 Flops/update bandwidth requirement to memory: 16 bytes/update 2 lines fit into cache whole matrix does not fit into cache one read + one write, everything else in cache memory-bound: 2 cores already occupy bus 54/83

55 Example: Stencil, 2D Jacobi base parallel version geometric 2D decomposition synchronization needed for accessing partition boundaries (1 layer) Core Core Core Cache Memory memory-bound: 2 cores already occupy bus on SuperMUC: 7 GFlop/s per node (2 sockets) 55/83

56 2D Jacobi: Cache Optimizations Spatial blocking of n iterations: n layers to neighbor Core Core Core Cache Memory recursive splitting of blocks for one core until blocks fit into cache update of inner borders reduced BW to memory better scalability 3 56/83

57 2D Jacobi: Cache Optimizations Wavefront: similar to blocking, use shared $ C1 C2 C1 C3 C2 C1 C1 C2 C3 Memory matrix gets streamed trough cores within multicore, may be combined with blocking allows larger blocks, less border updates on SuperMUC > 10x improvement (!!) for details: lab course in SS14 57/83

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to