Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de) Daniel Hackenberg (daniel.hackenberg@tu-dresden.de) Wolfgang E. Nagel (wolfgang.nagel@tu-dresden.de)
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 2
Scaling of Parallel Applications Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Multiple levels of cache to bridge processor-dram performance gap Last Level Cache (LLC) and memory controller usually shared between cores 3
Scaling of Parallel Applications Core 0 Core 1... Last Level Cache Core n ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Typically integrated memory controllers in every processor Performance of memory accesses depends on distance from the data 4
Scaling of Parallel Applications Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors 5
Potential Bottlenecks Local memory hierarchy Limited data path widths Access latencies Remote memory accesses Additional latency Multicore Processor Multicore Processor Core Core Core Core Core Core...... Last Level Cache Last Level Cache Interconnect bandwidth ory ory Performance of on-chip transfers and remote cache accesses Saturation of shared resources Need to understand which hardware characteristics determine the application performance Requires knowledge about: Peak achievable performance of individual components Component utilization at application runtime 6
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 7
Common ory Benchmarks Multicore Processor Multicore Processor Core Core Core Core Core Core...... Last Level Cache Last Level Cache ory ory Multicore Processor Multicore Processor Core Core Core Core Core Core...... Local memory hierarchy Bandwidth: STREAM Latency: Lmbench Remote memory accesses STREAM and Lmbench plus numactl On-chip transfers and remote cache accesses Not covered by common tools Not easily extendable Last Level Cache ory Last Level Cache ory Saturation of shared resources Also covered by STREAM 8
ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM 9
ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM transfers within one processor 10
ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM remote cache accesses transfers within one processor main memory (NUMA) 11
ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM remote cache accesses not covered by other benchmark suites transfers within one processor main memory (NUMA) Sophisticated data placement enables performance measurements for individual components in the memory hierarchy Also considers state transitions of coherence protocols 12
bandwidth [GB/s] Scaling of Shared Resources in Multi-core Processors bandwidth [GB/s] 300 250 200 150 100 50 0 last level cache 1 2 3 4 5 6 7 8 9 10 11 12 number of cores 70 60 50 40 30 20 10 0 main memory 1 2 3 4 5 6 7 8 9 10 11 12 number of cores Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435 On some processors, the bandwidth of the last level cache scales linearly with the number of cores that access it concurrently The DRAM bandwidth can typically be saturated without using all cores 13
Saturation of Shared Resources Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Multiple levels of cache to bridge processor-dram performance gap Last Level Cache (LLC) and memory controller usually shared between cores 14
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 15
Hardware Performance Counters RAM RAM Core 0 Core 1 Last Level Cache ory Controller...... Core n Point-to-Point Interconnect I/O Other Processors Per-core counters Record events that occur within the individual cores, e.g., pipeline stalls, misses in the local and l2 caches Uncore counters Monitor shared resources Events cannot be attributed to a certain core Accessible via PAPI 16
Properties of Hardware Performance Counters Not designed with performance analysis in mind Included for verification purposes, not guaranteed to work Some events are poorly documented Some countable events can have different origins E.g., stalls in execution can happen because of long latency operations as well as memory accesses Unclear whether counters are good indicators for capacity utilization Are 10 million cache misses per second too much? Not stable between processor generations => Methodology to identify meaningful events is needed 17
Identification of Meaningful Performance Counter Events Component Utilization: Use Micro-benchmarks to stress individual components Identify performance monitoring events that correlate with the component utilization Determine peak event ratios Estimate performance Impact: Extended latency benchmark to determine which stall counters best represent delays that are caused by memory accesses Search for events that represent stalls caused by limited bandwidth 18
Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g., for every load and store or per cache line) 19
Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 20
Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 21
Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 22
Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 23
Measuring Component Utilization Example: Events that count L3 accesses (per cache line) Good counters available for all levels in the memory hierarchy, except: accesses only per load resp. store instruction (different widths) Writes to DRAM only counted per package 24
Estimating Performance Impact of ory Accesses A high component utilization indicates a potential performance problem But, the actual effect on the performance cannot easily be quantified Modified latency benchmarks to check which stall counters provide good estimates for the delays caused by memory accesses additional multiplications between loads Two versions: independent operations that can overlap with memory accesses reported number of stalls should decrease accordingly multiplications that are part of the dependency chain Ideal counter reports same results as for latency benchmark 25
Haswell Stall Counters 26
Haswell Stall Counters 27
Estimating Performance Impact of ory Accesses Also need to consider stalls cycles in bandwidth-bound scenarios Best reflected by events that indicate full request queues, but far from an optimal correlation Events for loads and stores can overlap, but do not have to On Haswell the following events can be used to categorize stall cycles, but accuracy is limited: CPU_CLK_UNHALTED CYCLE_ACTIVITY:CYCLES_NO_EXECUTE productive cycles max(resource_stalls:sb, D_PEND_MISS:FB_FULL + OFFCORE_REQUESTS_BUFFER :SQ_FULL) bandwidth bound CPU_CLK_UNHALTED active cycles max(resource_stalls:sb, CYCLE_ACTIVITY:STALLS_D_PENDING) memory bound CYCLE_ACTIVITY :CYCLES_NO_EXECUTE latency bound stall cycles memory bound bandwidth bound stall cycles memory bound other stall reason 28
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 29
Summary Raw performance counter data typically difficult to interpret Selecting the (most) relevant events is not a trivial task Some events do not show the expected behavior E.g., the LDM_PENDING event Verification needed before relying on the reported event rates The presented micro-benchmark based approach can be used to tackle these challenges Acknowledgment: This work has been funded in a part by the European Union s Horizon 2020 program in the READEX project and by the Bundesministerium für Bildung und Forschung via the research project Score-E 30
Thank You For Your Attention 31