Profiling: Understand Your Application

Size: px

Start display at page:

Download "Profiling: Understand Your Application"

Bruno Simmons
5 years ago
Views:

1 Profiling: Understand Your Application Michal Merta 1st of March 2018

2 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel VTune Amplifier XE Hands-on

3 Hardware events based sampling

4 Hardware events based sampling Profiling using Performance Monitoring Units (PMU) The data collector periodically interrupts the program and collects data from PMUs The average overhead of event-based sampling is about 2% on a 1ms sampling interval The number of hardware events (Performance Monitor Counters) collected simultaneously is limited by CPU capabilities (number of PMUs) Multiple runs or multiplexing

5 Some fundamental bottlenecks

6 Some fundamental bottlenecks Clockticks/Cycles Instructions retired - how many instructions were completely executed Cycles per Instruction (CPI) how many cycles on average a single instruction took to execute (the lower the better) Instruction per Cycle (IPC) - 1/CPI

7 Some fundamental bottlenecks Clockticks/Cycles Instructions retired - how many instructions were completely executed Cycles per Instruction (CPI) how many cycles on average a single instruction took to execute (the lower the better) Instruction per Cycle (IPC) - 1/CPI Attention! CPI/IPC metric treat instructions the same. SIMD instructions will be wrongly penalized. SIMD instructions need more cycles but compute more elements concurrently. Must consider total number of instructions retired and cycles when interpreting!

8 Some fundamental bottlenecks Clockticks/Cycles Instructions retired - how many instructions were completely executed Cycles per Instruction (CPI) how many cycles on average a single instruction took to execute (the lower the better) Instruction per Cycle (IPC) - 1/CPI Attention! CPI/IPC metric treat instructions the same. SIMD instructions will be wrongly penalized. SIMD instructions need more cycles but compute more elements concurrently. Must consider total number of instructions retired and cycles when interpreting!e.g.,

9 Pipeline slots Processor front-end vs. back-end Front-end fetching and decoding instructions into uops and forward to scheduler Back-end uops from scheduler are assigned to execution units for execution and memory changes are commited (Image: Intel)

10 Pipeline slots Modern CPUs are able to allocate n uops per cycle and retire n abstract concept of n pipeline slots Individual slots can be classified based on their state during allocation. The classification is used, e.g., in VTune: Usual values for HPC application: retiring 30-70%, bad speculation 1-5%, front-end bound 5-10%, back-end bound 20-40%.

11 Some fundamental bottlenecks Front-end bound Issues when processor front-end undersupplies back-end. ICache misses when fetched instructions are not present in L1I (instruction) cache ITLB overhead translation look-aside buffer stores recent translation of virtual to physical memory. This metric estimates the performance penalty of ITLB (instruction TLB) misses. Branch resteers fraction of cycles the CPU stalled due to fetching corrected path after branch misprediction Tip: use the Profiler Guided Optimization (PGO) to improve the metrics

12 Some fundamental bottlenecks Back-end bound Issues when processor back-end not ready to handle uops provided by front-end. Can be divided into two further categories: Memory bound - execution of instructions is stalled due to memory dependencies waiting to be resolved Core bound no more instructions can be executed since there is no free execution unit (so-called port) Most common causes include Cache misses Remote memory accesses Data sharing 4K-Aliasing DTLB misses Data dependencies between instructions

13 Some fundamental bottlenecks Back-end bound Cache misses Computations are performed on data in L1 cache The smallest unit of loading data from memory towards compute units is a cache line (64 bytes) Cache miss cache line not in a cache, needs to be loaded (opposite: cache hit) Pre-configured CPU metrics are ratio of cycles of a miss to overall clock cycles. Cache misses will always be present in reality only a too high rate is a problem

14 Some fundamental bottlenecks Back-end bound Remote memory access Any access to remote memory (of other socket via QPI) is slower compared to local memory An ideal NUMA-optimized application should entirely use local memory Pre-configured metric Remote DRAM counts cycles remote memory is accessed. Compared to overall clock cycles it should be very low.

15 Some fundamental bottlenecks Back-end bound Data sharing Data can be shared between cores (and sockets) as True sharing cores share the exact same data and at least one updates/writes. Requires synchronization with other cores. False sharing different data items on the same cache line are touched by different cores. Pre-computed metrics Contested Accesses or Data sharing counts cycles LLC needs for synchronization to overall clock cycles.

16 Some fundamental bottlenecks Back-end bound 4K-Aliasing Store-forwarding (using load/store buffers) Optimization avoiding pipeline stall when load from a memory location follows recently after a store to the location. Data are not written directly to the memory (cache) but stored in a store buffer. It can only anti-alias lower 12 bit of address (4096 bytes) If a load using an address whose lower 12 bit are the same of previous store, processor wrongly assumes data being in load/store buffer The real data needs to be loaded later (5+ cycles stall) Pre-computed metrics 4K-Aliasing counts cycles of stall due to the 4K-Aliasing.

17 Some fundamental bottlenecks Back-end bound DTLB misses TLBs are small buffers to help decode logical to physical addresses on a page granularity. If data in a page is accessed and the page s address is not in TLB address needs to be decoded penalty Worst case of random page accesses causes huge amount of TLB misses Pre-computed metrics DTLB Overhead counts cycles of stall caused by DTLB misses

18 Overview of available tools gprof, oprofile, Vampir, Scalasca, Valgrind, Intel Advisor, Score-P, PAPI, Likwid,... perf tools Linux performance analysis tool available from kernel version Intel VTune Amplifier XE powerful profiler provided as a part of Intel Parallel Studio or a stand-alone product Allinea MAP profiler for OpenMP, MPI applications analyzing instructions, memory usage, I/O, communication etc.

19 perf tools

20 perf tools Performance Counters for Linux (PCL) performance measurement tool integrated into Linux since 2009 statistical profiling of whole system supports hardware performance counters, software performance counters, tracepoints, dynamic probes

21 perf tools commands $ perf usage : perf [-- version ] [--help ] [ OPTIONS ] COMMAND [ARGS ] The most commonly used perf commands are : annotate Read perf.data ( created by perf record ) and display annotated code archive Create arch. with object files with build - ids found in perf. data f. bench General framework for benchmark suites buildid - cache Manage build -id cache. buildid - list List the buildids in a perf. data file data Data file related processing diff Read perf. data files and display the differential profile evlist List the event names in a perf. data file inject Filter to augment the events stream with additional information kmem Tool to trace / measure kernel memory properties kvm Tool to trace / measure kvm guest os list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf. data report Read perf.data ( created by perf record ) and display the profile sched Tool to trace / measure scheduler properties ( latencies ) script Read perf.data ( created by perf record ) and display trace output stat Run a command and gather performance counter statistics test Runs sanity tests. timechart Tool to visualize total system behavior during a workload top System profiling tool. probe Define new dynamic tracepoints trace strace inspired tool

22 perf tools events Category Description Example Hardware events Basic CPU events, measured by PMU of CPU Hardware cache Data- and instructioncache events hw. events. Software events Measurable by kernel counters + tracepoint, probe events + Raw hardware event descriptors branch- cpu-cycles, misses L1-dcache-loadmisses, LLC-storemisses cpu-clock, contextswitches for Intel see: Intel(R) 64 and IA-32 Architectures Software Developer s Manual

23 perf tools events to get a list of available events: perf list $ perf list List of pre - defined events (to be used in -e): branch - instructions OR branches [ Hardware event ] branch - misses [ Hardware event ] bus - cycles [ Hardware event ] cache - misses [ Hardware event ]... alignment - faults [ Software event ] context - switches OR cs [ Software event ] cpu - clock [ Software event ]... L1 -dcache -load - misses [ Hardware cache event ] L1 -dcache - loads [ Hardware cache event ] L1 -dcache - stores [ Hardware cache event ]... branch - instructions OR cpu /branch - instructions / [ Kernel PMU event ] branch - misses OR cpu /branch - misses / [ Kernel PMU event ] bus - cycles OR cpu /bus - cycles / [ Kernel PMU event ]... rnnn [ Raw hardware event descriptor ] ( see man perf -list on how to encode it)

24 Counting events with perf stat $ perf stat./ myprogram Performance counter stats for./ myprogram : task - clock (msec ) # CPUs utilized 4,591 context - switches # K/ sec 44 cpu - migrations # K/ sec 202,626 page - faults # M/ sec 116,275,734,384 cycles # GHz 167,335,603,761 instructions # 1.44 insns per cycle 14,612,431,103 branches # M/ sec 16,714,213 branch - misses # 0.11% of all branches seconds time elapsed Some important switches -e event selection e.g. perf stat -e cycles,cache-misses./myprogram -p/-t stat events on existing process/thread id e.g. perf stat -p I <n> prints counts at regular intervals in ms -r <n> repeats measurement n times, prints avg. and std.

25 Sampling with perf report Collect samples by perf record Stores in the perf.data file (can be analyzed on different machine) Analyze using perf report By default collects cycles counts $ perf record -e branch - misses./ myprogram $ perf report Samples : 20K of event branch -misses, Event count ( approx.): Overhead Command Shared Object Symbol 21.71% myprogram libmkl_avx2.so [.] mkl_blas_avx2_xzgemv 9.99% myprogram myprogram [.] computeelementmatrix 8.89% myprogram myprogram [.] apply 8.57% myprogram myprogram [.] collect 8.49% myprogram libc so [.] _int_malloc... -g will record a call graph

26 Sampling with perf report Perf annotate function maps recorded profile information to the actual functions and instructions in the code Pressing a on any symbol in perf report displays assembly instructions of the functions together with source code

27 Intel VTune Amplifier XE

28 Intel VTune Amplifier XE Powerful performance analysis tool providing in depth metrics about the profiled applications. Enables, e.g., finding hot spots in application, measure memory and QPI bandwidth, profile threading performance, etc.

Hot Spots Get started with identifying what is worth optimizing: Two analysis types: Basic Hotspots: Simple instrumentation not requiring any drivers or perf but just delivers execution times.

29 Hot Spots Get started with identifying what is worth optimizing: Two analysis types: Basic Hotspots: Simple instrumentation not requiring any drivers or perf but just delivers execution times. Advanced Hotspots: Sampling with basic event counters 1 requires drivers or perf and delivers instruction information. Below is an example for Advanced Hotspots: 1 Allows system wide profiling (Image: Intel)

(Image: Intel) Locks and Waits: Identify concurrency bottlenecks where threads are blocked due to

30 Concurrency/Locks and Waits Understand threading: Two analysis types: Concurrency: Provides information about how many threads are running at the same time. (Image: Intel) Locks and Waits: Identify concurrency bottlenecks where threads are blocked due to locks/synchronization. (Image: Intel) If non-standard synchronization constructs are used, consider User-Defined Synchronization API to make this information available (see here )

31 Memory/QPI Bandwidth Select analysis type Microarchitecture Analysis/Memory Access Tab Summary gives a first overview (incl. latency) See BW details under Platform tab:

32 General Exploration Use General Exploration Analysis for a comprehensive overview of available metrics

33 General Exploration Use General Exploration Analysis for a comprehensive overview of available metrics (Image: Intel)

Metrics Reference Those are pre-configured by using basic event counters and also highlighted in the GUI (e.g. red or grayed out).

34 Which Performance Metrics can be collected? Depending on analysis type, Intel VTune Amplifier XE shows two types of performance metrics: 140 predefined CPU metrics: CPU Metrics Reference Those are pre-configured by using basic event counters and also highlighted in the GUI (e.g. red or grayed out). Raw event counters from the Performance Monitoring Unit (PMU): Intel R Processor Event Reference Used by predefined CPU metrics in more or less complex formulas. Those are highly dependent on the target architecture!

35 Tuning Guides and Performance Analysis Papers Tuning Guides and Performance Analysis Papers

36 Summary Hardware events based sampling with minimal overhead Issues can be either front-end or back-end bound Wide selection of profiling tools - free or paid - e.g., perf tools, Intel VTune Amplifier

37 Hands-on

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits