Performance analysis basics

Size: px

Start display at page:

Download "Performance analysis basics"

Arthur Brown
5 years ago
Views:

1 Performance analysis basics Christian Iwainsky

2 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2

3 Why bother with performance analysis Moore's Law still in charge, so what? increasingly difficult to get close to peak performance for sequential computation memory wall optimal pipelining,... for parallel interaction Amdahl's law synchronization with single late-comer,... efficiency is important because of limited resources scalability is important to cope with next bigger machine or data set 3

4 Amdahl's law A parallel application can not run faster than the sum of its sequential parts! 4

5 Amdahl's law Assume a sequential program, with serial execution time T s Parallelization ideally yields a total runtime of T p with t s that is not parallelized and t p that can be executed in parallel: T p = t s + t p Adding additional cpus/cores to the application will ideally peed t p up: t p n = t p /n n=number of cpus/cores 5

6 Amdahl's law Substitue t p n T p = t s + t p /n For n T p = t s A parallel application can not run faster than the sum of its sequential parts! 6

7 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 7

8 The Basics Successful tuning is a combination of right algorithms and libraries compiler flags and directives and how to use them knowing the runtime behavior of your application thinking! Measurement is better than guessing: to validate program behaviour to determine performance problems to validate tuning decisions and optimization Measurement should be repeated after each significant code modification and optimizations 8

9 The Basics It is easier to optimize a slow correct program than to debug a fast incorrect one Debugging before tuning Nobody really cares how fast you can compute the wrong answer! The 80/20 rule Program spends 80% time in 20% of code Programmer spends 20% effort to get 80% of the total speedup possible in the code Know when to stop! 9

10 The Basics Don t optimize what doesn t matter Make the common case fast! 10

11 The Basics Do I have a performance problem at all? Time measurements Speedup and scalability measurements What is the main bottleneck (computation/communication...)? Flat profiling Where is the main bottleneck? Call graph profiling Detailed (basic block) profiling 11

12 The Basics Why is it there? Hardware counters analysis Trace your application Does my code have scalability problems? Profile code for typical small and large processor counts Compare profiles function by function 12

13 What kind of information can be obtained? Typical examples of performance information: enter/leave of function/routine/region number of calls duration send/receive of P2P messages (MPI) time stamp, sender, receiver, length, tag, communicator communication cost synchronisation cost (continued on next slide) 13

14 What kind of information can be obtained? Typical examples of performance information: (continued) hardware performance counter values time stamp, value, process, counter ID everything you are interested in self defined 14

15 Optimization Cycle Instrumentation Measurement Analysis Presentation Optimization Insertion of extra code (probes, hooks) into application Collection of data relevant to performance analysis Calculation of metrics, identification of performance problems Transformation of the results into a representation that can be easily understood by a human user Elimination of performance problems 15

16 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 16

17 Measurement Basics Two independent decisions: 1. When performance is measurement triggered Sampling Triggered by timer interrupt or by hardware counter overflow Can measure unmodified executables, potential low overhead Code instrumentation: Triggered by instrumentation hooks inserted into the code Insertion can be done manually or automatically 2. How performance is data recorded Profile ::= Summarization of events over time run time summarization (functions, call sites, loops, ) Trace file ::= Sequence of events over time 17

18 A study of a short example program To explain the differences: main(..) { for (i=1..3) { foo(i) } } foo(i) { if (i>0) foo(i-1) } Time main foo(1) foo(2) foo(3) 18

19 Example: Sampling t0: begin t1: main t2: foo(1) t3: main t4: foo(2) t5: foo(1) t6: foo(3) t7: foo(2) t8: foo(1) t9: main t10: end t1 t2 t3 t4 t5 t6 t7 t8 t9 Time main foo(1) foo(2) foo(3) 19

20 Summary: Sampling Sampling: The application is probed at specific times and a set of interesting metrics is gathered Advantages: Low perturbance of the application Application does not have to be recompiled Works well with large and long running applications Disadvantages: Not very detailed information on highly volatile metrics 20

21 Example: Instrumentation t00: begin t01: main t02: foo(1) t03: main t04: foo(2) t05: foo(1) t06: foo(2) t07: main t08: foo(3) t09: foo(2) t10: foo(1) t11: foo(2) t12: foo(3) t13: main t14: end Time main foo(1) foo(2) foo(3) measurement 21

22 Summary: Instrumentation Idea: The code is instrumented such that every interesting event is recorded as it occurs. Advantages: Every event of interest can be captured Much more detailed information possible Disadvantages: Preprocessing of the program necessary Probably expensive at runtime 22

23 Critical issues Accuracy Perturbation Measurement alters program behaviour Intrusion overhead Measurement itself needs time and thus lowers performance Accuracy of timers, counters Granularity How many measurements How much information / work during each measurement Trade-off: Accuracy expressiveness of data 23

24 Instrumentation techniques Instrumentation: Process of modifying programs to detect and report events There are various ways of instrumentation: Manual instrumentation Automatic instrumentation (using compilers or specific instrumentation tools) Binary instrumentation 24

25 Automatic instrumentation Calls to instrumentation functions automatically inserted Via source-to-source transformation: Detailed access to the instrumentation mechanisms Instrumented source code is available for post-processing Examples: Program Database Toolkit (PDT) OpenMP Pragma And Region Instrumentor (Opari) Compiler based instrumentation Only functions are automatically instrumented No source modification Supported by many compilers GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, 25

26 How does one perform instrumentation WARNING Compiler based instrumentation has no standardized interface may instrument before optimization may deactivate optimization may instrument after optimization implementation may vary from vendor to vendor 26

27 Profiling Recording of aggregated information 27 Time Counts Function/method invocations Hardware counter values about program and system entities Functions, call sites, loops, basic blocks, Processes, threads Collects only process-local information Methods to create a profile Interval timer / program counter sampling (statistical approach) Direct measurement (deterministic approach)

28 Tracing Recording information about significant points (events) during execution of the program Entering/leaving of a code region (function, loop, ) Sending/receiving a message... Save information in event record Time stamp, location ID, event type Plus event specific information Event trace := stream of event records sorted by time Can be used to reconstruct the dynamic behaviour Abstract execution model on level of defined events 28

29 Tracing vs. Profiling Tracing Advantages Event traces preserve the temporal and spatial relationships among individual events ( context!) Allows reconstruction of dynamic behaviour of application on any required abstraction level Automatic analysis Visualization Most general measurement technique Profile data can be constructed from event traces Disadvantages Traces can become very large Flushing of trace buffers to file at runtime can cause perturbation 29

30 What is a hardware performance counter Hardware Performance Counter (Wikipedia ): a set of (programmable) special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems What do they do? Example MEGAFLOPS MEGA = Million (10 6 ) FLOPS = FLoatingpoint Operations Per Second Tell the CPU to count the number of floating point operations 30

31 Excerpt of available counters Nehalem X5570 Level 1 data cache misses Level 1 instruction cache misses Level 2 data cache misses Level 2 instruction cache misses Level 3 data cache misses Level 3 instruction cache misses Level 1 cache misses Level 2 cache misses Level 3 cache misses Requests for a snoop Requests for exclusive access to shared cache line Requests for exclusive access to clean cache line Requests for cache line invalidation Requests for cache line intervention Level 3 load misses Level 3 store misses Cycles branch units are idle Cycles integer units are idle Cycles floating point units are idle Cycles load/store units are idle Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Branch target address cache misses Data prefetch cache misses Level 3 data cache hits Translation lookaside buffer shootdowns Failed store conditional instructions Successful store conditional instructions Total store conditional instructions Cycles Stalled Waiting for memory accesses Cycles Stalled Waiting for memory Reads Cycles Stalled Waiting for memory writes Cycles with no instruction issue Cycles with maximum instruction issue Cycles with no instructions completed Cycles with maximum instructions completed Hardware interrupts Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted FMA instructions completed Instructions issued 31

32 Excerpt of available counters Nehalem X5570 Instructions completed Integer instructions Floating point instructions Load instructions Store instructions Branch instructions Vector/SIMD instructions (could include integer) Cycles stalled on any resource Cycles the FP unit(s) are stalled Total cycles Load/store instructions completed Synchronization instructions completed Level 1 data cache hits Level 2 data cache hits Level 1 data cache accesses Level 2 data cache accesses Level 3 data cache accesses Level 1 data cache reads Level 2 data cache reads Level 3 data cache reads Level 1 data cache writes Level 2 data cache writes Level 3 data cache writes Level 1 instruction cache hits Level 2 instruction cache hits Level 3 instruction cache hits Level 1 instruction cache accesses Level 2 instruction cache accesses Level 3 instruction cache accesses Level 1 instruction cache reads Level 2 instruction cache reads Level 3 instruction cache reads Level 1 instruction cache writes Level 2 instruction cache writes Level 3 instruction cache writes Level 1 total cache hits Level 2 total cache hits Level 3 total cache hits Level 1 total cache accesses Level 2 total cache accesses Level 3 total cache accesses Level 1 total cache reads Level 2 total cache reads Level 3 total cache reads Level 1 total cache writes Level 2 total cache writes Level 3 total cache writes Floating point multiply instructions Floating point add instructions Floating point divide instructions 32

33 Excerpt of available counters Nehalem X5570 Floating point square root instructions Floating point inverse instructions Floating point operations Floating point operations; optimized to count scaled single precision vector operations Floating point operations; optimized to count scaled double precision vector operations Single precision vector/simd instructions Double precision vector/simd instructions 33

34 Things to know about Hardware Performance Counters Very limited resource Nehalem X5507: Number Hardware Counters per Core: 7 Xeon E5450: Number Hardware Counters per Core: 5 However: Once can only choose specific combinations of counters Example, if one counts the floating point operations Only 33 different additional counters available count the level 3 cache misses no further counters available 34

35 Wrap up Questions 35

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance: