Profiling and Workflow

Size: px

Start display at page:

Download "Profiling and Workflow"

Albert Wilkinson
5 years ago
Views:

1 Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory September 13, / 34

2 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance Counters 3 Workflow 2 / 34

3 Introduction 3 / 34

4 Introduction What is profiling? Software optimization Measuring execution Why do we do profiling? How do we profile? 4 / 34

5 Introduction What is profiling? Why do we do profiling? Indentify bottlenecks Limited resources How do we profile? 5 / 34

6 Introduction What is profiling? Why do we do profiling? How do we profile? Instrumentation Sampling 6 / 34

7 Introduction $ time./mjpeg_encoder -w 352 -h 288 -o output [...] Limited to 10 frames. [...] real 0m7.407s user 0m7.290s sys 0m0.120s $ gprof mjpeg_encoder gmon.out [...] % self total time calls ms/call ms/call name dct_quantize write_block encode write_dht 7 / 34

8 Profiling 8 / 34

9 Profiling gprof Flag (-pg) to GCC Uses both instrumentation and sampling Fast, no simulation Only userspace, no kernel mode 9 / 34

10 Profiling gprof on mjpeg Application must also be linked with -pg Application must exit correctly for output to be created 10 / 34

11 Profiling llvm-prof Analog to gprof, but different approach Uses the LLVM JIT capabilities Use utils/profile.pl in LLVM source tree 11 / 34

12 Profiling cachegrind A tool in the valgrind Simulates and analyzes CPU cache usage It is really slow Slower than gprof and llvm-prof Has GUI frontend, kcachegrind 12 / 34

13 Profiling cachegrind Average memory access time (t avg ) is: t avg = t hit p + t miss (1 p) t hit is time to access when data is in cache p is probability of having data in cache t miss is time to access when not in cache 13 / 34

14 Profiling cachegrind on matrix multiplication original.c: ca. 6.30s ( 7% miss) transpose.c: ca. 3.57s ( 1% miss) unroll.c: ca. 3.45s ( 2% miss) 14 / 34

15 Profiling There are many other tools available, e.g., gperftools sampling profiler from Google oprofile for everything, also kernel stuff VTune from Intel, DTrace from Sun, etc. 15 / 34

16 Tracing 16 / 34

17 Tracing Profiling gives you a snapshot, but tracing adds a time dimension. Execution timeline Increases memory demands Discover new aspects 17 / 34

18 Tracing 18 / 34

19 Tracing Figure : Threading trace 19 / 34

20 Tracing Profiling: Find out where the hotspot is in code Tracing: Find out when the hotspot occurs in time Tracing important for understanding parallel execution 20 / 34

21 Performance Counters 21 / 34

22 Performance Counters Can provide detailed information about: Cache usage; prefetching, references, hits, or misses Branching; total branches, prediction hits, or -misses Cycles; e.g., wasted by stalling in CPU front-end 22 / 34

23 Performance Counters Accurate and very detailed information Low overhead compared to software events 23 / 34

24 Performance Counters cachegrind vs perf on matrix multiplication original.c: ca. 6.30s ( 0.10% miss) transpose.c: ca. 3.57s ( 1.0% miss) unroll.c: ca. 3.45s ( 0.35% miss) 24 / 34

25 Performance Counters Examples of Intel s metrics or ratios of hardware events: Cycles per Instruction (CPI) Core bound and memory bound Threads contested memory access Instruction starvation, vectorization usage Google Hardware Event-based Metrics 25 / 34

26 Workflow 26 / 34

27 Workflow The performance analyzing workflow is iterative. Do (advanced) analysis Interpret results Optimize your code Recompile Run analysis again Compare the two results Use or throw optimization Do it all over again 27 / 34

28 Workflow When a part of the code uses 99.9% of the execution time, you should definitely focus on it. However, you might reach something close to optimal, be sure to move on Your task is to exploit data parallelism using SIMD, meaning you need to identify opportunities in the time consuming code. Sometimes larger rewrites are needed You are not allowed to use other algorithms than the ones specified, but there can be alternative ways of executing these. That is, reduce computational complexity (big-o) Keep in mind the most efficient use of the processing unit, is there anything you can do to help the CPU? Data alignment, SW prefetching, etc. 28 / 34

29 Scripting I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. Bill Gates Automate the building and performance testing Your choice of analysis tool should be fast You should be able to evaluate results quickly Less time on book keeping, more time on coding 29 / 34

30 Scripting Use regexps on output and save performance numbers Script comparison of current run with previous You can use gnuplot for creating plots, e.g., bar charts 30 / 34

31 Version Control Nice way of remembering every optimization Commit optimizations one by one, not bundled Explain rationale and optimization in commit Remember the optimizations that failed 31 / 34

32 Version Control For example... Create a list of the optimizations (read commits) you want to include in the final report Write a script that iterates over the list and checks out each commit, does performance testing and stores the result together with your well written, informative commit message Create a nice plot comparing the all of the results, use the plot and the commit messages in your report 32 / 34

33 Report Start with the report before you start optimizing Write a disposition according to the exam guidelines Working continuously with the report is smart Using L A TEX you can include report in your repository (collaboration) latexmk -pvc report.tex = continuous preview 33 / 34

34 The End 34 / 34

Profiling: Understand Your Application

Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel