Performance analysis in large C++ projects

Size: px

Start display at page:

Download "Performance analysis in large C++ projects"

Damon Gardner
5 years ago
Views:

1 Performance analysis in large C++ projects Florent D HALLUIN <d-halluin@lrde.epita.fr> EPITA Research and Development Laboratory May 13th, 2009 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 1 / 39

2 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39

3 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39

4 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39

5 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39

6 Definitions Performance indicators Popular open-source profilers Introduction to performance analysis 1 Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 3 / 39

7 Definitions I Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Profiling Investigation of a program s behavior. Done as the program executes. Used to locate bottlenecks. Profilers Link functions to their resource usage. Work by instrumentation or by sampling. Output a flat profile or a call graph. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 4 / 39

8 Definitions I Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Profiling Investigation of a program s behavior. Done as the program executes. Used to locate bottlenecks. Profilers Link functions to their resource usage. Work by instrumentation or by sampling. Output a flat profile or a call graph. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 4 / 39

9 Definitions II Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Optimization Can be done at several levels (design, source code, compile, run-time). Should be selective: 10% of the code is responsible for 90% of the execution time. Donald Knuth: We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 5 / 39

10 Definitions III Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Overhead Cost of the measurement process. Low for sampling profilers. Higher for instrumenting profilers. Its impact on the measures can be estimated. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 6 / 39

11 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39

12 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39

13 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39

14 Definitions Performance indicators Popular open-source profilers Popular open-source profilers TIME TIME Widely available shell command. Shows CPU time and wall time for a program execution. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 8 / 39

15 GPROF Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Popular open-source profilers GPROF Written in Part of the GNU Binutils. Uses both instrumentation and sampling. Low overhead. Results are not always accurate. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 9 / 39

16 Definitions Performance indicators Popular open-source profilers Popular open-source profilers CALLGRIND CALLGRIND Runtime-instrumentation profiler. Uses the VALGRIND framework. High overhead. Fairly complex output. KCACHEGRIND Powerful GUI. Displays flat profile and call graph from CALLGRIND results. Can display data from other sources. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 10 / 39

17 Definitions Performance indicators Popular open-source profilers Popular open-source profilers CALLGRIND CALLGRIND Runtime-instrumentation profiler. Uses the VALGRIND framework. High overhead. Fairly complex output. KCACHEGRIND Powerful GUI. Displays flat profile and call graph from CALLGRIND results. Can display data from other sources. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 10 / 39

18 Features Structure Output Under the hood 1 Introduction to performance analysis 2 Features Structure Output Under the hood 3 4 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 11 / 39

19 Features Introduction to performance analysis Features Structure Output Under the hood CBS C++ Benchmarking Suite for VAUCANSON. Manual instrumentation profiler. Low overhead. Adaptable output. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 12 / 39

20 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39

21 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39

22 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39

23 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39

24 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39

25 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39

26 Output Summary (text) Features Structure Output Under the hood P r o f i l e r comparison Listing: Summary text output. [ D e s c r i p t i o n : ] A simple program t h a t consumes time and memory The parameter n defines the program complexity, i. e. the time and memory taken. [ I n f o s : ] Date : Mon May 4 14:43: [ Parameters : ] n : 20 [ Results : ] memory peak : r e l a t i v e memory usage : time : time ( system ) : time ( user ) : time ( w a l l ) : Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 15 / 39

27 Output Flat profile (text) Features Structure Output Under the hood [ Task l i s t : ] Listing: Flat profile text output. Charge i d : <name> t o t a l s e l f c a l l s s e l f avg. t o t a l avg % 0: _program s s s s 30.7% 3: f i l e _ i o ( ) (C: 0) s ms (C: 0) 25.5% 2: new_integer ( ) (C: 0) 17.21s ms (C: 0) 23.5% 4: list_push_back ( ) 15.84s 15.84s ms 37.73ms 20.3% 1: parent ( ) 67.44s 13.71s s 67.44s Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 16 / 39

28 Output Call graph (DOT) Features Structure Output Under the hood _program Calls: 1 Self time: 67.44s 1 parent() Calls: 1 Self time: 13.71s new_integer() Calls: 420 Self time: 17.21s (C: 0) Incoming calls: 20 Internal calls: 820 Outgoing calls: 0 Self time: 37.88s list_push_back() Calls: 420 Self time: 15.84s file_io() Calls: 420 Self time: 20.68s Figure: DOT output example. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 17 / 39

29 Output Summary (XML) Features Structure Output Under the hood Listing: Main section of the XML output. <?xml version= " 1.0 " encoding= "UTF 8"?> <bench> <name> P r o f i l e r comparison< / name> <date>mon May 4 14 :43: < / date> <time> < / time> < d e s c r i p t i o n >A simple program t h a t consumes time and memory The parameter n defines the program complexity, i. e. the time and memory taken. < / description> <parameters> <parameter name= " n " value= " 20 " / > < / parameters> < r e s u l t s > < result name= "memory peak " value= " " / > < r e s u l t name= " r e l a t i v e memory usage " value= " " / > < result name= " time " value= " " / > < result name= " time ( system ) " value= " " / > < result name= " time ( user ) " value= " " / > < result name= " time ( wall ) " value= " " / > < / r e s u l t s > < / bench> Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 18 / 39

30 Statistics Introduction to performance analysis Features Structure Output Under the hood Statistics 3000 lines of C++ code. 600 lines of demo scripts. 400 lines of documentation. 200 lines of Perl. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 19 / 39

31 System calls Introduction to performance analysis Features Structure Output Under the hood Time CPU time: getrusage(). Wall time: gettimeofday(). Memory Virtual memory: parse /proc/self/stat. Also: sbrk(0). Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 20 / 39

32 System calls Introduction to performance analysis Features Structure Output Under the hood Time CPU time: getrusage(). Wall time: gettimeofday(). Memory Virtual memory: parse /proc/self/stat. Also: sbrk(0). Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 20 / 39

33 Call graph Introduction to performance analysis Features Structure Output Under the hood Call graph Directed graph that represents calling relationships between tasks. Cycles (strongly connected components) show recursive tasks. Built in two steps: data collection and call graph computation. Implemented using boost::adjacency_list. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 21 / 39

34 Performance trace Demo Optimizing algorithms 1 Introduction to performance analysis 2 3 Performance trace Demo Optimizing algorithms 4 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 22 / 39

35 Performance trace Demo Optimizing algorithms VAUCANSON s performance trace Goals Give a performance overview across versions. Compare VAUCANSON to OPENFST. Compare implementations (LISTG and BMIG). Measure the impact of structural changes in the library. Implementation Benchmarking suite (make bench). 15 of the most used algorithms on common test cases. Output CBS XML files. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 23 / 39

36 Performance trace Demo Optimizing algorithms VAUCANSON s performance trace Goals Give a performance overview across versions. Compare VAUCANSON to OPENFST. Compare implementations (LISTG and BMIG). Measure the impact of structural changes in the library. Implementation Benchmarking suite (make bench). 15 of the most used algorithms on common test cases. Output CBS XML files. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 23 / 39

37 Some results Introduction to performance analysis Performance trace Demo Optimizing algorithms Time (ms) Graph implementation influence on determinize() (2.66GHz Intel Core 2 Quad Q9400, 4GB RAM) CPU time (bmig) user (bmig) system (bmig) CPU time (listg) user (listg) system (listg) Automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 24 / 39

38 Some results Introduction to performance analysis Performance trace Demo Optimizing algorithms Time (ms) Graph implementation influence on minimization-2n-moore() (2.66GHz Intel Core 2 Quad Q9400, 4GB RAM) CPU time (bmig) user (bmig) system (bmig) CPU time (listg) user (listg) system (listg) Automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 25 / 39

39 Optimizing algorithms Performance trace Demo Optimizing algorithms quotient() One of the most important algorithms in VAUCANSON. CBS showed a 30-line bottleneck in a 900-line file. Selective optimization led to a 40% performance boost. a a b b b Figure: Base automaton for quotient() benchmark. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 26 / 39

40 Optimizing algorithms Performance trace Demo Optimizing algorithms quotient() One of the most important algorithms in VAUCANSON. CBS showed a 30-line bottleneck in a 900-line file. Selective optimization led to a 40% performance boost. a a b b b Figure: Base automaton for quotient() benchmark. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 26 / 39

41 Performance trace Demo Optimizing algorithms Quotient code before optimization Listing: quotient() code excerpt, before optimization. 1 bool compute_going_in_states ( p a r t i t i o n _ t & p, l e t t e r _ t a ) 2 { 3 f o r _ a l l _ ( going_in_t, s, going_in_ ) 4 s = false ; 5 6 f o r _ a l l _ ( p a r t i t i o n _ t, s, p ) 7 { 8 for ( r d e l t a _ i t e r a t o r t ( input_. value ( ), s ) ;! t. done ( ) ; t. next ( ) ) 9 { 10 / / Some code o p t i m i z a t i o n i s p o s sible here 11 monoid_elt_t w( i n p u t _. s e r i e s _ o f ( t ). s t r u c t u r e ( ). monoid ( ), a ) ; 12 i f ( i n p u t _. s e r i e s _ o f ( t ). get (w)!= input_. s e r i e s ( ). semiring ( ). wzero_ ) 13 { 14 / / [... ] 15 } 16 } 17 } 18 return! met_class_. empty ( ) ; 19 } Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 27 / 39

42 Performance trace Demo Optimizing algorithms Quotient code after optimization Listing: quotient() code excerpt, after optimization. 1 bool compute_going_in_states ( p a r t i t i o n _ t & p, l e t t e r _ t a ) 2 { 3 f o r _ a l l _ ( going_in_t, s, going_in_ ) 4 s = false ; 5 6 / / Compute as much as possible outside loops. 7 s e m i r i n g _ e l t _ t weight_zero = input_. s e r i e s ( ). semiring ( ). wzero_ ; 8 monoid_elt_t w( i n p u t _. s e r i e s ( ). monoid ( ), a ) ; 9 10 f o r _ a l l _ ( p a r t i t i o n _ t, s, p ) 11 { 12 for ( r d e l t a _ i t e r a t o r t ( input_. value ( ), s ) ;! t. done ( ) ; t. next ( ) ) 13 { 14 / / Optimization led to a 40% performance boost on test cases. 15 i f ( i n p u t _. s e r i e s _ o f ( t ). get (w)!= weight_zero ) 16 { 17 / / [... ] 18 } 19 } 20 } 21 return! met_class_. empty ( ) ; 22 } Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 28 / 39

43 Quotient benchmark results Performance trace Demo Optimizing algorithms Time (ms) Code optimization influence on quotient() (2GHz Intel Centrino, 1GB RAM) CPU time (after) user (after) system (after) CPU time (before) user (before) system (before) Product automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 29 / 39

44 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 30 / 39

45 Future works VAUCANSON Optimize algorithms. Expand benchmarking suite. Add OPENFST benchmarks. CBS Add a script front-end. Turn into a separate open-source project. Add support for Mac & Windows. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 31 / 39

46 Future works VAUCANSON Optimize algorithms. Expand benchmarking suite. Add OPENFST benchmarks. CBS Add a script front-end. Turn into a separate open-source project. Add support for Mac & Windows. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 31 / 39

47 Questions Figure: A dog. Picture from FreeDigitalPhotos.net. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 32 / 39

48 References I Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., and Mohri, M. (2007). OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages Springer. Bigaignon, R. (2006). Présentation du TAF-KIT. lrde-techreps/trunk/0701. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 33 / 39

49 References II D Halluin, F. (2009a). CBS archive homepage (temporary). D Halluin, F. (2009b). VAUCANSON benchmark archive (temporary). Lazzara, G. (2006). Automata and performances. Publications/ Seminar-Lazzara. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 34 / 39

50 References III Lombardy, S., Régis-Gianas, Y., and Sakarovitch, J. (2004). Introducing Vaucanson. Theoretical Computer Science, 328: Osier, J. (1993). gprof manual. Available on texinfo/as/gprof.html. Sakarovitch, J. (2003). Éléments de théorie des automates. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 35 / 39

51 References IV Shende, S., Malony, A. D., Cuny, J., Malony, S. S. A. D., Lindlan, K., Beckman, P., and Karmesin, S. (1998). Portable profiling and tracing for parallel, scientific applications using c++. In In Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, pages ACM. VAUCANSON Group (2008). VAUCANSON home page. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 36 / 39

52 References V Weidendorfer, J. (2005a). Calltree file format. show.cgi/kcachegrindcalltreeformat. Weidendorfer, J. (2005b). KCachegrind homepage. show.cgi/kcachegrindindex. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 37 / 39

53 References VI Wikipedia (2009a). Optimization (computer science) wikipedia, the free encyclopedia. Optimization_(computer_science)&oldid= [Online; accessed 2-May-2009]. Wikipedia (2009b). Performance tuning wikipedia, the free encyclopedia. Performance_tuning&oldid= [Online; accessed 2-May-2009]. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 38 / 39

54 References VII Wikipedia (2009c). Software performance analysis wikipedia, the free encyclopedia. Software_performance_analysis&oldid= [Online; accessed 2-May-2009]. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 39 / 39

Implementation of transducers in Vaucanson

Implementation of transducers in Vaucanson Sarah O Connor LRDE seminar, June 23, 2004 http://vaucanson.lrde.epita.fr/ Copying this document Copying this document Copyright