Performance analysis in large C++ projects
|
|
- Damon Gardner
- 5 years ago
- Views:
Transcription
1 Performance analysis in large C++ projects Florent D HALLUIN <d-halluin@lrde.epita.fr> EPITA Research and Development Laboratory May 13th, 2009 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 1 / 39
2 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39
3 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39
4 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39
5 Outline Introduction to performance analysis 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 2 / 39
6 Definitions Performance indicators Popular open-source profilers Introduction to performance analysis 1 Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 3 / 39
7 Definitions I Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Profiling Investigation of a program s behavior. Done as the program executes. Used to locate bottlenecks. Profilers Link functions to their resource usage. Work by instrumentation or by sampling. Output a flat profile or a call graph. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 4 / 39
8 Definitions I Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Profiling Investigation of a program s behavior. Done as the program executes. Used to locate bottlenecks. Profilers Link functions to their resource usage. Work by instrumentation or by sampling. Output a flat profile or a call graph. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 4 / 39
9 Definitions II Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Optimization Can be done at several levels (design, source code, compile, run-time). Should be selective: 10% of the code is responsible for 90% of the execution time. Donald Knuth: We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 5 / 39
10 Definitions III Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Overhead Cost of the measurement process. Low for sampling profilers. Higher for instrumenting profilers. Its impact on the measures can be estimated. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 6 / 39
11 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39
12 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39
13 Performance indicators Definitions Performance indicators Popular open-source profilers Wall time Time as perceived by a human, i.e. measured on a wall clock. CPU time Time spent in CPU instructions. Divided into user time and system time. Virtual memory Amount of virtual memory allocated to a program. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 7 / 39
14 Definitions Performance indicators Popular open-source profilers Popular open-source profilers TIME TIME Widely available shell command. Shows CPU time and wall time for a program execution. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 8 / 39
15 GPROF Introduction to performance analysis Definitions Performance indicators Popular open-source profilers Popular open-source profilers GPROF Written in Part of the GNU Binutils. Uses both instrumentation and sampling. Low overhead. Results are not always accurate. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 9 / 39
16 Definitions Performance indicators Popular open-source profilers Popular open-source profilers CALLGRIND CALLGRIND Runtime-instrumentation profiler. Uses the VALGRIND framework. High overhead. Fairly complex output. KCACHEGRIND Powerful GUI. Displays flat profile and call graph from CALLGRIND results. Can display data from other sources. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 10 / 39
17 Definitions Performance indicators Popular open-source profilers Popular open-source profilers CALLGRIND CALLGRIND Runtime-instrumentation profiler. Uses the VALGRIND framework. High overhead. Fairly complex output. KCACHEGRIND Powerful GUI. Displays flat profile and call graph from CALLGRIND results. Can display data from other sources. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 10 / 39
18 Features Structure Output Under the hood 1 Introduction to performance analysis 2 Features Structure Output Under the hood 3 4 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 11 / 39
19 Features Introduction to performance analysis Features Structure Output Under the hood CBS C++ Benchmarking Suite for VAUCANSON. Manual instrumentation profiler. Low overhead. Adaptable output. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 12 / 39
20 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39
21 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39
22 Structure Introduction to performance analysis Features Structure Output Under the hood LIBBENCH Profiling library TIMER, measures time and generates the call graph. MEMPLOT, measures memory usage. XML archiving Own (simple) XML format. Stores benchmarks and statistics to be processed. Plot helpers Convenient scripts for GNUPLOT. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 13 / 39
23 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39
24 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39
25 Output Introduction to performance analysis Features Structure Output Under the hood 3 output elements Benchmark summary. Flat profile and call graph. Memory consumption. 3 output formats XML. DOT. Text. Several verbosity levels Adaptable to one s needs. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 14 / 39
26 Output Summary (text) Features Structure Output Under the hood P r o f i l e r comparison Listing: Summary text output. [ D e s c r i p t i o n : ] A simple program t h a t consumes time and memory The parameter n defines the program complexity, i. e. the time and memory taken. [ I n f o s : ] Date : Mon May 4 14:43: [ Parameters : ] n : 20 [ Results : ] memory peak : r e l a t i v e memory usage : time : time ( system ) : time ( user ) : time ( w a l l ) : Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 15 / 39
27 Output Flat profile (text) Features Structure Output Under the hood [ Task l i s t : ] Listing: Flat profile text output. Charge i d : <name> t o t a l s e l f c a l l s s e l f avg. t o t a l avg % 0: _program s s s s 30.7% 3: f i l e _ i o ( ) (C: 0) s ms (C: 0) 25.5% 2: new_integer ( ) (C: 0) 17.21s ms (C: 0) 23.5% 4: list_push_back ( ) 15.84s 15.84s ms 37.73ms 20.3% 1: parent ( ) 67.44s 13.71s s 67.44s Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 16 / 39
28 Output Call graph (DOT) Features Structure Output Under the hood _program Calls: 1 Self time: 67.44s 1 parent() Calls: 1 Self time: 13.71s new_integer() Calls: 420 Self time: 17.21s (C: 0) Incoming calls: 20 Internal calls: 820 Outgoing calls: 0 Self time: 37.88s list_push_back() Calls: 420 Self time: 15.84s file_io() Calls: 420 Self time: 20.68s Figure: DOT output example. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 17 / 39
29 Output Summary (XML) Features Structure Output Under the hood Listing: Main section of the XML output. <?xml version= " 1.0 " encoding= "UTF 8"?> <bench> <name> P r o f i l e r comparison< / name> <date>mon May 4 14 :43: < / date> <time> < / time> < d e s c r i p t i o n >A simple program t h a t consumes time and memory The parameter n defines the program complexity, i. e. the time and memory taken. < / description> <parameters> <parameter name= " n " value= " 20 " / > < / parameters> < r e s u l t s > < result name= "memory peak " value= " " / > < r e s u l t name= " r e l a t i v e memory usage " value= " " / > < result name= " time " value= " " / > < result name= " time ( system ) " value= " " / > < result name= " time ( user ) " value= " " / > < result name= " time ( wall ) " value= " " / > < / r e s u l t s > < / bench> Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 18 / 39
30 Statistics Introduction to performance analysis Features Structure Output Under the hood Statistics 3000 lines of C++ code. 600 lines of demo scripts. 400 lines of documentation. 200 lines of Perl. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 19 / 39
31 System calls Introduction to performance analysis Features Structure Output Under the hood Time CPU time: getrusage(). Wall time: gettimeofday(). Memory Virtual memory: parse /proc/self/stat. Also: sbrk(0). Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 20 / 39
32 System calls Introduction to performance analysis Features Structure Output Under the hood Time CPU time: getrusage(). Wall time: gettimeofday(). Memory Virtual memory: parse /proc/self/stat. Also: sbrk(0). Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 20 / 39
33 Call graph Introduction to performance analysis Features Structure Output Under the hood Call graph Directed graph that represents calling relationships between tasks. Cycles (strongly connected components) show recursive tasks. Built in two steps: data collection and call graph computation. Implemented using boost::adjacency_list. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 21 / 39
34 Performance trace Demo Optimizing algorithms 1 Introduction to performance analysis 2 3 Performance trace Demo Optimizing algorithms 4 Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 22 / 39
35 Performance trace Demo Optimizing algorithms VAUCANSON s performance trace Goals Give a performance overview across versions. Compare VAUCANSON to OPENFST. Compare implementations (LISTG and BMIG). Measure the impact of structural changes in the library. Implementation Benchmarking suite (make bench). 15 of the most used algorithms on common test cases. Output CBS XML files. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 23 / 39
36 Performance trace Demo Optimizing algorithms VAUCANSON s performance trace Goals Give a performance overview across versions. Compare VAUCANSON to OPENFST. Compare implementations (LISTG and BMIG). Measure the impact of structural changes in the library. Implementation Benchmarking suite (make bench). 15 of the most used algorithms on common test cases. Output CBS XML files. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 23 / 39
37 Some results Introduction to performance analysis Performance trace Demo Optimizing algorithms Time (ms) Graph implementation influence on determinize() (2.66GHz Intel Core 2 Quad Q9400, 4GB RAM) CPU time (bmig) user (bmig) system (bmig) CPU time (listg) user (listg) system (listg) Automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 24 / 39
38 Some results Introduction to performance analysis Performance trace Demo Optimizing algorithms Time (ms) Graph implementation influence on minimization-2n-moore() (2.66GHz Intel Core 2 Quad Q9400, 4GB RAM) CPU time (bmig) user (bmig) system (bmig) CPU time (listg) user (listg) system (listg) Automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 25 / 39
39 Optimizing algorithms Performance trace Demo Optimizing algorithms quotient() One of the most important algorithms in VAUCANSON. CBS showed a 30-line bottleneck in a 900-line file. Selective optimization led to a 40% performance boost. a a b b b Figure: Base automaton for quotient() benchmark. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 26 / 39
40 Optimizing algorithms Performance trace Demo Optimizing algorithms quotient() One of the most important algorithms in VAUCANSON. CBS showed a 30-line bottleneck in a 900-line file. Selective optimization led to a 40% performance boost. a a b b b Figure: Base automaton for quotient() benchmark. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 26 / 39
41 Performance trace Demo Optimizing algorithms Quotient code before optimization Listing: quotient() code excerpt, before optimization. 1 bool compute_going_in_states ( p a r t i t i o n _ t & p, l e t t e r _ t a ) 2 { 3 f o r _ a l l _ ( going_in_t, s, going_in_ ) 4 s = false ; 5 6 f o r _ a l l _ ( p a r t i t i o n _ t, s, p ) 7 { 8 for ( r d e l t a _ i t e r a t o r t ( input_. value ( ), s ) ;! t. done ( ) ; t. next ( ) ) 9 { 10 / / Some code o p t i m i z a t i o n i s p o s sible here 11 monoid_elt_t w( i n p u t _. s e r i e s _ o f ( t ). s t r u c t u r e ( ). monoid ( ), a ) ; 12 i f ( i n p u t _. s e r i e s _ o f ( t ). get (w)!= input_. s e r i e s ( ). semiring ( ). wzero_ ) 13 { 14 / / [... ] 15 } 16 } 17 } 18 return! met_class_. empty ( ) ; 19 } Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 27 / 39
42 Performance trace Demo Optimizing algorithms Quotient code after optimization Listing: quotient() code excerpt, after optimization. 1 bool compute_going_in_states ( p a r t i t i o n _ t & p, l e t t e r _ t a ) 2 { 3 f o r _ a l l _ ( going_in_t, s, going_in_ ) 4 s = false ; 5 6 / / Compute as much as possible outside loops. 7 s e m i r i n g _ e l t _ t weight_zero = input_. s e r i e s ( ). semiring ( ). wzero_ ; 8 monoid_elt_t w( i n p u t _. s e r i e s ( ). monoid ( ), a ) ; 9 10 f o r _ a l l _ ( p a r t i t i o n _ t, s, p ) 11 { 12 for ( r d e l t a _ i t e r a t o r t ( input_. value ( ), s ) ;! t. done ( ) ; t. next ( ) ) 13 { 14 / / Optimization led to a 40% performance boost on test cases. 15 i f ( i n p u t _. s e r i e s _ o f ( t ). get (w)!= weight_zero ) 16 { 17 / / [... ] 18 } 19 } 20 } 21 return! met_class_. empty ( ) ; 22 } Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 28 / 39
43 Quotient benchmark results Performance trace Demo Optimizing algorithms Time (ms) Code optimization influence on quotient() (2GHz Intel Centrino, 1GB RAM) CPU time (after) user (after) system (after) CPU time (before) user (before) system (before) Product automaton complexity Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 29 / 39
44 1 Introduction to performance analysis Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 30 / 39
45 Future works VAUCANSON Optimize algorithms. Expand benchmarking suite. Add OPENFST benchmarks. CBS Add a script front-end. Turn into a separate open-source project. Add support for Mac & Windows. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 31 / 39
46 Future works VAUCANSON Optimize algorithms. Expand benchmarking suite. Add OPENFST benchmarks. CBS Add a script front-end. Turn into a separate open-source project. Add support for Mac & Windows. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 31 / 39
47 Questions Figure: A dog. Picture from FreeDigitalPhotos.net. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 32 / 39
48 References I Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., and Mohri, M. (2007). OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages Springer. Bigaignon, R. (2006). Présentation du TAF-KIT. lrde-techreps/trunk/0701. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 33 / 39
49 References II D Halluin, F. (2009a). CBS archive homepage (temporary). D Halluin, F. (2009b). VAUCANSON benchmark archive (temporary). Lazzara, G. (2006). Automata and performances. Publications/ Seminar-Lazzara. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 34 / 39
50 References III Lombardy, S., Régis-Gianas, Y., and Sakarovitch, J. (2004). Introducing Vaucanson. Theoretical Computer Science, 328: Osier, J. (1993). gprof manual. Available on texinfo/as/gprof.html. Sakarovitch, J. (2003). Éléments de théorie des automates. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 35 / 39
51 References IV Shende, S., Malony, A. D., Cuny, J., Malony, S. S. A. D., Lindlan, K., Beckman, P., and Karmesin, S. (1998). Portable profiling and tracing for parallel, scientific applications using c++. In In Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, pages ACM. VAUCANSON Group (2008). VAUCANSON home page. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 36 / 39
52 References V Weidendorfer, J. (2005a). Calltree file format. show.cgi/kcachegrindcalltreeformat. Weidendorfer, J. (2005b). KCachegrind homepage. show.cgi/kcachegrindindex. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 37 / 39
53 References VI Wikipedia (2009a). Optimization (computer science) wikipedia, the free encyclopedia. Optimization_(computer_science)&oldid= [Online; accessed 2-May-2009]. Wikipedia (2009b). Performance tuning wikipedia, the free encyclopedia. Performance_tuning&oldid= [Online; accessed 2-May-2009]. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 38 / 39
54 References VII Wikipedia (2009c). Software performance analysis wikipedia, the free encyclopedia. Software_performance_analysis&oldid= [Online; accessed 2-May-2009]. Florent D HALLUIN CSI Seminar Performance analysis in large C++ projects 39 / 39
Implementation of transducers in Vaucanson
Implementation of transducers in Vaucanson Sarah O Connor LRDE seminar, June 23, 2004 http://vaucanson.lrde.epita.fr/ Copying this document Copying this document Copyright
More informationTiming programs with time
Profiling Profiling measures the performance of a program and can be used to find CPU or memory bottlenecks. time A stopwatch gprof The GNU (CPU) Profiler callgrind Valgrind s CPU profiling tool massif
More informationUnder The Hood: Performance Tuning With Tizen. Ravi Sankar Guntur
Under The Hood: Performance Tuning With Tizen Ravi Sankar Guntur How to write a Tizen App Tools already available in IDE v2.3 Dynamic Analyzer Valgrind 2 What s NEXT? Want to optimize my application App
More informationFSMXML and its application in Vaucanson
FSMXML and its application in Vaucanson Florian Lesaint Technical Report n o 0818, May 2008 revision 1877 Last year, we started to work on a new proposal of an XML automata description format, now called
More informationFSMXML and its application in Vaucanson
FSMXML and its application in Vaucanson Florian Lesaint Technical Report n o 0818, May 2008 revision 1734 Last year, we started to work on a new proposal of an XML automata description format, now called
More informationProfiling & Optimization
Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes
More informationProfiling & Optimization
Lecture 11 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes
More informationIntroduction to Parallel Performance Engineering
Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:
More informationProfiling and Workflow
Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance
More informationPerformance Optimization: Simulation and Real Measurement
Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationSystems software design. Software build configurations; Debugging, profiling & Quality Assurance tools
Systems software design Software build configurations; Debugging, profiling & Quality Assurance tools Who are we? Krzysztof Kąkol Software Developer Jarosław Świniarski Software Developer Presentation
More informationCSE 141 Summer 2016 Homework 2
CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays
More informationPCAN-Explorer 6. Tel: Professional Windows Software to Communicate with CAN and CAN FD Busses. Software >> PC Software
PCAN-Explorer 6 Professional Windows Software to Communicate with CAN and CAN FD Busses The PCAN-Explorer 6 is a versatile, professional program for working with CAN and CAN FD networks. The user is not
More informationA Type System for Weighted Automata and Rational Expressions
A Type System for Weighted Automata and Rational Expressions Akim Demaille 1, Alexandre Duret-Lutz 1, Sylvain Lombardy 2, Luca Saiu 3,1 and Jacques Sakarovitch 3 1 LRDE, EPITA, {akim,adl}@lrde.epita.fr
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind Parallel Performance Analysis Course, 31 October, 2010 King Abdullah University of Science and Technology, Saudi Arabia Josef Weidendorfer Computer
More informationIntroduction to Performance Engineering
Introduction to Performance Engineering Markus Geimer Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance: an old problem
More informationSIMD. Utilization of a SIMD unit in the OS Kernel. Shogo Saito 1 and Shuichi Oikawa 2 2. SIMD. SIMD (Single SIMD SIMD SIMD SIMD
OS SIMD 1 2 SIMD (Single Instruction Multiple Data) SIMD OS (Operating System) SIMD SIMD OS Utilization of a SIMD unit in the OS Kernel Shogo Saito 1 and Shuichi Oikawa 2 Nowadays, it is very common that
More informationProfiling for Performance in C++
Profiling for Performance in C++ Your Current Toolbox Big-O Analysis An O(n 2 ) function will theoretically take longer than O(n log n) The C++ standard library provides complexity for most functions Guessing
More informationCOL862 Programming Assignment-1
Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,
More informationParallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU
Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)
More informationDevel::NYTProf. Perl Source Code Profiler
Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2010 Devel::DProf Is Broken $ perl -we 'print "sub s$_ { sqrt(42) for 1..100 }; s$_({});\n" for 1..1000' > x.pl $ perl -d:dprof x.pl $ dprofpp
More informationCapturing and Analyzing the Execution Control Flow of OpenMP Applications
Int J Parallel Prog (2009) 37:266 276 DOI 10.1007/s10766-009-0100-2 Capturing and Analyzing the Execution Control Flow of OpenMP Applications Karl Fürlinger Shirley Moore Received: 9 March 2009 / Accepted:
More informationPerformance Evaluation
Performance Evaluation Master 2 Research Tutorial: High-Performance Architectures Arnaud Legrand et Jean-François Méhaut ID laboratory, arnaud.legrand@imag.fr November 29, 2006 A. Legrand (CNRS-ID) INRIA-MESCAL
More informationComputer Organization: A Programmer's Perspective
Profiling Oren Kapah orenkapah.ac@gmail.com Profiling: Performance Analysis Performance Analysis ( Profiling ) Understanding the run-time behavior of programs What parts are executed, when, for how long
More informationExploiting the Behavior of Generational Garbage Collector
Exploiting the Behavior of Generational Garbage Collector I. Introduction Zhe Xu, Jia Zhao Garbage collection is a form of automatic memory management. The garbage collector, attempts to reclaim garbage,
More informationPython where we can, C ++ where we must
Python where we can, C ++ where we must Source: http://xkcd.com/353/ Guy K. Kloss Python where we can,c++ where we must 1/28 Python where we can, C ++ where we must Guy K. Kloss BarCamp Auckland 2007 15
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 12, December 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationDebugging and Profiling
Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/
More informationL41: Lab 2 - Kernel implications of IPC
L41: Lab 2 - Kernel implications of IPC Dr Robert N.M. Watson Michaelmas Term 2015 The goals of this lab are to: Continue to gain experience tracing user-kernel interactions via system calls Explore the
More information21. This is a screenshot of the Android Studio Debugger. It shows the current thread and the object tree for a certain variable.
4. Logging is an important part of debugging, which is hard to achieve on mobile devices, where application development and execution take place on different systems. Android includes a framework that
More informationWorkload Characterization using the TAU Performance System
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationCOMP40 Assignment: Code Improvement Through Profiling
COMP40 Assignment: Code Improvement Through Profiling Assignment due Tuesday, November 22 at 11:59 PM. There is no design document for this assignment. Contents 1 Purpose and overview 1 2 What we expect
More informationSparse Matrix Operations on Multi-core Architectures
Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). This Tutorial Weighted
More informationPerformance Profiling
Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance
More informationThe Bulk Synchronous Parallel Model (PSC 1.2) Lecture 1.2 Bulk Synchronous Parallel Model p.1
The Bulk Synchronous Parallel Model (PSC 1.2) Lecture 1.2 Bulk Synchronous Parallel Model p.1 What is a parallel computer? switch switch switch switch PC PC PC PC PC PC PC PC A parallel computer consists
More informationEmbedded SDR for Small Form Factor Systems
Embedded SDR for Small Form Factor Systems Philip Balister, Tom Tsou, and Jeff Reed MPRG Wireless @ Virginia Tech Blacksburg, VA 24060 balister@vt.edu Outline Embedded Software Defined Radio SDR Frameworks
More informationDevel::NYTProf. Perl Source Code Profiler. Tim Bunce - July 2009 Screencast available at
Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at http://blog.timbunce.org/tag/nytprof/ Devel::DProf Oldest Perl Profiler 1995 Design flaws make it practically useless
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationA Case Study in Optimizing GNU Radio s ATSC Flowgraph
A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%
More informationSCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT
SCALING UP VS. SCALING OUT IN A QLIKVIEW ENVIRONMENT QlikView Technical Brief February 2012 qlikview.com Introduction When it comes to the enterprise Business Discovery environments, the ability of the
More informationMaking the Box Transparent: System Call Performance as a First-class Result. Yaoping Ruan, Vivek Pai Princeton University
Making the Box Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University Outline n Motivation n Design & implementation n Case study n More results Motivation
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationMonitor Application for Panasonic TDA
Monitor Application for Panasonic TDA MAP Demo Getting Started Version 1.0 G3 NOVA Communications SRL 28 Iacob Felix, Sector 1, Bucharest, ROMANIA Phone: +1 877 777 8753 www.g3novacommunications.com 2005
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität
More informationEnemy Territory Traffic Analysis
Enemy Territory Traffic Analysis Julie-Anne Bussiere *, Sebastian Zander Centre for Advanced Internet Architectures. Technical Report 00203A Swinburne University of Technology Melbourne, Australia julie-anne.bussiere@laposte.net,
More informationMPI Performance Tools
Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI
More informationPerformance Tools. Tulin Kaman. Department of Applied Mathematics and Statistics
Performance Tools Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook/BNL New York Center for Computational Science tkaman@ams.sunysb.edu Aug 23, 2012 Do you have information on exactly
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationJULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING
JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationHigh Performance Computing and Programming. Performance Analysis
High Performance Computing and Programming Performance Analysis What is performance? Performance is a total effectiveness of a program What are the measures of the performance? execution time FLOPS memory
More informationBalanced Trees. Nate Foster Spring 2019
Balanced Trees Nate Foster Spring 2019 Today s music: Get the Balance Right by Depeche Mode Review Previously in 3110: Streams Today: Balanced trees Running example: Sets module type Set = sig type 'a
More informationPerformance Improvement. The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7
Performance Improvement The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7 1 For Your Amusement Optimization hinders evolution. -- Alan Perlis
More informationCollecting and Exploiting Cache-Reuse Metrics
Collecting and Exploiting Cache-Reuse Metrics Josef Weidendorfer and Carsten Trinitis Technische Universität München, Germany {weidendo, trinitic}@cs.tum.edu Abstract. The increasing gap of processor and
More informationPerformance measurements of computer systems: tools and analysis
Performance measurements of computer systems: tools and analysis M2R PDES Jean-Marc Vincent and Arnaud Legrand Laboratory LIG MESCAL Project Universities of Grenoble {Jean-Marc.Vincent,Arnaud.Legrand}@imag.fr
More informationAutomatic trace analysis with the Scalasca Trace Tools
Automatic trace analysis with the Scalasca Trace Tools Ilya Zhukov Jülich Supercomputing Centre Property Automatic trace analysis Idea Automatic search for patterns of inefficient behaviour Classification
More informationShared Memory Programming With OpenMP Exercise Instructions
Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science
More informationECE 454 Computer Systems Programming Measuring and profiling
ECE 454 Computer Systems Programming Measuring and profiling Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan It is a capital mistake to theorize before one has data. Insensibly
More informationEfficient Clustering and Scheduling for Task-Graph based Parallelization
Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de
More informationEvaluation studies: From controlled to natural settings
Chapter 14 Evaluation studies: From controlled to natural settings 1 The aims: Explain how to do usability testing Outline the basics of experimental design Describe how to do field studies 2 1. Usability
More informationSeparating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington
Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems Robert Grimm University of Washington Extensions Added to running system Interact through low-latency interfaces Form
More information<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle
Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering Linux Performance on Large Systems Exadata Hardware How large systems are different
More informationBenchmark of a Cubieboard cluster
Benchmark of a Cubieboard cluster M J Schnepf, D Gudu, B Rische, M Fischer, C Jung and M Hardt Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Karlsruhe, Germany E-mail: matthias.schnepf@student.kit.edu,
More informationIBM High Performance Computing Toolkit
IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea
More informationDistributed Two-way Trees for File Replication on Demand
Distributed Two-way Trees for File Replication on Demand Ramprasad Tamilselvan Department of Computer Science Golisano College of Computing and Information Sciences Rochester, NY 14586 rt7516@rit.edu Abstract
More informationNew User Seminar: Part 2 (best practices)
New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency
More informationChapter 2: Operating-System Structures. Operating System Concepts 9 th Edit9on
Chapter 2: Operating-System Structures Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 2: Operating-System Structures 1. Operating System Services 2. User Operating System
More informationProf. Thomas Sterling
High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel
More informationVertical Profiling: Understanding the Behavior of Object-Oriented Applications
Vertical Profiling: Understanding the Behavior of Object-Oriented Applications Matthias Hauswirth, Amer Diwan University of Colorado at Boulder Peter F. Sweeney, Michael Hind IBM Thomas J. Watson Research
More informationDistributed and Parallel Technology
Distributed and Parallel Technology Parallel Performance Tuning Hans-Wolfgang Loidl http://www.macs.hw.ac.uk/~hwloidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 No
More informationIntra Application Data Communication Characterization
Intra Application Data Communication Characterization Imran Ashraf, Vlad Mihai Sima, Koen Bertels Computer Engineering Lab, TU Delft, The Netherlands Trends Growing demand of processing Growing number
More informationProfiling and Debugging Tools. Lars Koesterke University of Porto, Portugal May 28-29, 2009
Profiling and Debugging Tools Lars Koesterke University of Porto, Portugal May 28-29, 2009 Outline General (Analysis Tools) Listings & Reports Timers Profilers (gprof, tprof, Tau) Hardware performance
More informationProgram Profiling. Xiangyu Zhang
Program Profiling Xiangyu Zhang Outline What is profiling. Why profiling. Gprof. Efficient path profiling. Object equality profiling 2 What is Profiling Tracing is lossless, recording every detail of a
More informationFull file at
Chapter 2: Current Hardware and PC Operating Systems Chapter 2 Answers to Review Questions 1. An EPIC CPU design: a. evolved from the CISC processor b. was created in a joint project between Apple and
More informationConfiguration Management for Component-based Systems
Configuration Management for Component-based Systems Magnus Larsson Ivica Crnkovic Development and Research Department of Computer Science ABB Automation Products AB Mälardalen University 721 59 Västerås,
More informationPerformance Analysis. HPC Fall 2007 Prof. Robert van Engelen
Performance Analysis HPC Fall 2007 Prof. Robert van Engelen Overview What to measure? Timers Benchmarking Profiling Finding hotspots Profile-guided compilation Messaging and network performance analysis
More informationSolving Polynomial Systems with PHCpack and phcpy
Solving Polynomial Systems with PHCpack and phcpy Jan Verschelde University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationShared Memory Programming With OpenMP Computer Lab Exercises
Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu
More informationParallel Merge Sort with Double Merging
Parallel with Double Merging Ahmet Uyar Department of Computer Engineering Meliksah University, Kayseri, Turkey auyar@meliksah.edu.tr Abstract ing is one of the fundamental problems in computer science.
More informationAccelerating Simulink Optimization, Code Generation & Test Automation Through Parallelization
Accelerating Simulink Optimization, Code Generation & Test Automation Through Parallelization Ryan Chladny Application Engineering May 13 th, 2014 2014 The MathWorks, Inc. 1 Design Challenge: Electric
More informationXML Description for Automata Manipulations
XML Description for Automata Manipulations José Alves Nelma Moreira Rogério Reis {sobuy,nam,rvr@ncc.up.pt DCC-FC & LIACC, Universidade do Porto R. do Campo Alegre 1021/1055, 4169-007 Porto, Portugal Abstract.
More informationLecture 2. Decidability and Verification
Lecture 2. Decidability and Verification model temporal property Model Checker yes error-trace Advantages Automated formal verification, Effective debugging tool Moderate industrial success In-house groups:
More informationPerformance Pack. Benchmarking with PlanetPress Connect and PReS Connect
Performance Pack Benchmarking with PlanetPress Connect and PReS Connect Contents 2 Introduction 4 Benchmarking results 5 First scenario: Print production on demand 5 Throughput vs. Output Speed 6 Second
More informationOracle SQL Internals Handbook. Donald K. Burleson Joe Celko Dave Ensor Jonathan Lewis Dave Moore Vadim Tropashko John Weeg
Oracle SQL Internals Handbook Donald K. Burleson Joe Celko Dave Ensor Jonathan Lewis Dave Moore Vadim Tropashko John Weeg Oracle SQL Internals Handbook By Donald K. Burleson, Joe Celko, Dave Ensor, Jonathan
More informationAMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Operating Systems Dianne P. O Leary c 2011
AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Operating Systems Dianne P. O Leary c 2011 1 Operating Systems Notes taken from How Operating Systems Work by Curt
More informationTDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures
TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures August Ernstsson, Nicolas Melot august.ernstsson@liu.se November 2, 2017 1 Introduction The protection of shared data structures against
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationHPC Fall 2007 Project 1 Fast Matrix Multiply
HPC Fall 2007 Project 1 Fast Matrix Multiply Robert van Engelen Due date: October 11, 2007 1 Introduction 1.1 Account and Login For this assignment you need an SCS account. The account gives you access
More informationCray Performance Tools Enhancements for Next Generation Systems Heidi Poxon
Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon Agenda Cray Performance Tools Overview Recent Enhancements Support for Cray systems with KNL 2 Cray Performance Analysis Tools
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationImproving the Performance of your LabVIEW Applications
Improving the Performance of your LabVIEW Applications 1 Improving Performance in LabVIEW Purpose of Optimization Profiling Tools Memory Optimization Execution Optimization 2 Optimization Cycle Benchmark
More informationJ2EE Development Best Practices: Improving Code Quality
Session id: 40232 J2EE Development Best Practices: Improving Code Quality Stuart Malkin Senior Product Manager Oracle Corporation Agenda Why analyze and optimize code? Static Analysis Dynamic Analysis
More informationIntel VTune Performance Analyzer 9.1 for Windows* In-Depth
Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................
More informationCornell Theory Center 1
Cornell Theory Center Cornell Theory Center (CTC) is a high-performance computing and interdisciplinary research center at Cornell University. Scientific and engineering research projects supported by
More information