Testing AutoFDO for Geant4

Size: px
Start display at page:

Download "Testing AutoFDO for Geant4"

Transcription

1 Testing for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/29 Testing for Geant4 Nathalie Rauschmayr

2 Introduction Idea: Autotuning Compile 2/29 Testing for Geant4 Nathalie Rauschmayr

3 Introduction Idea: Autotuning Run Compile 3/29 Testing for Geant4 Nathalie Rauschmayr

4 Introduction Idea: Autotuning Feedback Run Compile 4/29 Testing for Geant4 Nathalie Rauschmayr

5 Introduction Idea: Autotuning Feedback Run Compile Concept exists already for some time: Profile Guided Optimization 4/29 Testing for Geant4 Nathalie Rauschmayr

6 Introduction Profile Guided Optimization is useful for: Code that contains a lot of branches that are difficult to predict at compile time Performance sensitive code When running the same code over and over again 5/29 Testing for Geant4 Nathalie Rauschmayr

7 Introduction Profile Guided Optimization: Uses profiling to improve runtime performance Analyses code sections that are frequently executed Based on profiles the compiler might change: Inlining Virtual Call Speculation Register allocation Basic Block Optimization Function Layout Conditional Branch Optimization Dead Code Separation 6/29 Testing for Geant4 Nathalie Rauschmayr

8 Introduction Two approaches for Profile Guided Optimization (PGO): Modify binary (instrumentation) Monitor unaltered binary (sampling with perf) transforms perf-profiles into the format that can be used by gcc/clang for Feedback Directed Optimization (FDO) Developed by Google 7/29 Testing for Geant4 Nathalie Rauschmayr

9 Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Production Environment 8/29 Testing for Geant4 Nathalie Rauschmayr

10 Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Disadvantages: Production Environment Tedious dual-compilation Produces a lot of small output files (in case of Geant4: 1698 files, each smaller than 100KB) Cannot run easily in production environment Instrumented binary might be significantly slower 8/29 Testing for Geant4 Nathalie Rauschmayr

11 Difference between sampling and instrumentation Sampling Based FDO (): Run production binary with perf Create production binary Convert perf-profile Recompile with converted perf-profile 9/29 Testing for Geant4 Nathalie Rauschmayr

12 Difference between sampling and instrumentation Sampling Based FDO (): gcc -O3 -ggdb -frecord-compilation-info-in-elf -D DEBUG test.c -o test Create production binary perf record -b -e cpu/event=0xc4,umask=0x20, name=br inst retired near taken, period= /pp./test Run production binary with perf create gcov --binary=./test --profile=perf.data --gcov=binary.gcov -gcov version=1 Convert perf-profile gcc -O3 -fauto-profile=test.gcov test.c -o test Recompile with converted perf-profile 10/29 Testing for Geant4 Nathalie Rauschmayr

13 Difference between sampling and instrumentation compared to instrumentation based PGO: Profile data can be obtained in production environment Works on optimized builds It provides a tool to merge profiles from multiple runs Only one output file per run 11/29 Testing for Geant4 Nathalie Rauschmayr

14 General Caveats The sample needs to be representative for the typical usage scenarios Otherwise: PGO could possible slow down the performance Need many profiles and runs Unbiased branches 12/29 Testing for Geant4 Nathalie Rauschmayr

15 Testcases Applications: CMS Detector Simulation (FullCMS) Simulation step of CMSSW using static build of Geant4 (cmsrun) Input data/workflow needs to be representative: How many events needed as training data? What if job configuration changes? What if job type changes? 13/29 Testing for Geant4 Nathalie Rauschmayr

16 Testcases Training data Run Number of Events FullCMS run FullCMS run 100,500,1k cmsrun config1 cmsrun config1 20, 50, 100 cmsrun config1 cmsrun config2 20, 50, 100 cmsrun config2 cmsrun config2 20, 50, 100 FullCMS run cmsrun config2 1k FullCMS: Geant4 example with particle gun cmsrun config1: TTbar event generation and simulation (CMSSW 7 3 1) cmsrun config2: Wjets event generation and simulation (CMSSW 7 3 1) 14/29 Testing for Geant4 Nathalie Rauschmayr

17 CMS Full Detector Simulation Training data Run Number of Events FullCMS run 100 events FullCMS 100, 500, 1k FullCMS run 500 events FullCMS 100, 500, 1k FullCMS run 1k events FullCMS 100, 500, 1k Processing 100 events Processing 500 events Runtime in [s] % -10.4% -11.5% Runtime in [s] % -9.5% -10.2% Normal 100 events 500 events 1000 events Normal 100 events 500 events 1000 Events 15/29 Testing for Geant4 Nathalie Rauschmayr

18 CMS Full Detector Simulation 1,400 Processing 1000 events 1,350 Runtime in [s] 1,300 1, % -10.7% -11.4% 1,200 1,150 Normal 100 events 500 events 1000 events 16/29 Testing for Geant4 Nathalie Rauschmayr

19 Simulation step of CMSSW using BigProducts Used CMSSW 7 3 1: SLC6, kernel 3.16 gcc 4.8 It uses BigProducts by default (developed by Shazhad) pluginsimulation.so: linked against static Geant4 libraries Obtain perf-profile for cmsrun, but then optimize only pluginsimulation.so Testcase: TTbar Step 1: Event generation and simulation 20, 50, 100 events 17/29 Testing for Geant4 Nathalie Rauschmayr

20 Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsrun 20 events config1 cmsrun config1 20, 50, 100 cmsrun 50 events config1 cmsrun config1 20, 50, 100 cmsrun 100 events config1 cmsrun config1 20, 50, 100 Processing 20 events Processing 50 events 580 1,350 Runtime in [s] % Runtime in [s] 1, % -8.4% -6.1% 1, % -7.0% Normal 20 events 50 events 100 events Normal 20 events 50 events 100 Events 18/29 Testing for Geant4 Nathalie Rauschmayr

21 Simulation step of CMSSW using BigProducts Processing 100 events 2,700 Runtime in [s] 2, % -6.5% -7.4% 2,500 Normal 20 events 50 events 100 events 19/29 Testing for Geant4 Nathalie Rauschmayr

22 Simulation step of CMSSW using BigProducts cmsrun config2: took Pythia configurations from Wjet Pt TeV cfi.py in CMSSW 8 1 X Training data Run Number of Events cmsrun 100 events config1 cmsrun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 1,850 4,400 8,500 Runtime in [s] 1,800 1,750 1,700 1, % Runtime in [s] 4,200 4,000 3,800 Runtime in [s] 8, % -11.9% 7,500 1,600 Normal 100 events Normal 100 events Normal 100 events 20/29 Testing for Geant4 Nathalie Rauschmayr

23 Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsrun 100 events config1 cmsrun config2 20, 50, 100 cmsrun 100 events config2 cmsrun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 1,850 4,400 8,500 Runtime in [s] 1,800 1,750 1,700 1, % -7.3% Runtime in [s] 4,200 4,000 3, % Runtime in [s] 8, % -8.5% 7, % 1,600 Normal 100 events 100 events Normal 100 events 100 events Normal 100 events 100 events 21/29 Testing for Geant4 Nathalie Rauschmayr

24 Simulation step of CMSSW using BigProducts Training data Run Number of Events fullcms 100 events cmsrun job config2 20, 50, 100 Processing 20 events 1,380 Processing 50 events Processing 100 events Runtime in [s] Normal -3.8% 100 events Runtime in [s] 1,360 1,340 1,320 1,300 1,280 1,260 Normal -4.8% 100 events Runtime in [s] 2,750 2,700 2,650 2,600 2,550 Normal -5.1% 100 events 22/29 Testing for Geant4 Nathalie Rauschmayr

25 gives useful insight Google-gcc provides the flag -fcheck-branch-annotation: objcopy -O binary --set-section-flags.gnu.switches.text.branch.annotation=alloc -j.gnu.switches.text.branch.annotation libg4processes.so libannotated Example Output: G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb dc9 23/29 Testing for Geant4 Nathalie Rauschmayr

26 gives useful insight Google-gcc provides the flag -fcheck-branch-annotation: objcopy -O binary --set-section-flags.gnu.switches.text.branch.annotation=alloc -j.gnu.switches.text.branch.annotation libg4processes.so libannotated Example Output: G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb dc9 File Line Basic block count Annotated Measured branch probability Assumed branch probability Hash value 23/29 Testing for Geant4 Nathalie Rauschmayr

27 gives useful insight events 50 events 100 events Basic block counts G4PhotoNuclearCrossSection.cc: G4NucleiModel.cc:1332 stl uninitialized.h:74 G4hBremsstrahlungModel.cc:87 G4CookPairingCorrections.hh:56 vector.tcc:184 G4MuBremsstrahlungModel.cc:300 G4InuclParticle.hh:83 G4ElectroNuclearCrossSection.cc:2327 locale facets.h:867 G4EmCorrections.cc:391 G4VEnergyLossProcess.cc:1099 G4FastVector.hh:68 G4VEnergyLossProcess.cc:1412 G4VEnergyLossProcess.cc:1143 G4VEnergyLossProcess.cc:1142 G4UniversalFluctuation.cc:224 G4UniversalFluctuation.cc:213 G4Poisson.hh:57 G4ProcessManager.cc:273 G4Transportation.cc:731 24/29 Testing for Geant4 Nathalie Rauschmayr

28 Compiler heuristics are not always accurate for branch probabilities events 50 events events without profile Branch probability G4PhotoNuclearCrossSection.cc: G4NucleiModel.cc:1332 stl uninitialized.h:74 G4hBremsstrahlungModel.cc:87 G4CookPairingCorrections.hh:56 vector.tcc:184 G4MuBremsstrahlungModel.cc:300 G4InuclParticle.hh:83 G4ElectroNuclearCrossSection.cc:2327 locale facets.h:867 G4EmCorrections.cc:391 G4VEnergyLossProcess.cc:1099 G4FastVector.hh:68 G4VEnergyLossProcess.cc:1412 G4VEnergyLossProcess.cc:1143 G4UniversalFluctuation.cc:224 G4VEnergyLossProcess.cc:1142 G4UniversalFluctuation.cc:213 G4Poisson.hh:57 G4ProcessManager.cc:273 G4Transportation.cc:731 25/29 Testing for Geant4 Nathalie Rauschmayr

29 Problems encountered Google gcc 4.8 has been merged with gcc main branch, but normal gcc is missing the flag -frecord-compilation-info-in-elf Flag creates a new section: Object files may have distinct compile options. The new section preserves compile options for every object file. >>>readelf -S -W libg4tracking.so less There are 37 section headers, starting at offset 0x615a8: Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al [ 0] NULL [ 1].hash HASH ec 04 A [...] [26].gnu.switches.text.quote_paths PROGBITS bb [27].gnu.switches.text.bracket_paths PROGBITS afb 007c [28].gnu.switches.text.system_paths PROGBITS c [29].gnu.switches.text.cpp_defines PROGBITS ca9c 00117e [30].gnu.switches.text.cpp_includes PROGBITS dc1a 0006bb [31].gnu.switches.text.cl_args PROGBITS e2d5 0029b [32].gnu.switches.text.lipo_info PROGBITS c f /29 Testing for Geant4 Nathalie Rauschmayr

30 Problems encountered The information in.gnu.switches.text is used to build the module map create gcov dumps the module map to a file (ending with.imports) >>>head output.gcov.imports /data/geant p03/source/event/src/g4eventmanager.cc: /data/geant p03/ /data/geant p03/source/event/src/g4smarttrackstack.cc: /data/geant p0 /data/geant p03/source/event/src/g4stackmanager.cc: /data/geant p03/ /data/geant p03/source/externals/clhep/src/evaluator.cc: /data/geant /data/geant p03/source/externals/clhep/src/lorentzrotation.cc /data/geant p03/source/externals/clhep/src/lorentzvector.cc /data/geant p03/source/externals/clhep/src/lorentzvectorl.cc 27/29 Testing for Geant4 Nathalie Rauschmayr

31 Problems encountered Apart from the module map creates a symbol map Symbol map does not contain symbols coming from shared libraries It limits the usability: Statically linked libraries required Optimize only the library causing the largest hotspots icc-files are not recognized (fixed) recent versions of perf could be problematic because data format is different works best with sampling the Last Branch Records (LBR) LBR: Collection of register pairs that store source and destination addresses of recently executed branches (currently only Intel CPUs) 28/29 Testing for Geant4 Nathalie Rauschmayr

32 Summary Quite decent speedups: 5-13% Some fixes needed (shared libraries, gcc-flag) Stable against change of simulation scenario Easy deployment 1 Start perf together with the job 2 Gather profiles 3 Convert and merge profiles 4 Add compiler flag in CMake scripts 29/29 Testing for Geant4 Nathalie Rauschmayr

FDO: Magic Make My Program Faster compilation option?

FDO: Magic Make My Program Faster compilation option? FDO: Magic Make My Program Faster compilation option? Paweł Moll Embedded Linux Conference Europe, Berlin, October 2016 Agenda FDO Basics Instrumentation based FDO Sample based ( Auto ) FDO Deployments

More information

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015 Operating Systems 09. Memory Management Part 1 Paul Krzyzanowski Rutgers University Spring 2015 March 9, 2015 2014-2015 Paul Krzyzanowski 1 CPU Access to Memory The CPU reads instructions and reads/write

More information

Split debug symbols for pkgsrc builds

Split debug symbols for pkgsrc builds Split debug symbols for pkgsrc builds Short report after Google Summer of Code 2016 Leonardo Taccari leot@netbsd.org EuroBSDcon 2016 NetBSD Summit 1 / 23 What will we see in this presentation? ELF, DWARF

More information

Stack frame unwinding on ARM

Stack frame unwinding on ARM Stack frame unwinding on ARM Ken Werner LDS, Budapest 2011 http:/www.linaro.org why? Who needs to unwind the stack? C++ exceptions GDB anyone who wants to display the call chain Unwinding in General How

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an

More information

Scientific Programming in C IX. Debugging

Scientific Programming in C IX. Debugging Scientific Programming in C IX. Debugging Susi Lehtola 13 November 2012 Debugging Quite often you spend an hour to write a code, and then two hours debugging why it doesn t work properly. Scientific Programming

More information

Link 3. Symbols. Young W. Lim Mon. Young W. Lim Link 3. Symbols Mon 1 / 42

Link 3. Symbols. Young W. Lim Mon. Young W. Lim Link 3. Symbols Mon 1 / 42 Link 3. Symbols Young W. Lim 2017-09-11 Mon Young W. Lim Link 3. Symbols 2017-09-11 Mon 1 / 42 Outline 1 Linking - 3. Symbols Based on Symbols Symbol Tables Symbol Table Examples main.o s symbol table

More information

EE458 - Embedded Systems Lecture 4 Embedded Devel.

EE458 - Embedded Systems Lecture 4 Embedded Devel. EE458 - Embedded Lecture 4 Embedded Devel. Outline C File Streams References RTC: Chapter 2 File Streams man pages 1 Cross-platform Development Environment 2 Software available on the host system typically

More information

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016 Geant4 MT Performance Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept. 12-16, 2016 Introduction Geant4 multi-threading (Geant4 MT) capabilities Event-level parallelism

More information

Benchmarking AMD64 and EMT64

Benchmarking AMD64 and EMT64 Benchmarking AMD64 and EMT64 Hans Wenzel, Oliver Gutsche, FNAL, Batavia, IL 60510, USA Mako Furukawa, University of Nebraska, Lincoln, USA Abstract We have benchmarked various single and dual core 64 Bit

More information

Progress Report Toward a Thread-Parallel Geant4

Progress Report Toward a Thread-Parallel Geant4 Progress Report Toward a Thread-Parallel Geant4 Gene Cooperman and Xin Dong High Performance Computing Lab College of Computer and Information Science Northeastern University Boston, Massachusetts 02115

More information

Evaluating Performance Via Profiling

Evaluating Performance Via Profiling Performance Engineering of Software Systems September 21, 2010 Massachusetts Institute of Technology 6.172 Professors Saman Amarasinghe and Charles E. Leiserson Handout 6 Profiling Project 2-1 Evaluating

More information

Reducing CPU Consumption of Geant4 Simulation in ATLAS

Reducing CPU Consumption of Geant4 Simulation in ATLAS Reducing CPU Consumption of Geant4 Simulation in ATLAS John Chapman (University of Cambridge) on behalf of the ATLAS Simulation Group Joint WLCG & HSF Workshop 2018 Napoli, Italy - 28th March 2018 Current

More information

Building Binary Optimizer with LLVM

Building Binary Optimizer with LLVM Building Binary Optimizer with LLVM Maksim Panchenko maks@fb.com BOLT Binary Optimization and Layout Tool Built in less than 6 months x64 Linux ELF Runs on large binary (HHVM, non-jitted part) Improves

More information

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools Systems software design Software build configurations; Debugging, profiling & Quality Assurance tools Who are we? Krzysztof Kąkol Software Developer Jarosław Świniarski Software Developer Presentation

More information

OpenMP Example. $ ssh # You might have to type yes if this is the first time you log on

OpenMP Example. $ ssh # You might have to type yes if this is the first time you log on OpenMP Example Day 1, afternoon session 1: We examine a serial and parallel implementation of a code solving an N-body problem with a star, a planet and many small particles near the planet s radius. The

More information

Profiling and Workflow

Profiling and Workflow Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance

More information

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter Cached Memory Computer Systems Chapter 6.2-6.5 Cost Speed The Memory Hierarchy Capacity The Cache Concept CPU Registers Addresses Data Memory ALU Instructions The Cache Concept Memory CPU Registers Addresses

More information

Process. One or more threads of execution Resources required for execution

Process. One or more threads of execution Resources required for execution Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory

More information

Frequently asked software questions for EM 8-bit Microcontrollers CoolRISC core architecture

Frequently asked software questions for EM 8-bit Microcontrollers CoolRISC core architecture EM MICROELECTRONIC - MARIN SA AppNote 60 Title: Product Family: Application Note 60 Frequently asked software questions for EM 8-bit Microcontrollers CoolRISC core architecture Part Number: EM6812, EM9550,

More information

An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign

An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign Abstract It is important to have a means of manually testing a potential optimization before laboring to fully

More information

ECE 498 Linux Assembly Language Lecture 1

ECE 498 Linux Assembly Language Lecture 1 ECE 498 Linux Assembly Language Lecture 1 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 13 November 2012 Assembly Language: What s it good for? Understanding at a low-level what

More information

Sista: Improving Cog s JIT performance. Clément Béra

Sista: Improving Cog s JIT performance. Clément Béra Sista: Improving Cog s JIT performance Clément Béra Main people involved in Sista Eliot Miranda Over 30 years experience in Smalltalk VM Clément Béra 2 years engineer in the Pharo team Phd student starting

More information

Improving Linux development with better tools

Improving Linux development with better tools Improving Linux development with better tools Andi Kleen Oct 2013 Intel Corporation ak@linux.intel.com Linux complexity growing Source lines in Linux kernel All source code 16.5 16 15.5 M-LOC 15 14.5 14

More information

Process. Memory Management

Process. Memory Management Process Memory Management One or more threads of execution Resources required for execution Memory (RAM) Program code ( text ) Data (initialised, uninitialised, stack) Buffers held in the kernel on behalf

More information

Process Environment. Pradipta De

Process Environment. Pradipta De Process Environment Pradipta De pradipta.de@sunykorea.ac.kr Today s Topic Program to process How is a program loaded by the kernel How does kernel set up the process Outline Review of linking and loading

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Libabigail & ABI compatibility

Libabigail & ABI compatibility Libabigail & ABI compatibility Taming the runtime linking problem Ben Woodard Consulting Engineer Dodji Seketeli Tools Engineer Aug 2017 Problems due to ABI Compatibility There are several problems cause

More information

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others Memory Management 1 Process One or more threads of execution Resources required for execution Memory (RAM) Program code ( text ) Data (initialised, uninitialised, stack) Buffers held in the kernel on behalf

More information

A program execution is memory safe so long as memory access errors never occur:

A program execution is memory safe so long as memory access errors never occur: A program execution is memory safe so long as memory access errors never occur: Buffer overflows, null pointer dereference, use after free, use of uninitialized memory, illegal free Memory safety categories

More information

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory

More information

Intro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy

Intro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy Intro to Segmentation Fault Handling in Linux By Khanh Ngo-Duy Khanhnd@elarion.com Seminar What is Segmentation Fault (Segfault) Examples and Screenshots Tips to get Segfault information What is Segmentation

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Code-Agnostic Performance Characterisation and Enhancement

Code-Agnostic Performance Characterisation and Enhancement Code-Agnostic Performance Characterisation and Enhancement Ben Menadue Academic Consultant @NCInews Who Uses the NCI? NCI has a large user base 1000s of users across 100s of projects These projects encompass

More information

Computer Organization & Assembly Language Programming

Computer Organization & Assembly Language Programming Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather

More information

Embedded Systems Programming

Embedded Systems Programming Embedded Systems Programming OS Linux - Toolchain Iwona Kochańska Gdansk University of Technology Embedded software Toolchain compiler and tools for hardwaredependent software developement Bootloader initializes

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Runtime Defenses against Memory Corruption

Runtime Defenses against Memory Corruption CS 380S Runtime Defenses against Memory Corruption Vitaly Shmatikov slide 1 Reading Assignment Cowan et al. Buffer overflows: Attacks and defenses for the vulnerability of the decade (DISCEX 2000). Avijit,

More information

Improving Linux Development with better tools. Andi Kleen. Oct 2013 Intel Corporation

Improving Linux Development with better tools. Andi Kleen. Oct 2013 Intel Corporation Improving Linux Development with better tools Andi Kleen Oct 2013 Intel Corporation ak@linux.intel.com Linux complexity growing Source lines in Linux kernel All source code 16.5 16 15.5 M-LOC 15 14.5 14

More information

Initial Explorations of ARM Processors for Scientific Computing

Initial Explorations of ARM Processors for Scientific Computing Initial Explorations of ARM Processors for Scientific Computing Peter Elmer - Princeton University David Abdurachmanov - Vilnius University Giulio Eulisse, Shahzad Muzaffar - FNAL Power limitations for

More information

Performance Profiling

Performance Profiling Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance

More information

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization Requirements Relocation Memory management ability to change process image position Protection ability to avoid unwanted memory accesses Sharing ability to share memory portions among processes Logical

More information

CERN IT Technical Forum

CERN IT Technical Forum Evaluating program correctness and performance with new software tools from Intel Andrzej Nowak, CERN openlab March 18 th 2011 CERN IT Technical Forum > An introduction to the new generation of software

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

System Programming And C Language

System Programming And C Language System Programming And C Language Prof. Jin-soo Kim. (jinsookim@skku.edu) Pintos TA Jin-yeong, Bak. (dongdm@gmail.com) Kyung-min, Go. (gkm2164@gmail.com) 2010.09.28 1 Contents Important thing in system

More information

Operating Systems CMPSC 473. Process Management January 29, Lecture 4 Instructor: Trent Jaeger

Operating Systems CMPSC 473. Process Management January 29, Lecture 4 Instructor: Trent Jaeger Operating Systems CMPSC 473 Process Management January 29, 2008 - Lecture 4 Instructor: Trent Jaeger Last class: Operating system structure and basics Today: Process Management Why Processes? We have programs,

More information

Continuous integration a walk-through. Dr Alin Marin Elena

Continuous integration a walk-through. Dr Alin Marin Elena Continuous integration a walk-through Dr Alin Marin Elena Daresbury, 2017 Elena et al. Daresbury, vember 2017 1 Theory 2 Theory Elena et al. Daresbury, vember 2017 1 Theory 2 Elena et al. Daresbury, vember

More information

ECE 471 Embedded Systems Lecture 4

ECE 471 Embedded Systems Lecture 4 ECE 471 Embedded Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 12 September 2013 Announcements HW#1 will be posted later today For next class, at least skim

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

How to run stable benchmarks

How to run stable benchmarks How to run stable benchmarks FOSDEM 2017, Brussels Victor Stinner victor.stinner@gmail.com BINARY_ADD optim In 2014, int+int optimization proposed: 14 patches, many authors Is is faster? Is it worth it?

More information

ECE 598 Advanced Operating Systems Lecture 10

ECE 598 Advanced Operating Systems Lecture 10 ECE 598 Advanced Operating Systems Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 22 February 2018 Announcements Homework #5 will be posted 1 Blocking vs Nonblocking

More information

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program.

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program. Language Translation Compilation vs. interpretation Compilation diagram Step 1: compile program compiler Compiled program Step 2: run input Compiled program output Language Translation compilation is translation

More information

The Processor Memory Hierarchy

The Processor Memory Hierarchy Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation

Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Corporation February 15-17, 17, 2000 Agenda: Key IA-64 Software Issues l Do I Need IA-64? l Coding

More information

Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model

Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model Abstract: As the Intel architecture moves off the desktop into a variety of other computing applications, developers

More information

ECE 598 Advanced Operating Systems Lecture 4

ECE 598 Advanced Operating Systems Lecture 4 ECE 598 Advanced Operating Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Announcements HW#1 was due HW#2 was posted, will be tricky Let me know

More information

Updating the Compiler?

Updating the Compiler? Updating the Compiler? Take Advantage of The New Development Toolchain Andreas Jaeger Product Manager aj@suse.com Programming Languages C C++ Fortran And Go 2 Why new compiler? Faster applications Support

More information

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Servicing HEP experiments with a complete set of ready integreated and configured common software components

Servicing HEP experiments with a complete set of ready integreated and configured common software components Journal of Physics: Conference Series Servicing HEP experiments with a complete set of ready integreated and configured common software components To cite this article: Stefan Roiser et al 2010 J. Phys.:

More information

Debugging and Profiling

Debugging and Profiling Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Compiler Options. Linux/x86 Performance Practical,

Compiler Options. Linux/x86 Performance Practical, Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt

More information

Quality in the Data Center: Data Collection and Analysis

Quality in the Data Center: Data Collection and Analysis Quality in the Data Center: Data Collection and Analysis Kingsum Chow, Chief Scientist Alibaba Systems Software Hardware Co-Optimization PNSQC 2017.10.07 3:45pm-5:20pm Acknowledged: Chengdong Li and Wanyi

More information

Main Memory CHAPTER. Exercises. 7.9 Explain the difference between internal and external fragmentation. Answer:

Main Memory CHAPTER. Exercises. 7.9 Explain the difference between internal and external fragmentation. Answer: 7 CHAPTER Main Memory Exercises 7.9 Explain the difference between internal and external fragmentation. a. Internal fragmentation is the area in a region or a page that is not used by the job occupying

More information

Dynamically-typed Languages. David Miller

Dynamically-typed Languages. David Miller Dynamically-typed Languages David Miller Dynamically-typed Language Everything is a value No type declarations Examples of dynamically-typed languages APL, Io, JavaScript, Lisp, Lua, Objective-C, Perl,

More information

Xcode Release Notes. Apple offers a number of resources where you can get Xcode development support:

Xcode Release Notes. Apple offers a number of resources where you can get Xcode development support: Xcode Release Notes This document contains release notes for Xcode 5 developer preview 5. It discusses new features and issues present in Xcode 5 developer preview 5 and issues resolved from earlier Xcode

More information

Lecture 9 Dynamic Compilation

Lecture 9 Dynamic Compilation Lecture 9 Dynamic Compilation I. Motivation & Background II. Overview III. Compilation Policy IV. Partial Method Compilation V. Partial Dead Code Elimination VI. Escape Analysis VII. Results Partial Method

More information

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München Automatic Tuning of HPC Applications with Periscope Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München Agenda 15:00 15:30 Introduction to the Periscope Tuning Framework (PTF)

More information

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358 Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement

More information

Experience with Data-flow, DQM and Analysis of TIF Data

Experience with Data-flow, DQM and Analysis of TIF Data Experience with Data-flow, DQM and Analysis of TIF Data G. Bagliesi, R.J. Bainbridge, T. Boccali, A. Bocci, V. Ciulli, N. De Filippis, M. De Mattia, S. Dutta, D. Giordano, L. Mirabito, C. Noeding, F. Palla,

More information

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation CSE 501: Compiler Construction Course outline Main focus: program analysis and transformation how to represent programs? how to analyze programs? what to analyze? how to transform programs? what transformations

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

C6000 Compiler Roadmap

C6000 Compiler Roadmap C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control

More information

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................

More information

Dynamic Race Detection with LLVM Compiler

Dynamic Race Detection with LLVM Compiler Dynamic Race Detection with LLVM Compiler Compile-time instrumentation for ThreadSanitizer Konstantin Serebryany, Alexander Potapenko, Timur Iskhodzhanov, and Dmitriy Vyukov OOO Google, 7 Balchug st.,

More information

Efficient and Large Scale Program Flow Tracing in Linux. Alexander Shishkin, Intel

Efficient and Large Scale Program Flow Tracing in Linux. Alexander Shishkin, Intel Efficient and Large Scale Program Flow Tracing in Linux Alexander Shishkin, Intel 16.09.2013 Overview Program flow tracing - What is it? - What is it good for? Intel Processor Trace - Features / capabilities

More information

12: Memory Management

12: Memory Management 12: Memory Management Mark Handley Address Binding Program goes through multiple steps from compilation to execution. At some stage, addresses in the program must be bound to physical memory addresses:

More information

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software

More information

ThinLTO. A Fine-Grained Demand-Driven Infrastructure. Teresa Johnson, Xinliang David Li

ThinLTO. A Fine-Grained Demand-Driven Infrastructure. Teresa Johnson, Xinliang David Li ThinLTO A Fine-Grained Demand-Driven Infrastructure Teresa Johnson, Xinliang David Li tejohnson,davidxl@google.com Outline CMO Background ThinLTO Motivation and Overview ThinLTO Details Build System Integration

More information

TK2123: COMPUTER ORGANISATION & ARCHITECTURE. CPU and Memory (2)

TK2123: COMPUTER ORGANISATION & ARCHITECTURE. CPU and Memory (2) TK2123: COMPUTER ORGANISATION & ARCHITECTURE CPU and Memory (2) 1 Contents This lecture will discuss: Cache. Error Correcting Codes. 2 The Memory Hierarchy Trade-off: cost, capacity and access time. Faster

More information

The SECURE Project and GCC

The SECURE Project and GCC The SECURE Project and GCC Security Enhancing Compilers for Use in Real-world Environments Speaker: Graham Markall graham.markall@embecosm.com Contributors: Jeremy Bennett, Craig Blackmore, Simon Cook,

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

Weaving Source Code into ThreadScope

Weaving Source Code into ThreadScope Weaving Source Code into ThreadScope Peter Wortmann scpmw@leeds.ac.uk University of Leeds Visualization and Virtual Reality Group sponsorship by Microsoft Research Haskell Implementors Workshop 2011 Peter

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

Today. More C/compiling/Makefiles Working with multiple files Pointer exercises

Today. More C/compiling/Makefiles Working with multiple files Pointer exercises Algorithms for GIS Today More C/compiling/Makefiles Working with multiple files Pointer exercises Announcements: Assignment 1 due on Tuesday Assignment 2 posted, due on 9/22 Josh office hours: Mon 8pm

More information

Xcode Release Notes. Apple offers a number of resources where you can get Xcode development support:

Xcode Release Notes. Apple offers a number of resources where you can get Xcode development support: Xcode Release Notes This document contains release notes for Xcode 5 developer preview 4. It discusses new features and issues present in Xcode 5 developer preview 4 and issues resolved from earlier Xcode

More information

Kernel perf tool user guide

Kernel perf tool user guide Kernel perf tool user guide 2017-10-16 Reversion Record Date Rev Change Description Author 2017-10-16 V0.1 Inital Zhang Yongchang 1 / 10 catalog 1 PURPOSE...4 2 TERMINOLOGY...4 3 ENVIRONMENT...4 3.1 HARDWARE

More information

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium Main Memory Outlook Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium 2 Backgound Background So far we considered how to share

More information

Linking. Explain what ELF format is. Explain what an executable is and how it got that way. With huge thanks to Steve Chong for his notes from CS61.

Linking. Explain what ELF format is. Explain what an executable is and how it got that way. With huge thanks to Steve Chong for his notes from CS61. Linking Topics How do you transform a collection of object files into an executable? How is an executable structured? Why is an executable structured as it is? Learning Objectives: Explain what ELF format

More information

Link 8. Dynamic Linking

Link 8. Dynamic Linking Link 8. Dynamic Linking Young W. Lim 2018-12-27 Thr Young W. Lim Link 8. Dynamic Linking 2018-12-27 Thr 1 / 66 Outline 1 Linking - 8. Dynamic Linking Based on Dynamic linking with a shared library example

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Debugging of CPython processes with gdb

Debugging of CPython processes with gdb Debugging of CPython processes with gdb KharkivPy January 28th, 2017 by Roman Podoliaka, Development Manager at Mirantis twitter: @rpodoliaka blog: http://podoliaka.org slides: http://podoliaka.org/talks/

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

Managed runtimes & garbage collection

Managed runtimes & garbage collection Managed runtimes Advantages? Managed runtimes & garbage collection CSE 631 Some slides by Kathryn McKinley Disadvantages? 1 2 Managed runtimes Portability (& performance) Advantages? Reliability Security

More information

Control Flow Integrity for COTS Binaries Report

Control Flow Integrity for COTS Binaries Report Control Flow Integrity for COTS Binaries Report Zhang and Sekar (2013) January 2, 2015 Partners: Instructor: Evangelos Ladakis Michalis Diamantaris Giorgos Tsirantonakis Dimitris Kiosterakis Elias Athanasopoulos

More information

On the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations

On the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 5 Special issue with selected papers from the workshop Two Years Avitohol: Advanced High Performance Computing Applications

More information

Study and Analysis of ELF Vulnerabilities in Linux

Study and Analysis of ELF Vulnerabilities in Linux Study and Analysis of ELF Vulnerabilities in Linux Biswajit Sarma Assistant professor, Department of Computer Science and Engineering, Jorhat Engineering College, Srishti Dasgupta Final year student, Department

More information