Testing AutoFDO for Geant4
|
|
- Tracy Harmon
- 6 years ago
- Views:
Transcription
1 Testing for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/29 Testing for Geant4 Nathalie Rauschmayr
2 Introduction Idea: Autotuning Compile 2/29 Testing for Geant4 Nathalie Rauschmayr
3 Introduction Idea: Autotuning Run Compile 3/29 Testing for Geant4 Nathalie Rauschmayr
4 Introduction Idea: Autotuning Feedback Run Compile 4/29 Testing for Geant4 Nathalie Rauschmayr
5 Introduction Idea: Autotuning Feedback Run Compile Concept exists already for some time: Profile Guided Optimization 4/29 Testing for Geant4 Nathalie Rauschmayr
6 Introduction Profile Guided Optimization is useful for: Code that contains a lot of branches that are difficult to predict at compile time Performance sensitive code When running the same code over and over again 5/29 Testing for Geant4 Nathalie Rauschmayr
7 Introduction Profile Guided Optimization: Uses profiling to improve runtime performance Analyses code sections that are frequently executed Based on profiles the compiler might change: Inlining Virtual Call Speculation Register allocation Basic Block Optimization Function Layout Conditional Branch Optimization Dead Code Separation 6/29 Testing for Geant4 Nathalie Rauschmayr
8 Introduction Two approaches for Profile Guided Optimization (PGO): Modify binary (instrumentation) Monitor unaltered binary (sampling with perf) transforms perf-profiles into the format that can be used by gcc/clang for Feedback Directed Optimization (FDO) Developed by Google 7/29 Testing for Geant4 Nathalie Rauschmayr
9 Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Production Environment 8/29 Testing for Geant4 Nathalie Rauschmayr
10 Difference between sampling and instrumentation Instrumentation based PGO: gcc -fprofile-use test.c -o test gcc -fprofile-generate test.c -o test test.gcno test.gcda Instrumentation Run Recompile Disadvantages: Production Environment Tedious dual-compilation Produces a lot of small output files (in case of Geant4: 1698 files, each smaller than 100KB) Cannot run easily in production environment Instrumented binary might be significantly slower 8/29 Testing for Geant4 Nathalie Rauschmayr
11 Difference between sampling and instrumentation Sampling Based FDO (): Run production binary with perf Create production binary Convert perf-profile Recompile with converted perf-profile 9/29 Testing for Geant4 Nathalie Rauschmayr
12 Difference between sampling and instrumentation Sampling Based FDO (): gcc -O3 -ggdb -frecord-compilation-info-in-elf -D DEBUG test.c -o test Create production binary perf record -b -e cpu/event=0xc4,umask=0x20, name=br inst retired near taken, period= /pp./test Run production binary with perf create gcov --binary=./test --profile=perf.data --gcov=binary.gcov -gcov version=1 Convert perf-profile gcc -O3 -fauto-profile=test.gcov test.c -o test Recompile with converted perf-profile 10/29 Testing for Geant4 Nathalie Rauschmayr
13 Difference between sampling and instrumentation compared to instrumentation based PGO: Profile data can be obtained in production environment Works on optimized builds It provides a tool to merge profiles from multiple runs Only one output file per run 11/29 Testing for Geant4 Nathalie Rauschmayr
14 General Caveats The sample needs to be representative for the typical usage scenarios Otherwise: PGO could possible slow down the performance Need many profiles and runs Unbiased branches 12/29 Testing for Geant4 Nathalie Rauschmayr
15 Testcases Applications: CMS Detector Simulation (FullCMS) Simulation step of CMSSW using static build of Geant4 (cmsrun) Input data/workflow needs to be representative: How many events needed as training data? What if job configuration changes? What if job type changes? 13/29 Testing for Geant4 Nathalie Rauschmayr
16 Testcases Training data Run Number of Events FullCMS run FullCMS run 100,500,1k cmsrun config1 cmsrun config1 20, 50, 100 cmsrun config1 cmsrun config2 20, 50, 100 cmsrun config2 cmsrun config2 20, 50, 100 FullCMS run cmsrun config2 1k FullCMS: Geant4 example with particle gun cmsrun config1: TTbar event generation and simulation (CMSSW 7 3 1) cmsrun config2: Wjets event generation and simulation (CMSSW 7 3 1) 14/29 Testing for Geant4 Nathalie Rauschmayr
17 CMS Full Detector Simulation Training data Run Number of Events FullCMS run 100 events FullCMS 100, 500, 1k FullCMS run 500 events FullCMS 100, 500, 1k FullCMS run 1k events FullCMS 100, 500, 1k Processing 100 events Processing 500 events Runtime in [s] % -10.4% -11.5% Runtime in [s] % -9.5% -10.2% Normal 100 events 500 events 1000 events Normal 100 events 500 events 1000 Events 15/29 Testing for Geant4 Nathalie Rauschmayr
18 CMS Full Detector Simulation 1,400 Processing 1000 events 1,350 Runtime in [s] 1,300 1, % -10.7% -11.4% 1,200 1,150 Normal 100 events 500 events 1000 events 16/29 Testing for Geant4 Nathalie Rauschmayr
19 Simulation step of CMSSW using BigProducts Used CMSSW 7 3 1: SLC6, kernel 3.16 gcc 4.8 It uses BigProducts by default (developed by Shazhad) pluginsimulation.so: linked against static Geant4 libraries Obtain perf-profile for cmsrun, but then optimize only pluginsimulation.so Testcase: TTbar Step 1: Event generation and simulation 20, 50, 100 events 17/29 Testing for Geant4 Nathalie Rauschmayr
20 Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsrun 20 events config1 cmsrun config1 20, 50, 100 cmsrun 50 events config1 cmsrun config1 20, 50, 100 cmsrun 100 events config1 cmsrun config1 20, 50, 100 Processing 20 events Processing 50 events 580 1,350 Runtime in [s] % Runtime in [s] 1, % -8.4% -6.1% 1, % -7.0% Normal 20 events 50 events 100 events Normal 20 events 50 events 100 Events 18/29 Testing for Geant4 Nathalie Rauschmayr
21 Simulation step of CMSSW using BigProducts Processing 100 events 2,700 Runtime in [s] 2, % -6.5% -7.4% 2,500 Normal 20 events 50 events 100 events 19/29 Testing for Geant4 Nathalie Rauschmayr
22 Simulation step of CMSSW using BigProducts cmsrun config2: took Pythia configurations from Wjet Pt TeV cfi.py in CMSSW 8 1 X Training data Run Number of Events cmsrun 100 events config1 cmsrun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 1,850 4,400 8,500 Runtime in [s] 1,800 1,750 1,700 1, % Runtime in [s] 4,200 4,000 3,800 Runtime in [s] 8, % -11.9% 7,500 1,600 Normal 100 events Normal 100 events Normal 100 events 20/29 Testing for Geant4 Nathalie Rauschmayr
23 Simulation step of CMSSW using BigProducts Training data Run Number of Events cmsrun 100 events config1 cmsrun config2 20, 50, 100 cmsrun 100 events config2 cmsrun config2 20, 50, 100 Processing 20 events Processing 50 events Processing 100 events 1,850 4,400 8,500 Runtime in [s] 1,800 1,750 1,700 1, % -7.3% Runtime in [s] 4,200 4,000 3, % Runtime in [s] 8, % -8.5% 7, % 1,600 Normal 100 events 100 events Normal 100 events 100 events Normal 100 events 100 events 21/29 Testing for Geant4 Nathalie Rauschmayr
24 Simulation step of CMSSW using BigProducts Training data Run Number of Events fullcms 100 events cmsrun job config2 20, 50, 100 Processing 20 events 1,380 Processing 50 events Processing 100 events Runtime in [s] Normal -3.8% 100 events Runtime in [s] 1,360 1,340 1,320 1,300 1,280 1,260 Normal -4.8% 100 events Runtime in [s] 2,750 2,700 2,650 2,600 2,550 Normal -5.1% 100 events 22/29 Testing for Geant4 Nathalie Rauschmayr
25 gives useful insight Google-gcc provides the flag -fcheck-branch-annotation: objcopy -O binary --set-section-flags.gnu.switches.text.branch.annotation=alloc -j.gnu.switches.text.branch.annotation libg4processes.so libannotated Example Output: G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb dc9 23/29 Testing for Geant4 Nathalie Rauschmayr
26 gives useful insight Google-gcc provides the flag -fcheck-branch-annotation: objcopy -O binary --set-section-flags.gnu.switches.text.branch.annotation=alloc -j.gnu.switches.text.branch.annotation libg4processes.so libannotated Example Output: G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb dc9 File Line Basic block count Annotated Measured branch probability Assumed branch probability Hash value 23/29 Testing for Geant4 Nathalie Rauschmayr
27 gives useful insight events 50 events 100 events Basic block counts G4PhotoNuclearCrossSection.cc: G4NucleiModel.cc:1332 stl uninitialized.h:74 G4hBremsstrahlungModel.cc:87 G4CookPairingCorrections.hh:56 vector.tcc:184 G4MuBremsstrahlungModel.cc:300 G4InuclParticle.hh:83 G4ElectroNuclearCrossSection.cc:2327 locale facets.h:867 G4EmCorrections.cc:391 G4VEnergyLossProcess.cc:1099 G4FastVector.hh:68 G4VEnergyLossProcess.cc:1412 G4VEnergyLossProcess.cc:1143 G4VEnergyLossProcess.cc:1142 G4UniversalFluctuation.cc:224 G4UniversalFluctuation.cc:213 G4Poisson.hh:57 G4ProcessManager.cc:273 G4Transportation.cc:731 24/29 Testing for Geant4 Nathalie Rauschmayr
28 Compiler heuristics are not always accurate for branch probabilities events 50 events events without profile Branch probability G4PhotoNuclearCrossSection.cc: G4NucleiModel.cc:1332 stl uninitialized.h:74 G4hBremsstrahlungModel.cc:87 G4CookPairingCorrections.hh:56 vector.tcc:184 G4MuBremsstrahlungModel.cc:300 G4InuclParticle.hh:83 G4ElectroNuclearCrossSection.cc:2327 locale facets.h:867 G4EmCorrections.cc:391 G4VEnergyLossProcess.cc:1099 G4FastVector.hh:68 G4VEnergyLossProcess.cc:1412 G4VEnergyLossProcess.cc:1143 G4UniversalFluctuation.cc:224 G4VEnergyLossProcess.cc:1142 G4UniversalFluctuation.cc:213 G4Poisson.hh:57 G4ProcessManager.cc:273 G4Transportation.cc:731 25/29 Testing for Geant4 Nathalie Rauschmayr
29 Problems encountered Google gcc 4.8 has been merged with gcc main branch, but normal gcc is missing the flag -frecord-compilation-info-in-elf Flag creates a new section: Object files may have distinct compile options. The new section preserves compile options for every object file. >>>readelf -S -W libg4tracking.so less There are 37 section headers, starting at offset 0x615a8: Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al [ 0] NULL [ 1].hash HASH ec 04 A [...] [26].gnu.switches.text.quote_paths PROGBITS bb [27].gnu.switches.text.bracket_paths PROGBITS afb 007c [28].gnu.switches.text.system_paths PROGBITS c [29].gnu.switches.text.cpp_defines PROGBITS ca9c 00117e [30].gnu.switches.text.cpp_includes PROGBITS dc1a 0006bb [31].gnu.switches.text.cl_args PROGBITS e2d5 0029b [32].gnu.switches.text.lipo_info PROGBITS c f /29 Testing for Geant4 Nathalie Rauschmayr
30 Problems encountered The information in.gnu.switches.text is used to build the module map create gcov dumps the module map to a file (ending with.imports) >>>head output.gcov.imports /data/geant p03/source/event/src/g4eventmanager.cc: /data/geant p03/ /data/geant p03/source/event/src/g4smarttrackstack.cc: /data/geant p0 /data/geant p03/source/event/src/g4stackmanager.cc: /data/geant p03/ /data/geant p03/source/externals/clhep/src/evaluator.cc: /data/geant /data/geant p03/source/externals/clhep/src/lorentzrotation.cc /data/geant p03/source/externals/clhep/src/lorentzvector.cc /data/geant p03/source/externals/clhep/src/lorentzvectorl.cc 27/29 Testing for Geant4 Nathalie Rauschmayr
31 Problems encountered Apart from the module map creates a symbol map Symbol map does not contain symbols coming from shared libraries It limits the usability: Statically linked libraries required Optimize only the library causing the largest hotspots icc-files are not recognized (fixed) recent versions of perf could be problematic because data format is different works best with sampling the Last Branch Records (LBR) LBR: Collection of register pairs that store source and destination addresses of recently executed branches (currently only Intel CPUs) 28/29 Testing for Geant4 Nathalie Rauschmayr
32 Summary Quite decent speedups: 5-13% Some fixes needed (shared libraries, gcc-flag) Stable against change of simulation scenario Easy deployment 1 Start perf together with the job 2 Gather profiles 3 Convert and merge profiles 4 Add compiler flag in CMake scripts 29/29 Testing for Geant4 Nathalie Rauschmayr
FDO: Magic Make My Program Faster compilation option?
FDO: Magic Make My Program Faster compilation option? Paweł Moll Embedded Linux Conference Europe, Berlin, October 2016 Agenda FDO Basics Instrumentation based FDO Sample based ( Auto ) FDO Deployments
More informationOperating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 09. Memory Management Part 1 Paul Krzyzanowski Rutgers University Spring 2015 March 9, 2015 2014-2015 Paul Krzyzanowski 1 CPU Access to Memory The CPU reads instructions and reads/write
More informationSplit debug symbols for pkgsrc builds
Split debug symbols for pkgsrc builds Short report after Google Summer of Code 2016 Leonardo Taccari leot@netbsd.org EuroBSDcon 2016 NetBSD Summit 1 / 23 What will we see in this presentation? ELF, DWARF
More informationStack frame unwinding on ARM
Stack frame unwinding on ARM Ken Werner LDS, Budapest 2011 http:/www.linaro.org why? Who needs to unwind the stack? C++ exceptions GDB anyone who wants to display the call chain Unwinding in General How
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationVectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition
Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an
More informationScientific Programming in C IX. Debugging
Scientific Programming in C IX. Debugging Susi Lehtola 13 November 2012 Debugging Quite often you spend an hour to write a code, and then two hours debugging why it doesn t work properly. Scientific Programming
More informationLink 3. Symbols. Young W. Lim Mon. Young W. Lim Link 3. Symbols Mon 1 / 42
Link 3. Symbols Young W. Lim 2017-09-11 Mon Young W. Lim Link 3. Symbols 2017-09-11 Mon 1 / 42 Outline 1 Linking - 3. Symbols Based on Symbols Symbol Tables Symbol Table Examples main.o s symbol table
More informationEE458 - Embedded Systems Lecture 4 Embedded Devel.
EE458 - Embedded Lecture 4 Embedded Devel. Outline C File Streams References RTC: Chapter 2 File Streams man pages 1 Cross-platform Development Environment 2 Software available on the host system typically
More informationGeant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016
Geant4 MT Performance Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept. 12-16, 2016 Introduction Geant4 multi-threading (Geant4 MT) capabilities Event-level parallelism
More informationBenchmarking AMD64 and EMT64
Benchmarking AMD64 and EMT64 Hans Wenzel, Oliver Gutsche, FNAL, Batavia, IL 60510, USA Mako Furukawa, University of Nebraska, Lincoln, USA Abstract We have benchmarked various single and dual core 64 Bit
More informationProgress Report Toward a Thread-Parallel Geant4
Progress Report Toward a Thread-Parallel Geant4 Gene Cooperman and Xin Dong High Performance Computing Lab College of Computer and Information Science Northeastern University Boston, Massachusetts 02115
More informationEvaluating Performance Via Profiling
Performance Engineering of Software Systems September 21, 2010 Massachusetts Institute of Technology 6.172 Professors Saman Amarasinghe and Charles E. Leiserson Handout 6 Profiling Project 2-1 Evaluating
More informationReducing CPU Consumption of Geant4 Simulation in ATLAS
Reducing CPU Consumption of Geant4 Simulation in ATLAS John Chapman (University of Cambridge) on behalf of the ATLAS Simulation Group Joint WLCG & HSF Workshop 2018 Napoli, Italy - 28th March 2018 Current
More informationBuilding Binary Optimizer with LLVM
Building Binary Optimizer with LLVM Maksim Panchenko maks@fb.com BOLT Binary Optimization and Layout Tool Built in less than 6 months x64 Linux ELF Runs on large binary (HHVM, non-jitted part) Improves
More informationSystems software design. Software build configurations; Debugging, profiling & Quality Assurance tools
Systems software design Software build configurations; Debugging, profiling & Quality Assurance tools Who are we? Krzysztof Kąkol Software Developer Jarosław Świniarski Software Developer Presentation
More informationOpenMP Example. $ ssh # You might have to type yes if this is the first time you log on
OpenMP Example Day 1, afternoon session 1: We examine a serial and parallel implementation of a code solving an N-body problem with a star, a planet and many small particles near the planet s radius. The
More informationProfiling and Workflow
Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance
More informationBinghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter
Cached Memory Computer Systems Chapter 6.2-6.5 Cost Speed The Memory Hierarchy Capacity The Cache Concept CPU Registers Addresses Data Memory ALU Instructions The Cache Concept Memory CPU Registers Addresses
More informationProcess. One or more threads of execution Resources required for execution
Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory
More informationFrequently asked software questions for EM 8-bit Microcontrollers CoolRISC core architecture
EM MICROELECTRONIC - MARIN SA AppNote 60 Title: Product Family: Application Note 60 Frequently asked software questions for EM 8-bit Microcontrollers CoolRISC core architecture Part Number: EM6812, EM9550,
More informationAn Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign
An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign Abstract It is important to have a means of manually testing a potential optimization before laboring to fully
More informationECE 498 Linux Assembly Language Lecture 1
ECE 498 Linux Assembly Language Lecture 1 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 13 November 2012 Assembly Language: What s it good for? Understanding at a low-level what
More informationSista: Improving Cog s JIT performance. Clément Béra
Sista: Improving Cog s JIT performance Clément Béra Main people involved in Sista Eliot Miranda Over 30 years experience in Smalltalk VM Clément Béra 2 years engineer in the Pharo team Phd student starting
More informationImproving Linux development with better tools
Improving Linux development with better tools Andi Kleen Oct 2013 Intel Corporation ak@linux.intel.com Linux complexity growing Source lines in Linux kernel All source code 16.5 16 15.5 M-LOC 15 14.5 14
More informationProcess. Memory Management
Process Memory Management One or more threads of execution Resources required for execution Memory (RAM) Program code ( text ) Data (initialised, uninitialised, stack) Buffers held in the kernel on behalf
More informationProcess Environment. Pradipta De
Process Environment Pradipta De pradipta.de@sunykorea.ac.kr Today s Topic Program to process How is a program loaded by the kernel How does kernel set up the process Outline Review of linking and loading
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationLibabigail & ABI compatibility
Libabigail & ABI compatibility Taming the runtime linking problem Ben Woodard Consulting Engineer Dodji Seketeli Tools Engineer Aug 2017 Problems due to ABI Compatibility There are several problems cause
More informationProcess. One or more threads of execution Resources required for execution. Memory (RAM) Others
Memory Management 1 Process One or more threads of execution Resources required for execution Memory (RAM) Program code ( text ) Data (initialised, uninitialised, stack) Buffers held in the kernel on behalf
More informationA program execution is memory safe so long as memory access errors never occur:
A program execution is memory safe so long as memory access errors never occur: Buffer overflows, null pointer dereference, use after free, use of uninitialized memory, illegal free Memory safety categories
More informationProcess. One or more threads of execution Resources required for execution. Memory (RAM) Others
Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory
More informationIntro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy
Intro to Segmentation Fault Handling in Linux By Khanh Ngo-Duy Khanhnd@elarion.com Seminar What is Segmentation Fault (Segfault) Examples and Screenshots Tips to get Segfault information What is Segmentation
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationCode-Agnostic Performance Characterisation and Enhancement
Code-Agnostic Performance Characterisation and Enhancement Ben Menadue Academic Consultant @NCInews Who Uses the NCI? NCI has a large user base 1000s of users across 100s of projects These projects encompass
More informationComputer Organization & Assembly Language Programming
Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather
More informationEmbedded Systems Programming
Embedded Systems Programming OS Linux - Toolchain Iwona Kochańska Gdansk University of Technology Embedded software Toolchain compiler and tools for hardwaredependent software developement Bootloader initializes
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationRuntime Defenses against Memory Corruption
CS 380S Runtime Defenses against Memory Corruption Vitaly Shmatikov slide 1 Reading Assignment Cowan et al. Buffer overflows: Attacks and defenses for the vulnerability of the decade (DISCEX 2000). Avijit,
More informationImproving Linux Development with better tools. Andi Kleen. Oct 2013 Intel Corporation
Improving Linux Development with better tools Andi Kleen Oct 2013 Intel Corporation ak@linux.intel.com Linux complexity growing Source lines in Linux kernel All source code 16.5 16 15.5 M-LOC 15 14.5 14
More informationInitial Explorations of ARM Processors for Scientific Computing
Initial Explorations of ARM Processors for Scientific Computing Peter Elmer - Princeton University David Abdurachmanov - Vilnius University Giulio Eulisse, Shahzad Muzaffar - FNAL Power limitations for
More informationPerformance Profiling
Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance
More informationMemory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization
Requirements Relocation Memory management ability to change process image position Protection ability to avoid unwanted memory accesses Sharing ability to share memory portions among processes Logical
More informationCERN IT Technical Forum
Evaluating program correctness and performance with new software tools from Intel Andrzej Nowak, CERN openlab March 18 th 2011 CERN IT Technical Forum > An introduction to the new generation of software
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationSystem Programming And C Language
System Programming And C Language Prof. Jin-soo Kim. (jinsookim@skku.edu) Pintos TA Jin-yeong, Bak. (dongdm@gmail.com) Kyung-min, Go. (gkm2164@gmail.com) 2010.09.28 1 Contents Important thing in system
More informationOperating Systems CMPSC 473. Process Management January 29, Lecture 4 Instructor: Trent Jaeger
Operating Systems CMPSC 473 Process Management January 29, 2008 - Lecture 4 Instructor: Trent Jaeger Last class: Operating system structure and basics Today: Process Management Why Processes? We have programs,
More informationContinuous integration a walk-through. Dr Alin Marin Elena
Continuous integration a walk-through Dr Alin Marin Elena Daresbury, 2017 Elena et al. Daresbury, vember 2017 1 Theory 2 Theory Elena et al. Daresbury, vember 2017 1 Theory 2 Elena et al. Daresbury, vember
More informationECE 471 Embedded Systems Lecture 4
ECE 471 Embedded Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 12 September 2013 Announcements HW#1 will be posted later today For next class, at least skim
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationHow to run stable benchmarks
How to run stable benchmarks FOSDEM 2017, Brussels Victor Stinner victor.stinner@gmail.com BINARY_ADD optim In 2014, int+int optimization proposed: 14 patches, many authors Is is faster? Is it worth it?
More informationECE 598 Advanced Operating Systems Lecture 10
ECE 598 Advanced Operating Systems Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 22 February 2018 Announcements Homework #5 will be posted 1 Blocking vs Nonblocking
More informationLanguage Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program.
Language Translation Compilation vs. interpretation Compilation diagram Step 1: compile program compiler Compiled program Step 2: run input Compiled program output Language Translation compilation is translation
More informationThe Processor Memory Hierarchy
Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.
More informationMike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation
Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Corporation February 15-17, 17, 2000 Agenda: Key IA-64 Software Issues l Do I Need IA-64? l Coding
More informationAuthor: Steve Gorman Title: Programming with the Intel architecture in the flat memory model
Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model Abstract: As the Intel architecture moves off the desktop into a variety of other computing applications, developers
More informationECE 598 Advanced Operating Systems Lecture 4
ECE 598 Advanced Operating Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Announcements HW#1 was due HW#2 was posted, will be tricky Let me know
More informationUpdating the Compiler?
Updating the Compiler? Take Advantage of The New Development Toolchain Andreas Jaeger Product Manager aj@suse.com Programming Languages C C++ Fortran And Go 2 Why new compiler? Faster applications Support
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationServicing HEP experiments with a complete set of ready integreated and configured common software components
Journal of Physics: Conference Series Servicing HEP experiments with a complete set of ready integreated and configured common software components To cite this article: Stefan Roiser et al 2010 J. Phys.:
More informationDebugging and Profiling
Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationCompiler Options. Linux/x86 Performance Practical,
Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt
More informationQuality in the Data Center: Data Collection and Analysis
Quality in the Data Center: Data Collection and Analysis Kingsum Chow, Chief Scientist Alibaba Systems Software Hardware Co-Optimization PNSQC 2017.10.07 3:45pm-5:20pm Acknowledged: Chengdong Li and Wanyi
More informationMain Memory CHAPTER. Exercises. 7.9 Explain the difference between internal and external fragmentation. Answer:
7 CHAPTER Main Memory Exercises 7.9 Explain the difference between internal and external fragmentation. a. Internal fragmentation is the area in a region or a page that is not used by the job occupying
More informationDynamically-typed Languages. David Miller
Dynamically-typed Languages David Miller Dynamically-typed Language Everything is a value No type declarations Examples of dynamically-typed languages APL, Io, JavaScript, Lisp, Lua, Objective-C, Perl,
More informationXcode Release Notes. Apple offers a number of resources where you can get Xcode development support:
Xcode Release Notes This document contains release notes for Xcode 5 developer preview 5. It discusses new features and issues present in Xcode 5 developer preview 5 and issues resolved from earlier Xcode
More informationLecture 9 Dynamic Compilation
Lecture 9 Dynamic Compilation I. Motivation & Background II. Overview III. Compilation Policy IV. Partial Method Compilation V. Partial Dead Code Elimination VI. Escape Analysis VII. Results Partial Method
More informationAutomatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München
Automatic Tuning of HPC Applications with Periscope Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München Agenda 15:00 15:30 Introduction to the Periscope Tuning Framework (PTF)
More informationMemory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358
Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement
More informationExperience with Data-flow, DQM and Analysis of TIF Data
Experience with Data-flow, DQM and Analysis of TIF Data G. Bagliesi, R.J. Bainbridge, T. Boccali, A. Bocci, V. Ciulli, N. De Filippis, M. De Mattia, S. Dutta, D. Giordano, L. Mirabito, C. Noeding, F. Palla,
More informationCSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation
CSE 501: Compiler Construction Course outline Main focus: program analysis and transformation how to represent programs? how to analyze programs? what to analyze? how to transform programs? what transformations
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationC6000 Compiler Roadmap
C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control
More informationIntel VTune Performance Analyzer 9.1 for Windows* In-Depth
Intel VTune Performance Analyzer 9.1 for Windows* In-Depth Contents Deliver Faster Code...................................... 3 Optimize Multicore Performance...3 Highlights...............................................
More informationDynamic Race Detection with LLVM Compiler
Dynamic Race Detection with LLVM Compiler Compile-time instrumentation for ThreadSanitizer Konstantin Serebryany, Alexander Potapenko, Timur Iskhodzhanov, and Dmitriy Vyukov OOO Google, 7 Balchug st.,
More informationEfficient and Large Scale Program Flow Tracing in Linux. Alexander Shishkin, Intel
Efficient and Large Scale Program Flow Tracing in Linux Alexander Shishkin, Intel 16.09.2013 Overview Program flow tracing - What is it? - What is it good for? Intel Processor Trace - Features / capabilities
More information12: Memory Management
12: Memory Management Mark Handley Address Binding Program goes through multiple steps from compilation to execution. At some stage, addresses in the program must be bound to physical memory addresses:
More informationImproving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester
Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software
More informationThinLTO. A Fine-Grained Demand-Driven Infrastructure. Teresa Johnson, Xinliang David Li
ThinLTO A Fine-Grained Demand-Driven Infrastructure Teresa Johnson, Xinliang David Li tejohnson,davidxl@google.com Outline CMO Background ThinLTO Motivation and Overview ThinLTO Details Build System Integration
More informationTK2123: COMPUTER ORGANISATION & ARCHITECTURE. CPU and Memory (2)
TK2123: COMPUTER ORGANISATION & ARCHITECTURE CPU and Memory (2) 1 Contents This lecture will discuss: Cache. Error Correcting Codes. 2 The Memory Hierarchy Trade-off: cost, capacity and access time. Faster
More informationThe SECURE Project and GCC
The SECURE Project and GCC Security Enhancing Compilers for Use in Real-world Environments Speaker: Graham Markall graham.markall@embecosm.com Contributors: Jeremy Bennett, Craig Blackmore, Simon Cook,
More informationGet an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
More informationWeaving Source Code into ThreadScope
Weaving Source Code into ThreadScope Peter Wortmann scpmw@leeds.ac.uk University of Leeds Visualization and Virtual Reality Group sponsorship by Microsoft Research Haskell Implementors Workshop 2011 Peter
More informationCache Performance Analysis with Callgrind and KCachegrind
Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität
More informationToday. More C/compiling/Makefiles Working with multiple files Pointer exercises
Algorithms for GIS Today More C/compiling/Makefiles Working with multiple files Pointer exercises Announcements: Assignment 1 due on Tuesday Assignment 2 posted, due on 9/22 Josh office hours: Mon 8pm
More informationXcode Release Notes. Apple offers a number of resources where you can get Xcode development support:
Xcode Release Notes This document contains release notes for Xcode 5 developer preview 4. It discusses new features and issues present in Xcode 5 developer preview 4 and issues resolved from earlier Xcode
More informationKernel perf tool user guide
Kernel perf tool user guide 2017-10-16 Reversion Record Date Rev Change Description Author 2017-10-16 V0.1 Inital Zhang Yongchang 1 / 10 catalog 1 PURPOSE...4 2 TERMINOLOGY...4 3 ENVIRONMENT...4 3.1 HARDWARE
More informationOutlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium
Main Memory Outlook Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium 2 Backgound Background So far we considered how to share
More informationLinking. Explain what ELF format is. Explain what an executable is and how it got that way. With huge thanks to Steve Chong for his notes from CS61.
Linking Topics How do you transform a collection of object files into an executable? How is an executable structured? Why is an executable structured as it is? Learning Objectives: Explain what ELF format
More informationLink 8. Dynamic Linking
Link 8. Dynamic Linking Young W. Lim 2018-12-27 Thr Young W. Lim Link 8. Dynamic Linking 2018-12-27 Thr 1 / 66 Outline 1 Linking - 8. Dynamic Linking Based on Dynamic linking with a shared library example
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationDebugging of CPython processes with gdb
Debugging of CPython processes with gdb KharkivPy January 28th, 2017 by Roman Podoliaka, Development Manager at Mirantis twitter: @rpodoliaka blog: http://podoliaka.org slides: http://podoliaka.org/talks/
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationManaged runtimes & garbage collection
Managed runtimes Advantages? Managed runtimes & garbage collection CSE 631 Some slides by Kathryn McKinley Disadvantages? 1 2 Managed runtimes Portability (& performance) Advantages? Reliability Security
More informationControl Flow Integrity for COTS Binaries Report
Control Flow Integrity for COTS Binaries Report Zhang and Sekar (2013) January 2, 2015 Partners: Instructor: Evangelos Ladakis Michalis Diamantaris Giorgos Tsirantonakis Dimitris Kiosterakis Elias Athanasopoulos
More informationOn the Use of Large Intel Xeon Phi Clusters for GEANT4-Based Simulations
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 5 Special issue with selected papers from the workshop Two Years Avitohol: Advanced High Performance Computing Applications
More informationStudy and Analysis of ELF Vulnerabilities in Linux
Study and Analysis of ELF Vulnerabilities in Linux Biswajit Sarma Assistant professor, Department of Computer Science and Engineering, Jorhat Engineering College, Srishti Dasgupta Final year student, Department
More information