Call Paths for Pin Tools

Similar documents
A Fast Instruction Set Simulator for RISC-V

Effective Performance Measurement and Analysis of Multithreaded Applications

Lightweight Memory Tracing

Project 0: Implementing a Hash Table

A program execution is memory safe so long as memory access errors never occur:

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Pinpointing Data Locality Problems Using Data-centric Analysis

A Practical Fine-grained Profiler

Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University)

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

CS527 Software Security

Trees. (Trees) Data Structures and Programming Spring / 28

Dynamic Access Binary Search Trees

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Dynamic Access Binary Search Trees

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Energy-centric DVFS Controlling Method for Multi-core Platforms

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Q1: /8 Q2: /30 Q3: /30 Q4: /32. Total: /100

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Honours/Master/PhD Thesis Projects Supervised by Dr. Yulei Sui

Programming Studio #9 ECE 190

Dynamic Trace Analysis with Zero-Suppressed BDDs

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester

ROSE-CIRM Detecting C-Style Errors in UPC Code

Fast, precise dynamic checking of types and bounds in C

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring

Microarchitecture Overview. Performance

Section 7: Wait/Exit, Address Translation

CS2141 Software Development using C/C++ Debugging

Cpt S 122 Data Structures. Data Structures Trees

Dynamic Dispatch and Duck Typing. L25: Modern Compiler Design

My malloc: mylloc and mhysa. Johan Montelius HT2016

CSC 2400: Computer Systems. Using the Stack for Function Calls

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Project 0: Implementing a Hash Table

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CSE 530A. B+ Trees. Washington University Fall 2013

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Concurrency, Thread. Dongkun Shin, SKKU

Overview. Constructors and destructors Virtual functions Single inheritance Multiple inheritance RTTI Templates Exceptions Operator Overloading

High Performance Managed Languages. Martin Thompson

Shared-memory Parallel Programming with Cilk Plus

Chapter 12: Indexing and Hashing. Basic Concepts

CIRM - Dynamic Error Detection

Running Valgrind on multiple processors: a prototype. Philippe Waroquiers FOSDEM 2015 valgrind devroom

G Programming Languages - Fall 2012

Intel VTune Amplifier XE

Efficient Memory Shadowing for 64-bit Architectures

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Program Profiling. Xiangyu Zhang

Run-time Environments - 3

Heckaton. SQL Server's Memory Optimized OLTP Engine

Chapter 12: Indexing and Hashing

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Run-time Environments

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

Run-time Environments

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Fast dynamic program analysis Race detection. Konstantin Serebryany May

Profiling & Optimization

Run-Time Environments/Garbage Collection

Chapter 3: Processes. Operating System Concepts 8th Edition,

DataCollider: Effective Data-Race Detection for the Kernel

Chapter 11: Indexing and Hashing

DEBUGGING: DYNAMIC PROGRAM ANALYSIS

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Research on the Implementation of MPI on Multicore Architectures

Chapter 12: Indexing and Hashing (Cnt(

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Efficient Data Race Detection for Unified Parallel C

NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness

Dynamic Race Detection with LLVM Compiler

Chapter 3: Process-Concept. Operating System Concepts 8 th Edition,

Custom Memory Allocation For Free

Summary: Open Questions:

Compilers. 8. Run-time Support. Laszlo Böszörmenyi Compilers Run-time - 1

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSE409, Rob Johnson, Alin Tomescu, November 11 th, 2011 Buffer overflow defenses

Mobilizing the Micro-Ops: Exploiting Context Sensitive Decoding for Security and Energy Efficiency

High Performance Managed Languages. Martin Thompson

Dynamic Binary Instrumentation: Introduction to Pin

Garbage Collection. CS 351: Systems Programming Michael Saelee

Lightweight Memory Tracing

Stream Computing using Brook+

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Runtime Defenses against Memory Corruption

CS61C : Machine Structures

Lecture 03 Bits, Bytes and Data Types

Kampala August, Agner Fog

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Tree-Structured Indexes. Chapter 10

Transcription:

, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014

What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

What is a Call Path? Performance Analysis Tools main() A() Debuggers B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Attribute each conflicting access to its full call path Thread 1 Thread 2 Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts

State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts We built one ourselves CCTLib

Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

Store History of Contexts Compactly Problem: Deluge of call paths

Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

Store History of Contexts Compactly Problem: Deluge of call paths Solution Call paths share common prefix Store call paths as a A A A A B B C C D E F G calling context tree (CCT) One CCT per thread A B C D E F G Instruction stream Instruction stream

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() call P() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() call P() Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() return Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() P() Tools can obtain! pointer to the! current context! via CTXT! in constant time call Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Z()

Maintaining CONTE CTXT Main() P() W() CTXT Z()

Maintaining CONTE CTXT Main() CTXT P() Return to caller: Constant time update W() Z()

Maintaining CONTE CTXT Main() CTXT P() W() Z()

Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

Maintaining CONTE CTXT Main() P() CTXT P() W() X() Y() CTXT Z() W() Z() Finding a callee from its caller involves a lookup

Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Accelerate Lookup with Splay Trees P() W() X() Y() CTXT Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Accelerate Lookup with Splay Trees P() CTXT W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; }

Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1

Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1 CTXT = Foo: INS 2

Attributing to Instructions main() Foo() P() CCTNode A CCT node represents a Pin trace CCTLib maintains node Pin trace mapping Each slot in a node represents an instruction in a Pin trace 1 2 3 4 5 6 Instructions CTXT

Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: CCTNode Foo() 1 2 3 4 5 6 Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT Instructions

Attributing to Instructions main() P() CCTNode Foo() 1 2 3 4 5 6 Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() GetContext(2) 1 2 3 4 5 6 Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() 1 2 3 4 5 6 Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

Attributing to Instructions main() P() CCTNode Foo() 1 2 3 4 5 6 Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

Attributing to Instructions main() Foo() P() CCTNode GetContext(5) 1 2 3 4 5 6 Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

Attributing to Instructions main() P() CCTNode Foo() 1 2 3 4 5 6 Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

Data-Centric Attribution How? Record all <AddressRange, VariableName> tuples in a map Instrument all allocation/free routines and maintain <AddressRange, CallPath> tuples in the map At each memory access: search the map for the address! Problems Searching the map on each access is expensive Map needs to be concurrent for threaded programs

Data-Centric Attribution using a Balanced Tree Observation: Updates to the map are infrequent Lookups in the maps are frequent! Solution #1: sorted map Keep <AddressRange, Object> in a balanced binary tree Low memory cost O(N) Moderate lookup cost O(log N) Concurrent access is handled by a novel replicated tree data structure

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory Application CCTLib

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application ObjA ObjA ObjA ObjA ObjA ObjA ObjA ObjA CCTLib

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC For each memory cell, a shadow cell holds a handle for the memory cell s data object Low lookup cost O(1), high memory cost Shadow memory supports concurrent access CCTLib supports both solutions, clients can choose

Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

Evaluation Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU 4.4.6 tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

Evaluation Spec Int 2006 reference benchmark Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU 4.4.6 tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU 4.4.6 tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains

Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU 4.4.6 tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded

Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU 4.4.6 tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded Hydrodynamics mini-app from LLNL! Frequent data allocation and de-allocations Memory bound Multithreaded, Poor scaling

Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

CCTLib Scales to Multiple Threads CCTLib overhead of N threads: CCTLib scalability of N threads: Emon(n) Eorig(n) Overhead(1) Overhead(N) Higher scalability is better, 1.0 is ideal 1.20 1.00 0.80 CCTLib scalability on LAMMPS Scalability 0.60 0.40 0.20 0.00 Call path collection 1 2 4 8 16 32 Data-centric attribution via Sorted maps Number of threads Data-centric attribution via Shadow memoory

Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success http://code.google.com/p/cctlib/

Other Complications in Real Programs Complex control flow Signal handling Setjmp-Longjmp C++ exceptions (try-catch)! Thread creation and destruction Maintaining parent-child relationships between threads Scalability to large number of threads