Call Paths for Pin Tools

Size: px

Start display at page:

Download "Call Paths for Pin Tools"

Bathsheba Walsh
5 years ago
Views:

1 , Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014

2 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

3 What is a Call Path? Performance Analysis Tools main() A() Debuggers B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

4 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

5 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Attribute each conflicting access to its

6 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Attribute each conflicting access to its full call path Thread 1 Thread 2 Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

7 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

8 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

9 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

10 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

11 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

12 State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts

you'll wind up with something that is unusably slow on anything except the

! If you tried to invoke Thread::getCallStack on every memory access there would

13 State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts We built one ourselves CCTLib

14 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

15 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

16 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

17 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

18 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

19 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

20 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

21 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

22 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

23 Store History of Contexts Compactly Problem: Deluge of call paths

24 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

25 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

26 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

27 Store History of Contexts Compactly Problem: Deluge of call paths Solution Call paths share common prefix Store call paths as a A A A A B B C C D E F G calling context tree (CCT) One CCT per thread A B C D E F G Instruction stream Instruction stream

28 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

29 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

30 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

31 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

32 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() call P() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

33 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

34 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() call P() Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

35 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() return Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

36 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() P() Tools can obtain! pointer to the! current context! via CTXT! in constant time call Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Z()

37 Maintaining CONTE CTXT Main() P() W() CTXT Z()

38 Maintaining CONTE CTXT Main() CTXT P() Return to caller: Constant time update W() Z()

39 Maintaining CONTE CTXT Main() CTXT P() W() Z()

40 Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

41 Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

42 Maintaining CONTE CTXT Main() P() CTXT P() W() X() Y() CTXT Z() W() Z() Finding a callee from its caller involves a lookup

43 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

44 Accelerate Lookup with Splay Trees P() W() X() Y() CTXT Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

45 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

46 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

47 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

48 Accelerate Lookup with Splay Trees P() CTXT W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

49 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; }

50 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1

51 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1 CTXT = Foo: INS 2

52 Attributing to Instructions main() Foo() P() CCTNode A CCT node represents a Pin trace CCTLib maintains node Pin trace mapping Each slot in a node represents an instruction in a Pin trace Instructions CTXT

53 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: CCTNode Foo() Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT Instructions

54 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

55 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() GetContext(2) Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

56 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

57 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

58 Attributing to Instructions main() Foo() P() CCTNode GetContext(5) Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

59 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

60 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

61 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

62 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

63 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

64 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

65 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

66 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

67 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

68 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

69 Data-Centric Attribution How? Record all <AddressRange, VariableName> tuples in a map Instrument all allocation/free routines and maintain <AddressRange, CallPath> tuples in the map At each memory access: search the map for the address! Problems Searching the map on each access is expensive Map needs to be concurrent for threaded programs

70 Data-Centric Attribution using a Balanced Tree Observation: Updates to the map are infrequent Lookups in the maps are frequent! Solution #1: sorted map Keep <AddressRange, Object> in a balanced binary tree Low memory cost O(N) Moderate lookup cost O(log N) Concurrent access is handled by a novel replicated tree data structure

71 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory Application CCTLib

72 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib

73 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application ObjA ObjA ObjA ObjA ObjA ObjA ObjA ObjA CCTLib

74 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

75 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

76 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC For each memory cell, a shadow cell holds a handle for the memory cell s data object Low lookup cost O(1), high memory cost Shadow memory supports concurrent access CCTLib supports both solutions, clients can choose

77 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

78 Evaluation Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

79 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

80 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains

81 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded

82 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded Hydrodynamics mini-app from LLNL! Frequent data allocation and de-allocations Memory bound Multithreaded, Poor scaling

83 Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

84 Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

85 Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

86 Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

87 CCTLib Scales to Multiple Threads CCTLib overhead of N threads: CCTLib scalability of N threads: Emon(n) Eorig(n) Overhead(1) Overhead(N) Higher scalability is better, 1.0 is ideal CCTLib scalability on LAMMPS Scalability Call path collection Data-centric attribution via Sorted maps Number of threads Data-centric attribution via Shadow memoory

88 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

89 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive

90 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

91 Other Complications in Real Programs Complex control flow Signal handling Setjmp-Longjmp C++ exceptions (try-catch)! Thread creation and destruction Maintaining parent-child relationships between threads Scalability to large number of threads

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.