Call Paths for Pin Tools

Size: px
Start display at page:

Download "Call Paths for Pin Tools"

Transcription

1 , Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014

2 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

3 What is a Call Path? Performance Analysis Tools main() A() Debuggers B() Foo() { x = *ptr;} Chain of function calls that led to the current point in the program. (a.k.a Calling Context / Call Stack / Backtrace / Activation Record)

4 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

5 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

6 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Attribute each conflicting access to its full call path Thread 1 Thread 2 Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

7 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

8 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

9 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools [Liu et al. ISPASS 13] Debugging, testing, resiliency, replay, etc. Attribute distance between use and reuse to references in full context

10 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

11 Need: Ubiquitous Call Paths Fine-grained monitoring tools Correctness Data race detection Taint analysis Array out of bound detection Performance Reuse-distance analysis Cache simulation False sharing detection Redundancy detection (e.g. dead writes) Other tools Debugging, testing, resiliency, replay, etc.

12 State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts

13 State-of-the-art in Collecting Ubiquitous Call Paths It will slow down execution by a factor of several thousand compared to native execution -- I'd guess -so you'll wind up with something that is unusably slow on anything except the smallest problems.! If you tried to invoke Thread::getCallStack on every memory access there would be very serious performance problems your program would probably never reach main. No support for collecting calling contexts We built one ourselves CCTLib

14 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

15 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

16 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

17 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

18 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

19 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

20 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

21 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

22 Top Three Challenges 1 Overhead (Space) 2 Overhead (Time) 3 Overhead (Parallel scaling)

23 Store History of Contexts Compactly Problem: Deluge of call paths

24 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

25 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

26 Store History of Contexts Compactly Problem: Deluge of call paths A A A A B B C C D E F G Instruction stream

27 Store History of Contexts Compactly Problem: Deluge of call paths Solution Call paths share common prefix Store call paths as a A A A A B B C C D E F G calling context tree (CCT) One CCT per thread A B C D E F G Instruction stream Instruction stream

28 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

29 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

30 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

31 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

32 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() CTXT Main() call P() P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

33 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

34 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() call P() Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

35 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() CTXT P() return Tools can obtain! pointer to the! current context! via CTXT! in constant time Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} }

36 Shadow Stack to Avoid Unwinding Overhead Problem: Unwinding overhead Main() Solution: Reverse the process. Eagerly build a replica/shadow stack on-the-fly. Main() P() P() Tools can obtain! pointer to the! current context! via CTXT! in constant time call Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } Foo() { *ptr = *ptr 100; = 100; x x = = 42; 42;} } CTXT Z()

37 Maintaining CONTE CTXT Main() P() W() CTXT Z()

38 Maintaining CONTE CTXT Main() CTXT P() Return to caller: Constant time update W() Z()

39 Maintaining CONTE CTXT Main() CTXT P() W() Z()

40 Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

41 Maintaining CONTE CTXT Main() CTXT P() CTXT P() W() X() Y() Z() W() Z() Finding a callee from its caller involves a lookup

42 Maintaining CONTE CTXT Main() P() CTXT P() W() X() Y() CTXT Z() W() Z() Finding a callee from its caller involves a lookup

43 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

44 Accelerate Lookup with Splay Trees P() W() X() Y() CTXT Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

45 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

46 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

47 Accelerate Lookup with Splay Trees CTXT P() W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

48 Accelerate Lookup with Splay Trees P() CTXT W() X() Y() Z() Splay tree [ Self-adjusting binary search trees by Sleator et al. 1985] ensures frequently called functions are near the root of the tree

49 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; }

50 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1

51 Context Should Incorporate Instruction Pointer main() P() Foo(){ *ptr = 100; x = 42; } CTXT = Foo: INS 1 CTXT = Foo: INS 2

52 Attributing to Instructions main() Foo() P() CCTNode A CCT node represents a Pin trace CCTLib maintains node Pin trace mapping Each slot in a node represents an instruction in a Pin trace Instructions CTXT

53 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: CCTNode Foo() Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT Instructions

54 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

55 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() GetContext(2) Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

56 Attributing to Instructions main() P() Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow CCTNode Foo() Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query 2 CTXT Instructions CTXT

57 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

58 Attributing to Instructions main() Foo() P() CCTNode GetContext(5) Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

59 Attributing to Instructions main() P() CCTNode Foo() Instructions CTXT Problem: Mapping IP to Slot at runtime Variable size x86 instructions Non-sequential control flow Solution: Pin s trace-instrumentation to hardwire Slot# as argument to context query routine for an IP Result: Constant time to query CTXT

60 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

61 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

62 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

63 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

64 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

65 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

66 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

67 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

68 Data-Centric Attribution in CCTLib int MyArray[SZ];! int * Create(){ return malloc( ); }! void Update(int * ptr) { for( ) ptr[i]++; }! int main(){ int * p; if ( ) p = Create(); else p = MyArray; Update(p); } Main() Create() Update() malloc() Associate each data access to its data object Data object Dynamic allocation: Call path of allocation site Static objects: Variable name

69 Data-Centric Attribution How? Record all <AddressRange, VariableName> tuples in a map Instrument all allocation/free routines and maintain <AddressRange, CallPath> tuples in the map At each memory access: search the map for the address! Problems Searching the map on each access is expensive Map needs to be concurrent for threaded programs

70 Data-Centric Attribution using a Balanced Tree Observation: Updates to the map are infrequent Lookups in the maps are frequent! Solution #1: sorted map Keep <AddressRange, Object> in a balanced binary tree Low memory cost O(N) Moderate lookup cost O(log N) Concurrent access is handled by a novel replicated tree data structure

71 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory Application CCTLib

72 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib

73 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application ObjA ObjA ObjA ObjA ObjA ObjA ObjA ObjA CCTLib

74 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

75 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC

76 Data-Centric Attribution using Shadow Memory Solution #2: shadow memory ObjA Application CCTLib ObjA ObjA ObjA ObjA ObjB ObjC ObjA ObjA ObjA ObjA ObjB ObjB ObjB ObjB ObjC ObjC ObjC ObjC For each memory cell, a shadow cell holds a handle for the memory cell s data object Low lookup cost O(1), high memory cost Shadow memory supports concurrent access CCTLib supports both solutions, clients can choose

77 Roadmap CCTLib Ubiquitous call path collection Attributing costs to data objects Evaluation Conclusions

78 Evaluation Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

79 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67

80 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains

81 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded

82 Evaluation Spec Int 2006 reference benchmark Program Running time! in sec astr 361 bzip2 161 gcc 70 h264ref 618 hmmer 446 libquantum 462 mcf 320 omnetpp 352 Xalan 295 ROSE 24 LAMMPS 99 LULESH 67 Experimental setup: 2.2GHz Intel Sandy Bridge 128GB DDR3 GNU tool chain Source-to-source compiler from LLNL! 3M LOC compiling 70K LOC Deep call chains Molecular dynamics code! 500K LOC Deep call chains Multithreaded Hydrodynamics mini-app from LLNL! Frequent data allocation and de-allocations Memory bound Multithreaded, Poor scaling

83 Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

84 Overhead Analysis Call path collection Data-centric! att Time overhead relative to original program (Null Pin tool) Time overhead relative to simple instruction counting Pin tool Memory overhead relative to original program Balanced Tree 30x 4.5x 1.7x 2.0x 1.8x Sh

85 Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

86 Overhead Analysis Call path collection Data-centric attribution! Balanced Tree Shadow Memory Time overhead relative to simple instruction counting Pin tool 1.7x 4.5x 2.3x Memory overhead relative to original program 1.8x 2.0x 11x

87 CCTLib Scales to Multiple Threads CCTLib overhead of N threads: CCTLib scalability of N threads: Emon(n) Eorig(n) Overhead(1) Overhead(N) Higher scalability is better, 1.0 is ideal CCTLib scalability on LAMMPS Scalability Call path collection Data-centric attribution via Sorted maps Number of threads Data-centric attribution via Shadow memoory

88 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

89 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

90 Conclusions Many tools can benefit from attributing metrics to full calling contexts and/or data objects Ubiquitous calling context collection was previously considered prohibitively expensive Fine-grain attribution of metrics to calling contexts and data objects is practical Full-precision call path collection and data-centric attribution require only modest space and time overhead Choice of algorithms and data structures was a key to success

91 Other Complications in Real Programs Complex control flow Signal handling Setjmp-Longjmp C++ exceptions (try-catch)! Thread creation and destruction Maintaining parent-child relationships between threads Scalability to large number of threads

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

Project 0: Implementing a Hash Table

Project 0: Implementing a Hash Table Project : Implementing a Hash Table CS, Big Data Systems, Spring Goal and Motivation. The goal of Project is to help you refresh basic skills at designing and implementing data structures and algorithms.

More information

A program execution is memory safe so long as memory access errors never occur:

A program execution is memory safe so long as memory access errors never occur: A program execution is memory safe so long as memory access errors never occur: Buffer overflows, null pointer dereference, use after free, use of uninitialized memory, illegal free Memory safety categories

More information

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Process a program in execution; process execution must progress in sequential fashion. Operating Systems Process Concept An operating system executes a variety of programs: Batch system jobs Time-shared systems user programs or tasks 1 Textbook uses the terms job and process almost interchangeably Process

More information

Pinpointing Data Locality Problems Using Data-centric Analysis

Pinpointing Data Locality Problems Using Data-centric Analysis Center for Scalable Application Development Software Pinpointing Data Locality Problems Using Data-centric Analysis Xu Liu XL10@rice.edu Department of Computer Science Rice University Outline Introduction

More information

A Practical Fine-grained Profiler

A Practical Fine-grained Profiler A Practical Fine-grained Profiler Shasha Wen, Xu Liu! College of William and Mary! {swen, xl10}@cs.wm.edu Milind Chabbi! HPE Lab! milind.m.chabbi@hpe.com Motivation: Computation Redundancy a = m; b = sqrt(a);

More information

Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University)

Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University) Whole Program Data Dependence Profiling to Unveil Parallel Regions in the Dynamic Execution Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University) 2012 IEEE International

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

CS527 Software Security

CS527 Software Security Security Policies Purdue University, Spring 2018 Security Policies A policy is a deliberate system of principles to guide decisions and achieve rational outcomes. A policy is a statement of intent, and

More information

Trees. (Trees) Data Structures and Programming Spring / 28

Trees. (Trees) Data Structures and Programming Spring / 28 Trees (Trees) Data Structures and Programming Spring 2018 1 / 28 Trees A tree is a collection of nodes, which can be empty (recursive definition) If not empty, a tree consists of a distinguished node r

More information

Dynamic Access Binary Search Trees

Dynamic Access Binary Search Trees Dynamic Access Binary Search Trees 1 * are self-adjusting binary search trees in which the shape of the tree is changed based upon the accesses performed upon the elements. When an element of a splay tree

More information

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11 CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11 CS 536 Spring 2015 1 Handling Overloaded Declarations Two approaches are popular: 1. Create a single symbol table

More information

Dynamic Access Binary Search Trees

Dynamic Access Binary Search Trees Dynamic Access Binary Search Trees 1 * are self-adjusting binary search trees in which the shape of the tree is changed based upon the accesses performed upon the elements. When an element of a splay tree

More information

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder JAVA PERFORMANCE PR SW2 S18 Dr. Prähofer DI Leopoldseder OUTLINE 1. What is performance? 1. Benchmarking 2. What is Java performance? 1. Interpreter vs JIT 3. Tools to measure performance 4. Memory Performance

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides Garbage Collection Last time Compiling Object-Oriented Languages Today Motivation behind garbage collection Garbage collection basics Garbage collection performance Specific example of using GC in C++

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Q1: /8 Q2: /30 Q3: /30 Q4: /32. Total: /100

Q1: /8 Q2: /30 Q3: /30 Q4: /32. Total: /100 ECE 2035(A) Programming for Hardware/Software Systems Fall 2013 Exam Three November 20 th 2013 Name: Q1: /8 Q2: /30 Q3: /30 Q4: /32 Total: /100 1/10 For functional call related questions, let s assume

More information

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime Runtime The optimized program is ready to run What sorts of facilities are available at runtime Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis token stream Syntactic

More information

Honours/Master/PhD Thesis Projects Supervised by Dr. Yulei Sui

Honours/Master/PhD Thesis Projects Supervised by Dr. Yulei Sui Honours/Master/PhD Thesis Projects Supervised by Dr. Yulei Sui Projects 1 Information flow analysis for mobile applications 2 2 Machine-learning-guide typestate analysis for UAF vulnerabilities 3 3 Preventing

More information

Programming Studio #9 ECE 190

Programming Studio #9 ECE 190 Programming Studio #9 ECE 190 Programming Studio #9 Concepts: Functions review 2D Arrays GDB Announcements EXAM 3 CONFLICT REQUESTS, ON COMPASS, DUE THIS MONDAY 5PM. NO EXTENSIONS, NO EXCEPTIONS. Functions

More information

Dynamic Trace Analysis with Zero-Suppressed BDDs

Dynamic Trace Analysis with Zero-Suppressed BDDs University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2011 Dynamic Trace Analysis with

More information

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software

More information

ROSE-CIRM Detecting C-Style Errors in UPC Code

ROSE-CIRM Detecting C-Style Errors in UPC Code ROSE-CIRM Detecting C-Style Errors in UPC Code Peter Pirkelbauer 1 Chunhuah Liao 1 Thomas Panas 2 Daniel Quinlan 1 1 2 Microsoft Parallel Data Warehouse This work was funded by the Department of Defense

More information

Fast, precise dynamic checking of types and bounds in C

Fast, precise dynamic checking of types and bounds in C Fast, precise dynamic checking of types and bounds in C Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge p.1 Tool wanted if (obj >type == OBJ COMMIT) { if (process commit(walker,

More information

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring NDSS 2012 Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring Donghai Tian 1,2, Qiang Zeng 2, Dinghao Wu 2, Peng Liu 2 and Changzhen Hu 1 1 Beijing Institute of Technology

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Section 7: Wait/Exit, Address Translation

Section 7: Wait/Exit, Address Translation William Liu October 15, 2014 Contents 1 Wait and Exit 2 1.1 Thinking about what you need to do.............................. 2 1.2 Code................................................ 2 2 Vocabulary 4

More information

CS2141 Software Development using C/C++ Debugging

CS2141 Software Development using C/C++ Debugging CS2141 Software Development using C/C++ Debugging Debugging Tips Examine the most recent change Error likely in, or exposed by, code most recently added Developing code incrementally and testing along

More information

Cpt S 122 Data Structures. Data Structures Trees

Cpt S 122 Data Structures. Data Structures Trees Cpt S 122 Data Structures Data Structures Trees Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Motivation Trees are one of the most important and extensively

More information

Dynamic Dispatch and Duck Typing. L25: Modern Compiler Design

Dynamic Dispatch and Duck Typing. L25: Modern Compiler Design Dynamic Dispatch and Duck Typing L25: Modern Compiler Design Late Binding Static dispatch (e.g. C function calls) are jumps to specific addresses Object-oriented languages decouple method name from method

More information

My malloc: mylloc and mhysa. Johan Montelius HT2016

My malloc: mylloc and mhysa. Johan Montelius HT2016 1 Introduction My malloc: mylloc and mhysa Johan Montelius HT2016 So this is an experiment where we will implement our own malloc. We will not implement the world s fastest allocator, but it will work

More information

CSC 2400: Computer Systems. Using the Stack for Function Calls

CSC 2400: Computer Systems. Using the Stack for Function Calls CSC 24: Computer Systems Using the Stack for Function Calls Lecture Goals Challenges of supporting functions! Providing information for the called function Function arguments and local variables! Allowing

More information

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research Tokyo Thread-Level Speculation (TLS) [Franklin et al., 92] or Speculative Multithreading (SpMT)

More information

Project 0: Implementing a Hash Table

Project 0: Implementing a Hash Table CS: DATA SYSTEMS Project : Implementing a Hash Table CS, Data Systems, Fall Goal and Motivation. The goal of Project is to help you develop (or refresh) basic skills at designing and implementing data

More information

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering

Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering Run-Time Monitoring with Adjustable Overhead Using Dataflow-Guided Filtering Daniel Lo, Tao Chen, Mohamed Ismail, and G. Edward Suh Cornell University Ithaca, NY 14850, USA {dl575, tc466, mii5, gs272}@cornell.edu

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Overview. Constructors and destructors Virtual functions Single inheritance Multiple inheritance RTTI Templates Exceptions Operator Overloading

Overview. Constructors and destructors Virtual functions Single inheritance Multiple inheritance RTTI Templates Exceptions Operator Overloading How C++ Works 1 Overview Constructors and destructors Virtual functions Single inheritance Multiple inheritance RTTI Templates Exceptions Operator Overloading Motivation There are lot of myths about C++

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

CIRM - Dynamic Error Detection

CIRM - Dynamic Error Detection CIRM - Dynamic Error Detection Peter Pirkelbauer Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory This work was funded by the Department of Defense and used elements

More information

Running Valgrind on multiple processors: a prototype. Philippe Waroquiers FOSDEM 2015 valgrind devroom

Running Valgrind on multiple processors: a prototype. Philippe Waroquiers FOSDEM 2015 valgrind devroom Running Valgrind on multiple processors: a prototype Philippe Waroquiers FOSDEM 2015 valgrind devroom 1 Valgrind and threads Valgrind runs properly multi-threaded applications But (mostly) runs them using

More information

G Programming Languages - Fall 2012

G Programming Languages - Fall 2012 G22.2110-003 Programming Languages - Fall 2012 Lecture 4 Thomas Wies New York University Review Last week Control Structures Selection Loops Adding Invariants Outline Subprograms Calling Sequences Parameter

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Efficient Memory Shadowing for 64-bit Architectures

Efficient Memory Shadowing for 64-bit Architectures Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,

More information

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey CSC400 - Operating Systems 3. Process Concepts J. Sumey Overview Concurrency Processes & Process States Process Accounting Interrupts & Interrupt Processing Interprocess Communication CSC400 - Process

More information

Program Profiling. Xiangyu Zhang

Program Profiling. Xiangyu Zhang Program Profiling Xiangyu Zhang Outline What is profiling. Why profiling. Gprof. Efficient path profiling. Object equality profiling 2 What is Profiling Tracing is lossless, recording every detail of a

More information

Run-time Environments - 3

Run-time Environments - 3 Run-time Environments - 3 Y.N. Srikant Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Outline of the Lecture n What is run-time

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multi-threaded Queries. Intra-Query Parallelism in LLVM Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted

More information

Run-time Environments

Run-time Environments Run-time Environments Status We have so far covered the front-end phases Lexical analysis Parsing Semantic analysis Next come the back-end phases Code generation Optimization Register allocation Instruction

More information

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing

More information

Run-time Environments

Run-time Environments Run-time Environments Status We have so far covered the front-end phases Lexical analysis Parsing Semantic analysis Next come the back-end phases Code generation Optimization Register allocation Instruction

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Fast dynamic program analysis Race detection. Konstantin Serebryany May

Fast dynamic program analysis Race detection. Konstantin Serebryany May Fast dynamic program analysis Race detection Konstantin Serebryany May 20 2011 Agenda Dynamic program analysis Race detection: theory ThreadSanitizer: race detector Making ThreadSanitizer

More information

Profiling & Optimization

Profiling & Optimization Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes

More information

Run-Time Environments/Garbage Collection

Run-Time Environments/Garbage Collection Run-Time Environments/Garbage Collection Department of Computer Science, Faculty of ICT January 5, 2014 Introduction Compilers need to be aware of the run-time environment in which their compiled programs

More information

Chapter 3: Processes. Operating System Concepts 8th Edition,

Chapter 3: Processes. Operating System Concepts 8th Edition, Chapter 3: Processes, Administrivia Friday: lab day. For Monday: Read Chapter 4. Written assignment due Wednesday, Feb. 25 see web site. 3.2 Outline What is a process? How is a process represented? Process

More information

DataCollider: Effective Data-Race Detection for the Kernel

DataCollider: Effective Data-Race Detection for the Kernel DataCollider: Effective Data-Race Detection for the Kernel John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk Microsoft Windows and Microsoft Research {jerick, madanm, sburckha, kirko}@microsoft.com

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

DEBUGGING: DYNAMIC PROGRAM ANALYSIS

DEBUGGING: DYNAMIC PROGRAM ANALYSIS DEBUGGING: DYNAMIC PROGRAM ANALYSIS WS 2017/2018 Martina Seidl Institute for Formal Models and Verification System Invariants properties of a program must hold over the entire run: integrity of data no

More information

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017 CS 137 Part 8 Merge Sort, Quick Sort, Binary Search November 20th, 2017 This Week We re going to see two more complicated sorting algorithms that will be our first introduction to O(n log n) sorting algorithms.

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana Introduction to Indexing 2 Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana Indexed Sequential Access Method We have seen that too small or too large an index (in other words too few or too

More information

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Chapter 12: Indexing and Hashing (Cnt(

Chapter 12: Indexing and Hashing (Cnt( Chapter 12: Indexing and Hashing (Cnt( Cnt.) Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Efficient Data Race Detection for Unified Parallel C

Efficient Data Race Detection for Unified Parallel C P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,

More information

NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness

NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness EECS Electrical Engineering and Computer Sciences P A R A L L E L C O M P U T I N G L A B O R A T O R Y NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness Jacob Burnim,

More information

Dynamic Race Detection with LLVM Compiler

Dynamic Race Detection with LLVM Compiler Dynamic Race Detection with LLVM Compiler Compile-time instrumentation for ThreadSanitizer Konstantin Serebryany, Alexander Potapenko, Timur Iskhodzhanov, and Dmitriy Vyukov OOO Google, 7 Balchug st.,

More information

Chapter 3: Process-Concept. Operating System Concepts 8 th Edition,

Chapter 3: Process-Concept. Operating System Concepts 8 th Edition, Chapter 3: Process-Concept, Silberschatz, Galvin and Gagne 2009 Chapter 3: Process-Concept Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Silberschatz, Galvin

More information

Custom Memory Allocation For Free

Custom Memory Allocation For Free Custom Memory Allocation For Free Alin Jula and Lawrence Rauchwerger Texas A&M University November 4, 2006 Motivation Memory Allocation Matters Goals Speed Data locality Fragmentation Current State of

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Compilers. 8. Run-time Support. Laszlo Böszörmenyi Compilers Run-time - 1

Compilers. 8. Run-time Support. Laszlo Böszörmenyi Compilers Run-time - 1 Compilers 8. Run-time Support Laszlo Böszörmenyi Compilers Run-time - 1 Run-Time Environment A compiler needs an abstract model of the runtime environment of the compiled code It must generate code for

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

CSE409, Rob Johnson, Alin Tomescu, November 11 th, 2011 Buffer overflow defenses

CSE409, Rob Johnson,   Alin Tomescu, November 11 th, 2011 Buffer overflow defenses Buffer overflow defenses There are two categories of buffer-overflow defenses: - Make it hard for the attacker to exploit buffer overflow o Address space layout randomization o Model checking to catch

More information

Mobilizing the Micro-Ops: Exploiting Context Sensitive Decoding for Security and Energy Efficiency

Mobilizing the Micro-Ops: Exploiting Context Sensitive Decoding for Security and Energy Efficiency Mobilizing the Micro-Ops: Exploiting Context Sensitive Decoding for Security and Energy Efficiency Mohammadkazem Taram, Ashish Venkat, Dean Tullsen University of California, San Diego The Tension between

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some

More information

Dynamic Binary Instrumentation: Introduction to Pin

Dynamic Binary Instrumentation: Introduction to Pin Dynamic Binary Instrumentation: Introduction to Pin Instrumentation A technique that injects instrumentation code into a binary to collect run-time information 2 Instrumentation A technique that injects

More information

Garbage Collection. CS 351: Systems Programming Michael Saelee

Garbage Collection. CS 351: Systems Programming Michael Saelee Garbage Collection CS 351: Systems Programming Michael Saelee = automatic deallocation i.e., malloc, but no free! system must track status of allocated blocks free (and potentially reuse)

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer ETH Zurich Enrico Kravina ETH Zurich Thomas R. Gross ETH Zurich Abstract Memory tracing (executing additional code for every memory access of a program) is a powerful

More information

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Runtime Defenses against Memory Corruption

Runtime Defenses against Memory Corruption CS 380S Runtime Defenses against Memory Corruption Vitaly Shmatikov slide 1 Reading Assignment Cowan et al. Buffer overflows: Attacks and defenses for the vulnerability of the decade (DISCEX 2000). Avijit,

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #6: Memory Management CS 61C L06 Memory Management (1) 2006-07-05 Andy Carle Memory Management (1/2) Variable declaration allocates

More information

Lecture 03 Bits, Bytes and Data Types

Lecture 03 Bits, Bytes and Data Types Lecture 03 Bits, Bytes and Data Types Computer Languages A computer language is a language that is used to communicate with a machine. Like all languages, computer languages have syntax (form) and semantics

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1 Memory Management Disclaimer: some slides are adopted from book authors slides with permission 1 CPU management Roadmap Process, thread, synchronization, scheduling Memory management Virtual memory Disk

More information

Tree-Structured Indexes. Chapter 10

Tree-Structured Indexes. Chapter 10 Tree-Structured Indexes Chapter 10 1 Introduction As for any index, 3 alternatives for data entries k*: Data record with key value k 25, [n1,v1,k1,25] 25,

More information