DBT Tool. DBT Framework

Similar documents
System Challenges and Opportunities for Transactional Memory

Thread-Safe Dynamic Binary Translation using Transactional Memory

The Common Case Transactional Behavior of Multithreaded Programs

Tradeoffs in Transactional Memory Virtualization

Improving the Practicality of Transactional Memory

Chí Cao Minh 28 May 2008

SYSTEM CHALLENGES AND OPPORTUNITIES FOR TRANSACTIONAL MEMORY

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

6 Transactional Memory. Robert Mullins

The Common Case Transactional Behavior of Multithreaded Programs

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Parallelizing SPECjbb2000 with Transactional Memory

Parallelizing SPECjbb2000 with Transactional Memory

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Summary: Open Questions:

A Dynamic Instrumentation Approach to Software Transactional Memory. Marek Olszewski

An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

1 Publishable Summary

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

LogTM: Log-Based Transactional Memory

Architectural Semantics for Practical Transactional Memory

Lock vs. Lock-free Memory Project proposal

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

Deterministic Replay and Data Race Detection for Multithreaded Programs

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Scalable Software Transactional Memory for Chapel High-Productivity Language

Scheduling Transactions in Replicated Distributed Transactional Memory

NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Transactional Memory

Reminder from last time

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Log-Based Transactional Memory

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Potential violations of Serializability: Example 1

Transactional Memory. Yaohua Li and Siming Chen. Yaohua Li and Siming Chen Transactional Memory 1 / 41

DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Transaction Memory for Existing Programs Michael M. Swift Haris Volos, Andres Tack, Shan Lu, Adam Welc * University of Wisconsin-Madison, *Intel

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Real-World Buffer Overflow Protection in User & Kernel Space

Hardware Transactional Memory Architecture and Emulation

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory Coherence and Consistency

Work Report: Lessons learned on RTM

XXXX Leveraging Transactional Execution for Memory Consistency Model Emulation

Performance Improvement via Always-Abort HTM

An Update on Haskell H/STM 1

Implementing Rollback: Copy. 2PL: Rollback. Implementing Rollback: Undo. 8L for Part IB Handout 4. Concurrent Systems. Dr.

Development of Technique for Healing Data Races based on Software Transactional Memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Graph-based protocols are an alternative to two-phase locking Impose a partial ordering on the set D = {d 1, d 2,..., d h } of all data items.

Using Software Transactional Memory In Interrupt-Driven Systems

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Do you have to reproduce the bug on the first replay attempt?

EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

A Scalable, Non-blocking Approach to Transactional Memory

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Embedded System Programming

Early Release: Friend or Foe?

VMM Emulation of Intel Hardware Transactional Memory

DMP Deterministic Shared Memory Multiprocessing

Advances in Data Management Transaction Management A.Poulovassilis

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

Chimera: Hybrid Program Analysis for Determinism

Intel Thread Building Blocks, Part IV

Runtime Monitoring on Multicores via OASES

Yuxi Chen, Shu Wang, Shan Lu, and Karthikeyan Sankaralingam *

6.852: Distributed Algorithms Fall, Class 20

Speculative Synchronization

Program Vulnerability Analysis Using DBI

Databases - Transactions II. (GF Royle, N Spadaccini ) Databases - Transactions II 1 / 22

Hardware Transactional Memory. Daniel Schwartz-Narbonne

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

OS Support for Virtualizing Hardware Transactional Memory

Weak Levels of Consistency

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Flexible Architecture Research Machine (FARM)

Conflict Detection and Validation Strategies for Software Transactional Memory

Heuristics for Profile-driven Method- level Speculative Parallelization

Massimiliano Ghilardi

Transcription:

Thread-Safe Dynamic Binary Translation using Transactional Memory JaeWoong Chung,, Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University http://csl.stanford.edu

Dynamic Binary Translation (DBT) DBT Short code sequence is translated in run-time PIN, Valgrind, DynamoRIO, StarDBT, etc Original Binary DBT Tool DBT Framework Translated Binary DBT use cases Translation on new target architecture JIT optimizations in virtual machines Binary instrumentation Profiling, security, debugging, 2

Example: Dynamic Information Flow Tracking (DIFT) t = XX ; // untrusted data from network taint(t) = 1;. swap t, u1; swap taint(t), taint(u1); u2 = u1; taint(u2) = taint(u1); Variables t u1 u2 XX XX Taint bits 1 1 Untrusted data are tracked throughout execution A taint bit per memory byte is used to track untrusted data. Security policy uses the taint bit. E.g. untrusted data should not be used as syscall argument. Dynamic instrumentation to propagate and check taint bit. 3

DBT & Multithreading Multithreaded executables as input Challenges Atomicity of target instructions e.g. compare-and-exchange Atomicity of additional instrumentation Races in accesses to application data & DBT metadata Easy but unsatisfactory solutions Do not allow multithreaded programs (StarDBT) Serialize multithreaded execution (Valgrind) 4

DIFT Example: MetaData Race Security Breach User code uses atomic instructions. After instrumentation, there are races on taint bits. Thread 1 Thread2 swap t, u1; swap taint(t), taint(u1); u2 = u1; taint(u2) = taint(u1); t u1 u2 Variables Taint bits XX 1 XX 5

Idea Can We Fix It with Locks? Enclose access to data and associated metadata, within a locked region. Problems Coarse-grained locks performance degradation Fine-grained locks locking overhead, convoying, limited scope of DBT optimizations Lock nesting between app & DBT locks potential deadlock Tool developers should be a feature + multithreading experts. 6

Transactional Memory Atomic and isolated execution of a group of instructions All or no instructions are executed. Intermediate results are not seen by other transactions. Programmer A transaction encloses a group of instructions. The transaction is executed sequentially with the other transactions and non-transactional instructions. TM system Parallel transaction execution. Register checkpoint, data versioning, conflict detection, rollback. Hardware, software, or hybrid TM implementation. 7

Idea Transaction for DBT DBT instruments a transaction to enclose accesses to (data, metadata) within the transaction boundary. Thread 1 TX_Begin swap t, u1; swap taint(t), taint(u1); TX_End Thread2 TX_Begin u2 = u1; taint(u2) = taint(u1); TX_End Advantages Atomic execution High performance through optimistic concurrency Support for nested transactions 8

Yes, it fixes the problem. But DBT transaction per instruction is heavy. User locks are nested with DBT transactions. User transactions overlap partially with DBT transactions. There will be I/O operations within DBT transactions. User-coded conditional synchronization may be tricky. Transactions are not free. 9

Granularity of Transaction Instrumentation Per instruction High overhead of executing TX_Begin and TX_End Limited scope for DBT optimizations Per basic block Amortizing the TX_Begin and TX_End overhead Easy to match TX_Begin and TX_End Per trace Further amortization of the overhead Potentially high transaction conflict Profile-based sizing Optimize transaction size based on transaction abort ratio 10

Interaction with Application Code (1) User lock s semantics should be preserved regardless of DBT transactions. If transaction locked region, fine. To TM, lock variables are just shared variables. If transaction locked region, fine. Transactions are executed in critical sections protected with locks. If partially overlapped, split the DBT transaction. User transactions may partially overlap with DBT transactions. If fully nested, fine. Either true transaction nesting or subsumption. If partially overlapping, split the DBT transaction. 11

Interaction with Application Code (2) I/O operations are not rolled back. Terminate the DBT transaction. Typically, they work as barriers in DBT s optimization. Conditional synchronization may cause live-lock. lock. Re-optimize the code to have a transaction per basic block. Initially, done1 = done2 = false Thread 1 Thread 2 2 3 TX_Begin while(!done2); done1 = true; TX_Begin done2 = true; while(!done1); 1 4 TX_End TX_End 12

Evaluation Environment DBT framework PIN v2.0 with multithreading support DIFT as a PIN tool example Execution environment x86 server with 4 dual-core processors Software TM system Multithreaded applications 6 from SPLASH 3 from SPECOmp 13

Baseline Performance Results ized Overhead (%) Normali 60% 50% 40% 30% 20% 10% 0% Barnes Equake Fmm Radiosity Radix Swim Tomcatv Water Waterspatial 41 % overhead on the average Transaction at the DBT trace granularity 14

Transaction Overheads Transaction begin/end Register checkpoint Initializing and cleaning TM metadata Per memory access Tracking the read-set & write-set Detecting conflicts Data versioning Transaction abort Applying logs and restarting the transaction In our tests, 0.03% of transactions abort 15

Transaction Begin/End Overhead alized Overhead (%) Norm 100% 90% 80% 70% 60% 50% 40% 30% 1 2 4 8 Barnes Equake Fmm Radix Number of basic blocks per transaction Tomcatv Transaction Sizing Longer TX amortizes the TX_Begin/End overhead. 16

Per Memory Access Overhead Instrumentation of software TM barrier TX_Begin read_barrier(t); write_barrier(u); u = t; read_barrier(taint(t)); write_barrier(taint(u)); taint(u) = taint(t); TX_end What happens in the barrier? Conflict detection by recording addresses Observation 1 : needed only for shared variables Data versioning by logging old values Observation 2 : not needed for stack variables 17

Software Transaction Optimization (1) Categorization of memory access types Data Versioning Y In stack? N Conflict Detection Y After TX_Begin? N Y Private? N IDEMPOTENT STACK STACK PRIVATE Y W/ lock? N BENIGN RACE SHARED 18

Software Transaction Optimization (2) lized Overhead (%) 60% 50% 40% 30% 20% Baseline STACK Norma 10% 0% barnes equake fmm radiosity radix swim tomcatv water waterspatial BENIGN RACE Average runtime overhead 36% with STACK 34% with BENIGN RACE 19

Hardware Acceleration (1) With software optimization, the overhead is about 35%. Emulating 3 types of hardware acceleration Cycles spent in transaction violation is under 0.03%. STM+ HybridTM HTM Register Conflict Data Checkpointing Detection Versioning HW (single-cycle) HW (single-cycle) HW (single-cycle) SW read-set / write-set HW Signatures HW read-set / write-set SW Undo-log SW Undo-log HW Undo-log 20

Hardware Acceleration (2) Norm mailized Overhead (%) 60% 50% 40% 30% 20% 10% 0% Barnes Equake FMM Radiosity Radix Swim Tomcatv Water Waterspatial STM STM+ HyTM HTM Overhead reduction 28% with STM+, 12% with HybridTM, and 6% with HTM Notice the diminishing return 21

Conclusion Multi-threaded threaded executables are a challenge for DBT in the era of multi-core. Races on metadata access Use transactions for multi-threaded threaded translated code. 41% overhead on average With software optimization, the overhead is around 35%. Further software optimization may be possible. Hardware acceleration reduces the overhead down to 6%. Remember the diminishing return. 22