Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Similar documents
Improving the Practicality of Transactional Memory

Potential violations of Serializability: Example 1

Flexible Architecture Research Machine (FARM)

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

Scalable Software Transactional Memory for Chapel High-Productivity Language

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems

Scheduling Transactions in Replicated Distributed Transactional Memory

Architectural Semantics for Practical Transactional Memory

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

6.852: Distributed Algorithms Fall, Class 20

Chí Cao Minh 28 May 2008

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

CS 347 Parallel and Distributed Data Processing

The Common Case Transactional Behavior of Multithreaded Programs

DBT Tool. DBT Framework

Tradeoffs in Transactional Memory Virtualization

An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

ABORTING CONFLICTING TRANSACTIONS IN AN STM

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Summary: Open Questions:

CS 347 Parallel and Distributed Data Processing

Goal of Concurrency Control. Concurrency Control. Example. Solution 1. Solution 2. Solution 3

Transactional Memory

Early Release: Friend or Foe?

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

Control. CS432: Distributed Systems Spring 2017

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

Conflict Detection and Validation Strategies for Software Transactional Memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Open Nesting in Software Transactional Memory. Paper by: Ni et al. Presented by: Matthew McGill

6 Transactional Memory. Robert Mullins

Software Speculative Multithreading for Java

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

An Update on Haskell H/STM 1

DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM

Parallelizing SPECjbb2000 with Transactional Memory

Concurrency Control. Chapter 17. Comp 521 Files and Databases Spring

Lecture 21 Concurrency Control Part 1

Concurrency control CS 417. Distributed Systems CS 417

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Intro to Transactions

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lowering the Overhead of Nonblocking Software Transactional Memory

LogTM: Log-Based Transactional Memory

Lecture: Transactional Memory. Topics: TM implementations

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

Parallelizing SPECjbb2000 with Transactional Memory

The Runtime Abort Graph and its Application to Software Transactional Memory Optimization

Work Report: Lessons learned on RTM

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

AN EXECUTION MODEL FOR FINE-GRAIN NESTED SPECULATIVE PARALLELISM

Atomicity via Source-to-Source Translation

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Transactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems

Summary: Issues / Open Questions:

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

Nested Parallelism in Transactional Memory

<Insert Picture Here> Revisting Condition Variables and Transactions

Software transactional memory

Mohamed M. Saad & Binoy Ravindran

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Advance Operating Systems (CS202) Locks Discussion

Testing Patterns for Software Transactional Memory Engines

EE/CSCI 451: Parallel and Distributed Computation

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

Performance Improvement via Always-Abort HTM

3. Concurrency Control for Transactions Part One

Transaction Management and Concurrency Control. Chapter 16, 17

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Thread-Safe Dynamic Binary Translation using Transactional Memory

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

On Open Nesting in Distributed Transactional Memory

INTRODUCTION. Hybrid Transactional Memory. Transactional Memory. Problems with Transactional Memory Problems

Irrevocable Transactions and their Applications

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

Transactional Interference-less Balanced Tree

Transactional Memory. Yaohua Li and Siming Chen. Yaohua Li and Siming Chen Transactional Memory 1 / 41

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

The masses will have to learn concurrent programming. But using locks and shared memory is hard and messy

Heckaton. SQL Server's Memory Optimized OLTP Engine

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Chapter 13 : Concurrency Control

! Part I: Introduction. ! Part II: Obstruction-free STMs. ! DSTM: an obstruction-free STM design. ! FSTM: a lock-free STM design

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 28 November 2014

Advanced Database Systems

Serializability of Transactions in Software Transactional Memory

Monitors; Software Transactional Memory

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

Remote Transaction Commit: Centralizing Software Transactional Memory Commits

Transcription:

Implementing and Evaluating Nested Parallel Transactions in STM Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Introduction // Parallelize the outer loop for(i=0;i<numcustomer;i++){ atomic{ // Can we parallelize the inner loop? for(j=0;j<numorders;j++) processorder(i,j, ); }} Transactional Memory (TM) simplifies parallel programming Atomic and isolated execution of transactions Current practice: Most TMs do not support nested parallelism Nested parallelism in TM is becoming more important To fully utilize the increasing number of cores To integrate well with programming models (e.g., OpenMP)

Previous Work: NP in STM [ECOOP 09] NePaLTM with practical support for nested parallelism Serialize nested transactions [PPoPP 08] CWSTM that supports nested parallel transactions With the lowest upper bound of time complexity of TM barriers No (actual) implementation / (quantitative) evaluation [PPoPP 10] a practical, concrete implementation of CWSTM With depth-independent time complexity of TM barriers Use rather complicated data structures such as concurrent stack Remaining question: Extend a timestamp-based, eager-versioning STM To support nested parallel transactions

Contributions Propose NesTM with support for nested parallel transactions Extend a timestamp-based, eager-versioning STM Discuss complications of concurrent nesting Describe subtle correctness issues Motivate further research on proving / verifying nested STMs Quantify NesTM across different use scenarios Admittedly, substantial runtime overheads to nested transactions E.g., Repeated read-set validation Motivate further research on performance optimizations

Outline Introduction Background NesTM Algorithm Complications of Nesting Evaluation Conclusions

Background: Semantics of Nesting Definitions Transactional hierarchy has a tree structure Ancestors(T) = Parent(T) Ancestors(Parent(T)) Readers(o): a set of active transactions that read o Writers(o): a set of active transactions that wrote to o Conflicts T reads from o : R/W conflict If there exists T such that T writers(o), T T, and T ancestors(t) T writes to o : R/W or W/W conflict If there exists T such that T readers(o) writers(o), T T, and T ancestors(t)

Background: Example of Nesting T1.1 ld A T1.1 ld A T1 ld B T1.2 st A T2 st A T1 and T2 are top-level T1.1, T1.2: T1 s children T=6: R/W conflict T2 writes to A T1.1 Readers(A) T1.1 Ances(T2) T=8: No conflict T1.2 writes to A Readers(A)=Writers(A)= Serialization order T2 T1

NesTM Overview Extend an eager data-versioning STM In-place update No need to look up parent s write buffer Useful property: Once acquire ownership, keep it until commit / abort Global data structures A global version clock (GC) A set of version-owner locks (volocks): T LSBs: Owner s TID / Remaining bits: Version Number Transaction descriptor Read-version (RV): GC value sampled when the txn starts R/W sets: Implemented using a doubly linked list Pointer to parent s transaction descriptor Commit-lock: to synchronize concurrent commits of children

TxLoad TxLoad(Self,addr){ vl=getvolock(addr); owner=getowner(vl); if(owner==self){ // Read data } } else if(isances(self,owner)){ cv=getts(vl); if(cv>self.rv){ // Abort } else{ // Read data } } else{ // Abort }} If the owner (of the memory object) is the transaction itself Read the memory value Else if the owner is an ancestor of the transaction If the version number is newer than the transaction s RV Abort Else Read the memory value Else Abort

TxStore TxStore(Self,addr,val){ owner=getowner(addr); if(owner==self){ // Write data } else if(isances(self,owner)){ if(atomicacqownership(self,owner,addr)==success){ if(validatereaders(self,owner,addr)==success){ // Write data } else{ // Abort } } else { // Abort }} else { // Abort }} If the owner is the transaction itself Write Else if the owner is an ancestor of the transaction If the atomic acquisition of the ownership is successful If the validation of all the readers in the hierarchy is successful Write Else Abort Else Abort Else Abort

TxCommit TxCommit(Self){ wv=incrementgc(); for each e in Self.RS { // Perform the same check in TxLoad // If fails, the transaction aborts } mergerwsetstoparent(self); for each e in Self.WS { // Increment version number using wv and // transfer ownership to parent } } Validate every memory object in RS Using the same conditions checked in TxLoad If fails, abort Merge R/W sets to the parent Linking the pointers Loss of temporal locality on these entries Validation / Merging is protected by parent s commit-lock To address the issue with non-atomic commit (See the paper) Increment version number / transfer ownership for the objects in WS

TxAbort TxAbort(Self){ for each e in Self.WS { // Restore the memory value to the previous value } for each e in Self.WS { // Restore the volock value to the previous value } // Retry the transaction } For every memory object in WS Restore the memory value to the previous value For every memory object in WS Restore the volock value to the previous value Refer to the paper for the invalid read problem Retry the transaction

Outline Introduction Background NesTM Algorithm Complications of Nesting Evaluation Conclusions

Complications of Nesting Subtle correctness issues discovered while developing NesTM Invalid read, non-atomic commit, zombie transactions Current status: No hand proof of correctness/liveness of NesTM Model checking: ChkTM [ICECCS 10] Checked correctness with a very small configuration Thread configuration: [1, 2, 1.1, 1.2] / Two memory op s per txn Failed to check with larger configurations due to large state space Motivate reduction theorem / partial order reduction techniques Random tests: Using the implemented NesTM code Tested with larger configurations (e.g., nesting depth of 3)

Evaluating NesTM Q1: Runtime overhead for top-level parallelism Used STAMP applications (Baseline STM vs. NesTM) Maximum performance difference is ~25% Due to the extra code in NesTM barriers Q2: Performance of nested transactions More in the following slides Q3: Using nested parallelism to improve performance Used a u-benchmark based on two-level hash tables If single-level parallelism is limited (e.g., frequent conflicts) Exploiting nested parallelism can be beneficial

Q2: Performance of Nested Txns Flat version // Parallelize this loop for(i=0;i<numops;i+=c){ atomic{ for(j=0;j<c;j++){ accessht(i,j, );} } } Nested version (N1) atomic{ // Parallelize this loop for(i=0;i<numops;i+=c){ atomic{ for(j=0;j<c;j++){ accessht(i,j, );} }} } hashtable: perform operations on a concurrent hash table Two types of operations: Look-up (reads) / Insert (reads/writes) Subsumed: Sequentially perform all the operations in a single txn Emulate an STM that flattens and serializes nested transactions Flat: Concurrently perform operations using top-level txns Nested: Repeatedly add outer-level transactions N1, N2, and N3 versions

Q2: Performance of Nested Txns Scale up to 16 threads (N1 with 16 threads 3x faster) Performance issues Non-parallelizable, linearly-increasing overheads E.g., Repeated read-set validation More expensive read/write barriers (loss of temporal locality) Contention on commit-lock (Many nested txns simultaneously commit)

Conclusion Propose NesTM with support for nested parallel transactions Extend a timestamp-based, eager-versioning STM Discuss complications of concurrent nesting Describe subtle correctness issues Motivate further research on proving / verifying nested STMs Quantify NesTM across different use scenarios Admittedly, substantial runtime overheads to nested transactions E.g., Repeated read-set validation Motivate further research on performance optimizations Software: more efficient algorithm / implementation Hardware: cost-effective hardware acceleration [ICS 10]