Scheduling Transactions in Replicated Distributed Transactional Memory

Similar documents
Enhancing Concurrency in Distributed Transactional Memory through Commutativity

Scheduling Memory Transactions in Distributed Systems

Scheduling Transactions in Replicated Distributed Software Transactional Memory

Mohamed M. Saad & Binoy Ravindran

Lock vs. Lock-free Memory Project proposal

Distributed Transactional Contention Management as the Traveling Salesman Problem

Software transactional memory

Scheduling Memory Transactions in Distributed Systems

6.852: Distributed Algorithms Fall, Class 20

Chí Cao Minh 28 May 2008

Enhancing Concurrency in Distributed Transactional Memory through Commutativity

Supporting Software Transactional Memory in Distributed Systems: Protocols for Cache-Coherence, Conflict Resolution and Replication

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

1 Publishable Summary

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

HyflowCPP: A Distributed Transactional Memory framework for C++

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

On Improving Distributed Transactional Memory Through Nesting and Data Partitioning

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

INTRODUCTION. Hybrid Transactional Memory. Transactional Memory. Problems with Transactional Memory Problems

Improving the Practicality of Transactional Memory

ABORTING CONFLICTING TRANSACTIONS IN AN STM

Lecture: Transactional Memory. Topics: TM implementations

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

The Common Case Transactional Behavior of Multithreaded Programs

ByteSTM: Java Software Transactional Memory at the Virtual Machine Level

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

On Open Nesting in Distributed Transactional Memory

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

6 Transactional Memory. Robert Mullins

Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions

An Automated Framework for Decomposing Memory Transactions to Exploit Partial Rollback

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Transactional Memory

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

On Improving Distributed Transactional Memory

Performance Comparison of Various STM Concurrency Control Protocols Using Synchrobench

Transactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems

Using Software Transactional Memory In Interrupt-Driven Systems

Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Distributed Computing Column 54 Transactional Memory: Models and Algorithms

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Chapter 20: Database System Architectures

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Reminder from last time

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory

Concurrent Preliminaries

TRANSACTION MEMORY. Presented by Hussain Sattuwala Ramya Somuri

Ensuring dependability and improving performance of transactional systems deployed on multi-core architectures

Evaluating Contention Management Using Discrete Event Simulation

Optimistic Preventive Replication in a Database Cluster

DBT Tool. DBT Framework

Scheduler Activations. CS 5204 Operating Systems 1

The Implications of Shared Data Synchronization Techniques on Multi-Core Energy Efficiency

Linked Lists: The Role of Locking. Erez Petrank Technion

False Conflict Reduction in the Swiss Transactional Memory (SwissTM) System

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Performance Tradeoffs in Software Transactional Memory

Lecture: Consistency Models, TM

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Design Tradeoffs in Modern Software Transactional Memory Systems

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Computer Architecture

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Lecture 23 Database System Architectures

On Open Nesting in Distributed Transactional Memory

Scalable Software Transactional Memory for Chapel High-Productivity Language

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Kernel Level Speculative DSM

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

A Qualitative Survey of Modern Software Transactional Memory Systems

Portland State University ECE 588/688. Transactional Memory

Serializability of Transactions in Software Transactional Memory

Multiprocessor Support

Hardware Transactional Memory Architecture and Emulation

Flexible Architecture Research Machine (FARM)

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Enhancing efficiency of Hybrid Transactional Memory via Dynamic Data Partitioning Schemes

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Programming with Transactional Memory

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Remote Transaction Commit: Centralizing Software Transactional Memory Commits

Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

Boosting Locality in Multi-version Partial Data Replication

Exploiting Distributed Software Transactional Memory

Composable Shared Memory Transactions Lecture 20-2

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

EazyHTM: Eager-Lazy Hardware Transactional Memory

Transcription:

Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013

Concurrency control on chip multiprocessors significantly affects performance (and programmability) Improve performance by exposing greater concurrency Amdahl s law: relationship between sequential execution time and speedup reduction is not linear Sun T2000 Niagara (8-core)

Lock-based concurrency control has serious drawbacks Coarse grained locking Simple But no concurrency

Fine-grained locking is better, but Excellent performance Poor programmability Lock problems don t go away! Deadlocks, livelocks, lock-convoying, priority inversion,. Most significant difficulty composition

Lock-free synchronization overcomes some of these difficulties, but lock-free retry loop

Transactional memory Like database transactions ACI properties (no D) Easier to program Composable First HTM, then STM, later HyTM M. Herlihy and J. B. Moss (1993). Transactional memory: Architectural support for lock-free data structures. ISCA. pp. 289 300. N. Shavit and D. Touitou (1995). Software Transactional Memory. PODC. pp. 204 213.

Optimistic execution yields performance gains at the simplicity of coarse-grain, but no silver bullet Time Coarse-grained locking STM Fine-grained locking High data dependencies Irrevocable operations Interaction between transactions and non-transactions Conditional waiting Threads E.g., C/C++ Intel Run-Time System STM (B. Saha et. al. (2006). McRT- STM: A High Performance Software Transactional Memory. ACM PPoPP)

Three key mechanisms needed to create atomicity illusion Versioning Conflict detection T0! T1! atomic{! x = x + y;! }! atomic{! x = x + y;! }! atomic{! x = x / 25;! }! Where to store new x until commit? Eager: store new x in memory; old in undo log Lazy: store new x in write buffer How to detect conflicts between T0 and T1? Record memory locations read in read set Record memory locations wrote in write set Conflict if one s read or write set intersects the other s write set

Third mechanism is contention management!! T0! T1!! x = x + y;! x = x / 25;!!!! x = x / 25;! Which transaction to abort? Greedy: favor those with an earlier start time Karma:.

Transactional scheduler is not necessary, but can boost performance Contention manager Can cause too many aborts, e.g., when a long running transaction conflicts with shorter transactions An aborted transaction may wait too long Transactional scheduler s goal: minimize conflicts (e.g., avoid repeated aborts) Walther M. et al. (2010). Scheduling support for transactional memory contention management, PPoPP, pp 79-90

Distributed TM (or DTM) Extends TM to distributed systems Nodes interconnected using message passing links Execution and network models Execution models Ø Data flow DTM (DISC 05) p p Transactions are immobile Objects migrate to invoking transactions Ø Control flow DTM (USENIX 12) p p Objects are immobile Transactions move from node to node Herlihy s metric-space network model (DISC 05) Ø Communication delay between every pair of nodes Ø Delay depends upon node-to-node distance 1.499 ms 9.095 ms 16.613 ms 13.709 ms 15.016 ms 1st hop 2nd hop 3rd hop 4th hop 5th hop Distance

Past research have developed several transactional schedulers Multi-core systems BiModal transactional scheduler (OPODIS 09) Proactive transactional scheduler (MICRO 09) Adaptive transactional scheduler (SPAA 08) Steal-On-Abort (HiPEAC 09) CAR-STM (PODC 08) Distributed systems Bi-interval transactional scheduler (SSS 10) Ø Single-copy Reactive transactional scheduler (IPDPS 12) Ø Single-copy (and closed-nested transactions)

Replication models in (dataflow) DTM No replication: non-fault-tolerant Only one copy for each object Single points of failure Full replication: fault-tolerant, but non-scalable All objects replicated on all nodes Atomic broadcasting of updates is non-scalable Partial replication: fault-tolerant and scalable Each object replicated only at a subset of nodes Updates atomically broadcast to only node subset N. Schiper, P. Sutra, and F. Pedone (2010). P-store: Genuine partial replication in wide area networks. SRDS. pp. 214 224

Paper s contribution Scheduling in partially replicated DTM Extend TFA for partial replication Cluster-based transactional scheduler (CTS) Competitive ratio analysis Implementation and experimental studies Comparisons with state-of-the art DTMs M. Saad and B. Ravindran (2011). Hyflow: A high performance distributed software transactional memory framework, HPDC, pp. 265-266

Atomicity, consistency, and isolation in data-flow DTM T 1, T 2, and T 3 request o 1 T 4 requests o 1 and aborts T 1 s node: N 1 T 2 s node: N 2 T 3 s node: N 3 T 4 s node: N 4 Object o 1 s owner node N 0 t 1 LC =14 (LC is local clock) t 2 T 2 s validate request t 3 o 1 is updated at 30; T 2 commits & becomes o 1 s owner T 1 and T 3 s t 4 t 5 time validate request, but they abort, because LC=30, was 14 Transactional Forwarding Algorithm (TFA) Early validation of remote objects Atomicity for object operations in the presence of asynchronous clocks M. Saad and B. Ravindran (2011). Hyflow: A high performance distributed software transactional memory framework, HPDC, pp. 265-266

Motivating example for CTS T 1 T 2 T 3 T 1 T 2 T 3 o 1 o 3 o 3 Conflict Time o 1 o 3 o 3 Conflict o 2 o 2 o 4 o 2 o 2 o 4 Abort Commit Abort Commit TFA Abort CTS Commit

Logical partitioning for partial replication 5 5 5 2 3 6 2 3 6 Cluster 3 2 3 6 1 4 Metric-space graph: each edge is a link edge length is distance 1 Cluster 1 4 Cluster 2 1 4 Purpose of partitioning is to enhance locality Partitioning using METIS (SIAM 98) Each cluster has an object replica Each cluster has one object owner who owns all replicas A transaction on node 2 requests objects from object owner in node 2 s cluster G. Karypis and V. Kumar (1998). A fast and high quality multilevel scheme for partitioning irregular graph, SIAM-JSC, pp. 359-392

Scheduler design (1/2) No Conflict Conflict 1 o x T i o y T j o 1 Conflict 2 T: requesting transaction O: requested object T k TxTable ObjectTable T i O x,o y O x T i No Conflict T j O y O y T i,t j Conflict 1 T k O 1 O 1 T k Conflict 2 T j requests O y T j requests O 1 time If T has conflicts 1 and 2, then abort T else allow T to use O

Scheduler design (2/2) No Conflict Conflict 1 Conflict 2 o x T i o y T j o 1 T k TxTable ObjectTable T i O x,o y O x T i No Conflict T j O 1 O y O y T i, T j Conflict 1 T k O 1 T j Conflict No Conflict 2 T j requests O 1 T k requests O 1 time For each T using O if T has a conflict, abort T else allow T to use O If T has conflicts 1 and 2, then abort T else allow T to use O

Competitive ratio analysis makespan A : time that A needs to complete N transactions Definition: Replication Model FR: Full Replication, PR: Partial Replication, NR: No Replication Theorem 1: makespan(fr) < makespan(pr) < makespan(nr) Theorem 2: makespan CTS (PR) < makespan(fr), where N > 3 T requests O T restarts without requesting O Random Back-off Time T aborts due to a conflict O is updated and O is sent to aborted transactions PR incurs requesting and object retrieving times for transactions, but aborted transactions are resent updated objects. CTS s backoff generally allows the update to be received before transaction re-start, resulting in less overall time than FR s broadcasting time.

Implementation and experimental setup Implemented CTS in HyFlow DTM framework Second generation DTM framework for the JVM (Java, Scala) Open-source: hyflow.org 24 nodes, each is 2GHz AMD Opteron Benchmarks Distributed version of STAMP Vacation Two monetary applications Distributed data structures Counter, Red/Black Tree, DHT CTS(30) and CTS(60) CTS over 30% and 60% of the nodes are object owners M. Saad and B. Ravindran (2011). Hyflow: A high performance distributed software transactional memory framework, HPDC, pp. 265-266 C. Minh, et al. (2008). STAMP: Stanford Transactional Applications for Multi-Processing, IISWC, pp. 200-208

Evaluation: Throughput with no node failures CTS(60) CTS(30) CTS(0) TFA CTS(90) CTS(0) Low Contention: 90% Read Transactions High Contention: 10% Read Transactions CTS(0): TFA + CTS, but no replication and no fault tolerance CTS(90): high communication overhead TFA: no CTS

Evaluation: Throughput under node failures Throughput speedup under 20% node failure Throughput speedup under 50% node failure Nodes randomly failed during each experiment GenRSTM and DecentSTM use full replication model Throughput speedup of CTS(60) over GenRSTM and DecentSTM Speedup range from 1.51x to 2.3x in low contention Speedup range from 1.3x to 1.7x in high contention CTS has reasonable performance at 50% failure (GenRSTM and DecentSTM have high communication delay overheads)

Conclusions DTM transactional scheduler in partial replication model Uses multiple clusters to support partial replication for fault-tolerance Clusters with small inter-node communication Identifies transactions for aborting to enhancing concurrency Enhances transactional throughput Ø 1.5x over baseline TFA; 1.55x and 1.73x over others Tradeoff between locality, communication cost, and fault-tolerance Can be effectively exploited in DTM Adaptive partial replication? Adaptive backoff scheme?