Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

Similar documents
ABORTING CONFLICTING TRANSACTIONS IN AN STM

Chris Rossbach, Owen Hofmann, Don Porter, Hany Ramadan, Aditya Bhandari, Emmett Witchel University of Texas at Austin

Dependence-Aware Transactional Memory for Increased Concurrency

METATM/TXLINUX: TRANSACTIONAL MEMORY FOR AN OPERATING SYSTEM

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Scheduling Transactions in Replicated Distributed Transactional Memory

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Chí Cao Minh 28 May 2008

Tradeoffs in Transactional Memory Virtualization

EazyHTM: Eager-Lazy Hardware Transactional Memory

Hardware Transactional Memory Architecture and Emulation

TxLinux and MetaTM: Transactional Memory and the Operating System

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

doi: /

Transactional Memory

Emmett Witchel The University of Texas At Austin

6 Transactional Memory. Robert Mullins

MetaTM/TxLinux: Transactional Memory For An Operating System

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Speculative Synchronization

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Enhancing Concurrency in Distributed Transactional Memory through Commutativity

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

INTRODUCTION. Hybrid Transactional Memory. Transactional Memory. Problems with Transactional Memory Problems

Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)

Log-Based Transactional Memory

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Speculative Locks. Dept. of Computer Science

Work Report: Lessons learned on RTM

TxLinux: Using and Managing Hardware Transactional Memory in an Operating System

ByteSTM: Java Software Transactional Memory at the Virtual Machine Level

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Improving the Practicality of Transactional Memory

The Common Case Transactional Behavior of Multithreaded Programs

FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing

LogTM: Log-Based Transactional Memory

Hardware Support for NVM Programming

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Portland State University ECE 588/688. Transactional Memory

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

OPERATING SYSTEM TRANSACTIONS

A Case for Using Value Prediction to Improve Performance of Transactional Memory

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy

Potential violations of Serializability: Example 1

Lecture 12. Lecture 12: The IO Model & External Sorting

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Unbounded Transactional Memory

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Concurrency Control. Transaction Management. Lost Update Problem. Need for Concurrency Control. Concurrency control

Chapter 22. Transaction Management

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Topic 18: Virtual Memory

Pathological Interaction of Locks with Transactional Memory

Some Examples of Conflicts. Transactional Concurrency Control. Serializable Schedules. Transactions: ACID Properties. Isolation and Serializability

Transactional Memory

An Integrated Pseudo-Associativity and Relaxed-Order Approach to Hardware Transactional Memory

HTM in the wild. Konrad Lai June 2015

COMP3151/9151 Foundations of Concurrency Lecture 8

Partition-Based Hardware Transactional Memory for Many-Core Processors

6.852: Distributed Algorithms Fall, Class 20

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

TRANSACTION MEMORY. Presented by Hussain Sattuwala Ramya Somuri

DATABASE TRANSACTIONS. CS121: Relational Databases Fall 2017 Lecture 25

Implementing Rollback: Copy. 2PL: Rollback. Implementing Rollback: Undo. 8L for Part IB Handout 4. Concurrent Systems. Dr.

Software-Controlled Multithreading Using Informing Memory Operations

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

Serializability of Transactions in Software Transactional Memory

Lecture 21. Lecture 21: Concurrency & Locking

Understanding Hardware Transactional Memory

Chapter 15 : Concurrency Control

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

CS Reading Packet: "Transaction management, part 2"

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

AN EXECUTION MODEL FOR FINE-GRAIN NESTED SPECULATIVE PARALLELISM

Implementing Isolation

transaction - (another def) - the execution of a program that accesses or changes the contents of the database

CSC 261/461 Database Systems Lecture 24

Hybrid Transactional Memory

Transcription:

Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

Concurrency Conundrum Challenge: CMP ubiquity Parallel programming with locks and threads is difficult deadlock, livelock, convoys lock ordering, poor composability performance-complexity tradeoff Transactional Memory (HTM) simpler programming model removes pitfalls of locking coarse-grain transactions can perform well under lowcontention 2 cores 16 cores 80 cores (neat!) 2

High contention: optimism warranted? TM performs poorly with write-shared data Increasing core counts make this worse Write sharing is common Statistics, reference counters, lists Solutions complicate programming model open-nesting, boosting, early release, ANTs Read- Sharing Write- Sharing Locks Transactions TM s need not Dependence-Awareness always serialize can help the programmer write-shared (transparently!) Transactions! 3

Outline Motivation Dependence-Aware Transactions Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 4

time Two threads sharing a counter Initially: count == 0 load count,r inc r store r,count Thread A load count,r inc r store r,count Result: count == 2 Schedule: dynamic instruction sequence models concurrency by interleaving instructions from multiple threads Thread B memory count: 012 5

TM shared counter Initially: count == 0 xbegin Tx A xbegin Tx B load count,r time inc r store r,count load count,r memory count: 01 inc r Conflict! store r,count xend xend Conflict: Current HTM designs cannot accept such schedules, both transactions access a datum, at least one is a write despite the fact that they yield the correct result! intersection between read and write sets of transactions 6

Does this really matter? xbegin load X,r0 <lots of work> inc r0 store r0,x load count,r inc r store r,count xend xend Critsec A xbegin <lots of work> load X,r0 inc r0 load count,r store inc r r0,x store r,count xend xend Critsec B Common Pattern! Statistics Linked lists Garbage collectors DATM: use dependences to commit conflicting transactions 7

DATM shared counter time xbegin load count,r inc r store r,count xend Initially: count == 0 T1 xbegin load count,r inc r store r,count xend T1 must commit before T2 If T1 aborts, T2 must also abort If T1 overwrites count, T2 must abort T2 memory count: 012 Forward speculative data from T1 to satisfy T2 s load. Model these constraints as dependences 8

Conflicts become Dependences Tx A read X write Y Tx B write X write Y Read-Write: A commits before B Write-Write: A commits before B Tx A Write-Read: forward data overwrite abort A commits before B R W W W Tx B write Z read Z W R Dependence Graph Transactions: nodes dependences: edges edges dictate commit order no cycles conflict serializable 9

Enforcing ordering using Dependence Graphs T1 xbegin load X,r0 store r0,x T2 xbegin xend load X,r0 store r0,x xend Wait for T1 xend T1 W R T2 T1 must serialize before T2 Outstanding Dependences! 10

Enforcing Consistency using Dependence Graphs T1 xbegin T2 xbegin load X, r0 load X,r0 store r0,x store r0,x T1 R W T2 xend xend W W CYCLE! (lost update) cycle not conflict serializable restart some tx to break cycle invoke contention manager 11

Theoretical Foundation Correctness and Serializability: optimality results of concurrent Proof: See interleaving equivalent [PPoPP to serial 09] interleaving Conflict: Conflict-equivalent: same operations, order of conflicting operations same Conflict-serializability: conflictequivalent to a serial interleaving 2-phase locking: conservative implementation of CS (LogTM, TCC, RTM, MetaTM, OneTM.) Serializable Conflict Serializable 2PL 12

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 13

DATM Requirements Mechanisms: Maintain global dependence graph Conservative approximation Detect cycles, enforce ordering constraints Create dependences Forwarding/receiving Buffer speculative data Implementation: coherence, L1, per-core HW structures 14

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences These are not programmer-visible! 15

Dependence-Aware Microarchitecture Order vector: dependence graph topological sort conservative approx. These are not programmer-visible! 16

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences These are not programmer-visible! 17

Dependence-Aware Microarchitecture Order vector: dependence graph TXID, access bits: versioning, detection TXSW: transaction status Frc-ND + ND bits: disable dependences, no active dependences These are not programmer-visible! 18

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xabt xcmt 19

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xabt xcmt 20

FRMSI protocol: forwarding and receiving MSI states TxMSI states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xabt xcmt 21

FRMSI protocol: forwarding and receiving MSI states TX states (T*) Forwarded: (T*F) Received: (T*R*) Committing (CTM) Bus messages: TXOVW xabt xcmt 22

Converting Conflicts to Dependences L1 cnt core_0 core_1 core 0 1 xbegin 1 xbegin core 1 PC 12345 PC 12345 2 ld R, cnt 2 ld R, cnt R 101 100 3 inc cnt 3 inc cnt R 102 101 TXID TxA 4 st cnt, R 4 st cnt, R TXID TxB 5 xend N xend (stall) L1 OVec [A,B] OVec [A,B] S TXID data Outstanding TMM CTM TMF TS 100 101 cnt TMR TR TxB Dependence 102 TMFTxA xcmt(a) 101 S TXID data Main Memory cnt 100 23

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 24

Experimental Setup Implemented DATM by extending MetaTM Compared against MetaTM: [Ramadan 2007] Simulation environment Simics 3.0.30 machine simulator x86 ISA w/htm instructions 32k 4-way tx L1 cache; 4MB 4-way L2; 1GB RAM 1 cycle/inst, 24 cyc/l2 hit, 350 cyc/main memory 8, 16 processors Benchmarks Micro-benchmarks STAMP [Minh ISCA 2007] TxLinux: Transactional OS [Rossbach SOSP 2007] 25

DATM speedup, 16 CPUs 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 15x 39% 22% 3% 2% pmake labyrinth vacation bayes counter 26

Eliminating wasted work 80% 70% 60% 50% 40% 30% 20% 10% 0% Speedup: restarts/tx avg bkcyc/tx 16 CPUs Lower is better pmake labyrinth vacation bayes counter 2% 3% 22% 39% 15x 27

Dependence Aware Contention Timestamp contention management Management T7 T6 5 transactions must restart! Cascaded abort? T3 T4 T2 T1 Dependence Aware Order Vector T2, T3, T7, T6, T4, T1 28

Contention Management 120% 100% 80% 60% 40% 20% 0% -20% -40% -60% bayes 101% Polka Eruption Dependence-Aware vacation 14% 29

Outline Motivation/Background Dependence-Aware Model Dependence-Aware Hardware Experimental Methodology/Evaluation Conclusion 30

Related Work HTM TCC [Hammond 04], LogTM[-SE] [Moore 06], VTM [Rajwar 05], MetaTM [Ramadan 07, Rossbach 07], HASTM, PTM, HyTM, RTM/FlexTM [Shriraman 07,08] TLS Hydra [Olukotun 99], Stampede [Steffan 98,00], Multiscalar [Sohi 95], [Garzaran 05], [Renau 05] TM Model extensions Privatization [Spear 07], early release [Skare 06], escape actions [Zilles 06], open/closed nesting [Moss 85, Menon 07], Galois [Kulkarni 07], boosting [Herlihy 08], ANTs [Harris 07] TM + Conflict Serializability Adaptive TSTM [Aydonat 08] 31

Conclusion DATM can commit conflicting transactions Improves write-sharing performance Transparent to the programmer DATM prototype demonstrates performance benefits Source code available! www.metatm.net 32

Broadcast and Scalability Yes. We broadcast. But it s less than you might think Because each node maintains all deps, this design uses broadcast for: Commit Abort TXOVW (broadcast writes) Forward restarts New Dependences These sources of broadcast could be avoided in a directory protocol: keep only the relevant subset of the dependence graph at each node 33

FRMSI Transient states 19 transient states: sources: forwarding overwrite of forwarded lines (TXOVW) transitions from MSI T* states Ameliorating factors best-effort: local evictions abort, reduces WB states T*R* states are isolated transitions back to MSI (abort, commit) require no transient states Signatures for forward/receive sets could eliminate five states. 34

Why isn t this TLS? Forward data between threads Detect when reads occur too early Discard speculative state after violations Retire speculative Writes in order Memory renaming Multi-threaded workloads Programmer/Compiler transparency TLS DATM TM 1. Txns can commit in arbitrary order. TLS has the daunting task of maintaining program order for epochs 2. TLS forwards values when they are the globally committed value (i.e. through memory) 3. No need to detect early reads (before a forwardable value is produced) 4. Notion of violation is different we must roll back when CS is violated, TLS, when epoch ordering 5. Retiring writes in order: similar issue we use CTM state, don t need speculation write buffers. 35 6. Memory renaming to avoid older threads seeing new values we have no such issue

Doesn t this lead to cascading aborts? YES Rare in most workloads 36

Inconsistent Reads OS modifications: Page fault handler Signal handler 37

Increasing Concurrency time t c Lazy/Lazy (e.g. TCC) T1 T2 xbegin xbegin conflict arbitrate xend xbegin serialized xend Eager/Eager (MetaTM,LogTM) T1 T2 xbegin xbegin arbitrate conflict xbegin xend overlapped retry xend DATM T1 T2 xbegin xbegin conflict xend xend no retry! (Eager/Eager: (Lazy/Lazy: Updates buffered, in-place, conflict detection at commit time of reference) time) 38

Design space highlights Cache granularity: eliminate per word access bits Timestamp-based dependences: eliminate Order Vector Dependence-Aware contention management 39

Design Points 2.5 2 1.5 bayes vacation counter counter-tt 1 0.5 0 Cache-gran Timestamp-fwd DATM 40

Hardware TM Primer Key Ideas: Critical sections execute concurrently Conflicts are detected dynamically If conflict serializability is violated, rollback Key Abstractions: Primitives xbegin, xend, xretry Conflict Contention Manager Need flexible policy Conventional Wisdom Transactionalization : Replace locks with transactions 41

Hardware TM basics: example cpu 0 cpu 1 PC: 1 36 70 2 PC: 01 23 47 8 Working Set R{ R{ R{ A,B,C R{} A } }} W{} 0: xbegin 1: read A 2: read B 3: if(cpu % 2) 4: write C 5: else 6: read C 7: 8: xend CONFLICT: 0: xbegin; 1: read A 2: read B 3: if(cpu % 2) 4: write C 5: else 6: read C 7: 8: xend Assume contention manager decides cpu1 C is in the read set of wins: cpu0, and in the write set cpu0 of rolls cpu1back cpu1 commits Working Set R{ R{ R{} A,B A } } W{} C} 42