Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Similar documents
Lecture: Transactional Memory. Topics: TM implementations

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Lecture: Consistency Models, TM

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture 17: Transactional Memories I

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Potential violations of Serializability: Example 1

6 Transactional Memory. Robert Mullins

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

6.852: Distributed Algorithms Fall, Class 20

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Lecture 25: Multiprocessors

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Advanced Topic: Efficient Synchronization

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

DHTM: Durable Hardware Transactional Memory

The Problems with OOO

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Scheduling Transactions in Replicated Distributed Transactional Memory

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Concurrent Preliminaries

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

A Basic Snooping-Based Multi-processor

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation

Computer Science 146. Computer Architecture

Transactional Memory

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Concurrency control CS 417. Distributed Systems CS 417

Conflict Detection and Validation Strategies for Software Transactional Memory

A More Sophisticated Snooping-Based Multi-Processor

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Improving the Practicality of Transactional Memory

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Chapter 15 : Concurrency Control

Distributed Shared Memory and Memory Consistency Models

TRANSACTION PROCESSING PROPERTIES OF A TRANSACTION TRANSACTION PROCESSING PROPERTIES OF A TRANSACTION 4/3/2014

Lecture 7: Implementing Cache Coherence. Topics: implementation details

LogTM: Log-Based Transactional Memory

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS377P Programming for Performance Multicore Performance Synchronization

Multiprocessors and Locking

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

EECS 570 Final Exam - SOLUTIONS Winter 2015

EazyHTM: Eager-Lazy Hardware Transactional Memory

Synchronization. Chapter 5

Transactional Lock-Free Execution of Lock-Based Programs

Part 1: Concepts and Hardware- Based Approaches

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Advanced Memory Management

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Distributed Computing Group

6.830 Lecture Recovery 10/30/2017

Lecture 12 Transactional Memory

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

Synchronization. CSCI 5103 Operating Systems. Semaphore Usage. Bounded-Buffer Problem With Semaphores. Monitors Other Approaches

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Consistency in Distributed Systems

Chí Cao Minh 28 May 2008

Lecture 24: Virtual Memory, Multiprocessors

ECE 485/585 Microprocessor System Design

Hardware Transactional Memory on Haswell

The Issue. Implementing Isolation. Transaction Schedule. Isolation. Schedule. Schedule

Multiprocessor Synchronization

Chapter 13 : Concurrency Control

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Speculative Locks. Dept. of Computer Science

CS 654 Computer Architecture Summary. Peter Kemper

Final Review. May 9, 2018 May 11, 2018

Final Review. May 9, 2017

Memory Consistency Models

Transcription:

Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1

Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking, transactions execute speculatively in the hope that there will be no conflicts Can yield better performance; Eliminates deadlocks Programmer can freely encapsulate code sections within transactions and not worry about the impact on performance and correctness (for the most part) Programmer specifies the code sections they d like to see execute atomically the hardware takes care of the rest (provides illusion of atomicity) 2

Transactions Transactional semantics: when a transaction executes, it is as if the rest of the system is suspended and the transaction is in isolation the reads and writes of a transaction happen as if they are all a single atomic operation if the above conditions are not met, the transaction fails to commit (abort) and tries again transaction begin read shared variables arithmetic write shared variables transaction end 3

Example Producer-consumer relationships producers place tasks at the tail of a work-queue and consumers pull tasks out of the head Enqueue transaction begin if (tail == NULL) update head and tail else update tail transaction end Dequeue transaction begin if (head->next == NULL) update head and tail else update head transaction end With locks, neither thread can proceed in parallel since head/tail may be updated with transactions, enqueue and dequeue can proceed in parallel transactions will be aborted only if the queue is nearly empty 4

Example Hash table implementation transaction begin index = hash(key); head = bucket[index]; traverse linked list until key matches perform operations transaction end Most operations will likely not conflict transactions proceed in parallel Coarse-grain lock serialize all operations Fine-grained locks (one for each bucket) more complexity, more storage, concurrent reads not allowed, concurrent writes to different elements not allowed 5

TM Implementation Core Core Cache Cache Caches track read-sets and write-sets Writes are made visible only at the end of the transaction At transaction commit, make your writes visible; others may abort 6

Detecting Conflicts Basic Implementation Writes can be cached (can t be written to memory) if the block needs to be evicted, flag an overflow (abort transaction for now) on an abort, invalidate the written cache lines Keep track of read-set and write-set (bits in the cache) for each transaction When another transaction commits, compare its write set with your own read set a match causes an abort At transaction end, express intent to commit, broadcast write-set (transactions can commit in parallel if their write-sets do not intersect) 7

Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock 8

Design Space Data Versioning Eager: based on an undo log Lazy: based on a write buffer Conflict Detection Optimistic detection: check for conflicts at commit time (proceed optimistically thru transaction) Pessimistic detection: every read/write checks for conflicts (reduces work during commit) 9

Lazy Implementation An implementation for a small-scale multiprocessor with a snooping-based protocol Lazy versioning and lazy conflict detection Does not allow transactions to commit in parallel 10

Lazy Implementation When a transaction issues a read, fetch the block in read-only mode (if not already in cache) and set the rd-bit for that cache line When a transaction issues a write, fetch that block in read-only mode (if not already in cache), set the wr-bit for that cache line and make changes in cache If a line with wr-bit set is evicted, the transaction must be aborted (or must rely on some software mechanism to handle saving overflowed data) 11

Lazy Implementation When a transaction reaches its end, it must now make its writes permanent A central arbiter is contacted (easy on a bus-based system), the winning transaction holds on to the bus until all written cache line addresses are broadcasted (this is the commit) (need not do a writeback until the line is evicted must simply invalidate other readers of these cache lines) When another transaction (that has not yet begun to commit) sees an invalidation for a line in its rd-set, it realizes its lack of atomicity and aborts (clears its rd- and wr-bits and re-starts) 12

Lazy Implementation Lazy versioning: changes are made locally the master copy is updated only at the end of the transaction Lazy conflict detection: we are checking for conflicts only when one of the transactions reaches its end Aborts are quick (must just clear bits in cache, flush pipeline and reinstate a register checkpoint) Commit is slow (must check for conflicts, all the coherence operations for writes are deferred until transaction end) No fear of deadlock/livelock the first transaction to acquire the bus will commit successfully Starvation is possible need additional mechanisms 13

Lazy Implementation Parallel Commits Writes cannot be rolled back hence, before allowing two transactions to commit in parallel, we must ensure that they do not conflict with each other One possible implementation: the central arbiter can collect signatures from each committing transaction (a compressed representation of all touched addresses) Arbiter does not grant commit permissions if it detects a possible conflict with the rd-wr-sets of transactions that are in the process of committing The lazy design can also work with directory protocols 14

Eager Implementation A write is made permanent immediately (we do not wait until the end of the transaction) This means that if some other transaction attempts a read, the latest value is returned and the memory may also be updated with this latest value Can t lose the old value (in case this transaction is aborted) hence, before the write, we copy the old value into a log (the log is some space in virtual memory -- the log itself may be in cache, so not too expensive) This is eager versioning 15

Eager Implementation Since Transaction-A s writes are made permanent rightaway, it is possible that another Transaction-B s rd/wr miss is re-directed to Tr-A At this point, we detect a conflict (neither transaction has reached its end, hence, eager conflict detection): two transactions handling the same cache line and at least one of them does a write One solution: requester stalls: Tr-A sends a NACK to Tr-B; Tr-B waits and re-tries again; hopefully, Tr-A has committed and can hand off the latest cache line to B neither transaction needs to abort 16

Eager Implementation Can lead to deadlocks: each transaction is waiting for the other to finish Need a separate (hw/sw) contention manager to detect such deadlocks and force one of them to abort Tr-A write X read Y Tr-B write Y read X 17

Eager Implementation Note that if Tr-B is doing a write, it may be forced to stall because Tr-A may have done a read and does not want to invalidate its cache line just yet If new reading transactions keep emerging, Tr-B may be starved again, need other sw/hw mechanisms to handle starvation Since logs are stored in virtual memory, there is no cache overflow problem and transactions can be large Commits are inexpensive (no additional step required); Aborts are expensive, but rare (must reinstate data from logs) 18

Other Issues Nesting: when one transaction calls another flat nesting: collapse all nested transactions into one large transaction closed nesting: inner transaction s rd-wr set are included in outer transaction s rd-wr set on inner commit; on an inner conflict, only the inner transaction is re-started open nesting: on inner commit, its writes are committed and not merged with outer transaction s commit set What if a transaction performs I/O? What if a transaction overflows out of cache? 19

Useful Rules of Thumb Transactions are often short more than 95% of them will fit in cache Transactions often commit successfully less than 10% are aborted 99.9% of transactions don t perform I/O Transaction nesting is not common Amdahl s Law again: optimize the common case! 20

Discussion Eager optimizes the common case and does not waste energy when there s a potential conflict TM implementations require relatively low hardware support Multiple commercial examples: Sun Rock, AMD ASF, IBM BG/Q, Intel Haswell 21

Title Bullet 22