Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Similar documents
Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)

Transactional Memory Coherence and Consistency

Unbounded Transactional Memory

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Transactional Memory

Portland State University ECE 588/688. Transactional Memory

Speculative Locks. Dept. of Computer Science

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Speculative Synchronization

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

6 Transactional Memory. Robert Mullins

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Chí Cao Minh 28 May 2008

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Computer Systems Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5. Multiprocessors and Thread-Level Parallelism

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Improving the Practicality of Transactional Memory

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Implicit Transactional Memory in Chip Multiprocessors

Computer Systems Architecture

Concurrent Preliminaries

Multiprocessors and Locking

A More Sophisticated Snooping-Based Multi-Processor

Multiprocessors & Thread Level Parallelism

Challenges to Providing Performance Isolation in Transactional Memories

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Multithreaded Processors. Department of Electrical Engineering Stanford University

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Hardware Transactional Memory Architecture and Emulation

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Software-Controlled Multithreading Using Informing Memory Operations

The Common Case Transactional Behavior of Multithreaded Programs

Potential violations of Serializability: Example 1

Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

The Design Complexity of Program Undo Support in a General-Purpose Processor

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Handout 3 Multiprocessor and thread level parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

UNIT I (Two Marks Questions & Answers)

Hardware Support for NVM Programming

Hybrid Transactional Memory

Portland State University ECE 588/688. Cray-1 and Cray T3E

Transactional Memory

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Shared Symmetric Memory Systems

Deconstructing Transactional Semantics: The Subtleties of Atomicity

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Hardware Support For Serializable Transactions: A Study of Feasibility and Performance

A Basic Snooping-Based Multi-Processor Implementation

Characterization of TCC on Chip-Multiprocessors

Scheduling Transactions in Replicated Distributed Transactional Memory

Lecture 17: Transactional Memories I

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Computer Science 146. Computer Architecture

Data Speculation Support for a Chip Multiprocessor

Computer Architecture

Tradeoffs in Transactional Memory Virtualization

LogTM: Log-Based Transactional Memory

CS425 Computer Systems Architecture

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

A Scalable, Non-blocking Approach to Transactional Memory

Chapter 9 Multiprocessors

DMP Deterministic Shared Memory Multiprocessing

Chapter 5. Multiprocessors and Thread-Level Parallelism

Log-Based Transactional Memory

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiprocessor Support

Memory hierarchy review. ECE 154B Dmitri Strukov

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Portland State University ECE 588/688. Graphics Processors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

EECS 570 Final Exam - SOLUTIONS Winter 2015

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Transcription:

Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group)

Motivation Uniprocessor Systems Frequency Power consumption Wire delay limits scalability Design complexity vs. verification effort Where is ILP? Support for multiprocessor or multicore systems Replicate small, simple cores, design is scalable Faster design turnaround time, Time to market Exploit TLP, in addition to ILP within each core But now we have new problems 2

Parallel Software Problems Parallel systems are often programmed with Synchronization through barriers Shared objects access control through locks Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead Must be careful to avoid deadlocks or data races Must be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events E.g. false sharing, coherence misses Feedback is often indirect (cache lines, rather than variables) 3

Parallel Hardware Complexity (TCC s view) Cache coherence protocols are complex Must track ownership of cache lines Difficult to implement and verify all corner cases Consistency protocols are complex Must provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers Latency of short messages unlikely to scale well in the future Bandwidth is likely to scale much better High speed interchip connections Multicore (CMP) = on-chip bandwidth 4

What do we want? A shared memory system with A simple, easy programming model (unlike message passing) A simple, low-complexity hardware implementation (unlike shared memory) Good performance 5

Lock Freedom Why lock is bad? Common problems in conventional locking mechanisms in concurrent systems Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running Deadlock (or Livelock): Processes attempt to lock the same set of objects in different orders (could be bugs by programmers) Error-prone 6

Using Transactions What is a transaction? A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction Satisfy the following properties Serializability: Transactions appear to execute serially. Atomicity (or Failure-Atomicity): A transaction either commits changes when complete, visible to all; or aborts, discarding changes (will retry again) 7

TCC (Stanford) [ISCA 2004] Transactional Coherence and Consistency Programmer-defined groups of instructions within a program Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction Start Buffering Results Commit Results Now Only commit machine state at the end of each transaction Each must update machine state atomically, all at once To other processors, all instructions within one transaction appear to execute only when the transaction commits These commits impose an order on how processors may modify machine state 8

Transaction Code Example MIT LTM instruction set xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, 1...... on_abort: // back off j xstart // retry 9

Transactional Memory Transactions appear to execute in commit order Flow (RAW) dependency cause transaction violation and restart Transaction A Time ld 0xdddd... st 0xbeef Arbitrate Commit 0xbeef Transaction B ld 0xdddd... ld 0xbbbb 0xbeef Transaction C ld 0xbeef Arbitrate Violation! Commit ld 0xbeef Re-execute with new data 10

Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order (think about sequential consistency) Transaction A Transaction B Store X Commit X Local buffer Store X Commit X Local buffer Shared Memory 11

Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Store X Commit X Local buffer Transaction B Store X Commit X Local buffer Transaction A ST X = 1 LD X (=1) Commit X X = 1 X = 3 Local stores supply data Transaction B ST X = 3 LD X (=3) LD X (=3) Commit X Shared Memory 12

TCC System Similar to prior thread-level speculation (TLS) techniques CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC speculative multithreading CMP Loosely coupled TLS system Completely eliminates conventional cache coherence and consistency models No MESI-style cache coherence protocol But require new hardware support 13

The TCC Cycle Transactions run in a cycle Speculatively execute code and buffer Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other processors Commit stores together (as a packet) Provides a well-defined write ordering Can invalidate or update other caches Large packet utilizes bandwidth effectively And repeat 14

Advantages of TCC Trades bandwidth for simplicity and latency tolerance Easier to build Not dependent on timing/latency of loads and stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no MESI!) Commit order implies an ownership sequence 15

How to Use TCC Divide code into potentially parallel tasks Usually loop iterations For initial division, tasks = transactions But can be subdivided up or grouped to match HW limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Easier to get parallel programs running correctly Programmer then orders transactions as necessary Ordering techniques implemented using phase number Deadlock-free (At least one transaction is the oldest one) Livelock-free (watchdog HW can easily insert barriers anywhere) 16

How to Use TCC Three common ordering scenarios Unordered for purely parallel tasks Fully ordered to specify sequential task (algorithm level) Partially ordered to insert synchronization like barriers 17

Basic TCC Transaction Control Bits In each local cache Read bits (per cache line, or per word to eliminate false sharing) Set on speculative loads Snooped by a committing transaction (writes by other CPU) Modified bits (per cache line) Set on speculative stores Indicate what to rollback if a violation is detected Different from dirty bit Renamed bits (optional) At word or byte granularity To indicate local updates (WAR) that do not cause a violation Subsequent reads that read lines with these bits set, they do NOT set read bits because local WAR is not considered a violation 18

During A Transaction Commit Need to collect all of the modified caches together into a commit packet Potential solutions A separate write buffer, or An address buffer maintaining a list of the line tags to be committed Size? Broadcast all writes out as one single (large) packet to the rest of the system 19

Re-execute A Transaction Rollback is needed when a transaction cannot commit Checkpoints needed prior to a transaction Checkpoint memory Use local cache Overflow issue Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache) Checkpoint register state Hardware approach: Flash-copying rename table / arch register file Software approach: extra instruction overheads 20

Sample TCC Hardware Write buffers and L1 Transaction Control Bits Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequence logic 21

Ideal Speedups with TCC equake_l : long transactions equake_s : short transactions 22

Speculative Write Buffer Needs Only a few KB of write buffering needed Set by the natural transaction sizes in applications Small write buffer can capture 90% of modified state Infrequent overflow can be always handled by committing early 23

Broadcast Bandwidth Broadcast is bursty Average bandwidth Needs ~16 bytes/cycle @ 32 processors with whole modified lines Needs ~8 bytes/cycle @ 32 processors with dirty data only High, but feasible on-chip 24

TCC vs MESI [PACT 2005] Application, Protocol + Processor count 25

Implementation of MIT s LTM [HPCA 05] Transactional Memory should support transactions of arbitrary size and duration LTM Large Transactional Memory No change in cache coherence protocol Abort when a memory conflict is detected Use coherency protocol to check conflicts Abort (younger) transactions during conflict resolution to guarantee forward progress For potential rollback Checkpoint rename table and physical registers Use local cache for all speculative memory operations Use shared L2 (or low level memory) for non-speculative data storage 26

Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, Saved Set {P1, } During instruction decode: Maintain rename table and saved bits in physical registers Saved bits track registers mentioned in current rename table Constant # of set bits: every time a register is added to saved set we also remove one 27

Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } When XBEGIN is decoded Snapshots taken of current rename table and S bits This snapshot is not active until XBEGIN retires 28

Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 29

Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 30

Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } Active snapshot When XBEGIN retires Snapshots taken at decode become active, which will prevent P1 from reuse 1 st transaction queued to become active in memory To abort, we just restore the active snapshot s rename table 31

Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot We are only reserving registers in the active set This implies that exactly # of arch registers are saved This number is strictly limited, even as we speculatively execute through multiple transactions 32

Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot Normally, P1 would be freed here Since it is in the active snapshot s saved set, we place it onto the register reserved list 33

Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, R1 P3, Saved Set {P2, } {P3, } When retires: Reserved physical registers (e.g. P1) are freed, and active snapshot is cleared Store queue is empty 34

Multiple In-Flight Transactions retire Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, Saved Set {P2, } Active snapshot Second transaction becomes active in memory 35

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Need to keep Current (speculative) values Rollback values Common case is commit, so keep Current in cache Problem: uncommitted current values do not fit in local cache Solution Overflow hashtable as extension of cache 36

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 T bit per cache line Set if accessed during a transaction O bit per cache set Indicate set overflow Overflow storage in physical DRAM Allocate and resize by the OS Search when miss : complexity of a page table walk If a line is found, swapped with a line in the set 37

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 1000 55 Overflow Hashtable key data Start with non-transactional data in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 38

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 1 1000 55 Overflow Hashtable key data Transactional read sets the T bit ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 39

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 1 1000 55 1 2000 66 Overflow Hashtable key data Expect most transactional writes fit in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 40

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 1 1 3000 77 1 2000 66 Overflow Hashtable key data 1000 55 ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 A conflict miss Overflow sets O bit Replacement taken place (LRU) Old data spilled to DRAM (hashtable) 41

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 1 1 1000 55 1 2000 66 Overflow Hashtable key data 3000 77 ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Miss to an overflowed line, checks overflow table If found, swap (like a victim cache) Else, proceed as miss 42

Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data 0 0 1000 55 0 2000 66 L2 Overflow Hashtable key data 3000 77 ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Abort Invalidate all lines with T set (assume L2 or lower level memory contains original values) Discard overflow hashtable Clear O and T bits Commit Write back hashtable; NACK interventions during this Clear O and T bits in the cache 43

LTM vs. Lock-based 44

Further Readings M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA 1993. R. Rajwar and J. R. Goodman, Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO 2001 R. Rajwar and J. R. Goodman, Transactional Lock-Free Execution of Lock-Based Programs, ASPLOS 2002 J. F. Martinez and J. Torrellas, Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications, ASPLOS 2002 L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton Transactional Memory Coherence and Consistency, ISCA 2004 C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie, Unbounded Transactional Memory, HPCA 2005 A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond, C. Kozyrakis and K. Olukotun, Characterization of TCC on a Chip-Multiprocessors, PACT 2005. 45