Lecture: Consistency Models, TM

Similar documents
Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Lecture: Transactional Memory. Topics: TM implementations

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 17: Transactional Memories I

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors

Potential violations of Serializability: Example 1

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

M4 Parallelism. Implementation of Locks Cache Coherence

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

6 Transactional Memory. Robert Mullins

Overview: Memory Consistency

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Computer Science 146. Computer Architecture

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Computer Architecture

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Distributed Shared Memory and Memory Consistency Models

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Advanced Topic: Efficient Synchronization

Concurrent Preliminaries

Multiprocessor Synchronization

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

EECS 570 Final Exam - SOLUTIONS Winter 2015

CSE502: Computer Architecture CSE 502: Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Intro to Transactions

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

5008: Computer Architecture

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

A Basic Snooping-Based Multi-Processor Implementation

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

A Basic Snooping-Based Multi-processor

CS5460: Operating Systems

Multiprocessor Cache Coherency. What is Cache Coherence?

Advanced Memory Management

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Problems Caused by Failures

Chapter-4 Multiprocessors and Thread-Level Parallelism

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

EECS 570 Lecture 13. Directory & Optimizations. Winter 2018 Prof. Satish Narayanasamy

Chapter 5. Thread-Level Parallelism

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Memory Multiprocessors

Portland State University ECE 588/688. Memory Consistency Models

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 5: Synchronization w/locks

Relaxed Memory-Consistency Models

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Handout 3 Multiprocessor and thread level parallelism

The Problems with OOO

A Basic Snooping-Based Multi-Processor Implementation

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Distributed Computing Group

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Transcription:

Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1

Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors, and (ii) write serialization (all processors see writes to the same location in the same order) The consistency model defines the ordering of writes and reads to different memory locations the hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions 2

Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section critical section Initially, Head = Data = 0 P1 P2 Data = 2000 while (Head == 0) Head = 1 { } = Data Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A 3

Sequential Consistency P1 Instr-a Instr-b Instr-c Instr-d P2 Instr-A Instr-B Instr-C Instr-D We assume: Within a program, program order is preserved Each instruction executes atomically Instructions from different threads can be interleaved arbitrarily Valid executions: abacbcddee or ABCDEFabGc or abcadbe or aabbccddee or.. 4

Problem 1 What are possible outputs for the program below? Assume x=y=0 at the start of the program Thread 1 Thread 2 x = 10 y=20 y = x+y x = y+x Print y 5

Problem 1 What are possible outputs for the program below? Assume x=y=0 at the start of the program Thread 1 Thread 2 A x = 10 a y=20 B y = x+y b x = y+x C Print y Possible scenarios: 5 choose 2 = 10 ABCab ABaCb ABabC AaBCb AaBbC 10 20 20 30 30 AabBC aabcb aabbc aabbc ababc 50 30 30 50 30 6

Sequential Consistency Programmers assume SC; makes it much easier to reason about program behavior Hardware innovations can disrupt the SC model For example, if we assume write buffers, or out-of-order execution, or if we drop ACKS in the coherence protocol, the previous programs yield unexpected outputs 7

Consistency Example - I An ooo core will see no dependence between instructions dealing with A and instructions dealing with B; those operations can therefore be re-ordered; this is fine for a single thread, but not for multiple threads Initially A = B = 0 P1 P2 A 1 B 1 if (B == 0) if (A == 0) Crit.Section Crit.Section The consistency model lets the programmer know what assumptions they can make about the hardware s reordering capabilities 8

Consistency Example - 2 Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A If a coherence invalidation didn t require ACKs, we can t confirm that everyone has seen the value of A. 9

Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achieveable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion Can implement sequential consistency by requiring the following: program order, write serialization, everyone has seen an update before a value is read very intuitive for the programmer, but extremely slow This is very slow alternatives: Add optimizations to the hardware Offer a relaxed memory consistency model and fences 10

Relaxed Consistency Models We want an intuitive programming model (such as sequential consistency) and we want high performance We care about data races and re-ordering constraints for some parts of the program and not for others hence, we will relax some of the constraints for sequential consistency for most of the program, but enforce them for specific portions of the code Fence instructions are special instructions that require all previous memory accesses to complete before proceeding (sequential consistency) 11

Fences P1 { { Region of code with no races } } P2 Region of code with no races Fence Acquire_lock Fence Fence Acquire_lock Fence { { Racy code } } Racy code Fence Release_lock Fence Fence Release_lock Fence 12

Lock Vs. Optimistic Concurrency lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit Critical Section ST 0(R1), #0 LL-SC is being used to figure out if we were able to acquire the lock without anyone interfering we then enter the critical section tryagain: LL R2, 0(R1) DADDUI R2, R2, R3 SC R2, 0(R1) BEQZ R2, tryagain If the critical section only involves one memory location, the critical section can be captured within the LL-SC instead of spinning on the lock acquire, you may now be spinning trying to atomically execute the CS 13

Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking, transactions execute speculatively in the hope that there will be no conflicts Can yield better performance; Eliminates deadlocks Programmer can freely encapsulate code sections within transactions and not worry about the impact on performance and correctness (for the most part) Programmer specifies the code sections they d like to see execute atomically the hardware takes care of the rest (provides illusion of atomicity) 14

TM Overview 15

Transactions Transactional semantics: when a transaction executes, it is as if the rest of the system is suspended and the transaction is in isolation the reads and writes of a transaction happen as if they are all a single atomic operation if the above conditions are not met, the transaction fails to commit (abort) and tries again transaction begin read shared variables arithmetic write shared variables transaction end 16

Example Producer-consumer relationships producers place tasks at the tail of a work-queue and consumers pull tasks out of the head Enqueue transaction begin if (tail == NULL) update head and tail else update tail transaction end Dequeue transaction begin if (head->next == NULL) update head and tail else update head transaction end With locks, neither thread can proceed in parallel since head/tail may be updated with transactions, enqueue and dequeue can proceed in parallel transactions will be aborted only if the queue is nearly empty 17

Example Hash table implementation transaction begin index = hash(key); head = bucket[index]; traverse linked list until key matches perform operations transaction end Most operations will likely not conflict transactions proceed in parallel Coarse-grain lock serialize all operations Fine-grained locks (one for each bucket) more complexity, more storage, concurrent reads not allowed, concurrent writes to different elements not allowed 18

TM Implementation Core Core Cache Cache Caches track read-sets and write-sets Writes are made visible only at the end of the transaction At transaction commit, make your writes visible; others may abort 19

Detecting Conflicts Basic Implementation Writes can be cached (can t be written to memory) if the block needs to be evicted, flag an overflow (abort transaction for now) on an abort, invalidate the written cache lines Keep track of read-set and write-set (bits in the cache) for each transaction When another transaction commits, compare its write set with your own read set a match causes an abort At transaction end, express intent to commit, broadcast write-set (transactions can commit in parallel if their write-sets do not intersect) 20

Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Speculative parallelization Avoids deadlock Resilient to faults 21

Design Space Data Versioning Eager: based on an undo log Lazy: based on a write buffer Conflict Detection Optimistic detection: check for conflicts at commit time (proceed optimistically thru transaction) Pessimistic detection: every read/write checks for conflicts (reduces work during commit) 22

Lazy Implementation An implementation for a small-scale multiprocessor with a snooping-based protocol Lazy versioning and lazy conflict detection Does not allow transactions to commit in parallel 23

Lazy Implementation When a transaction issues a read, fetch the block in read-only mode (if not already in cache) and set the rd-bit for that cache line When a transaction issues a write, fetch that block in read-only mode (if not already in cache), set the wr-bit for that cache line and make changes in cache If a line with wr-bit set is evicted, the transaction must be aborted (or must rely on some software mechanism to handle saving overflowed data) 24

Lazy Implementation When a transaction reaches its end, it must now make its writes permanent A central arbiter is contacted (easy on a bus-based system), the winning transaction holds on to the bus until all written cache line addresses are broadcasted (this is the commit) (need not do a writeback until the line is evicted must simply invalidate other readers of these cache lines) When another transaction (that has not yet begun to commit) sees an invalidation for a line in its rd-set, it realizes its lack of atomicity and aborts (clears its rd- and wr-bits and re-starts) 25

Lazy Implementation Lazy versioning: changes are made locally the master copy is updated only at the end of the transaction Lazy conflict detection: we are checking for conflicts only when one of the transactions reaches its end Aborts are quick (must just clear bits in cache, flush pipeline and reinstate a register checkpoint) Commit is slow (must check for conflicts, all the coherence operations for writes are deferred until transaction end) No fear of deadlock/livelock the first transaction to acquire the bus will commit successfully Starvation is possible need additional mechanisms 26

Lazy Implementation Parallel Commits Writes cannot be rolled back hence, before allowing two transactions to commit in parallel, we must ensure that they do not conflict with each other One possible implementation: the central arbiter can collect signatures from each committing transaction (a compressed representation of all touched addresses) Arbiter does not grant commit permissions if it detects a possible conflict with the rd-wr-sets of transactions that are in the process of committing The lazy design can also work with directory protocols 27

Title Bullet 28