Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Similar documents
Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies

Lecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture: Transactional Memory. Topics: TM implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Overview: Memory Consistency

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 17: Transactional Memories I

Multiprocessor Synchronization

Shared Memory Consistency Models: A Tutorial

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Memory Consistency Models. CSE 451 James Bornholt

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Beyond Sequential Consistency: Relaxed Memory Models

Relaxed Memory-Consistency Models

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Memory Consistency Models

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Distributed Shared Memory and Memory Consistency Models

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Computer Architecture

Log-Based Transactional Memory

Shared Memory Consistency Models: A Tutorial

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

PERFORMANCE PATHOLOGIES IN HARDWARE TRANSACTIONAL MEMORY

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 25: Multiprocessors

Shared Memory Multiprocessors

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

CS533 Concepts of Operating Systems. Jonathan Walpole

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Computer Science 146. Computer Architecture

Sequential Consistency & TSO. Subtitle

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter-4 Multiprocessors and Thread-Level Parallelism

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Potential violations of Serializability: Example 1

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Portland State University ECE 588/688. Memory Consistency Models

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

EE382 Processor Design. Processor Issues for MP

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Relaxed Memory Consistency

EazyHTM: Eager-Lazy Hardware Transactional Memory

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

6 Transactional Memory. Robert Mullins

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Using Relaxed Consistency Models

Lecture 22: Fault Tolerance

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Problem Set 5 Solutions CS152 Fall 2016

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Portland State University ECE 588/688. Transactional Memory

A Basic Snooping-Based Multi-Processor Implementation

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Towards Transparent and Efficient Software Distributed Shared Memory

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

CS 654 Computer Architecture Summary. Peter Kemper

Scalable and Reliable Communication for Hardware Transactional Memory

LogTM: Log-Based Transactional Memory

EECS 570 Lecture 13. Directory & Optimizations. Winter 2018 Prof. Satish Narayanasamy

Consistency & TM. Consistency

Advanced Topic: Efficient Synchronization

Transcription:

Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1

Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing transaction wins conflicts EL: lazy versioning, eager conflict detection, requester succeeds and others abort EE: eager versioning, eager conflict detection, requester stalls 2

Pathology 1: Friendly Fire Two conflicting transactions that keep aborting each other Can do exponential back-off to handle livelock VM: any CD: eager CR: requester wins Fixable by doing requester stalls? Also fixable by doing requester wins only if the requester is older 3

Pathology 2: Starving Writer A writer has to wait for the reader to finish but if more readers keep showing up, the writer is starved (note that the directory allows new readers to proceed by just adding them to the list of sharers) VM: any CD: eager CR: requester stalls Fixable by forcing the directory to override requester-stalls on a starvation alarm 4

Pathology 3: Serialized Commit If there s a single commit token, transaction commit is serialized VM: lazy CD: lazy CR: any There are ways to alleviate this problem (discussed in the last class) 5

Pathology 4: Futile Stall A transaction is stalling on another transaction that ultimately aborts and takes a while to reinstate old values -- no good workaround VM: any CD: eager CR: requester stalls 6

Pathology 5: Starving Elder Small successful transactions can keep aborting a large transaction VM: lazy CD: lazy CR: committer wins The large transaction can eventually grab the token and not release it until after it commits 7

Pathology 6: Restart Convoy A number of similar (conflicting) transactions execute together one wins, the others all abort shortly, these transactions all return and repeat the process Use exponential back-off VM: lazy CD: lazy CR: committer wins 8

Pathology 7: Dueling Upgrades If two transactions both read the same object and then both decide to write it, a deadlock is created VM: eager CD: eager CR: requester stalls Exacerbated by the Futile Stall pathology Solution? 9

Four Extensions Predictor: predict if the read will soon be followed by a write and acquire write permissions aggressively Hybrid: if a transaction believes it is a Starving Writer, it can force other readers to abort; for everything else, use requester stalls Timestamp: In the EL case, requester wins only if it is the older transaction (handles Friendly Fire pathology) Backoff: in the LL case, aborting transactions invoke exponential back-off to prevent convoy formation 10

Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors, and (ii) write serialization (all processors see writes to the same location in the same order) The consistency model defines the ordering of writes and reads to different memory locations the hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions 11

Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section critical section P1 P2 Data = 2000 while (Head == 0) Head = 1 { } = Data Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A 12

Sequential Consistency P1 Instr-a Instr-b Instr-c Instr-d P2 Instr-A Instr-B Instr-C Instr-D We assume: Within a program, program order is preserved Each instruction executes atomically Instructions from different threads can be interleaved arbitrarily Valid executions: abacbcddee or ABCDEFabGc or abcadbe or aabbccddee or.. 13

Sequential Consistency Programmers assume SC; makes it much easier to reason about program behavior Hardware innovations can disrupt the SC model For example, if we assume write buffers, or out-of-order execution, or if we drop ACKS in the coherence protocol, the previous programs yield unexpected outputs 14

Consistency Example - I Consider a multiprocessor with bus-based snooping cache coherence and a write buffer between CPU and cache Initially A = B = 0 P1 P2 A 1 B 1 if (B == 0) if (A == 0) Crit.Section Crit.Section The programmer expected the above code to implement a lock because of write buffering, both processors can enter the critical section The consistency model lets the programmer know what assumptions they can make about the hardware s reordering capabilities 15

Consistency Example - 2 P1 P2 Data = 2000 while (Head == 0) { } Head = 1 = Data Sequential consistency requires program order -- the write to Data has to complete before the write to Head can begin -- the read of Head has to complete before the read of Data can begin 16

Consistency Example - 3 Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A Sequential consistency can be had if a process makes sure that everyone has seen an update before that value is read else, write atomicity is violated 17

Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achieveable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion The multiprocessors in the previous examples are not sequentially consistent Can implement sequential consistency by requiring the following: program order, write serialization, everyone has seen an update before a value is read very intuitive for the programmer, but extremely slow 18

HW Performance Optimizations Program order is a major constraint the following try to get around this constraint without violating seq. consistency if a write has been stalled, prefetch the block in exclusive state to reduce traffic when the write happens allow out-of-order reads with the facility to rollback if the ROB detects a violation (detected by re-executing the read later) 19

Relaxed Consistency Models (HW/SW) We want an intuitive programming model (such as sequential consistency) and we want high performance We care about data races and re-ordering constraints for some parts of the program and not for others hence, we will relax some of the constraints for sequential consistency for most of the program, but enforce them for specific portions of the code Fence instructions are special instructions that require all previous memory accesses to complete before proceeding (sequential consistency) 20

Fences P1 { { Region of code with no races } } P2 Region of code with no races Fence Acquire_lock Fence Fence Acquire_lock Fence { { Racy code } } Racy code Fence Release_lock Fence Fence Release_lock Fence 21

Potential Relaxations Program Order: (all refer to different memory locations) Write to Read program order Write to Write program order Read to Read and Read to Write program orders Write Atomicity: (refers to same memory location) Read others write early Write Atomicity and Program Order: Read own write early 22

Relaxations Relaxation W R Order W W Order R RW Order Rd others Wr early Rd own Wr early IBM 370 X TSO X X PC X X X SC X IBM 370: a read can complete before an earlier write to a different address, but a read cannot return the value of a write unless all processors have seen the write SPARC V8 Total Store Ordering (TSO): a read can complete before an earlier write to a different address, but a read cannot return the value of a write by another processor unless all processors have seen the write (it returns the value of own write before others see it) Processor Consistency (PC): a read can complete before an earlier write (by any processor to any memory location) has been made visible to all 23

Performance Comparison Taken from Gharachorloo, Gupta, Hennessy, ASPLOS 91 Studies three benchmark programs and three different architectures: MP3D: 3-D particle simulator LU: LU-decomposition for dense matrices PTHOR: logic simulator LFC: aggressive; lockup-free caches, write buffer with bypassing RDBYP: only write buffer with bypassing BASIC: no write buffer, no lockup-free caches 24

Performance Comparison 25

Summary Sequential Consistency restricts performance (even more when memory and network latencies increase relative to processor speeds) Relaxed memory models relax different combinations of the five constraints for SC Most commercial systems are not sequentially consistent and rely on the programmer to insert appropriate fence instructions to provide the illusion of SC 26

Title Bullet 27