Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Similar documents
Relaxed Memory-Consistency Models

Using Relaxed Consistency Models

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Memory Consistency Models

Beyond Sequential Consistency: Relaxed Memory Models

Overview: Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Distributed Operating Systems Memory Consistency

Relaxed Memory Consistency

Portland State University ECE 588/688. Memory Consistency Models

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS533 Concepts of Operating Systems. Jonathan Walpole

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Shared Memory Consistency Models: A Tutorial

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Shared Memory Consistency Models: A Tutorial

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

EECS 570 Final Exam - SOLUTIONS Winter 2015

Distributed Shared Memory and Memory Consistency Models

Sequential Consistency & TSO. Subtitle

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Multiprocessor Synchronization

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

Shared Memory Consistency Models: A Tutorial

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Relaxed Memory-Consistency Models

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Memory Consistency Models. CSE 451 James Bornholt

Advanced Operating Systems (CS 202)

Foundations of the C++ Concurrency Memory Model

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

A Memory Model for RISC-V

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

CSE 506: Opera.ng Systems Memory Consistency

Multiprocessors Should Support Simple Memory Consistency Models

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Administrivia. p. 1/20

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Towards Transparent and Efficient Software Distributed Shared Memory

Hardware Memory Models: x86-tso

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

ECE/CS 757: Advanced Computer Architecture II

Efficient Sequential Consistency via Conflict Ordering

CSE502: Computer Architecture CSE 502: Computer Architecture

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture 13: Memory Consistency. + course-so-far review (if time) Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Computer Architecture

Speculative Synchronization

Other consistency models

Speculative Locks. Dept. of Computer Science

Handout 3 Multiprocessor and thread level parallelism

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Potential violations of Serializability: Example 1

A Unified Formalization of Four Shared-Memory Models

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Implementing Sequential Consistency In Cache-Based Systems

GPU Concurrency: Weak Behaviours and Programming Assumptions

DISTRIBUTED FILE SYSTEMS & NFS

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

EE382 Processor Design. Processor Issues for MP

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Multiprocessors Should Support Simple Memory- Consistency Models

Shared Memory Consistency Models: A Tutorial

Computer Architecture

Relaxed Memory-Consistency Models

Handling Memory Ordering in Multithreaded Applications with Oracle Solaris Studio 12 Update 2: Part 2, Memory Barriers and Memory Fences

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Memory Consistency and Multiprocessor Performance

RELAXED CONSISTENCY 1

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Instruction Level Parallelism (ILP)

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Switch Gear to Memory Consistency

The memory consistency. model of a system. affects performance, programmability, and. portability. This article. describes several models

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

CS5460: Operating Systems

3. Memory Persistency Goals. 2. Background

SPECULATIVE MULTITHREADED ARCHITECTURES

Transcription:

Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional reading: Adve and Gharachorloo, WRL Tech Report, 1995] file:///e /parallel_com_arch/lecture34/34_1.htm[6/13/2012 12:15:56 PM]

Memory consistency Coherence protocol is not enough to completely specify the output(s) of a parallel program Coherence protocol only provides the foundation to reason about legal outcome of accesses to the same memory location Consistency model tells us the possible outcomes arising from legal ordering of accesses to all memory locations A shared memory machine advertises the supported consistency model; it is a contract with the writers of parallel software and the writers of parallelizing compilers Implementing memory consistency model is really a hardware-software tradeoff: a strict sequential model (SC) offers execution that is intuitive, but may suffer in terms of performance; relaxed models (RC) make program reasoning difficult, but may offer better performance SC Recall that an execution is SC if the memory operations form a valid total order i.e. it is an interleaving of the partial program orders Sufficient conditions require that a new memory operation cannot issue until the previous one is completed This is too restrictive and essentially disallows compiler as well as hardware any reordering of instructions No microprocessor that supports SC implements sufficient conditions Instead, all out-of-order execution is allowed, and a proper recovery mechanism is implemented in case of a memory order violation Let s discuss the MIPS R10000 implementation SC in MIPS R10000 Issues instructions out of program order, but commits in order The problem is with speculatively executed loads: a load may execute and use a value long before it finally commits In the meantime, some other processor may modify that value through a store and the store may commit (i.e. become globally visible) before the load commits: may violate SC (why?) How do you detect such a violation? How do you recover and guarantee an SC execution? Any special consideration for prefetches? Binding and non-binding prefetches In MIPS R10000 a store remains at the head of the active list until it is completed in cache Can we just remove it as soon as it issues and let the other instructions commit (the store can complete from store buffer at a later point)? How far can we go and still guarantee SC? The Stanford DASH multiprocessor, on receiving a read reply that is already invalidated, forces the processor to retry that load Why can t it use the value in the cache line and then discard the line? Does the cache controller need to take any special action when a line is replaced from the cache? file:///e /parallel_com_arch/lecture34/34_2.htm[6/13/2012 12:15:56 PM]

file:///e /parallel_com_arch/lecture34/34_2.htm[6/13/2012 12:15:56 PM]

Relaxed models Implementing SC requires complex hardware Is there an example that clearly shows the disaster of not implementing all these? Observe that cache coherence protocol is orthogonal But such violations are rare Does it make sense to invest so much time (for verification) and hardware (associative lookup logic in load queue)? Many processors today relax the consistency model to get rid of complex hardware and achieve some extra performance at the cost of making program reasoning complex P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B; SC is too restrictive; relaxing it does not always violate programmers intuition Three attributes System specification: which orders are preserved and which are not; if all program orders are not preserved what support is provided (software and hardware) to enforce a particular order that the programmer wishes Programmer s interface: set of rules, if followed, will lead to an execution as expected by the programmer; normally specified in terms of high-level language annotations and labels Translation mechanism: how to translate programmer s annotations to hardware actions Let s take a look at a few relaxed models: TSO, PSO, PC, WO/WC, RC, DC Total store ordering Allows a read to bypass (i.e. commit before) an earlier incomplete write This essentially means a blocked store at the head of the ROB can be removed (but remains in write buffer) and subsequent instructions are allowed to commit bypassing the blocked store Can hide latency of write operations Note that this is the only allowed re-ordering Programmer s intuition is preserved in most cases, but not always P0: A=1; flag=1; P1: while (!flag); print A; [same as SC] P0: A=1; B=1; P1: print B; print A; [same as SC] P0: A=1; print B; P1: B=1; print A; [violates SC] Implemented in many Sun UltraSPARC microprocessors How do I enforce SC in the last example if I really care? May be needed when porting this program from R10000 to UltraSPARC Must ensure that a read cannot bypass earlier writes Microprocessors provide fence instructions for this purpose SPARC v9 specification provides MEMBAR (memory barrier) instruction of different flavors Here we only need to use one of these flavors, namely, write-to-read fence just before the load instruction This fence will not allow graduation of load until all stores before it graduates If fence instruction is not available, substituting the read by a read-modify-write (e.g., ldstub in SPARC) also works file:///e /parallel_com_arch/lecture34/34_3.htm[6/13/2012 12:15:57 PM]

PC and PSO Processor consistency (PC) is same as TSO, but does not guarantee write atomicity The writes may become visible to different processors in different order P0: A=1; P1: while (!A); B=1; P2: while (!B); print A; BTW, how can a system not guarantee write atomicity? What hardware logics you get rid of by not implementing write atomicity? Implemented in Intel processors Partial store ordering (PSO) allows reads as well as writes to bypass writes Still implements write atomicity like TSO Further overlaps write latency Deviates a lot from SC (flag spinning no longer works) Store barrier instruction is needed to emulate SC TSO, PC, PSO Hardware support TSO and PC still need to worry about load-invalidate squash (why?) In TSO and PC the store queue entries are freed in-order because stores are not allowed to bypass stores A processor supporting PSO can retire stores in any order (however, a store cannot bypass a load) The store barrier instruction (stbar in SPARC) goes through the normal store queue and controls issuing of future store instructions Note that in all three models memory operations to the same or overlapping addresses must issue in program order All three models may benefit from bigger write/store buffers Weak ordering (WO) First seminal relaxed consistency model; also known as weak consistency An important observation is that a parallel programmer does not really care about ordering among memory operations between synchronizations as long as conventional data and control dependencies are preserved within a process For example, within a critical section the exact order in which independent reads/writes are executed is not important Weak ordering makes use of this property and relaxes all memory orders between synchronization operations e.g., within and outside critical sections, between consecutive barriers, before and after flag synchronization A few constraints must be followed Any code from outside a critical section cannot be re-ordered with codes inside the critical section i.e. all codes before a critical section must commit before lock is acquired, all codes within a critical section must commit before releasing the lock, and any code after a critical section must not issue before releasing the lock Any code before a barrier must commit before entering a barrier and any code after a barrier must not issue before leaving the barrier (second condition should hold naturally if barrier implementation itself is same) Any code before flag wait must commit before waiting on the flag, any code after flag file:///e /parallel_com_arch/lecture34/34_4.htm[6/13/2012 12:15:57 PM]

wait must not issue before the flag is set by producer, any code before setting of a flag must commit before setting the flag, and any code after setting of flag must not issue before setting of flag Perfectly suits modern microprocessors and aggressive compiler optimizations Either hardware must be able to recognize synchronization operations and stall until everything before it has graduated or compiler must insert proper memory barrier instructions MIPS R10000 provides a sync instruction and a fence count register for this purpose; fence count register is incremented whenever an L2 miss leaves the chip and decremented on a reply; a sync instruction disables issuing from address queue until fence register is zero and all outstanding memory operations have committed A processor supporting WO does not need to have load-invalidate squash as in the MIPS R10000 file:///e /parallel_com_arch/lecture34/34_4.htm[6/13/2012 12:15:57 PM]