Shared Memory Consistency Models: A Tutorial

Similar documents
Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Shared Memory Consistency Models: A Tutorial

Overview: Memory Consistency

Portland State University ECE 588/688. Memory Consistency Models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Relaxed Memory-Consistency Models

Advanced Operating Systems (CS 202)

Using Relaxed Consistency Models

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Memory Consistency Models

Shared Memory Consistency Models: A Tutorial

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Distributed Operating Systems Memory Consistency

Shared Memory Consistency Models: A Tutorial

Administrivia. p. 1/20

Beyond Sequential Consistency: Relaxed Memory Models

The memory consistency. model of a system. affects performance, programmability, and. portability. This article. describes several models

Distributed Shared Memory and Memory Consistency Models

Sequential Consistency & TSO. Subtitle

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

CS533 Concepts of Operating Systems. Jonathan Walpole

RELAXED CONSISTENCY 1

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

ECE/CS 757: Advanced Computer Architecture II

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Relaxed Memory Consistency

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Foundations of the C++ Concurrency Memory Model

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

GPU Concurrency: Weak Behaviours and Programming Assumptions

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Memory Consistency Models: Convergence At Last!

Issues in Multiprocessors

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Issues in Multiprocessors

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

Bus-Based Coherent Multiprocessors

Program logics for relaxed consistency

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

A Memory Model for RISC-V

Multiprocessor Synchronization

CSE502: Computer Architecture CSE 502: Computer Architecture

Relaxed Memory-Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS5460: Operating Systems

Memory Consistency Models. CSE 451 James Bornholt

Memory Consistency Models

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Handout 3 Multiprocessor and thread level parallelism

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Shared Memory. SMP Architectures and Programming

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

Specifying System Requirements. for Memory Consistency Models. Stanford University. Stanford, CA University of Wisconsin

Systèmes d Exploitation Avancés

Lecture: Consistency Models, TM

The Java Memory Model

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Coherence and Consistency

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Switch Gear to Memory Consistency

The Cache-Coherence Problem

Shared Memory and Shared Memory Consistency

Audience. Revising the Java Thread/Memory Model. Java Thread Specification. Revising the Thread Spec. Proposed Changes. When s the JSR?

A Unified Formalization of Four Shared-Memory Models

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Computer Architecture

Multiprocessors and Locking

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Topic C Memory Models

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Computer System Architecture Final Examination Spring 2002

High-level languages

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Flynn's Classification

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS/COE1541: Intro. to Computer Architecture

Parallel Numerical Algorithms

The C/C++ Memory Model: Overview and Formalization

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Transcription:

Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster

Contents Overview Uniprocessor Review Sequential Consistency Relaxed Memory Models Program Abstractions Conclusions 2

Overview Correct & Efficient Shmem Programs Require precise notion of behavior w.r.t. read (R) and write (W) operations between processor memories. Example 1, Figure 1 Initially, all ptrs = NULL; all ints = 0; P 1 While (no more tasks) { Task = GetFromFreeList(); Task->Data = ; insert Task in task queue } Head = head of task queue; P 2, P 3,, P n While (MyTask == null) { Begin Critical Section if (Head!= null) { MyTask = Head; Head = Head->Next; } End Critical Section } = MyTask->Data; 3 Q: What will Data be? A: Could be old Data

Definitions Memory Consistency Model Formal Specification of Mem System Behavior to Programmer Program Order The order in which memory operations appear in program Sequential Consistency (SC): An MP is SC if Exec Result is same as if all Procs were in some sequence. Operations of each Proc appear in this sequence in order specified by its program. (Lamport [16]) Relaxed Memory Consistency Models (RxM) An RxM less restrictive than SC. Valuable for efficient shmem. System Centric: HW/SW mechanism enabling Mem Model Programmer-centric: Observation of Program behavior a memory model from programmer s viewpoint. Cache-Coherence: 1. A write is eventually made visible to all MPs. 2. Writes to same loc appear as serialized (same order) by MPs NOTE: not equivalent to Sequential Consistency (SC) 4

UniProcessor Review Only needs to maintain control and data dependencies. Compiler can perform extreme Optz: (reg alloc, code motion, value propagation, loop transformations, vectorizing, SW pipelining, prefetching, A multi-threaded program will look like: T1 T2 T3 T4... Tn 5 Memory Conceptually, SC wants the one program memory w/ switch that connects procs to memory + Program Order on a per- Processor basis All of memory will appear to have the same values to the threads in a UniProcessor System. You still have to deal with the normal multi-threaded problems by one processor, but you don t have to deal with issues such as Write Buffer problems or Cache Coherence.

Seq. Consist. Examples P 1 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section P 2 // init: all = 0 Flag2 = 1 If (Flag1 == 0) critical section Dekker s Algorithm: What if Flag1 set to 1 then Flag2 set to 1 then ifs? Or F2 Read bypasses F1 Write? A: Sequential Consistency (program order & Proc seq) P 1 A = 1 P 2 If (A == 1) B = 1 P 3 If (B == 1) reg1 = A What if P2 gets Read of A but P3 gets old value of A? A: Atomicity of memops (All procs see instant and identical view of memops.) 6 NOTE: UniProcessor system doesn t have to deal with old values or R/W bypasses.

Architectures Will visit: Architectures w/o Cache Write Bufferes w/ Bypass Capability Overlapping Write Operations Non-Blocking Read Operations Architectures w/ Cache Cache Coherence & SC Detecting Completion of Write Operations Illusion of Write Atomicity 7

Write Buffer w/ Bypass Capability Read Flag2 t1 P1 Flag1: 0 Flag2: 0 Write Flag1 t3 Read Flag1 t2 Shared Bus P2 Write Flag2 t4 Q: What happens if Read of Flag1 & Flag2 bypass Writes? Bus-based Mem System w/o Cache Bypass can hide Write latency Violates Sequential Consistency NOTE: Write Buffer not a problem on UniProcessor Programs A: Both enter critical section 8 P 1 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section P 2 // init: all = 0 Flag2 = 1 If (Flag1 == 0) critical section

Overlapping Writes P1 Write Head t1 Head: 0 Read Data t3 Read Head t2 Memory P2 Write Data t4 Data: 0 Interconnection network alleviates the serialization bottleneck of a bus-based design. Also, Writes can be coalesced. Q: What happens if Write of Head bypasses Write of Data? A: Data Read returns 0 9 P 1 // init: all = 0 Data = 2000 Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data

Non-Blocking Reads P1 Write Head t3 Write Data t2 P2 Interconnect Non-Blocking Reads Enable non-blocking caches speculative execution dynamic scheduling Read Head t4 Read Data t1 Head: 0 Memory Data: 0 Q: What happens if Read of Data bypasses Read of Head? A: Data Read returns 0 P 1 // init: all = 0 Data = 2000 Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data 10

Cache-Coherence & SC Write buffer w/o cache similar to Write-thru cache Reads can proceed before Write completes (on other MPs) Cache-Coherence: not equiv to Sequential Consistency (SC) 1. A write is eventually made visible to all MPs. 2. Writes to same loc appear as serialized (same order) by MPs 3. Propagate value via invalidating or updating cache-copy(ies) Detecting Completion of Write Operation What if P2 gets new Head but old Data? Avoided if invalidate/update before 2 nd Write Write ACK needed Or at least Invalidate ACK P1 Read Data t3 Read Head t2 P2 Write Head Write Data t1 t4 Write-thru cache Head: 0 Memory Data: 0 Memory 11

Illusion of Write Atomicity Cache-coherence Problems: 1. Cache-coherence (cc) Protocol must propogate value to all copies. 2. Detecting Write completion takes multi ops w/ multiple replications 3. Hard to create Illusion of Atomicity w/ non-atomic writes. Q: What if P1 & P2 updates reach P3 & P4 differently? A: Reg1 & Reg2 might have different results (& violates SC) Solution: Can serialize writes to same location Alternative: Delay updates until ACK of previous to same loc Still not equiv to Sequential Consistency. P 1: A=B=C=0 P 2 = 0 P 3 P 4 A = 1 B = 1 A = 2 C = 1 While (B!= 1) ; While (C!= 1) ; Reg1 = A While (B!= 1) ; While (C!= 1) ; Reg2 = A 12

Ex2: Illusion of Wr Atomicity Q: What if P2 reads new A before P3 gets updated w/ A; AND P2 update of B reaches P3 before its update of A AND P3 reads new B & old A? A: Prohibit read from new value until all have ACK d. Update Protocol (2-phase scheme): 1. Send update, Recv ACK from each MP 2. Updated MPs get ACK of all ACKs. (Note: Writing proc can consider Write complete after #1.) 13 P 1 A = 1 P 2 If (A == 1) B = 1 P 3 If (B == 1) reg1 = A

Compilers Compilers do many optz w.r.t. mem reorderings: CSE, Code motion, reg alloc, SW pipe, vect temps, const prop, All done from uni-processor perspective. Violates shmem SC e.g. Would never exit from many of our while loops. Compiler needs to know shmem objects and/or Sync points or must forego many optz. 14

Sequential Consistency Summary SC imposes many HW and Compiler constraints Requirements: 1. Complete of all mem ops before next (or Illusion thereof) 2. Writes to same loc need be serialized (cache-based). 3. Write Atomicity (or illusion thereof) Discuss HW Techniques useful for SC & Efficiency: Pre-Exclusive Rd (Delays due to Program Order); cc invalid mems Read Rolebacks (Due to speculative exec or dyn sched). Global shmem data dep analysis (Shasha & Snir) Relaxed Memory Models (RxM) next 15

Relaxed Memory Models Characterization (3 models, 5 specific types) Relaxation 1a. Relax Write to Read program order (PO) 1b. Relax Write to Write PO 1c. Relax Read to Read & Read to Write POs (assume different locations) 2. Read others Write early (cache-based only) 16 3. Read own Write early (most allow & usually safe; but what if two writers to same loc?) Some RxMs can be detected by programmer, others not. Various Systems use different fence techniques to provide safety nets. AlphaServer, Cray T3D, SparcCenter, Sequent, IBM 370, PowerPC

Relaxed Write to Read PO Relax constraint of Write then Read to a diff loc. Reorder Reads w.r.t. previous Writes w/ memory disambiguation. 3 Models handle it differently. All do it to hide Write Latency Only IBM 370 provides serialization instr as safety net between W&R TSO can use Read-Modify-Write (RMW) of either Read or Write PC must use RMW of Read since it uses less stringent RMW requirements. 17 P 1 P 2 F1 = 1 F2 = 1 A = 1 A = 2 Rg1 = A Rg3 = A Rg2 = F2 Rg4 = F1 Rslt: Rg1 = 1, Rg3 = 2 Rg2 = Rg4 = 0 TSO & PC since they allow Read of F1/F2 before Write of F1/F2 on each proc P 1 P 2 P 3 A = 1 if(a==1) B = 1 Rslt: Rg1 = 0, B = 1 IBM 370 since it allows P 2 Read of new A while P 3 Reads old A if (B==1) Rg1 = A

Relaxing W2R and W2W SPARC Partial Store Order (PSO) Writes to diff locs from same proc can be pipelined or overlapped and are allowed to reach memory or other caches out of program order. PSO == TSO w.r.t. letting itself read own write early and prohibitting others from reading value of another proc s write before ACK from all. W2R: Decker s Algorithm PSO will still allow non-sc rslts P 1 // init: all = 0 P 2 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section Flag2 = 1 If (Flag1 == 0) critical section W2W: PSO Safety net is to provide STBAR (store barrier) P 1 // init: all = 0 Data = 2000 STBAR // Write Barrier Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data 18

Relaxing All Program Order R or W may be reordered w.r.t. R or W to diff location Can hide latency of Reads in the context of either static or dynamic (out-of-order) scheduling processors (can use spec exec & nonblocking caches). Alpha, SPARC V9 RMO, PowerPC SPARC & PPC allow reorder of reads to same location. Violate SC for previous codes (let s get dangerous!) All allow a proc to read own write early but: RCpc and PPC allow a read to get value of other MP Wr early (complex) Two catagories of Parallel program semantics: 1. Distinguish MemOps based on type (WO, RCsc, RCpc) 2. Provide explicit fence capabilities (Alpha, RMO, PPC) 19

Weak Ordering (WO) Two MemOp Catagories: 1) Data Ops 2) Sync Ops Program order enforced between these Programmer must ID Sync Op (safety net) counter utilized (inc/dec) Data regions between Sync Ops can be reordered/optimized. Writes appear atomic to programmer. 20

Release Consistency (RCsc/RCpc) RCsc: Maintains SC among special ops Constraints: acquire all; all release; special special RCpc: Maintains Processor Consistency among special ops; W R program order among special ops eliminated. Constraints: acquire all; all release; special special except special WR special RD shared RMW used to maintain Program Order subtle details special ordinary sync nsync 21 acquire release

Alpha,RMO,PPC Alpha: fences provided MB Memory Barrier WMB- Write Memory Barrier Write atomicity fence not needed. SPARC V9 RMO: MEMBAR fence bits used for any of R W; W W; R R; W R No need for RMW. Write atomicity fence not needed. PowerPC: SYNC fence Similar to MB except: R R to same location can still be OoO Exec (use RMW) Write Atomicity may require RMW Allows write to be seen early by another processor s read 22

Abstractions Generally, compiler optz can go full bore between sync/special/fence or sync IDs. Some optz can be done w.r.t. global shmem objects. Programmer supplied, standardized safety nets. Don t know; Assume worst Starting method? Over-marking SYNCs is overly-conservative Programming Model Support doall no deps between iterations (HPF/F95 forall, where) SIMD (CUDA) Implied multithread access w/o sync or IF cond Data type - volatile - C/C++ Directives OpenMP: #pragma omp parallel Sync Region #pragma omp shared(a) Data Type Library (eg, MPI, OpenCL, CUDA) 23

Using the HW & Conclusions Compilers can protect memory ranges [low high] Assign data segments to protected page locations Use high-order bits for addrs of VM Extra opcode usage (eg, GPU sync) Modify internal memory disambiguation methods Perform Inter-Procedural Optz for shmem optz. Relaxed Memory Consistency Models + Puts more performance, power & responsibilities into hands of programmers and compilers. - Puts more performance, power & responsibilities into hands of programmers and compilers. 24

The End