Beyond Sequential Consistency: Relaxed Memory Models

Similar documents
Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Relaxed Memory-Consistency Models

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Using Relaxed Consistency Models

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Portland State University ECE 588/688. Memory Consistency Models

Distributed Operating Systems Memory Consistency

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

A Memory Model for RISC-V

Memory Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Relaxed Memory Consistency

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

CS533 Concepts of Operating Systems. Jonathan Walpole

Shared Memory Consistency Models: A Tutorial

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Shared Memory Consistency Models: A Tutorial

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Commit-Reconcile & Fences (CRF): A New Memory Model for Architects and Compiler Writers

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Overview: Memory Consistency

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Relaxed Memory-Consistency Models

Advanced Operating Systems (CS 202)

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Sequential Consistency & TSO. Subtitle

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

ECE/CS 757: Advanced Computer Architecture II

RELAXED CONSISTENCY 1

Distributed Shared Memory and Memory Consistency Models

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Relaxed Memory-Consistency Models

Constructing a Weak Memory Model

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Memory Consistency Models. CSE 451 James Bornholt

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

The Cache-Coherence Problem

Multiprocessor Synchronization

Hardware Memory Models: x86-tso

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CS5460: Operating Systems

GPU Concurrency: Weak Behaviours and Programming Assumptions

The Java Memory Model

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Foundations of the C++ Concurrency Memory Model

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Instruction Level Parallelism (ILP)

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

Multiprocessors Should Support Simple Memory Consistency Models

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Towards Transparent and Efficient Software Distributed Shared Memory

Memory Consistency and Multiprocessor Performance

Program logics for relaxed consistency

Computer Science 146. Computer Architecture

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory Consistency Models: A Tutorial

Chapter 5. Multiprocessors and Thread-Level Parallelism

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Multiprocessors Should Support Simple Memory- Consistency Models

Computer Architecture

EE382 Processor Design. Processor Issues for MP

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Bus-Based Coherent Multiprocessors

Formal Specification of RISC-V Systems Instructions

Handout 3 Multiprocessor and thread level parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

The Java Memory Model

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

CSE 153 Design of Operating Systems

Memory barriers in C

Lecture: Consistency Models, TM

The Java Memory Model

Release Consistency. Draft material for 3rd edition of Distributed Systems Concepts and Design

Computer Architecture Spring 2016

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5 Cache Model and Memory Coherency

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

(1) Measuring performance on multiprocessors using linear speedup instead of execution time is a good idea.

Administrivia. p. 1/20

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

LC-SIM: A SIMULATION FRAMEWORK FOR EVALUATING LOCATION CONSISTENCY BASED CACHE PROTOCOLS. Pouya Fotouhi

Transcription:

1 Beyond Sequential Consistency: Relaxed Memory Models Computer Science and Artificial Intelligence Lab M.I.T. Based on the material prepared by and Krste Asanovic

2 Beyond Sequential Consistency: Relaxed Memory Models

6.823 L20-3 Sequential Consistency Processor 1 Processor 2 Store(a,10); Store(flag,1); initially flag = 0 L: r1 = Load(flag); Jz(r 1,L); r2 = Load(a); In-order instruction execution Atomic loads and stores SC is easy to understand but architects and compiler writers want to violate it for performance

6.823 L20-4 Memory Model Issues Architectural optimizations that are correct for uniprocessors, often violate sequential consistency and result in a new memory model for multiprocessors

6.823 L20-5 Example 1: Store Buffers Process 1 Process 2 Store(flag 1,1); Store(flag 2,1); r1 := Load(flag 2 ); r 2 := Load(flag 1 ); Question: Is it possible that r 1 =0 and r 2 =0? Sequential consistency: No Suppose Loads can bypass stores in the store buffer: Yes! Total Store Order (TSO): IBM 370, Sparc s TSO memory model Initially, all memory locations contain zeros

6.823 L20-6 Example 2: Short-circuiting Process 1 Process 2 Store(flag 1,1); Store(flag 2,1); r 3 := Load(flag 1 ); r 4 := Load(flag 2 ); r 1 := Load(flag 2 ); r 2 := Load(flag 1 ); Question: Do extra Loads have any effect? Sequential consistency: No Suppose Load-Store short-circuiting is permitted in the store buffer No effect in Sparc s TSO model A Load acts as a barrier on other loads in IBM 370

6.823 L20-7 Example 3: Non-FIFO Store buffers Process 1 Process 2 Store(a,1); Store(flag,1); r 1 := Load(flag); r 2 := Load(a); Question: Is it possible that r 1 =1 but r 2 =0? Sequential consistency: No With non-fifo store buffers: Yes Sparc s PSO memory model

Example 4: Non-Blocking Caches 6.823 L20-8 Process 1 Store(a,1); Store(flag,1); Process 2 r 1 := Load(flag); r 2 := Load(a); Question: Is it possible that r 1 =1 but r 2 =0? Sequential consistency: No Assuming stores are ordered: Yes because Loads can be reordered Sparc s RMO, PowerPC s WO, Alpha

Example 5: Register Renaming 6.823 L20-9 Register Process 1 Process 2 renaming Store(flag 1,r 1 ); Store(flag 2,r 2 ); will eliminate r1 := Load(flag 2 ); r 2 := Load(flag 1 ); this edge Initially both r 1 and r 2 contain 1. Question: Is it possible that r 1 =0 but r 2 =0? Sequential consistency: No Register renaming: Yes because it removes anti-dependencies

Example 6: Speculative Execution 6.823 L20-10 Process 1 Process 2 Store(a,1); L: r 1 := Load(flag); Store(flag,1); Jz(r 1,L); r 2 := Load(a); Question: Is it possible that r 1 =1 but r 2 =0? Sequential consistency: No With speculative loads: Yes even if the stores are ordered

Example 7: Store Atomicity 6.823 L20-11 Process 1 Process 2 Process 3 Process 4 Store(a,1); Store(a,2); r 1 := Load(a); r 3 := Load(a); r2 := Load(a); r 4 := Load(a); Question: Is it possible that r 1 =1 and r 2 =2 but r 3 =2 and r 4 =1? Sequential consistency: No Even if Loads on a processor are ordered, the different ordering of stores can be observed if the Store operation is not atomic.

Example 8: Causality 6.823 L20-12 Process 1 Process 2 Process 3 Store(flag 1,1); r 1 := Load(flag 1 ); r 2 := Load(flag 2 ); Store(flag 2,1); r 3 := Load(flag 1 ); Question: Is it possible that r 1 =1 and r 2 =1 but r 3 =0? Sequential consistency: No

Five-minute break to stretch your legs 13

Weaker Memory Models & Memory Fence Instructions 6.823 L20-14 Architectures with weaker memory models provide memory fence instructions to prevent the permitted reorderings of loads and stores Store(a 1, v); Fence wr Load(a 2 ); The Load and Store can be reordered if a 1 =/= a 2. Insertion of Fence wr will disallow this reordering Similarly: Fence rr ; Fence rw ; Fence ww ; SUN s Sparc: MEMBAR; MEMBARRR; MEMBARRW; MEMBARWR; MEMBARWW PowerPC: Sync; EIEIO

6.823 L20-15 Enforcing SC using Fences Processor 1 Processor 2 Store(a,10); L: r 1 = Load(flag); Store(flag,1); Jz(r 1,L); r 2 = Load(a); Processor 1 Store(a,10); Fence ww ; Store(flag,1); Processor 2 L: r 1 = Load(flag); Jz(r 1,L); Fence rr ; r 2 = Load(a); Weak ordering

Weaker (Relaxed) Memory Models 6.823 L20-16 Alpha, Sparc PowerPC,... Store is globally performed TSO, PSO, RMO,... Writebuffers RMO=WO? SMP, DSM Hard to understand and remember Unstable - Modèle de l année

Backlash in the architecture community 6.823 L20-17 Abandon weaker memory models in favor of SC by employing aggressive speculative execution tricks. all modern microprocessors have some ability to execute instructions speculatively, i.e., ability to kill instructions if something goes wrong (e.g. branch prediction) treat all loads and stores that are executed out of order as speculative and kill them if a signal is received from some other processor indicating that SC is about to be violated.

6.823 L20-18 Aggressive SC Implementations Loads can go out of order Processor 1 Processor 2 miss r 1 = Load(flag); Store(a,10); hit r 2 = Load(a); kill Load(a) and the subsequent instructions if Store(a,10) happens before Load(flag) completes Still not as efficient as weaker memory mode Scalable for Distributed Shared Memory systems?

Properly Synchronized Programs 6.823 L20-19 Very few programmers do programming that relies on SC; instead higher-level synchronization primitives are used locks, semaphores, monitors, atomic transactions A properly synchronized program is one where each shared writable variable is protected (say, by a lock) so that there is no race in updating the variable. There is still race to get the lock There is no way to check if a program is properly synchronized For properly synchronized programs, instruction reordering does not matter as long as updated values are committed before leaving a locked region.

6.823 L20-20 Release Consistency Only care about inter-processor memory ordering at thread synchronization points, not in between Can treat all synchronization instructions as the only ordering points Acquire(lock) // All following loads get most recent written values Read and write shared data.. Release(lock) // All preceding writes are globally visible before // lock is freed.