Relaxed Memory-Consistency Models

Similar documents
Using Relaxed Consistency Models

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Relaxed Memory-Consistency Models

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Beyond Sequential Consistency: Relaxed Memory Models

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Portland State University ECE 588/688. Memory Consistency Models

Relaxed Memory Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Relaxed Memory-Consistency Models

Shared Memory Consistency Models: A Tutorial

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Overview: Memory Consistency

Distributed Operating Systems Memory Consistency

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

CS533 Concepts of Operating Systems. Jonathan Walpole

Memory Consistency Models

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Shared Memory Consistency Models: A Tutorial

Foundations of the C++ Concurrency Memory Model

Advanced Operating Systems (CS 202)

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Distributed Shared Memory and Memory Consistency Models

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

RELAXED CONSISTENCY 1

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Memory Consistency Models. CSE 451 James Bornholt

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Shared Memory Consistency Models: A Tutorial

Sequential Consistency & TSO. Subtitle

Hardware Memory Models: x86-tso

The Java Memory Model

Administrivia. p. 1/20

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Audience. Revising the Java Thread/Memory Model. Java Thread Specification. Revising the Thread Spec. Proposed Changes. When s the JSR?

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

ECE/CS 757: Advanced Computer Architecture II

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Memory Consistency Models

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

GPU Concurrency: Weak Behaviours and Programming Assumptions

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Other consistency models

C++ Memory Model. Don t believe everything you read (from shared memory)

CSE 506: Opera.ng Systems Memory Consistency

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

EECS 570 Final Exam - SOLUTIONS Winter 2015

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Systèmes d Exploitation Avancés

Lecture: Consistency Models, TM

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

19 File Structure, Disk Scheduling

Lecture-18 (Cache Optimizations) CS422-Spring

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Potential violations of Serializability: Example 1

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

A Memory Model for RISC-V

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

G52CON: Concepts of Concurrency

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Superscalar Processors

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Memory model for multithreaded C++: August 2005 status update

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

SHARED-MEMORY COMMUNICATION

Memory Consistency and Multiprocessor Performance

The New Java Technology Memory Model

Page 1. Cache Coherence

Memory Consistency Models: Convergence At Last!

The memory consistency. model of a system. affects performance, programmability, and. portability. This article. describes several models

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

A Unified Formalization of Four Shared-Memory Models

Birth of Optimistic Methods. On Optimistic Methods for Concurrency Control. Basic performance arg

CSE502: Computer Architecture CSE 502: Computer Architecture

Problem Set 5 Solutions CS152 Fall 2016

Handling Memory Ordering in Multithreaded Applications with Oracle Solaris Studio 12 Update 2: Part 2, Memory Barriers and Memory Fences

EE382 Processor Design. Processor Issues for MP

Taming release-acquire consistency

Transcription:

Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency good enough? Compilers Basically, what we want to do is develop models where one operation doesn t have to complete before another one is issued, but the system assures that operations don t become visible out of program order. SC requires that all memory operations be ordered. Let s see how this prevents performance optimizations that are common in compilers. Compilers allocate variables to registers for better performance, but this is not possible if sequential consistency is to be preserved. Without register allocation After register allocation A = 0 B = 1 v = A B = 0 A = 1 u = B r1 = 0 A = 1 u = r1 B = r1 r2 = 0 B = 1 v = r2 A = r2 Notice that the compiler has reordered some memory operations. Which ones? After executing the code without register allocation, what final values are possible for u and v? Lecture 21 Architecture of Parallel Computers 1

After executing the code without register allocation, what final values are possible for u and v? So, mere usage of processor registers causes a result not possible with sequential consistency. Clearly, we must circumvent sequential consistency if we want our programs to perform acceptably. We can preserve sequential consistency, but overlap operations as much as possible. E.g., we can prefetch data, or use more multithreading. But sequential consistency is preserved, so the compiler cannot reorder operations. We can allow the compiler to reorder operations as long as we are sure that sequential consistency will not be violated in the results. Memory operations can be executed out of order, as long as they become visible to other processors in program order. For example, we can execute instructions speculatively, and simply not commit the results if those instructions should not have been executed. We can relax the consistency model, abandoning the guarantee of sequential consistency, but still retain semantics that are intuitive. The reason that relaxed consistency models are useful is that most of the sequential constraints in a program are not really necessary for it to give intuitively correct results. Lecture 21 Architecture of Parallel Computers 2

1a. A = 1 2a. while (flag == 0); 1b. B = 1 2b. u = A 1c. flag = 1 2c. v = B If the code were to execute sequentially, statement 1a would have to precede 1b, which would have to precede 1c; ditto for 2a 2c. What precedence relationships among the statements are really necessary? CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides. The programmer s interface, which tells what code needs to be written to invoke operations provided by the system specification. The system specification [ 9.1.1] CS&G divide relaxed consistency models into three classes based on the kinds of reorderings they allow. Write= read reordering. These models only allow a write to bypass (complete before) an incomplete earlier read. Examples: Total store ordering, processor consistency Write= write reordering. These models also allow writes to bypass previous writes. Example: Partial store ordering. All reorderings. These models also allow reads and writes to bypass previous reads. Example: Weak ordering, release consistency. Write= read reordering Models that allow reads to bypass pending writes are helpful because they help hide the latency of write operations. Lecture 21 Architecture of Parallel Computers 3

While a write-miss is still in the write buffer, and not yet visible to other processors, the processor can perform reads that hit in its cache, or a single read that misses in its cache. These models require minimal changes to programs. For example, no change is needed to the code for spinning on a flag: A = 1; flag = 1; while (flag == 0); print A; (In this and later code fragments, we assume all variables are initialized to 0.) This is because the model does not permit writes to be reordered. In particular, the write to will not be allowed to complete out of order with respect to the write to flag. What would write read reordering models guarantee about this fragment? A = 1; B = 1; print B; print A; Recall from Lecture 13 that processor consistency does not require that all processors see writes in the same order if those writes come from different processors. Under processor consistency (PC), what value will be printed by the following code? P 3 A = 1; while (A==0); while (B==0); B = 1; print A; However, this problem does not occur with total store ordering (TSO). Lecture 21 Architecture of Parallel Computers 4

TSO provides Store atomicity there is a global order defined between all writes. A processor s own reads and writes are ordered with respect to itself. X = 3; print X will print 3. The processor issuing the write may observe it sooner then the other processors. Special membar instructions are used to impose global ordering between write read. Now consider this code fragment. 1a. A = 1; 1b. print B; 2a. B = 1; 2b. print A; With sequential consistency, what will be printed? How do we know this? Might something else happen under PC? What about under TSO? To guarantee SC semantics when desired, we need to use the membar (or fence ) instruction. This instruction prevents a following read from completing before previous writes have completed. If an architecture doesn t have a membar instruction, an atomic read-modify-write operation can be substituted for a read. Why do Lecture 21 Architecture of Parallel Computers 5

we know this will work? Such an instruction is provided on the Sun Sparc V9. Write= write reordering What is the advantage of allowing write write reordering? Well, the write buffer can merge or retire writes before previous writes in program order complete. What would it mean to merge two writes? Multiple write misses (to different locations) can thus become fully overlapped and visible out of program order. What happens to our first code fragment under this model? 1a. A = 1 2a. while (flag == 0); 1b. B = 1 2b. u = A 1c. flag = 1 2c. v = B In order to make the model work in general, we need an instruction that enforces write-to-write ordering when necessary. On the Sun Sparc V9, the membar instruction has a write-to-write flavor that can be turned on to provide this. Where would we need to insert this instruction in the code fragment above? Relaxing all program orders Models in this class don t guarantee that anything becomes visible to other processors in program order. Lecture 21 Architecture of Parallel Computers 6

This allows multiple read and write requests to be outstanding at the same time and complete out of order. This allows read latency to be hidden as well as write latency. (In earlier models, writes couldn t bypass earlier reads, so everything in a process had to wait for a pending read to finish.) These models are the only ones that allow optimizing compilers to use many key reorderings and eliminate unnecessary accesses. Two models, and three implementations, fall into this category. Weak ordering Release consistency Digital Alpha implementation Sparc V9 Relaxed Memory Ordering (RMO) IBM PowerPC implementation We have already seen the two models in Lecture 13. To get a better appreciation for the differences between the two, lets consider the two examples from CS&G. Here is a piece of code that links a new task onto the head of a doubly linked list. It may be executed in parallel by all processors, so some sort of concurrency control is needed. lock(taskq); newtask next = head; if (head!= NULL) head prev = newtask; head = newtask; unlock(taskq); In this code, taskq serves as a synchronization variable. If accesses to it are kept in order, and accesses by a single processor are kept in program order, then it will serve to prevent two processes from manipulating the queue simultaneously. Lecture 21 Architecture of Parallel Computers 7

This example uses a flag variable to implement a producer/consumer interaction. TOP: while (flag2 == 0); A = 1; u = B; v = C; D = B * C; flag2 = 0; flag1 = 1; goto TOP; Which shared variables are produced by P 1? Which shared variables are produced by P 2? What is the synchronization variable here? TOP: while (flag1 == 0); x = A; y = D; B = 3; C = D / B; flag1 = 0; flag2 = 1; goto TOP; Again, as long as one access to it completes throughout the system before the next one starts, other reads and writes can complete out of order, and the program will still work. The diagram below (similar to the one at the end of Lecture 13) illustrates the difference between weak ordering and release consistency. Each block with reads and writes represents a run of nonsynchronization operations from a single processor. With weak ordering, before a synchronization operation, the processor waits for all previous Also, a synchronization operation has to complete before later reads and writes can be issued. Release consistency distinguishes between an acquire performed to gain access to a critical section, and Lecture 21 Architecture of Parallel Computers 8

a release performed to leave a critical section. An acquire can be issued before previous reads (in block 1) complete. After a release is issued, instructions in block 3 can be issued without waiting for it to complete. 1 1 acquire (read) sync 2 2 sync release (write) 3 3 Release consistency Weak ordering Modern processors provide instructions that can be used to implement these two models. The Alpha architecture supports two fence instructions. The memory barrier (MB) operation is like a synch operation in RO. The write memory barrier (WMB) operation imposes program order only between writes. Lecture 21 Architecture of Parallel Computers 9

A read issued after a WMB can bypass (complete before) a write issued before a WMB. The Sparc V9 relaxed memory order (RMO) provides an membar (fence) with four flavor bits. Each flavor bit guarantees enforcement between a certain combination of previous & following memory operations. read to read read to write write to read write to write The PowerPC provides only a single sync (fence) instruction; it can only implement WO. The programmer s interface When writing code for a system that uses a relaxed memoryconsistency model, the programmer must label synchronization operations. This can be done using system-specified programming primitives such as locks and barriers. For example, how are lock operations implemented in release consistency? A lock operation translates to An unlock operation translates to Arrival at a barrier indicates that previous operations have completed, so it is Leaving a barrier indicates that new accesses may begin, so it is What about accesses to ordinary variables, like the flag variables earlier in this lecture? Lecture 21 Architecture of Parallel Computers 10

Sometimes the programmer will just know how they should be labeled. But if not, here s how to determine it. 1. Decide which operations are conflicting. Two memory operations from different processes conflict if they access the same memory location and at least one of them is a write. 2. Decide which conflicting operations are competing. Two conflicting operations from different processes compete if they could appear next to each other in a sequentially consistent total order. This means that one could immediately follow another with no intervening memory operations on shared data. 3. Label all competing memory operations as synchronization operations of the appropriate type. Notice that we can decide where synchronization operations are needed by assuming SC, even if we are not using an SC memory model. To see how this is done, consider our first code fragment again. 1a. A = 1 2a. while (flag == 0); 1b. B = 1 2b. u = A 1c. flag = 1 2c. v = B Which operations are conflicting? Which operations are competing? Which operations need to be labeled as synchronization operations? How should these operations be labeled in weak ordering? How should these operations be labeled in release consistency? Lecture 21 Architecture of Parallel Computers 11