Foundations of the C++ Concurrency Memory Model

Similar documents
CS510 Advanced Topics in Concurrency. Jonathan Walpole

Hardware Memory Models: x86-tso

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Relaxed Memory-Consistency Models

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

The Java Memory Model

Relaxed Memory-Consistency Models

RELAXED CONSISTENCY 1

Overview: Memory Consistency

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Memory model for multithreaded C++: August 2005 status update

Threads Cannot Be Implemented As a Library

Memory Consistency Models: Convergence At Last!

The Java Memory Model

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Memory Consistency Models

Relaxed Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Using Relaxed Consistency Models

An introduction to weak memory consistency and the out-of-thin-air problem

The Java Memory Model

Programming Language Memory Models: What do Shared Variables Mean?

Multiprocessor Synchronization

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

DISTRIBUTED COMPUTER SYSTEMS

Taming release-acquire consistency

Coherence and Consistency

Potential violations of Serializability: Example 1

Relaxed Memory-Consistency Models

C++ Memory Model. Don t believe everything you read (from shared memory)

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

The New Java Technology Memory Model

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

A Unified Formalization of Four Shared-Memory Models

Threads and Locks. Chapter Introduction Locks

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

CS533 Concepts of Operating Systems. Jonathan Walpole

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

The C/C++ Memory Model: Overview and Formalization

Distributed Operating Systems Memory Consistency

Relaxed Memory: The Specification Design Space

Atomicity via Source-to-Source Translation

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

C++ Memory Model Tutorial

Other consistency models

Administrivia. p. 1/20

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

High-level languages

Memory Consistency Models. CSE 451 James Bornholt

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Portland State University ECE 588/688. Memory Consistency Models

Synchronization SPL/2010 SPL/20 1

Audience. Revising the Java Thread/Memory Model. Java Thread Specification. Revising the Thread Spec. Proposed Changes. When s the JSR?

Multicore Programming Java Memory Model

Computation Abstractions. Processes vs. Threads. So, What Is a Thread? CMSC 433 Programming Language Technologies and Paradigms Spring 2007

Nondeterminism is Unavoidable, but Data Races are Pure Evil

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Linearizability of Persistent Memory Objects

Concurrency. Glossary

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Handling Memory Ordering in Multithreaded Applications with Oracle Solaris Studio 12 Update 2: Part 2, Memory Barriers and Memory Fences

Systèmes d Exploitation Avancés

Distributed Shared Memory and Memory Consistency Models

Beyond Sequential Consistency: Relaxed Memory Models

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Chapter 5 Asynchronous Concurrent Execution

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Final Field Semantics

EE382 Processor Design. Processor Issues for MP

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

The Bulk Multicore Architecture for Programmability

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

CSE 160 Lecture 7. C++11 threads C++11 memory model

C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Lecture 25: Synchronization. James C. Hoe Department of ECE Carnegie Mellon University

CS377P Programming for Performance Multicore Performance Synchronization

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Sequential Consistency & TSO. Subtitle

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

CSE502: Computer Architecture CSE 502: Computer Architecture

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Static Analysis of Embedded C Code

Effective Performance Measurement and Analysis of Multithreaded Applications

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Reasoning about the C/C++ weak memory model

Lecture 6 Consistency and Replication

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Dealing with Issues for Interprocess Communication

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Static Analysis of Embedded C

Transcription:

Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016

Before C++ Memory Model Prior practice threaded programming within a shared address space, e.g. Pthreads implementations prohibit reordering memory operations with respect to synchronization operations treat synchronization operations as opaque procedure calls Problem C and C++: single threaded languages using thread libraries unaware of threads compilers are thread unaware optimize programs for a single thread problem: may perform optimizations that are valid for single-thread programs but violate intended meaning of multithreaded programs prior informal specifications don t precisely define data races semantics of a program without data races 2

A Familiar Example Without a precise memory model: yes! T1 and T2 respectively speculate values of X and Y as 1, validating each other s speculation Does this program have a data race? yes, for the aforementioned execution 3

Structure Fields and Races 4

Register Promotion Is the register promotion in the optimized code a problem? Yes: reads and writes to x outside the critical section 5

Why a C++ Memory Model Problem: hard for programmers to reason about correctness Without precise semantics, hard to reason if the compiler will violate semantics Compiler transformations could introduce data races without violating language specification e.g., previous register promotion and field update examples 6

Trylock and Ordering Problem: undesirably strong semantics for programs using non-blocking calls to acquire lock e.g., pthread_mutex_trylock() What s odd about with this code? T2 waits for T1 to acquire lock instead of waiting for lock to be released Why would we want the assert to succeed? 7

The Problem with Trylock Problem: in a sequentially consistent execution, example is data-race free by current semantics and assertion can t fail Assertion can fail if compiler or hardware reorders the statements executed by T1 Prohibiting such reordering requires memory fence before lock fence doubles cost of the lock acquisition For well-structured uses of locks and unlocks, reordering T1 s statements is safe and no fence is necessary 8

C++ Memory Model Goals Sequential consistency for race free programs No semantics for programs with data races Weakened semantics for trylock 9

Why Undefined Data Race Semantics? There are no benign data races effectively the status quo little to gain by allowing races other than allowing code to be obfuscated Giving Java-like semantics to data races may greatly increase cost of some C++ constructs Compilers often assume that objects do not change unless there is an intervening assignment through a potential alias 10

Possible Memory Models Sequential consistency intuitive, but restricts optimizations Relaxed memory models allow hardware optimizations specified at a low level: makes it hard for programmers to reason about correctness can limit compiler optimizations e.g., at least one relaxed model disallows global analysis or RRE Data-race free model properties guarantee sequential consistency for data-race free programs no guarantee for programs with races simple model with high performance 11

Why Data Race Free Models? Simple programmability of sequential consistency any program without data races is guaranteed to execute with sequential consistency Implementation flexibility of relaxed models Different data race free models define notion of a race to provide increasing flexibility data-race-free-0: no concurrent conflicting accesses (as for Java) 12

Definitions - I Memory location each scalar value occupies a separate memory location except bitfields inside the same innermost struct or class Memory action consists of type of action data operation: load, store synchronization operation (for communication) lock/unlock/trylock, atomic load/store, atomic RMW label identifying program point values to be read and written 13

Definitions - II Thread Execution set of memory actions partial order corresponding to the sequenced before ordering sequenced before applies to memory operations by same thread Sequentially Consistent Execution set of thread executions total order, <T, on all the memory actions which satisfies the constraints each thread is internally consistent T is consistent with sequenced-before orders each load, lock, read-modify-write operation reads value from last preceding write to same locations according to <T effectively requires total order that is interleaving of individual thread actions 14

Definitions - III Two memory operations conflict if they access same memory location, and at least one operation is a store atomic store atomic read-modify-write Type 1 data race in sequentially consistent operation, two memory operations from different threads form a type 1 data race if they conflict at least one is a data operation adjacent in <T 15

C++ Memory Model If a program (on a given input) has a sequentially consistent execution with a type 1 data race, its behavior is undefined Otherwise, the program behaves according to one of its sequentially consistent executions 16

Legal Reorderings Hardware and compilers may freely reorder memory operation M1 sequenced before memory operation M2 if the reordering is allowed by intra-thread semantics and 1. M1 is a data operation and M2 is a read synchronization operation 2. M1 is a write synchronization and M2 is a data operation 3. M1 and M2 are both data with no synchronization sequence-ordered between them 17

Legal Lock Optimizations When lock & unlock are used in well-structured ways, following reorderings between M1 sequenced-before M2 are safe M1 is data and M2 is the write of a lock operation M1 is unlock and M2 is either a read or write of a lock Note: data writes and writes from well-structured locks and unlocks can be executed non-atomically 18

C++ Memory Model Solution for Trylock Modify specification of trylock to not guarantee that it will succeed if the lock is available Failed trylock doesn t tell you anything reliable can t infer that another thread holds the lock New semantics successful trylock is treated as lock() unsuccessful trylock() is treated by memory model as no-op Why is this useful? promise sequential consistency with an intuitive race definition, even for programs with trylock 19

Sequential Consistency vs. Write Atomicity Independent read, independent write doesn t guarantee sequential consistency if writes don t execute atomically T1 and T2 are on same node with shared write-through cache T3 and T4 are on separate nodes Execution T2 reads T1 s value x=1 early, executes a fence and reads old Y T3 writes y=1, executes fence, writes x=2; all writes commit T4 reads x=2, executes fence, now T1 s write x=1 commits Violation of sequential consistency: non-atomic write x=1 20

Making C++ Memory Model Usable Problem: a departure from sequential consistency for sync operations can lead to non-intuitive behavior Approach: retain sequential consistency for default atomics and sync operations 21

C++ Atomics C++ qualifier denote variables with atomic operations operations needs to be executed atomically by HW Sequentially-consistent atomics: data-race-free models require that all operations on C++ atomics must appear sequentially consistent Initially opposed by many HW and SW developers sufficiently expensive on some processors that an experts only alternative was deemed necessary existing code (e.g. Linux kernel) assumes weak ordering. easier to migrate such code with primitives closer to assumed semantics Why was it resolved this way? models were too hard to formalize and use otherwise 22

Some Problems Significant restrictions on synchronization operations synchronization operations must appear sequentially consistent with respect to each other Performance problem in practice only Intel Itanium processor distinguished between data and sync operations other processors enforce ordering through fence or memory barrier instructions Atomic operations must execute in sequenced-before and atomic writes must execute atomically read can t return a new value for a location until all older copies of a location are invalid with caches, atomic writes are easier with invalidation protocols 23

Implications for Current Processors Guaranteeing sequentially consistent atomics Atomic writes need to be mapped to xchg: AMD64 and Intel64 HW now only needs to make sure that xchg writes are atomic xchg implicitly ensures semantics of a store load fence Why is this OK? better to pay a penalty on stores than on loads loads more frequent than stores store load fence replaced by read-modify-write is just as expensive on many processors today 24

Problematic Examples Common to use counters that are frequently incremented but only read after all threads complete problem: requires all memory updates performed prior to counter update become visible after any later counter update in another thread Atomic store requires two fences one before and one after the store could imagine example where memory updates before or after store could be reordered 25

Case for Low-Level Atomics Current model provides too much safety for some use cases Expert programmers need way to relax model when code does not require as much safety to maximize performance Don t want to make memory model hard to reason about for the rest of us 26

Low-Level Atomics Why? enable expert programmers to maximize performance What? can explicitly parameterize an operation on an atomic variable with its memory ordering constraints e.g., x.load(memory_order_relaxed) allows instruction to be reordered with other memory operations load is never an acquire operation, hence does not contribute to the synchronizes-with ordering for read-modify-write operations, programmer can specify whether an operation acts as an acquire, release, neither, or both 27

Additional Definitions Happens-before (HB) if a is sequenced before b, then a happens before b if a synchronizes with b, then a happens before b if a happens before b and b happens before c, then a happens before c Type 2 data race two data conflicting accesses to the same memory location are unordered by happens before 28

C++ Memory Model for Low Level Atomics If a program (on a given input) has a consistent execution with a type 2 data race, then its behavior is undefined Otherwise, program (on the same input) behaves according to one of its sequentially consistent executions A theorem shows that this is equivalent to the previous model phrased in terms of type 1 data races 29

Java/C++ Comparison Primary goal of Java is safety and security make large effort to ensure semantics of codes C++ focus is performance no such safety concerns ignore semantics of codes with data races low-level atomics enable performance fine tuning 30

Summary Need multithreaded programs to harness the power of modern multicore architectures A semantics for multithreaded execution is essential Previously, C++ memory model was ambiguous C++ memory model provides well defined semantics for data race free programs high performance by leaving programs with data races undefined low level atomics for expert programmers to squeeze out maximum performance compilers may assume that ordinary variables don t change asynchronously Remaining problems: adjacent bitfields 31