Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Similar documents
Can Seqlocks Get Along with Programming Language Memory Models?

Foundations of the C++ Concurrency Memory Model

Nondeterminism is Unavoidable, but Data Races are Pure Evil

CS510 Advanced Topics in Concurrency. Jonathan Walpole

The C/C++ Memory Model: Overview and Formalization

C++ Memory Model. Don t believe everything you read (from shared memory)

Other consistency models

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder

C++ 11 Memory Consistency Model. Sebastian Gerstenberg NUMA Seminar

The New Java Technology Memory Model

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

The Java Memory Model

Synchronization COMPSCI 386

RCU in the Linux Kernel: One Decade Later

What is uop Cracking?

Engineering Robust Server Software

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Memory Consistency Models

Relaxed Memory-Consistency Models

Cache Coherence and Atomic Operations in Hardware

The Java Memory Model

G52CON: Concepts of Concurrency

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

CS 241 Honors Concurrent Data Structures

P1202R1: Asymmetric Fences

CS5460: Operating Systems

CSE 160 Lecture 7. C++11 threads C++11 memory model

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue

An introduction to weak memory consistency and the out-of-thin-air problem

Multiprocessors and Locking

Programmazione di sistemi multicore

Prof. Dr. Hasan Hüseyin

Dealing with Issues for Interprocess Communication

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Review of last lecture. Goals of this lecture. DPHPC Overview. Lock-based queue. Lock-based queue

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

M4 Parallelism. Implementation of Locks Cache Coherence

The C++ Memory Model. Rainer Grimm Training, Coaching and Technology Consulting

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Programming Language Memory Models: What do Shared Variables Mean?

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS533 Concepts of Operating Systems. Jonathan Walpole

Potential violations of Serializability: Example 1

Sequential Consistency & TSO. Subtitle

Using Relaxed Consistency Models

G52CON: Concepts of Concurrency

Threads Cannot Be Implemented As a Library

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 4 Shared-Memory Concurrency & Mutual Exclusion

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Overview: Memory Consistency

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Audience. Revising the Java Thread/Memory Model. Java Thread Specification. Revising the Thread Spec. Proposed Changes. When s the JSR?

CSE 332: Data Structures & Parallelism Lecture 17: Shared-Memory Concurrency & Mutual Exclusion. Ruth Anderson Winter 2019

Threading and Synchronization. Fahd Albinali

Agenda. Highlight issues with multi threaded programming Introduce thread synchronization primitives Introduce thread safe collections

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Threads and Java Memory Model

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Shared Memory Consistency Models: A Tutorial

Interprocess Communication By: Kaushik Vaghani

C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

C11 Compiler Mappings: Exploration, Verification, and Counterexamples

Coherence and Consistency

CMSC 433 Programming Language Technologies and Paradigms. Synchronization

Advanced OpenMP. Memory model, flush and atomics

Models of concurrency & synchronization algorithms

A Simple Example. The Synchronous Language Esterel. A First Try: An FSM. The Esterel Version. The Esterel Version. The Esterel Version

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

[537] Locks. Tyler Harter

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

SPIN, PETERSON AND BAKERY LOCKS

CSE332: Data Abstractions Lecture 23: Programming with Locks and Critical Sections. Tyler Robison Summer 2010

The Java Memory Model

Safety and liveness for critical sections

CS-537: Midterm Exam (Spring 2001)

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Reasoning about the C/C++ weak memory model

Synchronization Principles

Multithreading done right? Rainer Grimm: Multithreading done right?

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

Lowering C11 Atomics for ARM in LLVM

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

Memory barriers in C

CSE 374 Programming Concepts & Tools

Beyond Sequential Consistency: Relaxed Memory Models

SYNCHRONIZATION M O D E R N O P E R A T I N G S Y S T E M S R E A D 2. 3 E X C E P T A N D S P R I N G 2018

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Shared Memory Consistency Models: A Tutorial

Synchronization in Java

Concurrency. Glossary

Synchronization SPL/2010 SPL/20 1

Background. Old Producer Process Code. Improving the Bounded Buffer. Old Consumer Process Code

Lecture 13 The C++ Memory model Synchronization variables Implementing synchronization

Transcription:

Using Weakly Ordered C++ Atomics Correctly Hans-J. Boehm 1

Why atomics? Programs usually ensure that memory locations cannot be accessed by one thread while being written by another. No data races. Typically we acquire locks when accessing shared variables: void inc_shared() { lock_guard<mutex> _(mtx); x++; } +: Operates at different granularities: Can correctly handle complex operations. -: Lock ordering/deadlock concerns.¹ Doesn t work in signal/interrupt handlers. Performance may be suboptimal. Preempted thread can block others. ¹ See also transactional memory specification. 2

Why atomics? (2) Pre-C++11 experience: If all accesses need to be protected by locks, programmers invent ways to evade the rules: Usually dubious ones: volatile + assembly code Sometimes terrible ones: Racing accesses that invalidate compiler assumptions. Gains are sometimes imagined, sometimes real. (See Fedor Pikus talk.) 3

C++ atomics design Declaring a location as atomic<t> allows concurrent access. Individual operations on atomic<t> appear indivisible: Another thread sees them as completed or not done at all. By default, you get interleaving sequentially consistent semantics so long as there are no data races (no non-atomic location is accessed while being modified)¹ It appears that at each time step an arbitrary thread performs next operation. ¹ And you avoid some really silly things. Like implementing Dekker s mutual exclusion algorithm using std::get_terminate() and std::set_terminate(). 4

C++ sequentially consistent atomics Behave as though a mutex were protecting a single operation. As easy to use as a mutex, but only if all protected critical sections naturally contain a single operation. A bit faster than a mutex in these easy cases. On the other hand: There is much research on lock-free algorithms. Its goal is to decompose more complex atomic operations into such single variable atomic operations, preserving appearance of atomicity. This is hard. Many research publications have been buggy. 5

Requirements on sequentially consistent atomics Very roughly, sequentially consistent atomics need to ensure that: I: Operation is indivisible. Usually free for loads and stores of small, well-aligned, operands. S: Stores become visible to other threads after prior memory operations. Basically free on x86. Requires memory fence a.k.a. memory barrier on ARMv7. L: Loads must complete before subsequent memory accesses take effect. Basically free on x86. Requires memory fence on ARMv7. SL: atomic stores are not reordered with subsequent atomic loads. Requires memory fence on ARMv7 and x86. Universally expensive. 6

Sequentially consistent atomics, MessagePassing example int x; atomic<bool> x_init(false); Thread 1: Thread 2: x = 17; x_init = true; if (x_init) { assert(x == 17); } Ensures that x = 17 is visible to assert call. Prevents reordering of expressions in either thread. Needs S & L (& I), not SL 7

Sequentially consistent atomics, Dekker s example atomic<bool> x(false), y(false); Thread 1: Thread 2: x.store(true); if (!y.load()) turn_ew_lights_green(); y.store(true); if (!x.load()) turn_ns_lights_green(); Ensures that one of the store calls goes first, and is visible to other thread. Prevents store load reordering in either thread. Needs SL. 8

The cost of sequentially consistent atomics On simple microbenchmarks, a fence typically costs 2 to 200 cycles. 20-30 typical on modern processors? Cheap compared to cache miss, e.g. memory contention between threads. Expensive compared to most instructions. Dominates atomics cost in the absence of contention. Significant factor in mutex cost. Can easily make clever lock-free code slower than locking. On x86, a fence is only needed for stores. Loads are sufficiently ordered anyway. Atomic RMW operations already include the fence. On ARMv7, Load: one fence, store or RMW: two fences. ARMv8 essentially has instructions for SC atomics. 9

Important note on sequential consistency costs In my experience: Extra fences usually add thread-local overhead. Overhead tends to be hidden by any other contention. Scalability may improve ; performance will decrease. ACM RACES 12 10

Reducing the cost: Weakly ordered atomics Cost of sequentially ordered atomics is not always necessary. C++ provides weakly ordered atomics, that sacrifice ordering guarantees. memory_order_acquire, memory_order_release, memory_order_acq_rel: Sacrifice SL ordering. Avoids fence after store, allows some fences to be weakened. memory_order_relaxed: Sacrifice SL, S, L ordering (leaving only I). Accesses to the same location can still not be reordered. Saves fences before store, after load, on ARM. 11

The ugly side of weakly ordered atomics Extreme complexity. The rules are not obvious. They re often downright surprising. And not even well understood. The committee still hasn t figured out how to define memory_order_relaxed. and I m not even going to talk about memory_order_consume. And use of weakly ordered atomics in a library is often visible to clients. It doesn t hide well. 12

My advice If possible, use mutexes (or possible transactional memory). (Relatively) simple atomicity at any granularity. If all critical sections accessing a variable involve a single access: Use sequentially consistent atomics. If you need variable access from signal handlers: Very carefully use sequentially consistent atomics. If you measured and the result is really too slow: Consider more complicated lock-free algorithms. Remembering that this may not help. Use weakly ordered atomics. Remembering that they re a bug magnet. 13

Rest of talk 1. Weakly-ordered atomics pitfalls 2. Weakly-ordered atomics recipes 14

Atomics: Pitfall 1 atomic<int> x; x = x + 1; is not the same as x++;! Only individual accesses are atomic! The former is not a single access! Only individual atomic<t> member function calls count as single accesses. Writing code so that multiple atomic accesses appear indivisible is hard! 15

Weakly ordered atomics: Pitfall 1 Ordering constraints are subtle! See reference counting example later. In my experience, this is a VERY common source of errors. 16

Weakly ordered atomics: Pitfall 2 Subtle interaction with other atomics: Thread 1: Thread 2: x.store(true, memory_order_release); if (!y.load()) turn_ew_lights_green(); y.store(true, memory_order_release); if (!x.load()) turn_ns_lights_green(); release stores can be reordered with seq_cst loads! SL reordering guarantee only applies to seq_cst ops! 17

Weakly ordered atomics: Pitfall 3 Subtle interaction with locks: Thread 1: Thread 2: x.store(memory_order_relaxed); { lock_guard<mutex> _(m1); } if (!y.load(memory_order_relaxed)) turn_ew_lights_green(); y.store(memory_order_relaxed); { lock_guard<mutex> _(m2); } if (!x.load(memory_order_relaxed)) turn_ns_lights_green(); Critical sections do not act as a memory fence! Memory model allows movement into critical sections. (Does work if both both mutexes are the same.) 18

Pitfalls 2 and 3 are getting bigger: These have traditionally worked because: Compilers did not reorder across synchronization. Synchronization operations where usually implemented as hardware fences. Both are now false! Compilers always get more aggressive. ARMv8 provides non-fence-like ordering primitives. Your phone probably uses them. 19

Weakly ordered atomics: Pitfall 4 Dependencies do not enforce ordering w.r.t. threads: ptr = x.load(memory_order_relaxed); if (ptr == y) { // hardware predicts condition true; result = ptr->field.load(); // compiler transforms to y->field // Both loads issued concurrently; second may complete first. } 20

General principles for weakly ordered atomics Acquire release: Ensure that memory operations preceding a release store are visible to (happen before) code reading the stored value with an acquire load. Other reasoning about operation ordering remains invalid. Relaxed: Only ensures ordering for operations on that one memory location. Cannot assume anything about the rest of state from loaded value. (Currently makes code resistent to formal verification.) Consume: Avoid (for now). 21

Some correct use recipes General use of weakly ordered atomics is hard. You need to fully understand the memory model! But here are some common cases. 22

Single word data structures If the contents of a single-word data structure are not relied upon for other computations while it is being modified, it s OK to use memory_order_relaxed. Example: 1. Counter that s only read after threads join. Use counter.fetch_add(1, memory_order_relaxed); 2. Accumulate information in a short bit vector. E.g. bit_set.fetch_or(1 << new_element, memory_order_relaxed); Warning sign: Caring about the result. 23

Computing a guess for compare_exchange Sometimes the value of a load just doesn t affect correctness: old = x.load(memory_order_relaxed); while (x.compare_exchange_weak(old, foo(old)) {} This would remain correct if we replaced the first line by old = 42; We re clearly not relying on the load value to tell us anything about the state. 24

Non-racing accesses Some accesses to atomic<t> just don t need to be atomic. Example: Double-checked locking: atomic<boolean> x_init(false); if (!x_init) { lock_guard<mutex> _(x_init_mtx); if (!x_init.load(memory_order_relaxed)) { initialize x; x_init = true; } } 25

Simple one-way communication int x; atomic<bool> x_init(false); Thread 1: Thread 2: x = 17; x_init.store(true, memory_order_release); if (x_init.load(memory_order_acquire) { assert(x == 17); } Requires only that x = 17, from before the store, is visible after the load. No presumption that store must happen before something else. 26

Double-checked locking again atomic<boolean> x_init(false); if (!x_init.load(memory_order_acquire)) { lock_guard<mutex> _(x_init_mtx); if (!x_init.load(memory_order_relaxed)) { initialize x; x_init.store(true, memory_order_release); } } Requires only that the correct value of x be seen after an execution that sees x_init == true. 27

Implementing simple locks Use standard lock if available! If lock entry and exit require a single atomic operation, then it suffices to prevent critical section operations from moving out of the critical section: (Warning: world s dumbest spin-lock!) Lock acquisition: while (lock.exchange(true, memory_order_acquire)) {} Lock release: lock.store(false, memory_order_release); 28

Some complex examples The next two require more complex reasoning, but are also fairly common. Deriving solutions like this requires thorough memory model understanding. Copying them not so much. 29

Reference counting You should probably use shared_ptr, but Increment: count.fetch_add(1, memory_order_relaxed); Decrement: if (count.fetch_sub(1, memory_order_acq_rel) == 1) deallocate; unique() : memory_order_acquire? Client ensures all objects are visible before final decrement. Each decrement ensures that prior operations are visible to next decrement. All operations happen before the final deallocation. 30

Note on reference counting The usual recommendation is to instead use: if (count.fetch_sub(1, memory_order_release) == 1) { atomic_thread_fence(memory_order_acquire); deallocate; } This is correct for similar, but even more subtle reasons. A bit faster on ARMv7 and Power. Roughly the same on x86 and ARMv8. I just dislike fences. 31

Reading with a version counter (seqlocks) Scenario: Data is written very rarely, but read frequently. Avoid locking for readers. atomic<unsigned int> version(0); Writer: { lock_guard _(m); version++; write data; version++; } Unoptimized reader: do { unsigned int v1 = version; read data; unsigned int v2 = version; } while ((v1 & 1)!= 0 v1!= v2); 32

Seqlocks continued Problem: Data accesses race! Read values are discarded when they do. But still introduces undefined behavior. Data must also be atomic! Optimized reader: do { unsigned int v1 = version.load(memory_order_acquire); foo_1 my_data_1 = data_1.load(memory_order_relaxed); foo_n my_data_n = data_n.load(memory_order_relaxed); atomic_thread_fence(memory_order_acquire); unsigned int v2 = version.load(memory_order_relaxed); // corrected after talk. } while ((v1 & 1)!= 0 v1!= v2); 33

Conclusions Weakly ordered atomics are hard! Avoid when possible! A small number of recipes seem to cover a significant fraction of use cases. For anything else, you need to fully understand the memory model! References: Fairly quick and slightly dated overview: WG14/N1479. Thorough mathematical discussion: Chapter 3 in Mark Batty s thesis. And there s always the standard itself. But it s not clear this is best expressed in standardese. 34

35