Implementing the C11 memory model for ARM processors. Will Deacon February 2015

Similar documents
Relaxed Memory: The Specification Design Space

P1202R1: Asymmetric Fences

ARMv8-A Memory Systems. Systems. Version 0.1. Version 1.0. Copyright 2016 ARM Limited or its affiliates. All rights reserved.

The C/C++ Memory Model: Overview and Formalization

Memory Consistency Models. CSE 451 James Bornholt

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

C11 Compiler Mappings: Exploration, Verification, and Counterexamples

Overview: Memory Consistency

Multicore Programming: C++0x

C++ 11 Memory Consistency Model. Sebastian Gerstenberg NUMA Seminar

C++ Concurrency - Formalised

CS533 Concepts of Operating Systems. Jonathan Walpole

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

GPU Concurrency: Weak Behaviours and Programming Assumptions

Program logics for relaxed consistency

Shared Memory Consistency Models: A Tutorial

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Memory Consistency Models

Portland State University ECE 588/688. Memory Consistency Models

Declarative semantics for concurrency. 28 August 2017

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Foundations of the C++ Concurrency Memory Model

RELAXED CONSISTENCY 1

An introduction to weak memory consistency and the out-of-thin-air problem

Taming release-acquire consistency

Memory Consistency Models

Relaxed Memory-Consistency Models

High-level languages

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Reasoning about the C/C++ weak memory model

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

Lowering C11 Atomics for ARM in LLVM

Load-reserve / Store-conditional on POWER and ARM

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Coherence and Consistency

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Heterogeneous-Race-Free Memory Models

C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

Distributed Operating Systems Memory Consistency

C++ Memory Model. Don t believe everything you read (from shared memory)

Can Seqlocks Get Along with Programming Language Memory Models?

CS510 Advanced Topics in Concurrency. Jonathan Walpole

The C1x and C++11 concurrency model

Shared Memory Consistency Models: A Tutorial

HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER

Programming Language Memory Models: What do Shared Variables Mean?

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Memory Consistency Models: Convergence At Last!

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Understanding POWER multiprocessors

Relaxed Memory-Consistency Models

Overhauling SC atomics in C11 and OpenCL

Memory barriers in C

Relaxed Memory Consistency

Sequential Consistency & TSO. Subtitle

Memory Models for C/C++ Programmers

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CS5460: Operating Systems

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Formal Specification of RISC-V Systems Instructions

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Beyond Sequential Consistency: Relaxed Memory Models

Other consistency models

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Using Relaxed Consistency Models

Multiprocessor Synchronization

Advanced OpenMP. Memory model, flush and atomics

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

CSE502: Computer Architecture CSE 502: Computer Architecture

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Audience. Revising the Java Thread/Memory Model. Java Thread Specification. Revising the Thread Spec. Proposed Changes. When s the JSR?

Overhauling SC atomics in C11 and OpenCL

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Relaxed Memory-Consistency Models

The Java Memory Model

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Repairing Sequential Consistency in C/C++11

G52CON: Concepts of Concurrency

Hardware Memory Models: x86-tso

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Sequential Consistency for Heterogeneous-Race-Free

Repairing Sequential Consistency in C/C++11

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

The C++ Memory Model. Rainer Grimm Training, Coaching and Technology Consulting

Topic C Memory Models

SharedArrayBuffer and Atomics Stage 2.95 to Stage 3

Language- Level Memory Models

Lecture: Consistency Models, TM

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

Modern Processor Architectures. L25: Modern Compiler Design

Transcription:

1 Implementing the C11 memory model for ARM processors Will Deacon <will.deacon@arm.com> February 2015

Introduction 2 ARM ships intellectual property, specialising in RISC microprocessors Over 50 billion chips shipped, around 2.5 billion per quarter Upstream kernel developer at ARM, Cambridge (UK!) Enable new architectural features in Linux before silicon Influence future hardware designs with feedback and prototypes I m going to talk about memory models, which form a crucial part of low-level system architecture and are needed to ensure portability of high-level multi-threaded user code.

What is memory ordering? (1) We expect a single CPU, executing a single thread of execution to operate in program order. Easy to reason about but terribly slow! Prohibits common compiler transformations (e.g. hoisting) Forbids common hardware optimisations (e.g. store buffers and caches) Increases memory subsystem bottleneck Instead, allow the program to run out-of-order as long as the programmer can t tell. 3

What is memory ordering? (2) We can t have our cake and eat it. With multiple CPUs, we can observe many of the tricks being played on us! SB (Dekker's) - Initially: A = B = 0 p0 a: A = 1; b: C = B; p1 c: B = 1; d: D = A; Results (C, D) == (1, 1) (C, D) == (0, 1) (C, D) == (1, 0) (C, D) == (0, 0) Question: What can cause this apparent reordering in practice? 4

Store buffering Buffering allows a variable to be live in multiple locations at once. 5 The memory model defines the set of permitted behaviours that may be observed in a system. Interesting cases are expressed as litmus tests.

6 Sequential Consistency Sequential consistency is easy to reason about, as there is a single global ordering. Sequential Consistency (SC): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Leslie Lamport (1979) Question: How does this constrain SB?

7 SB+SC No valid interleaving for (C,D) == (0,0), therefore forbidden by SC p0 p1, Results (C,D) a: A = 1; b: C = B; c: B = 1; d: D = A; (a,b,c,d) : (0,1) (c,d,a,b) : (1,0) (a,c,b,d) : (1,1)... (b,c,d,a) : (0,0) (b,d,c,a) : (0,0) (b,d,a,c) : (0,0) (b,a,d,c) : (0,0) (d,c,b,a) : (0,0) (d,b,c,a) : (0,0) (d,b,a,c) : (0,0) (d,a,b,c) : (0,0)

8 C11 Threads cannot be implemented as a library. Hans Boehm [1] The C11 standard introduced native threads and atomic operations: Atomic types (atomic_int) defined in <stdatomic.h> Atomic operations such as atomic_compare_exchange, atomic_load Formal relations to describe ordering and data races This necessitates a memory model, the default behaviour being SC-DRF.

memory_order_* Maintaining SC is expensive! Atomic operations are parameterised by enum memory_order for finer-grained control: memory_order_seq_cst: sequential consistency (default) memory_order_acq_rel: RCpc [2] LOAD-acquire/STORE-release semantics memory_order_consume: data dependent acquisition [3] memory_order_relaxed: no inter-thread synchronisation This provides a portable mechanism to expose weakly-ordered hardware to applications for optimal performance. atomic_int foo;... return atomic_load_explicit(&foo, memory_order_acquire); 9

Relations The C11 memory model can be described by a series of relations: sequenced-before (sb) reads-from (rf) synchronizes-with (sw) happens-before (hb) and some complications thanks to consume (cad, dob, ithb) All writes to an atomic variable form part of a total modification order mo for that location, consistent with hb. We ll focus on SC and acquire/release operations, ignoring consume. 10

sequenced-before (sb) sb Describes intra-thread evaluation order and applies to operations on arbitrary types. static int x; static int y; /* The store to x is sequenced-before the store to y */ int main(void) { x = 1; y = 2; return 0; } Matches the single-threaded intuition already present in the language. 11

reads-from (rf) rf An operation reads a value written by another. Not strictly defined by the standard, but useful as a building block Can be applied to arbitrary types Can be applied between threads for atomic types No ordering implications on its own static atomic_int x; /* T1's load of x reads-from T0's store iff y == 1 */ void t0(void) { x = 1; } void t1(void) { int y = x;... } 12

synchronizes-with (sw) 13 sw Applies only to operations on atomic types and is defined differently for each family of memory orderings. If A and B are atomic operations of the specified memory order, then: SC: A sw B if A rf B acquire/release: A sw B if A rf B and A is a release and B is an acquire. or sw = Wsc rf rf Rsc Wrel Racq relaxed atomics do not participate in sw!

happens-before (hb) 14 hb An operation A happens-before B if A sb B or A sw B (consume adds complications). The relation is transitive, meaning that atomic variables can be used to stitch together thread-local code: hb = (sb sw) + A data race exists if a program contains two actions, at least one of which is a write and one of which is not atomic, in different threads on the same memory location, neither of which happens-before the other. A program exhibiting such a race has undefined behaviour (SC-DRF).

modification-order (mo) 15 mo The modification-order of an atomic variable indicates the sequence of visible side-effects (i.e. writes) to that variable. It is a single total order consistent with hb and is strictly per-location. static atomic_int x; /* MO of x is either {0, 1, 2} or {0, 2, 1} */ void t0(void) { x = 1; } void t1(void) { x = 2; } Additionally, there is a single total order sc on all SC operations that is consistent with hb and mo.

Formal tools Recall the SB example 16 int main() { atomic_int x=0; atomic_int y=0; {{{ { y.store(1,memory_order_seq_cst); r1=x.load(memory_order_seq_cst); } { x.store(1,memory_order_seq_cst); r2=y.load(memory_order_seq_cst); } }}} return 0; } We can feed this to a formal model and visualise the set of consistent executions.

CppMem 17 a:wna x=0 sb,hb b:wna y=0 rf mo,hb,sw mo hb,sw c:wsc y=1 sb,sc,hb d:rsc x=0 rf,hb,sw sc e:wsc x=1 sb,sc,hb f:rsc y=1 http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html

18 SB+acq+rel We can modify SB to use acquire/release: int main() { atomic_int x=0; atomic_int y=0; {{{ { y.store(1,memory_order_release); r1=x.load(memory_order_acquire); } { x.store(1,memory_order_release); r2=y.load(memory_order_acquire); } }}} return 0; } r1 == r2 == 0 is now permitted.

Acquire/release 19 a:wna x=0 sb,hb b:wna y=0 rf mo,hb,sw mo hb,sw c:wrel y=1 rf sb,hb d:racq x=0 e:wrel x=1 sb,hb f:racq y=0 Acquire/release in C11 is not SC!

20 Message Passing (1) Passing messages between a producer and a consumer thread is ideally suited to acquire/release: MP int main() { int x=0; atomic_int y=0; {{{ { x=1; y.store(1,memory_order_release); } { r1=y.load(memory_order_acquire).readsvalue(1); /* If we read y == 1 */ r2=x; } /* Then we must read x == 1 */ }}} return 0; } y is an atomic flag indicating the validity of the data x.

Message Passing (2) Only one consistent execution: a:wna x=0 sb,hb b:wna y=0 hb,sw mo sw c:wna x=1 sb,hb d:wrel y=1 rf rf,hb,sw e:racq y=1 sb,hb f:rna x=1 21 Question: What happens if e reads y == 0 instead?

22 Message Passing (3) Data race on x between c and f! a:wna x=0 sb,hb b:wna y=0 hb,sw mo rf,hb,sw rf c:wna x=1 e:racq y=0 sb,hb dr sb,hb d:wrel y=1 f:rna x=0 Intuition: don t read data if!valid.

23 Acquire/release instructions ARMv8 introduced native LDAR and STLR instructions: LDAR <Xt>, [<Xn>] ordered against subsequent accesses in program order STLR <Xt>, [<Xn>] ordered against prior accesses in program order and any prior observed writes A variation on MP, memory initialised to zero: LDR X0, [Xa, #4] ADD X0, X0, #1 STR X0, [Xa, #4] STLR #1, [Xa] Looks an awful lot like hb! 1: LDAR X0, [Xa] CBZ X0, 1b LDR X1, [Xa, #4] X1 == 1

24 SC acquire/release The fun doesn t stop here: STLR LDAR STLR LDAR globally observed in program order STLR is multi-copy atomic when observed by LDAR #1, [Xa] X0, [Xb] A variation on SB, memory initialised to zero: STLR #1, [Xb] LDAR X1, [Xa] X0 == X1 == 0 forbidden Unlike C11, provide SC when paired and map directly onto memory_order_seq_cst. Rsc Wsc Racq Wrel ARMv8 LDAR STLR LDAR STLR

Conclusion We ve only scratched the surface of the C11 and ARM memory models: Compound atomic operations (cmpxchg) memory_order_consume Explicit fences However, there is a deliberate mapping from ARMv8 to C11 SC and formal tools for the ARM model are under active development. 25

26 Thank You Hans-J. Boehm Threads Cannot be Implemented as a Library 2004. K. Gharachorloo Shared Memory Consistency Models: A Tutorial 1995. Paul McKenney et al. N4215: Towards Implementation and Use of memory_order_consume 2014. http://www.arm.com/careers/ The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

27 Sequential Consistency (2) Program A B C A p0 Parallel Interleaving B p1 C p2

SC example SC is easy to reason about, as there is a single global ordering. This can be demonstrated by the IRIW litmus test: X = Y = 0 T0 X = 1 T1 Y = 1 T2 X2 = X Y2 = Y T3 Y3 = Y X3 = X (X2, Y2) = (1, 0) (X3, Y3) = (0, 1) not permitted (X2, Y2) = (0, 1) (X3, Y3) = (1, 0) not permitted SC actually forbids any reordering of reads and writes. 28

The ARM weak memory model The ARM architecture features a relatively weak memory model: No multi-copy atomicity requirement (unlike TSO) Arbitrary reordering of independent reads and writes Explicit barriers and instructions to enforce ordering System-level ordering (e.g. MMIO, Cache/TLB maintenance, ) Like many (all?) other architectures, the memory model is not formally defined and is driven by pragmatism rather than pure mathematics. 29

Observability 30 Ordering is defined in terms of observability by memory masters (observers). Writes A write to a location in memory is said to be observed by an observer when: (1) A subsequent read of the location by the same observer will return the value written by the observed write, or written by a write to that location by any observer that is sequenced in the coherence order of the location after the observed write and (2) A subsequent write of the location by the same observer will be sequenced in the coherence order of the location after the observed write This is actually pretty intuitive

Observability (2) 31 but reads are observable too! Reads A read of a location in memory is said to be observed by an observer when a subsequent write to the location by the same observer will have no effect on the value returned by the read. These definitions clearly have relations with rf and mo.

Global Observability and Completion A normal memory access is globally observed for a shareability domain when it is observed by all observers in that domain. n A table walk is complete for a shareability domain when its accesses are globally observed in that domain and the TLB is updated. n An access is complete for a shareability domain when it is globally observed in that domain and any table walks associated with it have completed in the same domain. Much more difficult to correlate with C11, which cares only about application-level ordering. n 32

Explicit barriers The ARM architecture defines three barrier instructions: ISB Pipeline flush and context synchronisation DMB <option> Ensure ordering of memory accesses DMB <option> Ensure completion of memory accesses The <option> argument specifies the required shareability domain (NSH, ISH, OSH, SY) and access type (ST). Defaults to full system, all access types if omitted. Userspace runs in the same inner-shareable domain. 33

Dependencies In the absence of explicit barriers, dependencies define observation order of normal memory accesses. 34 Address: value returned by a read is used to compute the address of a subsequent access. Control: value returned by a read is used to determine the condition flags and the flags are used in the condition code checking that determines the address of a subsequent access. Data: value returned by a read is used as data written by a subsequent write. There are also a few other rules (RaR, store speculation).

Dependency Examples 35 ldr r1, [r0, #4] and r1, #0xfff ldr r3, [r2, r1] ldr r1, [r0, #4] cmp r1, #1 addeq r2, #4 ldr r3, [r2] ldr r1, [r0, #4] add r1, #5 str r1, [r2] (address) (control) (data) Question: Which dependencies enforce ordering of observability?

Mapping to C11 Typically, architectures provide either stronger (x86) or weaker (PowerPC) guarantees than those required by the C11 relaxed memory models: Architecture SC Acq/rel Relaxed x86 ARMv7 = PowerPC = ia64 = ARMv8 = = Explicit fences are used to convert into. Rsc Wsc Racq Wrel ARMv7 LDR; DMB DMB; STR; DMB LDR; DMB DMB; STR ARMv8 LDAR STLR LDAR STLR 36