Implementing the C11 memory model for ARM processors. Will Deacon February PDF Free Download

1 Implementing the C11 memory model for ARM processors Will Deacon <will.deacon@arm.com> February 2015

Introduction 2 ARM ships intellectual property, specialising in RISC microprocessors Over 50 billion chips shipped, around 2.5 billion per quarter Upstream kernel developer at ARM, Cambridge (UK!) Enable new architectural features in Linux before silicon Influence future hardware designs with feedback and prototypes I m going to talk about memory models, which form a crucial part of low-level system architecture and are needed to ensure portability of high-level multi-threaded user code.

What is memory ordering? (1) We expect a single CPU, executing a single thread of execution to operate in program order. Easy to reason about but terribly slow! Prohibits common compiler transformations (e.g. hoisting) Forbids common hardware optimisations (e.g. store buffers and caches) Increases memory subsystem bottleneck Instead, allow the program to run out-of-order as long as the programmer can t tell. 3

What is memory ordering? (2) We can t have our cake and eat it. With multiple CPUs, we can observe many of the tricks being played on us! SB (Dekker's) - Initially: A = B = 0 p0 a: A = 1; b: C = B; p1 c: B = 1; d: D = A; Results (C, D) == (1, 1) (C, D) == (0, 1) (C, D) == (1, 0) (C, D) == (0, 0) Question: What can cause this apparent reordering in practice? 4

Store buffering Buffering allows a variable to be live in multiple locations at once. 5 The memory model defines the set of permitted behaviours that may be observed in a system. Interesting cases are expressed as litmus tests.

6 Sequential Consistency Sequential consistency is easy to reason about, as there is a single global ordering. Sequential Consistency (SC): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Leslie Lamport (1979) Question: How does this constrain SB?

7 SB+SC No valid interleaving for (C,D) == (0,0), therefore forbidden by SC p0 p1, Results (C,D) a: A = 1; b: C = B; c: B = 1; d: D = A; (a,b,c,d) : (0,1) (c,d,a,b) : (1,0) (a,c,b,d) : (1,1)... (b,c,d,a) : (0,0) (b,d,c,a) : (0,0) (b,d,a,c) : (0,0) (b,a,d,c) : (0,0) (d,c,b,a) : (0,0) (d,b,c,a) : (0,0) (d,b,a,c) : (0,0) (d,a,b,c) : (0,0)

8 C11 Threads cannot be implemented as a library. Hans Boehm [1] The C11 standard introduced native threads and atomic operations: Atomic types (atomic_int) defined in <stdatomic.h> Atomic operations such as atomic_compare_exchange, atomic_load Formal relations to describe ordering and data races This necessitates a memory model, the default behaviour being SC-DRF.

memory_order_* Maintaining SC is expensive! Atomic operations are parameterised by enum memory_order for finer-grained control: memory_order_seq_cst: sequential consistency (default) memory_order_acq_rel: RCpc [2] LOAD-acquire/STORE-release semantics memory_order_consume: data dependent acquisition [3] memory_order_relaxed: no inter-thread synchronisation This provides a portable mechanism to expose weakly-ordered hardware to applications for optimal performance. atomic_int foo;... return atomic_load_explicit(&foo, memory_order_acquire); 9

Relations The C11 memory model can be described by a series of relations: sequenced-before (sb) reads-from (rf) synchronizes-with (sw) happens-before (hb) and some complications thanks to consume (cad, dob, ithb) All writes to an atomic variable form part of a total modification order mo for that location, consistent with hb. We ll focus on SC and acquire/release operations, ignoring consume. 10

sequenced-before (sb) sb Describes intra-thread evaluation order and applies to operations on arbitrary types. static int x; static int y; /* The store to x is sequenced-before the store to y */ int main(void) { x = 1; y = 2; return 0; } Matches the single-threaded intuition already present in the language. 11

reads-from (rf) rf An operation reads a value written by another. Not strictly defined by the standard, but useful as a building block Can be applied to arbitrary types Can be applied between threads for atomic types No ordering implications on its own static atomic_int x; /* T1's load of x reads-from T0's store iff y == 1 */ void t0(void) { x = 1; } void t1(void) { int y = x;... } 12

synchronizes-with (sw) 13 sw Applies only to operations on atomic types and is defined differently for each family of memory orderings. If A and B are atomic operations of the specified memory order, then: SC: A sw B if A rf B acquire/release: A sw B if A rf B and A is a release and B is an acquire. or sw = Wsc rf rf Rsc Wrel Racq relaxed atomics do not participate in sw!

happens-before (hb) 14 hb An operation A happens-before B if A sb B or A sw B (consume adds complications). The relation is transitive, meaning that atomic variables can be used to stitch together thread-local code: hb = (sb sw) + A data race exists if a program contains two actions, at least one of which is a write and one of which is not atomic, in different threads on the same memory location, neither of which happens-before the other. A program exhibiting such a race has undefined behaviour (SC-DRF).

modification-order (mo) 15 mo The modification-order of an atomic variable indicates the sequence of visible side-effects (i.e. writes) to that variable. It is a single total order consistent with hb and is strictly per-location. static atomic_int x; /* MO of x is either {0, 1, 2} or {0, 2, 1} */ void t0(void) { x = 1; } void t1(void) { x = 2; } Additionally, there is a single total order sc on all SC operations that is consistent with hb and mo.

Formal tools Recall the SB example 16 int main() { atomic_int x=0; atomic_int y=0; {{{ { y.store(1,memory_order_seq_cst); r1=x.load(memory_order_seq_cst); } { x.store(1,memory_order_seq_cst); r2=y.load(memory_order_seq_cst); } }}} return 0; } We can feed this to a formal model and visualise the set of consistent executions.

CppMem 17 a:wna x=0 sb,hb b:wna y=0 rf mo,hb,sw mo hb,sw c:wsc y=1 sb,sc,hb d:rsc x=0 rf,hb,sw sc e:wsc x=1 sb,sc,hb f:rsc y=1 http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html

18 SB+acq+rel We can modify SB to use acquire/release: int main() { atomic_int x=0; atomic_int y=0; {{{ { y.store(1,memory_order_release); r1=x.load(memory_order_acquire); } { x.store(1,memory_order_release); r2=y.load(memory_order_acquire); } }}} return 0; } r1 == r2 == 0 is now permitted.

Acquire/release 19 a:wna x=0 sb,hb b:wna y=0 rf mo,hb,sw mo hb,sw c:wrel y=1 rf sb,hb d:racq x=0 e:wrel x=1 sb,hb f:racq y=0 Acquire/release in C11 is not SC!

20 Message Passing (1) Passing messages between a producer and a consumer thread is ideally suited to acquire/release: MP int main() { int x=0; atomic_int y=0; {{{ { x=1; y.store(1,memory_order_release); } { r1=y.load(memory_order_acquire).readsvalue(1); /* If we read y == 1 */ r2=x; } /* Then we must read x == 1 */ }}} return 0; } y is an atomic flag indicating the validity of the data x.

Message Passing (2) Only one consistent execution: a:wna x=0 sb,hb b:wna y=0 hb,sw mo sw c:wna x=1 sb,hb d:wrel y=1 rf rf,hb,sw e:racq y=1 sb,hb f:rna x=1 21 Question: What happens if e reads y == 0 instead?

22 Message Passing (3) Data race on x between c and f! a:wna x=0 sb,hb b:wna y=0 hb,sw mo rf,hb,sw rf c:wna x=1 e:racq y=0 sb,hb dr sb,hb d:wrel y=1 f:rna x=0 Intuition: don t read data if!valid.

23 Acquire/release instructions ARMv8 introduced native LDAR and STLR instructions: LDAR <Xt>, [<Xn>] ordered against subsequent accesses in program order STLR <Xt>, [<Xn>] ordered against prior accesses in program order and any prior observed writes A variation on MP, memory initialised to zero: LDR X0, [Xa, #4] ADD X0, X0, #1 STR X0, [Xa, #4] STLR #1, [Xa] Looks an awful lot like hb! 1: LDAR X0, [Xa] CBZ X0, 1b LDR X1, [Xa, #4] X1 == 1

24 SC acquire/release The fun doesn t stop here: STLR LDAR STLR LDAR globally observed in program order STLR is multi-copy atomic when observed by LDAR #1, [Xa] X0, [Xb] A variation on SB, memory initialised to zero: STLR #1, [Xb] LDAR X1, [Xa] X0 == X1 == 0 forbidden Unlike C11, provide SC when paired and map directly onto memory_order_seq_cst. Rsc Wsc Racq Wrel ARMv8 LDAR STLR LDAR STLR

Conclusion We ve only scratched the surface of the C11 and ARM memory models: Compound atomic operations (cmpxchg) memory_order_consume Explicit fences However, there is a deliberate mapping from ARMv8 to C11 SC and formal tools for the ARM model are under active development. 25

26 Thank You Hans-J. Boehm Threads Cannot be Implemented as a Library 2004. K. Gharachorloo Shared Memory Consistency Models: A Tutorial 1995. Paul McKenney et al. N4215: Towards Implementation and Use of memory_order_consume 2014. http://www.arm.com/careers/ The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

27 Sequential Consistency (2) Program A B C A p0 Parallel Interleaving B p1 C p2

SC example SC is easy to reason about, as there is a single global ordering. This can be demonstrated by the IRIW litmus test: X = Y = 0 T0 X = 1 T1 Y = 1 T2 X2 = X Y2 = Y T3 Y3 = Y X3 = X (X2, Y2) = (1, 0) (X3, Y3) = (0, 1) not permitted (X2, Y2) = (0, 1) (X3, Y3) = (1, 0) not permitted SC actually forbids any reordering of reads and writes. 28

The ARM weak memory model The ARM architecture features a relatively weak memory model: No multi-copy atomicity requirement (unlike TSO) Arbitrary reordering of independent reads and writes Explicit barriers and instructions to enforce ordering System-level ordering (e.g. MMIO, Cache/TLB maintenance, ) Like many (all?) other architectures, the memory model is not formally defined and is driven by pragmatism rather than pure mathematics. 29

Observability 30 Ordering is defined in terms of observability by memory masters (observers). Writes A write to a location in memory is said to be observed by an observer when: (1) A subsequent read of the location by the same observer will return the value written by the observed write, or written by a write to that location by any observer that is sequenced in the coherence order of the location after the observed write and (2) A subsequent write of the location by the same observer will be sequenced in the coherence order of the location after the observed write This is actually pretty intuitive

Observability (2) 31 but reads are observable too! Reads A read of a location in memory is said to be observed by an observer when a subsequent write to the location by the same observer will have no effect on the value returned by the read. These definitions clearly have relations with rf and mo.

Global Observability and Completion A normal memory access is globally observed for a shareability domain when it is observed by all observers in that domain. n A table walk is complete for a shareability domain when its accesses are globally observed in that domain and the TLB is updated. n An access is complete for a shareability domain when it is globally observed in that domain and any table walks associated with it have completed in the same domain. Much more difficult to correlate with C11, which cares only about application-level ordering. n 32

Explicit barriers The ARM architecture defines three barrier instructions: ISB Pipeline flush and context synchronisation DMB <option> Ensure ordering of memory accesses DMB <option> Ensure completion of memory accesses The <option> argument specifies the required shareability domain (NSH, ISH, OSH, SY) and access type (ST). Defaults to full system, all access types if omitted. Userspace runs in the same inner-shareable domain. 33

Dependencies In the absence of explicit barriers, dependencies define observation order of normal memory accesses. 34 Address: value returned by a read is used to compute the address of a subsequent access. Control: value returned by a read is used to determine the condition flags and the flags are used in the condition code checking that determines the address of a subsequent access. Data: value returned by a read is used as data written by a subsequent write. There are also a few other rules (RaR, store speculation).

Dependency Examples 35 ldr r1, [r0, #4] and r1, #0xfff ldr r3, [r2, r1] ldr r1, [r0, #4] cmp r1, #1 addeq r2, #4 ldr r3, [r2] ldr r1, [r0, #4] add r1, #5 str r1, [r2] (address) (control) (data) Question: Which dependencies enforce ordering of observability?

Mapping to C11 Typically, architectures provide either stronger (x86) or weaker (PowerPC) guarantees than those required by the C11 relaxed memory models: Architecture SC Acq/rel Relaxed x86 ARMv7 = PowerPC = ia64 = ARMv8 = = Explicit fences are used to convert into. Rsc Wsc Racq Wrel ARMv7 LDR; DMB DMB; STR; DMB LDR; DMB DMB; STR ARMv8 LDAR STLR LDAR STLR 36

Implementing the C11 memory model for ARM processors. Will Deacon February 2015