The Semantics of x86-cc Multiprocessor Machine Code

Size: px

Start display at page:

Download "The Semantics of x86-cc Multiprocessor Machine Code"

Valentine Brett McCoy
6 years ago
Views:

1 The Semantics of x86-cc Multiprocessor Machine Code Susmit Sarkar Computer Laboratory University of Cambridge Joint work with: Peter Sewell, Scott Owens, Tom Ridge, Magnus Myreen (U.Cambridge) Francesco Zappa Nardelli, Jade Alglave, Thomas Braibant (INRIA) ARG lunch, November 2008

2 Shared Memory Multiprocessors are now everywhere Programmer model: many processors operating on (the illusion of) a single shared memory Also known as: sequential consistency Traditional concurrency semantics presupposes sequential consistency, for parallel languages or process calculi: (P 0 P 0 M 0 ) (P 1 P 1 M 1 ) (P 2 P 2 M 2 )...

3 Shared Memory Programmer model: many processors operating on (the illusion of) a single shared memory But: For typical real shared-memory multiprocessors, the illusion of a single shared memory is not very good. For performance reasons, they only have approximately consistent views of that memory, aka weak memory models, aka relaxed memory models. They are not sequentially consistent. Different processors can observe actions in different orders. We can t think about these systems in terms of global time

4 Approximately Consistent Memory: One Intel/AMD Example Initial shared memory values: x = 0 y = 0 Per-processor registers: r A r B Processor A Processor B store x := 1 store y := 1 load r A := y load r B := x Processor A MOV [x] $1 MOV EAX [y] Processor B MOV [y] $1 MOV EBX [x] Final register values: r A =? r B =?

5 Approximately Consistent Memory: One Intel/AMD Example Initial shared memory values: x = 0 y = 0 Per-processor registers: r A r B Processor A Processor B store x := 1 store y := 1 load r A := y load r B := x Processor A MOV [x] $1 MOV EAX [y] Processor B MOV [y] $1 MOV EBX [x] Final register values: r A = 0 and r B = 0 is possible Each processor can do its own store action before the store of the other processor. Makes it hard to understand what your programs are doing! Already a real problem for OS, compiler, and library authors.

6 Problems Most real multiprocessors (x86, PPC, SPARC, ARM,...) provide non-sequentially-consistent, or weak, or relaxed memory To write efficient low-level concurrent code you have to understand exactly what guarantees are provided But:...the guarantees are subtle, and differ between architectures...the processor documentation is typically very ambiguous, hard to understand, and sometimes, incomplete and unsound...(almost) none of the last 40 years of research on verifying concurrent algorithms deals with these real weak memory models...(almost all) previous WMM work doesn t cover x86, and isn t integrated with instruction semantics

7 Plan 1. Find out what the architecture and processors say and do Aim: Model should be sound w.r.t. the architecture (and hence w.r.t current and future processors) and strong enough for reasoning about (racy) code, but may be looser than the behaviour of any particular processor. 2. Express it in nice clear unambiguous mathematics 3. Test that the mathematics and hardware correspond 4. Prove metatheory (e.g. that for well-synchronized programs you don t need to think about this stuff)

9 Sources Intel 64 and IA-32 Architectures Software Developer s Manual, vols 1,2A,2B,3A,3B (Rev 28, July 2008) In multiprocessor systems, maintenance of cache consistency may, in rare circumstances, require intervention by system software. [Vol 3A 10-5] AMD 64 Architecture Programmer s Manual, vols 1,2,3 (September 2007) Personal communication with a couple of Intel experts. You?

10 Timeline of memory model descriptions Pre-IWP Nov 2006 Intel manuals, rev 22 IWP/Rev 28 Aug 2007 Intel white paper v1.0 Sep 2007 AMD manual, rev 3.14 Jul 2008 Intel manuals, rev 28 Rev 29 Nov 2008 (last week) Intel manual, rev 29

11 Not all of x86 For now, only the basic user-code scenario: coherent write-back memory no misaligned accesses, exceptions, or non-temporal operations no self-modifying code no page-table changes Sufficient for user space code and most kernel code

12 Two styles of semantics in WMM lit.: Semantics of Memory model Operational : idealised machines, with buffers, etc. Non-operational or axiomatic : constraints on ordering relations. Ideally: both, with a correspondence theorem. First: axiomatic. A view order per processor, with constraints on how they relate to each other.

13 Instructions and Events Program is instructions, but reordering is over read/write events: proc:0 proc:1 INC [100] INC [100] eiid:1 (of INC [100]) iiid: proc:0;po:0 R [100]=0 eiid:5 (of INC [100]) iiid: proc:1;po:0 R [100]=0 iico iico eiid:3 (of INC [100]) iiid: proc:0;po:0 W [100]=1 eiid:7 (of INC [100]) iiid: proc:1;po:0 W [100]=1 inc-inc: (event structure 4) For program reasoning need both (unlike most lit.). Non-atomic instructions. Record iico.

14 Locked Instructions proc:0 proc:1 LOCK; INC [100] LOCK; INC [100] Event Structures event structure = [ procs : proc set; events : event set; intra causality : event reln; atomicity : event set set]

15 View Orders A collection of view orders vo gives, for each processor p, a linear order vo p of the relevant events. The relevant events are: all the events of processor p, and all the memory write events of other processors

16 vo:1 vo:0 eiid:5 (of INC [100]) iiid: proc:1;po:0 R [100]=0 eiid:1 (of INC [100]) iiid: proc:0;po:0 R [100]=0 vo:1 iico vo:0 eiid:3 (of INC [100]) iico iiid: proc:0;po:0 W [100]=1 proc:0 proc:1 INC [100] INC [100] vo:1 P6 vo:0 eiid:7 (of INC [100]) iiid: proc:1;po:0 W [100]=1

17 Preserved Program Order 5 of the 8 Intel WP principles are straightforward. P1. LOADS ARE NOT REORDERED WITH OTHER LOADS. P2. STORES ARE NOT REORDERED WITH OTHER STORES. iwp2.1/amd1 proc:0 proc:1 po:0 MOV [100] $1 MOV EAX [200] po:1 MOV [200] $1 MOV EBX [100] Required: (1:EAX=1) (1:EBX=1) P3. STORES ARE NOT REORDERED WITH OLDER LOADS. P4. LOADS MAY BE REORDERED WITH OLDER STORES TO DIFFERENT LOCATIONS BUT NOT WITH OLDER STORES TO THE SAME LOCATION. P8. LOADS AND STORES ARE NOT REORDERED WITH LOCKED INS

18 Preserved Program Order, Formalised preserved program order E = {(e 1,e 2 ) (e 1,e 2 ) (po strict E) (( p r.(loc e 1 = loc e 2 ) (loc e 1 = SOME (LOCATION REG p r))) (mem load e 1 mem load e 2 ) (mem store e 1 mem store e 2 ) (mem load e 1 mem store e 2 ) (mem store e 1 mem load e 2 (loc e 1 = loc e 2 )) ((mem load e 1 mem store e 1 ) locked E e 2 ) (locked E e 1 (mem load e 2 mem store e 2 )))}

19 Total order on stores to each location P6. IN A MULTIPROCESSOR SYSTEM, STORES TO THE SAME LOCATION HAVE A TOTAL ORDER. write serialization candidates E =...the set of all relations which are the union, for each location, of a linear order over all the store events to that location in E. iwp2.6 proc:0 proc:1 proc:2 proc:3 po:0 MOV [100] $1 MOV [100] $2 MOV EAX [100] MOV ECX [100] po:1 MOV EBX [100] MOV EDX [100] Forbidden: 2:EAX=1 2:EBX=2 3:ECX=2 3:EDX=1

20 Total order on locked instructions P7. IN A MULTIPROCESSOR SYSTEM, LOCKED INSTRUCTIONS HAVE A TOTAL ORDER. lock serialization candidates E =...similar, but on instructions iwp2.7/amd7 proc:0 proc:1 proc:2 proc:3 po:0 XCHG [100] EAX XCHG [200] EBX MOV ECX [100] MOV ESI [200] po:1 MOV EDX [200] MOV EDI [100] Initial state: 0:EAX= 1 1:EBX= 1 (elsewhere 0) Forbidden: 2:ECX=1 2:EDX=0 3:ESI=1 3:EDI=0

21 Transitive visibility Key question: how to capture condition P5 Intel 64 memory ordering ensures transitive visibility of stores i.e. stores that are causally related appear to execute in an order consistent with the causal relation Transitivity from reads-from to preserved-program-order: proc:0 proc:1 proc:2 MOV [100] $1 MOV EAX [100] MOV EBX [200] MOV [200] $1 Required: (1:EAX=1 2:EBX=1) (2:ECX=1) MOV ECX [100]

22 Reads-from maps A reads-from map for an event structure is a set of pairs (ew, er) identifying, for some of its read events, a write event to the same location with the same value. Other read events are presumed to read from the initial state.

23 Causality Believe transitive also through write- and lock-serialization orders, and intra-instruction causality. Interpret causally with happens before E X = E.intra causality (preserved program order E) X.write serialization X.lock serialization X.rfmap

24 In full, an execution witness X, for an event structure E, comprises: an initial state initial state, a family of view orders (one for each processor) vo, a per-location global order on memory writes write serialization, a global order on locked instructions lock serialization, a reads-from map rfmap, together satisfying the valid execution predicate below.

25 (then the final state is determined by the initial state overridden by the last memory and register writes) Valid Executions For each processor p: (a) p s view order is consistent with happens before (strict(vo p) happens bef ore is acyclic) (b) the reads-from map is satisfied by the view orders (for any write ew and read er in rfmap and in the relevant view order events for p, ew vo p er and there is no other intervening write to the same location) (c) the initial state constraint is satisfied by the rfmap and view orders (for each read er that does not have a corresponding write in rfmap, the initial state contains the read value and that there is no other write ew to that location preceding er in the view order) (d) that atomicity conditions are satisfied by each view order (for any two events in the same atomicity equivalence class, there is no third event e that occurs between them that isn t in that class.)

26 Example valid execution vo:0 eiid:0 (of MOV [100] $1) iiid: proc:0;po:0 W [100]=1 iwp2.4/amd9 proc:0 proc:1 po:0 MOV [100] $1 MOV [200] $1 vo:0 P4 eiid:1 (of MOV EAX [100]) iiid: proc:0;po:1 R [100]=1 rf po:1 MOV EAX [100] MOV ECX [200] po:2 MOV EBX [200] MOV EDX [100] Allowed: 0:EBX=0 1:EDX=0 P1 vo:0 iico eiid:3 (of MOV EAX [100]) iiid: proc:0;po:1 W 0:EAX=1 vo:0 eiid:6 (of MOV EBX [200]) iiid: proc:0;po:2 R [200]=0 iico vo:0 An execution in which processor 0 sees its write before that of processor 1 whereas processor 1 sees them in the opposite order. eiid:8 (of MOV EBX [200]) iiid: proc:0;po:2 W 0:EBX=0 vo:1 eiid:9 (of MOV [200] $1) iiid: proc:1;po:0 W [200]=1 eiid:10 (of MOV ECX [200]) iiid: proc:1;po:1 R [200]=1 vo:0 P4 rf vo:1 vo:1 vo:1 vo:1 iico eiid:12 (of MOV ECX [200]) iiid: proc:1;po:1 W 1:ECX=1 P1 vo:1 eiid:15 (of MOV EDX [100]) iiid: proc:1;po:2 R [100]=0 iico eiid:17 (of MOV EDX [100]) iiid: proc:1;po:2 W 1:EDX=0 iwp2.4/amd9: Litmus Test (event structure 6)

27 Instruction Semantics Decoding: " 8B /r MOV r32, r/m32 "; " B8+rd id MOV r32, imm32 "; Microcode combinators: seqm : a M ( a b M) b M parm : a M b M ( a b)m read reg : iiid Xreg word32 M... x86 exec ii (XBINOP binop name ds) len = parm unit (seqm (read eip ii) (λx. write eip ii (x + len))) (seqm (parm (read src ea ii ds) (read dest ea ii ds)) (λ((ea src, val src), (ea dest, val dest)). write binop ii binop name val dest val src ea dest))

28 Validating the semantics Too complex to work with by hand! (both combinatorially and twistily) Write executable version, in OCaml Formalise semantics, in HOL Test behaviour of real processors: the instruction semantics (directly against HOL) the memory model Prove metatheory

29 Testing the instruction semantics Generate 6000 conjectures like this (for MOV EAX EBX) from a real processor: (XREAD REG EBX s = 0x6F5BE65Bw) = (XREAD EIP s = 0x804848Bw) = (XREAD MEM 0x804848Bw s = SOME 0x89w) = (XREAD MEM 0x804848Cw s = SOME 0xD8w) = (XREAD REG EAX(the(X86 NEXT s)) = 0x6F5BE65Bw) (XREAD REG EBX(the(X86 NEXT s)) = 0x6F5BE65Bw) (XREAD EIP(the(X86 NEXT s)) = 0x804848Dw) Prove in HOL (automatically...) 32-bit MOV, CMOVE, CMOVNE, XADD, XCHG, CMPXCHG; ADD, AND, CMP, OR, SUB, TEST, XOR; INC, DEC, NOT, NEG; POP, PUSH; JUMP, CALL, RET, LOOP.

30 Testing the memory model (* Test iwp2.4/amd9 : Intra-processor forwarding is allowed*) {x = 0; y = 0}; exists (%r2 = 0 /\ %r4 = 0); P0 P1 ; mov [x], 1 mov [y], 1 ; mov %r1, [x] mov %r3, [y] ; mov %r2, [y] mov %r4, [x] We found a witness for the case : exists r2 = 0 /\ r4 = 0 Histogram of results (x,1) (y,1) (%r1,1) (%r2,0) (%r3,1) (%r4,0) 412 (x,1) (y,1) (%r1,1) (%r2,1) (%r3,1) (%r4,0) (x,1) (y,1) (%r1,1) (%r2,0) (%r3,1) (%r4,1) (x,1) (y,1) (%r1,1) (%r2,1) (%r3,1) (%r4,1) 23

31 Metatheory 1: Nice executions The model follows the statements in the manual... and so is (superficially) quite weak e.g. accesses to different registers need not follow program order Theorem: All valid executions are equivalent to nice valid executions Nice: register and memory reads are in program order (but memory writes can be arbitrarily delayed) Proved in HOL [Tom Ridge]

32 Metatheory 2: Data Race freedom We would like to program as if memory is sequentially consistent (for well-behaved programs) Theorem:...for race-free event structures, all valid executions are equivalence to valid sequential executions Race free: Intensional definition no pair of events, one a memory read and another a memory write, to the same location, unrelated by happens-before Proved in HOL [Scott Owens]

33 Metatheory 3: Operational model The axiomatic model is good for proofs, but is not suited for calculation Theorem:...the axiomatic model is equivalent to a deadlock-free operational semantics model [hand proof, without lock prefix] The operational semantics, due to property of niceness, need only delay visibility of memory writes

34 Summary of X86 models Pre-IWP (pre-aug 2007) (Intel/AMD) Extremely vague IWP/Rev 28 (Intel/AMD, formalized by X86-CC) Moderately clear, except for causality (interpreted in X86-CC) Unsound with hardware Too weak for programmers (?) (IRIW, MFENCEs do not lead to sequential consistency)

35 The Rev 28 Manual / X86-CC is in some cases stronger than hardware Rev 28 of Intel manual: The Rev 28/X86-CC model is not sound P4. READS MAY BE REORDERED WITH OLDER WRITES TO DIFFERENT LOCATIONS BUT NOT WITH OLDER WRITES TO THE SAME LOCATION. P6. WRITES TO THE SAME LOCATION HAVE A TOTAL ORDER. n6 proc:0 proc:1 poi:0 MOV [x] $1 MOV [y] $2 poi:1 MOV EAX [x] MOV [x] $2 poi:2 MOV EBX [y] Forbidden: 0:EAX=1 0:EBX=0 x=1 Observed (rarely, but reproducibly) on real hardware (Core 2), and allowed in Rev 29

36 What do programmers on X86 use? Generally assumed: somewhat like Total Store Order on SPARC x86/iriw proc:0 proc:1 proc:2 proc:3 poi:0 MOV [x] $1 MOV [y] $1 MOV EAX [x] MOV ECX [y] poi:1 MOV EBX [y] MOV EDX [x] Forbidden: 2:EAX=1 2:EBX=0 3:ECX=1 3:EDX=0 Allowed in X86-CC/Rev 28, explicitly allowed by AMD (X86-CC is in this respect weaker than TSO) Forbidden in Rev 29 (this is a weakness, not an unsoundness)

37 Forbidden in Rev 28/X86-CC Rev 29 of Intel manual: Revised model is (probably) too weak P6. ANY TWO STORES ARE SEEN IN A CONSISTENT ORDER BY PROCESSORS OTHER THAN THOSE PERFORMING THE STORES... and delete the P6. WRITES TO THE SAME LOCATION HAVE A TOTAL ORDER x86/n5 proc:0 proc:1 poi:0 MOV [x] $1 MOV [x] $2 poi:1 MOV EAX [x] MOV EBX [x] Forbidden: 0:EAX=2 1:EBX=1 This would be allowed (as far as we can tell) in Rev 29, and would be very strange for programmers

38 Comparison of X86 models Pre-IWP (pre-aug 2007) (Intel/AMD) Extremely vague IWP/Rev 28 (Intel/AMD, formalized by X86-CC) Moderately clear, except for causality (interpreted in X86-CC) Unsound with hardware Too weak for programmers (?) (IRIW, MFENCEs do not lead to sequential consistency) Rev 29 (Intel, AMD in progress) Moderately clear, except for causality (old interpretation does not work, not clear what does) Sound (as far as we know) with hardware Too weak for programmers (n5) X86-TSO (Us, in progress) Clear Sound (as far as we know) with hardware Strong enough for programmers (?) (experience of TSO programmers)

X-86 Memory Consistency

X-86 Memory Consistency Andreas Betz University of Kaiserslautern a betz12@cs.uni-kl.de Abstract In recent years multiprocessors have become ubiquitous and with them the need for concurrent programming.