X-86 Memory Consistency

Size: px

Start display at page:

Download "X-86 Memory Consistency"

Avis Webster
6 years ago
Views:

1 X-86 Memory Consistency Andreas Betz University of Kaiserslautern a betz12@cs.uni-kl.de Abstract In recent years multiprocessors have become ubiquitous and with them the need for concurrent programming. However, concurrent programming, which is always challenging, is made much more so by two problems. First, multiprocessors do not provide the sequentially consistent memory that is assumed for most work on semantics and verification. Instead, they have relaxed memory models and different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures and specifications, specifying what programmers can rely on, are often in ambiguous informal prose, leading to widespread confusion. In my work I focus on x86 processors. I will present some of the recent Intel and AMD specifications, showing that all contain certain ambiguities, some are arguably to weak to program above and some are simply unsound with respect to actual hardware. I will also present the x86-tso programmer s model by the group of Sewell, Sarkar and Owens. Their model is mathematically precise, can be presented as an intuitive abstract machine and as an axiomatic memory model. This should put x86 multiprocessor system building on a more solid foundation. 1 Introduction Since the 1960s multiprocessor machines have been developed, with many processors working on a shared memory, today they are ubiquitous. Extensive research was motivated by the difficulty of concurrent programming and resulted in techniques like semaphores, monitors and software model checking. It was mostly always assumed that concurrent threads share a single sequentially consistent memory. In reality, to achieve high performance multiprocessors use sophisticated techniques like store buffers, hierarchies of local caches and speculative execution. In sequential code such optimisations are not observable, but in a multithreaded system different threads may have different views of memory. Table 1: Example1 MOV [x] 1 MOV [y] 1 MOV EAX [y] MOV EBX [x] Allowed Final State: Proc 0:EAX=0 Proc 1:EBX= 0 Example?? can be seen as a visible consequence of store buffering, if each processor has a FIFO buffer for pending memory writes then the reads from x and y can occur before the stores were propa-

2 2 gated to main memory. So systems programmers cannot reason, at the level of abstraction of memory reads and writes in terms of a global time. To make things worse some vendors architectural specifications do not clearly define what they guarantee, while others do, despite the extensive research on relaxed memory models. The rest of the paper is structured as follows: In Section?? I will give an overview of related work and existing solutions for the x86 memory consistency. In this Section?? I will include some of Intel and AMD s specifications and show why they are ambiguous and some are unsound with respect to actual hardware. This section will also give a mathematical definition of the x86-tso memory model, which is, in contrast to the vendor specifications unambiguous. It is also accessible, presented in an operational abstract machine style and in an axiomatic memory model. This section will also contain the relevant vendor litmus tests and show which behaviour is permitted by x86-tso. At the end of this section I also show an implementation of a Linux spinlock considering the x86-tso memory model. Section?? will show the results of my work on x86 memory consistency and give a discussion about future improvements and a comparison to other possible solutions of memory consistency. Section?? includes references to the related work I used for my research on x86 memory consistency. 2 Related Work There is an extensive literature on relaxed memory models, but most of it does not address x86. I will cover some of the most related work. The most relevant work for x86 was published by Owens et al.[[?],[?]]. In their work they discuss the ambiguities of vendor specifications and their unsoundness. They mathematically define an x86-tso model and use it for some litmus tests and show which behaviour should be permitted by actual architecture. Also they use their memory model to implement an Linux x86 spinlock and give a definition for data race freedom of x86 programs with respect to their model. A Primer on Memory Consistency and Cache Coherence by Hill et al.[[?]] introduces a wide variety of memory consistency models and cache coherence protocols. In Chapter 4 of their book they focus on x86 and also give a formal definition of x86-tso. Also they show how systems implement x86-tso and discuss how atomic instructions and instructions to enforce order between instructions can be implemented. Lastly they compare their x86-tso model to sequential consistency. The Intel 64 architecture memory ordering white paper[[?]] gives an overview about the memory reordering observable by software. It also gives a number of examples which reordering is allowed by the architecture, but it also says that any reordering on hardware level is allowed as long as it doesn t violate the visibility rules of the architecture. It also gives 8 principles that Intel 64 memory ordering obeys, they are as following: 1. Loads are not reordered with other loads. 2. Stores are not reordered with other stores. 3. Stores are not reordered with older loads. 4. Loads may be reordered with older stores to different locations but not with older stores to the same location.

3 3 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). 6. In a multiprocessor system, stores to the same location have a total order. 7. In a multiprocessor system, locked instructions have a total order. 8. Loads and stores are not reordered with locked instructions. In Multiprocessor Memory Model Verification by Loewenstein et al.[[?] ] they describe a method to use the system architects specified memory ordering as a function of execution, for verifying multiprocessing systems by monitoring not only that the result of an execution conforms to the required memory model, but that the result is exactly what the system architects intended. In typical simulations implementation errors do not propagate to memory model violations, so their approach uses architectural/mathematical insight to amplify the coverage of analysis. Higham et al give an overview about different memory models in their Weak memory consistency models part I: Definitions and comparisons[[?]]. First, they introduce a formal framework which is used to define the different memory models. By using a unifying formal framework they are able to reveal substantial differences between models. Also they presented the relationship between models when no explicit synchronisation primitives are used and compared different models to each other. In Owens et al previous work The semantics of x86-cc multiprocessor machine code [[?]], they developed a rigorous and accurate semantics for x86 multiprocessor programs. They tested the semantics against actual processors and the vendor litmus-test examples, and gave an equivalent abstract-machine characterisation of their axiomatic memory model. For programs that are data-race free, they also proved in HOL that their behaviour is sequentially consistent. At the end they compared the x86 model with aspects of the POWER and ARM behaviour. In Saraswat et al A theory of memory models[[?]] they present a simple mathematical framework for relaxed memory models for programming languages. They establish that all models in the framework satisfy the fundamental property of relaxed memory models: programs whose sequentially consistent(sc) executions have no races must have have only SC executions. Also they show how to define synchronisation constructs in their framework and discuss the causality test cases from the Java Memory Model. Boudol and Petri propose in their work Relaxed memory models: an operational approach[[?]] a new approach to formalizing a memory model in which the model itself is part of a weak operational semantics for a programming language. They formalize in this way a write operation to the store to be buffered. They derive the ordering constrainsts from the weak semantics of programs and prove, at the programming language level, that the weak semantics implements the usual interleaving semantics for data-race free programs. Burckhardt and Musvahti propose a new verification technique for the most common relaxation, store buffers in their work Effective program verification for relaxed memory models[[?]].they first present a monitor algorithm that can detect the presence of program executions that are not sequentially consistent due to store buffers while only exploring sequentially consistent executions. Then, they combine this monitor with a stateless model checker that verifies that every sequentially consistent execution is correct. They have implemented this algorithm in a prototype tool called Sober and present experiments that demonstrate the precision and scalability of their method. In their other work Verifying compiler transformations for concurrent programs[[?]] Burckhardt and Musvahti present a novel proof methodology for proving the soundness of compiler transformations for concurrent programs. Their methodology is based on a new formalization of memory models as dynamic rewrite rules on event streams. They implement their proof methodology in a first-of-its-kind

4 4 semi-automated tool called Traver to verify or falsify compiler transformations. Using Traver, they prove or refute the soundness of several commonly used compiler transformations for various memory models. Adve and Garachorloo describe in their paper Shared memory consistency models: A tutorial[[?]] issues related to memory consistency models in a way that would be understandable to most computer professionals. They focus on consistency models proposed for hardware-based shared memory systems. Many of these models are originally specified with an emphasis on the system optimizations they allow. They retain the system-centric emphasis, but use uniform and simple terminology to describe the different models. They also briefly discuss an alternate programmer-centric view that describes the models in terms of program behaviour rather than specific system optimizations. In A Unified Formalization of Four Shared-Memory Models[[?]] Adve and Hill present a shared-memory model, data-race-free-1, that unifies four earlier models: weak ordering, release consistency, the VAX memory model, and data-race-free-0. The most intuitive and commonly assumed shared-memory model, sequential consistency, limits performance. The models are based on the common intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance. However, each model formalizes this intuition differently and has different advantages and disadvantages with respect to the other models. Data-race-free-1 unifies these models by formalizing the above intuition in a manner that retains the advantages of each of the four models. The next section introduces some of the vendor architecture specifications and then the x86-tso model. After the definition I will give some code examples that show which behaviour is permitted by this model and at the end of the next section I will present a Linux spinlock implementation. 3 The Solution 3.1 Vendor Specifications First, I will give a briefly overview of different vendor specifications, comparing them and explaining which behaviour is allowed by them. After that I will introduce the x86-tso memory model. Then I will show some assembly code examples and explain which memory reordering are allowed. Processor vendors document their architectures, so programmers can rely on these. For some architectures the memory-model aspects are expressed in precise mathematics. However, for x86, these specifications are informal prose documents. For loose specifications of subtle properties informal prose is a poor medium, because such documents are almost inevitably ambiguous and sometimes wrong. Moreover, one cannot test programs above such a vague specification, and one cannot use them as criteria for testing processor implementations. Now I will review some of the informal-prose Intel and AMD x86 specifications including the Intel 64 and IA-32 Architectures Software Developers Manual (SDM) and the AMD64 Architecture Programmers Manual (APM). Before August 2007 there were some early revision of the Intel SDM, that gave an informal-prose model called processor ordering, unsupported by any examples. It is hard to see precisely what this prose means, especially without additional knowledge or assumptions about the micro architecture of particular implementations. The Intel White Paper (IWP) which was published in August 2007, gave a somewhat more precise model, with 8 informal-prose principles P1P8 supported by 10 examples. This was incorporated into later revisions of the Intel SDM, AMD gave similar but not identical prose in newer revisions of their manual (AMD 3.14). These are essentially causal-consistency models, and they allow different processors to see writes to independent locations in different orders.

5 5 Table 2: Independent Reads, Independent Writes (IRIW) Example Proc 2 Proc 3 MOV [x] 1 MOV [y] 1 MOV EAX [x] MOV ECX [y] MOV EBX [y] MOV EDX [x] Initial State: all entries are zero Forbidden Final State: Proc 2:EAX=1 Proc 2:EBX=0 Proc 3:ECX=1 Proc 3:EDX=0 AMD 3.14 allows this [??] explicitly, while IWP allows this implicitly, because IRIW is not ruled out by the stated principles. From a micro architecturally point of view this can arise from store buffers that are shared between some but not all processors. However, both require that, in some sense, causality is respected, as in the IWP principle P5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). Owens et al used this informal specification as basis for formal model, x86-cc, for which a key issue was giving a reasonable interpretation to this causality, which is not defined in IWP or AMD3.14. But these informal specifications turned out to have two serious flaws. First, they are arguably rather weak for programmers. In particular, they admit the IRIW behaviour above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency. Second, and more seriously, x86-cc and IWP are unsound with respect to current processors. The following example,n6, shows a behaviour that is observable but that is disallowed by x86-cc and by any interpretation that can be made of IWP principles P1,2,4 and 6. Table 3: n6 MOV [x] 1 MOV [y] 2 MOV EAX [x] MOV [x] 2 MOV EBX [y] Allowed Final State: Proc 0:EAX=1 Proc 0:EBX=0 [x]=1 To see why this [??] could be allowed by multiprocessors with FIFO store buffers, suppose that first the Proc 1 write of [y]=2 is buffered, then Proc 0 buffers its write of [x]=1, reads [x]=1 from its own store buffer, and reads [y]=0 from main memory, then Proc 1 buffers its [x]=2 write and flushes its buffered [y]=2 and [x]=2 writes to memory, then finally Proc 0 flushes its [x]=1 write to memory. An important change to the Intel memory-model specification was made in rev29 of Intel SDM. First, the IRIW final state above is forbidden and the previous coherence condition: P6. In a multiprocessor system, stores to the same location have a total order has been replaced by: Any two stores are seen in a consistent order by processors other than those performing the stores (I label this P9). Second, the memory barrier instructions are now included. It is stated that reads and writes cannot pass MFENCE instructions, together with more refined properties for SFENCE and LFENCE. Third, same-processor writes are now explicitly ordered: Writes by a single processor are observed in the same order by all processors (P10) (we regarded this as implicit in the IWP P2. Stores are not reordered with other stores). This revision appears to deal with the unsoundness, admitting the n6 behaviour above, but, unfortunately,

6 6 it is still problematic. The first issue is, again, how to interpret causality as used in P5. The second issue is one of weakness: the new P9 says nothing about observations of two stores by those two processors themselves (or by one of those processors and one other). The following examples (which I call n5 and n4b) illustrate potentially surprising behaviour that arguably violates coherence. Table 4: n5 MOV [x] 1 MOV [y] 2 MOV EAX [x] MOV EBX [y] Forbidden Final State: Proc 0:EAX=2 Proc 1:EBX=1 Table 5: n4b MOV EAX [x] MOV ECX [y] MOV [x] 1 MOV [y] 2 Forbidden Final State: Proc 0:EAX=2 Proc 1:ECX=1 Their [??,??] final states are not allowed in x86-cc, are not allowed in a pure store-buffer implementation or in x86-tso, and this couldn t be observed on actual processors. However, the principles stated in revisions 2934 of the Intel SDM appear, presumably unintentionally, to allow them. The AMD3.14 Vol. 2, 7.2 text taken alone would allow them, but the implied coherence from elsewhere in the AMD manual would forbid them. In November 2009, AMD produced a new revision, 3.15, of their manuals. The main difference in the memory model specification is that IRIW is now explicitly forbidden. 3.2 The x86-tso Memory Model In the following part I will give a reasoning why a new memory model is needed and I will give you formal definition of the x86-tso memory-model by Owens et al. Given these problems with the informal specifications, it is impossible to produce a useful rigorous model by formalising the principles they contain. Instead, they had to build a reasonable model that is consistent with the given litmus tests, with observed processor behaviour, and with what we know of the needs of programmers, the vendors intentions, and the folklore in the area. They emphasise that their aim is a programmers model, of the allowable behaviours of x86 processors as observed by assembly programs, not of the internal structure of processor implementations, or of what could be observed on hardware interfaces. They present the model in an abstract machine style to make it accessible, but are concerned only with its external behaviour; its buffers and locks are highly abstracted from the micro architecture of processor implementations. They have designed a TSO-like model for x86, called x86-tso. It is defined mathematically in two styles: an abstract machine with explicit store buffers, and an axiomatic model that defines valid executions in terms of memory orders; they are formalised in HOL4 and are proved equivalent. The abstract machine conveys the programmer-level operational intuition behind x86-tso; I describe it informally in the next subsection.

7 The x86-tso Abstract Machine Memory Model Figure 1: x86-tso block diagram The programmers model of a multiprocessor x86 system is illustrated in Figure1 [??]. At the top of the figure are a number of hardware threads, each corresponding to a single in-order stream of instruction execution. They interact with a storage subsystem, drawn as the dotted box. The state of the storage subsystem comprises a shared memory that maps addresses to values, a global lock to indicate when a particular hardware thread has exclusive access to memory, and one store buffer per hardware thread. The behaviour of the storage subsystem is described in more detail below, but the main points are: The store buffers are FIFO and a reading thread must read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory. An MFENCE instruction flushes the store buffer of that thread. To execute a LOCK d instruction, a thread must first obtain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read. More precisely, the possible interactions between the threads and the storage subsystem are described by the following events: Wp[a]=v, for a write of value v to address a by thread p Rp[a]=v, for a read of v from a by thread p Fp, for an MFENCE memory barrier by thread p Lp, at the start of a LOCK d instruction by thread p Up, at the end of a LOCK d instruction by thread p Tp, for an internal action of the storage subsystem, propagating a write from p s store buffer to the shared memory

8 8 As an example, a particular hardware thread p has come to the instruction INC[56], and p s store buffer contains a single write to 56 with the value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two Tp events as the two writes propagate to shared memory. Another execution might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events. The behaviour of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is executing a LOCKd instruction. 1. Rp[a]=v: p can read v from memory at address a if p is not blocked, there are no writes to a in ps store buffer, and the memory does contain v at a; 2. Rp[a]=v: p can read v from its store buffer for address a if p is not blocked and has v as the newest write to a in its buffer; 3. Wp[a]=v: p can write v to its store buffer for address a at any time; 4. Tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in memory at the given address, without coordinating with any hardware thread; 5. Fp: if ps store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more p steps to empty the buffer and proceed, and similarly in 7 below); 6. Lp: if the lock is not held, it can begin a LOCK d instruction; 7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK d instruction. Technically, the formal versions of these rules define a labelled transition system (with the events as labels) for the storage subsystem, and we define the behaviour of the whole system as a parallel composition of that and transition systems for each thread, synchronising on the non-labels. Additionally, we tentatively impose a progress condition, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors. For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-tso. Note, though, that elsewhere it is stated that the Intel SFENCE flushes the store buffer The x86-tso Axiomatic Memory Model In this subsection I will describe the x86-tso Axiomatic Memory Model. The action of any particular execution of a program is abstracted into a set of events with additional data called an event structure. An event represents a read or write of a particular value to a memory address, or to a register, or the execution of a fence. An event structure, the memory model (here x86-tso) defines what a valid execution is. In more detail, each machine-code instruction may have multiple events associated with it: events are indexed by an instruction ID iiid that identifies which processor the event occurred on and the position in the instruction stream of the instruction it comes from (the program order index, or poi). Events also have an event ID eiid to identify them within an instruction (to permit multiple, otherwise identical, events). An event structure indicates when one of an instructions events has a dependency on another event of the same instruction with an intra causality relation, a partial order over the events

9 9 of each instruction. Expressing this in HOL, processors are indexed by a type proc = num, take types address and value to both be 32-bit words, and take a location to be either a memory address or a register of a particular processor: location = Location reg of proc reg Location mem of address The model is parameterised by a type reg of x86 registers, which one should think of as an enumeration of the names of ordinary registers EAX, EBX, etc., the instruction pointer EIP, and the status flags. To identify an instance of an instruction in an execution, its processor and its program order index is specfied. iiid =[ proc : proc; poi : num] This introduces a type of records with two fields, a proc of type proc and a program order index poi of type num. An action is either a read or write of a value at some location, or a barrier: dirn = R W barrier = LFENCE SFENCE MFENCE action = Access of dirn ( reg location) value Barrier of barrier Finally, an event has an instruction instance id, an event id (of type eiid = num, unique per iiid), and an action: event =[ eiid : eiid; iiid : iiid; action : action] and an event structure E comprises a set of processors, a set of events, an intra-instruction causality relation, and a partial equivalence relation (PER) capturing sets of events which must occur atomically, all subject to some well-formedness conditions which is omitted here. event structure =[ procs : proc set; events : (reg event)set; intra causality : (reg event)reln; atomicity : (reg event)set set] I show a very simple event structure below, for the program: Table 6: Event structure example tso 0 poi: 0 MOV [x] $1 MOV [x] $2 poi: 1 MOV EAX [x] There are four events the inner (blue) boxes. The event ids are pretty-printed alphabetically, as

10 10 a,b,c,d, etc. We also show the assembly instruction that gave rise to each event, e.g. MOV [x] $1, though that is not formally part of the event structure. Figure 2: Event structure for the example above Note that events contain concrete values: in this particular event structure, there are two writes of x, with values 1 and 2, a read of [x] with value 2, and a write of proc 0 s EAX register with value 2. Later we show two valid executions for this program, one for this event structure and one for another (note also that some event structures may not have any valid executions). In the diagram, the instructions of each processor are clustered together, into the outermost (magenta) boxes, with program order (po) edges between them, and the events of each instruction are clustered together into the intermediate (green) boxes, with intra-causality edges as appropriate here, in the MOV EAX [x], the write of EAX is dependent on the read of x. This x86-tso axiomatic memory model is based on the SPARCv8 memory model specification but adapted to x86. Compared with the SPARCv8 TSO specification instruction fetches (IF), instruction loads (IL), flushes (F) and stbars (S ) are omitted. The first three deal exclusively with instruction memory, which is not modelled, and the last is useful only under the SPARC PSO memory model. To adapt it to x86 programs, registers and fence events are added, generalize to support instructions that give rise to many events (partially ordered by an intra- instruction causality relation), and generalize atomic load/store pairs to locked instructions. An execution is permitted by this memory model if there exists an execution witness X for its event structure E that is a valid execution. An execution witness contains a memory order, an rfmap, and an initial state; the rest of this section defines when these are valid.

11 11 execution witness = [memory order : ( reg event)reln; (1) r f map : ( reg event)reln; (2) initialstate : ( reg location valueoption)] (3) The memory order is a partial order that records the global ordering of memory events. It must be a total order on memory writes, and corresponds to the relation in SPARCv8, as constrained by the SPARCv8 Order condition. partial order(< X.memory order )(mem accesses E) (4) linear order((< X.memory order ) (mem writes E) )(mem writes E) (5) The initial state is a partial function from locations to values. Each read events value must come either from the initial state or from a write event: the rfmap (reads-from map) records which, containing (ew, er) pairs where the read er reads from the write ew. The reads from map candidates predicate below ensures that the rfmap only relates such pairs with the same address and value. reads f rom map candidates E r f map = (6) (ew,er) r f map.(er reads E) (ew writes E) (7) (loc ew = loc er) (value o f ew = value o f er) (8) Program order is lifted from instructions to a relation po iico E over events, taking the union of program order of instructions and intra-instruction causality. This corresponds roughly to the ; in SPARCv8. However, intra causality might not relate some pairs of events in an instruction, so our po iico E will not generally be a total order for the events of a processor. po strict E = (9) {(e1, e2) (e1.iiid.proc = e2.iiid.proc) e1.iiid.poi < e2.iiid.poi (10) e1 E.events e2 E.events} (11) (12) < (po iico E) = po strict E E.intra causality (13) The check rfmap written below ensures that the rfmap relates a read to the most recent preceding write. For a register read, this is the most recent write in program order. For a memory read, this is the most recent write in memory order among those that precede the read in either memory order or program order (intuitively, the first case is a read of a committed write and the second is a read from the local write buffer). The check rfmap written and reads from map candidates predicates implement the SPARCv8 Value axiom above the rfmap witness data. The check rfmap initial predicate extends this to handle initial state, ensuring that any read not in the rfmap takes its value from the initial state, and that that read is not preceded by a write in memory order or program order.

12 12 previous writes E er < order = (14) {ew ew writes E ew < o rder er (loc ew = loc er)} (15) check r f map written E X = (ew,er) (X.r f map). (16) i f ew mem accesses E then (17) ew maximal elements(previous writes E er(< X.memory order ) (18) previous writes E er(< (po iico E) )) (19) (< X.memory order ) (20) else( ew IN reg accesses E ) (21) ew maximal elements(previous writes E er(< (po iico E) ))(< (po iico E) ) (22) check r f map initial E X = er (reads E \ range X.r f map). (23) ( l.(loc er = Some l) (value o f er = X.initial state l)) (24) (previous writes E er(< X.memory order ) (25) previous writes E er(< (po iico E )) = {}) (26) Now the memory order is further constrained, to ensure that it respects the relevant parts of program order, and that the memory accesses of a LOCKd instruction do occur atomically. Program order is included in memory order, for a memory read before a memory access: (27) er (mem reads E). e (mem accesses E). (28) er < (po iico E) e er < X.memory order e (29) Program order is included in memory order, for a memory write before a memory write: ew1 ew2 (mem writes E). (30) ew1 < (po iico E) ew2 ew1 < X.memory order ew2 (31) Program order is included in memory order, for a memory write before a memory read, if there is an MFENCE between: ew (mem writes E). er (mem reads E). e f (m f ences E). (32) (ew < (po iico E) e f e f < (po iico E) er) ew < X.memory order er (33) Program order is included in memory order, for any two memory accesses where at least one is from a LOCK d instruction: e1 e2 (mem accesses E). es (E.atomicity). (34) ((e1 es e2 es) e1 < (po iico E) e2) e1 < X.memory order e2 (35)

13 13 The memory accesses of a LOCK d instruction occur atomically in memory order, i.e., there must be no intervening memory events. Further, all program order relationships between the locked memory accesses and other memory accesses are included in the memory order: es (E.atomicity). e (mem accesses E \ es). (36) ( e (es mem accesses E).e < X.memory order e ) (37) ( e (es mem accesses E).e < X.memory order e) (38) To deal properly with infinite executions, it s also required that the prefixes of the memory order are all finite, ensuring that there are no limit points, and, to ensure that each write eventually takes effect globally, there must not be an infinite set of reads unrelated to any particular write, all on the same memory location. f inite pre f ixes(< X.memory order )(mem accesses E) (39) ew (mem writes E). (40) f inite{er er E.events (loc er = loc ew) (41) er X.memory order ew ew X.memory order )er} (42) A final state of a valid execution takes the last write in memory order for each memory location, together with a maximal write in program order for each register (or the initial state, if there is no such write). This is uniquely defined assuming that no instruction has multiple unrelated writes to the same register a reasonable property for x86 instructions. The definition of valid execution E X comprising the above conditions is equivalent to one in which < X.memory order is required to be a linear order, not just a partial order. 3.3 Litmus Tests In this part I will give some of the vendor litmus tests to illustrate which behaviour is permitted by x86- TSO. I now go through Examples 8-1 to 8-10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-tso behaviour in each case. Example8-1.[??] Stores Are Not Reordered with Other Stores. Table 7: Example8-1. MOV [x] 1 MOV EAX [y] MOV [y] 1 MOV EBX [x] Forbidden Final State: Proc 1:EAX=1 Proc 1:EBX=0 This test implies that the writes by Proc 0 are seen in order by Proc 1 s reads, which also execute in order. x86-tso forbids the final state because Proc 0 s store buffer is FIFO, and Proc 0 communicates with Proc 1 only through shared memory. Example8-2.[??] Stores Are Not Reordered with Older Loads.

14 14 Table 8: Example8-2. MOV EAX [x] MOV EBX [y] MOV [y] 1 MOV [x] 1 Forbidden Final State: Proc 0:EAX=1 Proc 1:EBX=1 x86-tso forbids the final state because reads are never delayed. Example 8-3. Loads May be Reordered with Older Stores. This test is just the SB example from Section 1, which x86-tso permits. The third AMD test (amd3) is similar but with additional writes inserted in the middle of each thread, of 2 to x and y respectively. Example 8-4.[??] Loads Are not Reordered with Older Stores to the Same Location. Table 9: Example8-4. Proc 0 MOV [x] 1 MOV EAX [x] Required Final State: Proc 0:EAX=1 x86-tso requires the specified result because reads must check the local store buffer. Example 8-5. Intra-Processor Forwarding is Allowed. This test is similar to Example 8-3. Example 8-6.[??] Stores Are Transitively Visible. Table 10: Example8-6. Proc 2 MOV [x] 1 MOV EAX [x] MOV EBX [y] MOV [y] 1 MOV ECX [x] Forbidden Final State: Proc 1:EAX=1 Proc 2:EBX=1 Proc 2:ECX=0 x86-tso forbids the given final state because otherwise the Proc 2 constraints imply that y was written to shared memory before x. Hence the write to x must be in Proc 0 s store buffer (or the instruction has not executed), when the write to y is initiated. Note that this test contains the only mention of transitive visibility in the Intel SDM, leaving its meaning unclear. Example 8-7. Stores Are Seen in a Consistent Order by Other Processors. This test rules out the IRIW behaviour as described in Section 2.2. x86-tso forbids the given final state because the Proc 2 constraints imply that x was written to shared memory before y whereas the

15 15 Proc 3 constraints imply that y was written to shared memory before x. Example 8-8. Locked Instructions Have a Total Order. This is the same as the IRIW Example 8-7 but with LOCKd instructions for the writes; x86-tso forbids the final state for the same reason as above. Example 8-9.[??] Loads Are not Reordered with Locks. Table 11: Example8-9. XCHG [x] EAX XCHG [y] ECX MOV EBX [y] MOV EDX [x] Initial state: Proc 0:EAX=1 Proc 1:ECX=1 (elsewhere 0) Forbidden Final State: Proc 0:EBX=0 Proc 1:EDX=0 This test indicates that locking both writes in Example 8-3 would forbid the non-sequentially consistent result. X86-TSO forbids the final state because LOCKd instructions flush the local store buffer. If only one write were LOCKd (say the write to x), the Example 8-3 final state would be allowed as follows: on Proc 1, buffer the write to y and execute the read x, then on Proc 0 write to x in shared memory then read from y. Example 8-10.[??] Stores Are not Reordered with Locks. Table 12: Example8-10. XCHG [x] EAX MOV EBX [y] MOV y 1 MOV ECX [x] Initial state: Proc 0:EAX=1 (elsewhere 0) Forbidden Final State: Proc 1:EBX=1 Proc 1:ECX=0 This is implied by Example 8-1, as we treat the memory writes of LOCKd instructions as stores. Test amd5.[??] Table 13: Test amd5. MOV [x] 1 MOV [y] 1 MFENCE MFENCE MOV EAX [y] MOV EBX [x] Forbidden Final State: Proc 0:EAX=0 Proc 1:EBX=0 For x86-tso, this test has the same force as Example 8-8, but using MFENCE instructions to flush the buffers instead of LOCKd instructions. The tenth AMD test is similar. None of the Intel litmus tests include fence instructions. In x86-tso adding MFENCE between every instruction would clearly

16 16 suffice to regain sequential consistency (though obviously in practice one would insert fewer barriers), in contrast to IWP/x86-CC/AMD Linux Spinlock Implementation In the next subsection I will show a correct linux spinlock that uses x86-tso as the underlying memory model. I present a spinlock from the Linux kernel (version ), as an example of a small but nontrivial concurrent programming idiom. I show how one can reason about this code using the x86-tso programmers model, explaining in terms of the model why it works and why the optimisation is sound thus making clear what the developers informal reasoning depended on. For accessibility I do this in prose, but the argument could easily be formalised as a proof. The implementation comprises code to acquire and release a spinlock. It is assumed that these are properly bracketed around critical sections and that spinlocks are not mutated by any other code. Table 14: Linux Spinlock On entry the address of spinlock is in register EAX and the spinlock is unlocked iff its value is 1 acquire: LOCK;DEC [EAX] ; LOCK d decrement of [EAX] JNS enter ; branch if [EAX] was 1 spin: CMP [EAX],0 ; test [EAX] JLE spin ; branch if [EAX] was 0 JMP acquire ; try again enter: ; the critical section starts here release: MOV [EAX] 1 A spinlock is represented by a signed integer which is 1 if the lock is free and 0 or less if the lock is held. To acquire a lock, a thread atomically decrements the integer (which will not wrap around assuming there are fewer than 231 hardware threads). If the lock was free, it is now held and the thread can proceed to the critical section. If the lock was held, the thread loops, waiting for it to become free. Because there might be multiple threads waiting for the lock, once it is freed, each waiting thread must again attempt to enter through the LOCK d decrement. To release the lock, a thread simply sets its value to 1. The optimisation in question made the releasing MOV instruction not LOCK d (removing a LOCK prefix and hence letting the releasing thread proceed without flushing its buffer). For example, consider a spinlock at address x and let y be another shared memory address. Suppose that several threads want to access y, and that they use spinlocks to ensure mutual exclusion. Initially, no one has the lock and [x] = 1. The first thread t to try to acquire the lock atomically decrements x by 1 (using a LOCK prefix); it then jumps into the critical section. Because a store buffer flush is part of LOCK d instructions, [x] will be 0 in shared memory after the decrement. Now if another thread attempts to acquire the lock, it will not jump into the critical section after performing the atomic decrement, since x was not 1. It will thus enter the spin loop. In this loop, the waiting thread continually reads the value of x until it gets a positive result. Returning to the original thread t, it can read and write y inside of its critical section while the others are spinning. These writes are initially placed in t s store buffer, and some may be propagated to shared memory. However, it does not matter how many (if any) are written to main memory, because (by assumption) no other thread is attempting to read (or write) y. When t is ready to exit the critical section, it releases the lock by writing the value 1 to x; this write is put in t s store buffer. It can now continue

17 17 after the critical section (in the text below, we assume it does not try to re-acquire the lock). If the releasing MOV had the LOCK prefix then all of the buffered writes to y would be sent to main memory, as would the write of 1 to x. Another thread could then acquire the spinlock. However, since it does not, the other threads continue to spin until the write setting x to 1 is removed from t s write buffer and sent to shared memory at some point in the future. At that point, the spinning threads will read 1 and restart the acquisition with atomic decrements, and another thread could enter its critical section. However, because t s write buffer is emptied in FIFO order, any writes to y from within t s critical section must have been propagated to shared memory (in order) before the write to x. Thus, the next thread to enter a critical section will not be able to see y in an inconsistent state. 4 Conclusions, Results, Discussion I presented x86-tso, a memory model for x86 processors that does not suffer from the ambiguities, weaknesses, or unsoundnesses of earlier models. Its abstract-machine definition should be intuitive for programmers, whereas its equivalent axiomatic definition supports the memevents exhaustive search and permits an easy comparison with related models; the similarity with SPARCv8 suggests x86-tso is strong enough to program above. Here follows a small comparison between TSO (total store ordering) and SC (sequential consistency): x-86-tso is more relaxed (weaker) than SC (sequential consistency) but less relaxed than other incomparable models. SC is the most intuitive memory model. TSO is close because it acts like SC for common programming idioms. Nevertheless, subtle non-sc executions can bite programmers and tool authors. For simple cores, TSO can offer better performance than SC, but the difference can be made small with speculation. SC is widely understood, while TSO is widely adopted.both SC and TSO are formally defined. The bottom line is that SC and TSO are pretty close, especially compared with the more complex and more relaxed memory consistency models. 5 Bibliography

Hardware Memory Models: x86-tso

Hardware Memory Models: x86-tso John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 9 20 September 2016 Agenda So far hardware organization multithreading