X-86 Memory Consistency

Size: px
Start display at page:

Download "X-86 Memory Consistency"

Transcription

1 X-86 Memory Consistency Andreas Betz University of Kaiserslautern a betz12@cs.uni-kl.de Abstract In recent years multiprocessors have become ubiquitous and with them the need for concurrent programming. However, concurrent programming, which is always challenging, is made much more so by two problems. First, multiprocessors do not provide the sequentially consistent memory that is assumed for most work on semantics and verification. Instead, they have relaxed memory models and different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures and specifications, specifying what programmers can rely on, are often in ambiguous informal prose, leading to widespread confusion. In my work I focus on x86 processors. I will present some of the recent Intel and AMD specifications, showing that all contain certain ambiguities, some are arguably to weak to program above and some are simply unsound with respect to actual hardware. I will also present the x86-tso programmer s model by the group of Sewell, Sarkar and Owens. Their model is mathematically precise, can be presented as an intuitive abstract machine and as an axiomatic memory model. This should put x86 multiprocessor system building on a more solid foundation. 1 Introduction Since the 1960s multiprocessor machines have been developed, with many processors working on a shared memory, today they are ubiquitous. Extensive research was motivated by the difficulty of concurrent programming and resulted in techniques like semaphores, monitors and software model checking. It was mostly always assumed that concurrent threads share a single sequentially consistent memory. In reality, to achieve high performance multiprocessors use sophisticated techniques like store buffers, hierarchies of local caches and speculative execution. In sequential code such optimisations are not observable, but in a multithreaded system different threads may have different views of memory. Table 1: Example1 MOV [x] 1 MOV [y] 1 MOV EAX [y] MOV EBX [x] Allowed Final State: Proc 0:EAX=0 Proc 1:EBX= 0 Example?? can be seen as a visible consequence of store buffering, if each processor has a FIFO buffer for pending memory writes then the reads from x and y can occur before the stores were propa-

2 2 gated to main memory. So systems programmers cannot reason, at the level of abstraction of memory reads and writes in terms of a global time. To make things worse some vendors architectural specifications do not clearly define what they guarantee, while others do, despite the extensive research on relaxed memory models. The rest of the paper is structured as follows: In Section?? I will give an overview of related work and existing solutions for the x86 memory consistency. In this Section?? I will include some of Intel and AMD s specifications and show why they are ambiguous and some are unsound with respect to actual hardware. This section will also give a mathematical definition of the x86-tso memory model, which is, in contrast to the vendor specifications unambiguous. It is also accessible, presented in an operational abstract machine style and in an axiomatic memory model. This section will also contain the relevant vendor litmus tests and show which behaviour is permitted by x86-tso. At the end of this section I also show an implementation of a Linux spinlock considering the x86-tso memory model. Section?? will show the results of my work on x86 memory consistency and give a discussion about future improvements and a comparison to other possible solutions of memory consistency. Section?? includes references to the related work I used for my research on x86 memory consistency. 2 Related Work There is an extensive literature on relaxed memory models, but most of it does not address x86. I will cover some of the most related work. The most relevant work for x86 was published by Owens et al.[[?],[?]]. In their work they discuss the ambiguities of vendor specifications and their unsoundness. They mathematically define an x86-tso model and use it for some litmus tests and show which behaviour should be permitted by actual architecture. Also they use their memory model to implement an Linux x86 spinlock and give a definition for data race freedom of x86 programs with respect to their model. A Primer on Memory Consistency and Cache Coherence by Hill et al.[[?]] introduces a wide variety of memory consistency models and cache coherence protocols. In Chapter 4 of their book they focus on x86 and also give a formal definition of x86-tso. Also they show how systems implement x86-tso and discuss how atomic instructions and instructions to enforce order between instructions can be implemented. Lastly they compare their x86-tso model to sequential consistency. The Intel 64 architecture memory ordering white paper[[?]] gives an overview about the memory reordering observable by software. It also gives a number of examples which reordering is allowed by the architecture, but it also says that any reordering on hardware level is allowed as long as it doesn t violate the visibility rules of the architecture. It also gives 8 principles that Intel 64 memory ordering obeys, they are as following: 1. Loads are not reordered with other loads. 2. Stores are not reordered with other stores. 3. Stores are not reordered with older loads. 4. Loads may be reordered with older stores to different locations but not with older stores to the same location.

3 3 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). 6. In a multiprocessor system, stores to the same location have a total order. 7. In a multiprocessor system, locked instructions have a total order. 8. Loads and stores are not reordered with locked instructions. In Multiprocessor Memory Model Verification by Loewenstein et al.[[?] ] they describe a method to use the system architects specified memory ordering as a function of execution, for verifying multiprocessing systems by monitoring not only that the result of an execution conforms to the required memory model, but that the result is exactly what the system architects intended. In typical simulations implementation errors do not propagate to memory model violations, so their approach uses architectural/mathematical insight to amplify the coverage of analysis. Higham et al give an overview about different memory models in their Weak memory consistency models part I: Definitions and comparisons[[?]]. First, they introduce a formal framework which is used to define the different memory models. By using a unifying formal framework they are able to reveal substantial differences between models. Also they presented the relationship between models when no explicit synchronisation primitives are used and compared different models to each other. In Owens et al previous work The semantics of x86-cc multiprocessor machine code [[?]], they developed a rigorous and accurate semantics for x86 multiprocessor programs. They tested the semantics against actual processors and the vendor litmus-test examples, and gave an equivalent abstract-machine characterisation of their axiomatic memory model. For programs that are data-race free, they also proved in HOL that their behaviour is sequentially consistent. At the end they compared the x86 model with aspects of the POWER and ARM behaviour. In Saraswat et al A theory of memory models[[?]] they present a simple mathematical framework for relaxed memory models for programming languages. They establish that all models in the framework satisfy the fundamental property of relaxed memory models: programs whose sequentially consistent(sc) executions have no races must have have only SC executions. Also they show how to define synchronisation constructs in their framework and discuss the causality test cases from the Java Memory Model. Boudol and Petri propose in their work Relaxed memory models: an operational approach[[?]] a new approach to formalizing a memory model in which the model itself is part of a weak operational semantics for a programming language. They formalize in this way a write operation to the store to be buffered. They derive the ordering constrainsts from the weak semantics of programs and prove, at the programming language level, that the weak semantics implements the usual interleaving semantics for data-race free programs. Burckhardt and Musvahti propose a new verification technique for the most common relaxation, store buffers in their work Effective program verification for relaxed memory models[[?]].they first present a monitor algorithm that can detect the presence of program executions that are not sequentially consistent due to store buffers while only exploring sequentially consistent executions. Then, they combine this monitor with a stateless model checker that verifies that every sequentially consistent execution is correct. They have implemented this algorithm in a prototype tool called Sober and present experiments that demonstrate the precision and scalability of their method. In their other work Verifying compiler transformations for concurrent programs[[?]] Burckhardt and Musvahti present a novel proof methodology for proving the soundness of compiler transformations for concurrent programs. Their methodology is based on a new formalization of memory models as dynamic rewrite rules on event streams. They implement their proof methodology in a first-of-its-kind

4 4 semi-automated tool called Traver to verify or falsify compiler transformations. Using Traver, they prove or refute the soundness of several commonly used compiler transformations for various memory models. Adve and Garachorloo describe in their paper Shared memory consistency models: A tutorial[[?]] issues related to memory consistency models in a way that would be understandable to most computer professionals. They focus on consistency models proposed for hardware-based shared memory systems. Many of these models are originally specified with an emphasis on the system optimizations they allow. They retain the system-centric emphasis, but use uniform and simple terminology to describe the different models. They also briefly discuss an alternate programmer-centric view that describes the models in terms of program behaviour rather than specific system optimizations. In A Unified Formalization of Four Shared-Memory Models[[?]] Adve and Hill present a shared-memory model, data-race-free-1, that unifies four earlier models: weak ordering, release consistency, the VAX memory model, and data-race-free-0. The most intuitive and commonly assumed shared-memory model, sequential consistency, limits performance. The models are based on the common intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance. However, each model formalizes this intuition differently and has different advantages and disadvantages with respect to the other models. Data-race-free-1 unifies these models by formalizing the above intuition in a manner that retains the advantages of each of the four models. The next section introduces some of the vendor architecture specifications and then the x86-tso model. After the definition I will give some code examples that show which behaviour is permitted by this model and at the end of the next section I will present a Linux spinlock implementation. 3 The Solution 3.1 Vendor Specifications First, I will give a briefly overview of different vendor specifications, comparing them and explaining which behaviour is allowed by them. After that I will introduce the x86-tso memory model. Then I will show some assembly code examples and explain which memory reordering are allowed. Processor vendors document their architectures, so programmers can rely on these. For some architectures the memory-model aspects are expressed in precise mathematics. However, for x86, these specifications are informal prose documents. For loose specifications of subtle properties informal prose is a poor medium, because such documents are almost inevitably ambiguous and sometimes wrong. Moreover, one cannot test programs above such a vague specification, and one cannot use them as criteria for testing processor implementations. Now I will review some of the informal-prose Intel and AMD x86 specifications including the Intel 64 and IA-32 Architectures Software Developers Manual (SDM) and the AMD64 Architecture Programmers Manual (APM). Before August 2007 there were some early revision of the Intel SDM, that gave an informal-prose model called processor ordering, unsupported by any examples. It is hard to see precisely what this prose means, especially without additional knowledge or assumptions about the micro architecture of particular implementations. The Intel White Paper (IWP) which was published in August 2007, gave a somewhat more precise model, with 8 informal-prose principles P1P8 supported by 10 examples. This was incorporated into later revisions of the Intel SDM, AMD gave similar but not identical prose in newer revisions of their manual (AMD 3.14). These are essentially causal-consistency models, and they allow different processors to see writes to independent locations in different orders.

5 5 Table 2: Independent Reads, Independent Writes (IRIW) Example Proc 2 Proc 3 MOV [x] 1 MOV [y] 1 MOV EAX [x] MOV ECX [y] MOV EBX [y] MOV EDX [x] Initial State: all entries are zero Forbidden Final State: Proc 2:EAX=1 Proc 2:EBX=0 Proc 3:ECX=1 Proc 3:EDX=0 AMD 3.14 allows this [??] explicitly, while IWP allows this implicitly, because IRIW is not ruled out by the stated principles. From a micro architecturally point of view this can arise from store buffers that are shared between some but not all processors. However, both require that, in some sense, causality is respected, as in the IWP principle P5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). Owens et al used this informal specification as basis for formal model, x86-cc, for which a key issue was giving a reasonable interpretation to this causality, which is not defined in IWP or AMD3.14. But these informal specifications turned out to have two serious flaws. First, they are arguably rather weak for programmers. In particular, they admit the IRIW behaviour above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency. Second, and more seriously, x86-cc and IWP are unsound with respect to current processors. The following example,n6, shows a behaviour that is observable but that is disallowed by x86-cc and by any interpretation that can be made of IWP principles P1,2,4 and 6. Table 3: n6 MOV [x] 1 MOV [y] 2 MOV EAX [x] MOV [x] 2 MOV EBX [y] Allowed Final State: Proc 0:EAX=1 Proc 0:EBX=0 [x]=1 To see why this [??] could be allowed by multiprocessors with FIFO store buffers, suppose that first the Proc 1 write of [y]=2 is buffered, then Proc 0 buffers its write of [x]=1, reads [x]=1 from its own store buffer, and reads [y]=0 from main memory, then Proc 1 buffers its [x]=2 write and flushes its buffered [y]=2 and [x]=2 writes to memory, then finally Proc 0 flushes its [x]=1 write to memory. An important change to the Intel memory-model specification was made in rev29 of Intel SDM. First, the IRIW final state above is forbidden and the previous coherence condition: P6. In a multiprocessor system, stores to the same location have a total order has been replaced by: Any two stores are seen in a consistent order by processors other than those performing the stores (I label this P9). Second, the memory barrier instructions are now included. It is stated that reads and writes cannot pass MFENCE instructions, together with more refined properties for SFENCE and LFENCE. Third, same-processor writes are now explicitly ordered: Writes by a single processor are observed in the same order by all processors (P10) (we regarded this as implicit in the IWP P2. Stores are not reordered with other stores). This revision appears to deal with the unsoundness, admitting the n6 behaviour above, but, unfortunately,

6 6 it is still problematic. The first issue is, again, how to interpret causality as used in P5. The second issue is one of weakness: the new P9 says nothing about observations of two stores by those two processors themselves (or by one of those processors and one other). The following examples (which I call n5 and n4b) illustrate potentially surprising behaviour that arguably violates coherence. Table 4: n5 MOV [x] 1 MOV [y] 2 MOV EAX [x] MOV EBX [y] Forbidden Final State: Proc 0:EAX=2 Proc 1:EBX=1 Table 5: n4b MOV EAX [x] MOV ECX [y] MOV [x] 1 MOV [y] 2 Forbidden Final State: Proc 0:EAX=2 Proc 1:ECX=1 Their [??,??] final states are not allowed in x86-cc, are not allowed in a pure store-buffer implementation or in x86-tso, and this couldn t be observed on actual processors. However, the principles stated in revisions 2934 of the Intel SDM appear, presumably unintentionally, to allow them. The AMD3.14 Vol. 2, 7.2 text taken alone would allow them, but the implied coherence from elsewhere in the AMD manual would forbid them. In November 2009, AMD produced a new revision, 3.15, of their manuals. The main difference in the memory model specification is that IRIW is now explicitly forbidden. 3.2 The x86-tso Memory Model In the following part I will give a reasoning why a new memory model is needed and I will give you formal definition of the x86-tso memory-model by Owens et al. Given these problems with the informal specifications, it is impossible to produce a useful rigorous model by formalising the principles they contain. Instead, they had to build a reasonable model that is consistent with the given litmus tests, with observed processor behaviour, and with what we know of the needs of programmers, the vendors intentions, and the folklore in the area. They emphasise that their aim is a programmers model, of the allowable behaviours of x86 processors as observed by assembly programs, not of the internal structure of processor implementations, or of what could be observed on hardware interfaces. They present the model in an abstract machine style to make it accessible, but are concerned only with its external behaviour; its buffers and locks are highly abstracted from the micro architecture of processor implementations. They have designed a TSO-like model for x86, called x86-tso. It is defined mathematically in two styles: an abstract machine with explicit store buffers, and an axiomatic model that defines valid executions in terms of memory orders; they are formalised in HOL4 and are proved equivalent. The abstract machine conveys the programmer-level operational intuition behind x86-tso; I describe it informally in the next subsection.

7 The x86-tso Abstract Machine Memory Model Figure 1: x86-tso block diagram The programmers model of a multiprocessor x86 system is illustrated in Figure1 [??]. At the top of the figure are a number of hardware threads, each corresponding to a single in-order stream of instruction execution. They interact with a storage subsystem, drawn as the dotted box. The state of the storage subsystem comprises a shared memory that maps addresses to values, a global lock to indicate when a particular hardware thread has exclusive access to memory, and one store buffer per hardware thread. The behaviour of the storage subsystem is described in more detail below, but the main points are: The store buffers are FIFO and a reading thread must read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory. An MFENCE instruction flushes the store buffer of that thread. To execute a LOCK d instruction, a thread must first obtain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read. More precisely, the possible interactions between the threads and the storage subsystem are described by the following events: Wp[a]=v, for a write of value v to address a by thread p Rp[a]=v, for a read of v from a by thread p Fp, for an MFENCE memory barrier by thread p Lp, at the start of a LOCK d instruction by thread p Up, at the end of a LOCK d instruction by thread p Tp, for an internal action of the storage subsystem, propagating a write from p s store buffer to the shared memory

8 8 As an example, a particular hardware thread p has come to the instruction INC[56], and p s store buffer contains a single write to 56 with the value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two Tp events as the two writes propagate to shared memory. Another execution might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events. The behaviour of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is executing a LOCKd instruction. 1. Rp[a]=v: p can read v from memory at address a if p is not blocked, there are no writes to a in ps store buffer, and the memory does contain v at a; 2. Rp[a]=v: p can read v from its store buffer for address a if p is not blocked and has v as the newest write to a in its buffer; 3. Wp[a]=v: p can write v to its store buffer for address a at any time; 4. Tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in memory at the given address, without coordinating with any hardware thread; 5. Fp: if ps store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more p steps to empty the buffer and proceed, and similarly in 7 below); 6. Lp: if the lock is not held, it can begin a LOCK d instruction; 7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK d instruction. Technically, the formal versions of these rules define a labelled transition system (with the events as labels) for the storage subsystem, and we define the behaviour of the whole system as a parallel composition of that and transition systems for each thread, synchronising on the non-labels. Additionally, we tentatively impose a progress condition, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors. For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-tso. Note, though, that elsewhere it is stated that the Intel SFENCE flushes the store buffer The x86-tso Axiomatic Memory Model In this subsection I will describe the x86-tso Axiomatic Memory Model. The action of any particular execution of a program is abstracted into a set of events with additional data called an event structure. An event represents a read or write of a particular value to a memory address, or to a register, or the execution of a fence. An event structure, the memory model (here x86-tso) defines what a valid execution is. In more detail, each machine-code instruction may have multiple events associated with it: events are indexed by an instruction ID iiid that identifies which processor the event occurred on and the position in the instruction stream of the instruction it comes from (the program order index, or poi). Events also have an event ID eiid to identify them within an instruction (to permit multiple, otherwise identical, events). An event structure indicates when one of an instructions events has a dependency on another event of the same instruction with an intra causality relation, a partial order over the events

9 9 of each instruction. Expressing this in HOL, processors are indexed by a type proc = num, take types address and value to both be 32-bit words, and take a location to be either a memory address or a register of a particular processor: location = Location reg of proc reg Location mem of address The model is parameterised by a type reg of x86 registers, which one should think of as an enumeration of the names of ordinary registers EAX, EBX, etc., the instruction pointer EIP, and the status flags. To identify an instance of an instruction in an execution, its processor and its program order index is specfied. iiid =[ proc : proc; poi : num] This introduces a type of records with two fields, a proc of type proc and a program order index poi of type num. An action is either a read or write of a value at some location, or a barrier: dirn = R W barrier = LFENCE SFENCE MFENCE action = Access of dirn ( reg location) value Barrier of barrier Finally, an event has an instruction instance id, an event id (of type eiid = num, unique per iiid), and an action: event =[ eiid : eiid; iiid : iiid; action : action] and an event structure E comprises a set of processors, a set of events, an intra-instruction causality relation, and a partial equivalence relation (PER) capturing sets of events which must occur atomically, all subject to some well-formedness conditions which is omitted here. event structure =[ procs : proc set; events : (reg event)set; intra causality : (reg event)reln; atomicity : (reg event)set set] I show a very simple event structure below, for the program: Table 6: Event structure example tso 0 poi: 0 MOV [x] $1 MOV [x] $2 poi: 1 MOV EAX [x] There are four events the inner (blue) boxes. The event ids are pretty-printed alphabetically, as

10 10 a,b,c,d, etc. We also show the assembly instruction that gave rise to each event, e.g. MOV [x] $1, though that is not formally part of the event structure. Figure 2: Event structure for the example above Note that events contain concrete values: in this particular event structure, there are two writes of x, with values 1 and 2, a read of [x] with value 2, and a write of proc 0 s EAX register with value 2. Later we show two valid executions for this program, one for this event structure and one for another (note also that some event structures may not have any valid executions). In the diagram, the instructions of each processor are clustered together, into the outermost (magenta) boxes, with program order (po) edges between them, and the events of each instruction are clustered together into the intermediate (green) boxes, with intra-causality edges as appropriate here, in the MOV EAX [x], the write of EAX is dependent on the read of x. This x86-tso axiomatic memory model is based on the SPARCv8 memory model specification but adapted to x86. Compared with the SPARCv8 TSO specification instruction fetches (IF), instruction loads (IL), flushes (F) and stbars (S ) are omitted. The first three deal exclusively with instruction memory, which is not modelled, and the last is useful only under the SPARC PSO memory model. To adapt it to x86 programs, registers and fence events are added, generalize to support instructions that give rise to many events (partially ordered by an intra- instruction causality relation), and generalize atomic load/store pairs to locked instructions. An execution is permitted by this memory model if there exists an execution witness X for its event structure E that is a valid execution. An execution witness contains a memory order, an rfmap, and an initial state; the rest of this section defines when these are valid.

11 11 execution witness = [memory order : ( reg event)reln; (1) r f map : ( reg event)reln; (2) initialstate : ( reg location valueoption)] (3) The memory order is a partial order that records the global ordering of memory events. It must be a total order on memory writes, and corresponds to the relation in SPARCv8, as constrained by the SPARCv8 Order condition. partial order(< X.memory order )(mem accesses E) (4) linear order((< X.memory order ) (mem writes E) )(mem writes E) (5) The initial state is a partial function from locations to values. Each read events value must come either from the initial state or from a write event: the rfmap (reads-from map) records which, containing (ew, er) pairs where the read er reads from the write ew. The reads from map candidates predicate below ensures that the rfmap only relates such pairs with the same address and value. reads f rom map candidates E r f map = (6) (ew,er) r f map.(er reads E) (ew writes E) (7) (loc ew = loc er) (value o f ew = value o f er) (8) Program order is lifted from instructions to a relation po iico E over events, taking the union of program order of instructions and intra-instruction causality. This corresponds roughly to the ; in SPARCv8. However, intra causality might not relate some pairs of events in an instruction, so our po iico E will not generally be a total order for the events of a processor. po strict E = (9) {(e1, e2) (e1.iiid.proc = e2.iiid.proc) e1.iiid.poi < e2.iiid.poi (10) e1 E.events e2 E.events} (11) (12) < (po iico E) = po strict E E.intra causality (13) The check rfmap written below ensures that the rfmap relates a read to the most recent preceding write. For a register read, this is the most recent write in program order. For a memory read, this is the most recent write in memory order among those that precede the read in either memory order or program order (intuitively, the first case is a read of a committed write and the second is a read from the local write buffer). The check rfmap written and reads from map candidates predicates implement the SPARCv8 Value axiom above the rfmap witness data. The check rfmap initial predicate extends this to handle initial state, ensuring that any read not in the rfmap takes its value from the initial state, and that that read is not preceded by a write in memory order or program order.

12 12 previous writes E er < order = (14) {ew ew writes E ew < o rder er (loc ew = loc er)} (15) check r f map written E X = (ew,er) (X.r f map). (16) i f ew mem accesses E then (17) ew maximal elements(previous writes E er(< X.memory order ) (18) previous writes E er(< (po iico E) )) (19) (< X.memory order ) (20) else( ew IN reg accesses E ) (21) ew maximal elements(previous writes E er(< (po iico E) ))(< (po iico E) ) (22) check r f map initial E X = er (reads E \ range X.r f map). (23) ( l.(loc er = Some l) (value o f er = X.initial state l)) (24) (previous writes E er(< X.memory order ) (25) previous writes E er(< (po iico E )) = {}) (26) Now the memory order is further constrained, to ensure that it respects the relevant parts of program order, and that the memory accesses of a LOCKd instruction do occur atomically. Program order is included in memory order, for a memory read before a memory access: (27) er (mem reads E). e (mem accesses E). (28) er < (po iico E) e er < X.memory order e (29) Program order is included in memory order, for a memory write before a memory write: ew1 ew2 (mem writes E). (30) ew1 < (po iico E) ew2 ew1 < X.memory order ew2 (31) Program order is included in memory order, for a memory write before a memory read, if there is an MFENCE between: ew (mem writes E). er (mem reads E). e f (m f ences E). (32) (ew < (po iico E) e f e f < (po iico E) er) ew < X.memory order er (33) Program order is included in memory order, for any two memory accesses where at least one is from a LOCK d instruction: e1 e2 (mem accesses E). es (E.atomicity). (34) ((e1 es e2 es) e1 < (po iico E) e2) e1 < X.memory order e2 (35)

13 13 The memory accesses of a LOCK d instruction occur atomically in memory order, i.e., there must be no intervening memory events. Further, all program order relationships between the locked memory accesses and other memory accesses are included in the memory order: es (E.atomicity). e (mem accesses E \ es). (36) ( e (es mem accesses E).e < X.memory order e ) (37) ( e (es mem accesses E).e < X.memory order e) (38) To deal properly with infinite executions, it s also required that the prefixes of the memory order are all finite, ensuring that there are no limit points, and, to ensure that each write eventually takes effect globally, there must not be an infinite set of reads unrelated to any particular write, all on the same memory location. f inite pre f ixes(< X.memory order )(mem accesses E) (39) ew (mem writes E). (40) f inite{er er E.events (loc er = loc ew) (41) er X.memory order ew ew X.memory order )er} (42) A final state of a valid execution takes the last write in memory order for each memory location, together with a maximal write in program order for each register (or the initial state, if there is no such write). This is uniquely defined assuming that no instruction has multiple unrelated writes to the same register a reasonable property for x86 instructions. The definition of valid execution E X comprising the above conditions is equivalent to one in which < X.memory order is required to be a linear order, not just a partial order. 3.3 Litmus Tests In this part I will give some of the vendor litmus tests to illustrate which behaviour is permitted by x86- TSO. I now go through Examples 8-1 to 8-10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-tso behaviour in each case. Example8-1.[??] Stores Are Not Reordered with Other Stores. Table 7: Example8-1. MOV [x] 1 MOV EAX [y] MOV [y] 1 MOV EBX [x] Forbidden Final State: Proc 1:EAX=1 Proc 1:EBX=0 This test implies that the writes by Proc 0 are seen in order by Proc 1 s reads, which also execute in order. x86-tso forbids the final state because Proc 0 s store buffer is FIFO, and Proc 0 communicates with Proc 1 only through shared memory. Example8-2.[??] Stores Are Not Reordered with Older Loads.

14 14 Table 8: Example8-2. MOV EAX [x] MOV EBX [y] MOV [y] 1 MOV [x] 1 Forbidden Final State: Proc 0:EAX=1 Proc 1:EBX=1 x86-tso forbids the final state because reads are never delayed. Example 8-3. Loads May be Reordered with Older Stores. This test is just the SB example from Section 1, which x86-tso permits. The third AMD test (amd3) is similar but with additional writes inserted in the middle of each thread, of 2 to x and y respectively. Example 8-4.[??] Loads Are not Reordered with Older Stores to the Same Location. Table 9: Example8-4. Proc 0 MOV [x] 1 MOV EAX [x] Required Final State: Proc 0:EAX=1 x86-tso requires the specified result because reads must check the local store buffer. Example 8-5. Intra-Processor Forwarding is Allowed. This test is similar to Example 8-3. Example 8-6.[??] Stores Are Transitively Visible. Table 10: Example8-6. Proc 2 MOV [x] 1 MOV EAX [x] MOV EBX [y] MOV [y] 1 MOV ECX [x] Forbidden Final State: Proc 1:EAX=1 Proc 2:EBX=1 Proc 2:ECX=0 x86-tso forbids the given final state because otherwise the Proc 2 constraints imply that y was written to shared memory before x. Hence the write to x must be in Proc 0 s store buffer (or the instruction has not executed), when the write to y is initiated. Note that this test contains the only mention of transitive visibility in the Intel SDM, leaving its meaning unclear. Example 8-7. Stores Are Seen in a Consistent Order by Other Processors. This test rules out the IRIW behaviour as described in Section 2.2. x86-tso forbids the given final state because the Proc 2 constraints imply that x was written to shared memory before y whereas the

15 15 Proc 3 constraints imply that y was written to shared memory before x. Example 8-8. Locked Instructions Have a Total Order. This is the same as the IRIW Example 8-7 but with LOCKd instructions for the writes; x86-tso forbids the final state for the same reason as above. Example 8-9.[??] Loads Are not Reordered with Locks. Table 11: Example8-9. XCHG [x] EAX XCHG [y] ECX MOV EBX [y] MOV EDX [x] Initial state: Proc 0:EAX=1 Proc 1:ECX=1 (elsewhere 0) Forbidden Final State: Proc 0:EBX=0 Proc 1:EDX=0 This test indicates that locking both writes in Example 8-3 would forbid the non-sequentially consistent result. X86-TSO forbids the final state because LOCKd instructions flush the local store buffer. If only one write were LOCKd (say the write to x), the Example 8-3 final state would be allowed as follows: on Proc 1, buffer the write to y and execute the read x, then on Proc 0 write to x in shared memory then read from y. Example 8-10.[??] Stores Are not Reordered with Locks. Table 12: Example8-10. XCHG [x] EAX MOV EBX [y] MOV y 1 MOV ECX [x] Initial state: Proc 0:EAX=1 (elsewhere 0) Forbidden Final State: Proc 1:EBX=1 Proc 1:ECX=0 This is implied by Example 8-1, as we treat the memory writes of LOCKd instructions as stores. Test amd5.[??] Table 13: Test amd5. MOV [x] 1 MOV [y] 1 MFENCE MFENCE MOV EAX [y] MOV EBX [x] Forbidden Final State: Proc 0:EAX=0 Proc 1:EBX=0 For x86-tso, this test has the same force as Example 8-8, but using MFENCE instructions to flush the buffers instead of LOCKd instructions. The tenth AMD test is similar. None of the Intel litmus tests include fence instructions. In x86-tso adding MFENCE between every instruction would clearly

16 16 suffice to regain sequential consistency (though obviously in practice one would insert fewer barriers), in contrast to IWP/x86-CC/AMD Linux Spinlock Implementation In the next subsection I will show a correct linux spinlock that uses x86-tso as the underlying memory model. I present a spinlock from the Linux kernel (version ), as an example of a small but nontrivial concurrent programming idiom. I show how one can reason about this code using the x86-tso programmers model, explaining in terms of the model why it works and why the optimisation is sound thus making clear what the developers informal reasoning depended on. For accessibility I do this in prose, but the argument could easily be formalised as a proof. The implementation comprises code to acquire and release a spinlock. It is assumed that these are properly bracketed around critical sections and that spinlocks are not mutated by any other code. Table 14: Linux Spinlock On entry the address of spinlock is in register EAX and the spinlock is unlocked iff its value is 1 acquire: LOCK;DEC [EAX] ; LOCK d decrement of [EAX] JNS enter ; branch if [EAX] was 1 spin: CMP [EAX],0 ; test [EAX] JLE spin ; branch if [EAX] was 0 JMP acquire ; try again enter: ; the critical section starts here release: MOV [EAX] 1 A spinlock is represented by a signed integer which is 1 if the lock is free and 0 or less if the lock is held. To acquire a lock, a thread atomically decrements the integer (which will not wrap around assuming there are fewer than 231 hardware threads). If the lock was free, it is now held and the thread can proceed to the critical section. If the lock was held, the thread loops, waiting for it to become free. Because there might be multiple threads waiting for the lock, once it is freed, each waiting thread must again attempt to enter through the LOCK d decrement. To release the lock, a thread simply sets its value to 1. The optimisation in question made the releasing MOV instruction not LOCK d (removing a LOCK prefix and hence letting the releasing thread proceed without flushing its buffer). For example, consider a spinlock at address x and let y be another shared memory address. Suppose that several threads want to access y, and that they use spinlocks to ensure mutual exclusion. Initially, no one has the lock and [x] = 1. The first thread t to try to acquire the lock atomically decrements x by 1 (using a LOCK prefix); it then jumps into the critical section. Because a store buffer flush is part of LOCK d instructions, [x] will be 0 in shared memory after the decrement. Now if another thread attempts to acquire the lock, it will not jump into the critical section after performing the atomic decrement, since x was not 1. It will thus enter the spin loop. In this loop, the waiting thread continually reads the value of x until it gets a positive result. Returning to the original thread t, it can read and write y inside of its critical section while the others are spinning. These writes are initially placed in t s store buffer, and some may be propagated to shared memory. However, it does not matter how many (if any) are written to main memory, because (by assumption) no other thread is attempting to read (or write) y. When t is ready to exit the critical section, it releases the lock by writing the value 1 to x; this write is put in t s store buffer. It can now continue

17 17 after the critical section (in the text below, we assume it does not try to re-acquire the lock). If the releasing MOV had the LOCK prefix then all of the buffered writes to y would be sent to main memory, as would the write of 1 to x. Another thread could then acquire the spinlock. However, since it does not, the other threads continue to spin until the write setting x to 1 is removed from t s write buffer and sent to shared memory at some point in the future. At that point, the spinning threads will read 1 and restart the acquisition with atomic decrements, and another thread could enter its critical section. However, because t s write buffer is emptied in FIFO order, any writes to y from within t s critical section must have been propagated to shared memory (in order) before the write to x. Thus, the next thread to enter a critical section will not be able to see y in an inconsistent state. 4 Conclusions, Results, Discussion I presented x86-tso, a memory model for x86 processors that does not suffer from the ambiguities, weaknesses, or unsoundnesses of earlier models. Its abstract-machine definition should be intuitive for programmers, whereas its equivalent axiomatic definition supports the memevents exhaustive search and permits an easy comparison with related models; the similarity with SPARCv8 suggests x86-tso is strong enough to program above. Here follows a small comparison between TSO (total store ordering) and SC (sequential consistency): x-86-tso is more relaxed (weaker) than SC (sequential consistency) but less relaxed than other incomparable models. SC is the most intuitive memory model. TSO is close because it acts like SC for common programming idioms. Nevertheless, subtle non-sc executions can bite programmers and tool authors. For simple cores, TSO can offer better performance than SC, but the difference can be made small with speculation. SC is widely understood, while TSO is widely adopted.both SC and TSO are formally defined. The bottom line is that SC and TSO are pretty close, especially compared with the more complex and more relaxed memory consistency models. 5 Bibliography

Hardware Memory Models: x86-tso

Hardware Memory Models: x86-tso Hardware Memory Models: x86-tso John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 9 20 September 2016 Agenda So far hardware organization multithreading

More information

A better x86 memory model: x86-tso (extended version)

A better x86 memory model: x86-tso (extended version) A better x86 memory model: x86-tso (extended version) Scott Owens Susmit Sarkar Peter Sewell University of Cambridge http://www.cl.cam.ac.uk/users/pes20/weakmemory March 25, 2009 Revision : 1746 Abstract

More information

The Semantics of x86-cc Multiprocessor Machine Code

The Semantics of x86-cc Multiprocessor Machine Code The Semantics of x86-cc Multiprocessor Machine Code Susmit Sarkar Computer Laboratory University of Cambridge Joint work with: Peter Sewell, Scott Owens, Tom Ridge, Magnus Myreen (U.Cambridge) Francesco

More information

x86-tso: A Rigorous and Usable Programmer s Model for x86 Multiprocessors

x86-tso: A Rigorous and Usable Programmer s Model for x86 Multiprocessors x86-tso: A Rigorous and Usable Programmer s Model for x86 Multiprocessors Peter Sewell University of Cambridge Francesco Zappa Nardelli INRIA Susmit Sarkar University of Cambridge Magnus O. Myreen University

More information

Overview: Memory Consistency

Overview: Memory Consistency Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency

More information

Multicore Semantics and Programming

Multicore Semantics and Programming Multicore Semantics and Programming Peter Sewell University of Cambridge Tim Harris MSR with thanks to Francesco Zappa Nardelli, Jaroslav Ševčík, Susmit Sarkar, Tom Ridge, Scott Owens, Magnus O. Myreen,

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models Review. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed? The safety net between operations whose order needs

More information

Relaxed Memory: The Specification Design Space

Relaxed Memory: The Specification Design Space Relaxed Memory: The Specification Design Space Mark Batty University of Cambridge Fortran meeting, Delft, 25 June 2013 1 An ideal specification Unambiguous Easy to understand Sound w.r.t. experimentally

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts Toshiyuki Maeda and Akinori Yonezawa University of Tokyo Quiz [Environment] CPU: Intel Xeon X5570 (2.93GHz)

More information

High-level languages

High-level languages High-level languages High-level languages are not immune to these problems. Actually, the situation is even worse: the source language typically operates over mixed-size values (multi-word and bitfield);

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In small multiprocessors, sequential consistency can be implemented relatively easily. However, this is not true for large multiprocessors. Why? This is not the

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

Beyond Sequential Consistency: Relaxed Memory Models

Beyond Sequential Consistency: Relaxed Memory Models 1 Beyond Sequential Consistency: Relaxed Memory Models Computer Science and Artificial Intelligence Lab M.I.T. Based on the material prepared by and Krste Asanovic 2 Beyond Sequential Consistency: Relaxed

More information

Reasoning about the Implementation of Concurrency Abstractions on x86-tso

Reasoning about the Implementation of Concurrency Abstractions on x86-tso Reasoning about the Implementation of Concurrency Abstractions on x86-tso Scott Owens University of Cambridge Abstract. With the rise of multi-core processors, shared-memory concurrency has become a widespread

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency Constructive Computer Architecture Symmetric Multiprocessors: Synchronization and Sequential Consistency Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

The Java Memory Model

The Java Memory Model The Java Memory Model The meaning of concurrency in Java Bartosz Milewski Plan of the talk Motivating example Sequential consistency Data races The DRF guarantee Causality Out-of-thin-air guarantee Implementation

More information

C11 Compiler Mappings: Exploration, Verification, and Counterexamples

C11 Compiler Mappings: Exploration, Verification, and Counterexamples C11 Compiler Mappings: Exploration, Verification, and Counterexamples Yatin Manerkar Princeton University manerkar@princeton.edu http://check.cs.princeton.edu November 22 nd, 2016 1 Compilers Must Uphold

More information

Distributed Operating Systems Memory Consistency

Distributed Operating Systems Memory Consistency Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Memory Consistency Marcus Völp (slides Julian Stecklina, Marcus Völp) SS2014 Concurrent

More information

Taming release-acquire consistency

Taming release-acquire consistency Taming release-acquire consistency Ori Lahav Nick Giannarakis Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) POPL 2016 Weak memory models Weak memory models provide formal sound semantics

More information

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014 Weak memory models Mai Thuong Tran PMA Group, University of Oslo, Norway 31 Oct. 2014 Overview 1 Introduction Hardware architectures Compiler optimizations Sequential consistency 2 Weak memory models TSO

More information

Understanding POWER multiprocessors

Understanding POWER multiprocessors Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2,3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011 Programming shared-memory

More information

Concurrent Objects and Linearizability

Concurrent Objects and Linearizability Chapter 3 Concurrent Objects and Linearizability 3.1 Specifying Objects An object in languages such as Java and C++ is a container for data. Each object provides a set of methods that are the only way

More information

GPU Concurrency: Weak Behaviours and Programming Assumptions

GPU Concurrency: Weak Behaviours and Programming Assumptions GPU Concurrency: Weak Behaviours and Programming Assumptions Jyh-Jing Hwang, Yiren(Max) Lu 03/02/2017 Outline 1. Introduction 2. Weak behaviors examples 3. Test methodology 4. Proposed memory model 5.

More information

Multicore Programming Java Memory Model

Multicore Programming Java Memory Model p. 1 Multicore Programming Java Memory Model Peter Sewell Jaroslav Ševčík Tim Harris University of Cambridge MSR with thanks to Francesco Zappa Nardelli, Susmit Sarkar, Tom Ridge, Scott Owens, Magnus O.

More information

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

RELAXED CONSISTENCY 1

RELAXED CONSISTENCY 1 RELAXED CONSISTENCY 1 RELAXED CONSISTENCY Relaxed Consistency is a catch-all term for any MCM weaker than TSO GPUs have relaxed consistency (probably) 2 XC AXIOMS TABLE 5.5: XC Ordering Rules. An X Denotes

More information

Reasoning About The Implementations Of Concurrency Abstractions On x86-tso. By Scott Owens, University of Cambridge.

Reasoning About The Implementations Of Concurrency Abstractions On x86-tso. By Scott Owens, University of Cambridge. Reasoning About The Implementations Of Concurrency Abstractions On x86-tso By Scott Owens, University of Cambridge. Plan Intro Data Races And Triangular Races Examples 2 sequential consistency The result

More information

Multicore Programming: C++0x

Multicore Programming: C++0x p. 1 Multicore Programming: C++0x Mark Batty University of Cambridge in collaboration with Scott Owens, Susmit Sarkar, Peter Sewell, Tjark Weber November, 2010 p. 2 C++0x: the next C++ Specified by the

More information

Sequential Consistency & TSO. Subtitle

Sequential Consistency & TSO. Subtitle Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to

More information

Memory model for multithreaded C++: August 2005 status update

Memory model for multithreaded C++: August 2005 status update Document Number: WG21/N1876=J16/05-0136 Date: 2005-08-26 Reply to: Hans Boehm Hans.Boehm@hp.com 1501 Page Mill Rd., MS 1138 Palo Alto CA 94304 USA Memory model for multithreaded C++: August 2005 status

More information

Memory Consistency Models

Memory Consistency Models Calcolatori Elettronici e Sistemi Operativi Memory Consistency Models Sources of out-of-order memory accesses... Compiler optimizations Store buffers FIFOs for uncommitted writes Invalidate queues (for

More information

SPIN, PETERSON AND BAKERY LOCKS

SPIN, PETERSON AND BAKERY LOCKS Concurrent Programs reasoning about their execution proving correctness start by considering execution sequences CS4021/4521 2018 jones@scss.tcd.ie School of Computer Science and Statistics, Trinity College

More information

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10 MULTITHREADING AND SYNCHRONIZATION CS124 Operating Systems Fall 2017-2018, Lecture 10 2 Critical Sections Race conditions can be avoided by preventing multiple control paths from accessing shared state

More information

Thirty one Problems in the Semantics of UML 1.3 Dynamics

Thirty one Problems in the Semantics of UML 1.3 Dynamics Thirty one Problems in the Semantics of UML 1.3 Dynamics G. Reggio R.J. Wieringa September 14, 1999 1 Introduction In this discussion paper we list a number of problems we found with the current dynamic

More information

Release Consistency. Draft material for 3rd edition of Distributed Systems Concepts and Design

Release Consistency. Draft material for 3rd edition of Distributed Systems Concepts and Design Draft material for 3rd edition of Distributed Systems Concepts and Design Department of Computer Science, Queen Mary & Westfield College, University of London Release Consistency 1. Introduction Chapter

More information

A Unified Formalization of Four Shared-Memory Models

A Unified Formalization of Four Shared-Memory Models Computer Sciences Technical Rert #1051, September 1991, Revised September 1992 A Unified Formalization of Four Shared-Memory Models Sarita V. Adve Mark D. Hill Department of Computer Sciences University

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS CONSISTENCY AND REPLICATION CONSISTENCY MODELS Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Consistency Models Background Replication Motivation

More information

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth Unit 12: Memory Consistency Models Includes slides originally developed by Prof. Amir Roth 1 Example #1 int x = 0;! int y = 0;! thread 1 y = 1;! thread 2 int t1 = x;! x = 1;! int t2 = y;! print(t1,t2)!

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Shared Memory Consistency Models: A Tutorial Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect

More information

Program logics for relaxed consistency

Program logics for relaxed consistency Program logics for relaxed consistency UPMARC Summer School 2014 Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) 1st Lecture, 28 July 2014 Outline Part I. Weak memory models 1. Intro

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory

More information

Using Relaxed Consistency Models

Using Relaxed Consistency Models Using Relaxed Consistency Models CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides.

More information

Topic C Memory Models

Topic C Memory Models Memory Memory Non- Topic C Memory CPEG852 Spring 2014 Guang R. Gao CPEG 852 Memory Advance 1 / 29 Memory 1 Memory Memory Non- 2 Non- CPEG 852 Memory Advance 2 / 29 Memory Memory Memory Non- Introduction:

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Timing Properties and Correctness for Structured Parallel Programs on x86-64 Multicores

Timing Properties and Correctness for Structured Parallel Programs on x86-64 Multicores Timing Properties and Correctness for Structured Parallel Programs on x86-64 Multicores Kevin Hammond, Christopher Brown, and Susmit Sarkar School of Computer Science, University of St Andrews, Scotland,

More information

Module 3. Requirements Analysis and Specification. Version 2 CSE IIT, Kharagpur

Module 3. Requirements Analysis and Specification. Version 2 CSE IIT, Kharagpur Module 3 Requirements Analysis and Specification Lesson 6 Formal Requirements Specification Specific Instructional Objectives At the end of this lesson the student will be able to: Explain what a formal

More information

CS5460: Operating Systems

CS5460: Operating Systems CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that

More information

Introduction to Formal Methods

Introduction to Formal Methods 2008 Spring Software Special Development 1 Introduction to Formal Methods Part I : Formal Specification i JUNBEOM YOO jbyoo@knokuk.ac.kr Reference AS Specifier s Introduction to Formal lmethods Jeannette

More information

The C/C++ Memory Model: Overview and Formalization

The C/C++ Memory Model: Overview and Formalization The C/C++ Memory Model: Overview and Formalization Mark Batty Jasmin Blanchette Scott Owens Susmit Sarkar Peter Sewell Tjark Weber Verification of Concurrent C Programs C11 / C++11 In 2011, new versions

More information

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything?

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything? Review of last lecture Design of Parallel and High-Performance Computing Fall Lecture: Memory Models Motivational video: https://www.youtube.com/watch?v=twhtg4ous Instructor: Torsten Hoefler & Markus Püschel

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed

More information

Type Checking and Type Equality

Type Checking and Type Equality Type Checking and Type Equality Type systems are the biggest point of variation across programming languages. Even languages that look similar are often greatly different when it comes to their type systems.

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/22891 holds various files of this Leiden University dissertation Author: Gouw, Stijn de Title: Combining monitoring with run-time assertion checking Issue

More information

Constructing a Weak Memory Model

Constructing a Weak Memory Model Constructing a Weak Memory Model Sizhuo Zhang 1, Muralidaran Vijayaraghavan 1, Andrew Wright 1, Mehdi Alipour 2, Arvind 1 1 MIT CSAIL 2 Uppsala University ISCA 2018, Los Angeles, CA 06/04/2018 1 The Specter

More information

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

SELECTED TOPICS IN COHERENCE AND CONSISTENCY SELECTED TOPICS IN COHERENCE AND CONSISTENCY Michel Dubois Ming-Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA90089-2562 dubois@usc.edu INTRODUCTION IN CHIP

More information

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016 Java Memory Model Jian Cao Department of Electrical and Computer Engineering Rice University Sep 22, 2016 Content Introduction Java synchronization mechanism Double-checked locking Out-of-Thin-Air violation

More information

Systèmes d Exploitation Avancés

Systèmes d Exploitation Avancés Systèmes d Exploitation Avancés Instructor: Pablo Oliveira ISTY Instructor: Pablo Oliveira (ISTY) Systèmes d Exploitation Avancés 1 / 32 Review : Thread package API tid thread create (void (*fn) (void

More information

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L17: Memory Model Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements HW4 / Lab4 1 Overview Symmetric Multi-Processors (SMPs) MIMD processing cores

More information

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees

More information

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

COMP Parallel Computing. CC-NUMA (2) Memory Consistency COMP 633 - Parallel Computing Lecture 11 September 26, 2017 Memory Consistency Reading Patterson & Hennesey, Computer Architecture (2 nd Ed.) secn 8.6 a condensed treatment of consistency models Coherence

More information

C++ Concurrency - Formalised

C++ Concurrency - Formalised C++ Concurrency - Formalised Salomon Sickert Technische Universität München 26 th April 2013 Mutex Algorithms At most one thread is in the critical section at any time. 2 / 35 Dekker s Mutex Algorithm

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The

More information

Portland State University ECE 588/688. Memory Consistency Models

Portland State University ECE 588/688. Memory Consistency Models Portland State University ECE 588/688 Memory Consistency Models Copyright by Alaa Alameldeen 2018 Memory Consistency Models Formal specification of how the memory system will appear to the programmer Places

More information

The New Java Technology Memory Model

The New Java Technology Memory Model The New Java Technology Memory Model java.sun.com/javaone/sf Jeremy Manson and William Pugh http://www.cs.umd.edu/~pugh 1 Audience Assume you are familiar with basics of Java technology-based threads (

More information

Models of concurrency & synchronization algorithms

Models of concurrency & synchronization algorithms Models of concurrency & synchronization algorithms Lecture 3 of TDA383/DIT390 (Concurrent Programming) Carlo A. Furia Chalmers University of Technology University of Gothenburg SP3 2016/2017 Today s menu

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

殷亚凤. Consistency and Replication. Distributed Systems [7]

殷亚凤. Consistency and Replication. Distributed Systems [7] Consistency and Replication Distributed Systems [7] 殷亚凤 Email: yafeng@nju.edu.cn Homepage: http://cs.nju.edu.cn/yafeng/ Room 301, Building of Computer Science and Technology Review Clock synchronization

More information

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue Review of last lecture Design of Parallel and High-Performance Computing Fall 2016 Lecture: Linearizability Motivational video: https://www.youtube.com/watch?v=qx2driqxnbs Instructor: Torsten Hoefler &

More information

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13 Hardware models: inventing a usable abstraction for Power/ARM 1 Hardware models: inventing a usable abstraction for Power/ARM Disclaimer: 1. ARM MM is analogous to Power MM all this is your next phone!

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Advanced Operating Systems (CS 202)

Advanced Operating Systems (CS 202) Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization (Part II) Jan, 30, 2017 (some cache coherence slides adapted from Ian Watson; some memory consistency slides

More information

Memory Consistency Models. CSE 451 James Bornholt

Memory Consistency Models. CSE 451 James Bornholt Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version: Multiprocessors reorder memory operations in unintuitive, scary ways This behavior is necessary for performance

More information

Concurrent & Distributed Systems Supervision Exercises

Concurrent & Distributed Systems Supervision Exercises Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Concurrency. Glossary

Concurrency. Glossary Glossary atomic Executing as a single unit or block of computation. An atomic section of code is said to have transactional semantics. No intermediate state for the code unit is visible outside of the

More information

An introduction to weak memory consistency and the out-of-thin-air problem

An introduction to weak memory consistency and the out-of-thin-air problem An introduction to weak memory consistency and the out-of-thin-air problem Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) CONCUR, 7 September 2017 Sequential consistency 2 Sequential

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

What is uop Cracking?

What is uop Cracking? Nehalem - Part 1 What is uop Cracking? uops are components of larger macro ops. uop cracking is taking CISC like instructions to RISC like instructions it would be good to crack CISC ops in parallel

More information

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Companion slides for The by Maurice Herlihy & Nir Shavit Concurrent Computation memory object object 2 Objectivism What is a concurrent object? How do we describe one? How do we implement

More information

The Java Memory Model

The Java Memory Model Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential

More information

Reasoning about the C/C++ weak memory model

Reasoning about the C/C++ weak memory model Reasoning about the C/C++ weak memory model Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) 13 October 2014 Talk outline I. Introduction Weak memory models The C11 concurrency model

More information

Declarative semantics for concurrency. 28 August 2017

Declarative semantics for concurrency. 28 August 2017 Declarative semantics for concurrency Ori Lahav Viktor Vafeiadis 28 August 2017 An alternative way of defining the semantics 2 Declarative/axiomatic concurrency semantics Define the notion of a program

More information

Interprocess Communication By: Kaushik Vaghani

Interprocess Communication By: Kaushik Vaghani Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the

More information