Brookes is Relaxed, Almost!

Size: px

Start display at page:

Download "Brookes is Relaxed, Almost!"

Austin Reynard Simon
5 years ago
Views:

1 Brookes is Relaxed, Almost! Radha Jagadeesan, Gustavo Petri, and James Riely School of Computing, DePaul University Abstract. We revisit the Brookes [1996] semantics for a shared variable parallel programming language in the context of the Total Store Ordering (TSO) relaxed memory model. We describe a denotational semantics that is fully abstract for Brookes language and also sound for the new commands that are specific to TSO. Our description supports the folklore sentiment about the simplicity of the TSO memory model. 1 Introduction Sequential Consistency (SC), defined by Lamport [1979], enforces total order on memory operations reads and writes to the memory respecting the program order of each individual thread in the program. Operationally, SC is realized by traditional interleaving semantics, where shared memory is represented as a map from locations to values. For such an operational semantics, Brookes [1996] describes a fully abstract denotational view that identifies a process with its transition traces. This technique supports several approaches to program logics for shared memory concurrent programs based on separation logic (see Reynolds [2002] for an early survey). For example, O Hearn [2007] and Brookes [2007] develop the semantics of Concurrent Separation Logic (CSL), an adaptation of separation logic to reason about concurrent threads operating on shared memory. CSL has been used to prove correctness of several concurrent data structures; for example, [Parkinson et al., 2007] and [Vafeiadis and Parkinson, 2007]. Similarly, Brookes [1996] gives the foundation for refinement approaches to prove the correctness of concurrent data structures such as in [Turon and Wand, 2011]. There are at least two motivations to consider memory models that are weaker, or more relaxed, than SC: First, modern multicore architectures permit executions that are not sequentially consistent. Second, SC disables some common compiler optimizations for sequential programs, such as the reordering of independent statements. This has led to a large body of work on on relaxed memory models; Adve and Gharachorloo [1996] and Adve and Boehm [2010] provide a tutorial introduction with detailed bibliography on architectures and their impact on language design. The operational semantics of programming languages in the presence of such relaxed memory models has now been explored. For example, Boudol and Petri [2009] explore the operational semantics of a process language with write buffers; Sevcík et al. [2011] explore the operational semantics of CLight executing with the TSO memory model; and Jagadeesan et al. [2010] describe the operational semantics of an object language under the Java Memory Model (JMM) of Manson et al. [2005]. Research supported by NSF

2 2 Radha Jagadeesan, Gustavo Petri, and James Riely However, what has not been investigated in the literature is the denotational semantics of a language with a relaxed memory execution model. We solve this open problem in this paper. 1.1 Overview of The Paper Our investigations are carried out in the context of the TSO memory model described in SPARC [1994], recently proposed as the model of x86 architectures by Sewell et al. [2010]. In TSO, each sequential thread carries its own write buffer that serves as the initial target of the writes executed by the thread. Thus, TSO permits executions that are not possible with SC. To illustrate this relaxed behavior let us consider the canonical example depicted in 1 below. We have two sequential threads running in parallel. The left thread, with code (x := 1; y), sets x to 1 and then reads y returning the value read. The thread on the right, with code (y := 1; x), sets y to 1 and then reads and returns x. We consider that the initial state has x = y = 0. In the SC model, the execution where both threads read 0 is impermissible. It is however achieved by the following TSO execution with write buffers. Below we depict the initial configuration, where both threads have empty buffers (indicated by /0) and the memory state is denoted by {x := 0,y := 0}. ( {x := 0,y := 0}, /0, x := 1;y /0, y := 1;x ) (1) The writes performed by a thread go into its write buffer (rather than the shared memory). Thus, the above process configuration can evolve to ( {x := 0,y := 0}, [x := 1], y [y := 1], x ) where [x := 1], y stands for the thread with local buffer containing the assignment of 1 to x, which is not visible to the other thread, and similarly for [y := 1], x. Now, both reads can return the value from the shared store, which is 0. Of course, the usual SC executions are also available in a TSO model, which we demonstrate an example execution where both reads yield 1 starting from the initial process configuration. From the intermediate configuration above, both buffer updates can nondeterministically move into memory before the reads execute. Then, we get: ( {x := 1,y := 1}, /0, y /0, x ) leading to an execution where both reads yield 1. We provide a precise formalization of the denotational semantics for the language of Brookes [1996] in the context of the TSO memory model. Our model includes the characteristic mfence instructions of TSO, which terminates only when the local buffer of the thread executing the instruction is empty. Our formalization satisfies the Data Race Free (DRF) property Adve and Hill [1990]. Informally, a program is DRF if no SC execution of the program leads to a state in which a write happens concurrently with another operation on the same location. A DRF model requires that the programmer view of computation coincides with SC for programs that satisfy the DRF property.

3 Brookes is Relaxed, Almost! 3 Let us review [Brookes, 1996] before adapting it to a TSO setting. We use the metavariable s to stand for a shared memory, that is a partial map of variables to values, and C for commands (possibly partially executed). Brookes [1996] views the denotation of a command, T C, as a set of completed transition traces, ranged by the metavariable α, and with the form α = (s 0,s 0 ) (s 1,s 1 )...(s n,s n). These traces describe the interaction between a system and its environment, where the following conditions hold. The execution starts with the command under consideration, so C 0 = C. Transitions from s k to s k model a system step, i.e. k [0,n]. s k, C k s k, C k+1. Transitions from s k to s k+1 model an environment step. The transition trace represents a terminated execution, so C n = skip. As in any sensible semantics, skip must be a unit for sequential composition. skip; C; skip C (2) This equation motivates the stuttering and mumbling closure properties. Closure by stuttering accommodates the case when the system does not move at all but just observes the current state, i.e. s i = s i. Closure by mumbling permits the combination of system steps that have no intervening environment step. We can now describe the model for TSO. The type of command denotations, T C, changes to a function that takes an input buffer b and yields a set of pairs of the form α,b where α is a transition trace as before, and b is the resulting buffer. The pair α,b is to be understood as follows, where we use P s as metavariables for threads, and letting α = (s 0,s 0 ) (s 1,s 1 )...(s n,s n). The execution of the command starts with the input buffer b, so P 0 = b, C. The state pairs still represent system steps, i.e. k [0,n]. s k, P k s k, P k+1. The change from s k to s k+1 still represents an environment step. The transition trace represents a terminated execution leaving b as the resulting buffer, so P n = b, skip. Thus, the pending updates in the resulting buffer b are yet to reach the shared memory even though there is no command left to be executed. Our TSO semantics has analogues of the stuttering and mumbling properties for the same reasons as discussed above. In addition, it has two buffer closure properties. Buffer update closure. Consider the program skip. Executions in T skip (b) can result in a smaller buffer b, because buffer updates can propagate into the shared memory. Furthermore, the change from b to b can be done piecemeal, one buffer update at a time. Thus, skip should permit any executions of upd(b) defined as the stuttering and mumbling closure of the set { (s0,s 0) (s n,s n),b b = [x 0 := v 0,...,x n := v n ] ++ b & i [0,n]. s i = s i [x i := v i ] } Each step in the above trace corresponds to the addition of one buffer update into memory. Mumbling closure introduces the possibility of multiple buffer updates in one atomic step. To validate Equation 2, buffer-update closure permits data to potentially move from the buffers to shared state before and after any command executes: α 1,b 1 upd(b), α 2,b 2 T C (b 1 ), α 3,b upd(b 2 ) α 1 α 2 α 3,b T C (b)

4 4 Radha Jagadeesan, Gustavo Petri, and James Riely Buffer reduction closure. The program (x := 1;x := 1) simulates the program (x := 1) (taking the two steps uninterruptedly), whereas the converse is not true. In buffer terms, this motivates the idea that two identical contiguous writes can be replaced by one copy of the write without leading to any new behaviors. We formalize this notion of buffer simulation as a binary relation b 1 b and demand: α,b 1 T C (b), b 1 b α,b T C (b) Results. We present the following results. We describe operational and denotational semantics for the language that accommodate the extra executions permitted by TSO. We prove that our denotational semantics is fully abstract when we observe termination of programs. We use the model to identify some equational principles that hold for parallel programs under the TSO memory model. Our results provide some formal validation for the folklore sentiment about the simplicity of the TSO memory model. Organization of paper. We eschew a separate related works section since we cite the related work in context. In Section 2 we discuss the transition system for the programming language. We develop the model theory in Section 3, and prove the correspondence between operational and denotational semantics in Section 4. In Section 5, we illustrate the differences from Brookes [1996] by describing some laws that hold for programs. More detailed proof sketches are found in a fuller version of the paper. 1 2 Operational Semantics We assume disjoint sets of variables, x, y and values, v. The only values we consider are natural numbers. In conditionals, we interprets non-zero (resp. zero) integers as true (resp. false). As usual we denote by FV(C) the set of free variables of command C. E ::= x v E 1 + E 2 E (Expression) C, D ::= skip x := E C;D C D if E then C else D (Command) while E do C local x in C await E then C mfence P, Q ::= b, C P;D P Q new x := v in P (Process) P, Q ::= [ ] P;D P Q P Q new x := v in P (Process context) A buffer, b Buff, is a list of variable/value pairs, with Buff the domain of all buffers. If b = [x 1 := v 1,..., x n := v n ], then dom(b) = {x 1,..., x n }. We write ++ for concatenation, /0 for the empty buffer and b x for the buffer that results from removing x from b. We consider buffer rewrites ( : Buff Buff) that can merge contiguous identical writes, e.g. [x 1 := v 1,..., x n := v n, x n := v n ] [x 1 := v 1,..., x n := v n ]. 1

5 Brookes is Relaxed, Almost! 5 Definition 1. The relation : Buff Buff is defined inductively as follows. x,v. [x := v, x := v] [x := v] b b b b 1, b 1 b b b b 1 b 1, b 2 b 2 b 1 ++ b 2 b 1 ++ b 2 A memory, s Σ, is a partial map from variables to values, where Σ is the domain of all memories. We adopt several notation conventions for partial maps: if s = {x 1 := v 1,..., x n := v n }, then dom(s) = {x 1,..., x n }. We write s[x := v] for the memory s with the value of reference x substituted for v, and s[b] to denote the memory which results from applying the updates contained in b from left to right. As usual, we suppose a semantic function which maps expressions to functions from memories to values (in notation E s = v). In the forthcoming transition rules, the memory passed to this function is already updated with (any) relevant buffer. The function is defined by induction on e as s(x) = v x (s) = v E 1 (s) = v 1, E 2 (s) = v 2 E 1 + E 2 (s) = v 1 + v 2... In this paper, we consider that expressions evaluate atomically, following the first language considered in Brookes [1996]. There are two standard approaches to formalizing finer grain semantics; either 1. a compilation of complex expressions to a sequence of simpler commands that only perform a single read or add local variables, or 2. a direct formalization in terms of a transition system as done in the later sections of Brookes [1996]. Our presentation can accommodate either of these changes. We elide details in the interest of space. Each sequential thread has its own buffer. Process are parallel compositions of commands. A configuration is a pair of a memory and a process. In Figure 1 we define the evaluation relation s, P s, P, where is the reflexive and transitive closure of the relation, and C{[ y /x]} denotes the command derived from C by replacing every occurrence of x with y. The buffers grow larger in ASSIGN that adds a new update to the buffer, and become smaller in COMMIT that moves thread local buffer updates into the shared memory. CTXT-BUF allows contiguous and identical updates in the buffer to be collapsed. The command skip captures our notion of termination. For example, in SKIP-SEQ, the succeeding command moves into the evaluation context when the preceding process evaluates to skip. When a process terminates, its associated buffer is not necessarily empty; e.g. when x := E terminates, the update to x might still be in the buffer and not yet reflected in the shared memory. The rule FENCE implements mfence as an assertion that can terminate only when the threads buffer is empty; e.g. x:=e;mfence terminates only when the update to x has been moved to the shared memory, thus making it visible to every other parallel thread. The rule PAR-CMD enables the initiation of a parallel composition only when the buffer is empty. This restriction is in conformance with Appendix J of SPARC [1994] to ensure that the newly created threads can be scheduled on different processors. For similar reasons, SKIP-PAR ensures that a parallel composition terminates only when the buffers of both parallel processes are empty.

6 6 Radha Jagadeesan, Gustavo Petri, and James Riely s, b, while E do C s, b, if E then (C;while E do C) else skip (WHILE) E (s[b]) 0 s, b, if E then C else D s, b, C (THEN) E (s[b]) = 0 s, b, if E then C else D s, b, D (ELSE) y dom(b) FV(C) s, b, local x in C s, new y := 0 in b, C{[ y /x]} (LOCAL) E s 0 s, /0, C s, /0, skip s, /0, await E then C s, /0, skip (AWAIT) E (s[b]) = v s, b, x := E s, b ++ [x := v], skip (ASSIGN) s, [x := v] ++ b, C s[x := v], b, C (COMMIT) s, /0, mfence s, /0, skip (FENCE) s, /0, (C D) s, /0, C /0, D (PAR-CMD) s, /0, skip /0, skip s, /0, skip (SKIP-PAR) s, P s, P s, P Q s, P Q (CTXT-LEFT) s, Q s, Q s, P Q s, P Q (CTXT-RIGHT) s, new y := v in b, skip s, b y, skip (SKIP-NEW) s, /0, skip ;D s, /0, D (SKIP-SEQ) b b s, b, C s, b, C (CTXT-BUF) s, b, C s, b, C s, b, C;D s, b, C ;D (CTXT-CMD) s, P s, P s, P;D s, P ;D (CTXT-SEQ) s[y := v], P s, P s (y) = v s, new y := v in P s [y := s(y)], new y := v in P (CTXT-NEW) Fig. 1: Evaluation: s, P s, P Our sole use of the local construct is to provide a model of thread-local registers in the special case when C is a sequential thread. However, our more general formalization permits the description of state that is shared among parallel processes. The process context new y := v in P carries the shared state of this variable. The hypothesis on the initial buffer in LOCAL ensures that any mfence in C does not affect the global x. The renaming ensures that the updates of CTXT-NEW do not affect the global x. SKIP- NEW discards any remaining updates to the local y. The commands IF and WHILE are standard. The AWAIT construct from Brookes [1996] is a conditional critical region. It provides atomic protection to the entire command C which in our use will be generally be a series of assignments. The compare-and-set instruction of TSO architectures is programmable as follows: cas(x,v,w) = await 1 then if x = v then x := w else x := v And similarly for the other atomic instruction of TSO. Following the semantics of cas in x86-tso given by Owens et al. [2009], AWAIT ensures that the buffers are empty

7 Brookes is Relaxed, Almost! 7 f lag 0 := 1; f lag 1 := 1; if f lag 1 = 0 then if f lag 0 = 0 then CS 0 CS 1 (a) Dekker Mutual Exclusion [ ] data := 1; f lag := 1 local r in if f lag = 0 then r := data (b) Safe Publication Fig. 2: Examples of TSO Programs before and after the command executes and prevents buffer updates from other threads cf. the LOKD modifier of Owens et al. [2009]. While TSO does not directly support such multi-instruction atomic conditional critical regions, our semantics continues to be sound for a traditional TSO programming model, only providing the simpler cas and the single-word atomics alluded to above. We use this construct to permit a direct comparison with Brookes [1996] and use it (as in that work) to construct discriminating contexts in the proof of full abstraction. Let us revise some examples of TSO in Figure 2. Dekker s mutual exclusion algorithm 2a fails under TSO. In initial memories that contain 0 for f lag 0 and f lag 1, the initial write of both threads can be put in their internal buffers, remaining unaccessible to the other thread while the reads can proceed before the updates are performed. Thus, both threads can get values 0 for their respective reads and execute their critical sections concurrently. On the other hand, the standard safe publication idiom of Figure 2b is safe under TSO, since the updates of f lag and data will proceed in order. Thus, if f lag is seen to have value 1 in the thread to the right, the update of 1 on data has also propagated to the memory. We end this section by remarking that our programming language satisfies the standard DRF guarantee, following traditional proofs, e.g. see Adve and Gharachorloo [1996], Boudol and Petri [2009], Owens et al. [2009]. 3 Denotational Semantics We use α,β etc. for elements of (Σ Σ), the sequences of state pairs, and ε for the empty trace. We will consider P((Σ Σ) ), the powerset of sequences of state pairs, with the subset ordering. Similar assumptions are made for P((Σ Σ) Buff), ranged by the metavariable U. Commands yield functions in Buff P((Σ Σ) Buff). Definition 2. For any b Buff, define T C (b) P((Σ Σ) Buff) as follows. T C (b) = { (s 0,s 0)... (s n,s n)),b k [0,n 1]. s k, P k s k, P k+1 & P 0 = b, C & P n = b, skip } Thus, we only consider transition traces where the residual left of the command is skip, albeit with potentially unfinished buffer updates. As in [Brookes, 1996], the transition traces are closed under stuttering and mumbling, to capture the reflexivity and transitivity of the operational transition relation. α β,b U α (s,s ) (s,s ) β,b U α (s,s) β,b U STUTTERING α (s,s ) β,b U MUMBLING

8 8 Radha Jagadeesan, Gustavo Petri, and James Riely Let U P((Σ Σ) Buff), we define U to be the smallest set containing U such that is stuttering and mumbling closed. Definition 3. Define upd(b) to be the stuttering and mumbling closure of { (s0,s 0 ) (s n,s n),b b = [x 0 := v 0,...,x n := v n ] ++ b & k [0,n]. s k = s k[x k := v k ] } And then we can deduce the inclusion: b Buff. upd(b) T skip (b). We now let f : Buff P((Σ Σ) Buff), and consider the following closure properties. α 1,b 1 upd(b), α 2,b 2 f (b 1 ), α 3,b upd(b 2 ) α 1 α 2 α 3,b BUFF-UPD f (b) α,b 1 f (b), b 1 b α,b f (b) BUFF-RED Definition 4. Let f : Buff P((Σ Σ) Buff). Then f is the smallest function (in the pointwise order) such that: 1. For all b, f (b) is stuttering and mumbling closed. 2. f is buffer-update and buffer-reduction closed. If f = f, we say f is closed. Any command yields a closed function. Lemma 5. For every command C, (T C ) = T C. The following auxiliary definitions enable us to describe the equations satisfied by the transition traces semantics. Let h be a partial function from buffers to sets of transition traces such that b Buff. ( b 1 dom(h). ( b Buff. b = b ++b 1 )); then, there is a unique closed function that contains h. Formally, we overload the closure notation and write: h = λb.{ α β,b α,b 1 upd(b), β,b h(b 1 )} We define the operator : (Σ Σ) (Σ Σ) P + ((Σ Σ) ) that yields the set of all interleavings of its arguments. We write it infix and define it inductively. α ε = {α} β α 1 α 2 β α 2 α 1 β α 1 α 2 (s 0,s 0) β ((s 0,s 0) α 1 ) α 2 We say that the system does not alter x in (s 0,s 0 ) (s n,s n) if k [1,n]. s k (x) = s k (x) and we use (Σ Σ) x + for the set of such transition sequences. We say that the environment does not alter x in (s 0,s 0 ) (s n,s n), if k [1,n 1]. s k (x) = s k+1(x) and we use (Σ Σ) x for the set of such transition sequences. We write α x = β x if traces α and β are identical except for the values of reference x. We write Buff x for the set of buffers that do not have x in their domain. We let E =0 = λb.{ (s,s),b E (s[b]) = 0} and similarly for E 0. The transition traces semantics from Theorem 2 satisfies the equations of Figure 3. Lemma 6. For every command C, C = T C The proof is a straightforward structural induction on the command, and we elide it in the interests of space. In this light, we are able to freely interchange C and T C in the rest of the paper.

9 skip = λb. { ε,b } Brookes is Relaxed, Almost! 9 C;D = λb. { α β,b b 1 Buff. α,b 1 C (b), β,b D (b 1 )} mfence = λb. { α, /0 skip (b)} x := E = λb. { (s,s),b ++ [x := v] E (s[b]) = v} if E then C else D = E =0 ; D E 0 ; C while E do C = ( E 0 ; C ) ; E =0 await E then C = λb {/0}. { (s,s ), /0 E (s) 0, (s,s ), /0 C (/0)} C 1 C 2 = λb {/0}. { β, /0 β β 1 β 2, i [1,2]. β i, /0 C i (/0)} local x in C = λb Buff x. { β,b x β (Σ Σ) x, + β 1,b C (b). β 1 (Σ Σ) x & β x = β 1 x } Fig. 3: Denotational semantics of TSO + await 4 Full Abstraction In this section we follow Brookes [1996] as closely as possible in order to highlight the differences caused by TSO. The input-output relation of a program is defined using only the shared memory, i.e. the program is started with an empty buffer and the output state is observed when the buffer is empty. Definition 7 (IO). For every command C, IO C = {(s,s ) (s,s ), /0 T C (/0)} Definition 8. The trace α = (s 0,s 0 ) (s n,s n) is Interference Free (IF) if and only if for all i [0,n 1] we have s i = s i+1. Notice that every (s,s ) IO C arises from the mumbling closure of IF traces. We add the following notations for technical convenience: IO C / s = {(s,s ) (s,s ) IO C } C (b)/ s = { α,b α,b C (b) & α = (s,s ) α } Definition 9. The operational ordering compares the IO relation of commands in all possible command contexts C[ ], C IO D C[ ],s. FV(C[C]) FV(C[D]) dom(s) IO C[C] / s IO C[D] / s Definition 10. There is a natural ordering induced by the denotational semantics, C D b,s. FV(C) FV(D) dom(b) dom(s) C (b)/ s D (b)/ s In the rest of this section, we prove that and IO coincide. From Figure 3, it is evident that all the program combinators are monotone with respect to set inclusion. Thus, we deduce the following lemma. Lemma 11 (Compositional Monotonicity). For all commands C and D, C D C[ ]. C[C] C[D]

10 10 Radha Jagadeesan, Gustavo Petri, and James Riely Since and T coincide by Theorem 6 we obtain: Corollary 12 (Adequacy). For all commands C and D, we have C D C IO D. We now introduce some macros that we will use for the following developments. For all memories s, s and buffer b, there is evidently an expression IS s such that IS s (s [b]) 0 dom(s) = dom(s ) & ( x dom(s). s(x) = s [b](x)) Moreover, for all memories s, s and buffer b, there is evidently a program consisting of a sequence of assignments MAKE s such that s, b, MAKE s s, /0, skip Finally, for each buffer b, there is evidently a program consisting of a sequence of assignments MAKE b such that for any s,b s, b, MAKE b s[b ], b, skip The program MAKE b can be used to encode input buffers as a command context. Lemma 13. For any command C and buffers b and b, MAKE b ;C (b) = C (b++b ). Proof (SKETCH). By induction on the length of b. The base case is immediate and the inductive case follows from the definition of sequential composition. Corollary 14. For all C 1 and C 2, C 1 C 2 C,s. C;C 1 (/0)/ s C;C 2 (/0)/ s. Proof. If C 1 (b)/ s C 2 (b)/ s, choose C = MAKE b. For the proof of our main result we will need to encode a context that simulates the environment of an arbitrary trace α. To that end we define the following program. Definition 15. Given α = (s 0,s 0 ) (s n,s n), define the command SIMULATE α as SIMULATE α = await IS s0 then skip; await IS s 0 then MAKE s1 ; await IS s 1 then MAKE s2 ;... await IS s n 1 then MAKE sn Intuitively, SIMULATE α is given by the closure of the single trace that is complementary to α. Formally, SIMULATE α = λb {/0}. { (s 0,s 1 ) (s 1,s 2 ) (s n 1,s n ), /0 } Lemma 16. Given α as in Theorem 15, letting { f lag, f inish} be disjoint from FV(C) dom(b) i(dom(s i ) dom(s i )), and considering the command context, C[ ] = ( f lag := 0; f inish := 0; ) MAKEb0 ; SIMULATE [ ] α we obtain α,b MAKE b0 ;C (/0) α 0 (s 0,s n) α 1, /0 C[C] (/0), where α, /0 SIMULATE α (/0), (s 0,s n) (α α ), α 0, /0 upd([ f lag := 0, f inish := 0]), and α 1, /0 upd(b).

11 Brookes is Relaxed, Almost! 11 This lemma characterizes the IF traces where the final state before flushing the final buffer b is s n, the first state is s 0 and the trace terminates by flushing the buffer b. The variables f lag and f inish play essentially no role in this lemma and are included only to accommodate the use-case later. The proof follows Brookes [1996]. For the forward direction, if α,b C (/0), the IF trace (s 0,s 0 ) (s 0,s 1) (s 1,s 1 ) (s 1,s 2) (s n 1,s n) (s n,s n),b is in C[C] (/0) by interleaving. Thus, by mumbling closure, (s 0,s n),b C[C] (/0). Conversely, (s 0,s n),b C[C] (/0) for some b only if there is some β that can be interleaved with (s 0,s 1) (s 1,s 2) (s n 1,s n) to fill up the gaps between s i and s i for all i. Such a trace yields α by stuttering and mumbling. A significant difference from Brookes [1996] is that we need to check that the final buffers since they are part of the trace semantics coincide. To that end, we define the following program CHECK b that observes all the updates of buffer b as they are performed one by one into the memory. Definition 17. For any buffer b = [x 1 := v 1,...,x n := v n ] and memories s and s, define: CHECK b,s, s = await IS s then MAKE s ; await IS s[x1 :=v 1 ] then MAKE s ;... await IS s[xn :=v n ] then MAKE s Informally, the program CHECK b starts by replacing the state s for a state s. In our use case, s maps every variable to values that do not appear in the trace generating the state s. The await commands are intended to observe each update from the buffer b of another thread. Upon observing each update in state s that state is reinitialized to observe the following buffer update. Lemma 18. Let { f lag, f inish} be disjoint from FV(C) dom(b) i(dom(s i ) dom(s i )). Let s be any memory such that the range of s is disjoint from the range of s and b. Considering the command context C[ ] = f lag := 0; f inish := 0; MAKE b0 ; [ ]; if f lag then f inish := 1 D; await 1 then f lag := 1; await 1 then f lag := 0; CHECK b,s, s ; await IS s[ f inish:=1] then skip there exist α 0 and α 1 such that α 0,b C (b 0 ) and α 1,ε D (ε) with (s 0,s) (α 0 α 1 ) if and only if IO C[C] (/0) /0. Proof (SKETCH). Let α 0,b C (b 0 ) and α 1, /0 D such that (s 0,s) (α 0 α 1 ). Consider the execution given by the following interleaving: obviously we start by executing the initial assignments of f lag and f inish, which are updated before spawning the new threads,

12 12 Radha Jagadeesan, Gustavo Petri, and James Riely C, D execute with an appropriate interleaving to yield the shared memory s and a buffer b for the thread on the left of the parallel component and an empty buffer for the thread on the right, where b b, we then execute the first await on the right hand of the parallel composition to set f lag in shared memory, the left thread of the parallel composition observes the update on f lag and sets f inish and this update is added to the buffer of the left hand thread, the await on the right hand thread executes unsetting f lag in shared memory, CHECK b,s, s terminates successfully since the individual awaits can be interleaved with the propagation of buffer updates from b into the shared memory, the update to f inish moves into shared memory from the buffer of left thread. Since b was exhausted in the previous step, there is no change in shared memory on dom( s), the final await in the right thread terminates successfully because f inish is set and the state remains at s[ f inish := 1]. Lemma 19. C 1 C 2 C 1 IO C 2 Proof (SKETCH). We have to construct a command context to distinguish the IO behavior of C 1,C 2. By Theorem 14, we can assume that MAKE b0 ;C 1 (/0) MAKE b0 ; C 2 (/0). Now let α,b C 1 (/0) \ C 2 (/0). Consider the program context C[ ] = f lag := 0; f inish := 0; MAKE b0 ; [ ]; if f lag then f inish := 1 SIMULATE α ; await 1 then f lag := 1; await 1 then f lag := 0; CHECK b,s, s ; await IS s[ f inish:=1] then skip where f lag, f inish, s, s and b satisfy the naming constraints of Lemmas 18 and 16. Since α,b C 1 (b 0 ), we use the forward direction of Lemmas 16 and 18 to deduce that IO C[C 1 ] (/0) /0. Let IO C[C 2 ] (/0) /0. Then there are α 0 and α with α 0,b C 2 (b 0 ) and α, /0 SIMULATE α such that (s 0,s) {α 0 α }. So by Theorem 18 α,b C 2 (b 0 ), which is a contradiction. Combining Theorem 12 and Theorem 19, we deduce that the denotational semantics is inequationally fully abstract. Theorem 20 (Full Abstraction). For any commands C and D we have C D C IO D A simple corollary of the proof of Theorem 19 is that it suffices to consider simple sequential contexts to prove inter-substitutivity of programs. For a given sequential command D and a given b, consider: C b D [ ] = f lag := 0; f inish := 0; MAKE b; [ ]; if f lag then f inish := 1 D

13 Brookes is Relaxed, Almost! 13 where f lag and f inish satisfy the naming constraints of Lemmas 18 and 16. Then: C 1 C 2 D,b. (IO C b D[C 1 ] /0 IO C b D[C 2 ] /0) This validates the folklore analysis of TSO programs using only sequential testers in parallel. 5 Examples & Laws We examine some laws of parallel programming under a TSO memory model, and consider some standard TSO examples from the perspective of the denotational semantics introduced in Section 3. Laws of parallel programming. Most of the laws inherited from Brookes [1996] hold in our setting. skip;c C C;skip (1) (C 1 ;C 2 );C 3 C 1 ;(C 2 ;C 3 ) (2) C 1 C 2 C 2 C 1 (3) (C 1 C 2 ) C 3 C 1 (C 2 C 3 ) (4) (if E then C 0 else C 1 );C if E then C 0 ;C else C 1 ;C (5) while E do C if E then (C;while E do C) else skip (6) In (1) and (2) we see that sequential composition is associative with unit skip. Laws (3) and (4) say that parallel composition is commutative and associative. However, skip is not a unit for parallel composition in general, since parallel composition requires flushing the buffers before spawning the threads and when synchronizing them at the end. Instead what holds is: skip C (mfence;c; mfence) Law (5) implies that sequential composition distributes into conditionals, and finally law (6) is the usual unrolling law for while loops. Also, The usual laws for local variables hold. If x is not free in C then: local x in C C local x in C;D C;local x in D local x in (C D) C local x in D Thread inlining. Thread inlining is always sound in Brookes [1996], where for example the following rule holds x := y;c x := y; C In our setting however, this equation holds only if C does not read reference x. In the case where C reads x, C in the left hand side can potentially access newer local updates that are not available globally. In this case, a mfence is needed to validate the equation: x := y;mfence; C x := y C

14 14 Radha Jagadeesan, Gustavo Petri, and James Riely local r 0,r 1 in x := 1; r 0 := x; local r 2,r 3 in y := 1; r 2 := y; r 1 := y r 3 := x Possible: r 0 = r 2 = 1 & r 1 = r 3 = 0 (a) Buffer Forwarding [ ] [ ] local r 0,r 1 in local r 2,r 3 in x := 1 y := 1 r 0 := x; r 2 := y; r 1 := y r 3 := x Impossible: r 0 = r 2 = 0 & r 1 = r 3 = 1 Fig. 4: TSO Examples (b) IRIW Commutation of independent statements. The TSO memory model permits reads to move ahead of previous writes on independent references. This is generally seen with the example below. Using the denotational semantics, we are able to prove the inequality, and moreover the denotations imply the existence of counterexamples to show that the inequality cannot be strengthened to an equality. Thus we get: local r in local r in r := y; x := 1; x := 1; r := y; z := r; z := r; & local r in local r in x := 1; r := y; r := y; x := 1; z := r; z := r; In general, TSO does not permit writes of independent references or reads of independent reference to commute. However, a special case of this latter class of transformation can be modeled by the capability of reading one threads own writes (as shown in the example of Figure 4a). Notice in particular that the example in Figure 4a is a case of inlining of the standard IRIW example (shown in Figure 4b), which provides evidence of our previous claim that inlining is not a legal TSO transformation in general. Our denotational semantics is able to explain this relaxed behavior by means of the inequalities below. In particular, the one on the right can be proved using the inequality discussed above and the one on the left. local r 1,r 2 in local r 1,r 2 in local r in local r in x := 1; r := 1; x := 1; r := x; z := r z := r 6 Conclusion r 2 := y; x := 1; r 1 := 1; z 0 := r 1 ;z 1 := r 2 x := 1; r 1 := x; r 2 := y; z 0 := r 1 ;z 1 := r 2 We describe how to modify the Brookes semantics for a shared variable parallel programming language Brookes [1996] to address the TSO relaxed memory model. We view our results as the foundations towards two developments: (a) separation logics for relaxed memory models, and (b) refinement theory for relaxed memory models. References S. V. Adve and H.-J. Boehm. Memory models: a case for rethinking parallel languages and hardware. Commun. ACM, 53:90 101, August ISSN

15 Brookes is Relaxed, Almost! 15 S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66 76, S. V. Adve and M. D. Hill. Weak ordering - a new definition. In ISCA, pages 2 14, G. Boudol and G. Petri. Relaxed memory models: an operational approach. In POPL, pages , S. Brookes. A semantics for concurrent separation logic. Theor. Comput. Sci., 375(1-3): , S. D. Brookes. Full abstraction for a shared-variable parallel language. Inf. Comput., 127(2): , R. Jagadeesan, C. Pitcher, and J. Riely. Generative operational semantics for relaxed memory models. In ESOP, pages , L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9): , J. Manson, W. Pugh, and S. V. Adve. The java memory model. In POPL, pages , P. W. O Hearn. Resources, concurrency, and local reasoning. Theor. Comput. Sci., 375 (1-3): , S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-tso. In TPHOL, pages , M. J. Parkinson, R. Bornat, and P. W. O Hearn. Modular verification of a non-blocking stack. In POPL, pages , J. C. Reynolds. Separation logic: A logic for shared mutable data structures. In LICS, pages 55 74, J. Sevcík, V. Vafeiadis, F. Z. Nardelli, S. Jagannathan, and P. Sewell. Relaxed-memory concurrency and verified compilation. In POPL, pages 43 54, P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-tso: a rigorous and usable programmer s model for x86 multiprocessors. Commun. ACM, 53(7): 89 97, Inc. CORPORATE. SPARC. The SPARC Architecture Manual (version 9). Prentice- Hall, Inc., Upper Saddle River, NJ, USA, A. J. Turon and M. Wand. A separation logic for refining concurrent objects. SIGPLAN Not., 46: , January ISSN V. Vafeiadis and M. J. Parkinson. A marriage of rely/guarantee and separation logic. In CONCUR, pages , 2007.

Taming release-acquire consistency

Taming release-acquire consistency Ori Lahav Nick Giannarakis Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) POPL 2016 Weak memory models Weak memory models provide formal sound semantics