C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model

Size: px

Start display at page:

Download "C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model"

David Cummings
6 years ago
Views:

1 C++ Memory Model December 26, 2012 Abstract Multi-threaded programming is increasingly important. We need parallel programs to take advantage of multi-core processors and those are likely to be the main source of improved performance as the the number of cores in processors will grow instead of the clock rate. Prior to C++11, multi-threading in C++ was supported only by libraries and the language was specied as a single threaded language. With the memory model introduced in C++11 threading support is integrated into the language specication and therefore exactly denes the behavior of multi-threaded applications. Further more, the simple usage of atomic operations is introduced to allow the implementation of lock-free algorithms and therefore a more performance conscious way of programming. This paper will give an introduction on why a memory model was needed. The model is rst explained with all atomic operations to execute sequentially consistent, followed by the enhanced model with atomic operations to relax the sequential consistency guarantee. Examples are provided that illustrate the dierent relaxation options. 1. Introduction Many parallel programs are written using threads and shared variables. And it is also likely that this parallel programming model will stay popular for several reasons. Direct hardware support for shared-memory is a performance advantage. For example, values read-mostly by dierent instances are implicitly shared in memory and available for all instances without the need of replication. And in case of recognized bottlenecks in an application, parallelism with threads can be introduced without a complete redesign of data structures [3]. An important part in shared-memory parallelism is the memory model or memory consistency model. Prior to C++11, multi-threaded applications in C++ are written by using libraries for threading support. The language was specied as single-threaded language. The execution of multi-threaded applications, programmed with a single-threaded-dened language and a library for threading support, was based on an agreement between compiler and hardware. This also aected the portability. With the new C++11 standard, threading support is a part of the language specication and a memory model is dened. This gives a guaranteed behavior in multi-threaded programs and better portability What is a Memory Model When multiple threads can access the same memory location in parallel, it must be specied which set of values a read can return. A memory model species concurrency semantics in shared-memory programs. The necessity can be shown by considering a quad-core processor where the caches are not shared between the cores. Running a multi-threaded application using this hardware, the following instruction ordering is possible. A thread 1 stores a value on a certain memory location. Afterwards another thread 2 loads a value from the same memory location, where the store was

2 performed previously by thread 1. Additionally, thread 1 is executed by a core A and thread 2 by core B. The scenario is shown in Figure 1. Which value is read (seen) by thread 2? Can it read the value stored by thread 1 although the cores do not share the cache? The question is, whether there is enough synchronization to ensure that a thread's write will occur before another's read. The compiler as well as the processor can reorder instructions in order to increase the performance. These Figure 1: Load and store of a shared reordering can also unintentionally inuence the behavior of a program, especially in multi-threaded programs variable x, performed by different threads on cores that (see example in Section 1.2). A memory model can therefore also be seen as a contract between the program and do not share the cache. any hardware or software that transforms the program. It restricts reordering and the compiler as well as the processor must agree on this. The design of a memory model involves a tension between performance and usability. A strong memory model, such as sequential consistency (see Section 1.2) that restricts many reorderings and therefore prevents hardware and software optimizations, reduces performance. But it also simplies reasoning about programs Sequential Consistency Sequential consistency is the most restricting memory model but also the most intuitive one. It is introduced by Lamport [14] as follows: Hardware is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specied by its program. For an application of this denition we consider the example in Figure 2. Additionally we interpret the term operation as memory operation and the term result as the set of values read and the nal value of the execution [2]. Under some sequential order we understand all interleaving of memory operations that are possible combinations of the instruction number 1 to 4. Additionally, in these dierent execution possibilities every read operation returns the value of the previously written value. The later criteria enforces that the memory operations appear to be executed atomically, hence without interleaving with respect to other memory operations. Under these considerations the value of Flag2 in instruction 2 and Flag1 in instruction 4 cannot both be 0. Figure 2: For sequential consistency, all memory accesses appear to execute atomically in some total order and program order is maintained among operations of each processor [1]. Figure 3: Store buer violates sequential consistency. t1,t2... indicate the order in which the corresponding memory operations execute at memory [1]. Page 2 of 18

3 Most modern processors almost always use a store buer to avoid waiting for stores to complete [3]. Such a scenario is shown in Figure 3 on the preceding page. It shows the execution of the example in Figure 2 on the previous page using a store buer. Each processor can buer its write and allow the subsequent read to bypass it. Therefore the read operations are performed before the write is applied in memory and allowing both reads to return 0. This violates sequential consistency Software- and Hardware Memory Model Both, hardware and software that transform a program must agree on the memory model. Hardware models are often more relaxed than software models. It is therefore not sucient to prevent the compiler from any reordering and make it transform every read and write of variables into its simple load/store equivalent hardware instruction. A weaker hardware model would then allow to reorder these loads and stores as seen in Figure 3 on the preceding page and violates the software memory model. The solution to this problem is to let the compiler add memory fences that prevent the hardware from performing violating reordering. With a full fence it is assured, that all read and write operations executed before the fence are synchronized before the memory operations executed after the fence. The hardware memory model then restricts the reordering when fences are involved. Sometimes it is inappropriate to insert a full fence, for example if only two thread needs to be synchronized. In this case not all threads need to be updated with respect to the synchronizing variable. A mapping of software memory model to a hardware memory model regarding an optimal performance is a dicult issue [11] [3] About the Report To explain the details of the C++ memory model, the paper is structured as follows. In Section 2 the motivating reasons for introducing the C++ memory model are stated followed by the explanation of the C++ memory model. In Section 3.5 on page 7 the memory model is explained for sequential consistent execution. It is a simpler explanation of the equivalent model described in the standard [7] for sequential consistent execution. Section 3.6 on page 11 explains the model of the standard that allows relaxing the sequential consistency guarantee. 2. Motivation for C++ Memory Model C++ is designed to leave concurrency support to additional libraries. This design decision was motivated to support dierent concurrency model (multi-threading, multi-processes, distributed) not just one [16]. A library that adds threading support and synchronization primitives is Pthread. It has been shown that extending C++ with threading support by a library works in most cases but not in any case and also aects the portability [8]. In this section we explain why the library approach almost worked but also where the approach fails and motivated a formalized memory model in the language specication The Library Approach with Pthread The rule for writing multi-threaded application using the Pthread library is as follows: Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads... Two of the function mentioned in this quote are pthread_mutex_lock and pthread_mutex_unlock. It requires a disciplined use of these synchronization operations to prevent from one thread modifying a memory location while another thread modies the same memory location. Implementations of the Pthread standard proceeds as follows [8]: Page 3 of 18

4 1. The implementation of the synchronization functions such as pthread_mutex_lock must guarantee that the memory is synchronized. This is realized by adding memory fences that are specic to the underlying hardware. The fences therefore preclude a hardware reordering of memory operations around calls to synchronization operations of Pthread. This is shown in Figure Also the compiler must prevent the reordering as seen in Figure 4. The compiler therefore treats calls to functions such as pthread_mutex_lock as calls to opaque functions. This means the compiler has no information about the function and assumes in this function read and write operations on any global variable. This assumption prevents the compiler to simply move the memory operation around the calls. Figure 4: Implementation of Pthread must prevent the reordering of memory operations around synchronization operations as pthread_mutex_lock and pthread_mutex_unlock [10]. With this rule and implementation considerations we can state the following: The semantic of programs with races is undened. (While the denition of a race is not made in Pthread. See Section 2.2.1) Synchronization-free code can be optimized as though it were single-threaded. It is easy to write wrong code for example by an incorrect ordering of pthread_mutex_lock and pthread_mutex_unlock 2.2. Possible Implications of the Library Approach In [8] three issues are listed where the library approach as realized in Pthread can fail. We discuss one in detail from where it gets clear that a memory model must be dened in a language specication and mention also the other points Concurrent Modification According to the pthread rule...no thread of control can read or modify a memory location while another thread of control may be modifying it... we need to know when a concurrent modication (data race) can happen, to decide where to put synchronization operations. Data races can only be detected if the semantics of the programming language is dened, which in turn is only possible with a properly dened memory model. It exists a circularity in the denition [15]. Considering the example in Figure 5 on the next page. Each thread executes the listed statements. Under sequentially consistent execution, there is no race. Hence the outcome of x==1 and y==1 is not possible as any variable ever become non zero. But according to the Pthread approach, a compiler may freely reorder memory operations that are free of synchronization operations. Page 4 of 18

5 Figure 5: Is there a concurrent modication? Do we need synchonization operations? [8]. Figure 6: Possible reordering of code in Figure 5 by the compiler imposed by speculative execution. [8]. This can result in code as shown in Figure 6. This transformation is based on speculative execution, assuming mostly to be x==1 and y==1 in the conditional statement. Considering this transformed execution again under sequential consistency, the outcome of x==1 and y==1 is possible and would imply a race. Therefore it is shown that the Pthread approach together with a thread unaware language denition could lead to transformations that generate a race. This problem can be solved by a programming-language-dened and compiler-respected memory model to ensure that the user and the compiler are agree on where there is a data race Memory Location and Register Promotion The additional two reasons for a thread-aware language denition are founded both by thread unaware transformations by the compiler. There must be clear denition of a what a memory location is. Based on this denition, compiler optimization rules can be derived to prevent the compiler from introducing implicit writes to adjacent memory locations that may cause a race. Analogous can be said on register promotion. In register promotion, values are kept in registers instead of reloading them many times from memory to the register. This may introduce additional read and writes outside of locks and may introduce a race from which the outcome is undened. Both of the issues mentioned here have been observed in practice [10] Conclusion The possible issues that may arise with not properly dened language semantics with threads were the motivating reasons to introduce a well dened memory model for C++. The usage of libraries such as Pthread works mostly well but are based on conventions and heavily rely on the used compiler and target hardware. This approach cannot guarantee continued correctness and portability of the application as the compiler evolves and more aggressive optimizations are added, or as the application is moved to a dierent compiler [13]. Another motivating point was performance. In [8] it is shown, that lock-free code (code that makes not use of lock and unlock calls) that uses atomic operations instead, leads to signicant performance increase. Although not every part of a multi-threaded application can be written without locking, it can aect the overall performance and is therefore kept as a target to make atomic operations easily available in C++. Page 5 of 18

6 3. The C++ Memory Model The standardized memory model is a data race free model. This means, that data races still imply undened behavior. Further, C++ provides a simple use of atomic operations through atomics. The atomics can be tagged with dierent ordering options resulting in dierent memory-ordering models. Atomics tagged for sequential consistent execution belong to the sequential consistent model. Atomics with ordering options dierent to sequential consistent belong to more relaxed models and improve the performance. Atomics with ordering options for relaxed models are referred as low-level atomics. Section 3.1 shows an overview on the ordering options and models. The model with sequential consistent atomics is explained in Section 3.5 on the following page, the enhanced model supporting the low-level atomics is explained in Section 3.6 on page Model Overview Figure 7 shows the dierent ordering model available in the C++ memory model together with the ordering options that dene the used model. Figure 7: Possible ordering models in C++ with the ordering options that lead to the used model. The distinct memory models can have varying costs on dierent CPU architectures [17]. Consider an architecture with a ne grained control over the visibility of operations by processors other than the one that made the change. On such an architecture there might be additional synchronization needed for sequential consistent ordering over acquire-release ordering or relaxed ordering and for acquire-release ordering over relaxed ordering. The impact of additional synchronization instructions on the overall performance is bigger on systems that have many processor [17] Basics The basics of the C++ memory model are explained in this section. They are valid for sequential consistent atomics and for low-level atomics. These basics introduce how objects and values are organized in memory as well as the concept of modication order Objects and Memory Location A memory model comprises two aspects. The structural and the concurrency aspect. The structural aspect is about how objects and values are organized in memory. It is the basic to dene whether there is potential of a concurrent access. During the execution of a program, memory is accessed. These memory operations access a certain memory location. Figure 8 on the next page shows the organization of a struct with its subobjects and simple elds in memory. Each subobject is a data member of the struct and occupies one or several memory locations. The bit elds bf1 and bf2 share a memory location and the std::string object s consists of several memory locations. The other members of scalar type occupy their own Page 6 of 18

memory location. Using zero-length bit eld, the sequence of bit elds is separated and each bit eld has its own memory location. This can be seen by bf3 and bf4.

7 memory location. Using zero-length bit eld, the sequence of bit elds is separated and each bit eld has its own memory location. This can be seen by bf3 and bf4. Figure 8: The division of a struct into objects and memory locations [17] Modification Order Every object in C++ has a dened modication order that includes all writes to the object from all threads in the program. The modication order may vary between executions, but in a specic execution all threads agree on the modication order of each variable. This is what the programmer must ensure either with more or less synchronization eort depending on the chosen ordering model from Section 3.1 on the preceding page. While the modication order of each variable is agreed for every thread, it can not be said for every ordering model, that the program statement number 7 from Figure 9 reads the value stored by the program statement number 2. Figure 9: There is one specic modication order for each variable in the execution of a program The C++ Memory Model Without Low-Level Atomics In this section we provide an explanation of the C++ memory model with sequential consistent semantic. This is the default ordering model. If the term atomic is used in this section we always refer to sequentially consistent atomics. The focus in this section lies in the denition of a sequentially consistent execution order, the denition of a data race and the behavior of a C++ Page 7 of 18

8 program in case of a data race. The allowed compiler optimizations are mentioned and there is also a short look on implementation issues of this model Sequentially Consistent Execution A sequentially consistent execution of a program is possible even if the chosen programming language does not implement the sequential consistency memory model. This section explains the constraints on the order of memory actions in a multi-threaded program to have sequential consistent execution. These constraints leave the possibilities of some compiler optimizations that are explained in Section on the following page. The execution order in a single thread is dened by the sequenced-before relation. Referring to Listing 1, the evaluation of the arguments in subtract on line 13 are sequenced-before the expression in the body of the called subtract function. The argument evaluation shall therefore precede the execution of subtract. In general, a statement is sequenced-before another statement and the operand evaluation of an operator is unsequenced. Meaning neither operand is sequenced before the other. Exceptions to this are the built-in comma operator ',' or the logical operators '&&' or ' ' where the operand evaluation is sequenced [7]. 1 # include < iostream > 2 3 int subtract ( int a, int b) { 4 return a - b; 5 } 6 7 int get_num () { 8 static int i = 0; 9 i ++; // unsequenced side effect resulting in undefined behavior 10 } void main () { 13 int sub = subtract ( get_num (), get_num () ); 14 std :: cout << sub ; 15 } Listing 1: Evaluation of the arguments to subtract are sequenced-before the execution of the function body of subtract on line 4 but the evaluation order of the arguments to subtract is undened. The argument evaluation contains a unsequenced side eect and the behavior of this program is therefore undened [17]. If a side eect on a scalar object is unsequenced relative to either another side eect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undened [7]. This can be seen in the operands evaluation to subtract in Listing 1. The evaluation order of a and b on line 4 is also not dened but there is no undened behavior as in the evaluation of a and b is no side eect on the same object involved. To conclude, the sequenced-before relation partially orders the execution of memory actions in a single-threaded program and therefore also which value is read from a variable. Memory actions can be divided in two categories: As synchronization operations we dene the actions: lock, unlock, atomic load, atomic store and atomic read-write-modify. All these actions can be used to communicate between threads, therefore the category name. As data operations or sometimes also ordinary data operations we dene the non atomic actions: store and load. Page 8 of 18

9 The sequential consistent execution of a multi-threaded program can now be dened using the following constraints regarding to the thread internal order of actions and regarding to a total order < T [7]: 1. The execution of each thread must be internally consistent. This means that reordering of actions are allowed, as far as they still maintain a correct sequential execution with respect to the values read and with respect to the sequenced-before ordering. As an example, optimizations that are inconsistent with sequenced-before ordering are not allowed. 2. The total order is consistent with the sequenced-before ordering. E.g if a is sequenced before b then a < T b. 3. Each load, lock and read-modify-write operation reads the value from the last preceding write to the same location according to the total order. The last operation on a given lock preceding an unlock must be a lock operation performed by the same thread. This eectively leads to a total order where the actions of the dierent threads are just interleaved Data Race Two operations conict, if they access the same memory location, and at least one of them is a store, atomic store, or atomic read-modify-write operation. A data race can further be dened as: if two memory operations from dierent threads conict, and at least one of them is a data operation, and the memory operations are adjacent in total order [11]. Based on this denition and considering the example in Figure 2 on page 2, the store action Flag1 = 1 of P1 and the load action Flag1 == 0 of P2 form a data race Data Race Free Model The C++ memory model is a data race free model. This means that in case of a data race as dened in the previous Section 3.5.2, the behavior of the program is undened. Otherwise, the program (on the same input) behaves according to one of its sequentially consistent executions [11]. This allows for some important hardware and compiler optimizations but does also restrict in reordering around synchronization operations. The allowed optimizations are explained in the next section Optimizations Allowed By the Model To understand the possible optimizations, important terms are claried. Figure 10 shows the terms applied. Figure 10: Lock operations are operations on the lock object. The synchronization operations lock() and unlock() use them to prevent from multiple thread in a critical section. Page 9 of 18

10 The term synchronization operation is rened into read synchronization operation that consist of lock() and atomic read and write synchronization operation that consist of unlock() and atomic write. A lock() and a unlock() are implemented using a lock object. The operations on this lock object are called lock operations and can either be a write or a read. These lock operations eectively prevent from multiple thread accessing a critical section. As long as the intra-thread semantic is maintained, two operations M1 and M2 where M1 is sequenced-before M2, can be freely reordered by compiler and hardware if [4] [2] [12] [9]: 1. M1 is a data operation and M2 is a read synchronization operation or 2. M1 is write synchronization and M2 is data or 3. M1 and M2 are both data with no synchronization sequence-ordered between them. Figure 11 shows example of possible reordering around synchronization operations. Figure 11: Allowed reordering around synchronization operations by the described model. Additionally, also reordering around lock operations are allowed if lock() and unlock() are used in well-structured ways. As when they prevent from data races and also do not form a dead lock. The reordering of M1 sequenced-before M2 is save (assumed intra-thread semantic is not aected) [4] if: 1. M1 is data and M2 is the write of a lock operation or 2. M1 is unlock and M2 is either a read or write of a lock. Figure 12 shows example of the possible reordering around lock operations. Figure 12: Allowed reordering around lock operations by the described model The Way to Sequential Consistent Atomics The model requires that synchronization operations appear sequentially consistent with respect to each other [11]. This means that no synchronization operations can be reordered with another synchronization operation and that atomic operations must be executed atomically. Also the atomic writes must be atomically. We can see this in the example of Figure 2 on page 2. To prevent a data race, the variable Flag1 and Flag2 must be atomic. Imagine the write of Flag1 = 1 in P1 is not executed atomically. Then the read of Flag1 == 0 would still return 0 and if analogous is imagined for Flag2 we have two threads in the critical section. Executing atomic writes is a performance issue. On multi-core processors the written value must Page 10 of 18

11 be propagated to other cores cache to have consistent values in caches. This is realized by a cache coherence protocol. Another issue is, how a compiler needs to convert the atomic stores in a program into hardware instructions to prevent the hardware from irregular reordering. Since most instruction sets of processor vendors do not distinguish synchronization operations from other operations, additional fences need to be placed. A relaxation of atomic writes could improve the performance. Relaxing atomic writes means to allow reading of another thread's write earlier than other threads can. This is possible in multi-core processors where cores share a cache. There have been attempts to relax atomic writes, but it was dicult to formalize and would lead to a more complex interface for the programmer [11]. But this attempts enforced the processor vendors to provide write instructions that execute atomically Introducing Low-Level Atomic to the C++ Memory Model In C++11, there are ways to explicitly relax the sequential consistency guarantee. This is achieved by using low-level atomic operations. In code, low level atomics are recognized when one of the ordering options for the relaxed models from Figure 7 on page 6 such as memory_order_acquire is specied for an atomic operation. This section explains how low-level atomics are introduced to the memory model. First, the equivalent model as introduced in the previous section for sequential consistent ordering is explained with the relations as dened in the standard. Using these relations a new denition of a data race is made. Further the dierent orderings are explained by stating how the used ordering options contribute to the relation that dene a data race Ordering Options The following list explains the meaning of the ordering options in the dierent operations [7]: memory_order_relaxed: no operation orders memory. memory_order_release, memory_order_acq_rel, and memory_order_seq_cst: a store operation performs a release operation on the aected memory location. memory_order_consume: a load operation performs a consume operation on the aected memory location. memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst: a load operation performs an acquire operation on the aected memory location Sequential Consistent Ordering Figure 13 shows an equivalent example as the one provided in Figure 2 on page 2. Each thread reads the other thread's write. Figure 13: Sequential consistent ordering. Each thread reads the others write atomically. [5] Figure 14: Happens-before relation between operations in the example of Figure 13. The code in Figure 13 is data race free. This can be shown with the happens-before relation. Thread internally happens-before is the same as the sequenced-before introduced in Section on page 8. The store on x is sequenced before the load of y and therefore also happens before. Page 11 of 18

12 In inter-thread consideration, the write on x synchronizes with the load on x. The load reads the written value. By simplied denition, a release operation W on a location synchronizes-with an acquire operation that reads the value written by W. Further it is dened that the synchronizeswith relation contributes to the inter-thread-happens-before relation which in turn contributes to happens-before [7]. These considerations lead to the happens-before relations as shown in Figure 14 on the preceding page. Further inter-thread-happens-before also combines with the sequenced-before relation: if operation A is sequenced before operation B, and operation B inter-thread happensbefore operation C, then A inter-thread happens-before C. New Definition of a Data Race A data race can now newly be dened with the happens-before relation as follows: Two actions at the same location, on dierent threads, not related by happens-before and at least one of which is a write [6]. Based on this denition it gets clear that the example in Figure 13 on the previous page does not contain a race Relaxed Ordering The relaxed ordering is the weakest ordering model in the C++. It only guarantees that all threads agree on the modication order of each variable. The example in Figure 15 shows the same program as in Figure 13 on the previous page but with relaxed atomic operations. It is possible that the read of x returns 0 although the read of y returns 0 and the write of 1 to y is performed too. This result is possible since there is no imposed ordering between the read and write on the shared variable x. It is important to mention here, that there is no race in this example and therefore also not undened behavior because relaxed atomic operations cannot contribute to data races [17]. Figure 15: Relaxed ordering. Each thread reads the others write atomically. Figure 16: Possible outcome for each operation in the program of Figure 15. As relaxed atomics do not contribute in synchronization, there is no happensbefore relation between variable that are shared by threads Acquire-Release Ordering The acquire-release ordering is more relaxed than sequential consistency. There is no guarantee anymore, that a read of a location returns the last previously written value. This is achieved by a renement of synchronizes-with using a release sequence. Additionally the possible value read by the acquire operation that participates in synchronizes-with are dened by the visible sequence of side eects. Figure 17 on the following page shows the pairwise synchronization of acquire-release. One thread writes some data x and then sets a ag y while the other spins until the ag is set and then reads the data. It should be guaranteed that the receiver must see the data writes of the sender and not any values written that precede in modication order of that writes. With the read and writes of y annotated as in Figure 17 on the next page this is guaranteed. The relations from this example can be seen in Figure 18 on the following page (1). Page 12 of 18

13 Figure 17: The release operation of the sender on y synchronizes-with the acquire operation on y by the receiver [5]. Figure 18: (1) shows the synchronizes-with relation. (2) shows the synchronizes-with relation with a release sequence [6]. The synchronizes-with edge arises because of the acquire operation (c) that reads from the release operation (b). Recognize also that (a) happens before (d). By the release sequence it is dened that an acquire operation can synchronize with a release (to the same location), that is before the write that it reads from. This is shown in Figure 18. The release (b) synchronizes-with the acquire (d) although the fact that (d) reads from another write to the same location (c). This is possible since [(b),(c)] is the release sequence of (b) that participates in synchronizes with. A release sequence is dened as a contiguous sub-sequence of modication order on the location of the release. The release sequence is headed by the release (e.g (b)) and can be followed by writes from the same thread (e.g (c)) or read-modify-writes from any thread (not shown). A more elaborated example that claries the read-modify-writes addition in release sequence is shown in Section on the following page. Until now we have only rened the synchronizes-with relation to dene when two operations synchronize in acquire-release ordering. Which values can be read by the acquire participating in synchronizes-with is dened by the visual sequence of side eects. Consider Figure 19, it is a fragment of the program executed in Figure 18 (2). It shows the happens-before relations in this program fragment. Figure 19: Happens-before relation in program fragment of Figure 18 (2) [6]. A visual sequence of side eects of a read is a contiguous sub-sequence of modication order, headed by a visible side eect of the read, where the read does not happen before any member of the sequence [7]. In Figure 19 the visible side eect of the acquire operation (d) is (b). The relaxed write (c) is in the visual sequence of side eects because of its subsequent ordering in the modication order of y. Also the relaxed write (c) as member of the visual sequence of side eects does not happen before the acquire (d). Therefore the acquire operation (d) either reads the value written in (b) or (c) Data Dependency in Acquire-Release Ordering On multiprocessors with weak memory orders the release/acquire pairs from acquire-release ordering are cheaper to implement than sequential consistent atomics [6]. But they are still more expensive than plain stores and loads. Multiprocessors as Power guarantees that certain data dependencies in instructions are respected [6]. This makes an additional synchronization instruction to prevent from reordering obsolete and results in a better performance. This is the reason for a memory_order_consume in the acquire-release ordering. If the programmer sees that the targeted hardware respects a certain order because of data dependencies he can introduces an atomic Page 13 of 18

14 load with memory_order_consume with less synchronization cost but still enough for the required ordering. An example using memory_order_consume is shown in Figure 20. Figure 20: Write data and store to a shared atomic pointer p by the sender. The receiver consumes the shared pointer and dereferences [5]. Figure 21: (1) shows the synchronizes-with relation in release/acquire pairs. (2) shows the dependency-ordered-before (dob) relation between the release operation (b) and the consume operation (c) as well as to (d) because (c) carries-a-dependency-to (d) [6]. The sender writes some data and stores the address of that data to a shared atomic pointer p. The other thread reads the shared pointer, dereferences it and reads the data. The relations that arise in this example are shown in Figure 21. As a comparison also the relation that would be introduced if an memory_order_acquire is used in the read of p (c) are shown (1). There are new relation introduced in (2). The read of p (c) carries-a-dependency-to the read of the data (d). The relation carries-dependency-to applies only within a single thread and models data dependency between operations. If the result of an operation A is used as an operand for an operation B, then A carries-a-dependency to B. If the result of operation A is a value of a scalar type such as an int, then the relationship still applies if the result of A is stored in a variable, and that variable is then used as an operand for operation B [17]. The later statement holds for the relation between the operations (c) and (d). Carries-a-dependency-to is a transitive relation. The other introduced relation is dependency-ordered-before. The dependency-ordered-before is the release/consume analogue of the release/acquire synchronizes-with. This involves also the release sequence which is only [(b)] in this example. The possible values read by the consume operation (c) is again dened by the visual sequence of side eects. The dependency-ordered-before contributes also to the inter-thread-happens-before and therefore to the happens-before. Which in turn states that operation (a) happens-before (d) Example - Reading Values From a Queue In Section on page 12 it is dened that a release can synchronize with an acquire although the acquire does not read from the release. But the read must be a value of the release sequence. Other threads can participate in the release sequence if there are read-modify-write operations executed between the release and the acquire of the synchronizes-with relation. An example with read-modify-write operations in the release sequence is shown in listing Listing 2 on the next page. One thread can populate the queue whereon several threads can read items from the queue. The atomic variable count is used for the number of items in the shared queue. On line 8 the initial store of the items is done, to let the other threads know that data is available. The consumer threads that access the shared queue to read the items, claim to get one item from the shared queue (line 16). This is done with the read-modify-write operation fetch_sub with acquire semantics and before the shared queue is accessed. The fetch_sub atomically reads the count, subtracts one and writes the modied value back to count. The returned value is the read value before the modication. A value less than 0 forces the consumer to wait for new items (line 18) otherwise item_index is used to get the item in the queue (line 21). Figure 22 on the following page shows the relations between the operations in an execution of one thread to populate the queue and two threads that consume queue items. The dotted lines show the release sequence and the solid lines show the happens-before relationships. The rst fetch_sub participates in the release Page 14 of 18

15 1 std :: vector < int > queue_data ; 2 std :: atomic < int > count ; 3 4 void populate_ queue () 5 { 6 unsigned const number_ of_ items =20; 7 // fill the queue with numbers from 0 to 19 8 count. store ( number_of_items, std :: memory_order_release ); 9 } void consume_ queue_ items () 12 { 13 while ( true ) 14 { 15 int item_ index ; 16 if (( item_index = count. fetch_sub (1, std :: memory_order_acquire )) <=0) 17 { 18 wait_for_more_items () ; 19 continue ; 20 } 21 process ( queue_ data [ item_index -1]) ; 22 } 23 } Listing 2: Reading from a queue with atomic operations [17]. The full example can be seen in Listing 3 on page 17 in the appendix. sequence and therefore the release synchronizes with the second fetch_sub. Recognize also that the value read by the second fetch_sub must return the value written by the rst fetch_sub although both, the release and the rst acquire, are in the visual sequence of side eects. This is dened by the write-read coherence that prevents from reading a value that is happens-before hidden by a later write in the modication order [6]. To our case, the release (write) is hidden to the second acquire since the release happens before the second acquire and the rst acquire occurs later in the modication order of count. Figure 22: The release sequence for the queue operations from Listing 2 [17]. Page 15 of 18

16 4. Conclusion When writing multi-threaded application it is crucial to prevent from data races. Without ordering options, sequential consistent execution is guaranteed in the memory model dened in C++11. The programmer does not need to care about possible compiler optimizations or the hardware model of the targeted system. Data races can be detected by just thinking of an interleaved execution of the thread's instructions and the analysis whether two conicting operations are adjacent. If the sequential consistent execution with locks and sequential consistent atomics can not satisfy the performance requirement, one can adapt the code and use the low-level atomics. This is considered as an expert-only feature as they are hard to use correctly. But nevertheless also non experts can prot indirectly by using libraries written using these facilities [11]. Page 16 of 18

17 A. Reading from a Queue with Atomic Operations 1 # include < atomic > 2 # include < thread > 3 4 std :: vector < int > queue_data ; 5 std :: atomic < int > count ; 6 7 void populate_ queue () 8 { 9 unsigned const number_ of_ items =20; 10 queue_data. clear () ; 11 for ( unsigned i =0; i < number_of_items ;++ i) 12 { 13 queue_data. push_back (i); 14 } count. store ( number_of_items, std :: memory_order_release ); 17 } void consume_ queue_ items () 20 { 21 while ( true ) 22 { 23 int item_ index ; 24 if (( item_index = count. fetch_sub (1, std :: memory_order_acquire )) <=0) 25 { 26 wait_for_more_items () ; 27 continue ; 28 } 29 process ( queue_ data [ item_index -1]) ; 30 } 31 } int main () 34 { 35 std :: thread a( populate_ queue ); 36 std :: thread b( consume_ queue_ items ); 37 std :: thread c( consume_ queue_ items ); 38 a. join () ; 39 b. join () ; 40 c. join () ; 41 } Listing 3: Reading from a queue with atomic operations [17]. Page 17 of 18

18 References [1] Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:6676, [2] Sarita V. Adve and Mark D. Hill. Weak ordering - a new denition. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 214, New York, NY, USA, ACM. [3] Sarita V. Adve and Hans j. Boehm. Memory models: A case for rethinking parallel languages and hardware. COMMUNICATIONS OF THE ACM, 53(8):90101, [4] Sarita Vikram Adve. Designing memory consistency models for shared-memory multiprocessors. PhD thesis, University of Wisconsin at Madison, Madison, WI, USA, UMI Order No. GAX [5] Mark Batty. Mathematizing c++ concurrency. batty.pdf, [6] Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. Mathematizing c++ concurrency. SIGPLAN Not., 46(1):5566, January [7] Pete Becker. C++ standard /n3242.pdf, [8] Hans-J. Boehm. Threads cannot be implemented as a library. In In PLDI, pages ACM Press, [9] Hans-J. Boehm. Reordering constraints for pthread-style locks. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages , New York, NY, USA, ACM. [10] Hans-J. Boehm. Towards a memory model for c++. Hans_Boehm/misc_slides/boehm-accu.pdf, [11] Hans-J. Boehm and Sarita V. Adve. Foundations of the c++ concurrency memory model. In PLDI'08, [12] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. SIGARCH Comput. Archit. News, 18(3a):1526, May [13] Paul McKenney Hans-J. Boehm. Programming with threads: Questions frequently asked by c and c++ programmers. html, [14] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., 28(9):690691, September [15] Jeremy Manson, William Pugh, and Sarita V. Adve. The java memory model. In Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '05, pages , New York, NY, USA, ACM. [16] Bjarne Stroustrup. The design and evolution of C++. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, [17] A. Williams. C++ Concurrency: Practical Multithreading. Manning Pubs Co Series. Manning, Page 18 of 18

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model