C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model

Size: px
Start display at page:

Download "C++ Memory Model. Martin Kempf December 26, Abstract. 1. Introduction What is a Memory Model"

Transcription

1 C++ Memory Model December 26, 2012 Abstract Multi-threaded programming is increasingly important. We need parallel programs to take advantage of multi-core processors and those are likely to be the main source of improved performance as the the number of cores in processors will grow instead of the clock rate. Prior to C++11, multi-threading in C++ was supported only by libraries and the language was specied as a single threaded language. With the memory model introduced in C++11 threading support is integrated into the language specication and therefore exactly denes the behavior of multi-threaded applications. Further more, the simple usage of atomic operations is introduced to allow the implementation of lock-free algorithms and therefore a more performance conscious way of programming. This paper will give an introduction on why a memory model was needed. The model is rst explained with all atomic operations to execute sequentially consistent, followed by the enhanced model with atomic operations to relax the sequential consistency guarantee. Examples are provided that illustrate the dierent relaxation options. 1. Introduction Many parallel programs are written using threads and shared variables. And it is also likely that this parallel programming model will stay popular for several reasons. Direct hardware support for shared-memory is a performance advantage. For example, values read-mostly by dierent instances are implicitly shared in memory and available for all instances without the need of replication. And in case of recognized bottlenecks in an application, parallelism with threads can be introduced without a complete redesign of data structures [3]. An important part in shared-memory parallelism is the memory model or memory consistency model. Prior to C++11, multi-threaded applications in C++ are written by using libraries for threading support. The language was specied as single-threaded language. The execution of multi-threaded applications, programmed with a single-threaded-dened language and a library for threading support, was based on an agreement between compiler and hardware. This also aected the portability. With the new C++11 standard, threading support is a part of the language specication and a memory model is dened. This gives a guaranteed behavior in multi-threaded programs and better portability What is a Memory Model When multiple threads can access the same memory location in parallel, it must be specied which set of values a read can return. A memory model species concurrency semantics in shared-memory programs. The necessity can be shown by considering a quad-core processor where the caches are not shared between the cores. Running a multi-threaded application using this hardware, the following instruction ordering is possible. A thread 1 stores a value on a certain memory location. Afterwards another thread 2 loads a value from the same memory location, where the store was

2 performed previously by thread 1. Additionally, thread 1 is executed by a core A and thread 2 by core B. The scenario is shown in Figure 1. Which value is read (seen) by thread 2? Can it read the value stored by thread 1 although the cores do not share the cache? The question is, whether there is enough synchronization to ensure that a thread's write will occur before another's read. The compiler as well as the processor can reorder instructions in order to increase the performance. These Figure 1: Load and store of a shared reordering can also unintentionally inuence the behavior of a program, especially in multi-threaded programs variable x, performed by different threads on cores that (see example in Section 1.2). A memory model can therefore also be seen as a contract between the program and do not share the cache. any hardware or software that transforms the program. It restricts reordering and the compiler as well as the processor must agree on this. The design of a memory model involves a tension between performance and usability. A strong memory model, such as sequential consistency (see Section 1.2) that restricts many reorderings and therefore prevents hardware and software optimizations, reduces performance. But it also simplies reasoning about programs Sequential Consistency Sequential consistency is the most restricting memory model but also the most intuitive one. It is introduced by Lamport [14] as follows: Hardware is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specied by its program. For an application of this denition we consider the example in Figure 2. Additionally we interpret the term operation as memory operation and the term result as the set of values read and the nal value of the execution [2]. Under some sequential order we understand all interleaving of memory operations that are possible combinations of the instruction number 1 to 4. Additionally, in these dierent execution possibilities every read operation returns the value of the previously written value. The later criteria enforces that the memory operations appear to be executed atomically, hence without interleaving with respect to other memory operations. Under these considerations the value of Flag2 in instruction 2 and Flag1 in instruction 4 cannot both be 0. Figure 2: For sequential consistency, all memory accesses appear to execute atomically in some total order and program order is maintained among operations of each processor [1]. Figure 3: Store buer violates sequential consistency. t1,t2... indicate the order in which the corresponding memory operations execute at memory [1]. Page 2 of 18

3 Most modern processors almost always use a store buer to avoid waiting for stores to complete [3]. Such a scenario is shown in Figure 3 on the preceding page. It shows the execution of the example in Figure 2 on the previous page using a store buer. Each processor can buer its write and allow the subsequent read to bypass it. Therefore the read operations are performed before the write is applied in memory and allowing both reads to return 0. This violates sequential consistency Software- and Hardware Memory Model Both, hardware and software that transform a program must agree on the memory model. Hardware models are often more relaxed than software models. It is therefore not sucient to prevent the compiler from any reordering and make it transform every read and write of variables into its simple load/store equivalent hardware instruction. A weaker hardware model would then allow to reorder these loads and stores as seen in Figure 3 on the preceding page and violates the software memory model. The solution to this problem is to let the compiler add memory fences that prevent the hardware from performing violating reordering. With a full fence it is assured, that all read and write operations executed before the fence are synchronized before the memory operations executed after the fence. The hardware memory model then restricts the reordering when fences are involved. Sometimes it is inappropriate to insert a full fence, for example if only two thread needs to be synchronized. In this case not all threads need to be updated with respect to the synchronizing variable. A mapping of software memory model to a hardware memory model regarding an optimal performance is a dicult issue [11] [3] About the Report To explain the details of the C++ memory model, the paper is structured as follows. In Section 2 the motivating reasons for introducing the C++ memory model are stated followed by the explanation of the C++ memory model. In Section 3.5 on page 7 the memory model is explained for sequential consistent execution. It is a simpler explanation of the equivalent model described in the standard [7] for sequential consistent execution. Section 3.6 on page 11 explains the model of the standard that allows relaxing the sequential consistency guarantee. 2. Motivation for C++ Memory Model C++ is designed to leave concurrency support to additional libraries. This design decision was motivated to support dierent concurrency model (multi-threading, multi-processes, distributed) not just one [16]. A library that adds threading support and synchronization primitives is Pthread. It has been shown that extending C++ with threading support by a library works in most cases but not in any case and also aects the portability [8]. In this section we explain why the library approach almost worked but also where the approach fails and motivated a formalized memory model in the language specication The Library Approach with Pthread The rule for writing multi-threaded application using the Pthread library is as follows: Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads... Two of the function mentioned in this quote are pthread_mutex_lock and pthread_mutex_unlock. It requires a disciplined use of these synchronization operations to prevent from one thread modifying a memory location while another thread modies the same memory location. Implementations of the Pthread standard proceeds as follows [8]: Page 3 of 18

4 1. The implementation of the synchronization functions such as pthread_mutex_lock must guarantee that the memory is synchronized. This is realized by adding memory fences that are specic to the underlying hardware. The fences therefore preclude a hardware reordering of memory operations around calls to synchronization operations of Pthread. This is shown in Figure Also the compiler must prevent the reordering as seen in Figure 4. The compiler therefore treats calls to functions such as pthread_mutex_lock as calls to opaque functions. This means the compiler has no information about the function and assumes in this function read and write operations on any global variable. This assumption prevents the compiler to simply move the memory operation around the calls. Figure 4: Implementation of Pthread must prevent the reordering of memory operations around synchronization operations as pthread_mutex_lock and pthread_mutex_unlock [10]. With this rule and implementation considerations we can state the following: The semantic of programs with races is undened. (While the denition of a race is not made in Pthread. See Section 2.2.1) Synchronization-free code can be optimized as though it were single-threaded. It is easy to write wrong code for example by an incorrect ordering of pthread_mutex_lock and pthread_mutex_unlock 2.2. Possible Implications of the Library Approach In [8] three issues are listed where the library approach as realized in Pthread can fail. We discuss one in detail from where it gets clear that a memory model must be dened in a language specication and mention also the other points Concurrent Modification According to the pthread rule...no thread of control can read or modify a memory location while another thread of control may be modifying it... we need to know when a concurrent modication (data race) can happen, to decide where to put synchronization operations. Data races can only be detected if the semantics of the programming language is dened, which in turn is only possible with a properly dened memory model. It exists a circularity in the denition [15]. Considering the example in Figure 5 on the next page. Each thread executes the listed statements. Under sequentially consistent execution, there is no race. Hence the outcome of x==1 and y==1 is not possible as any variable ever become non zero. But according to the Pthread approach, a compiler may freely reorder memory operations that are free of synchronization operations. Page 4 of 18

5 Figure 5: Is there a concurrent modication? Do we need synchonization operations? [8]. Figure 6: Possible reordering of code in Figure 5 by the compiler imposed by speculative execution. [8]. This can result in code as shown in Figure 6. This transformation is based on speculative execution, assuming mostly to be x==1 and y==1 in the conditional statement. Considering this transformed execution again under sequential consistency, the outcome of x==1 and y==1 is possible and would imply a race. Therefore it is shown that the Pthread approach together with a thread unaware language denition could lead to transformations that generate a race. This problem can be solved by a programming-language-dened and compiler-respected memory model to ensure that the user and the compiler are agree on where there is a data race Memory Location and Register Promotion The additional two reasons for a thread-aware language denition are founded both by thread unaware transformations by the compiler. There must be clear denition of a what a memory location is. Based on this denition, compiler optimization rules can be derived to prevent the compiler from introducing implicit writes to adjacent memory locations that may cause a race. Analogous can be said on register promotion. In register promotion, values are kept in registers instead of reloading them many times from memory to the register. This may introduce additional read and writes outside of locks and may introduce a race from which the outcome is undened. Both of the issues mentioned here have been observed in practice [10] Conclusion The possible issues that may arise with not properly dened language semantics with threads were the motivating reasons to introduce a well dened memory model for C++. The usage of libraries such as Pthread works mostly well but are based on conventions and heavily rely on the used compiler and target hardware. This approach cannot guarantee continued correctness and portability of the application as the compiler evolves and more aggressive optimizations are added, or as the application is moved to a dierent compiler [13]. Another motivating point was performance. In [8] it is shown, that lock-free code (code that makes not use of lock and unlock calls) that uses atomic operations instead, leads to signicant performance increase. Although not every part of a multi-threaded application can be written without locking, it can aect the overall performance and is therefore kept as a target to make atomic operations easily available in C++. Page 5 of 18

6 3. The C++ Memory Model The standardized memory model is a data race free model. This means, that data races still imply undened behavior. Further, C++ provides a simple use of atomic operations through atomics. The atomics can be tagged with dierent ordering options resulting in dierent memory-ordering models. Atomics tagged for sequential consistent execution belong to the sequential consistent model. Atomics with ordering options dierent to sequential consistent belong to more relaxed models and improve the performance. Atomics with ordering options for relaxed models are referred as low-level atomics. Section 3.1 shows an overview on the ordering options and models. The model with sequential consistent atomics is explained in Section 3.5 on the following page, the enhanced model supporting the low-level atomics is explained in Section 3.6 on page Model Overview Figure 7 shows the dierent ordering model available in the C++ memory model together with the ordering options that dene the used model. Figure 7: Possible ordering models in C++ with the ordering options that lead to the used model. The distinct memory models can have varying costs on dierent CPU architectures [17]. Consider an architecture with a ne grained control over the visibility of operations by processors other than the one that made the change. On such an architecture there might be additional synchronization needed for sequential consistent ordering over acquire-release ordering or relaxed ordering and for acquire-release ordering over relaxed ordering. The impact of additional synchronization instructions on the overall performance is bigger on systems that have many processor [17] Basics The basics of the C++ memory model are explained in this section. They are valid for sequential consistent atomics and for low-level atomics. These basics introduce how objects and values are organized in memory as well as the concept of modication order Objects and Memory Location A memory model comprises two aspects. The structural and the concurrency aspect. The structural aspect is about how objects and values are organized in memory. It is the basic to dene whether there is potential of a concurrent access. During the execution of a program, memory is accessed. These memory operations access a certain memory location. Figure 8 on the next page shows the organization of a struct with its subobjects and simple elds in memory. Each subobject is a data member of the struct and occupies one or several memory locations. The bit elds bf1 and bf2 share a memory location and the std::string object s consists of several memory locations. The other members of scalar type occupy their own Page 6 of 18

7 memory location. Using zero-length bit eld, the sequence of bit elds is separated and each bit eld has its own memory location. This can be seen by bf3 and bf4. Figure 8: The division of a struct into objects and memory locations [17] Modification Order Every object in C++ has a dened modication order that includes all writes to the object from all threads in the program. The modication order may vary between executions, but in a specic execution all threads agree on the modication order of each variable. This is what the programmer must ensure either with more or less synchronization eort depending on the chosen ordering model from Section 3.1 on the preceding page. While the modication order of each variable is agreed for every thread, it can not be said for every ordering model, that the program statement number 7 from Figure 9 reads the value stored by the program statement number 2. Figure 9: There is one specic modication order for each variable in the execution of a program The C++ Memory Model Without Low-Level Atomics In this section we provide an explanation of the C++ memory model with sequential consistent semantic. This is the default ordering model. If the term atomic is used in this section we always refer to sequentially consistent atomics. The focus in this section lies in the denition of a sequentially consistent execution order, the denition of a data race and the behavior of a C++ Page 7 of 18

8 program in case of a data race. The allowed compiler optimizations are mentioned and there is also a short look on implementation issues of this model Sequentially Consistent Execution A sequentially consistent execution of a program is possible even if the chosen programming language does not implement the sequential consistency memory model. This section explains the constraints on the order of memory actions in a multi-threaded program to have sequential consistent execution. These constraints leave the possibilities of some compiler optimizations that are explained in Section on the following page. The execution order in a single thread is dened by the sequenced-before relation. Referring to Listing 1, the evaluation of the arguments in subtract on line 13 are sequenced-before the expression in the body of the called subtract function. The argument evaluation shall therefore precede the execution of subtract. In general, a statement is sequenced-before another statement and the operand evaluation of an operator is unsequenced. Meaning neither operand is sequenced before the other. Exceptions to this are the built-in comma operator ',' or the logical operators '&&' or ' ' where the operand evaluation is sequenced [7]. 1 # include < iostream > 2 3 int subtract ( int a, int b) { 4 return a - b; 5 } 6 7 int get_num () { 8 static int i = 0; 9 i ++; // unsequenced side effect resulting in undefined behavior 10 } void main () { 13 int sub = subtract ( get_num (), get_num () ); 14 std :: cout << sub ; 15 } Listing 1: Evaluation of the arguments to subtract are sequenced-before the execution of the function body of subtract on line 4 but the evaluation order of the arguments to subtract is undened. The argument evaluation contains a unsequenced side eect and the behavior of this program is therefore undened [17]. If a side eect on a scalar object is unsequenced relative to either another side eect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undened [7]. This can be seen in the operands evaluation to subtract in Listing 1. The evaluation order of a and b on line 4 is also not dened but there is no undened behavior as in the evaluation of a and b is no side eect on the same object involved. To conclude, the sequenced-before relation partially orders the execution of memory actions in a single-threaded program and therefore also which value is read from a variable. Memory actions can be divided in two categories: As synchronization operations we dene the actions: lock, unlock, atomic load, atomic store and atomic read-write-modify. All these actions can be used to communicate between threads, therefore the category name. As data operations or sometimes also ordinary data operations we dene the non atomic actions: store and load. Page 8 of 18

9 The sequential consistent execution of a multi-threaded program can now be dened using the following constraints regarding to the thread internal order of actions and regarding to a total order < T [7]: 1. The execution of each thread must be internally consistent. This means that reordering of actions are allowed, as far as they still maintain a correct sequential execution with respect to the values read and with respect to the sequenced-before ordering. As an example, optimizations that are inconsistent with sequenced-before ordering are not allowed. 2. The total order is consistent with the sequenced-before ordering. E.g if a is sequenced before b then a < T b. 3. Each load, lock and read-modify-write operation reads the value from the last preceding write to the same location according to the total order. The last operation on a given lock preceding an unlock must be a lock operation performed by the same thread. This eectively leads to a total order where the actions of the dierent threads are just interleaved Data Race Two operations conict, if they access the same memory location, and at least one of them is a store, atomic store, or atomic read-modify-write operation. A data race can further be dened as: if two memory operations from dierent threads conict, and at least one of them is a data operation, and the memory operations are adjacent in total order [11]. Based on this denition and considering the example in Figure 2 on page 2, the store action Flag1 = 1 of P1 and the load action Flag1 == 0 of P2 form a data race Data Race Free Model The C++ memory model is a data race free model. This means that in case of a data race as dened in the previous Section 3.5.2, the behavior of the program is undened. Otherwise, the program (on the same input) behaves according to one of its sequentially consistent executions [11]. This allows for some important hardware and compiler optimizations but does also restrict in reordering around synchronization operations. The allowed optimizations are explained in the next section Optimizations Allowed By the Model To understand the possible optimizations, important terms are claried. Figure 10 shows the terms applied. Figure 10: Lock operations are operations on the lock object. The synchronization operations lock() and unlock() use them to prevent from multiple thread in a critical section. Page 9 of 18

10 The term synchronization operation is rened into read synchronization operation that consist of lock() and atomic read and write synchronization operation that consist of unlock() and atomic write. A lock() and a unlock() are implemented using a lock object. The operations on this lock object are called lock operations and can either be a write or a read. These lock operations eectively prevent from multiple thread accessing a critical section. As long as the intra-thread semantic is maintained, two operations M1 and M2 where M1 is sequenced-before M2, can be freely reordered by compiler and hardware if [4] [2] [12] [9]: 1. M1 is a data operation and M2 is a read synchronization operation or 2. M1 is write synchronization and M2 is data or 3. M1 and M2 are both data with no synchronization sequence-ordered between them. Figure 11 shows example of possible reordering around synchronization operations. Figure 11: Allowed reordering around synchronization operations by the described model. Additionally, also reordering around lock operations are allowed if lock() and unlock() are used in well-structured ways. As when they prevent from data races and also do not form a dead lock. The reordering of M1 sequenced-before M2 is save (assumed intra-thread semantic is not aected) [4] if: 1. M1 is data and M2 is the write of a lock operation or 2. M1 is unlock and M2 is either a read or write of a lock. Figure 12 shows example of the possible reordering around lock operations. Figure 12: Allowed reordering around lock operations by the described model The Way to Sequential Consistent Atomics The model requires that synchronization operations appear sequentially consistent with respect to each other [11]. This means that no synchronization operations can be reordered with another synchronization operation and that atomic operations must be executed atomically. Also the atomic writes must be atomically. We can see this in the example of Figure 2 on page 2. To prevent a data race, the variable Flag1 and Flag2 must be atomic. Imagine the write of Flag1 = 1 in P1 is not executed atomically. Then the read of Flag1 == 0 would still return 0 and if analogous is imagined for Flag2 we have two threads in the critical section. Executing atomic writes is a performance issue. On multi-core processors the written value must Page 10 of 18

11 be propagated to other cores cache to have consistent values in caches. This is realized by a cache coherence protocol. Another issue is, how a compiler needs to convert the atomic stores in a program into hardware instructions to prevent the hardware from irregular reordering. Since most instruction sets of processor vendors do not distinguish synchronization operations from other operations, additional fences need to be placed. A relaxation of atomic writes could improve the performance. Relaxing atomic writes means to allow reading of another thread's write earlier than other threads can. This is possible in multi-core processors where cores share a cache. There have been attempts to relax atomic writes, but it was dicult to formalize and would lead to a more complex interface for the programmer [11]. But this attempts enforced the processor vendors to provide write instructions that execute atomically Introducing Low-Level Atomic to the C++ Memory Model In C++11, there are ways to explicitly relax the sequential consistency guarantee. This is achieved by using low-level atomic operations. In code, low level atomics are recognized when one of the ordering options for the relaxed models from Figure 7 on page 6 such as memory_order_acquire is specied for an atomic operation. This section explains how low-level atomics are introduced to the memory model. First, the equivalent model as introduced in the previous section for sequential consistent ordering is explained with the relations as dened in the standard. Using these relations a new denition of a data race is made. Further the dierent orderings are explained by stating how the used ordering options contribute to the relation that dene a data race Ordering Options The following list explains the meaning of the ordering options in the dierent operations [7]: memory_order_relaxed: no operation orders memory. memory_order_release, memory_order_acq_rel, and memory_order_seq_cst: a store operation performs a release operation on the aected memory location. memory_order_consume: a load operation performs a consume operation on the aected memory location. memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst: a load operation performs an acquire operation on the aected memory location Sequential Consistent Ordering Figure 13 shows an equivalent example as the one provided in Figure 2 on page 2. Each thread reads the other thread's write. Figure 13: Sequential consistent ordering. Each thread reads the others write atomically. [5] Figure 14: Happens-before relation between operations in the example of Figure 13. The code in Figure 13 is data race free. This can be shown with the happens-before relation. Thread internally happens-before is the same as the sequenced-before introduced in Section on page 8. The store on x is sequenced before the load of y and therefore also happens before. Page 11 of 18

12 In inter-thread consideration, the write on x synchronizes with the load on x. The load reads the written value. By simplied denition, a release operation W on a location synchronizes-with an acquire operation that reads the value written by W. Further it is dened that the synchronizeswith relation contributes to the inter-thread-happens-before relation which in turn contributes to happens-before [7]. These considerations lead to the happens-before relations as shown in Figure 14 on the preceding page. Further inter-thread-happens-before also combines with the sequenced-before relation: if operation A is sequenced before operation B, and operation B inter-thread happensbefore operation C, then A inter-thread happens-before C. New Definition of a Data Race A data race can now newly be dened with the happens-before relation as follows: Two actions at the same location, on dierent threads, not related by happens-before and at least one of which is a write [6]. Based on this denition it gets clear that the example in Figure 13 on the previous page does not contain a race Relaxed Ordering The relaxed ordering is the weakest ordering model in the C++. It only guarantees that all threads agree on the modication order of each variable. The example in Figure 15 shows the same program as in Figure 13 on the previous page but with relaxed atomic operations. It is possible that the read of x returns 0 although the read of y returns 0 and the write of 1 to y is performed too. This result is possible since there is no imposed ordering between the read and write on the shared variable x. It is important to mention here, that there is no race in this example and therefore also not undened behavior because relaxed atomic operations cannot contribute to data races [17]. Figure 15: Relaxed ordering. Each thread reads the others write atomically. Figure 16: Possible outcome for each operation in the program of Figure 15. As relaxed atomics do not contribute in synchronization, there is no happensbefore relation between variable that are shared by threads Acquire-Release Ordering The acquire-release ordering is more relaxed than sequential consistency. There is no guarantee anymore, that a read of a location returns the last previously written value. This is achieved by a renement of synchronizes-with using a release sequence. Additionally the possible value read by the acquire operation that participates in synchronizes-with are dened by the visible sequence of side eects. Figure 17 on the following page shows the pairwise synchronization of acquire-release. One thread writes some data x and then sets a ag y while the other spins until the ag is set and then reads the data. It should be guaranteed that the receiver must see the data writes of the sender and not any values written that precede in modication order of that writes. With the read and writes of y annotated as in Figure 17 on the next page this is guaranteed. The relations from this example can be seen in Figure 18 on the following page (1). Page 12 of 18

13 Figure 17: The release operation of the sender on y synchronizes-with the acquire operation on y by the receiver [5]. Figure 18: (1) shows the synchronizes-with relation. (2) shows the synchronizes-with relation with a release sequence [6]. The synchronizes-with edge arises because of the acquire operation (c) that reads from the release operation (b). Recognize also that (a) happens before (d). By the release sequence it is dened that an acquire operation can synchronize with a release (to the same location), that is before the write that it reads from. This is shown in Figure 18. The release (b) synchronizes-with the acquire (d) although the fact that (d) reads from another write to the same location (c). This is possible since [(b),(c)] is the release sequence of (b) that participates in synchronizes with. A release sequence is dened as a contiguous sub-sequence of modication order on the location of the release. The release sequence is headed by the release (e.g (b)) and can be followed by writes from the same thread (e.g (c)) or read-modify-writes from any thread (not shown). A more elaborated example that claries the read-modify-writes addition in release sequence is shown in Section on the following page. Until now we have only rened the synchronizes-with relation to dene when two operations synchronize in acquire-release ordering. Which values can be read by the acquire participating in synchronizes-with is dened by the visual sequence of side eects. Consider Figure 19, it is a fragment of the program executed in Figure 18 (2). It shows the happens-before relations in this program fragment. Figure 19: Happens-before relation in program fragment of Figure 18 (2) [6]. A visual sequence of side eects of a read is a contiguous sub-sequence of modication order, headed by a visible side eect of the read, where the read does not happen before any member of the sequence [7]. In Figure 19 the visible side eect of the acquire operation (d) is (b). The relaxed write (c) is in the visual sequence of side eects because of its subsequent ordering in the modication order of y. Also the relaxed write (c) as member of the visual sequence of side eects does not happen before the acquire (d). Therefore the acquire operation (d) either reads the value written in (b) or (c) Data Dependency in Acquire-Release Ordering On multiprocessors with weak memory orders the release/acquire pairs from acquire-release ordering are cheaper to implement than sequential consistent atomics [6]. But they are still more expensive than plain stores and loads. Multiprocessors as Power guarantees that certain data dependencies in instructions are respected [6]. This makes an additional synchronization instruction to prevent from reordering obsolete and results in a better performance. This is the reason for a memory_order_consume in the acquire-release ordering. If the programmer sees that the targeted hardware respects a certain order because of data dependencies he can introduces an atomic Page 13 of 18

14 load with memory_order_consume with less synchronization cost but still enough for the required ordering. An example using memory_order_consume is shown in Figure 20. Figure 20: Write data and store to a shared atomic pointer p by the sender. The receiver consumes the shared pointer and dereferences [5]. Figure 21: (1) shows the synchronizes-with relation in release/acquire pairs. (2) shows the dependency-ordered-before (dob) relation between the release operation (b) and the consume operation (c) as well as to (d) because (c) carries-a-dependency-to (d) [6]. The sender writes some data and stores the address of that data to a shared atomic pointer p. The other thread reads the shared pointer, dereferences it and reads the data. The relations that arise in this example are shown in Figure 21. As a comparison also the relation that would be introduced if an memory_order_acquire is used in the read of p (c) are shown (1). There are new relation introduced in (2). The read of p (c) carries-a-dependency-to the read of the data (d). The relation carries-dependency-to applies only within a single thread and models data dependency between operations. If the result of an operation A is used as an operand for an operation B, then A carries-a-dependency to B. If the result of operation A is a value of a scalar type such as an int, then the relationship still applies if the result of A is stored in a variable, and that variable is then used as an operand for operation B [17]. The later statement holds for the relation between the operations (c) and (d). Carries-a-dependency-to is a transitive relation. The other introduced relation is dependency-ordered-before. The dependency-ordered-before is the release/consume analogue of the release/acquire synchronizes-with. This involves also the release sequence which is only [(b)] in this example. The possible values read by the consume operation (c) is again dened by the visual sequence of side eects. The dependency-ordered-before contributes also to the inter-thread-happens-before and therefore to the happens-before. Which in turn states that operation (a) happens-before (d) Example - Reading Values From a Queue In Section on page 12 it is dened that a release can synchronize with an acquire although the acquire does not read from the release. But the read must be a value of the release sequence. Other threads can participate in the release sequence if there are read-modify-write operations executed between the release and the acquire of the synchronizes-with relation. An example with read-modify-write operations in the release sequence is shown in listing Listing 2 on the next page. One thread can populate the queue whereon several threads can read items from the queue. The atomic variable count is used for the number of items in the shared queue. On line 8 the initial store of the items is done, to let the other threads know that data is available. The consumer threads that access the shared queue to read the items, claim to get one item from the shared queue (line 16). This is done with the read-modify-write operation fetch_sub with acquire semantics and before the shared queue is accessed. The fetch_sub atomically reads the count, subtracts one and writes the modied value back to count. The returned value is the read value before the modication. A value less than 0 forces the consumer to wait for new items (line 18) otherwise item_index is used to get the item in the queue (line 21). Figure 22 on the following page shows the relations between the operations in an execution of one thread to populate the queue and two threads that consume queue items. The dotted lines show the release sequence and the solid lines show the happens-before relationships. The rst fetch_sub participates in the release Page 14 of 18

15 1 std :: vector < int > queue_data ; 2 std :: atomic < int > count ; 3 4 void populate_ queue () 5 { 6 unsigned const number_ of_ items =20; 7 // fill the queue with numbers from 0 to 19 8 count. store ( number_of_items, std :: memory_order_release ); 9 } void consume_ queue_ items () 12 { 13 while ( true ) 14 { 15 int item_ index ; 16 if (( item_index = count. fetch_sub (1, std :: memory_order_acquire )) <=0) 17 { 18 wait_for_more_items () ; 19 continue ; 20 } 21 process ( queue_ data [ item_index -1]) ; 22 } 23 } Listing 2: Reading from a queue with atomic operations [17]. The full example can be seen in Listing 3 on page 17 in the appendix. sequence and therefore the release synchronizes with the second fetch_sub. Recognize also that the value read by the second fetch_sub must return the value written by the rst fetch_sub although both, the release and the rst acquire, are in the visual sequence of side eects. This is dened by the write-read coherence that prevents from reading a value that is happens-before hidden by a later write in the modication order [6]. To our case, the release (write) is hidden to the second acquire since the release happens before the second acquire and the rst acquire occurs later in the modication order of count. Figure 22: The release sequence for the queue operations from Listing 2 [17]. Page 15 of 18

16 4. Conclusion When writing multi-threaded application it is crucial to prevent from data races. Without ordering options, sequential consistent execution is guaranteed in the memory model dened in C++11. The programmer does not need to care about possible compiler optimizations or the hardware model of the targeted system. Data races can be detected by just thinking of an interleaved execution of the thread's instructions and the analysis whether two conicting operations are adjacent. If the sequential consistent execution with locks and sequential consistent atomics can not satisfy the performance requirement, one can adapt the code and use the low-level atomics. This is considered as an expert-only feature as they are hard to use correctly. But nevertheless also non experts can prot indirectly by using libraries written using these facilities [11]. Page 16 of 18

17 A. Reading from a Queue with Atomic Operations 1 # include < atomic > 2 # include < thread > 3 4 std :: vector < int > queue_data ; 5 std :: atomic < int > count ; 6 7 void populate_ queue () 8 { 9 unsigned const number_ of_ items =20; 10 queue_data. clear () ; 11 for ( unsigned i =0; i < number_of_items ;++ i) 12 { 13 queue_data. push_back (i); 14 } count. store ( number_of_items, std :: memory_order_release ); 17 } void consume_ queue_ items () 20 { 21 while ( true ) 22 { 23 int item_ index ; 24 if (( item_index = count. fetch_sub (1, std :: memory_order_acquire )) <=0) 25 { 26 wait_for_more_items () ; 27 continue ; 28 } 29 process ( queue_ data [ item_index -1]) ; 30 } 31 } int main () 34 { 35 std :: thread a( populate_ queue ); 36 std :: thread b( consume_ queue_ items ); 37 std :: thread c( consume_ queue_ items ); 38 a. join () ; 39 b. join () ; 40 c. join () ; 41 } Listing 3: Reading from a queue with atomic operations [17]. Page 17 of 18

18 References [1] Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:6676, [2] Sarita V. Adve and Mark D. Hill. Weak ordering - a new denition. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 214, New York, NY, USA, ACM. [3] Sarita V. Adve and Hans j. Boehm. Memory models: A case for rethinking parallel languages and hardware. COMMUNICATIONS OF THE ACM, 53(8):90101, [4] Sarita Vikram Adve. Designing memory consistency models for shared-memory multiprocessors. PhD thesis, University of Wisconsin at Madison, Madison, WI, USA, UMI Order No. GAX [5] Mark Batty. Mathematizing c++ concurrency. batty.pdf, [6] Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. Mathematizing c++ concurrency. SIGPLAN Not., 46(1):5566, January [7] Pete Becker. C++ standard /n3242.pdf, [8] Hans-J. Boehm. Threads cannot be implemented as a library. In In PLDI, pages ACM Press, [9] Hans-J. Boehm. Reordering constraints for pthread-style locks. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages , New York, NY, USA, ACM. [10] Hans-J. Boehm. Towards a memory model for c++. Hans_Boehm/misc_slides/boehm-accu.pdf, [11] Hans-J. Boehm and Sarita V. Adve. Foundations of the c++ concurrency memory model. In PLDI'08, [12] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. SIGARCH Comput. Archit. News, 18(3a):1526, May [13] Paul McKenney Hans-J. Boehm. Programming with threads: Questions frequently asked by c and c++ programmers. html, [14] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., 28(9):690691, September [15] Jeremy Manson, William Pugh, and Sarita V. Adve. The java memory model. In Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '05, pages , New York, NY, USA, ACM. [16] Bjarne Stroustrup. The design and evolution of C++. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, [17] A. Williams. C++ Concurrency: Practical Multithreading. Manning Pubs Co Series. Manning, Page 18 of 18

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

The C/C++ Memory Model: Overview and Formalization

The C/C++ Memory Model: Overview and Formalization The C/C++ Memory Model: Overview and Formalization Mark Batty Jasmin Blanchette Scott Owens Susmit Sarkar Peter Sewell Tjark Weber Verification of Concurrent C Programs C11 / C++11 In 2011, new versions

More information

Overview: Memory Consistency

Overview: Memory Consistency Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

High-level languages

High-level languages High-level languages High-level languages are not immune to these problems. Actually, the situation is even worse: the source language typically operates over mixed-size values (multi-word and bitfield);

More information

Memory Consistency Models: Convergence At Last!

Memory Consistency Models: Convergence At Last! Memory Consistency Models: Convergence At Last! Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign sadve@cs.uiuc.edu Acks: Co-authors: Mark Hill, Kourosh Gharachorloo,

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Shared Memory Consistency Models: A Tutorial Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect

More information

Topic C Memory Models

Topic C Memory Models Memory Memory Non- Topic C Memory CPEG852 Spring 2014 Guang R. Gao CPEG 852 Memory Advance 1 / 29 Memory 1 Memory Memory Non- 2 Non- CPEG 852 Memory Advance 2 / 29 Memory Memory Memory Non- Introduction:

More information

Hardware Memory Models: x86-tso

Hardware Memory Models: x86-tso Hardware Memory Models: x86-tso John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 9 20 September 2016 Agenda So far hardware organization multithreading

More information

Programming Language Memory Models: What do Shared Variables Mean?

Programming Language Memory Models: What do Shared Variables Mean? Programming Language Memory Models: What do Shared Variables Mean? Hans-J. Boehm 10/25/2010 1 Disclaimers: This is an overview talk. Much of this work was done by others or jointly. I m relying particularly

More information

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm Using Weakly Ordered C++ Atomics Correctly Hans-J. Boehm 1 Why atomics? Programs usually ensure that memory locations cannot be accessed by one thread while being written by another. No data races. Typically

More information

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Analyzable Atomic Sections: Integrating Fine-Grained Synchronization and Weak

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

Nondeterminism is Unavoidable, but Data Races are Pure Evil

Nondeterminism is Unavoidable, but Data Races are Pure Evil Nondeterminism is Unavoidable, but Data Races are Pure Evil Hans-J. Boehm HP Labs 5 November 2012 1 Low-level nondeterminism is pervasive E.g. Atomically incrementing a global counter is nondeterministic.

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

C++ Memory Model Tutorial

C++ Memory Model Tutorial C++ Memory Model Tutorial Wenzhu Man C++ Memory Model Tutorial 1 / 16 Outline 1 Motivation 2 Memory Ordering for Atomic Operations The synchronizes-with and happens-before relationship (not from lecture

More information

Memory Models for C/C++ Programmers

Memory Models for C/C++ Programmers Memory Models for C/C++ Programmers arxiv:1803.04432v1 [cs.dc] 12 Mar 2018 Manuel Pöter Jesper Larsson Träff Research Group Parallel Computing Faculty of Informatics, Institute of Computer Engineering

More information

Distributed Operating Systems Memory Consistency

Distributed Operating Systems Memory Consistency Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Memory Consistency Marcus Völp (slides Julian Stecklina, Marcus Völp) SS2014 Concurrent

More information

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016 Java Memory Model Jian Cao Department of Electrical and Computer Engineering Rice University Sep 22, 2016 Content Introduction Java synchronization mechanism Double-checked locking Out-of-Thin-Air violation

More information

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder Chenyu Zheng CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder Actuality Introduction Concurrency framework in the 2010 new C++ standard History of multi-threading in C++

More information

C++ Concurrency - Formalised

C++ Concurrency - Formalised C++ Concurrency - Formalised Salomon Sickert Technische Universität München 26 th April 2013 Mutex Algorithms At most one thread is in the critical section at any time. 2 / 35 Dekker s Mutex Algorithm

More information

Specifying System Requirements. for Memory Consistency Models. Stanford University. Stanford, CA University of Wisconsin

Specifying System Requirements. for Memory Consistency Models. Stanford University. Stanford, CA University of Wisconsin Specifying System Requirements for Memory Consistency Models Kourosh Gharachorloo y, Sarita V. Adve z, Anoop Gupta y, John L. Hennessy y, and Mark D. Hill z y Computer System Laborary Stanford University

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important

More information

Memory model for multithreaded C++: August 2005 status update

Memory model for multithreaded C++: August 2005 status update Document Number: WG21/N1876=J16/05-0136 Date: 2005-08-26 Reply to: Hans Boehm Hans.Boehm@hp.com 1501 Page Mill Rd., MS 1138 Palo Alto CA 94304 USA Memory model for multithreaded C++: August 2005 status

More information

The Java Memory Model

The Java Memory Model Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential

More information

C++ 11 Memory Consistency Model. Sebastian Gerstenberg NUMA Seminar

C++ 11 Memory Consistency Model. Sebastian Gerstenberg NUMA Seminar C++ 11 Memory Gerstenberg NUMA Seminar Agenda 1. Sequential Consistency 2. Violation of Sequential Consistency Non-Atomic Operations Instruction Reordering 3. C++ 11 Memory 4. Trade-Off - Examples 5. Conclusion

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Language- Level Memory Models

Language- Level Memory Models Language- Level Memory Models A Bit of History Here is a new JMM [5]! 2000 Meyers & Alexandrescu DCL is not portable in C++ [3]. Manson et. al New shiny C++ memory model 2004 2008 2012 2002 2006 2010 2014

More information

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees

More information

HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER

HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER WWW.QUALCOMM.COM OUTLINE HSA Memory Model OpenCL 2.0 Has a memory model too Obstruction-free bounded deques An example using the HSA

More information

How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor

How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor Leslie Lamport 1 Digital Equipment Corporation February 14, 1993 Minor revisions January 18, 1996 and September 14, 1996

More information

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency

More information

CSE 160 Lecture 7. C++11 threads C++11 memory model

CSE 160 Lecture 7. C++11 threads C++11 memory model CSE 160 Lecture 7 C++11 threads C++11 memory model Today s lecture C++ threads The C++11 Memory model 2013 Scott B. Baden / CSE 160 / Winter 2013 2 C++11 Threads Via , C++ supports a threading

More information

Threads Cannot Be Implemented As a Library

Threads Cannot Be Implemented As a Library Threads Cannot Be Implemented As a Library Authored by Hans J. Boehm Presented by Sarah Sharp February 18, 2008 Outline POSIX Thread Library Operation Vocab Problems with pthreads POSIX Thread Library

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions

More information

A Unified Formalization of Four Shared-Memory Models

A Unified Formalization of Four Shared-Memory Models Computer Sciences Technical Rert #1051, September 1991, Revised September 1992 A Unified Formalization of Four Shared-Memory Models Sarita V. Adve Mark D. Hill Department of Computer Sciences University

More information

On the tamability of the Location Consistency memory model

On the tamability of the Location Consistency memory model On the tamability of the Location Consistency memory model Charles Wallace Computer Science Dept. Michigan Technological University Houghton, MI, USA Guy Tremblay Dépt. d informatique Université du Québec

More information

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and

More information

The C++ Memory Model. Rainer Grimm Training, Coaching and Technology Consulting

The C++ Memory Model. Rainer Grimm Training, Coaching and Technology Consulting The C++ Memory Model Rainer Grimm Training, Coaching and Technology Consulting www.grimm-jaud.de Multithreading with C++ C++'s answers to the requirements of the multicore architectures. A well defined

More information

The New Java Technology Memory Model

The New Java Technology Memory Model The New Java Technology Memory Model java.sun.com/javaone/sf Jeremy Manson and William Pugh http://www.cs.umd.edu/~pugh 1 Audience Assume you are familiar with basics of Java technology-based threads (

More information

Multicore Programming: C++0x

Multicore Programming: C++0x p. 1 Multicore Programming: C++0x Mark Batty University of Cambridge in collaboration with Scott Owens, Susmit Sarkar, Peter Sewell, Tjark Weber November, 2010 p. 2 C++0x: the next C++ Specified by the

More information

Advanced Operating Systems (CS 202)

Advanced Operating Systems (CS 202) Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization (Part II) Jan, 30, 2017 (some cache coherence slides adapted from Ian Watson; some memory consistency slides

More information

Memory Consistency Models. CSE 451 James Bornholt

Memory Consistency Models. CSE 451 James Bornholt Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version: Multiprocessors reorder memory operations in unintuitive, scary ways This behavior is necessary for performance

More information

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access Philip W. Howard, Josh Triplett, and Jonathan Walpole Portland State University Abstract. This paper explores the

More information

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems

Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems Recent Advances in Memory Consistency Models for Hardware Shared Memory Systems SARITA V. ADVE, MEMBER, IEEE, VIJAY S. PAI, STUDENT MEMBER, IEEE, AND PARTHASARATHY RANGANATHAN, STUDENT MEMBER, IEEE Invited

More information

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014 Weak memory models Mai Thuong Tran PMA Group, University of Oslo, Norway 31 Oct. 2014 Overview 1 Introduction Hardware architectures Compiler optimizations Sequential consistency 2 Weak memory models TSO

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

11/19/2013. Imperative programs

11/19/2013. Imperative programs if (flag) 1 2 From my perspective, parallelism is the biggest challenge since high level programming languages. It s the biggest thing in 50 years because industry is betting its future that parallel programming

More information

EARLY DRAFT. Efficient caching in distributed persistent stores. 1. Introduction. Bengt Johansson *

EARLY DRAFT. Efficient caching in distributed persistent stores. 1. Introduction. Bengt Johansson * Efficient caching in distributed persistent stores. EARLY DRAFT Bengt Johansson * Abstract: This article shows how techniques from hardware systems like multiprocessor computers can be used to improve

More information

Threads and Memory Model for C++

Threads and Memory Model for C++ Threads and Memory Model for C++ Hochschule für Technik Rapperswil JORGE MIGUEL MACHADO Supervisor: Prof. Peter Sommerlad January 11, 2012 Abstract With the introduction of the new C++11 standard, additional

More information

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue Review of last lecture Design of Parallel and High-Performance Computing Fall 2016 Lecture: Linearizability Motivational video: https://www.youtube.com/watch?v=qx2driqxnbs Instructor: Torsten Hoefler &

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In small multiprocessors, sequential consistency can be implemented relatively easily. However, this is not true for large multiprocessors. Why? This is not the

More information

Consistency Issues in Distributed Shared Memory Systems

Consistency Issues in Distributed Shared Memory Systems Consistency Issues in Distributed Shared Memory Systems CSE 6306 Advance Operating System Spring 2002 Chingwen Chai University of Texas at Arlington cxc9696@omega.uta.edu Abstract In the field of parallel

More information

Using Relaxed Consistency Models

Using Relaxed Consistency Models Using Relaxed Consistency Models CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides.

More information

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract A simple correctness proof of the MCS contention-free lock Theodore Johnson Krishna Harathi Computer and Information Sciences Department University of Florida Abstract Mellor-Crummey and Scott present

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models Review. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed? The safety net between operations whose order needs

More information

Algorithmic "imperative" language

Algorithmic imperative language Algorithmic "imperative" language Undergraduate years Epita November 2014 The aim of this document is to introduce breiy the "imperative algorithmic" language used in the courses and tutorials during the

More information

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department

More information

Scalable Correct Memory Ordering via Relativistic Programming

Scalable Correct Memory Ordering via Relativistic Programming Scalable Correct Memory Ordering via Relativistic Programming Josh Triplett Portland State University josh@joshtriplett.org Philip W. Howard Portland State University pwh@cs.pdx.edu Paul E. McKenney IBM

More information

P1202R1: Asymmetric Fences

P1202R1: Asymmetric Fences Document number: P1202R1 Date: 2018-01-20 (pre-kona) Reply-to: David Goldblatt Audience: SG1 P1202R1: Asymmetric Fences Overview Some types of concurrent algorithms can be split

More information

Lowering C11 Atomics for ARM in LLVM

Lowering C11 Atomics for ARM in LLVM 1 Lowering C11 Atomics for ARM in LLVM Reinoud Elhorst Abstract This report explores the way LLVM generates the memory barriers needed to support the C11/C++11 atomics for ARM. I measure the influence

More information

7/6/2015. Motivation & examples Threads, shared memory, & synchronization. Imperative programs

7/6/2015. Motivation & examples Threads, shared memory, & synchronization. Imperative programs Motivation & examples Threads, shared memory, & synchronization How do locks work? Data races (a lower level property) How do data race detectors work? Atomicity (a higher level property) Concurrency exceptions

More information

Coherence and Consistency

Coherence and Consistency Coherence and Consistency 30 The Meaning of Programs An ISA is a programming language To be useful, programs written in it must have meaning or semantics Any sequence of instructions must have a meaning.

More information

C++ Memory Model. Don t believe everything you read (from shared memory)

C++ Memory Model. Don t believe everything you read (from shared memory) C++ Memory Model Don t believe everything you read (from shared memory) The Plan Why multithreading is hard Warm-up example Sequential Consistency Races and fences The happens-before relation The DRF guarantee

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Motivation & examples Threads, shared memory, & synchronization

Motivation & examples Threads, shared memory, & synchronization 1 Motivation & examples Threads, shared memory, & synchronization How do locks work? Data races (a lower level property) How do data race detectors work? Atomicity (a higher level property) Concurrency

More information

Threads and Locks. Chapter Introduction Locks

Threads and Locks. Chapter Introduction Locks Chapter 1 Threads and Locks 1.1 Introduction Java virtual machines support multiple threads of execution. Threads are represented in Java by the Thread class. The only way for a user to create a thread

More information

Sequential Consistency & TSO. Subtitle

Sequential Consistency & TSO. Subtitle Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

Concurrent Objects and Linearizability

Concurrent Objects and Linearizability Chapter 3 Concurrent Objects and Linearizability 3.1 Specifying Objects An object in languages such as Java and C++ is a container for data. Each object provides a set of methods that are the only way

More information

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4 Algorithms Implementing Distributed Shared Memory Michael Stumm and Songnian Zhou University of Toronto Toronto, Canada M5S 1A4 Email: stumm@csri.toronto.edu Abstract A critical issue in the design of

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Introduction to Parallel Programming Part 4 Confronting Race Conditions

Introduction to Parallel Programming Part 4 Confronting Race Conditions Introduction to Parallel Programming Part 4 Confronting Race Conditions Intel Software College Objectives At the end of this module you should be able to: Give practical examples of ways that threads may

More information

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

Safe Non-blocking Synchronization in Ada 202x

Safe Non-blocking Synchronization in Ada 202x Safe Non-blocking Synchronization in Ada 202x Johann Blieberger 1 and Bernd Burgstaller 2 1 Institute of Computer Engineering, Automation Systems Group, TU Wien, Austria 2 Department of Computer Science,

More information

Portland State University ECE 588/688. Memory Consistency Models

Portland State University ECE 588/688. Memory Consistency Models Portland State University ECE 588/688 Memory Consistency Models Copyright by Alaa Alameldeen 2018 Memory Consistency Models Formal specification of how the memory system will appear to the programmer Places

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Global Scheduler. Global Issue. Global Retire

Global Scheduler. Global Issue. Global Retire The Delft-Java Engine: An Introduction C. John Glossner 1;2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs, Allentown, Pa. 2 Delft University oftechnology, Department of Electrical Engineering Delft,

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

CS 5220: Shared memory programming. David Bindel

CS 5220: Shared memory programming. David Bindel CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy

More information

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995 ICC++ Language Denition Andrew A. Chien and Uday S. Reddy 1 May 25, 1995 Preface ICC++ is a new dialect of C++ designed to support the writing of both sequential and parallel programs. Because of the signicant

More information

The Java Memory Model

The Java Memory Model The Java Memory Model The meaning of concurrency in Java Bartosz Milewski Plan of the talk Motivating example Sequential consistency Data races The DRF guarantee Causality Out-of-thin-air guarantee Implementation

More information

A Case for System Support for Concurrency Exceptions

A Case for System Support for Concurrency Exceptions A Case for System Support for Concurrency Exceptions Luis Ceze, Joseph Devietti, Brandon Lucia and Shaz Qadeer University of Washington {luisceze, devietti, blucia0a}@cs.washington.edu Microsoft Research

More information

An introduction to weak memory consistency and the out-of-thin-air problem

An introduction to weak memory consistency and the out-of-thin-air problem An introduction to weak memory consistency and the out-of-thin-air problem Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) CONCUR, 7 September 2017 Sequential consistency 2 Sequential

More information

Adding std::*_semaphore to the atomics clause.

Adding std::*_semaphore to the atomics clause. Doc number: P0514R1 Revises: P0514R0 Date: 2017-6-14 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux Adding std::*_semaphore to the atomics clause.

More information

Memory model for multithreaded C++: Issues

Memory model for multithreaded C++: Issues Document Number: WG21/N1777=J16/05-0037 Date: 2005-03-04 Reply to: Hans Boehm Hans.Boehm@hp.com 1501 Page Mill Rd., MS 1138 Palo Alto CA 94304 USA Memory model for multithreaded C++: Issues Andrei Alexandrescu

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

3. Memory Persistency Goals. 2. Background

3. Memory Persistency Goals. 2. Background Memory Persistency Steven Pelley Peter M. Chen Thomas F. Wenisch University of Michigan {spelley,pmchen,twenisch}@umich.edu Abstract Emerging nonvolatile memory technologies (NVRAM) promise the performance

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Implementing the C11 memory model for ARM processors. Will Deacon February 2015

Implementing the C11 memory model for ARM processors. Will Deacon February 2015 1 Implementing the C11 memory model for ARM processors Will Deacon February 2015 Introduction 2 ARM ships intellectual property, specialising in RISC microprocessors Over 50 billion

More information