Light-Weight Execution Agents

Size: px
Start display at page:

Download "Light-Weight Execution Agents"

Transcription

1 Light-Weight Execution Agents Revision 5 Document number: P0072R0 Revises: N4439 (see Section 7 for revision history) Date: Author: Torvald Riegel Reply-to: Torvald Riegel <triegel@redhat.com> 1 Introduction N4016 makes a case for adding execution modes for threads of execution that are less capable than std::thread; doing so either allows for avoiding runtime overheads of std::thread or enables semantically different ways to execute such as SIMD loops. This paper defines the semantics of these lighter-weight modes of execution by defining certain kinds of execution agents (EAs). An EA is a mechanism that executes a particular thread of execution 1. Execution context was suggested as a different name instead of EA; this might make it clearer that the intent is to model execution properties and not primarily to describe the hardware or software resources used for execution. To provide some consistency in the revisions of this paper, I will stick to using the term EA and will happily delegate the choice of a final name to the committee. The standard seems not quite clear regarding whether a thread of execution is a static (e.g., a function) or dynamic entity (i.e., an instance of an execution of single flow of control); in this paper, I will assume the latter. Given that, there is a one-to-one correspondence between an EA and a thread of execution. std::thread is a simple way to create and manage a certain type of EA. The entity that is a running thread (i.e., what was created by std::thread) is an EA. See N4321 for a more detailed explanation of this terminology (and of the existing terminology in the standard). The standard also seems to not explicitly state what execution agent executes main. I will assume that it executes on an execution agent that is created implicitly by the implementation and has equivalent semantics to execution agents created by std::thread. 1 A thread of execution is defined in the standard as a single flow of control within a program (see 1.10p1). sequenced-before is defined exactly for a thread of execution (see 1.9p13).

2 At the Rapperswil meeting, there was very strong consensus that forward progress requirements are essential to specifying EAs. Therefore, I will first present refined progress requirements (Section 2), then discusses std::thread-specific features such as threadlocal storage (Section 3), sketch how to represent EA requirements in the type system (Section 4), and propose wording for version 2 of the Parallelism TS (Section 5; covering just forward progress and adapted to the current content of the TS). 2 Forward progress requirements The requirements described in what follows are requirements for an implementation of the particular kinds of execution agents; in turn, programmers can treat them as the guarantees the implementation gives to the EAs they use in a program. The requirements are deliberately specified on an abstract level of making progress and not by explaining them based on current implementation techniques such as mappings to OS threads. This helps separate implementation artifacts from the requirements we really need to specify in a standard, and thus should help avoid misunderstandings; for example, OS threads on a host CPU are often quite different from threads running on a GPU. There certainly is a performance aspect of EAs, but this should be separate from the semantics I am focusing on here (e.g., exposed through Executors). 2.1 Basic forward progress definitions Currently, the standard requires that implementations should ensure that all unblocked threads eventually make progress ( 1.10p2). While progress is not explicitly defined, it likely means events of observable behavior of the abstract machine, or termination of the abstract machine. Such behavior will be produced by the implementation taking steps that eventually lead to observable behavior (and thus correspond to steps of execution of the abstract machine). The term unblocked is also not defined, and is the trickier issue because there are different causes that prevent progress: (1) no execution of implementation steps vs. (2) a thread that is blocked due to program logic. The latter can happen both when the implementation does not execute steps (e.g., when a blocking file I/O function has been called, or a lock uses a blocking facility of the OS) and when it does execute steps (e.g., busy-waiting using atomics until another thread produces a value). I propose to replace this with the following requirements. First, we define what progress of a thread of execution is: The execution of threads of execution consists of steps: A step ends with either (1) termination of the thread of execution, (2) access to or modification of a volatile object, or (3) a synchronization or atomic operation; the remainder of a step consists of other things executed by the implementation that are not in the previous list. [Note: A thread of execution makes progress when its steps are executed. 1.10p27 requires that individual steps are finite (see below for how to handle library I/O functions). end note] P0072R0 2

3 These steps could also be summarized as observable actions in the sense of being either observable by other threads or observable in terms of observable behavior of the abstract machine. Next, we need to cover calls to library I/O functions, which may block in an OS kernel when waiting for data to arrive, for example. This blocking is internal to the application, and as I mentioned previously, we need to distinguish between blocking due to, for example, lack of input data and the implementation not allowing the thread of execution to make progress. To solve this, we can model such blocking functions as conceptual busy-waiting loops: Calls to library I/O functions and other functions that block (i.e., that cannot return to the caller unless a certain, potentially external condition is met) can be conceptually expressed using steps as defined previously; each blocking operation can be replaced by a busy-waiting loop that polls for the condition to be met (e.g., using observable behavior of the abstract machine). Note that this is purely a model to enable the specification of progress; implementations are not required to busy-wait nor required to not do that. It allows us to separate the progress allowed by the implementation (i.e., whether a thread of execution is allowed to make steps) from whether particular blocking operations such as I/O make progress in the sense of achieving what the program was meant to do. The implementation has control over the former, but not necessarily over the latter. If blocking operations want to specify which condition they may wait for, then they can express this with busy-waiting loops. 2 The progress requirements defined below do not require such specifications but only rely on the fact that such a specification would be possible. Therefore, we do not need to add these specifications to the standard before being able to define progress as described here. 2.2 Execution agent progress requirements Next, I will discuss three classes of forward progress requirements that EAs can fall into. Concurrent execution This class provides the same progress guarantees as std::thread: EAs in this class will eventually be allowed by the scheduler to make all steps they need to execute, independently of what other EAs (including those created by std::thread) are doing. Two EAs in this class will run concurrently with respect to each other, and can depend on the other EA to make implementation steps concurrently, even if they block on the other EA: The implementation should ensure that a concurrent execution agent will eventually be allowed to execute all steps its thread of execution consists 2 For example, a non-lock-free atomic operation can be expressed as performing the (sequential code for the) operation in a critical section protected by a global lock; in turn, lock acquisition can be expressed as a spinlock. P0072R0 3

4 of, independently of which steps other execution agents might or might not execute. The execution agents created by std::thread and the implementation-created execution agent that executes main are concurrent execution agents. Note: To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. Thus, this can effectively replace the existing progress requirement ( 1.10p2); it clarifies that the requirement is about providing each concurrent execution agent (e.g., one created using std::thread) with the compute resources it needs. In other words, what the implementation should provide for std::thread is a thread scheduler that never starves any thread, and is thus fair (at least to some degree). The progress requirement is specified as should instead of shall (as in 1.10) because there might be implementations based on OS schedulers that cannot give these properties (e.g., in a hard-real-time environment). Nonetheless, general-purpose implementations should strive to fulfill this requirement. Parallel execution Parallel execution is weaker than concurrent execution in terms of forward progress. In particular, we need to capture the notion that one would like to let programs define lots of parallel tasks, yet use a bounded set of resources (e.g., CPU cores) to execute those tasks: For a parallel EA, an implementation is not required to execute steps if it has not yet executed any step; once it has executed a step, the implementation should ensure that it will eventually execute all of the steps its thread of execution consists of, independently of which steps of other execution agents are or are not execute. Note: This effectively makes the same progress requirements as for a concurrent execution agent once the parallel execution agent is started, but does not specify a requirement for when to start the parallel execution agent; the latter will typically be specified by the entity that creates the parallel execution agent. In other words, a parallel EA will behave like a std::thread in terms of progress, but only after performing its first step. Note that though having no requirement on when to start a parallel EA might seem odd at first, this is completely taken care of by the notion of boost-blocking (see Section 2.3). This progress requirement is stronger than for other forms of parallel execution (see below); the main advantage of it is that it allows typical uses of critical sections because as soon as a parallel EA starts and might acquire a mutex, it is guaranteed to eventually be able to finish execution, and thus it will not block other EAs indefinitely. 3 3 This still does not allow other uses of mutexes, for example an EA using a mutex to wait for the finished execution of another EA that already owned the mutex before it got started. P0072R0 4

5 Weakly parallel execution Weakly parallel EAs cannot be expected to make steps concurrently with other EAs. Thus, they get weaker progress guarantees than parallel EAs in that there is no upgrade to concurrent execution once a weakly parallel EA has started executing: For a weakly parallel EA, an implementation is not required to execute steps; such EAs cannot be expected to execute steps independently of which steps other threads of execution might or might not execute. As a result, unlike for parallel EAs, typical uses of critical sections are not possible in weakly parallel EAs because then an EA might wait for another EA that is not guaranteed to be able to make implementation steps concurrently. 4 However, using nonblocking synchronization (e.g., using lock-free atomics) is possible because it does not need any specific guarantees from the scheduler (i.e., it just finishes a step). 5 Also note that like for parallel EAs, boost-blocking takes care of what might appear to be a too weak progress requirement (see Section 2.3). Implementation examples There are many possible implementations that would satisfy the progress requirements for the three classes of EAs; I will highlight just a few to illustrate the possibilities. Concurrent EAs can be easily implemented by mapping each EA to one OS thread and using a round-robin OS thread scheduler. An unbounded thread pool that eventually adds a new OS thread to the pool if some EAs did not run yet is also a valid implementation. However, a bounded thread pool is not a correct implementation because it does not guarantee concurrent execution if there are more concurrent EAs than OS threads in the pool (e.g., the pool s threads might all be taken by consumers waiting for a producer that hasn t been started because the pool s threads are all being used). Nonetheless, if an implementation that uses a non-preemptive thread scheduler is aware of all points of blocking more precisely, points where one EA depends on another EA s progress and it either preempts or starts new threads in case of blocking, then this can be a valid implementation of concurrent EAs. However, this can be nontrivial to implement; the, for example, compiler could have to instrument all loops containing atomic loads or read-modify-write ops and volatile loads (because these end steps), as well as calls to (OS) library functions that might block. Parallel EAs, on the other hand, can be implemented based on a bounded thread pool; because the progress requirement does not require a certain amount of resources (e.g., OS threads) to be used, the implementation has full flexibility regarding resource usage. This is easy to implement because we just need to execute all parallel EAs eventually, 4 Specifically, consider cases like when parallel EAs use the same mutex to protect critical sections. Other cases would still be allowed, such as when an EA uses a critical section that it will never block on (e.g., because nobody else uses the same mutex). 5 Note that while the success of obstruction-free synchronization depends on whether operations are scheduled in such a way that there is no obstruction eventually, this can be emulated by the code executing the obstruction-free operations (e.g., using randomized exponential back-off). Blocking synchronization is different in that it cannot tolerate another EA not making any progress. P0072R0 5

6 in some order and interleaving. However, it cannot be implemented with, for example, certain kinds of work-stealing schedulers if work-stealing is allowed to happen during critical sections. 6 Weakly parallel EAs can be implemented in several ways. They can be run concurrently on threads from a thread pool, or in some interleaving on a single thread. Even implementations that use SIMD instructions fall into the latter category (i.e., using a mix of SIMD-parallel execution and serialized execution of code that cannot be vectorized). Work-stealing implementations are also able to provide weakly parallel EAs. The progress requirements for parallel and weakly parallel EAs match the progress guarantees that are essentially in effect when using parallel execution policy and parallel vector execution policy, respectively, of the Parallelism TS (N4104). 2.3 Boosting progress While concurrent EAs are guaranteed to just execute eventually, the requirements for parallel and weakly parallel EAs do not include having to actually start execution of such EAs. As an example, consider a simple program that contains a parallel loop; the calling EA of such a loop spawns a number of parallel EAs and waits until all of them have finished. The whole program starts executing as one std::thread, so the EA that calls the parallel loop construct is a concurrent EA. Thus, the program is already guaranteed to execute eventually up to the invocation of the parallel loop. However, then we have a bootstrap problem because the concurrent EA waits for the completion of parallel EAs, which have weaker progress requirements. What we need is a way to specify that the stronger progress requirements for the concurrent EA should eventually lead to finishing execution of the parallel EAs spawned by the parallel loop. To do that, let us define a notion of boost-blocking: If an execution agent P uses boost-blocking to block on the completion of a set S of execution agents, and if P is subject to a stronger forward progress requirement than at least one execution agent in S, then throughout the whole time of P being boost-blocked on S, P will boost the progress requirement of at least one execution agent in S to Ps stronger requirement. Specifically, P is free to select which execution agent in S to boost and for which number of steps (i.e., the boost is not permanent and in place for the rest of the lifetime of the boosted execution agent); as long as P is boost-blocked, it has to eventually boost an execution agent in S. Once an execution agent in S finishes execution, it is removed from S. Once S is empty, P stops being blocked. 6 Consider a scheduler that immediately executes a spawned parallel task instead of finishing the spawning task (and keeps using a single OS thread): If the former blocks on a mutex acquired by the latter, then the OS thread used for the two EAs will get deadlocked; if the scheduler isn t aware of all blocking relationships nor promotes parallel EAs to concurrent EAs after a while (i.e., adds OS threads to the pool), a deadlock will arise. P0072R0 6

7 Note: An execution agent thus can have an effectively stronger forward progress requirement for a certain amount of time, due to second agent being boost-blocked on it. In turn, this may allow this agent to itself boost the progress of a third agent that this agent is boost-blocked on. Note: If all execution agents in S finish execution (e.g., they do not use blocking synchronization incorrectly), then Ps progress requirement will not be weakened by executing the boost-blocking operation. Note: This does not remove any constraints regarding blocking synchronization for parallel or weakly parallel execution agents because P is not guaranteed to boost the particular execution agent whose too-weak progress requirement is preventing overall progress. In the parallel loop example, this means that the calling EA s progress just helps one of the parallel EAs to make progress (e.g., to get a parallel EA started). If all of them terminate, they will eventually all be allowed by the implementation to execute all their steps. Note that, however, this does not mean that the concurrent EA will make all the parallel EAs start concurrently; the concurrent EA is free to boost just one parallel EA until this EA has terminated, and then boost the next one. This way, if none of the parallel EAs blocks for another parallel EA to be started, then the whole loop is guaranteed to be executed eventually. This is straight-forward to implement when concurrent EAs are based on OS threads: Ahe calling EA just participates in executing iterations of the loop that are not processed yet; any OS threads available in a program-wide thread pool can be used to execute other iterations of the loop. When a group of weakly parallel EAs are boosted by a concurrent or parallel EA, this works similarly to the previous example in that one of the weakly parallel EAs will always eventually make steps. This allows vectorized execution, for example: The concurrent or parallel EA can use SIMD operations to execute steps from all weakly parallel EAs in a lockstep execution fashion. Boost-blocking is a property that operations such as the parallel loop from our example would explicitly guarantee, including specifying the group of EAs that the boost-blocking applies to. In the Parallelism TS, for example, all algorithms executed with a parallel or parallel-vector execution policy would guarantee boost-blocking for the group of parallel or weakly-parallel EAs that an algorithm spawns. Boost-blocking is not intended to be the default form of blocking; for example, std::mutex would not guarantee boostblocking as this would increase the implementation complexity and performance cost. Examples Let us look at two more complex examples. First, if we have two concurrent EAs that both execute one parallel loop spawning parallel EAs, then each concurrent EA boost-blocks on its own group of parallel EAs. If one of the parallel EAs of the first concurrent EA produces an intermediate value that a parallel EA of the second concurrent EA needs, then the second parallel EA can block on the first EA without needing boost-blocking to ensure progress (e.g., it can use a condition variable). This P0072R0 7

8 works because boost-blocking reduces this case to essentially the second concurrent EA waiting for the first concurrent EA. As a second example, consider an implementation that schedules several threads of execution nonpreemptively on fewer OS threads than there are threads of execution. This means having weakly parallel EAs because one of the EAs cannot expect to make steps without another EA potentially having to preempt voluntarily (i.e., making the OS thread it runs on available). We can envision a facility that spawns an entire network of such EAs that do message passing between each other (e.g., each EA represents one actor in an actor-based model). After being spawned, the network may make progress but this is not guaranteed because these are all weakly parallel EAs. To get an actual output from the network, we can use boost-blocking in two ways: Let a concurrent or started parallel EA boost-block on the whole network (i.e., the group of all weakly parallel EAs it is comprised of). If the network uses pull-based communication, let a concurrent or started parallel EA boost-block on exactly the EAs that produce the final output of the network, and employ pull or receive message operations that use boost-blocking. This will establish transitive boost-blocking towards all the EAs whose input is needed to produce the output. Third, consider a generator coroutine: This is, logically, an EA that only executes whenever it is requested to generate a new value. From a forward progress perspective, this can be expressed by specifying that the generator is a weakly parallel EA, and that another thread requesting a new value from the generator is an operation that boostblocks on the generator EA. There is typically an additional safety guarantee that states that the generator and the other thread never run at the same time; however, there are cases in which allowing speculative execution of the generator in parallel (i.e., without giving the safety guarantee) can improve performance. 3 std::thread-specific state and features Besides forward progress, we also need to consider how light-weight EAs relate to features of EAs created by std::thread, notably thread-local storage (TLS) and the value returned by this thread::get id. While programmers often will not need these features, we need to at least define the level of compatibility with existing code based on std::thread. There are several choices for whether, and how, an EA can relate to these features. We can express these by specifying to which extent a particular EA can be conceptually considered to be running on top of another EA created by std::thread (in what follows, this is abbreviated as being associated with a std::thread ). Ranging from the weakest requirement to the strongest, these are: (1) Not associated with std::thread. In an EA for which only this requirement applies, accesses to TLS or calls to this thread::get id result in undefined behavior. P0072R0 8

9 (2) Always associated with some std::thread. There is an associated std::thread, but which instance this is can change at any time during the execution of the EA; however, there will be no change during a single access to TLS, a call to this thread::get id, or the execution of external, non-c++ code. 7 This requires TLS to be used as if it where relaxed atomic accesses with respect to the change of the associated thread; it would be safe to make a plain load or store to TLS, but trying to apply a plain read modify write operation (e.g., threadlocal++;) is not guaranteed to work. (3) Association with std::thread only changes in certain cases. Compared to (2), this restricts the places where the associated std::thread can change to explicitly specified cases, for example calls to certain std:: functions or execution of language constructs. Depending on which these cases are, only a few or many std:: functions might have to be included in the set (e.g., when the associated std::thread can change in every parallelism construct, then every potentially parallelized std:: function is in the set). (4) Associated std::thread is stable. The associated std::thread will not change during the lifetime of the EA. (5) Associated std::thread is stable and has same lifetime as EA. This is exactly what code using std::thread would get. The difference to (4) is that constructors and destructors of TLS are guaranteed to be executed at the start and termination of the EA. Additionally, we can distinguish whether TLS is potentially shared between several EAs, which can be useful if TLS is used as a mechanism to maintain likely-local state (in the sense of likely being accessed by just a few EAs), for example for caching. Thus, if we consider the above requirements to include that any associated std::thread is only associated with one EA at a particular point in time, then we can add two additional requirements: (3s) Association with std::thread only changes in certain cases; shared. Like (3) except that an associated std::thread can be associated with other EAs concurrently. (4s) Associated std::thread is stable; shared. Like (4) except that an associated std::thread can be associated with other EAs concurrently. Note that for (5), a shared TLS implementation is probably not useful because it would require the affected EAs to have the exact same lifetimes. For (2), a shared TLS implementation does not make much difference in practice because the associated TLS can change anytime, so any value or pointer obtained through it could be accessible to another EA at the next moment anyway. 7 We could also say that changes can happen in all std:: code and certain language constructs; however, this might not make it easier to use because, for example, calls to customized operators can be easy to overlook. P0072R0 9

10 Consequences for implementations Option (1) is the easiest to implement. In contrast, option (5) restricts the implementation most in that each EA really has to have it s separate TLS space and constructors and destructors have to run. Because TLS is program-wide currently (although a compiler may be able to optimize this), short-lived EAs might face additional construction overhead. Option (4) is somewhat better in that constructors/destructors do not need to be run on EA start/finish, so an implementation can implement such EAs on top of a std::thread thread pool, for example. Nonetheless, TLS and this thread::get id must not change across the EAs lifetime, so an implementation has to either (a) use exactly one underlying std::thread for such an EA (which can conflict with certain implementations of work stealing), change memory mappings, or introduce an indirection, and thus runtime overhead, for TLS (depending on the actual TLS implementation). Options (2) and (3) should give implementations enough leeway in terms of which underlying std::thread is used while still allowing to run external or legacy code that might use TLS with little impact (unless, for example, there are several callbacks into the code and it expects to be executed by a single std::thread). However, code generation by the compiler is probably affected because accesses to TLS do not have the exact semantics as sequential code or TLS with options (4) and (5): TLS can change concurrently, so the compiler may have to prevent reloading of TLS values or may have to reload in other cases. Option (3) is a little less constraining for the compiler if all standard library functions, or at least those that can actually change the underlying std::thread in the particular implementation, are opaque to the compiler. Options (3s) and (4s) are probably a little easier to implement than (3) and (4) because whether an EA has its own isolated TLS becomes basically a quality-of-implementation issue. Storing a value for this thread::get id can be implemented using TLS; when TLS is shared (options (3s) and (4s)), one might still want to have distinct IDs for each EA (this thread::get id is currently specified to return an ID for a thread of execution, not just for a std::thread 8 ). If so, we need to create unique IDs; we discussed this in Issaquah, and while some people felt that having a unique ID would be valuable and creating them would not cause a lot of runtime overhead, others though that creating unique IDs could be costly in terms of performance (e.g., on highly-parallel hardware with a large number of hardware threads). It might also be interesting to reconsider whether, for an associated std::thread, there is still a relation to an underlying OS thread. Not establishing such a relation could make an implementation of light-weight EAs simpler because then it can implement the standard s TLS as required yet does not have to make the OS thread s TLS mechanism work. Potential replacements for TLS I think that it would be valuable to investigate abstractions that can replace TLS usage, especially to make options (1) to (3) more widely 8 The standard does not provide a way to look up a std::thread object based on an ID, so we do not need to create a std::thread. P0072R0 10

11 applicable. Also, this could provide abstractions that are better suited to the particular use case than current std::thread-specific TLS; often, TLS is used as a mechanism but does not necessarily fulfill the intent perfectly. When considering just a single EA, TLS is basically a special allocator combined with a very efficient way to obtain an EA-specific root pointer. Where TLS differs from that is that if you allocate a TLS data item (e.g., by adding a thread local variable), that this automatically allocates one EA-specific data item for every past and future EA in the program; furthermore, that in each EA one can access this data item through a shared key (e.g., the thread local variable). I do not think that programmers need exactly that in the majority of use cases; usually, you need a TLS data item just for a subset of all EAs (e.g., all the EAs a subsequent parallel loop will spawn). Considering programs with several parallelized parts and systems that can run thousands of EAs potentially, this can make a difference. Two major TLS use cases seem to be (a) global state that is accessed by just one thread and (b) locality of access. If TLS is used for the former, then allocating the data in the particular EA is fine, if it is possible to pass the pointer to the data along with other arguments to functions called by this EA, for example. If this data is expected to have the same lifetime as the EA, the implementation could even use a very efficient, local allocator for that. If TLS is used for locality of access, then the motivation is primarily to try to reduce interference from other EAs, for example to reduce cache misses or access local memory. As an example, consider caches for sets of worker EAs on a NUMA system: A threadsafe cache could be costly if accessed from all EAs in the program due to lack of locality of memory accesses, but if all EAs running on one NUMA node would share a cache, the synchronization overheads would be small; giving each EA its own cache would remove the synchronization overheads completely, but could increase memory usage. In such a situation, it might be best to let an EA access the cache that is closest to the CPU core it is actually executing on; the instance of the cache might even change when the EA migrates to executing on another CPU core. This could be provided through a special allocator, for example one that allocates from per-workgroup memory on a GPU. To summarize, I think that while TLS is important for compatibility with legacy code, specialized mechanisms should be better suited to address the actual intent behind using the TLS mechanism. Lock ownership The standard already specifies lock acquisition semantics in terms of lock ownership of EAs, and notes that other EAs than those created by std::thread may exist (see ). Thus, we do not need to define any association to any existing threads as for TLS. However, implementations might have to be changed if they rely on having OS threads or std::thread as the only possible EA (e.g., if a mutex stores an OS thread ID to designate the lock owner). For options (4) and (5) above, implementations might be able to use existing thread-id-based lock implementations. Implementations of weakly parallel execution such as SIMD execution could not use those, but weakly parallel P0072R0 11

12 EAs do not support blocking synchronization in general either; thus, perhaps the actual impact on implementations would be small. 4 A simple interface to schedule execution agents So far, we have discussed EAs as purely a concept used to specify execution properties. Beyond that, one could add ways to directly create instances of EAs. Here is a sketch of how to make the requirements part of the type system, defining the different requirements as tag types: // Forward p r o g r e s s requirements, ordered by s t r e n g t h struct p r o g r e s s w e a k l y p a r a l l e l t a g { } ; struct p r o g r e s s p a r a l l e l t a g : public p r o g r e s s w e a k l y p a r a l l e l t a g { } ; struct p r o g r e s s c o n c u r r e n t t a g : public p r o g r e s s p a r a l l e l t a g { } ; // TLS requirements, ordered by s t r e n g t h // ( Please i g n o r e f o r a second t h a t the f o l l o w i n g would only work with // v i r t u a l i n h e r i t a n c e, whose overhead we want to avoid though. ) // Options (1) to ( 4 ) : struct t l s n o n e t a g { } ; struct t l s e x i s t s t a g : public t l s n o n e t a g { } ; struct t l s a s s o c i a t e d t a g : public t l s e x i s t s t a g { } ; struct t l s s t a b l e t a g : public t l s a s s o c i a t e d t a g { } ; // Option (3 s ) and (4 s ) : struct t l s a s s o c i a t e d s h a r e d t a g : public t l s e x i s t s t a g { } ; struct t l s s t a b l e s h a r e d t a g : public t l s a s s o c i a t e d s h a r e d t a g { } ; // Option ( 5 ) : struct t l s t h r e a d t a g : public t l s s t a b l e t a g, public t l s s t a b l e s h a r e d t a g { } ; Next, we can define concrete EA types and traits. In this example, there is a parallel EA and an EA that is like those created by std::thread: // P r o p e r t i e s o f an EA template<class EA> struct e a t r a i t s ; // A p a r a l l e l EA type w i t h o u t TLS support struct e a l e a n p a r a l l e l { } ; template<> struct e a t r a i t s <e a l e a n p a r a l l e l > { typedef p r o g r e s s p a r a l l e l t a g p r o g r e s s ; typedef t l s n o n e t a g t l s ; } ; // An EA type t h a t has the p r o p e r t i e s o f EAs c r e a t e d by s t d : : thread struct e a s t d t h r e a d { } ; template<> struct e a t r a i t s <ea stdthread > { typedef p r o g r e s s c o n c u r r e n t t a g p r o g r e s s ; typedef t l s t h r e a d t a g t l s ; } ; P0072R0 12

13 If we have interfaces that spawn EAs (e.g., something similar to executors), then we can use the traits to select the most efficient implementation at compile time. We can also use them to disallow trying to create kinds of EAs that are not supported by the particular interface. For example, a strictly bounded thread pool cannot guarantee concurrent execution of all EAs in general, so it might just refuse spawning concurrent EAs altogether: // Executor l i k e example : Bounded thread pool can t launch concurrent EAs struct bounded threadpool { template<c l a s s Function> void spawn ( Function f, p r o g r e s s p a r a l l e l t a g ) { / Can do i t. / } template<c l a s s Function> // Can t c r e a t e concurrent EAs : void spawn ( Function f, p r o g r e s s c o n c u r r e n t t a g ) = delete ; template<class EA, class Function> void spawn (EA&, Function f ) { spawn ( std : : forward<function >( f ), typename e a t r a i t s <EA>:: p r o g r e s s { } ) ; } } ; Considering executors in general, I think it would be beneficial to parametrize them or their member functions by an EA type. Although one could argue that, for example, a client of a bounded thread pool should be well aware which properties EAs created by it have, making the requirement explicit helps catch mistakes and is less error-prone in generic code. More importantly, it separates semantic constraints for the execution of a particular piece of code from all the other reasons the programmer might have to use a specific executor in particular, performance reasons. For example, a programmer might want to use a thread pool with a fixed number of OS threads primarily because those threads may have been bound to the cores on a particular CPU socket, not because the programmer was thinking about bounded vs. not bounded and the consequences for progress. The thread pool implementation even could add additional OS threads to the pool if the program requests it to schedule more concurrent EAs. Another example would be a GPU executor: Its purpose would be to run EAs on the GPU, not just to run a particular type of EA. We could specify one concrete executor for each combination of execution requirements (e.g., concurrent GPU, parallel GPU, parallel-no-tls GPU,... executors) but I do not think this would be easier to consume for readers of the standard. Also, if we want to combine, chain, or merge executors, then having a larger number of concrete executors makes the implementation more complex too. As a last example for simple interfaces, let us consider a facility to upgrade the currently executing EA to stronger EA requirements: // Spawns and boost b l o c k s on j u s t one EA template<class EA, class Function> void upgrade ea (EA& ea, Function f ) ; // Just need s t a b l e TLS, but no c o n s t r a i n t s on p r o g r e s s P0072R0 13

14 struct e a s t a b l e t l s { } ; template<> struct e a t r a i t s <e a s t a b l e t l s > { typedef p r o g r e s s w e a k l y p a r a l l e l t a g p r o g r e s s ; typedef t l s s t a b l e t a g t l s ; } ; //... code t h a t does not use TLS... upgrade ea ( e a s t a b l e t l s, l e g a c y c o d e t h a t n e e d s t l s ) ; //... code t h a t does not use TLS... upgrade ea spawns a new EA and waits for its termination. It will give the new EA requirements that are not weaker than neither the current EA nor the supplied EA type. Note that for the progress requirement, this could as well be achieved by specifying that upgrade ea boost-blocks on the completion of the spawned EA. 5 Suggested wording for version 2 of the Parallelism TS There was consensus in Urbana that the next step regarding the forward progress definitions would be to apply them to other proposals, notably the Parallelism TS. Thus, in what follows, I will propose wording for the next version of this TS, relative to N4352. To avoid having to use and define execution agents, the proposed wording simply associates forward progress requirements directly with threads of execution. I still think that using EAs as a concept makes sense, but it s not strictly necessary for the uses in the current Parallelism TS. First, we need to add the base definitions around progress. Those are not specific to the Parallelism TS but would rather replace the existing requirements in the standard; nonetheless, we need to refer to them, so just add a subsection titled Forward progress before Section 4.1.2, with the following content. The definitions are essentially the same as discussed previously in this paper except associating progress with threads of execution directly. Also, a few explanatory notes have been added: [Note: The definitions and requirements in this section are supposed to replace the forward progress requirements in the current C++ standard. end note] The execution of threads of execution consists of steps: A step ends with either (1) termination of the thread of execution, (2) access to or modification of a volatile object, or (3) a synchronization or atomic operation; the remainder of a step consists of other things executed by the implementation that are not in the previous list. [Note: A thread of execution makes progress when its steps are executed. 1.10p27 requires that individual steps are finite (see below for how to handle library I/O functions). end note] Calls to library I/O functions and other functions that block (i.e., that cannot return to the caller unless a certain, potentially external condition is met) can be conceptually expressed using steps as defined previously; each blocking P0072R0 14

15 operation can be replaced by a busy-waiting loop that polls for the condition to be met (e.g., using observable behavior of the abstract machine). [Note: It is not necessary to provide such specifications of blocking. The previous paragraph only states that it is always possible to do so, which allows separating blocking due to program logic from blocking due to how an implementation executes the abstract machine. end note] [Note: Next, three kinds of forward progress guarantees of varying strength are defined, which specify when implementations have to execute steps and thus have to allow for a program to make progress.] The implementation should ensure that the steps of a thread of execution providing concurrent forward progress guarantees will eventually be allowed to execute, independently of which steps of other threads of executions might or might not execute. [Note: To eventually fulfill this requirement means that this will happen in an unspecified but finite amount of time. end note] [Note: The threads of execution created by std::thread and the implementationcreated thread of execution that executes main provide concurrent forward progress guarantees. end note] For a thread of execution providing parallel forward progress guarantees, an implementation is not required to execute steps if it has not yet executed any step; once it has executed a step, the implementation should ensure that it will eventually execute all of the thread of execution s steps, independently of which steps of other threads of execution are or are not executed. [Note: This effectively makes the same progress requirements as for a concurrent forward progress guarantees once the parallel-forward-progress thread of execution is started, but does not specify a requirement for when to start this thread of execution; the latter will typically be specified by the entity that creates this thread of execution. For example, executing tasks from a set of tasks in an arbitrary order, one after the other, satisfies the requirements of parallel forward progress for these tasks. end note] For a thread of execution providing weakly parallel forward progress guarantees, an implementation is not required to execute steps; however, boostblocking, as defined below, can be used to ensure that the steps of such threads of execution are executed eventually. [Note: Such threads of execution cannot be expected to execute steps independently of which steps other threads of execution might or might not execute. end note] [Note: Concurrent forward progress guarantees are stronger than parallel, which in turn are stronger than weakly parallel guarantees. For example, some kinds of synchronization between threads of execution may only make forward progress if the respective threads of execution provide parallel for- P0072R0 15

16 ward progress guarantees, but fail to make progress under weakly parallel guarantees. end note] If a thread of execution P is specified to use boost-blocking to block on the completion of a set S of threads of execution, and if P is providing a stronger forward progress guarantee than at least one thread of execution in S, then throughout the whole time of P being boost-blocked on S, P will boost the progress requirement of at least one thread of execution in S to P s stronger guarantee. Specifically, P is free to select which thread of execution in S to boost and for which number of steps (i.e., the boost is not permanent and in place for the rest of the lifetime of the boosted thread of execution); as long as P is boost-blocked, it has to eventually boost a thread of execution in S. Once a thread of execution in S finishes execution, it is removed from S. Once S is empty, P stops being blocked. [Note: A thread of execution B thus can temporarily provide an effectively stronger forward progress guarantee for a certain amount of time, due to a second thread of execution A being boost-blocked on it. In turn, this may allow B to itself boost the progress of a third thread of execution C by boost-blocking on C. end note] [Note: If all threads of execution in S finish executing (e.g., they terminate and do not use blocking synchronization incorrectly), then P s progress guarantee will not be weakened by executing the boost-blocking operation. end note] [Note: This does not remove any constraints regarding blocking synchronization for threads of execution providing parallel or weakly parallel forward progress guarantees because P is not required to boost a particular thread of execution whose too-weak progress guarantee is preventing overall progress. end note] With these foundations in place, we can then adapt Section as follows. Generally, to avoid misinterpretations, we should replace all occurrences of thread with thread of execution (see N4231 for background). In the third paragraph, specify that threads of execution supplied by the library provide parallel forward progress guarantees: The invocations of element access functions in parallel algorithms invoked with an execution policy object of type parallel execution policy are permitted to execute in an unordered fashion in either the invoking thread of execution or in a thread of execution implicitly created by the library to support parallel algorithm execution; the latter will provide parallel forward progress guarantees. Any such invocations executing in the same thread of execution are indeterminately sequenced with respect to each other. [Note: It is the caller s responsibility to ensure correctness, for example that the invocation does not introduce data races or deadlocks. end note] P0072R0 16

17 Similarly, specify weakly parallel guarantees in the fourth paragraph. Also, clarify that affected threads of execution are the invoking thread or managed by the library, and that blocking synchronization is the problematic case, not all kinds of synchronization in general: The invocations of element access functions in parallel algorithms invoked with an execution policy of type parallel vector execution policy are permitted to execute in an unordered fashion in unspecified threads of execution, and unsequenced with respect to one another within each thread of execution. These threads of execution are either the invoking thread of execution or implicitly created by the library; the latter will provide weakly parallel forward progress guarantees. [Note: This means that multiple function object invocations may be interleaved on a single thread of execution. end note] [Note: This overrides the usual guarantee from the C++ standard, Section 1.9 [intro.execution] that function executions do not interleave with one another. end note] Since parallel vector execution policy allows the execution of element access functions to be interleaved on a single thread of execution, blocking synchronization, including the use of mutexes, risks deadlock. Thus, code with parallel vector execution policy is restricted as follows: A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from parallel vector execution policy algorithms. [Note: Implementations must ensure that internal synchronization inside standard library routines does not prevent forward progress when those routines are executed by threads of execution with weakly parallel forward progress guarantees. This can be achieved by either avoiding synchronization incompatible with those guarantees, or by executing these routines differently. end note] I would have liked to make the two paragraphs before the last note of this more precise; however, we need to use some categorization of synchronization that the current standard gives us, and it doesn t make a distinction between blocking and general synchronization. The changes to the last note attempt to clarify what I understand as the actual takeaway: The deadlock mentioned is a clash between blocking synchronization and weak progress, and implementations can either avoid such blocking or make sure that progress is stronger (e.g., by not vectorizing calls to such routines). Finally, we need to ensure that threads of execution implicitly created by the library do not weaken the forward progress of the threads of execution invoking the library. Add a new paragraph before paragraph 6: P0072R0 17

18 If an invocation of a parallel algorithm uses threads of execution implicitly created by the library, then the invoking thread of execution will boost-block on the completion of these library-managed threads of execution. [Note: In boost-blocking in this context, a thread of execution created by the library is considered to have finished execution as soon as it has finished the execution of the particular element access function that the invoking thread of execution logically depends on.] 5.1 What if we consider just forward progress? It may seem surprising that the changes proposed previously do not significantly change the wording regarding which threads can be used to execute a parallel algorithm. This is not because the forward progress definitions would be insufficient, but because I did not want to change the semantics of TLS and lock ownership. If we do not rule out TLS accesses and want to allow typical implementations, we need to cover executions in which element access functions access both the invoking thread s TLS and TLS of threads managed by the library. Thus, this is a TLS visibility/lifetime issue, not a limitation of the forward progress model. Allowing functions to be executed in indeterminately ordered fashion on the same thread is not just needed to allow implementations to do exactly that, but also to indirectly create a forward progress constraint: If the order is not specified, we cannot expect another element access function to start executing eventually (which covers the weaker guarantee of parallel EAs compared to concurrent EAs). The same holds in the specification of the parallel-vector policy. Basing the specification solely on which threads of execution are used binds thread identity and TLS to progress guarantees: We cannot specify a parallel-vector policy that provides weakly parallel progress for the element access functions but gives each concurrent invocation of such a function its own, non-shared TLS. This is because if this would be the case (i.e., visible to the program through different thread identities), a program could infer that there are separate threads, and thus also assume the default progress guarantees for threads of execution specified by the current standard (i.e., concurrent progress). In contrast, if we were to ignore or disallow TLS and lock ownership in the specification, or would specify these aspects by assigning TLS support levels to execution agents (see Section 3), we could simply let each element access function run on one conceptual execution agent managed by the implementation. In such a scenario (and note that I m not proposing this wording for now), we would just need to say the following for the third paragraph (now using execution agents as discussed in the rest of this paper): The invocations of element access functions in parallel algorithms invoked with an execution policy object of type parallel execution policy will be executed using parallel execution agents. [Note: It is the caller s responsibility to ensure correctness, for example that the invocation does not introduce P0072R0 18

Interprocess Communication By: Kaushik Vaghani

Interprocess Communication By: Kaushik Vaghani Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the

More information

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

Key-value store with eventual consistency without trusting individual nodes

Key-value store with eventual consistency without trusting individual nodes basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS CONSISTENCY AND REPLICATION CONSISTENCY MODELS Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Consistency Models Background Replication Motivation

More information

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder Chenyu Zheng CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder Actuality Introduction Concurrency framework in the 2010 new C++ standard History of multi-threading in C++

More information

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University G22.2110-001 Programming Languages Spring 2010 Lecture 13 Robert Grimm, New York University 1 Review Last week Exceptions 2 Outline Concurrency Discussion of Final Sources for today s lecture: PLP, 12

More information

Concurrent & Distributed Systems Supervision Exercises

Concurrent & Distributed Systems Supervision Exercises Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture

More information

Request for Comments: 3989 Category: Informational T. Taylor Nortel February Middlebox Communications (MIDCOM) Protocol Semantics

Request for Comments: 3989 Category: Informational T. Taylor Nortel February Middlebox Communications (MIDCOM) Protocol Semantics Network Working Group Request for Comments: 3989 Category: Informational M. Stiemerling J. Quittek NEC T. Taylor Nortel February 2005 Status of This Memo Middlebox Communications (MIDCOM) Protocol Semantics

More information

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU) Document number: D0566R1 Date: 20170619 (pre Toronto) Project: Programming Language C++, WG21, SG1,SG14, LEWG, LWG Authors: Michael Wong, Maged M. Michael, Paul McKenney, Geoffrey Romer, Andrew Hunter

More information

Concurrency. Glossary

Concurrency. Glossary Glossary atomic Executing as a single unit or block of computation. An atomic section of code is said to have transactional semantics. No intermediate state for the code unit is visible outside of the

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Concurrent Objects and Linearizability

Concurrent Objects and Linearizability Chapter 3 Concurrent Objects and Linearizability 3.1 Specifying Objects An object in languages such as Java and C++ is a container for data. Each object provides a set of methods that are the only way

More information

P0444: Unifying suspend-by-call and suspend-by-return

P0444: Unifying suspend-by-call and suspend-by-return P0444: Unifying suspend-by-call and suspend-by-return Document number: P0444R0 Date: 2016-10-14 Reply-to: Nat Goodspeed Audience: SG1, EWG Abstract This paper is a response to P0073R2

More information

Context Tokens for Parallel Algorithms

Context Tokens for Parallel Algorithms Doc No: P0335R1 Date: 2018-10-07 Audience: SG1 (parallelism & concurrency) Authors: Pablo Halpern, Intel Corp. phalpern@halpernwightsoftware.com Context Tokens for Parallel Algorithms Contents 1 Abstract...

More information

6.001 Notes: Section 15.1

6.001 Notes: Section 15.1 6.001 Notes: Section 15.1 Slide 15.1.1 Our goal over the next few lectures is to build an interpreter, which in a very basic sense is the ultimate in programming, since doing so will allow us to define

More information

Dealing with Issues for Interprocess Communication

Dealing with Issues for Interprocess Communication Dealing with Issues for Interprocess Communication Ref Section 2.3 Tanenbaum 7.1 Overview Processes frequently need to communicate with other processes. In a shell pipe the o/p of one process is passed

More information

Transactional Memory in C++ Hans-J. Boehm. Google and ISO C++ Concurrency Study Group chair ISO C++ Transactional Memory Study Group participant

Transactional Memory in C++ Hans-J. Boehm. Google and ISO C++ Concurrency Study Group chair ISO C++ Transactional Memory Study Group participant Transactional Memory in C++ Hans-J. Boehm Google and ISO C++ Concurrency Study Group chair ISO C++ Transactional Memory Study Group participant 1 Disclaimers I ve been writing concurrent programs for decades,

More information

Subject: Scheduling Region Questions and Problems of new SystemVerilog commands

Subject: Scheduling Region Questions and Problems of new SystemVerilog commands Subject: Scheduling Region Questions and Problems of new SystemVerilog commands I have read and re-read sections 14-17 of the SystemVerilog 3.1 Standard multiple times and am still confused about exactly

More information

N2880 Distilled, and a New Issue With Function Statics

N2880 Distilled, and a New Issue With Function Statics Doc No: SC22/WG21/N2917 = PL22.16/09-0107 Date: 2009-06-19 Project: Reply to: JTC1.22.32 Herb Sutter Microsoft Corp. 1 Microsoft Way Redmond WA USA 98052 Email: hsutter@microsoft.com This paper is an attempt

More information

Efficient concurrent waiting for C++20

Efficient concurrent waiting for C++20 Doc number: P0514R3 Revises: P0514R2, P0126R2, N4195 Date: 2018-02-10 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux Efficient concurrent waiting

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

QUIZ. What is wrong with this code that uses default arguments?

QUIZ. What is wrong with this code that uses default arguments? QUIZ What is wrong with this code that uses default arguments? Solution The value of the default argument should be placed in either declaration or definition, not both! QUIZ What is wrong with this code

More information

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency Document Number: P0233R3 Date: 2017-02-06 Reply-to: maged.michael@acm.org, michael@codeplay.com Authors: Maged M. Michael, Michael Wong, Paul McKenney, Arthur O'Dwyer, David Hollman Project: Programming

More information

CS5412: TRANSACTIONS (I)

CS5412: TRANSACTIONS (I) 1 CS5412: TRANSACTIONS (I) Lecture XVII Ken Birman Transactions 2 A widely used reliability technology, despite the BASE methodology we use in the first tier Goal for this week: in-depth examination of

More information

OPERATING SYSTEMS. Prescribed Text Book. Operating System Principles, Seventh Edition. Abraham Silberschatz, Peter Baer Galvin and Greg Gagne

OPERATING SYSTEMS. Prescribed Text Book. Operating System Principles, Seventh Edition. Abraham Silberschatz, Peter Baer Galvin and Greg Gagne OPERATING SYSTEMS Prescribed Text Book Operating System Principles, Seventh Edition By Abraham Silberschatz, Peter Baer Galvin and Greg Gagne 1 DEADLOCKS In a multi programming environment, several processes

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

Kakadu and Java. David Taubman, UNSW June 3, 2003

Kakadu and Java. David Taubman, UNSW June 3, 2003 Kakadu and Java David Taubman, UNSW June 3, 2003 1 Brief Summary The Kakadu software framework is implemented in C++ using a fairly rigorous object oriented design strategy. All classes which are intended

More information

The Java Memory Model

The Java Memory Model Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential

More information

Safety SPL/2010 SPL/20 1

Safety SPL/2010 SPL/20 1 Safety 1 system designing for concurrent execution environments system: collection of objects and their interactions system properties: Safety - nothing bad ever happens Liveness - anything ever happens

More information

UNIT:2. Process Management

UNIT:2. Process Management 1 UNIT:2 Process Management SYLLABUS 2.1 Process and Process management i. Process model overview ii. Programmers view of process iii. Process states 2.2 Process and Processor Scheduling i Scheduling Criteria

More information

Process Management And Synchronization

Process Management And Synchronization Process Management And Synchronization In a single processor multiprogramming system the processor switches between the various jobs until to finish the execution of all jobs. These jobs will share the

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc. Parallel Processing with the SMP Framework in VTK Berk Geveci Kitware, Inc. November 3, 2013 Introduction The main objective of the SMP (symmetric multiprocessing) framework is to provide an infrastructure

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models Review. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed? The safety net between operations whose order needs

More information

Asynchronous I/O: A Case Study in Python

Asynchronous I/O: A Case Study in Python Asynchronous I/O: A Case Study in Python SALEIL BHAT A library for performing await -style asynchronous socket I/O was written in Python. It provides an event loop, as well as a set of asynchronous functions

More information

6.001 Notes: Section 8.1

6.001 Notes: Section 8.1 6.001 Notes: Section 8.1 Slide 8.1.1 In this lecture we are going to introduce a new data type, specifically to deal with symbols. This may sound a bit odd, but if you step back, you may realize that everything

More information

Multitasking / Multithreading system Supports multiple tasks

Multitasking / Multithreading system Supports multiple tasks Tasks and Intertask Communication Introduction Multitasking / Multithreading system Supports multiple tasks As we ve noted Important job in multitasking system Exchanging data between tasks Synchronizing

More information

AST: scalable synchronization Supervisors guide 2002

AST: scalable synchronization Supervisors guide 2002 AST: scalable synchronization Supervisors guide 00 tim.harris@cl.cam.ac.uk These are some notes about the topics that I intended the questions to draw on. Do let me know if you find the questions unclear

More information

Persistent Oberon Language Specification

Persistent Oberon Language Specification Persistent Oberon Language Specification Luc Bläser Institute of Computer Systems ETH Zurich, Switzerland blaeser@inf.ethz.ch The programming language Persistent Oberon is an extension of Active Oberon

More information

1 The comparison of QC, SC and LIN

1 The comparison of QC, SC and LIN Com S 611 Spring Semester 2009 Algorithms for Multiprocessor Synchronization Lecture 5: Tuesday, 3rd February 2009 Instructor: Soma Chaudhuri Scribe: Jianrong Dong 1 The comparison of QC, SC and LIN Pi

More information

Type Checking and Type Equality

Type Checking and Type Equality Type Checking and Type Equality Type systems are the biggest point of variation across programming languages. Even languages that look similar are often greatly different when it comes to their type systems.

More information

CS 167 Final Exam Solutions

CS 167 Final Exam Solutions CS 167 Final Exam Solutions Spring 2018 Do all questions. 1. [20%] This question concerns a system employing a single (single-core) processor running a Unix-like operating system, in which interrupts are

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In small multiprocessors, sequential consistency can be implemented relatively easily. However, this is not true for large multiprocessors. Why? This is not the

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Thread Safety. Review. Today o Confinement o Threadsafe datatypes Required reading. Concurrency Wrapper Collections

Thread Safety. Review. Today o Confinement o Threadsafe datatypes Required reading. Concurrency Wrapper Collections Thread Safety Today o Confinement o Threadsafe datatypes Required reading Concurrency Wrapper Collections Optional reading The material in this lecture and the next lecture is inspired by an excellent

More information

Thirty one Problems in the Semantics of UML 1.3 Dynamics

Thirty one Problems in the Semantics of UML 1.3 Dynamics Thirty one Problems in the Semantics of UML 1.3 Dynamics G. Reggio R.J. Wieringa September 14, 1999 1 Introduction In this discussion paper we list a number of problems we found with the current dynamic

More information

1.1 Jadex - Engineering Goal-Oriented Agents

1.1 Jadex - Engineering Goal-Oriented Agents 1.1 Jadex - Engineering Goal-Oriented Agents In previous sections of the book agents have been considered as software artifacts that differ from objects mainly in their capability to autonomously execute

More information

Operating Systems. Synchronization

Operating Systems. Synchronization Operating Systems Fall 2014 Synchronization Myungjin Lee myungjin.lee@ed.ac.uk 1 Temporal relations Instructions executed by a single thread are totally ordered A < B < C < Absent synchronization, instructions

More information

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy Operating Systems Designed and Presented by Dr. Ayman Elshenawy Elsefy Dept. of Systems & Computer Eng.. AL-AZHAR University Website : eaymanelshenawy.wordpress.com Email : eaymanelshenawy@yahoo.com Reference

More information

Introduction to Asynchronous Programming Fall 2014

Introduction to Asynchronous Programming Fall 2014 CS168 Computer Networks Fonseca Introduction to Asynchronous Programming Fall 2014 Contents 1 Introduction 1 2 The Models 1 3 The Motivation 3 4 Event-Driven Programming 4 5 select() to the rescue 5 1

More information

Course: Operating Systems Instructor: M Umair. M Umair

Course: Operating Systems Instructor: M Umair. M Umair Course: Operating Systems Instructor: M Umair Process The Process A process is a program in execution. A program is a passive entity, such as a file containing a list of instructions stored on disk (often

More information

Operating Systems Antonio Vivace revision 4 Licensed under GPLv3

Operating Systems Antonio Vivace revision 4 Licensed under GPLv3 Operating Systems Antonio Vivace - 2016 revision 4 Licensed under GPLv3 Process Synchronization Background A cooperating process can share directly a logical address space (code, data) or share data through

More information

Lambda Correctness and Usability Issues

Lambda Correctness and Usability Issues Doc No: WG21 N3424 =.16 12-0114 Date: 2012-09-23 Reply to: Herb Sutter (hsutter@microsoft.com) Subgroup: EWG Evolution Lambda Correctness and Usability Issues Herb Sutter Lambda functions are a hit they

More information

Chapter 2 Overview of the Design Methodology

Chapter 2 Overview of the Design Methodology Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed

More information

Stitched Together: Transitioning CMS to a Hierarchical Threaded Framework

Stitched Together: Transitioning CMS to a Hierarchical Threaded Framework Stitched Together: Transitioning CMS to a Hierarchical Threaded Framework CD Jones and E Sexton-Kennedy Fermilab, P.O.Box 500, Batavia, IL 60510-5011, USA E-mail: cdj@fnal.gov, sexton@fnal.gov Abstract.

More information

Execution Architecture

Execution Architecture Execution Architecture Software Architecture VO (706.706) Roman Kern Institute for Interactive Systems and Data Science, TU Graz 2018-11-07 Roman Kern (ISDS, TU Graz) Execution Architecture 2018-11-07

More information

Efficient waiting for concurrent programs

Efficient waiting for concurrent programs Doc number: P0514R2 Revises: P0514R0-1, P0126R0-2, and N4195 Date: 2017-10-09 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux Efficient waiting

More information

CPS221 Lecture: Threads

CPS221 Lecture: Threads Objectives CPS221 Lecture: Threads 1. To introduce threads in the context of processes 2. To introduce UML Activity Diagrams last revised 9/5/12 Materials: 1. Diagram showing state of memory for a process

More information

Operating Systems: Quiz2 December 15, Class: No. Name:

Operating Systems: Quiz2 December 15, Class: No. Name: Operating Systems: Quiz2 December 15, 2006 Class: No. Name: Part I (30%) Multiple Choice Each of the following questions has only one correct answer. Fill the correct one in the blank in front of each

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

CS 241 Honors Concurrent Data Structures

CS 241 Honors Concurrent Data Structures CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25 PROJECT 6: PINTOS FILE SYSTEM CS124 Operating Systems Winter 2015-2016, Lecture 25 2 Project 6: Pintos File System Last project is to improve the Pintos file system Note: Please ask before using late tokens

More information

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency Document Number: P0233R1 Date: 2016 05 29 Reply to: maged.michael@acm.org, michael@codeplay.com Authors: Maged M. Michael, Michael Wong Project: Programming Language C++, SG14/SG1 Concurrency, LEWG Hazard

More information

The Dynamic Typing Interlude

The Dynamic Typing Interlude CHAPTER 6 The Dynamic Typing Interlude In the prior chapter, we began exploring Python s core object types in depth with a look at Python numbers. We ll resume our object type tour in the next chapter,

More information

Virtual Machine Design

Virtual Machine Design Virtual Machine Design Lecture 4: Multithreading and Synchronization Antero Taivalsaari September 2003 Session #2026: J2MEPlatform, Connected Limited Device Configuration (CLDC) Lecture Goals Give an overview

More information

Python in the Cling World

Python in the Cling World Journal of Physics: Conference Series PAPER OPEN ACCESS Python in the Cling World To cite this article: W Lavrijsen 2015 J. Phys.: Conf. Ser. 664 062029 Recent citations - Giving pandas ROOT to chew on:

More information

IT 540 Operating Systems ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (5 th Week) (Advanced) Operating Systems 5. Concurrency: Mutual Exclusion and Synchronization 5. Outline Principles

More information

Enhancing std::atomic_flag for waiting.

Enhancing std::atomic_flag for waiting. Doc number: P0514R0 Revises: None. Date: 2016-11-15 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux Enhancing std::atomic_flag for waiting. TL;DR

More information

Doc number: P0126R2 Revises: P0126R1, N4195 Date: Project: Reply-to: Thanks-to: A simplifying abstraction

Doc number: P0126R2 Revises: P0126R1, N4195 Date: Project: Reply-to: Thanks-to: A simplifying abstraction Doc number: P0126R2 Revises: P0126R1, N4195 Date: 2016-03-13 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux ogiroux@nvidia.com, Torvald Riegel triegel@redhat.com

More information

The Deadlock Lecture

The Deadlock Lecture Concurrent systems Lecture 4: Deadlock, Livelock, and Priority Inversion DrRobert N. M. Watson The Deadlock Lecture 1 Reminder from last time Multi-Reader Single-Writer (MRSW) locks Alternatives to semaphores/locks:

More information

Multiple Inheritance. Computer object can be viewed as

Multiple Inheritance. Computer object can be viewed as Multiple Inheritance We have seen that a class may be derived from a given parent class. It is sometimes useful to allow a class to be derived from more than one parent, inheriting members of all parents.

More information

CMSC421: Principles of Operating Systems

CMSC421: Principles of Operating Systems CMSC421: Principles of Operating Systems Nilanjan Banerjee Assistant Professor, University of Maryland Baltimore County nilanb@umbc.edu http://www.csee.umbc.edu/~nilanb/teaching/421/ Principles of Operating

More information

Operating Systems, Assignment 2 Threads and Synchronization

Operating Systems, Assignment 2 Threads and Synchronization Operating Systems, Assignment 2 Threads and Synchronization Responsible TA's: Zohar and Matan Assignment overview The assignment consists of the following parts: 1) Kernel-level threads package 2) Synchronization

More information

Chapter 6 - Deadlocks

Chapter 6 - Deadlocks Chapter 6 - Deadlocks Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 6 - Deadlocks 1 / 100 1 Motivation 2 Resources Preemptable and Nonpreemptable Resources Resource Acquisition

More information

Reintroduction to Concurrency

Reintroduction to Concurrency Reintroduction to Concurrency The execution of a concurrent program consists of multiple processes active at the same time. 9/25/14 7 Dining philosophers problem Each philosopher spends some time thinking

More information

1 Dynamic Memory continued: Memory Leaks

1 Dynamic Memory continued: Memory Leaks CS104: Data Structures and Object-Oriented Design (Fall 2013) September 3, 2013: Dynamic Memory, continued; A Refresher on Recursion Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we continue

More information

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016 Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016 RCU and C++ What Is RCU, Really? Publishing of new data: rcu_assign_pointer()

More information

Object Oriented Paradigm

Object Oriented Paradigm Object Oriented Paradigm Ming-Hwa Wang, Ph.D. Department of Computer Engineering Santa Clara University Object Oriented Paradigm/Programming (OOP) similar to Lego, which kids build new toys from assembling

More information

N1793: Stability of indeterminate values in C11

N1793: Stability of indeterminate values in C11 N1793: Stability of indeterminate values in C11 Robbert Krebbers and Freek Wiedijk Radboud University Nijmegen, The Netherlands Abstract. This paper document N1793 of WG 14 proposes and argues for a specific

More information

* What are the different states for a task in an OS?

* What are the different states for a task in an OS? * Kernel, Services, Libraries, Application: define the 4 terms, and their roles. The kernel is a computer program that manages input/output requests from software, and translates them into data processing

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

Adding std::*_semaphore to the atomics clause.

Adding std::*_semaphore to the atomics clause. Doc number: P0514R1 Revises: P0514R0 Date: 2017-6-14 Project: Programming Language C++, Concurrency Working Group Reply-to: Olivier Giroux Adding std::*_semaphore to the atomics clause.

More information

The goal of the Pangaea project, as we stated it in the introduction, was to show that

The goal of the Pangaea project, as we stated it in the introduction, was to show that Chapter 5 Conclusions This chapter serves two purposes. We will summarize and critically evaluate the achievements of the Pangaea project in section 5.1. Based on this, we will then open up our perspective

More information

CC and CEM addenda. Exact Conformance, Selection-Based SFRs, Optional SFRs. May Version 0.5. CCDB xxx

CC and CEM addenda. Exact Conformance, Selection-Based SFRs, Optional SFRs. May Version 0.5. CCDB xxx CC and CEM addenda Exact Conformance, Selection-Based SFRs, Optional SFRs May 2017 Version 0.5 CCDB-2017-05-xxx Foreword This is a DRAFT addenda to the Common Criteria version 3.1 and the associated Common

More information

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011 Synchronization CS61, Lecture 18 Prof. Stephen Chong November 3, 2011 Announcements Assignment 5 Tell us your group by Sunday Nov 6 Due Thursday Nov 17 Talks of interest in next two days Towards Predictable,

More information

CHAPTER 6: PROCESS SYNCHRONIZATION

CHAPTER 6: PROCESS SYNCHRONIZATION CHAPTER 6: PROCESS SYNCHRONIZATION The slides do not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. TOPICS Background

More information

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski Operating Systems Design Fall 2010 Exam 1 Review Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 To a programmer, a system call looks just like a function call. Explain the difference in the underlying

More information

Fixing Atomic Initialization, Rev1

Fixing Atomic Initialization, Rev1 Project: ISO JTC1/SC22/WG21: Programming Language C++ Doc No: WG21 P0883R1 Date: 2018-06-05 Reply to: Nicolai Josuttis (nico@josuttis.de) Audience: SG1, LEWG, LWG Prev. Version: P0883R0 Fixing Atomic Initialization,

More information

Using a waiting protocol to separate concerns in the mutual exclusion problem

Using a waiting protocol to separate concerns in the mutual exclusion problem Using a waiting protocol to separate concerns in the mutual exclusion problem Frode V. Fjeld frodef@cs.uit.no Department of Computer Science, University of Tromsø Technical Report 2003-46 November 21,

More information

Threads Chapter 5 1 Chapter 5

Threads Chapter 5 1 Chapter 5 Threads Chapter 5 1 Chapter 5 Process Characteristics Concept of Process has two facets. A Process is: A Unit of resource ownership: a virtual address space for the process image control of some resources

More information

The Bizarre Truth! Automating the Automation. Complicated & Confusing taxonomy of Model Based Testing approach A CONFORMIQ WHITEPAPER

The Bizarre Truth! Automating the Automation. Complicated & Confusing taxonomy of Model Based Testing approach A CONFORMIQ WHITEPAPER The Bizarre Truth! Complicated & Confusing taxonomy of Model Based Testing approach A CONFORMIQ WHITEPAPER By Kimmo Nupponen 1 TABLE OF CONTENTS 1. The context Introduction 2. The approach Know the difference

More information

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Multiple Processes OS design is concerned with the management of processes and threads: Multiprogramming Multiprocessing Distributed processing

More information

Unit 3 : Process Management

Unit 3 : Process Management Unit : Process Management Processes are the most widely used units of computation in programming and systems, although object and threads are becoming more prominent in contemporary systems. Process management

More information