Scalable Flat-Combining Based Synchronous Queues

Size: px
Start display at page:

Download "Scalable Flat-Combining Based Synchronous Queues"

Transcription

1 Scalable Flat-Combining Based Synchronous Queues Danny Hendler 1, Itai Incze 2, Nir Shavit 2,3 and Moran Tzafrir 2 1 Ben-Gurion University 2 Tel-Aviv University 3 Sun Labs at Oracle Abstract. In a synchronous queue, producers and consumers handshake to exchange data. Recently, new scalable unfair synchronous queues were added to the Java JDK 6.0 to support high performance thread pools. This paper applies flat-combining to the problem of designing a synchronous queue algorithm. We first use the original flat-combining algorithm, a single combiner thread acquires a global lock and services the other threads combined s with very low synchronization overheads. As we show, this single combiner approach delivers superior performance up to a certain level of concurrency, but unfortunately does not continue to scale beyond that point. In order to continue to deliver scalable performance as concurrency increases, we introduce a new parallel flat-combining algorithm. The new algorithm dynamically adds additional concurrently executing flat-combiners that coordinate their work. It enjoys the low coordination overheads of sequential flat combining, with the added scalability that comes with parallelism. Our novel unfair synchronous queue using parallel flat combining exhibits scalability far and beyond that of the JDK 6.0 algorithm: it matches it in the case of a single producer and consumer, and is superior throughout the concurrency range, delivering up to 11 (eleven) times the throughput at high concurrency. 1 Introduction This paper presents a new highly scalable design of an unfair synchronous queue, a fundamental concurrent data structure used in a variety of concurrent programming settings. In many applications, one or more producer threads produce items to be consumed by one or more consumer threads. These items may be jobs to perform, keystrokes to interpret, purchase orders to execute, or packets to decode. As noted in [7], many applications require poll and offer operations which take an item only if a producer is already present, or put an item only if a consumer is already waiting (otherwise, these operations return an error). The synchronous queue provides such a pairing up of items, without buffering; it is entirely symmetric: Producers and consumers wait for one another, rendezvous,

2 and leave in pairs. The term unfair refers to the fact that the queue is actually a pool [5]: it does not impose an order on the servicing of s, and permits starvation. Previous synchronous queue algorithms were presented by Hanson [3], by Scherer, Lea and Scott [8, 7, 9] and by Afek et al. [1]. A survey of past work on synchronous queues can be found in [7]. New scalable implementations of synchronous queues were recently introduced by Scherer, Lea, and Scott [7] into the Java 6.0 JDK, available on more than 10 million desktops. They showed that the unfair version of a synchronous queue delivers scalable performance, both in general and when used to implement the JDK s thread pools. In a recent paper [4], we introduced a new synchronization paradigm called flat combining (FC). At the core of flat combining is a low cost way to allow a single combiner thread at a time to acquire a global lock on the data structure, learn about all concurrent access s by other threads, and then perform their combined s on the data structure. This technique has the dual benefit of reducing the synchronization overhead on hot shared locations, and at the same time reducing the overall cache invalidation traffic on the structure. The effect of these reductions is so dramatic, that in a kind of anti-amdahl s law effect, they outweigh the loss of parallelism caused by allowing only one combiner thread at a time to manipulate the structure. This paper applies the flat-combining technique to the synchronous queue implementation problem. We begin by presenting a scalable flat-combining implementation of an unfair synchronous queue using the technique suggested in [4]. As we show, this implementation outperforms the new Java 6.0 JDK implementation at all concurrency levels (by a factor of up to 3 on a Sun Niagara 64 way multicore). However, it does not continue to scale beyond some point, because in the end it is based on a single sequentially executing combiner thread that executes the operations of all others. Our next implementation, and the core result in this paper, is a synchronous queue based on parallel flat-combining, a new flat combining algorithm in which multiple instances of flat combining are executed in parallel in a coordinated fashion. The parallel flat-combining algorithm spawns new instances of flatcombining dynamically as concurrency increases, and folds them as it decreases. The key problem one faces in such a scheme is how to deal with imbalances: in a synchronous queue one must pair up s, without buffering the imbalances that might occur. Our solution is a dynamic two level exchange mechanism that manages to take care of imbalances with very little overhead, a crucial property for making the algorithm work at both high and low load, and at both even and uneven distributions. We note that a synchronous queue, in particular a parallel one, requires a higher level of coordination from a combiner than that of queues, stacks, or priority queues implemented in our previous paper [4], which introduced flat combining. This is because the lack of buffering means there is no slack : the combiner must actually match threads up before releasing them. As we show, our parallel flat-combining implementation of an unfair synchronous queue outperforms the single combiner, continuing to improve with

3 4 infrequently, new records are CASed by threads to head of list, and old ones are removed by combiner head lock 2 thread acquires lock, becomes combiner, updates count 1 thread writes push or pop and spins on local record count Thread B Thread G Thread A publication list Thread F Thread C 3 combiner traverses list, performs collecting s into stack and matching them to other s along the list combiner s private stack Fig. 1. A synchronized-queue using a single combiner flat-combining structure. Each record in the publication list is local to a given thread. The thread writes and spins on the field in its record. Records not recently used are once in a while removed by a combiner, forcing such threads to re-insert a record into the publication list when they next access the structure. the level of concurrency. On a Sun Niagara 64 way multicore, it reaches up to 11 times the throughput of the JDK implementation at 64 threads. The rest of this paper is organized as follows. We outline the basic sequential flat-combining algorithm and describe how it can be used to implement a synchronous queue in Section 2. In Section 3, we describe our parallel FC synchronous queue implementation. Section 4 reports on our experimental evaluation. We conclude the paper in Section 5 with a short discussion of our results. 2 A Synchronous Queue Using Single-Combiner Flat-Combining In a previous paper [4], we showed how given a sequential data structure D, one can design a (single-combiner) flat combining (henceforth FC) concurrent implementation of the structure. For presentation completeness, we outline this basic sequential flat combining algorithm in this section and describe how we use it to implement a synchronous queue. Then, in the next section, we present our parallel FC algorithm that is based on running multiple dynamically maintained instances of this basic algorithm in a two level hierarchy. As depicted in Figure 1, to implement a single instance of FC, a few structures are added to a sequential structure D: a global lock, a count of the number of combining passes, and a pointer to the head of a publication list. The publication list is a list of thread-local records of a size proportional to the number of threads that are concurrently accessing the shared object. Though one could implement the list in an array, the dynamic publication list using thread local pointers is

4 necessary for a practical solution: because the number of potential threads is unknown and typically much greater than the array size, using an array one would have had to solve a renaming problem [2] among the threads accessing it. This would imply a CAS per location, which would give us little advantage over existing techniques. Each thread t accessing the structure to perform an invocation of some method m on the shared object executes the following sequence of steps. We describe only the ones important for coordination so as to keep the presentation as simple as possible. The following then is the single combiner algorithm for a given thread executing a method m: 1. Write the invocation opcode and parameters (if any) of the method m to be applied sequentially to the shared object in the field of your thread local publication record (no need to use a load-store memory barrier). The field will later be used to receive the response. If your thread local publication record is marked as active continue to step 2, otherwise continue to step Check if the global lock is taken. If so (another thread is an active combiner), spin on the field waiting for a response to the invocation (one can add a yield at this point to allow other threads on the same core to run). Once in a while, while spinning, check if the lock is still taken and that your record is active. If your record is inactive proceed to step 5. Once the response is available, reset the field to null and return the response. 3. If the lock is not taken, attempt to acquire it and become a combiner. If you fail, return to spinning in step Otherwise, you hold the lock and are a combiner. Increment the combining pass count by one. Traverse the publication list (our algorithm guarantees that this is done in a wait-free manner) from the publication list head, combining all nonnull method call invocations, setting the age of each of these records to the current count, applying the combined method calls to the structure D, and returning responses to all the invocations. If the count is such that a cleanup needs to be performed, traverse the publication list from the head. Starting from the second item (as we explain below, we always leave the item pointed to by the head in the list), remove from the publication list all records whose age is much smaller than the current count. This is done by removing the record and marking it as inactive. Release the lock. 5. If you have no thread local publication record allocate one, marked as active. If you already have one marked as inactive, mark it as active. Execute a storeload memory barrier. Proceed to insert the record into the list by repeatedly attempting to perform a successful CAS to the head. If and when you succeed, proceed to step 1. Records are added using a CAS only to the head of the list, and so a simple wait free traversal by the combiner is trivial to implement [5]. Thus, removal will not

5 require any synchronization as long as it is not performed on the record pointed to from the head: the continuation of the list past this first record is only ever changed by the thread holding the global lock. Note that the first item is not an anchor or dummy record, we are simply not removing it. Once a new record is inserted, if it is unused it will be removed, and even if no new records are added, leaving it in the list will not affect performance. The common case for a thread is that its record is active and some other thread is the combiner, so it completes in step 2 after having only performed a store and a sequence of loads ending with a single cache miss. This is supported by the empirical data presented later. Our implementation of the FC mechanism allows us to provide the same clean concurrent object-oriented interface as used in the Java concurrency package [6] and similar C++ libraries [13], and the same consistency guarantees. We note that the Java concurrency package supports a time-out capability that allows operations awaiting a response to give up after a certain elapsed time. It is straightforward to modify the push and pop operations we support in our implementation into dual operations and to add a time-out capability. However, for the sake of brevity, we do not describe the implementation of these features in this extended abstract. To access the synchronous queue, a thread t posts the respective pair <PUSH,v> or <POP,0> to its publication record and follows the FC algorithm. As seen in Figure 1, to implement the synchronous queue, the combiner keeps a private stack in which it records push and pop s (and for each also the publication record of the thread that ed them). As the combiner traverses the publication list, it compares each ed operation to the top operation in the private stack. If the operations are complementary, the combiner provides the or and the thread with the complementary operation with their appropriate responses, and releases them both. It then pops the operation from the top of the stack, and continues to the next record in the publication list. The stack can thus alternately hold a sequence of pushes or a sequence of pops, but never a mix of both. In short, during a single pass over the publication list, the combiner matches up as best as possible all the push and pop pairs it encountered. The operations remaining in the stack upon completion of the traversal are in a sense the overflow s of a certain type, that were not serviced during the current combining round and will remain for the next. In Section 4 a single instance of the single combiner FC algorithm is shown to provide a synchronous queue with superior performance to the one in JDK6.0, but it does not continue to scale beyond a certain number of threads. To overcome this limitation, we now describe how to implement a highly scalable generalization of the FC paradigm using multiple concurrent instances of the single combiner algorithm.

6 list of extra s list of extra s count lock head of exchange FC publication list 2nd combiner 1st combiner 3rd combiner 2nd combiner node 3rd combiner node head of dynamic FC publication list count count tstamp tstamp Thread A Thread B Thread G Thread A Thread F Thread C 1st combiner node count tstamp Fig. 2. A synchronized-queue based on parallel flat combining. There are two main interconnected elements: the dynamic FC structure and the exchange FC structure. As can be seen, in this example there are three combiner sublists with approximately 4 records per sublist. Each of the combiners also has a record in the exchanger FC structure. 3 Parallel Flat Combining In this section we provide a description of the parallel flat combining algorithm. We extend the single combiner algorithm to multiple parallel combiners in the following way. We use two types of flat combining coordination structures, depicted in Figure 2. The first is a dynamic FC structure that has the ability to split the publication list into shorter sublists when its length passes a certain threshold, and collapse publication sublists if their lengths go below a certain threshold. The second is an exchange single combiner FC structure that implements a synchronous queue in which each can involve a collection of several push or pop s. Each of the multiple combiners that operate on sublists of the dynamic FC structure may fail to match all its s and be left with an overflow of operations of the same type. The exchange FC structure is used for trying to match these overflows. The key technical challenge of the new algorithm is to allow coordination between the two levels of structures: multiple parallel dynamicallycreated single combiner sublists, and the exchange structure that deals with their overflows. Any significant overhead in this mechanism will result in a performance deterioration that will make the scheme as a whole work poorly.

7 Each record in the dynamic flat combining publication list is augmented with a pointer (not shown in Figure 2) to a special combiner node that contains the lock and other accounting data associated with the sublist currently associated with the record; the itself is now also a separate structure pointed to by the publication record (this structural change is not depicted in Figure 2). Initially there is a single combiner node and the initial publication records are added to the list starting with this node and point to it. Each thread t performing an invocation of a push or a pop starts by accessing the head of the dynamic FC publication list and executes the following sequence of steps: 1. Write the invocation opcode of the operation and its parameters (if any) to a newly created structure. If your thread local publication record is marked as active, continue to step 3., otherwise mark it as active and continue to step Publication record is not in list: count the number of records in the sublist pointed to by the currently first combining node in the dynamic FC structure. If less than the threshold (in our case 8, chosen statically based on empirical testing), set the combiner pointer of your publication record to this combining node, and try to CAS yourself into this sublist. Otherwise: Create a new combiner node, pointing to the currently first combiner node. Try to CAS the head pointer to point to the new combiner node. Repeat the above steps until your record is in the list and then proceed to step Check if the lock associated with the combiner node pointed at by your publication record is taken. If so, proceed similarly to step 2. of the single FC algorithm: spin on your structure waiting for a response and, once in while, check if the lock is still taken and that your publication record is active. If your response is available, reset the field to null and return the response. If your record is inactive, mark it as active and proceed to step 2; if the lock is not taken, try to capture it and, if you succeed, proceed to step You are now a combiner in your sublist: run the combiner algorithm using a local stack, matching pairs of s in your sublist. If, after a few rounds of combining, there are leftover s in the stack that cannot be matched, access the exchanger FC structure, creating a record pointing to a list of excess structures and add it to the exchanger s publication list using the single FC algorithm. The excess structures are no longer pointed at from the corresponding records of the dynamic FC list. 5. If you become a combiner in the exchanger FC structure, traverse the exchanger publication list using the single combiner algorithm. However, in this single combiner algorithm, each record points to a list of overflow s placed by a combiner of some dynamic list, and so you must either match (in case of having previously pushed counter-s) or push (in other cases) all items in each list before signaling that the is

8 complete. This task is simplified by the fact that the s will always be all pushes or all pops (since otherwise they would have been matched in the dynamic list and never posted to the exchange). In our implementation we chose to split lists so that they contain approximately 8 threads each (in Figure 2 the threshold is 4). Making the lengths of the sublists and thresholds vary dynamically in a reactive manner is a subject for further research. For lack of space, detailed pseudo-codes of our algorithms are not presented in this extended abstract and will appear in the full paper. 3.1 Correctness Though there is no obvious way to specify the rendezvous property of synchronous queues using a sequential specification, it is straightforward to see that our implementation is linearizable to the next closest thing, an object whose histories consist of a sequence of pairs consisting of a push followed by a pop of the matching value (i.e. push, pop, push, pop...). This follows immediately because each thread only leaves the structure after the flat combiner has matched it to a complementary concurrent operation, and we can linearize the operations at the point of the release of the first of them by the flat combiner. In terms of robustness, our flat combining implementation is as robust as any global lock based data structure: in both cases a thread holding the lock could be preempted, causing all others to wait. 4 4 Performance Evaluation For our experiments we used two machines. The first is an Oracle 64-way Niagara II multicore machine with 8 SPARC cores that multiplex 8 hardware threads each, and share an on chip L2 cache. The second is an Intel Nehalem 8-way machine, with 4 cores that each multiplex 2 hardware threads. We ran benchmarks in Java using the Java 6.0 JDK. In the figures we will refer to these two architectures respectively as SPARC and INTEL. Our empirical evaluation is based on comparing the relative performance of our new flat combining implementations to the most efficient known synchronous queue implementations: the current Java 6.0 JDK java.util.concurrent implementation of the unfair synchronous queue, and the recently introduced Elimination-Diffraction trees of Afek et al. [1]. The JDK algorithm, due to Scherer, Lea, and Scott [7], was recently added to Java 6.0 and was reported to provide a three fold improvement over the Java 5.0 unfair synchronous queue implementation. The JDK implementation uses a lock-free linked list based stack in the style of Treiber [12] in order to queue either producer s or consumer s but never both at the same time. 4 On modern operating systems such as Solaris T M, one can use mechanisms such as the schetdl command to control the quantum allocated to a combining thread, significantly reducing the chances of it being preempted.

9 Fig. 3. A Synchronous Queue Benchmark with N consumers and N producers. The graphs show throughput, average CAS failures, average CAS successes, and L2 cache misses (all but the throughput are logarithmic and per operation). We show the throughput graphs for the JDK6.0 algorithm with parks, to make it clear that removing the parks helps performance in this benchmark. Whenever a appears, the queue is examined - if it is empty or has nodes which have the same type of ed operation, the is enqueued using a CAS to the top of the list. Otherwise, the ed operation at the top of the stack is popped using a CAS operation on the head end of the list. This JDK version must provide the option to park a thread (i.e. context switch it out) while it waits for its chance to rendezvous in the queue. A park operation is costly and involves a system call. This, as our graphs show, hurts the JDK algorithm s performance. To make it clear that the performance improvement we obtain relative to the JDK synchronous queue is not a result of the park calls, we implemented a version of the JDK algorithm with the parks neutralized, and use it in our comparison. The Elimination-Diffracting tree [1] (henceforth called ED-Tree), recently introduced by Afek et al., is a distributed data structure that can be viewed as a combination of an elimination-tree [10] and a diffracting-tree [11]. ED-Trees are randomized data-structures that distribute concurrent thread s onto the nodes of a binary tree consisting of balancers (see [11]).

10 Fig. 4. Concurrent Synchronous Queue implementations with N consumers and 1 producer configuration: throughput, average CAS failures, average CAS successes, and L2 cache misses (all but throughput are logarithmic and per operation). Again, we show the throughput graphs for the JDK6.0 algorithm with parks, to make it clear that removing the parks helps performance in this benchmark. We compared the JDK and an ED-Tree based implementation of a synchronous queue to two versions of flat combining based synchronous queues, an FC synchronous queue using a single FC mechanism (denoted in the graphs as FC single), and our dynamic parallel version denoted as FC parallel. The FC parallel threshold was set to spawn a new combiner sublist for every 8 new threads. 4.1 Producer-Consumer Benchmarks Our first benchmark, whose results are presented in Figure 3, is similar to the one in [7]. Producer and consumer threads continuously push and pop items from the queues. In the throughput graph in the upper lefthand corner, one can clearly see that both FC implementations outperform JDK s algorithm even with the park operations removed. Thus, in the remaining sections, we will no longer compare to the inferior JDK6.0 algorithm with parks. As can be seen, the FC single version throughput exceeds the JDK s throughput by almost 80% at 16 threads, and remains better as concurrency levels grow

11 up to 64. However, because there is only a single combiner, the FC single algorithm does not continue to scale beyond 16 threads. On the other hand, the FC parallel version is the same as the JDK up to 8 threads, and from 16 threads and onwards it continues to scale, reaching peak performance at 48 threads. At 64 threads, the FC parallel algorithm is about 11 times faster than the JDK. The explanation for the performance becomes clear when one examines the other three graphs in Figure 3. The numbers of both successful and failed CAS operations in the FC algorithms are orders of magnitude lower (the scale is logarithmic) than in the JDK, as is the number of L2 cache misses. The gap between the FC parallel and the FC single and JDK continues to grow as its overall cache miss levels are consistently lower than their, and its CAS levels remain an order of magnitude smaller then theirs as parallelism increases. The low cache miss rates of the FC parallel when compared to FC single can be attributed to the fact that the combining list is divided and is thus shorter, having a better chance of staying in cache. This explains why the FC parallel algorithm continues to improve while the others slowly deteriorate as concurrency increases. At the highest concurrency levels, however, FC parallel s throughput also starts to decline, since the cost incurred by a combiner thread that accesses the exchange increases as more combiner threads contend for it. Since the ED-Tree algorithm is based on a static tree (that is, a tree whose size is proportional to the number of threads sharing the implementation rather than the number of threads that actually participate in the execution), it incurs significant overheads and has the worst performance among all evaluated algorithms in low concurrency levels. However, for larger numbers of threads, the high parallelism and low contention provided by the ED-Tree allow it to significantly scale up to 16 threads, and to sustain (and even slightly improve) its peak performance in higher concurrency levels. Starting from 16 threads and on, the performance of the ED-Tree exceeds that of the JDK and it is almost 5-fold faster than the JDK for 64 threads. Both FC algorithms are superior to the ED-Tree in all concurrency levels. Since ED-Tree is highly parallel, the gap between its performance and that of FC-single decreases as concurrency increases, and at 64 threads their performance is almost the same. The FC-parallel implementation, on the other hand, outperforms the ED-Tree implementation by a wide margin in all concurrency levels and provides almost three times the throughput at 48 threads. Also here, the performance differences becomes clearer when we examine CAS and cache miss statistics (see Figure 3). Similarly to the JDK, the ED-Tree algorithm performs a relatively high number of successful CAS operations but, since its operations are spatially spread across the nodes of the tree, it incurs a much smaller number of failed CAS operations. The number of cache misses it incurs is close to that of FC-single implementation, yet is about 10-fold higher than FC-parallel implementation at high concurrency levels. Figure 5-(a) shows the throughput on the Intel Nehalem architecture. As can be seen, the behavior is similar to that on SPARC up to 8 threads (recall that

12 Fig. 5. (a) Throughput on the Intel architecture; (b) Decreasing arrival rate on SPARC; (c) Decreasing arrival rate on Intel; (d) Burst test throughput: SPARC; (e) Intel burst test throughput; (f) Worst-case vs. optimum distribution of producers and consumers.

13 the Nehalem has 8 hardware threads): the FC-single algorithm performs better than the FC-parallel, and both FC algorithms significantly outperform the JDK and ED-Tree algorithms. The cache miss and CAS rate graphs, not shown for lack of space, provide a similar picture. Our next benchmark, in Figure 4, has a single producer and multiple consumers. This is not a typical use of a synchronous queue since there is much waiting and little possibility of parallelism. However, this benchmark, introduced in [7], is a good stress test. In the throughput graph in Figure 4, one can see that the imbalance in the single producer test stresses the FC algorithms, making the performance of the FC parallel algorithm more or less the same as that of the JDK. However, we find it encouraging that an algorithm that can deliver up to 11 times the performance of the JDK in the balanced case, delivers comparable performance when there is a great imbalance among producers and consumers. What is the explanation for this behavior? In this benchmark there is a fundamental lack of parallelism even as the number of threads grows: in all of the algorithms, all of the threads but 2 - the producer and its previously matched consumer cannot make progress. Recall that FC wins by having a single low overhead pass over the list service multiple threads. With this in mind, consider that in the single FC case, for every time a lock is acquired, about two s are answered, and yet all the threads are continuously polling the lock. This explains the high cache invalidation rates, which together with a longer publication list traversed each time, explains why the single FC throughput deteriorates. For the parallel FC algorithm, we notice that its failed CAS and cache miss rates are quite similar to those of the JDK. The parallel FC algorithm keeps cache misses and failed CAS rates lower than the single FC because threads are distributed over multiple locks and after failing as a combiner a thread goes to the exchange. In most lists no combining takes place, and s are matched at the exchange level (an indication of this is the successful CAS rate which is close to 1), not in the lists. Combiners accessing the exchange take longer to release their list ownership locks, and therefore cause other threads less cache misses and failed CAS operations. The exchange itself is again a single combiner situation (only accessed by a fraction of the participating threads) and thus with less overhead. The result is a performance very similar to that of the JDK. 4.2 Performance as Arrival Rates Change In earlier benchmarks, the data structures were tested at very high arrival rates. These rates are common to some uses of concurrent data structures, but not to all. Figures 5-(b) and 5-(c) show the change in throughput of the various algorithms as the method call arrival rates change when running on 64 threads on SPARC, or on 8 threads on Intel, respectively. In this benchmark, we inject a work period between calls a thread makes to the queue. The work consists of a loop which is measured to take a specific amount of time.

14 On SPARC, at all work levels, both FC implementations perform better or the same as the JDK, and on the Nehalem, where the cost of a CAS operation is lower, they converge to a point with JDK winning slightly over the FC parallel algorithm. The ED-Tree is the worst performer on Nehalem. On SPARC, on the other hand, ED-Tree is consistently better than JDK and FC single and its performance surpasses that of FC parallel as the amount of work added between operations exceeds 500 nanoseconds. In Figures 5-(d) an 5-(e) we stress the FC implementations further. We show a burst test in which a longer work period is injected frequently, after every 50 operations. This causes the nodes on the combining lists to be removed frequently, thus putting more stress onto the combining allocation algorithm. Again this slows down the FC algorithms, but, as can be seen, they still perform well The Pitfalls of the Parallel Flat Combining Algorithm The algorithmic process used to create the parallel combining lists is another issue that needs further inspection. Since the exchange shared by all sublists is at its core a simple flat combining algorithm, its performance relies on the fact that on average not many of the s are directed to it because of an imbalance in the types of operations on the various sublists. This raises the question of what happens at the best case - when every combiner enjoys an equal number of producers and consumers, and the worst case - in which each combiner is unfortunate enough to continuously have s of the same type. Figure 5-(f) compares runs in which the combining lists are prearranged for the worst and best cases prior to execution. As can be seen, the worst case scenario performance is considerately poor compared to the average and optimal ones. In cases were the number of combiners is even (16, 32, 48, 64 threads), performance solely relies on the exchange, and at one point it is worse than the JDK - this is most likely due to the overhead introduced by the parallel FC algorithm prior to the exchange algorithm. When the number of combiners is odd, there is a combiner which has both consumers and producers, which explains the gain in performance at 8, 24, 40, and 56 threads when compared to their successors. This yields the saw like pattern seen in the graph. Unsurprisingly, the regular run (denoted as average ) is much closer to the optimum. In summary, our benchmarks show that the parallel flat-combining synchronous queue algorithm has the potential to deliver in the most common cases scalability beyond that achievable using fastest prior algorithms, and in the exceptional worst cases, under stress, they continue to deliver comparable performance. 5 Due to the different speeds of the SPARC and INTEL machines, different work periods were required in the tests on the two machines in order to demonstrate the effect of bursty workloads.

15 5 Discussion We presented a new parallel flat combining algorithm and used it to implement synchronous queues. The full code of our Java based implementation is available at We believe that, apart from providing a highly scalable implementation of a fundamental data structure, our new parallel flat combining algorithm is an example of the potential for using multiple instances of flat combining in a data structure to allow continued scaling of the overhead-reducing properties provided by the flat combining technique. Applying the parallel flat combining paradigm to additional key data-structures is an interesting venue for future research. Acknowledgments We thank Doug Lea for allowing us to use his Sun Niagara 2 multicore machine. We also thank the anonymous reviewers for their many helpful comments. References 1. Y. Afek, G. Korland, M. Natanzon, and N. Shavit. Scalable producer-consumer pools based on elimination-diffraction trees. In Euro-Par 10, to appear, June H. Attiya, A. Bar-Noy, D. Dolev, D. Peleg, and R. Reischuk. Renaming in an asynchronous environment. J. ACM, 37(3): , D. R. Hanson. C interfaces and implementations: techniques for creating reusable software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In SPAA 10: Proceedings of the Twenty Third annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, NY, USA, D. Lea. util.concurrent.concurrenthashmap in java.util.concurrent the Java Concurrency Package. src/main/java/util/concurrent/. 7. W. N. Scherer, III, D. Lea, and M. L. Scott. Scalable synchronous queues. Commun. ACM, 52(5): , W. N. Scherer III. Synchronization and concurrency in user-level software systems. PhD thesis, Rochester, NY, USA, Adviser-Scott, Michael L. 9. W. N. Scherer III and M. L. Scott. Nonblocking concurrent data structures with condition synchronization. In DISC, pages , N. Shavit and D. Touitou. Elimination trees and the construction of pools and stacks. Theory of Computing Systems, 30: , N. Shavit and A. Zemach. Diffracting trees. ACM Trans. Comput. Syst., 14(4): , R. K. Treiber. Systems programming: Coping with parallelism. Technical Report RJ 5118, IBM Almaden Research Center, April M. Tzafrir. C++ multi-platform memory-model solution with java orientation.

Scalable Producer-Consumer Pools based on Elimination-Diraction Trees

Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit Computer Science Department Tel-Aviv University, Israel, contact email: guy.korland@cs.tau.ac.il

More information

Flat Combining and the Synchronization-Parallelism Tradeoff

Flat Combining and the Synchronization-Parallelism Tradeoff Flat Combining and the Synchronization-Parallelism Tradeoff Danny Hendler Ben-Gurion University hendlerd@cs.bgu.ac.il Itai Incze Tel-Aviv University itai.in@gmail.com Moran Tzafrir Tel-Aviv University

More information

Fast and Scalable Rendezvousing

Fast and Scalable Rendezvousing Fast and Scalable Rendezvousing Yehuda Afek, Michael Hakimi, and Adam Morrison School of Computer Science Tel Aviv University Abstract. In an asymmetric rendezvous system, such as an unfair synchronous

More information

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53 Flat Parallelization V. Aksenov, ITMO University P. Kuznetsov, ParisTech July 4, 2017 1 / 53 Outline Flat-combining PRAM and Flat parallelization PRAM binary heap with Flat parallelization ExtractMin Insert

More information

Non-blocking Array-based Algorithms for Stacks and Queues!

Non-blocking Array-based Algorithms for Stacks and Queues! Non-blocking Array-based Algorithms for Stacks and Queues! Niloufar Shafiei! Department of Computer Science and Engineering York University ICDCN 09 Outline! Introduction! Stack algorithm! Queue algorithm!

More information

Lightweight Contention Management for Efficient Compare-and-Swap Operations

Lightweight Contention Management for Efficient Compare-and-Swap Operations Lightweight Contention Management for Efficient Compare-and-Swap Operations Dave Dice 1, Danny Hendler 2 and Ilya Mirsky 3 1 Sun Labs at Oracle 2 Ben-Gurion University of the Negev and Telekom Innovation

More information

Software-based contention management for efficient compare-and-swap operations

Software-based contention management for efficient compare-and-swap operations SOFTWARE PRACTICE AND EXPERIENCE Softw. Pract. Exper. ; :1 16 Published online in Wiley InterScience (www.interscience.wiley.com). Software-based contention management for efficient compare-and-swap operations

More information

A Scalable Lock-free Stack Algorithm

A Scalable Lock-free Stack Algorithm A Scalable Lock-free Stack Algorithm Danny Hendler Ben-Gurion University Nir Shavit Tel-Aviv University Lena Yerushalmi Tel-Aviv University The literature describes two high performance concurrent stack

More information

Fast and Scalable Rendezvousing

Fast and Scalable Rendezvousing Fast and Scalable Rendezvousing by Michael Hakimi A Thesis submitted for the degree Master of Computer Science Supervised by Professor Yehuda Afek School of Computer Science Tel Aviv University September

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

arxiv: v1 [cs.dc] 24 May 2013

arxiv: v1 [cs.dc] 24 May 2013 Lightweight Contention Management for Efficient Compare-and-Swap Operations Dave Dice 1, Danny Hendler 2 and Ilya Mirsky 2 1 Sun Labs at Oracle 2 Ben-Gurion University of the Negev arxiv:1305.5800v1 [cs.dc]

More information

Per-Thread Batch Queues For Multithreaded Programs

Per-Thread Batch Queues For Multithreaded Programs Per-Thread Batch Queues For Multithreaded Programs Tri Nguyen, M.S. Robert Chun, Ph.D. Computer Science Department San Jose State University San Jose, California 95192 Abstract Sharing resources leads

More information

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore

More information

AST: scalable synchronization Supervisors guide 2002

AST: scalable synchronization Supervisors guide 2002 AST: scalable synchronization Supervisors guide 00 tim.harris@cl.cam.ac.uk These are some notes about the topics that I intended the questions to draw on. Do let me know if you find the questions unclear

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 5 Part 2 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Static work allocation Where work distribution is predetermined, but based on what? Typical scheme Divide n size data into P

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei Non-blocking Array-based Algorithms for Stacks and Queues Niloufar Shafiei Outline Introduction Concurrent stacks and queues Contributions New algorithms New algorithms using bounded counter values Correctness

More information

Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack

Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack Irina Calciu Brown University irina@cs.brown.edu Justin E. Gottschlich Intel Labs justin.e.gottschlich@intel.com Maurice Herlihy

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Frequently asked questions from the previous class survey

Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [CPU SCHEDULING] Shrideep Pallickara Computer Science Colorado State University L15.1 Frequently asked questions from the previous class survey Could we record burst times in

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

Lock cohorting: A general technique for designing NUMA locks

Lock cohorting: A general technique for designing NUMA locks Lock cohorting: A general technique for designing NUMA locks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Design of Concurrent and Distributed Data Structures

Design of Concurrent and Distributed Data Structures METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues D.N. Serpanos and P.I. Antoniadis Department of Computer Science University of Crete Knossos Avenue

More information

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Bo Hong Drexel University, Philadelphia, PA 19104 bohong@coe.drexel.edu Abstract A new hardware-based design is presented

More information

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems COMP 242 Class Notes Section 9: Multiprocessor Operating Systems 1 Multiprocessors As we saw earlier, a multiprocessor consists of several processors sharing a common memory. The memory is typically divided

More information

Scheduling Transactions in Replicated Distributed Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Concurrent Counting using Combining Tree

Concurrent Counting using Combining Tree Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Unit 3 : Process Management

Unit 3 : Process Management Unit : Process Management Processes are the most widely used units of computation in programming and systems, although object and threads are becoming more prominent in contemporary systems. Process management

More information

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model Sun B. Chung Department of Quantitative Methods and Computer Science University of St. Thomas sbchung@stthomas.edu

More information

Course Syllabus. Operating Systems

Course Syllabus. Operating Systems Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Processes and threads 1 Overview Process concept Process scheduling Thread

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

A GENERIC SIMULATION OF COUNTING NETWORKS

A GENERIC SIMULATION OF COUNTING NETWORKS A GENERIC SIMULATION OF COUNTING NETWORKS By Eric Neil Klein A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Dynamic Memory ABP Work-Stealing

Dynamic Memory ABP Work-Stealing Dynamic Memory ABP Work-Stealing Danny Hendler 1, Yossi Lev 2, and Nir Shavit 2 1 Tel-Aviv University 2 Sun Microsystems Laboratories & Tel-Aviv University Abstract. The non-blocking work-stealing algorithm

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

Hopscotch Hashing. 1 Introduction. Maurice Herlihy 1, Nir Shavit 2,3, and Moran Tzafrir 3

Hopscotch Hashing. 1 Introduction. Maurice Herlihy 1, Nir Shavit 2,3, and Moran Tzafrir 3 Hopscotch Hashing Maurice Herlihy 1, Nir Shavit 2,3, and Moran Tzafrir 3 1 Brown University, Providence, RI 2 Sun Microsystems, Burlington, MA 3 Tel-Aviv University, Tel-Aviv, Israel Abstract. We present

More information

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads Chapter 4: Threads Objectives To introduce the notion of a

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics. Question 1. (10 points) (Caches) Suppose we have a memory and a direct-mapped cache with the following characteristics. Memory is byte addressable Memory addresses are 16 bits (i.e., the total memory size

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Performance, Scalability, and Semantics of Concurrent FIFO Queues

Performance, Scalability, and Semantics of Concurrent FIFO Queues Performance, Scalability, and Semantics of Concurrent FIFO Queues Christoph M. Kirsch Hannes Payer Harald Röck Ana Sokolova Department of Computer Sciences University of Salzburg, Austria firstname.lastname@cs.uni-salzburg.at

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access Philip W. Howard, Josh Triplett, and Jonathan Walpole Portland State University Abstract. This paper explores the

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES

More information

Design Patterns for Real-Time Computer Music Systems

Design Patterns for Real-Time Computer Music Systems Design Patterns for Real-Time Computer Music Systems Roger B. Dannenberg and Ross Bencina 4 September 2005 This document contains a set of design patterns for real time systems, particularly for computer

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

Lecture 2 Process Management

Lecture 2 Process Management Lecture 2 Process Management Process Concept An operating system executes a variety of programs: Batch system jobs Time-shared systems user programs or tasks The terms job and process may be interchangeable

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. ! Scheduling II Today! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling Next Time! Memory management Scheduling with multiple goals! What if you want both good turnaround

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Process- Concept &Process Scheduling OPERATING SYSTEMS

Process- Concept &Process Scheduling OPERATING SYSTEMS OPERATING SYSTEMS Prescribed Text Book Operating System Principles, Seventh Edition By Abraham Silberschatz, Peter Baer Galvin and Greg Gagne PROCESS MANAGEMENT Current day computer systems allow multiple

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

Chapter 4: Multithreaded

Chapter 4: Multithreaded Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Overview Multithreading Models Thread Libraries Threading Issues Operating-System Examples 2009/10/19 2 4.1 Overview A thread is

More information

SMD149 - Operating Systems

SMD149 - Operating Systems SMD149 - Operating Systems Roland Parviainen November 3, 2005 1 / 45 Outline Overview 2 / 45 Process (tasks) are necessary for concurrency Instance of a program in execution Next invocation of the program

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

Building Efficient Concurrent Graph Object through Composition of List-based Set

Building Efficient Concurrent Graph Object through Composition of List-based Set Building Efficient Concurrent Graph Object through Composition of List-based Set Sathya Peri Muktikanta Sa Nandini Singhal Department of Computer Science & Engineering Indian Institute of Technology Hyderabad

More information

CHAPTER 2: PROCESS MANAGEMENT

CHAPTER 2: PROCESS MANAGEMENT 1 CHAPTER 2: PROCESS MANAGEMENT Slides by: Ms. Shree Jaswal TOPICS TO BE COVERED Process description: Process, Process States, Process Control Block (PCB), Threads, Thread management. Process Scheduling:

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process

More information

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter Lecture Topics Today: Uniprocessor Scheduling (Stallings, chapter 9.1-9.3) Next: Advanced Scheduling (Stallings, chapter 10.1-10.4) 1 Announcements Self-Study Exercise #10 Project #8 (due 11/16) Project

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

CSE544 Database Architecture

CSE544 Database Architecture CSE544 Database Architecture Tuesday, February 1 st, 2011 Slides courtesy of Magda Balazinska 1 Where We Are What we have already seen Overview of the relational model Motivation and where model came from

More information

IN the cloud, the number of virtual CPUs (VCPUs) in a virtual

IN the cloud, the number of virtual CPUs (VCPUs) in a virtual IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 7, JULY 2017 1811 APPLES: Efficiently Handling Spin-lock Synchronization on Virtualized Platforms Jianchen Shan, Xiaoning Ding, and Narain

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap Håkan Sundell Philippas Tsigas OPODIS 2004: The 8th International Conference on Principles of Distributed Systems

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors Maged M. Michael Michael L. Scott University of Rochester Department of

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Using RDMA for Lock Management

Using RDMA for Lock Management Using RDMA for Lock Management Yeounoh Chung Erfan Zamanian {yeounoh, erfanz}@cs.brown.edu Supervised by: John Meehan Stan Zdonik {john, sbz}@cs.brown.edu Abstract arxiv:1507.03274v2 [cs.dc] 20 Jul 2015

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing

Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing H. Sundell 1 1 School of Business and Informatics, University of Borås, Borås, Sweden Abstract We present a new algorithm for implementing

More information

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department Philippas Tsigas

More information

Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott

Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott DOI:1.1145/15649.156431 Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Abstract In a thread-safe concurrent queue, consumers typically wait for producers to make

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 3 Nov 2017 Lecture 1/3 Introduction Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat

More information