Fast and Scalable Rendezvousing

Size: px
Start display at page:

Download "Fast and Scalable Rendezvousing"

Transcription

1 Fast and Scalable Rendezvousing by Michael Hakimi A Thesis submitted for the degree Master of Computer Science Supervised by Professor Yehuda Afek School of Computer Science Tel Aviv University September 2011

2 CONTENTS 1. Introduction Preliminaries Asymmetric rendezvous Progress Related Work Synchronous queues Elimination AdaptiveAR Algorithm Algorithm Description Nonadaptive Algorithm Adding Adaptivity Pragmatic Extensions and Considerations Correctness TripleAdp Algorithm Description Algorithm Description Correctness Performance Evaluation Experimental setup AdaptiveAR Performance Evaluation

3 Contents ii Producer/consumer throughput Adaptivity and peeking impact NUMA Architecture Asymmetric workloads Bursts Varying arrival rate Work uniformity TripleAdp Performance Evaluation Symmetric Producer/Consumer Workload Asymmetric Producer/Consumer Workloads Memory Consumption Adaptiveness Concurrent Stack Conclusion

4 ACKNOWLEDGEMENTS I would like to thank my supervisor Prof Yehuda Afek and my partner Adam Morrison, for their guidance and support during my research.

5 ABSTRACT In an asymmetric rendezvous system, such as an unfair synchronous queue and an elimination array, threads of two types, consumers and producers, show up and are matched, each with a unique thread of the other type. Here we present a new highly scalable, high throughput asymmetric rendezvous system that outperforms prior synchronous queue and elimination array implementations under both symmetric and asymmetric workloads (more operations of one type than the other). It is a fast matching machine. Consequently, we also present a highly scalable and competitive stack implementation. This thesis is based on the paper "Fast and Scalable Rendezvousing" of Afek, Hakimi and Morrison, presented on DISC 2011 [1].

6 1. INTRODUCTION A common abstraction in concurrent programming is that of an asymmetric rendezvous mechanism. In this mechanism, there are two types of threads, e.g., producers and consumers, that show up. The goal is to match pairs of threads, one of each type, and send them away. Usually the purpose of the pairing is for a producer to hand over a data item (such as a task to perform) to a consumer. The asymmetric rendezvous abstraction encompasses both unfair synchronous queues (or pools) [13] which are a key building block in Java's thread pool implementation and other message-passing and hand-o designs [3, 13], and the elimination technique [14], which is used to scale concurrent stacks and queues [8, 12]. To date, considerable work has been invested in asymmetric rendezvous objects. In this thesis, we present a highly scalable asymmetric rendezvous algorithm that improves the state of the art in both unfair synchronous queue and elimination algorithms. It is based on a distributed scalable ring structure, unlike Java's synchronous queue which relies on a non-scalable centralized structure. It is nonblocking, in the following sense: if both producers and consumers keep taking steps, some rendezvous operation is guaranteed to complete. (This is similar to the lock-freedom property [9], while taking into account the fact that it takes two to tango, i.e., both types of threads must take steps to successfully rendezvous.) It is also uniform, in that no thread has to perform work on behalf of other threads. This is in contrast to the at combining (FC) based synchronous queues of Hendler et. al. [7], which are blocking and non-uniform. Most importantly, our algorithm performs extremely well in practice, outperforming Java's synchronous queue by up to 60 on an UltraSPARC T2 Plus multicore machine, and the FC synchronous queue by up to a factor of six. Using our algorithm as the elimination layer of a concurrent stack yields a factor of three improvement over Hendler et. al.'s FC stack [6]. Our rst presented algorithm is based on a simple idea that turns out to be remarkably eective in practice: the algorithm itself is asymmetric. While a consumer captures a node in the shared ring structure and waits there for a producer, a producer actively seeks out waiting consumers on the ring. We present two new techniques, that combined with this asymmetric methodology on the ring, lead to extremely high performance and scalability in practice. The rst is a new ring adaptivity scheme that dynamically adjusts the ring's size, leaving enough room to t in all the consumers while minimizing as much as possible empty nodes that producers will needlessly need to search. Having an adaptive ring size, we can expect the nodes to be mostly occupied, which leads us to the next idea: if a producer starts to scan the ring and nds the rst node to be empty, it is likely that a consumer will arrive there shortly. Yet simply waiting at this node, hoping that this will occur, makes the algorithm prone to timeouts and impedes progress. We solve this problem by employing a peeking technique that lets the producer have the best of both worlds: as the producer traverses the ring, it continues to peek at its initial node; should a consumer arrive there, the producer immediately tries to partner with it, thereby minimizing the amount of wasted work. Our algorithm avoids two problems found in prior elimination algorithms that did not exploit asymmetry. In these works [2,8,14], both types of threads would pick a random node in the hope of meeting the right kind of partner. Thus these works suer from false matches, when two threads of the same type meet, and from timeouts, when a producer and a consumer both pick distinct nodes and futilely wait for a partner to arrive. We refer this version of our algorithm as AdaptiveAR. We tested our algorithm on dierent work loads. In the scenario where consumers outnumbered producers,

7 1. Introduction 2 we successfully maintain the peak throughput as the number of consumers grows. Unfortunately, in the scenario where producers outnumber consumers, the producers collide with each other when competing for the few consumers in the ring, causing AdaptiveAR's throughput to degrade as the number of producers grows. We present our second algorithm which addresses this limitation, we refer to it as TripleAdp. The main observation is that the mirror image of AdaptiveAR, where a consumer seeks a producer that waits on the ring, does well in the workload where AdaptiveAR fails to maintain peak throughput. Therefore, the TripleAdp algorithm adapts to the access pattern and switches between (essentially) AdaptiveAR and its mirror image, thus enjoying the best of both worlds. Our TripleAdp algorithm is also adaptive to the total ring size, it allocates and deallocates nodes as needed, unlike AdaptiveAR which uses a statically allocated ring. TripleAdp still maintains the ability to adapt the eective ring size (this avoids frequent memory allocation and moving threads between the old and new rings). The TripleAdp algorithm is therefor triple adaptive: it is adaptive to the eective ring size, to access pattern and to the total ring size. We discuss related work in section 3. The algorithms are presented in Section 4 and in Section 5, after formally dening the asymmetric rendezvous problem and our progress property in Section 2. The empirical evaluation is presented in Section 6. In Section 7 we evaluate an implementation of stack in which our AdaptiveAR algorithm is used as an elimination layer. Section 8 concludes.

8 2. PRELIMINARIES 2.1 Asymmetric rendezvous In the asymmetric rendezvous problem there are threads of two types, producers and consumers. Producers perform put(x) operations, that return either OK or a reserved value (explained shortly). Consumers perform get() operations, which return some item x handed o by a producer, or the reserved value. Producers and consumers show up and must be matched with a unique thread of the other type, such that a producer invoking put(x) and a consumer whose get() returns x must be concurrent at some point. Since this synchronous behavior inherently requires waiting, we provide threads with the option to give up: the special return value is used to indicate such a timeout. We say that an operation completes successfully if it returns a value other than. 2.2 Progress To reason about progress while taking into account that rendezvous inherently requires waiting, we consider both types of operations' combined behavior. An algorithm A that implements asymmetric rendezvous is nonblocking if, some operation completes successfully after both put() and get() take enough steps concurrently. Note that, as with the denition of the lock-freedom property [9], there is no xed a priori bound on the number of steps after which some operation must complete successfully. Rather, we rule out implementations that make no progress at all, i.e., admit executions in which both types of threads take steps innitely often and yet no operation successfully completes. By only counting successful operations as progress, we also rule out the trivial implementation that always returns (i.e., times out immediately).

9 3. RELATED WORK 3.1 Synchronous queues A synchronous queue using three semaphores was described by Hanson [5]. Java 5 includes a coarse-grained locking synchronous queue, which was superseded in Java 6 by Scherer, Lea and Scott's algorithm [13]. Their algorithm is based on a Treiber-style nonblocking stack [17] that at all times contains rendezvous requests by either producers or consumers. A producer nding the stack empty or containing producers pushes itself on the stack and waits, but if it nds the stack holding consumers, it attempts to partner with the consumer at the top of the stack (consumers behave symmetrically). This creates a sequential bottleneck. Motivated by this, Afek, Korland, Natanzon, and Shavit described elimination-diracting (ED) trees [2], a randomized distributed data structure where arriving threads follow a path through a binary tree whose internal nodes are balancer objects [15] and the leaves are Java synchronous queues. In each internal node a thread accesses an elimination array in attempt to avoid descending down the tree. Recently, Hendler et al. applied the at combining paradigm [6] to the synchronous queue problem [7], describing single-combiner and parallel versions. In a single combiner FC pool, a thread attempts to become a combiner by acquiring a global lock on the queue. Threads that fail to grab the lock instead post their request and wait for it to be fullled by the combiner, which matches between the participating threads. In the parallel version there are multiple combiners that each handle a subset of participating threads and then try to satisfy unmatched requests in their subset by entering an exchange FC synchronous queue. 3.2 Elimination The elimination technique is due to Touitou and Shavit [14]. Hendler, Shavit and Yerushalmi used elimination with an adaptive scheme inspired by Shavit and Zemach [16] to obtain a scalable linearizable stack [8]. They used a collision array in which each thread picks a slot trying to collide with a partner. In their adaptive scheme threads adapt locally: each thread picks a slot to collide in from sub-range of the collision array centered around the middle of the array. If no partner arrives, the thread eventually shrinks the range. Alternatively, if the thread sees a waiting partner but fails to collide due to contention, it increases the range. In our adaptivity technique, described in Sect. 4, threads also make local decisions, but with global impact: the ring is resized. Moir et al. used elimination to scale a FIFO queue [12]. In their algorithm an enqueuer picks a random slot in an elimination array and waits there for a dequeuer; a dequeuer picks a random slot, giving up immediately if that slot is empty. It does not seek out waiting enqueuers. Scherer, Lea and Scott applied elimination in their symmetric rendezvous system [10], where there is only one type of a thread and so the pairing is between any two threads that show up. Scherer, Lea and Scott also do not discuss adaptivity, though a later version of their symmetric exchanger channel, which is part of Java [11], includes a scheme that resizes the elimination array. However, here we are interested in the more dicult asymmetric rendezvous problem, where not all pairings are allowed.

10 4. AdaptiveAR ALGORITHM 4.1 Algorithm Description The pseudo code of the AdaptiveAR algorithm is provided in Fig The main data structure (Fig. 4.2a) is a ring of nodes. The ring is accessed through a central array ring, where ring[i] points to the i-th node in the ring (Fig. 4.1). 1 A consumer attempts to capture a node in the ring. It rst scans the ring searching for an empty node, once a node is captured, it waits there for a producer. A producer scans the ring, seeking for a waiting consumer. Once a consumer is found, it tries to match with it. For simplicity, in this section we assume the number of threads in the system, T, is known in advance and pre-allocating a ring of size T. However, the eective size of the ring is adaptive and it is dynamically adjusted as the concurrency level changes (We expand on this in Sect. 4.3.). In Section 5 we handle the case in which the maximum number of threads that can show up is not known in advance, in this case the nodes are allocated and deallocated dynamically, i.e., adaptive also to the number of threads. Conceptually each ring node should only contain an item pointer that encodes the node's state: 1. Free: item points to a global reserved object, FREE, that is distinct from any object a producer may enqueue. Initially all nodes are free. 2. Captured by consumer: item is NULL. 3. Holding data (of a producer): item points to the data. Fig. 4.1: The ring data structure with the array of pointers. Node i points to node i 1. The ring maximum size is T and is resized by changing the head's prev pointer. In practice, ring traversal is more ecient by following a pointer from one node to the next rather than the alternative of traversing the array. Array traversal suers from two problems. First, in Java reading 1 This reects Java semantics, where arrays are of references to objects and not of objects themselves.

11 4. AdaptiveAR Algorithm 6 from the next array cell may result in a cache miss (it is an array of pointers), whereas reading from the (just accessed) current node does not. Second, maintaining a running array index requires an expensive test+branch to handle index boundary conditions or counting modulo the ring size, while reading a pointer is cheap. The pointer eld is named prev, reecting that node i points to node i 1. This allows the ring to be resized with a single atomic compareandset (CAS) that changes ring[1]'s (the head's) prev pointer. To support mapping from a node to its ring index, each node holds its ring index in a read-only index eld. For the sake of clarity we start in Sect. 4.2 by discussing the algorithm without adaptation of the ring size. The adaptivity code, however, is included and is marked by a symbol in Fig. 4.2, and is discussed in Sect Section 4.4 discusses some practically-motivated extensions to the algorithm, e.g. supporting timeouts. Finally, Sect. 4.5 discusses correctness and progress. 4.2 Nonadaptive Algorithm Producers (Fig. 4.2b): A producer searches the ring for a waiting consumer, and attempts to hand its data to it. The search begins at a node, s, obtained by hashing the thread's id (Line 19). The producer passes the ring size to the hash function as a parameter, to ensure the returned node falls within the ring. It then traverses the ring looking for a node captured by a consumer. Here the producer periodically peeks at the initial node s to see if it has a waiting consumer (Lines 24-26); if not, it checks the current node in the traversal (Lines 27-30). Once a captured node is found, the producer tries to deposit its data using a CAS (Lines 25 and 28). If successful, it returns. Consumers (Fig. 4.2c): A consumer searches the ring for a free node and attempts to capture it by atomically changing its item pointer from FREE to NULL using a CAS. Once a node is captured, the consumer spins, waiting for a producer to arrive and deposit its item. Similarly to the producer, a consumer hashes its id to obtain a starting point for its search, s (Line 39). The consumer calls the findfreenode procedure to traverse the ring from s until it captures and returns node u (Lines 59-74). (Recall that the code responsible for handling adaptivity, which is marked by a, is ignored for the moment.) The consumer then waits until a producer deposits an item in u (Lines 41-49), frees u (Lines 51-52) and returns (Line 56). 4.3 Adding Adaptivity If the number of active consumers is smaller than the ring size, producers may need to traverse through a large number (the length of the path is linear in the ring size) of empty nodes before nding a match. It is therefore important to decrease the ring size if the concurrency level is low. On the other hand, if there are more concurrent threads than the ring size (high contention), it is important to dynamically increase the ring. The goal of the adaptivity scheme is to keep the ring size just right and allow threads to complete their operations within a constant number of steps. The logic driving the resizing process is in the consumer's code, which detects when the ring is overcrowded or sparsely populated and changes the size accordingly. If a consumer fails to capture a node after passing through many nodes in findfreenode() (due to not nding a free node or having its CASes fail), then the ring is too crowded and should be increased. The exact threshold is determined by an increase threshold parameter, T i. If in findfreenode(), consumer fails to capture a node after passing through more than T i ring_size nodes, it attempts to increase the ring size (Lines 67-72). To detect when to decrease the ring, we observe that when the ring is sparsely populated, a consumer usually nds a free node quickly, but then has to wait longer until a producer nds it. Thus, we add a wait threshold, T w, and a decrease threshold, T d. If it takes a consumer more than T w iterations of the loop in Lines to successfully complete the rendezvous, but it successfully captured its ring node in up to T d steps, then it attempts to decrease the ring size (Lines 53-55). Resizing the ring is made by CASing the prev pointer of the ring head (ring[1]) from the current tail

12 4. AdaptiveAR Algorithm 7 1 struct node { 2 item : pointer to object 3 index : node's index in the ring 4 prev : pointer to previous ring node 5 } 6 7 shared vars: 8 ring : array [1,...,T] of pointers to nodes, 9 ring [1]. prev = ring[t], 10 ring[ i ]. prev = ring[i 1] (i > 1) utils : 13 getringsize() { 14 node tail := ring [1]. prev 15 return tail.index 16 } (a) Global variables 17 put(threadid, object item) { node s := ring[hash(threadid, 20 getringsize())] 21 node v := s.prev while (true) { 24 if (s.item == NULL) { 25 if (CAS(s.item, NULL, item)) 26 return OK 27 } else if (v.item == NULL) { 28 if (CAS(v.item, NULL, item)) 29 return OK 30 v := v.prev 31 } 32 } 33 } (b) Producer code 34 get(threadid) { 35 int ringsize, busy_ctr, ctr := 0 36 node s, u ringsize := getringsize() 39 s := ring[hash(threadid, ringsize )] 40 (u,busy_ctr) := ndfreenode(s,ringsize) 41 while (u.item == NULL) { 42 ringsize := getringsize() 43 if (u.index > ringsize and 44 CAS(u.item, NULL, FREE) { 45 s := ring[hash(threadid, ringsize )] 46 (u,busy_ctr) := ndfreenode(s, ringsize) 47 } 48 ctr := ctr } item := u.item 52 u.item := FREE 53 if (busy_ctr < T d and ctr > T w and ringsize>1) 54 // Try to decrease ring 55 CAS(ring[1].prev, ring[ ringsize ], ring[ ringsize 1]) 56 return item 57 } ndfreenode(node s, int ringsize ) { 60 int busy_ctr := 0 61 while (true) { 62 if (s.item == FREE and 63 CAS(s.item, FREE, NULL)) 64 return (s,busy_ctr) 65 s := s.prev 66 busy_ctr := busy_ctr if (busy_ctr > T i ringsize) { 68 if ( ringsize < T) // Try to increase ring 69 CAS(ring[1].prev, ring[ ringsize ], ring[ ringsize +1]) 70 ringsize := getringsize() 71 busy_ctr := 0 72 } 73 } 74 } (c) Consumer code Fig. 4.2: Asymmetric rendezvous algorithm. Lines beginning with handle the adaptivity process.

13 4. AdaptiveAR Algorithm 8 of the ring to the tail's successor (to increase the ring size by one) or its predecessor (to decrease the ring size by one). If the CAS fails, then another thread has resized the ring and the consumer continues. The head's prev pointer is not a sequential bottleneck because resizing is a rare event in stable workloads. Even if resizing is frequent, the thresholds ensure that the cost of the CAS is negligible compared to the other work performed by the algorithm, and resizing pays-o in terms of better ring size which leads to improved performance. Handling consumers left out of the ring: A decrease in the ring size may leave a consumer's captured node outside of the current ring. Therefore, each consumer periodically checks if its node's index is larger than the current ring size (Line 43). If so, it tries to free its node using a CAS (Line 44) and nd itself a new node in the ring (Lines 45-46). However, if the CAS fails, then a producer has already deposited its data in this node and so the consumer can take the data and return (this will be detected in the next execution of Line 41). 4.4 Pragmatic Extensions and Considerations Here we discuss two extensions that might be interesting in certain practical scenarios. Timeout support: Aborting a rendezvous that is taking more than a specied period of time is a useful feature. Unfortunately, our model has no notion of time and so we do not model the semantics of timeouts. We simply describe the details for an implementation that captures the intuitive notion of a timeout. The operations take a parameter specifying the desired timeout, if any. Timeout expiry is then checked in each iteration of the main loop in put(), get() and findfreenode(). If the timeout expires, a producer or a consumer in ndfreenode() aborts by returning. A consumer that has captured a ring node cannot abort before freeing the node by CASing its item from NULL back to FREE. If the CAS fails, the consumer has found a match and its rendezvous has completed successfully. Otherwise its abort attempt succeeded and it returns. Busy waiting: Both producers and consumers rely on busy waiting, which may not be appropriate for blocking applications that wish to let a thread sleep until a partner arrives. Being blocking, such applications may not need our strong progress guarantee. We plan to extend our algorithm to blocking scenarios in future work. 4.5 Correctness Theorem 1. The algorithm meets the asymmetric rendezvous semantics. Proof: We consider the point at which both put(x) and the get() that returns x take eect as the point where the producer successfully CASes x into some node's item pointer (Line 25 or 28). The CAS operation assures us that only a single producer has deposited its item into the node. The producer leaves only after a successful CAS and the consumer leaves only after his node has been captured. The algorithm thus clearly meets the asymmetric rendezvous semantics. Theorem 2. If threads do not request timeouts, the algorithm is nonblocking (as dened in Sect. 2). Proof: Assume towards a contradiction that there is an execution in which both producers and consumers take innitely many steps and yet no rendezvous successfully completes. Since there is a nite number of threads T, there must be a producer/consumer pair, p and c, each of which runs forever without completing. We will use Lemma 1 to get a contradiction. Lemma 1. The consumer c will eventually captures a node. Proof: Suppose c never captures a node. Then eventually the ring size must become T. This is because after c completes a cycle around the ring in findfreenode(), it tries to increase the ring size (say the increase

14 4. AdaptiveAR Algorithm 9 threshold is 1). c's resizing CAS (Line 69) either succeeds or fails due to another CAS that succeeded. The succeeded CAS must have been in increasing, because the ring size decreasing implies a rendezvous completing, which by our assumption does not occur. Once the ring size reaches T, the ring has room for c (since T is the maximum number of threads). Thus c fails to capture a node only by encountering another consumer twice at dierent nodes, implying this consumer completes a rendezvous, a contradiction. It therefore cannot be that c never captures a node. From Lemma 1 we get that c captures some node but p never nds c on the ring, implying that c has been left out of the ring by some decreasing resize. But, since from that point on, the ring size cannot decrease, c eventually executes Lines and moves itself into p's range, where either p or another producer rendezvous with it, a contradiction.

15 5. TripleAdp ALGORITHM DESCRIPTION 5.1 Algorithm Description Similarly to the AdaptiveAR algorithm, the main data structure is a ring of nodes. Each node has a unique id (between 1 and the total ring size) stored in its index eld. The nodes are linked via their prev eld, i.e., node i points to node i 1. The ring is initially accessed through an array whose i-th element points to node i. A global shared ring variable points to this array. This indirection allows a thread to install a new ring/array pair with a single atomic compareandset (CAS) that points ring at the new array (and thereby the new ring). Such a ring reallocation occurs when TripleAdp adapts the total ring size. The shared state of the algorithm is described in Fig TripleAdp has two operations modes. In one mode a consumer captures a ring node and waits for a producer, while a producer seeks a waiting consumer by traversing. The other mode is symmetric: a consumer seeks out a waiting producer. The current mode is stored in a global mode variable. If mode=1 the rst mode, henceforth referred to as consumers waiting, is in eect. Otherwise if mode=2 the second (producers waiting) mode is used. Figure 5.2a shows the pseudo code of this procedure. An arriving thread reads mode and then invokes one of the wait or seek procedures, where the rest of the work is done. The mode can change during the execution in response to the workload. The wait and seek procedures therefore periodically check if the mode has changed, and reverse the thread's role when this happens. But as this role reversal does not occur atomically with the mode change, threads in dierent modes can be active concurrently in the ring. 1 struct node { 2 item[] : array of two pointers to objects 3 index : node's index in the ring 4 prev : pointer to previous ring node 5 } 6 7 shared variables: 8 ring : pointer to an array of nodes, initially 9 of size 1, and ring[1].prev = ring[1] mode : int, initially utils : 14 allocatering(int size, int max) { 15 node[] r := new node[max] 16 for (i in 1... max) 17 r[ i ] = new node 18 initialize r' s nodes such that: 19 r[ i ]. index = i 20 r [1]. prev = r[max] 21 r[ i ]. prev = r[i 1] (i > 1) 22 return r 23 } getringsize(node[] r) { 26 node tail := r [1]. prev 27 return tail.index 28 } 29 // The node item a thread in wait() uses 30 waititemindex(object value) 31 if (value = null) 32 return 1 33 else 34 return 2 35 } // The node item a thread in seek() uses 38 seekitemindex(object value) 39 if (value!= null) 40 return 1 41 else 42 return 2 43 } // Was value deposited by a waiting thread? 46 haswaiter(object value, int idx) { 47 if (idx = 1) 48 return value = NULL 49 else 50 return (value!= NULL and value!= FREE) 51 } Fig. 5.1: Algorithm TripleAdp: denitions and helper procedures.

16 5. TripleAdp Algorithm Description 11 To prevent such threads from stepping on each other, each node has a two-member array eld named item, where item[m] is accessed only by threads in mode m and encodes the node's state in mode m: node state Consumers waiting (mode = 1) Producers waiting (mode = 2) Free item[mode] points to a global reserved object, FREE, distinct from any object a producer may enqueue. Captured (by a waiter) item[1]=null item[2] holds producer's item Rendezvous occurred item[1] holds producer's item item[2]=null Figure 5.3 shows the pseudo code of the wait and seek procedures. For the sake of clarity, we initially describe the code without going into the details of adapting the ring size and memory consumption, to which we return later. The relevant lines of code are shown in the pseudo code marked by a but are ignored for the moment. Waiters (Fig. 5.3a): First (Line 96), a waiter determines its operation mode by invoking the helper procedure waititemindex() whose code is given in Fig If the waiter does not have a value to hand o then it is a consumer, implying it is in the consumers waiting mode (m = 1). Otherwise, it is a producer and m = 2. The waiter then attempts to capture a node in the ring and waits there for a seeker to rendezvous with it. This process generally follows that of our previous algorithm AdaptiveAR. The waiter starts by searching the ring for free nodes to capture. It hashes its id to obtain a starting point for this search (Lines ) and then invokes the findfreenode procedure to perform the search (Line 101). The pseudo code of findfreenode is shown in Fig. 5.2b. Here the waiter traverses the ring and upon encountering a FREE node, attempts to CAS the node's item to its value (Lines 69-72). (Recall that value holds the thread's item if it is a producer, and NULL otherwise.) The traversal is aborted if the global mode changes from the current mode (Lines 73-74) and the waiter then reverses its role (Lines ). Once findfreenode captures a node u (Line 101), the waiter spins on u until a seeker arrives and deposits its value (Lines ), at which point the waiter can return (Line 119). If the waiter is a consumer, it returns the seeking producer's deposited value, or NULL otherwise. 52 get(int threadid) { 53 if (mode = 1) 54 return wait(threadid, NULL) 55 else 56 return seek(threadid, NULL) 57 } put(int threadid, object item) { 60 if (mode = 1) 61 return seek(threadid, item) 62 else 63 return wait(threadid, item) 64 } (a) 65 ndfreenode(object value, node[] r, int ringsize, node s) { 66 int busy_ctr := 0 67 int i := waititemindex(value) 68 while (true) { 69 if (s.item == FREE and 70 CAS(s.item[i], FREE, value)) 71 return (s,busy_ctr,ok) 72 s := s.prev 73 if (mode!= i) 74 return (,,MODE CHG) 75 if (ring!= r) 76 return (,,RING CHG) 77 busy_ctr := busy_ctr if (busy_ctr > T i ringsize) { 79 if ( ringsize < size(r)) { // Try to increase ring 80 CAS(r[1].prev, r[ ringsize ], r[ ringsize +1]) 81 } else { // Allocate larger ring 82 node[] nr = allocatering(ringsize+1, 2 len(r)) 83 CAS(ring, r, nr) 84 } 85 ringsize := getringsize(r) 86 busy_ctr := 0 87 } 88 } 89 } (b) Fig. 5.2: TripleAdp pseudo code of get(), put() and findfreenode(). Lines marked by a handle ring size adaptivity.

17 5. TripleAdp Algorithm Description wait(int threadid, object value) { 91 int i, ctr, busy_ctr, ringsize 92 node s, u 93 node[] r ctr := 0 96 i := waititemindex(value) 97 while (true) { 98 r := ring // private copy of ring pointer 99 ringsize := getringsize(r) 100 s := r[hash(threadid, ringsize )] 101 (u,busy_ctr,ret) := ndfreenode(value, r, ringsize, s) 102 if (ret = MODE CHG) 103 return seek(threadid, value) 104 if (ret = RING CHG) 105 continue; // Restart loop 106 while (true) { // Wait for partner 107 if (u.item[i ]!= value) { 108 object item := u.item[i ] 109 u.item[i ] := FREE 110 if (busy_ctr < T d and ctr > T w and ringsize>1) { 111 // Try to decrease ring 112 if ( ringsize 1 = size(r)/4) { 113 node[] nr = allocatering(ringsize 1, len(r)/2) 114 CAS(ring, r, nr) 115 } else { 116 CAS(r[1].prev, r[ ringsize ], r[ ringsize 1]) 117 } 118 } 119 return item 120 } 121 ringsize := getringsize(r) 122 if ((u.index > ringsize or ring!= r) and 123 CAS(u.item[i], value, FREE)) { 124 break; // Restart outer loop 125 } 126 ctr := ctr } 128 } 129 } (a) 130 seek(int threadid, object value) { 131 object item 132 int i, ctr 133 node s, v 134 node[] r i := seekitemindex(value) 137 while (true) { 138 r := ring // private copy of ring pointer 139 s := r[hash(threadid, getringsize(r))] 140 v := s.prev 141 ctr := while (true) { 143 if (mode!= i) 144 return wait(threadid, value) 145 if (ring!= r) 146 break; // Restart outer loop 147 if (ctr > T f ) { 148 CAS(mode, i, 3 i) // toggle 1,2 149 continue; // Restart inner loop 150 } 151 item := s.item[i ] 152 if (haswaiter(item, i)) { 153 if (CAS(s.item[i], item, value)) 154 return item 155 ctr := ctr continue // Restart inner loop 157 } 158 item := v.item[i ] 159 if (haswaiter(item, i)) { 160 if (CAS(v.item[i], item, value)) 161 return item 162 ctr := ctr v := v.prev 164 } 165 } 166 } 167 } (b) Fig. 5.3: TripleAdp pseudo code for seeking and waiting threads. Lines marked by a handle ring size adaptivity. Adapting eective and total ring size ( -marked lines): The waiter adjusts both the eective and total ring size. Adjusting the eective ring size is performed similarly to AdaptiveAR. However, where AdaptiveAR would fail to increase the eective ring size due to using a statically sized ring, TripleAdp instead adjusts the total ring size. Similarly, when the eective ring size becomes small compared to the total ring size, TripleAdp decreases the total ring size. The decision when to resize the ring is made as in AdaptiveAR. If the waiter fails to capture a node after traversing a fraction of the ring (determined by an increase threshold parameter T i ), it tries to increase the eective ring size (Lines 77-87). If the eective ring size is maximal, the waiter allocates a new ring of double the size and tries to install it using a CAS to ring (Lines 82-83). Otherwise, the eective ring size is increased by CASing the prev pointer of the ring head (ring[1]) (Lines 79-80). The eective ring size is decreased if the waiter captures a node quickly, but then spins for more than T w (the wait threshold) iterations before a seeker arrives (Line 110). If the resulting eective size is a quarter of the ring size, the waiter swings ring to point to a new allocated ring of half the old size (Lines ). Otherwise, the eective size is decreased (Line 116). Decreasing the eective ring size may leave a waiter's captured node outside of the current eective ring. Each waiter therefore checks if it has been left out and tries to move back into the ring (Lines ). Similarly, both seekers and waiters check whether a new ring has been allocated (Lines and 122,75-76). If so, they move themselves to the new ring by restarting the operation. Memory reclamation: We assume the runtime environment supports garbage collection. Thus, once the pointer to an old ring is overwritten and all threads move to the new ring, the garbage collector reclaims the old ring's memory. Garbage collectors are part of the specication of modern environments such as C# and

18 5. TripleAdp Algorithm Description 13 Java, in which most prior synchronous queue algorithms were implemented [1, 2, 7, 13]. Seekers (Fig. 5.3b): A seeker traverses the ring, looking for a waiter to rendezvous with. Symmetrically to the waiter code, the seeker starts by determining its operation mode (Line 136). The seeker then begins traversing the ring from an initial node s obtained by hashing the thread's id (Lines ). As in the waiter case, the traversal process follows that of AdaptiveAR: the seeker proceeds while peeking at s to see if a waiter has arrived there in the mean time (Lines ). If s has no waiter, the next node in the traversal is checked. To determine if a node is a captured by a waiter, the seeker uses the helper procedure haswaiter() (shown in Fig. 5.1). Once a waiter is found, the seeker tries to match with it by atomically swapping the value stored in the node's item with its value. If successful, it returns the retrieved value. During each iteration of the loop, the seeker validates that it is in the correct mode and switches roles if not (Lines ). Adapting to the access pattern: Seekers are responsible for adapting the operation mode to the access pattern. The idea is to assign the role of waiter to the majority thread type (producer or consumer) in an asymmetric access pattern. Ring size adaptivity then sizes the ring according to the number of waiters. This reduces the chance for seekers (which are outnumbered by the waiters) to collide with each other on the ring. Following this intuition, a seeker deduces that the operation mode should to be changed if it starts to experience collisions with other seekers in the form of CAS failures. A seeker maintains a counter of failed CAS attempts (Lines 155, 162). If this counter crosses a failure threshold T f, the seeker toggles the global operation mode using a CAS on mode (Lines ). 5.2 Correctness Theorem 3. The algorithm meets the asymmetric rendezvous semantics. Proof: The point at which both a seek(t 1, x) and the wait(t 2, y) that returns x take eect is the point where the seeker successfully CASes x into some node's item pointer (Lines 153,160). If the seeker is a put(x) operation, then the waiter is a get() and so y = NULL. Otherwise, the seeker is a get() and x = NULL. Thus, a put(x) always returns NULL (i.e., OK) and its partner get() returns x, and so the algorithm meets the semantics of asymmetric rendezvous. Theorem 4. If threads do not request timeouts, the algorithm is nonblocking (as dened in Sect. 2). Proof: The proof is by contradiction. Assume that there is an execution E in which both producers and consumers take innitely many steps and yet no rendezvous successfully completes. We rst prove Lemma 2 and Lemma 3 and then move on towards a contradiction. Lemma 2. If threads do not request timeouts, the mirror image of AdaptiveAR is nonblocking. Proof: The proof is very similar to the proof of Theorem 2. Assume towards a contradiction that there is a producer/consumer pair, p and c, each of which runs forever without completing. Like in Theorem 2, the producer p must eventually capture a node. This implies that c doesn't nd p on the ring. This can only occur if p was left out of the ring because some producer decreased the ring. But, since from that point on, the ring size cannot decrease, p eventually moves itself into c's range, where either c or another consumer rendezvous with it, a contradiction. Lemma 3. If E has a nite number of mode switches and ring reallocations, a rendezvous must complete. Proof: The observation is that when no mode switches and no ring reallocations occur, an execution of TripleAdp is reduced to an execution of AdaptiveAR or to its mirror image. Since AdaptiveAR is nonblocking (Theorem 2) and its mirror image is nonblocking (Lemma 2) we get that a rendezvous must complete, i.e if there were only nitely many such events, the innite execution after the last one would lead to a rendezvous. It follows that E must contain innitely many mode switches or ring reallocations. A mode switch occurs after a seeker's CAS fails (Line 148). This implies a rendezvous has completed, a contradiction. A ring

19 5. TripleAdp Algorithm Description 14 reallocation that reduces the total ring size (Lines ) occurs after a waiter completes a rendezvous, which is impossible. Thus, the only ring reallocations possible are ones that increase the total ring size. To complete the proof, we show that there cannot be innitely many such ring reallocations. Here we use the assumption that the concurrency is nite. Let T be the maximum number of threads that can be active in E; we show that the total ring size cannot grow beyond 2T. Let R be the rst ring to be installed with total size size(r) T. Then R's eective size always remains T, because a ring size decrease implies some operation completes. Consider the rst waiter w that tries to reallocate R. Then w observes size(r) occupied nodes, i.e., it observes another waiter w in two dierent nodes. w leaves one node and returns to capture another node due to one of the following reasons: a mode switch (impossible), a ring reallocation (impossible, since w is the rst to reallocate the ring), or because w successfully rendezvouses, a contradiction.

20 6. PERFORMANCE EVALUATION We rst evaluate the performance of our AdaptiveAR algorithm in comparison with prior unfair synchronous queues (Sect. 6.2) and demonstrate the performance gains due to the adaptivity and the peeking techniques). In Section 6.3 we evaluate our TripleAdp algorithm. 6.1 Experimental setup Evaluation is carried out on both a Sun SPARC T5240 and on an Intel Core i7. The Sun has two UltraSPARC T2 Plus (Niagara II) chips. Each is a multithreading (CMT) processor, with HZ in-order cores with 8 hardware strands per core, for a total of 64 hardware strands per chip. The Intel Core i7 920 (Nehalem) processor has four 2.67GHz cores, each multiplexing 2 hardware threads. Both Java and C++ implementations are tested. 1 Due to space limitations, we focus more on the Sun platform which oers more parallelism and better insight into the scaling behavior of the algorithm. Unless stated otherwise, results are averages of ten 10-second runs of the Java implementation on an idle machine, resulting in very little variance. Adaptivity parameters used: T i = 1, T d = 2, and T w = 64. A modulo hash function, hash(t) = t mod ringsize, is used since we don't know anything about the thread ids and consider them as uniformly random. If however thread ids are sequential the modulo hash achieves perfect coverage of the ring with few collisions and helps in pairing producers/consumers that often match with each other. We evaluate these eects by testing our algorithms with both random and sequential thread ids throughout all experiments. 6.2 AdaptiveAR Performance Evaluation Our AdaptiveAR algorithm (marked AdaptiveAR in the gures) is compared against the Java unfair synchronous queue (JDK), ED tree, and FC synchronous queues. The original authors' implementation of each algorithm is used. 2 We tested both JDK and JDK-no-park: a version that always uses busy waiting instead of yielding the CPU (so-called parking), and report its results for the workloads where it improves upon the standard Java pool (as was done in [7]) Producer/consumer throughput N : N producer/consumer symmetric workload: We measure the throughput at which data is transferred from N producers to N consumers. We focus rst on single chip results in Figs. 6.1a and 6.1e. The Nehalem results are qualitatively similar to the low thread counts SPARC results. Other than our algorithm, the parallel FC pool is the only algorithm that shows meaningful scalability. Our rendezvous algorithm 1 Java benchmarks were ran with HotSpot Server JVM, build ea-b137. C++ benchmarks were compiled with Sun C on the SPARC machine and with gcc (-O3 optimization setting) on the Intel machine. In the C++ experiments we used the Hoard 3.8 [4] memory allocator. 2 We remove all statistics counting from the code and use the latest JVM. Thus, the results we report are usually slightly better than those reported in the original papers. On the other hand, we xed a bug in the benchmark of [7] that miscounted timed-out operations of the Java pool as successful operations; thus the results we report for it are sometimes lower.

21 6. Performance Evaluation 16 Fig. 6.1: Rendezvousing between N pairs of producers and consumers. Performance counter plots are logarithmic scale. L2 misses are not shown; all algorithms but JDK had less than one L2 miss/operation on average. outperforms the parallel FC queue by 2 3 in low concurrency settings, and by up to 6 at high thread counts. Hardware performance counter analysis (Fig. 6.1b-6.1d) shows that our rendezvous completes in less than 170 instructions, of which one is a CAS. In the parallel FC pool, waiting for the combiner to pick up a thread's request, match it, and report back with the result, all add up. While the parallel FC pool hardly performs CASes, it does require between 3 to 6 more instructions to complete an operation. Similarly, ED tree's

22 6. Performance Evaluation 17 rendezvous operations require 1000 instructions to complete and incur more cache misses as concurrency increases. Our algorithm therefore outperforms it by at least 6 and up to 10 at high thread counts. The Java synchronous queue fails to scale in this benchmark. Figures 6.1b and 6.1c show its serializing behavior. Due to the number of failed CAS operations on the top of the stack (and consequent retries), it requires more instructions to complete an operation as concurrency grows. Consequently, our algorithm outperforms it by more than Adaptivity and peeking impact From Fig. 6.1c we deduce that ring resizing operations are a rare event, leading to an average number of one CAS per operation. Thus resizing does not adversely impact performance here. Throughput of the algorithm with and without peeking is compared in Figure 6.3a. While peeking has little eect at low thread counts, it improves performance by as much as 47% once concurrency increases. Because, due to adaptivity, a producer's initial node being empty usually means that a consumer will arrive there shortly, and peeking increasing the chance of pairing with that consumer. Without it a producer has a higher chance of colliding with another producer and consequently spending more instructions (and CASes) to complete a rendezvous NUMA Architecture When utilizing both processors of the Sun machine the operating system's default scheduling policy is to place threads round-robin on both chips. Thus, the cross-chip overhead is noticeable even at low thread counts, as Fig. 6.1f shows. Since the threads no longer share a single L2 cache, they experience an increased number of L1 and L2 cache misses; each such miss is expensive, requiring coherency protocol trac to the remote chip. The eect is catastrophic for serializing algorithms; for example, the Java pool experiences a 10 drop in throughput. The more concurrent algorithms, such as parallel FC and ours, show scaling trends similar to the single chip ones, but achieve lower throughput. In the rest of the section we therefore focus on the more interesting single chip case. Fig. 6.2: Single producer and N consumers rendezvousing. Left: Default OS scheduling. Right: OS constrained to not co-locate producer on same cores as consumers Asymmetric workloads The throughput of one producer rendezvousing with a varying number of consumers is presented in Figs. 6.2a and 6.3c. Nehalem results (Fig. 6.3c) are again similar and not discussed in detail. Since the throughput is bounded by the rate of a single producer, little scaling is expected and observed. However, for all algorithms (but the ED tree) it takes several consumers to keep up with a single producer. This is because the producer hands o an item and completes, whereas the consumer needs to notice the rendezvous has occurred. And

23 6. Performance Evaluation 18 while the single consumer is thus busy the producer cannot make progress on a new rendezvous. However, when more consumers are active, the producer can immediately return and nd another (dierent) consumer ready to rendezvous with. Unfortunately, as shown in Figure 6.2a, most algorithms do not sustain this peak throughput. The FC pools have the worst degradation (3.74 for the single version, and 3 for the parallel version). The Java pool's degradation is minimal (13%), and it along with the ED tree achieves close to peak throughput even at high thread counts. Yet this throughput is low: our algorithm outperforms the Java pool by up to 3 and the ED tree by 12 for low consumer counts and 6 for high consumer counts, despite degrading by 23% 28% from peak throughput. Why our algorithm degrades is not obvious. The producer has its pick of consumers in the ring and should be able to complete a hand-o immediately. The reason for this degradation is not algorithmic, but due to Fig. 6.3: Top left: Impact of peeking on N : N rendezvous. Top right: Rendezvousing between N producers and a single consumer. Bottom: Intel 1 : N and N : 1 producer/consumer workloads throughput. contention on chip resources. A Niagara II core has two pipelines, each shared by four hardware strands. Thus, beyond 16 threads some threads must share a pipeline and our algorithm indeed starts to degrade at 16 threads. To prove that this is the problem, we present in Fig. 6.2b runs where the producer runs on the rst core and consumers are scheduled only on the remaining seven cores. While the trends of the other algorithms are unaected, our algorithm now maintains peak throughput through all consumer counts. Results from the opposite workload (which is less interesting in real life scenarios), where multiple producers try to serve a single consumer, are given in Figs. 6.3b and 6.3d. Here the producers contend over the single consumer node and as a result the throughput of our algorithm degrades as the number of producers increases (as do the FC pools). Despite this degradation, our algorithm outperforms the Java pool up to 48 threads (falling behind by 15% at 64 threads) and outperforms the FC pools by about 2.

24 6. Performance Evaluation Bursts To evaluate the eectiveness of our adaptivity technique, we measure the rendezvous rate in a workload that experiences bursts of activity (on the Sun machine). For ten seconds the workload alternates every second between 31 thread pairs and 8 pairs. The 63rd hardware strand is used to take ring size measurement. Our sampling thread continuously reads the ring size, and records the time whenever the current read diers from the previous read. Figure 6.4a depicts the result, showing how the algorithm continuously resizes its ring. Consequently, it successfully benets from the available parallelism in this workload, outperforming Fig. 6.4: Top: Synchronous queue bursty workload throughput. Left: Our algorithm's ring size over time (sampled continuously using a thread that does not participate in the rendezvousing). Right: throughput. Bottom: N : N rendezvousing with decreasing arrival rate due to increasing amount of work time operations. the Java pool by at least 40 and the parallel FC pool by 4 to Varying arrival rate In practice, threads do some work between rendezvouses. We measure how the throughput of the producer/consumer pairs workload is aected when the thread arrival rate decreases due to increasingly larger amounts of time spent doing work before each rendezvous. Figures 6.4c and 6.4d show that as the work period grows the throughput of all algorithms that exhibit scaling deteriorates, due to reduced parallelism in the workload. E.g., on the SPARC the parallel FC degrades by 2 when going from no work to 1.5µs of work, and our algorithm degrades by 3.4 (because it starts much higher). Still, on the SPARC there is sucient parallelism to allow our algorithm to outperform the other implementations by a factor of at least three. (On the Nehalem, by 31%.)

25 6. Performance Evaluation Work uniformity One clear advantage of our algorithm over FC is work uniformity, since the combiner in FC is expected to spend much more time doing work for other threads. We show this by comparing the percent of total operations performed by each thread in a multiple producer/multiple consumer workload. In addition to measuring uniformity, this test is a yardstick for progress in practice: if a thread starves, we will see it as performing very little work compared to other threads. We pick the best result from ve executions of 16 producer/consumer pairs, and plot the percent of total operations performed by each thread. Figure 6.5 shows the results. In an ideal setting, each thread would perform 3.125% (1/32) of the work. Our algorithm comes relatively close to this distribution, as does the JDK algorithm. In contrast, the parallel FC pool experiences much larger deviation and best/worst ratio. Fig. 6.5: Percent of total operations performed by each of 32 threads in an N : N test. 6.3 TripleAdp Performance Evaluation In this section we evaluate the performance of the TripleAdp algorithm compared to prior unfair synchronous queues. We compare TripleAdp to AdaptiveAR, the FC synchronous queues, ED tree, and Java synchronous queue (JDK). As in 6.2, we tested both JDK and JDK-no-park, a version that uses busy waiting instead of parking (yielding the CPU), and report the version that performs best in each workload. To improve legibility of the graphs, we plot the JDK only in the experiments where it outperformed the other prior algorithms, and we omit the ED tree results entirely, since it is never the best performer. Note that 6.2 has extensively evaluated the JDK and ED tree compared to AdaptiveAR and the FC synchronous queues [7] on the same hardware platforms. Parameters used: The adaptivity thresholds we use for both AdaptiveAR and TripleAdp are T i = 1, T d = 2, T w = 64 (the same as in 6.2). In addition, we set T f = 10 for TripleAdp. Both algorithms use the same modulo hash function, hash(t) = t modulo the eective ring size. The rationale for this choice is that if we don't know anything about the thread ids, we consider them as uniformly random and the modulo will provide a uniform distribution. If however thread ids are sequential the modulo hash helps in pairing

Fast and Scalable Rendezvousing

Fast and Scalable Rendezvousing Fast and Scalable Rendezvousing Yehuda Afek, Michael Hakimi, and Adam Morrison School of Computer Science Tel Aviv University Abstract. In an asymmetric rendezvous system, such as an unfair synchronous

More information

Scalable Flat-Combining Based Synchronous Queues

Scalable Flat-Combining Based Synchronous Queues Scalable Flat-Combining Based Synchronous Queues Danny Hendler 1, Itai Incze 2, Nir Shavit 2,3 and Moran Tzafrir 2 1 Ben-Gurion University 2 Tel-Aviv University 3 Sun Labs at Oracle Abstract. In a synchronous

More information

Scalable Producer-Consumer Pools based on Elimination-Diraction Trees

Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Scalable Producer-Consumer Pools based on Elimination-Diraction Trees Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit Computer Science Department Tel-Aviv University, Israel, contact email: guy.korland@cs.tau.ac.il

More information

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract A simple correctness proof of the MCS contention-free lock Theodore Johnson Krishna Harathi Computer and Information Sciences Department University of Florida Abstract Mellor-Crummey and Scott present

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

A Scalable Lock-free Stack Algorithm

A Scalable Lock-free Stack Algorithm A Scalable Lock-free Stack Algorithm Danny Hendler Ben-Gurion University Nir Shavit Tel-Aviv University Lena Yerushalmi Tel-Aviv University The literature describes two high performance concurrent stack

More information

AST: scalable synchronization Supervisors guide 2002

AST: scalable synchronization Supervisors guide 2002 AST: scalable synchronization Supervisors guide 00 tim.harris@cl.cam.ac.uk These are some notes about the topics that I intended the questions to draw on. Do let me know if you find the questions unclear

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Non-blocking Array-based Algorithms for Stacks and Queues!

Non-blocking Array-based Algorithms for Stacks and Queues! Non-blocking Array-based Algorithms for Stacks and Queues! Niloufar Shafiei! Department of Computer Science and Engineering York University ICDCN 09 Outline! Introduction! Stack algorithm! Queue algorithm!

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Midterm Exam Amy Murphy 6 March 2002

Midterm Exam Amy Murphy 6 March 2002 University of Rochester Midterm Exam Amy Murphy 6 March 2002 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

A Non-Blocking Concurrent Queue Algorithm

A Non-Blocking Concurrent Queue Algorithm A Non-Blocking Concurrent Queue Algorithm Bruno Didot bruno.didot@epfl.ch June 2012 Abstract This report presents a new non-blocking concurrent FIFO queue backed by an unrolled linked list. Enqueue and

More information

Quasi-Linearizability Relaxed Consistency For Improved Concurrency

Quasi-Linearizability Relaxed Consistency For Improved Concurrency TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Quasi-Linearizability Relaxed Consistency For Improved Concurrency Dissertation submitted in partial

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Design of Concurrent and Distributed Data Structures

Design of Concurrent and Distributed Data Structures METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Problem. Context. Hash table

Problem. Context. Hash table Problem In many problems, it is natural to use Hash table as their data structures. How can the hash table be efficiently accessed among multiple units of execution (UEs)? Context Hash table is used when

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

Software-based contention management for efficient compare-and-swap operations

Software-based contention management for efficient compare-and-swap operations SOFTWARE PRACTICE AND EXPERIENCE Softw. Pract. Exper. ; :1 16 Published online in Wiley InterScience (www.interscience.wiley.com). Software-based contention management for efficient compare-and-swap operations

More information

Concurrent Counting using Combining Tree

Concurrent Counting using Combining Tree Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,

More information

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Revisiting the Combining Synchronization Technique

Revisiting the Combining Synchronization Technique Revisiting the Combining Synchronization Technique Panagiota Fatourou Department of Computer Science University of Crete & FORTH ICS faturu@csd.uoc.gr Nikolaos D. Kallimanis Department of Computer Science

More information

The Relative Power of Synchronization Methods

The Relative Power of Synchronization Methods Chapter 5 The Relative Power of Synchronization Methods So far, we have been addressing questions of the form: Given objects X and Y, is there a wait-free implementation of X from one or more instances

More information

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei Non-blocking Array-based Algorithms for Stacks and Queues Niloufar Shafiei Outline Introduction Concurrent stacks and queues Contributions New algorithms New algorithms using bounded counter values Correctness

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

Concurrent Objects and Linearizability

Concurrent Objects and Linearizability Chapter 3 Concurrent Objects and Linearizability 3.1 Specifying Objects An object in languages such as Java and C++ is a container for data. Each object provides a set of methods that are the only way

More information

Lecture 10: Multi-Object Synchronization

Lecture 10: Multi-Object Synchronization CS 422/522 Design & Implementation of Operating Systems Lecture 10: Multi-Object Synchronization Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Advanced Topic: Efficient Synchronization

Advanced Topic: Efficient Synchronization Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is

More information

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL Whatever can go wrong will go wrong. attributed to Edward A. Murphy Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL 251 Literature Maurice Herlihy and Nir Shavit. The Art of Multiprocessor

More information

CS 241 Honors Concurrent Data Structures

CS 241 Honors Concurrent Data Structures CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over

More information

CS 167 Final Exam Solutions

CS 167 Final Exam Solutions CS 167 Final Exam Solutions Spring 2018 Do all questions. 1. [20%] This question concerns a system employing a single (single-core) processor running a Unix-like operating system, in which interrupts are

More information

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 5 Part 2 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Static work allocation Where work distribution is predetermined, but based on what? Typical scheme Divide n size data into P

More information

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return? Synchronization Part 1: Synchronization - Locks Dekker s Algorithm and the Bakery Algorithm provide software-only synchronization. Thanks to advancements in hardware, synchronization approaches have been

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Interprocess Communication By: Kaushik Vaghani

Interprocess Communication By: Kaushik Vaghani Interprocess Communication By: Kaushik Vaghani Background Race Condition: A situation where several processes access and manipulate the same data concurrently and the outcome of execution depends on the

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers) Comp 104:Operating Systems Concepts Revision Lectures (separate questions and answers) Today Here are a sample of questions that could appear in the exam Please LET ME KNOW if there are particular subjects

More information

Distributed Computing Group

Distributed Computing Group Distributed Computing Group HS 2009 Prof. Dr. Roger Wattenhofer, Thomas Locher, Remo Meier, Benjamin Sigg Assigned: December 11, 2009 Discussion: none Distributed Systems Theory exercise 6 1 ALock2 Have

More information

CS370 Operating Systems Midterm Review

CS370 Operating Systems Midterm Review CS370 Operating Systems Midterm Review Yashwant K Malaiya Fall 2015 Slides based on Text by Silberschatz, Galvin, Gagne 1 1 What is an Operating System? An OS is a program that acts an intermediary between

More information

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers) Comp 204: Computer Systems and Their Implementation Lecture 25a: Revision Lectures (separate questions and answers) 1 Today Here are a sample of questions that could appear in the exam Please LET ME KNOW

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

Algorithm 23 works. Instead of a spanning tree, one can use routing.

Algorithm 23 works. Instead of a spanning tree, one can use routing. Chapter 5 Shared Objects 5.1 Introduction Assume that there is a common resource (e.g. a common variable or data structure), which different nodes in a network need to access from time to time. If the

More information

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing

More information

15 418/618 Project Final Report Concurrent Lock free BST

15 418/618 Project Final Report Concurrent Lock free BST 15 418/618 Project Final Report Concurrent Lock free BST Names: Swapnil Pimpale, Romit Kudtarkar AndrewID: spimpale, rkudtark 1.0 SUMMARY We implemented two concurrent binary search trees (BSTs): a fine

More information

Reagent Based Lock Free Concurrent Link List Spring 2012 Ancsa Hannak and Mitesh Jain April 28, 2012

Reagent Based Lock Free Concurrent Link List Spring 2012 Ancsa Hannak and Mitesh Jain April 28, 2012 Reagent Based Lock Free Concurrent Link List Spring 0 Ancsa Hannak and Mitesh Jain April 8, 0 Introduction The most commonly used implementation of synchronization is blocking algorithms. They utilize

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Process- Concept &Process Scheduling OPERATING SYSTEMS

Process- Concept &Process Scheduling OPERATING SYSTEMS OPERATING SYSTEMS Prescribed Text Book Operating System Principles, Seventh Edition By Abraham Silberschatz, Peter Baer Galvin and Greg Gagne PROCESS MANAGEMENT Current day computer systems allow multiple

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

TABLES AND HASHING. Chapter 13

TABLES AND HASHING. Chapter 13 Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Lightweight Contention Management for Efficient Compare-and-Swap Operations

Lightweight Contention Management for Efficient Compare-and-Swap Operations Lightweight Contention Management for Efficient Compare-and-Swap Operations Dave Dice 1, Danny Hendler 2 and Ilya Mirsky 3 1 Sun Labs at Oracle 2 Ben-Gurion University of the Negev and Telekom Innovation

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list Lockless Algorithms CAS based algorithms stack order linked list CS4021/4521 2017 jones@scss.tcd.ie School of Computer Science and Statistics, Trinity College Dublin 2-Jan-18 1 Obstruction, Lock and Wait

More information

Using RDMA for Lock Management

Using RDMA for Lock Management Using RDMA for Lock Management Yeounoh Chung Erfan Zamanian {yeounoh, erfanz}@cs.brown.edu Supervised by: John Meehan Stan Zdonik {john, sbz}@cs.brown.edu Abstract arxiv:1507.03274v2 [cs.dc] 20 Jul 2015

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Exploiting the Behavior of Generational Garbage Collector

Exploiting the Behavior of Generational Garbage Collector Exploiting the Behavior of Generational Garbage Collector I. Introduction Zhe Xu, Jia Zhao Garbage collection is a form of automatic memory management. The garbage collector, attempts to reclaim garbage,

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

Most of this PDF is editable. You can either type your answers in the red boxes/lines, or you can write them neatly by hand.

Most of this PDF is editable. You can either type your answers in the red boxes/lines, or you can write them neatly by hand. 15-122 : Principles of Imperative Computation, Spring 2016 Written Homework 7 Due: Monday 7 th March, 2016 Name:q1 Andrew ID: q2 Section:q3 This written homework covers amortized analysis and hash tables.

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Unit 6: Indeterminate Computation

Unit 6: Indeterminate Computation Unit 6: Indeterminate Computation Martha A. Kim October 6, 2013 Introduction Until now, we have considered parallelizations of sequential programs. The parallelizations were deemed safe if the parallel

More information

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou 1 and Nikolaos D. Kallimanis 2 1 Institute of Computer Science (ICS), Foundation of Research and Technology-Hellas

More information

The element the node represents End-of-Path marker e The sons T

The element the node represents End-of-Path marker e The sons T A new Method to index and query Sets Jorg Homann Jana Koehler Institute for Computer Science Albert Ludwigs University homannjkoehler@informatik.uni-freiburg.de July 1998 TECHNICAL REPORT No. 108 Abstract

More information

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3.

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3. Whatever can go wrong will go wrong. attributed to Edward A. Murphy Murphy was an optimist. authors of lock-free programs 3. LOCK FREE KERNEL 309 Literature Maurice Herlihy and Nir Shavit. The Art of Multiprocessor

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 3 Nov 2017 Lecture 1/3 Introduction Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat

More information

Cuckoo Hashing for Undergraduates

Cuckoo Hashing for Undergraduates Cuckoo Hashing for Undergraduates Rasmus Pagh IT University of Copenhagen March 27, 2006 Abstract This lecture note presents and analyses two simple hashing algorithms: Hashing with Chaining, and Cuckoo

More information

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week 05 Lecture 18 CPU Scheduling Hello. In this lecture, we

More information

Proving liveness. Alexey Gotsman IMDEA Software Institute

Proving liveness. Alexey Gotsman IMDEA Software Institute Proving liveness Alexey Gotsman IMDEA Software Institute Safety properties Ensure bad things don t happen: - the program will not commit a memory safety fault - it will not release a lock it does not hold

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Project 0: Implementing a Hash Table

Project 0: Implementing a Hash Table CS: DATA SYSTEMS Project : Implementing a Hash Table CS, Data Systems, Fall Goal and Motivation. The goal of Project is to help you develop (or refresh) basic skills at designing and implementing data

More information

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 4 - Concurrency and Synchronization Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Mutual exclusion Hardware solutions Semaphores IPC: Message passing

More information

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm Computing intersections in a set of line segments: the Bentley-Ottmann algorithm Michiel Smid October 14, 2003 1 Introduction In these notes, we introduce a powerful technique for solving geometric problems.

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Problem Set 2. CS347: Operating Systems

Problem Set 2. CS347: Operating Systems CS347: Operating Systems Problem Set 2 1. Consider a clinic with one doctor and a very large waiting room (of infinite capacity). Any patient entering the clinic will wait in the waiting room until the

More information

Enhancing Linux Scheduler Scalability

Enhancing Linux Scheduler Scalability Enhancing Linux Scheduler Scalability Mike Kravetz IBM Linux Technology Center Hubertus Franke, Shailabh Nagar, Rajan Ravindran IBM Thomas J. Watson Research Center {mkravetz,frankeh,nagar,rajancr}@us.ibm.com

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Page 1. Challenges Concurrency CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks CS162 Operating Systems and Systems Programming Lecture 4 Synchronization, Atomic operations, Locks" January 30, 2012 Anthony D Joseph and Ion Stoica http://insteecsberkeleyedu/~cs162 Space Shuttle Example"

More information

Remaining Contemplation Questions

Remaining Contemplation Questions Process Synchronisation Remaining Contemplation Questions 1. The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and

More information

6.852: Distributed Algorithms Fall, Class 21

6.852: Distributed Algorithms Fall, Class 21 6.852: Distributed Algorithms Fall, 2009 Class 21 Today s plan Wait-free synchronization. The wait-free consensus hierarchy Universality of consensus Reading: [Herlihy, Wait-free synchronization] (Another

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Lock cohorting: A general technique for designing NUMA locks

Lock cohorting: A general technique for designing NUMA locks Lock cohorting: A general technique for designing NUMA locks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada CS102 Hash Tables Prof. Tejada 1 Vectors, Linked Lists, Stack, Queues, Deques Can t provide fast insertion/removal and fast lookup at the same time The Limitations of Data Structure Binary Search Trees,

More information

Chapter 3: Process Concept

Chapter 3: Process Concept Chapter 3: Process Concept Chapter 3: Process Concept Process Concept Process Scheduling Operations on Processes Inter-Process Communication (IPC) Communication in Client-Server Systems Objectives 3.2

More information

Chapter 3: Process Concept

Chapter 3: Process Concept Chapter 3: Process Concept Chapter 3: Process Concept Process Concept Process Scheduling Operations on Processes Inter-Process Communication (IPC) Communication in Client-Server Systems Objectives 3.2

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Efficient Adaptations of the Non-Blocking Buffer for Event Message Communication

Efficient Adaptations of the Non-Blocking Buffer for Event Message Communication Proc. ISORC2007 (10th IEEE CS Int' Symp. on Object & Component Oriented Real- Time Distributed Computing), Santorini, Greece, May 7-9, 2007, pp.29-40. Efficient Adaptations of the Non-Blocking Buffer for

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

The Wait-Free Hierarchy

The Wait-Free Hierarchy Jennifer L. Welch References 1 M. Herlihy, Wait-Free Synchronization, ACM TOPLAS, 13(1):124-149 (1991) M. Fischer, N. Lynch, and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process,

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information