Fast and Scalable Rendezvousing

Size: px

Start display at page:

Download "Fast and Scalable Rendezvousing"

Archibald Hunter
6 years ago
Views:

1 Fast and Scalable Rendezvousing by Michael Hakimi A Thesis submitted for the degree Master of Computer Science Supervised by Professor Yehuda Afek School of Computer Science Tel Aviv University September 2011

2 CONTENTS 1. Introduction Preliminaries Asymmetric rendezvous Progress Related Work Synchronous queues Elimination AdaptiveAR Algorithm Algorithm Description Nonadaptive Algorithm Adding Adaptivity Pragmatic Extensions and Considerations Correctness TripleAdp Algorithm Description Algorithm Description Correctness Performance Evaluation Experimental setup AdaptiveAR Performance Evaluation

3 Contents ii Producer/consumer throughput Adaptivity and peeking impact NUMA Architecture Asymmetric workloads Bursts Varying arrival rate Work uniformity TripleAdp Performance Evaluation Symmetric Producer/Consumer Workload Asymmetric Producer/Consumer Workloads Memory Consumption Adaptiveness Concurrent Stack Conclusion

4 ACKNOWLEDGEMENTS I would like to thank my supervisor Prof Yehuda Afek and my partner Adam Morrison, for their guidance and support during my research.

5 ABSTRACT In an asymmetric rendezvous system, such as an unfair synchronous queue and an elimination array, threads of two types, consumers and producers, show up and are matched, each with a unique thread of the other type. Here we present a new highly scalable, high throughput asymmetric rendezvous system that outperforms prior synchronous queue and elimination array implementations under both symmetric and asymmetric workloads (more operations of one type than the other). It is a fast matching machine. Consequently, we also present a highly scalable and competitive stack implementation. This thesis is based on the paper "Fast and Scalable Rendezvousing" of Afek, Hakimi and Morrison, presented on DISC 2011 [1].

6 1. INTRODUCTION A common abstraction in concurrent programming is that of an asymmetric rendezvous mechanism. In this mechanism, there are two types of threads, e.g., producers and consumers, that show up. The goal is to match pairs of threads, one of each type, and send them away. Usually the purpose of the pairing is for a producer to hand over a data item (such as a task to perform) to a consumer. The asymmetric rendezvous abstraction encompasses both unfair synchronous queues (or pools) [13] which are a key building block in Java's thread pool implementation and other message-passing and hand-o designs [3, 13], and the elimination technique [14], which is used to scale concurrent stacks and queues [8, 12]. To date, considerable work has been invested in asymmetric rendezvous objects. In this thesis, we present a highly scalable asymmetric rendezvous algorithm that improves the state of the art in both unfair synchronous queue and elimination algorithms. It is based on a distributed scalable ring structure, unlike Java's synchronous queue which relies on a non-scalable centralized structure. It is nonblocking, in the following sense: if both producers and consumers keep taking steps, some rendezvous operation is guaranteed to complete. (This is similar to the lock-freedom property [9], while taking into account the fact that it takes two to tango, i.e., both types of threads must take steps to successfully rendezvous.) It is also uniform, in that no thread has to perform work on behalf of other threads. This is in contrast to the at combining (FC) based synchronous queues of Hendler et. al. [7], which are blocking and non-uniform. Most importantly, our algorithm performs extremely well in practice, outperforming Java's synchronous queue by up to 60 on an UltraSPARC T2 Plus multicore machine, and the FC synchronous queue by up to a factor of six. Using our algorithm as the elimination layer of a concurrent stack yields a factor of three improvement over Hendler et. al.'s FC stack [6]. Our rst presented algorithm is based on a simple idea that turns out to be remarkably eective in practice: the algorithm itself is asymmetric. While a consumer captures a node in the shared ring structure and waits there for a producer, a producer actively seeks out waiting consumers on the ring. We present two new techniques, that combined with this asymmetric methodology on the ring, lead to extremely high performance and scalability in practice. The rst is a new ring adaptivity scheme that dynamically adjusts the ring's size, leaving enough room to t in all the consumers while minimizing as much as possible empty nodes that producers will needlessly need to search. Having an adaptive ring size, we can expect the nodes to be mostly occupied, which leads us to the next idea: if a producer starts to scan the ring and nds the rst node to be empty, it is likely that a consumer will arrive there shortly. Yet simply waiting at this node, hoping that this will occur, makes the algorithm prone to timeouts and impedes progress. We solve this problem by employing a peeking technique that lets the producer have the best of both worlds: as the producer traverses the ring, it continues to peek at its initial node; should a consumer arrive there, the producer immediately tries to partner with it, thereby minimizing the amount of wasted work. Our algorithm avoids two problems found in prior elimination algorithms that did not exploit asymmetry. In these works [2,8,14], both types of threads would pick a random node in the hope of meeting the right kind of partner. Thus these works suer from false matches, when two threads of the same type meet, and from timeouts, when a producer and a consumer both pick distinct nodes and futilely wait for a partner to arrive. We refer this version of our algorithm as AdaptiveAR. We tested our algorithm on dierent work loads. In the scenario where consumers outnumbered producers,

7 1. Introduction 2 we successfully maintain the peak throughput as the number of consumers grows. Unfortunately, in the scenario where producers outnumber consumers, the producers collide with each other when competing for the few consumers in the ring, causing AdaptiveAR's throughput to degrade as the number of producers grows. We present our second algorithm which addresses this limitation, we refer to it as TripleAdp. The main observation is that the mirror image of AdaptiveAR, where a consumer seeks a producer that waits on the ring, does well in the workload where AdaptiveAR fails to maintain peak throughput. Therefore, the TripleAdp algorithm adapts to the access pattern and switches between (essentially) AdaptiveAR and its mirror image, thus enjoying the best of both worlds. Our TripleAdp algorithm is also adaptive to the total ring size, it allocates and deallocates nodes as needed, unlike AdaptiveAR which uses a statically allocated ring. TripleAdp still maintains the ability to adapt the eective ring size (this avoids frequent memory allocation and moving threads between the old and new rings). The TripleAdp algorithm is therefor triple adaptive: it is adaptive to the eective ring size, to access pattern and to the total ring size. We discuss related work in section 3. The algorithms are presented in Section 4 and in Section 5, after formally dening the asymmetric rendezvous problem and our progress property in Section 2. The empirical evaluation is presented in Section 6. In Section 7 we evaluate an implementation of stack in which our AdaptiveAR algorithm is used as an elimination layer. Section 8 concludes.

8 2. PRELIMINARIES 2.1 Asymmetric rendezvous In the asymmetric rendezvous problem there are threads of two types, producers and consumers. Producers perform put(x) operations, that return either OK or a reserved value (explained shortly). Consumers perform get() operations, which return some item x handed o by a producer, or the reserved value. Producers and consumers show up and must be matched with a unique thread of the other type, such that a producer invoking put(x) and a consumer whose get() returns x must be concurrent at some point. Since this synchronous behavior inherently requires waiting, we provide threads with the option to give up: the special return value is used to indicate such a timeout. We say that an operation completes successfully if it returns a value other than. 2.2 Progress To reason about progress while taking into account that rendezvous inherently requires waiting, we consider both types of operations' combined behavior. An algorithm A that implements asymmetric rendezvous is nonblocking if, some operation completes successfully after both put() and get() take enough steps concurrently. Note that, as with the denition of the lock-freedom property [9], there is no xed a priori bound on the number of steps after which some operation must complete successfully. Rather, we rule out implementations that make no progress at all, i.e., admit executions in which both types of threads take steps innitely often and yet no operation successfully completes. By only counting successful operations as progress, we also rule out the trivial implementation that always returns (i.e., times out immediately).

9 3. RELATED WORK 3.1 Synchronous queues A synchronous queue using three semaphores was described by Hanson [5]. Java 5 includes a coarse-grained locking synchronous queue, which was superseded in Java 6 by Scherer, Lea and Scott's algorithm [13]. Their algorithm is based on a Treiber-style nonblocking stack [17] that at all times contains rendezvous requests by either producers or consumers. A producer nding the stack empty or containing producers pushes itself on the stack and waits, but if it nds the stack holding consumers, it attempts to partner with the consumer at the top of the stack (consumers behave symmetrically). This creates a sequential bottleneck. Motivated by this, Afek, Korland, Natanzon, and Shavit described elimination-diracting (ED) trees [2], a randomized distributed data structure where arriving threads follow a path through a binary tree whose internal nodes are balancer objects [15] and the leaves are Java synchronous queues. In each internal node a thread accesses an elimination array in attempt to avoid descending down the tree. Recently, Hendler et al. applied the at combining paradigm [6] to the synchronous queue problem [7], describing single-combiner and parallel versions. In a single combiner FC pool, a thread attempts to become a combiner by acquiring a global lock on the queue. Threads that fail to grab the lock instead post their request and wait for it to be fullled by the combiner, which matches between the participating threads. In the parallel version there are multiple combiners that each handle a subset of participating threads and then try to satisfy unmatched requests in their subset by entering an exchange FC synchronous queue. 3.2 Elimination The elimination technique is due to Touitou and Shavit [14]. Hendler, Shavit and Yerushalmi used elimination with an adaptive scheme inspired by Shavit and Zemach [16] to obtain a scalable linearizable stack [8]. They used a collision array in which each thread picks a slot trying to collide with a partner. In their adaptive scheme threads adapt locally: each thread picks a slot to collide in from sub-range of the collision array centered around the middle of the array. If no partner arrives, the thread eventually shrinks the range. Alternatively, if the thread sees a waiting partner but fails to collide due to contention, it increases the range. In our adaptivity technique, described in Sect. 4, threads also make local decisions, but with global impact: the ring is resized. Moir et al. used elimination to scale a FIFO queue [12]. In their algorithm an enqueuer picks a random slot in an elimination array and waits there for a dequeuer; a dequeuer picks a random slot, giving up immediately if that slot is empty. It does not seek out waiting enqueuers. Scherer, Lea and Scott applied elimination in their symmetric rendezvous system [10], where there is only one type of a thread and so the pairing is between any two threads that show up. Scherer, Lea and Scott also do not discuss adaptivity, though a later version of their symmetric exchanger channel, which is part of Java [11], includes a scheme that resizes the elimination array. However, here we are interested in the more dicult asymmetric rendezvous problem, where not all pairings are allowed.

10 4. AdaptiveAR ALGORITHM 4.1 Algorithm Description The pseudo code of the AdaptiveAR algorithm is provided in Fig The main data structure (Fig. 4.2a) is a ring of nodes. The ring is accessed through a central array ring, where ring[i] points to the i-th node in the ring (Fig. 4.1). 1 A consumer attempts to capture a node in the ring. It rst scans the ring searching for an empty node, once a node is captured, it waits there for a producer. A producer scans the ring, seeking for a waiting consumer. Once a consumer is found, it tries to match with it. For simplicity, in this section we assume the number of threads in the system, T, is known in advance and pre-allocating a ring of size T. However, the eective size of the ring is adaptive and it is dynamically adjusted as the concurrency level changes (We expand on this in Sect. 4.3.). In Section 5 we handle the case in which the maximum number of threads that can show up is not known in advance, in this case the nodes are allocated and deallocated dynamically, i.e., adaptive also to the number of threads. Conceptually each ring node should only contain an item pointer that encodes the node's state: 1. Free: item points to a global reserved object, FREE, that is distinct from any object a producer may enqueue. Initially all nodes are free. 2. Captured by consumer: item is NULL. 3. Holding data (of a producer): item points to the data. Fig. 4.1: The ring data structure with the array of pointers. Node i points to node i 1. The ring maximum size is T and is resized by changing the head's prev pointer. In practice, ring traversal is more ecient by following a pointer from one node to the next rather than the alternative of traversing the array. Array traversal suers from two problems. First, in Java reading 1 This reects Java semantics, where arrays are of references to objects and not of objects themselves.

11 4. AdaptiveAR Algorithm 6 from the next array cell may result in a cache miss (it is an array of pointers), whereas reading from the (just accessed) current node does not. Second, maintaining a running array index requires an expensive test+branch to handle index boundary conditions or counting modulo the ring size, while reading a pointer is cheap. The pointer eld is named prev, reecting that node i points to node i 1. This allows the ring to be resized with a single atomic compareandset (CAS) that changes ring[1]'s (the head's) prev pointer. To support mapping from a node to its ring index, each node holds its ring index in a read-only index eld. For the sake of clarity we start in Sect. 4.2 by discussing the algorithm without adaptation of the ring size. The adaptivity code, however, is included and is marked by a symbol in Fig. 4.2, and is discussed in Sect Section 4.4 discusses some practically-motivated extensions to the algorithm, e.g. supporting timeouts. Finally, Sect. 4.5 discusses correctness and progress. 4.2 Nonadaptive Algorithm Producers (Fig. 4.2b): A producer searches the ring for a waiting consumer, and attempts to hand its data to it. The search begins at a node, s, obtained by hashing the thread's id (Line 19). The producer passes the ring size to the hash function as a parameter, to ensure the returned node falls within the ring. It then traverses the ring looking for a node captured by a consumer. Here the producer periodically peeks at the initial node s to see if it has a waiting consumer (Lines 24-26); if not, it checks the current node in the traversal (Lines 27-30). Once a captured node is found, the producer tries to deposit its data using a CAS (Lines 25 and 28). If successful, it returns. Consumers (Fig. 4.2c): A consumer searches the ring for a free node and attempts to capture it by atomically changing its item pointer from FREE to NULL using a CAS. Once a node is captured, the consumer spins, waiting for a producer to arrive and deposit its item. Similarly to the producer, a consumer hashes its id to obtain a starting point for its search, s (Line 39). The consumer calls the findfreenode procedure to traverse the ring from s until it captures and returns node u (Lines 59-74). (Recall that the code responsible for handling adaptivity, which is marked by a, is ignored for the moment.) The consumer then waits until a producer deposits an item in u (Lines 41-49), frees u (Lines 51-52) and returns (Line 56). 4.3 Adding Adaptivity If the number of active consumers is smaller than the ring size, producers may need to traverse through a large number (the length of the path is linear in the ring size) of empty nodes before nding a match. It is therefore important to decrease the ring size if the concurrency level is low. On the other hand, if there are more concurrent threads than the ring size (high contention), it is important to dynamically increase the ring. The goal of the adaptivity scheme is to keep the ring size just right and allow threads to complete their operations within a constant number of steps. The logic driving the resizing process is in the consumer's code, which detects when the ring is overcrowded or sparsely populated and changes the size accordingly. If a consumer fails to capture a node after passing through many nodes in findfreenode() (due to not nding a free node or having its CASes fail), then the ring is too crowded and should be increased. The exact threshold is determined by an increase threshold parameter, T i. If in findfreenode(), consumer fails to capture a node after passing through more than T i ring_size nodes, it attempts to increase the ring size (Lines 67-72). To detect when to decrease the ring, we observe that when the ring is sparsely populated, a consumer usually nds a free node quickly, but then has to wait longer until a producer nds it. Thus, we add a wait threshold, T w, and a decrease threshold, T d. If it takes a consumer more than T w iterations of the loop in Lines to successfully complete the rendezvous, but it successfully captured its ring node in up to T d steps, then it attempts to decrease the ring size (Lines 53-55). Resizing the ring is made by CASing the prev pointer of the ring head (ring[1]) from the current tail

12 4. AdaptiveAR Algorithm 7 1 struct node { 2 item : pointer to object 3 index : node's index in the ring 4 prev : pointer to previous ring node 5 } 6 7 shared vars: 8 ring : array [1,...,T] of pointers to nodes, 9 ring [1]. prev = ring[t], 10 ring[ i ]. prev = ring[i 1] (i > 1) utils : 13 getringsize() { 14 node tail := ring [1]. prev 15 return tail.index 16 } (a) Global variables 17 put(threadid, object item) { node s := ring[hash(threadid, 20 getringsize())] 21 node v := s.prev while (true) { 24 if (s.item == NULL) { 25 if (CAS(s.item, NULL, item)) 26 return OK 27 } else if (v.item == NULL) { 28 if (CAS(v.item, NULL, item)) 29 return OK 30 v := v.prev 31 } 32 } 33 } (b) Producer code 34 get(threadid) { 35 int ringsize, busy_ctr, ctr := 0 36 node s, u ringsize := getringsize() 39 s := ring[hash(threadid, ringsize )] 40 (u,busy_ctr) := ndfreenode(s,ringsize) 41 while (u.item == NULL) { 42 ringsize := getringsize() 43 if (u.index > ringsize and 44 CAS(u.item, NULL, FREE) { 45 s := ring[hash(threadid, ringsize )] 46 (u,busy_ctr) := ndfreenode(s, ringsize) 47 } 48 ctr := ctr } item := u.item 52 u.item := FREE 53 if (busy_ctr < T d and ctr > T w and ringsize>1) 54 // Try to decrease ring 55 CAS(ring[1].prev, ring[ ringsize ], ring[ ringsize 1]) 56 return item 57 } ndfreenode(node s, int ringsize ) { 60 int busy_ctr := 0 61 while (true) { 62 if (s.item == FREE and 63 CAS(s.item, FREE, NULL)) 64 return (s,busy_ctr) 65 s := s.prev 66 busy_ctr := busy_ctr if (busy_ctr > T i ringsize) { 68 if ( ringsize < T) // Try to increase ring 69 CAS(ring[1].prev, ring[ ringsize ], ring[ ringsize +1]) 70 ringsize := getringsize() 71 busy_ctr := 0 72 } 73 } 74 } (c) Consumer code Fig. 4.2: Asymmetric rendezvous algorithm. Lines beginning with handle the adaptivity process.

13 4. AdaptiveAR Algorithm 8 of the ring to the tail's successor (to increase the ring size by one) or its predecessor (to decrease the ring size by one). If the CAS fails, then another thread has resized the ring and the consumer continues. The head's prev pointer is not a sequential bottleneck because resizing is a rare event in stable workloads. Even if resizing is frequent, the thresholds ensure that the cost of the CAS is negligible compared to the other work performed by the algorithm, and resizing pays-o in terms of better ring size which leads to improved performance. Handling consumers left out of the ring: A decrease in the ring size may leave a consumer's captured node outside of the current ring. Therefore, each consumer periodically checks if its node's index is larger than the current ring size (Line 43). If so, it tries to free its node using a CAS (Line 44) and nd itself a new node in the ring (Lines 45-46). However, if the CAS fails, then a producer has already deposited its data in this node and so the consumer can take the data and return (this will be detected in the next execution of Line 41). 4.4 Pragmatic Extensions and Considerations Here we discuss two extensions that might be interesting in certain practical scenarios. Timeout support: Aborting a rendezvous that is taking more than a specied period of time is a useful feature. Unfortunately, our model has no notion of time and so we do not model the semantics of timeouts. We simply describe the details for an implementation that captures the intuitive notion of a timeout. The operations take a parameter specifying the desired timeout, if any. Timeout expiry is then checked in each iteration of the main loop in put(), get() and findfreenode(). If the timeout expires, a producer or a consumer in ndfreenode() aborts by returning. A consumer that has captured a ring node cannot abort before freeing the node by CASing its item from NULL back to FREE. If the CAS fails, the consumer has found a match and its rendezvous has completed successfully. Otherwise its abort attempt succeeded and it returns. Busy waiting: Both producers and consumers rely on busy waiting, which may not be appropriate for blocking applications that wish to let a thread sleep until a partner arrives. Being blocking, such applications may not need our strong progress guarantee. We plan to extend our algorithm to blocking scenarios in future work. 4.5 Correctness Theorem 1. The algorithm meets the asymmetric rendezvous semantics. Proof: We consider the point at which both put(x) and the get() that returns x take eect as the point where the producer successfully CASes x into some node's item pointer (Line 25 or 28). The CAS operation assures us that only a single producer has deposited its item into the node. The producer leaves only after a successful CAS and the consumer leaves only after his node has been captured. The algorithm thus clearly meets the asymmetric rendezvous semantics. Theorem 2. If threads do not request timeouts, the algorithm is nonblocking (as dened in Sect. 2). Proof: Assume towards a contradiction that there is an execution in which both producers and consumers take innitely many steps and yet no rendezvous successfully completes. Since there is a nite number of threads T, there must be a producer/consumer pair, p and c, each of which runs forever without completing. We will use Lemma 1 to get a contradiction. Lemma 1. The consumer c will eventually captures a node. Proof: Suppose c never captures a node. Then eventually the ring size must become T. This is because after c completes a cycle around the ring in findfreenode(), it tries to increase the ring size (say the increase

14 4. AdaptiveAR Algorithm 9 threshold is 1). c's resizing CAS (Line 69) either succeeds or fails due to another CAS that succeeded. The succeeded CAS must have been in increasing, because the ring size decreasing implies a rendezvous completing, which by our assumption does not occur. Once the ring size reaches T, the ring has room for c (since T is the maximum number of threads). Thus c fails to capture a node only by encountering another consumer twice at dierent nodes, implying this consumer completes a rendezvous, a contradiction. It therefore cannot be that c never captures a node. From Lemma 1 we get that c captures some node but p never nds c on the ring, implying that c has been left out of the ring by some decreasing resize. But, since from that point on, the ring size cannot decrease, c eventually executes Lines and moves itself into p's range, where either p or another producer rendezvous with it, a contradiction.

15 5. TripleAdp ALGORITHM DESCRIPTION 5.1 Algorithm Description Similarly to the AdaptiveAR algorithm, the main data structure is a ring of nodes. Each node has a unique id (between 1 and the total ring size) stored in its index eld. The nodes are linked via their prev eld, i.e., node i points to node i 1. The ring is initially accessed through an array whose i-th element points to node i. A global shared ring variable points to this array. This indirection allows a thread to install a new ring/array pair with a single atomic compareandset (CAS) that points ring at the new array (and thereby the new ring). Such a ring reallocation occurs when TripleAdp adapts the total ring size. The shared state of the algorithm is described in Fig TripleAdp has two operations modes. In one mode a consumer captures a ring node and waits for a producer, while a producer seeks a waiting consumer by traversing. The other mode is symmetric: a consumer seeks out a waiting producer. The current mode is stored in a global mode variable. If mode=1 the rst mode, henceforth referred to as consumers waiting, is in eect. Otherwise if mode=2 the second (producers waiting) mode is used. Figure 5.2a shows the pseudo code of this procedure. An arriving thread reads mode and then invokes one of the wait or seek procedures, where the rest of the work is done. The mode can change during the execution in response to the workload. The wait and seek procedures therefore periodically check if the mode has changed, and reverse the thread's role when this happens. But as this role reversal does not occur atomically with the mode change, threads in dierent modes can be active concurrently in the ring. 1 struct node { 2 item[] : array of two pointers to objects 3 index : node's index in the ring 4 prev : pointer to previous ring node 5 } 6 7 shared variables: 8 ring : pointer to an array of nodes, initially 9 of size 1, and ring[1].prev = ring[1] mode : int, initially utils : 14 allocatering(int size, int max) { 15 node[] r := new node[max] 16 for (i in 1... max) 17 r[ i ] = new node 18 initialize r' s nodes such that: 19 r[ i ]. index = i 20 r [1]. prev = r[max] 21 r[ i ]. prev = r[i 1] (i > 1) 22 return r 23 } getringsize(node[] r) { 26 node tail := r [1]. prev 27 return tail.index 28 } 29 // The node item a thread in wait() uses 30 waititemindex(object value) 31 if (value = null) 32 return 1 33 else 34 return 2 35 } // The node item a thread in seek() uses 38 seekitemindex(object value) 39 if (value!= null) 40 return 1 41 else 42 return 2 43 } // Was value deposited by a waiting thread? 46 haswaiter(object value, int idx) { 47 if (idx = 1) 48 return value = NULL 49 else 50 return (value!= NULL and value!= FREE) 51 } Fig. 5.1: Algorithm TripleAdp: denitions and helper procedures.

16 5. TripleAdp Algorithm Description 11 To prevent such threads from stepping on each other, each node has a two-member array eld named item, where item[m] is accessed only by threads in mode m and encodes the node's state in mode m: node state Consumers waiting (mode = 1) Producers waiting (mode = 2) Free item[mode] points to a global reserved object, FREE, distinct from any object a producer may enqueue. Captured (by a waiter) item[1]=null item[2] holds producer's item Rendezvous occurred item[1] holds producer's item item[2]=null Figure 5.3 shows the pseudo code of the wait and seek procedures. For the sake of clarity, we initially describe the code without going into the details of adapting the ring size and memory consumption, to which we return later. The relevant lines of code are shown in the pseudo code marked by a but are ignored for the moment. Waiters (Fig. 5.3a): First (Line 96), a waiter determines its operation mode by invoking the helper procedure waititemindex() whose code is given in Fig If the waiter does not have a value to hand o then it is a consumer, implying it is in the consumers waiting mode (m = 1). Otherwise, it is a producer and m = 2. The waiter then attempts to capture a node in the ring and waits there for a seeker to rendezvous with it. This process generally follows that of our previous algorithm AdaptiveAR. The waiter starts by searching the ring for free nodes to capture. It hashes its id to obtain a starting point for this search (Lines ) and then invokes the findfreenode procedure to perform the search (Line 101). The pseudo code of findfreenode is shown in Fig. 5.2b. Here the waiter traverses the ring and upon encountering a FREE node, attempts to CAS the node's item to its value (Lines 69-72). (Recall that value holds the thread's item if it is a producer, and NULL otherwise.) The traversal is aborted if the global mode changes from the current mode (Lines 73-74) and the waiter then reverses its role (Lines ). Once findfreenode captures a node u (Line 101), the waiter spins on u until a seeker arrives and deposits its value (Lines ), at which point the waiter can return (Line 119). If the waiter is a consumer, it returns the seeking producer's deposited value, or NULL otherwise. 52 get(int threadid) { 53 if (mode = 1) 54 return wait(threadid, NULL) 55 else 56 return seek(threadid, NULL) 57 } put(int threadid, object item) { 60 if (mode = 1) 61 return seek(threadid, item) 62 else 63 return wait(threadid, item) 64 } (a) 65 ndfreenode(object value, node[] r, int ringsize, node s) { 66 int busy_ctr := 0 67 int i := waititemindex(value) 68 while (true) { 69 if (s.item == FREE and 70 CAS(s.item[i], FREE, value)) 71 return (s,busy_ctr,ok) 72 s := s.prev 73 if (mode!= i) 74 return (,,MODE CHG) 75 if (ring!= r) 76 return (,,RING CHG) 77 busy_ctr := busy_ctr if (busy_ctr > T i ringsize) { 79 if ( ringsize < size(r)) { // Try to increase ring 80 CAS(r[1].prev, r[ ringsize ], r[ ringsize +1]) 81 } else { // Allocate larger ring 82 node[] nr = allocatering(ringsize+1, 2 len(r)) 83 CAS(ring, r, nr) 84 } 85 ringsize := getringsize(r) 86 busy_ctr := 0 87 } 88 } 89 } (b) Fig. 5.2: TripleAdp pseudo code of get(), put() and findfreenode(). Lines marked by a handle ring size adaptivity.

17 5. TripleAdp Algorithm Description wait(int threadid, object value) { 91 int i, ctr, busy_ctr, ringsize 92 node s, u 93 node[] r ctr := 0 96 i := waititemindex(value) 97 while (true) { 98 r := ring // private copy of ring pointer 99 ringsize := getringsize(r) 100 s := r[hash(threadid, ringsize )] 101 (u,busy_ctr,ret) := ndfreenode(value, r, ringsize, s) 102 if (ret = MODE CHG) 103 return seek(threadid, value) 104 if (ret = RING CHG) 105 continue; // Restart loop 106 while (true) { // Wait for partner 107 if (u.item[i ]!= value) { 108 object item := u.item[i ] 109 u.item[i ] := FREE 110 if (busy_ctr < T d and ctr > T w and ringsize>1) { 111 // Try to decrease ring 112 if ( ringsize 1 = size(r)/4) { 113 node[] nr = allocatering(ringsize 1, len(r)/2) 114 CAS(ring, r, nr) 115 } else { 116 CAS(r[1].prev, r[ ringsize ], r[ ringsize 1]) 117 } 118 } 119 return item 120 } 121 ringsize := getringsize(r) 122 if ((u.index > ringsize or ring!= r) and 123 CAS(u.item[i], value, FREE)) { 124 break; // Restart outer loop 125 } 126 ctr := ctr } 128 } 129 } (a) 130 seek(int threadid, object value) { 131 object item 132 int i, ctr 133 node s, v 134 node[] r i := seekitemindex(value) 137 while (true) { 138 r := ring // private copy of ring pointer 139 s := r[hash(threadid, getringsize(r))] 140 v := s.prev 141 ctr := while (true) { 143 if (mode!= i) 144 return wait(threadid, value) 145 if (ring!= r) 146 break; // Restart outer loop 147 if (ctr > T f ) { 148 CAS(mode, i, 3 i) // toggle 1,2 149 continue; // Restart inner loop 150 } 151 item := s.item[i ] 152 if (haswaiter(item, i)) { 153 if (CAS(s.item[i], item, value)) 154 return item 155 ctr := ctr continue // Restart inner loop 157 } 158 item := v.item[i ] 159 if (haswaiter(item, i)) { 160 if (CAS(v.item[i], item, value)) 161 return item 162 ctr := ctr v := v.prev 164 } 165 } 166 } 167 } (b) Fig. 5.3: TripleAdp pseudo code for seeking and waiting threads. Lines marked by a handle ring size adaptivity. Adapting eective and total ring size ( -marked lines): The waiter adjusts both the eective and total ring size. Adjusting the eective ring size is performed similarly to AdaptiveAR. However, where AdaptiveAR would fail to increase the eective ring size due to using a statically sized ring, TripleAdp instead adjusts the total ring size. Similarly, when the eective ring size becomes small compared to the total ring size, TripleAdp decreases the total ring size. The decision when to resize the ring is made as in AdaptiveAR. If the waiter fails to capture a node after traversing a fraction of the ring (determined by an increase threshold parameter T i ), it tries to increase the eective ring size (Lines 77-87). If the eective ring size is maximal, the waiter allocates a new ring of double the size and tries to install it using a CAS to ring (Lines 82-83). Otherwise, the eective ring size is increased by CASing the prev pointer of the ring head (ring[1]) (Lines 79-80). The eective ring size is decreased if the waiter captures a node quickly, but then spins for more than T w (the wait threshold) iterations before a seeker arrives (Line 110). If the resulting eective size is a quarter of the ring size, the waiter swings ring to point to a new allocated ring of half the old size (Lines ). Otherwise, the eective size is decreased (Line 116). Decreasing the eective ring size may leave a waiter's captured node outside of the current eective ring. Each waiter therefore checks if it has been left out and tries to move back into the ring (Lines ). Similarly, both seekers and waiters check whether a new ring has been allocated (Lines and 122,75-76). If so, they move themselves to the new ring by restarting the operation. Memory reclamation: We assume the runtime environment supports garbage collection. Thus, once the pointer to an old ring is overwritten and all threads move to the new ring, the garbage collector reclaims the old ring's memory. Garbage collectors are part of the specication of modern environments such as C# and

18 5. TripleAdp Algorithm Description 13 Java, in which most prior synchronous queue algorithms were implemented [1, 2, 7, 13]. Seekers (Fig. 5.3b): A seeker traverses the ring, looking for a waiter to rendezvous with. Symmetrically to the waiter code, the seeker starts by determining its operation mode (Line 136). The seeker then begins traversing the ring from an initial node s obtained by hashing the thread's id (Lines ). As in the waiter case, the traversal process follows that of AdaptiveAR: the seeker proceeds while peeking at s to see if a waiter has arrived there in the mean time (Lines ). If s has no waiter, the next node in the traversal is checked. To determine if a node is a captured by a waiter, the seeker uses the helper procedure haswaiter() (shown in Fig. 5.1). Once a waiter is found, the seeker tries to match with it by atomically swapping the value stored in the node's item with its value. If successful, it returns the retrieved value. During each iteration of the loop, the seeker validates that it is in the correct mode and switches roles if not (Lines ). Adapting to the access pattern: Seekers are responsible for adapting the operation mode to the access pattern. The idea is to assign the role of waiter to the majority thread type (producer or consumer) in an asymmetric access pattern. Ring size adaptivity then sizes the ring according to the number of waiters. This reduces the chance for seekers (which are outnumbered by the waiters) to collide with each other on the ring. Following this intuition, a seeker deduces that the operation mode should to be changed if it starts to experience collisions with other seekers in the form of CAS failures. A seeker maintains a counter of failed CAS attempts (Lines 155, 162). If this counter crosses a failure threshold T f, the seeker toggles the global operation mode using a CAS on mode (Lines ). 5.2 Correctness Theorem 3. The algorithm meets the asymmetric rendezvous semantics. Proof: The point at which both a seek(t 1, x) and the wait(t 2, y) that returns x take eect is the point where the seeker successfully CASes x into some node's item pointer (Lines 153,160). If the seeker is a put(x) operation, then the waiter is a get() and so y = NULL. Otherwise, the seeker is a get() and x = NULL. Thus, a put(x) always returns NULL (i.e., OK) and its partner get() returns x, and so the algorithm meets the semantics of asymmetric rendezvous. Theorem 4. If threads do not request timeouts, the algorithm is nonblocking (as dened in Sect. 2). Proof: The proof is by contradiction. Assume that there is an execution E in which both producers and consumers take innitely many steps and yet no rendezvous successfully completes. We rst prove Lemma 2 and Lemma 3 and then move on towards a contradiction. Lemma 2. If threads do not request timeouts, the mirror image of AdaptiveAR is nonblocking. Proof: The proof is very similar to the proof of Theorem 2. Assume towards a contradiction that there is a producer/consumer pair, p and c, each of which runs forever without completing. Like in Theorem 2, the producer p must eventually capture a node. This implies that c doesn't nd p on the ring. This can only occur if p was left out of the ring because some producer decreased the ring. But, since from that point on, the ring size cannot decrease, p eventually moves itself into c's range, where either c or another consumer rendezvous with it, a contradiction. Lemma 3. If E has a nite number of mode switches and ring reallocations, a rendezvous must complete. Proof: The observation is that when no mode switches and no ring reallocations occur, an execution of TripleAdp is reduced to an execution of AdaptiveAR or to its mirror image. Since AdaptiveAR is nonblocking (Theorem 2) and its mirror image is nonblocking (Lemma 2) we get that a rendezvous must complete, i.e if there were only nitely many such events, the innite execution after the last one would lead to a rendezvous. It follows that E must contain innitely many mode switches or ring reallocations. A mode switch occurs after a seeker's CAS fails (Line 148). This implies a rendezvous has completed, a contradiction. A ring

19 5. TripleAdp Algorithm Description 14 reallocation that reduces the total ring size (Lines ) occurs after a waiter completes a rendezvous, which is impossible. Thus, the only ring reallocations possible are ones that increase the total ring size. To complete the proof, we show that there cannot be innitely many such ring reallocations. Here we use the assumption that the concurrency is nite. Let T be the maximum number of threads that can be active in E; we show that the total ring size cannot grow beyond 2T. Let R be the rst ring to be installed with total size size(r) T. Then R's eective size always remains T, because a ring size decrease implies some operation completes. Consider the rst waiter w that tries to reallocate R. Then w observes size(r) occupied nodes, i.e., it observes another waiter w in two dierent nodes. w leaves one node and returns to capture another node due to one of the following reasons: a mode switch (impossible), a ring reallocation (impossible, since w is the rst to reallocate the ring), or because w successfully rendezvouses, a contradiction.

20 6. PERFORMANCE EVALUATION We rst evaluate the performance of our AdaptiveAR algorithm in comparison with prior unfair synchronous queues (Sect. 6.2) and demonstrate the performance gains due to the adaptivity and the peeking techniques). In Section 6.3 we evaluate our TripleAdp algorithm. 6.1 Experimental setup Evaluation is carried out on both a Sun SPARC T5240 and on an Intel Core i7. The Sun has two UltraSPARC T2 Plus (Niagara II) chips. Each is a multithreading (CMT) processor, with HZ in-order cores with 8 hardware strands per core, for a total of 64 hardware strands per chip. The Intel Core i7 920 (Nehalem) processor has four 2.67GHz cores, each multiplexing 2 hardware threads. Both Java and C++ implementations are tested. 1 Due to space limitations, we focus more on the Sun platform which oers more parallelism and better insight into the scaling behavior of the algorithm. Unless stated otherwise, results are averages of ten 10-second runs of the Java implementation on an idle machine, resulting in very little variance. Adaptivity parameters used: T i = 1, T d = 2, and T w = 64. A modulo hash function, hash(t) = t mod ringsize, is used since we don't know anything about the thread ids and consider them as uniformly random. If however thread ids are sequential the modulo hash achieves perfect coverage of the ring with few collisions and helps in pairing producers/consumers that often match with each other. We evaluate these eects by testing our algorithms with both random and sequential thread ids throughout all experiments. 6.2 AdaptiveAR Performance Evaluation Our AdaptiveAR algorithm (marked AdaptiveAR in the gures) is compared against the Java unfair synchronous queue (JDK), ED tree, and FC synchronous queues. The original authors' implementation of each algorithm is used. 2 We tested both JDK and JDK-no-park: a version that always uses busy waiting instead of yielding the CPU (so-called parking), and report its results for the workloads where it improves upon the standard Java pool (as was done in [7]) Producer/consumer throughput N : N producer/consumer symmetric workload: We measure the throughput at which data is transferred from N producers to N consumers. We focus rst on single chip results in Figs. 6.1a and 6.1e. The Nehalem results are qualitatively similar to the low thread counts SPARC results. Other than our algorithm, the parallel FC pool is the only algorithm that shows meaningful scalability. Our rendezvous algorithm 1 Java benchmarks were ran with HotSpot Server JVM, build ea-b137. C++ benchmarks were compiled with Sun C on the SPARC machine and with gcc (-O3 optimization setting) on the Intel machine. In the C++ experiments we used the Hoard 3.8 [4] memory allocator. 2 We remove all statistics counting from the code and use the latest JVM. Thus, the results we report are usually slightly better than those reported in the original papers. On the other hand, we xed a bug in the benchmark of [7] that miscounted timed-out operations of the Java pool as successful operations; thus the results we report for it are sometimes lower.

21 6. Performance Evaluation 16 Fig. 6.1: Rendezvousing between N pairs of producers and consumers. Performance counter plots are logarithmic scale. L2 misses are not shown; all algorithms but JDK had less than one L2 miss/operation on average. outperforms the parallel FC queue by 2 3 in low concurrency settings, and by up to 6 at high thread counts. Hardware performance counter analysis (Fig. 6.1b-6.1d) shows that our rendezvous completes in less than 170 instructions, of which one is a CAS. In the parallel FC pool, waiting for the combiner to pick up a thread's request, match it, and report back with the result, all add up. While the parallel FC pool hardly performs CASes, it does require between 3 to 6 more instructions to complete an operation. Similarly, ED tree's

22 6. Performance Evaluation 17 rendezvous operations require 1000 instructions to complete and incur more cache misses as concurrency increases. Our algorithm therefore outperforms it by at least 6 and up to 10 at high thread counts. The Java synchronous queue fails to scale in this benchmark. Figures 6.1b and 6.1c show its serializing behavior. Due to the number of failed CAS operations on the top of the stack (and consequent retries), it requires more instructions to complete an operation as concurrency grows. Consequently, our algorithm outperforms it by more than Adaptivity and peeking impact From Fig. 6.1c we deduce that ring resizing operations are a rare event, leading to an average number of one CAS per operation. Thus resizing does not adversely impact performance here. Throughput of the algorithm with and without peeking is compared in Figure 6.3a. While peeking has little eect at low thread counts, it improves performance by as much as 47% once concurrency increases. Because, due to adaptivity, a producer's initial node being empty usually means that a consumer will arrive there shortly, and peeking increasing the chance of pairing with that consumer. Without it a producer has a higher chance of colliding with another producer and consequently spending more instructions (and CASes) to complete a rendezvous NUMA Architecture When utilizing both processors of the Sun machine the operating system's default scheduling policy is to place threads round-robin on both chips. Thus, the cross-chip overhead is noticeable even at low thread counts, as Fig. 6.1f shows. Since the threads no longer share a single L2 cache, they experience an increased number of L1 and L2 cache misses; each such miss is expensive, requiring coherency protocol trac to the remote chip. The eect is catastrophic for serializing algorithms; for example, the Java pool experiences a 10 drop in throughput. The more concurrent algorithms, such as parallel FC and ours, show scaling trends similar to the single chip ones, but achieve lower throughput. In the rest of the section we therefore focus on the more interesting single chip case. Fig. 6.2: Single producer and N consumers rendezvousing. Left: Default OS scheduling. Right: OS constrained to not co-locate producer on same cores as consumers Asymmetric workloads The throughput of one producer rendezvousing with a varying number of consumers is presented in Figs. 6.2a and 6.3c. Nehalem results (Fig. 6.3c) are again similar and not discussed in detail. Since the throughput is bounded by the rate of a single producer, little scaling is expected and observed. However, for all algorithms (but the ED tree) it takes several consumers to keep up with a single producer. This is because the producer hands o an item and completes, whereas the consumer needs to notice the rendezvous has occurred. And

23 6. Performance Evaluation 18 while the single consumer is thus busy the producer cannot make progress on a new rendezvous. However, when more consumers are active, the producer can immediately return and nd another (dierent) consumer ready to rendezvous with. Unfortunately, as shown in Figure 6.2a, most algorithms do not sustain this peak throughput. The FC pools have the worst degradation (3.74 for the single version, and 3 for the parallel version). The Java pool's degradation is minimal (13%), and it along with the ED tree achieves close to peak throughput even at high thread counts. Yet this throughput is low: our algorithm outperforms the Java pool by up to 3 and the ED tree by 12 for low consumer counts and 6 for high consumer counts, despite degrading by 23% 28% from peak throughput. Why our algorithm degrades is not obvious. The producer has its pick of consumers in the ring and should be able to complete a hand-o immediately. The reason for this degradation is not algorithmic, but due to Fig. 6.3: Top left: Impact of peeking on N : N rendezvous. Top right: Rendezvousing between N producers and a single consumer. Bottom: Intel 1 : N and N : 1 producer/consumer workloads throughput. contention on chip resources. A Niagara II core has two pipelines, each shared by four hardware strands. Thus, beyond 16 threads some threads must share a pipeline and our algorithm indeed starts to degrade at 16 threads. To prove that this is the problem, we present in Fig. 6.2b runs where the producer runs on the rst core and consumers are scheduled only on the remaining seven cores. While the trends of the other algorithms are unaected, our algorithm now maintains peak throughput through all consumer counts. Results from the opposite workload (which is less interesting in real life scenarios), where multiple producers try to serve a single consumer, are given in Figs. 6.3b and 6.3d. Here the producers contend over the single consumer node and as a result the throughput of our algorithm degrades as the number of producers increases (as do the FC pools). Despite this degradation, our algorithm outperforms the Java pool up to 48 threads (falling behind by 15% at 64 threads) and outperforms the FC pools by about 2.

24 6. Performance Evaluation Bursts To evaluate the eectiveness of our adaptivity technique, we measure the rendezvous rate in a workload that experiences bursts of activity (on the Sun machine). For ten seconds the workload alternates every second between 31 thread pairs and 8 pairs. The 63rd hardware strand is used to take ring size measurement. Our sampling thread continuously reads the ring size, and records the time whenever the current read diers from the previous read. Figure 6.4a depicts the result, showing how the algorithm continuously resizes its ring. Consequently, it successfully benets from the available parallelism in this workload, outperforming Fig. 6.4: Top: Synchronous queue bursty workload throughput. Left: Our algorithm's ring size over time (sampled continuously using a thread that does not participate in the rendezvousing). Right: throughput. Bottom: N : N rendezvousing with decreasing arrival rate due to increasing amount of work time operations. the Java pool by at least 40 and the parallel FC pool by 4 to Varying arrival rate In practice, threads do some work between rendezvouses. We measure how the throughput of the producer/consumer pairs workload is aected when the thread arrival rate decreases due to increasingly larger amounts of time spent doing work before each rendezvous. Figures 6.4c and 6.4d show that as the work period grows the throughput of all algorithms that exhibit scaling deteriorates, due to reduced parallelism in the workload. E.g., on the SPARC the parallel FC degrades by 2 when going from no work to 1.5µs of work, and our algorithm degrades by 3.4 (because it starts much higher). Still, on the SPARC there is sucient parallelism to allow our algorithm to outperform the other implementations by a factor of at least three. (On the Nehalem, by 31%.)

We show this by comparing the percent of total operations performed by each thread in a multiple producer/multiple consumer workload.

We pick the best result from ve executions of 16 producer/consumer pairs, and plot the percent of total operations performed by each thread. Figure 6.5 shows the results.

25 6. Performance Evaluation Work uniformity One clear advantage of our algorithm over FC is work uniformity, since the combiner in FC is expected to spend much more time doing work for other threads. We show this by comparing the percent of total operations performed by each thread in a multiple producer/multiple consumer workload. In addition to measuring uniformity, this test is a yardstick for progress in practice: if a thread starves, we will see it as performing very little work compared to other threads. We pick the best result from ve executions of 16 producer/consumer pairs, and plot the percent of total operations performed by each thread. Figure 6.5 shows the results. In an ideal setting, each thread would perform 3.125% (1/32) of the work. Our algorithm comes relatively close to this distribution, as does the JDK algorithm. In contrast, the parallel FC pool experiences much larger deviation and best/worst ratio. Fig. 6.5: Percent of total operations performed by each of 32 threads in an N : N test. 6.3 TripleAdp Performance Evaluation In this section we evaluate the performance of the TripleAdp algorithm compared to prior unfair synchronous queues. We compare TripleAdp to AdaptiveAR, the FC synchronous queues, ED tree, and Java synchronous queue (JDK). As in 6.2, we tested both JDK and JDK-no-park, a version that uses busy waiting instead of parking (yielding the CPU), and report the version that performs best in each workload. To improve legibility of the graphs, we plot the JDK only in the experiments where it outperformed the other prior algorithms, and we omit the ED tree results entirely, since it is never the best performer. Note that 6.2 has extensively evaluated the JDK and ED tree compared to AdaptiveAR and the FC synchronous queues [7] on the same hardware platforms. Parameters used: The adaptivity thresholds we use for both AdaptiveAR and TripleAdp are T i = 1, T d = 2, T w = 64 (the same as in 6.2). In addition, we set T f = 10 for TripleAdp. Both algorithms use the same modulo hash function, hash(t) = t modulo the eective ring size. The rationale for this choice is that if we don't know anything about the thread ids, we consider them as uniformly random and the modulo will provide a uniform distribution. If however thread ids are sequential the modulo hash helps in pairing

Fast and Scalable Rendezvousing

Fast and Scalable Rendezvousing Yehuda Afek, Michael Hakimi, and Adam Morrison School of Computer Science Tel Aviv University Abstract. In an asymmetric rendezvous system, such as an unfair synchronous