Combining Techniques Application for Tree Search Structures

Size: px

Start display at page:

Download "Combining Techniques Application for Tree Search Structures"

Alban Johnson
5 years ago
Views:

1 RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES BLAVATNIK SCHOOL OF COMPUTER SCIENCE Combining Techniques Application for Tree Search Structures Thesis submitted in partial fulfillment of requirements for the M. Sc. degree in the School of Computer Science, Tel-Aviv University by Vladimir Budovsky The research work for this thesis has been carried out at Tel-Aviv University under the supervision of Prof. Yehuda Afek and Prof. Nir Shavit June 2010

2 CONTENTS 1. Introduction Flat Combining Skip Lists The Flat Combined Skip Lists Naive Flat Combined Skip List Flat Combined Skip List with Multiple Combiners Flat Combined Skip List with Hints Performance Performance Comparison of Flat Combined Skip Lists vs JDK ConcurrentSkipListSet Flat Combining Mechanism Experimental Verifications Conclusions

3 LIST OF FIGURES 1.1 Skip list of heights 4. May be considered either as collection of fat nodes or 2-d list Skip list traversal with key 12. Traversed predecessors are shown. start level is Multi-combiner skip list. Every node with height 3 is a combiner node Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality Hints FC skip list implementation vs JDK lock-free ConcurrentSkipList- Set, uniform keys distribution Hints FC skip list implementation vs JDK lock-free ConcurrentSkipList- Set, high access locality FC skip list implementation vs multi-lock one, naive implementations, uniform keys distribution FC skip list implementation vs multi-lock one, naive implementations, high access locality FC skip list implementation vs multi-lock one, hints implementations, uniform keys distribution FC skip list implementation vs multi-lock one, hints implementations, high access locality Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality Hints mechanism success rate for pure update workloads The connection between FC intensity to throughput per thread for pure update workloads Lock-free skip list CAS per update, CAS success rate and throughput per thread for pure update workloads

4 LISTINGS 2.1 Set of Integers Interface Flat combining definitions Node definition Wait free contains is the same for all skip lists add Naive implementation scanandcombine common implementation Physical add and remove Naive implementation Multi-combiner remove implementation Optimistic (hinted) FCrequest and add implementation Optimistic (hinted) doadd and verify implementation Optimistic (hinted) multi-lock add method implementation... 22

5 ACKNOWLEDGEMENTS I would like to thank all those who made this thesis possible. I am extremely grateful to my advisors Prof. Yehuda Afek and Prof. Nir Shavit who introduced me to the world of multiprocessors and distributed algorithms and whose supervision and support enabled me to advance my understanding of the subject. My sincere thanks to Ms. Moran Tzafrir for teaching me what everyday researcher s work is about and for supplying me with the arsenal of essential tools for my work. Finally, I am grateful to my family and especially to my sister Elena for the patience and encouragement.

6 ABSTRACT Flat combining (FC) is a new synchronization paradigm allowing to reduce dramatically the synchronization costs. Use of this technique, as it was recently shown, brings significant performance gain for several popular parallel data structures, such as stacks, queues, shared counters, etc. Besides, the combining paradigm application makes a code as simple as one synchronized via single global lock. However, the question about applicability for other classes of parallel data structures has not been answered yet. This work deals with FC paradigm application to binary tree-like data structures. As it is shown below, combining is hardly suitable for these cases. The limits for FC uses have been studied, and criterion for its applicability has been justified.

7 1. INTRODUCTION Multi and many core computers appear more and more common these days. We witness recent developments of computer chips with tens of cores that consume no more space and energy than a desktop processor. In the light of this trend, the development of scalable and correct data structures becomes extremely important. The most simple and straightforward solution is to devise concurrent data structure from sequential one using global lock as synchronization primitive. Unfortunately, this solution does not scale even for relatively small number of cores. Another approach is to design fine-grained synchronization schemes using multiple locks or non-blocking read-modify-write atomic operations. This method usually requests full algorithm redesign and implementation. Additional drawback of fine-grained and, especially, lock-free synchronization is theirs high complexity. It is very difficult to formally prove the correctness of such data structures (See, for example, [3] and [4] proofs). 1.1 Flat Combining Flat combining [7] programming paradigm allows to achieve high level of concurrency while preserving of code simplicity. The main idea behind the flat combining is to attach the public actions registry to existing sequential data structure. Each thread, before accessing to shared data, publishes its action request in the registry, and then tries to access the global lock. The winning thread becomes combiner, scans the registry and performs all found requests. Other threads simply wait for theirs fulfilled actions results, spinning on thread local Done flag. There are several benefits of this strategy: Low synchronization cost, comparing to global lock since there is only one competition round for acquiring the shared lock, and every thread - winning or missing - returns with its request performed. The combiner can use its knowledge about all requests and fulfill part of them without access to data structure. For stack, for example, the combiner may collect push/pop pairs and to return the results to appropriate callers. This technique is well-known and called elimination. For shared counter, the combiner can calculate the total counter change and update data structure only once. This technique, called combining, is also widely used. The variants of FC algorithm are described in details in Chapter 2 (The Flat Combined Skip Lists). The flat combining is proven very efficient for data structures with hot spots, such as stack head, queue ends, priority queue head, and so on. It also shows good results when synchronization costs are high. For

8 1. Introduction 2 Fig. 1.1: Skip list of heights 4. May be considered either as collection of fat nodes or 2-d list example, lock-free synchronous queues [16] demonstrate good throughput but moderate scalability, which can be improved using elimination or FC techniques. However, the question of FC usefulness for data structures without emphasized bottlenecks and high synchronization costs remains opened. This work studies the flat combining applicability for binary tree-like data structures, the ones with O(log n) access time, allowing range operations. 1.2 Skip Lists Tree search structures are, probably, the most popular and widespread ones. It is hard to find computer science or software programming area that does not use them. Their practical applications start with the most popular red-black tree [6], which is used nearly in any algorithms library, including C++ STL library [17] and Java TM SDK [18], and AVL tree [1], which is very popular for search dominated workloads, then continue with various B-trees [2], which are useful for block-organized memories, and finish with specialized suffix tries, splay trees, spatial search trees, persistent trees, etc. Since all of the above algorithms deal with large amount of data, and many of them run inside various operating systems or used as various search indexes inside databases, the distributed and multi-threaded decisions for search trees are in focus of many researches and commercial projects. The comprehensive survey of concurrent binary search trees is given in [13]. The common problem with all search trees mentioned above is that they either static (do not allow add/remove without full rebuild) or need re-balancing mechanism after updates in order to preserve logarithmic access time. In most of the cases, the re-balance scope is unknown prior to update action, and that makes the design of fine-grained synchronization for binary search trees very complicated task. That is why the skip lists were chosen as the basic data structure for the research. There were several reasons for the decision: Skip list is simple and has no re-balancing overheads, which simplifies measures.

9 1. Introduction 3 Fig. 1.2: Skip list traversal with key 12. Traversed predecessors are shown. start level is 3. Skip list is the only known concurrent lock-free binary search structure. Skip list was invented [15] in 1990 as a probabilistic alternative for binary search trees. Skip list is a linked list of fat nodes (Figure 1.1), where each one has randomly chosen height (number of levels). Every node has a unique key, and the nodes appear ordered in the list. Each node is connected at each level with the successor at the same level. The random level is chosen using geometrical distribution: the probability that the node has layer i, i 0 is 1 p,p > 1. So, every node has layer 0, and, if node has layer i, it, with the i probability of 1 p, has also layer i+1. In practice, p is usually chosen between 2 to 4. Such distribution gives O(log N) skip list maximal node height expectation, and between every two nodes of height k, p-1 nodes of height k-1 are expected to appear. It is useful to add two immutable nodes head and tail with highest possible level and to manage real highest level (start level) on every add or delete. Alternatively, the skip list may be represented as a collection of sorted lists with unique keys L 1,L 2,...,L k, such as i > j L i L j and all nodes with equal keys form vertical lists. The later representation is especially convenient for lock-free implementations, where all of the updates are implemented through atomic read-and-updates operations. Denote the next node to node n at level l as next l (n), and the key of n as key(n) The simple sequential list works in the following way: Initially, empty list contains head and tail with keys of and + correspondingly. The head node is connected to tail at every possible level, and actual start level is 0. Listtraversalwithkeyk startsfromnoden = headatlevell = start level, and proceeds at this level searching the pair of nodes (pred, succ), such that next l (pred) == succ and key(pred) < k key(succ). Set l = l 1 and n = pred and repeat the search. The process continues until 0 level achieved. Figure 1.2 illustrates the pred nodes observed during traversal with key 12. contains(k) simply calls the traversal with key k. It is unnecessary to

10 1. Introduction 4 proceed to the bottom, once the desired key is found, the traversal is interrupted and found node is returned. add(k) starts from generating random height h, as described above. After that, the traversal algorithm is performed, collecting h bottom pred and suss nodes. Once the node is not found (for pure set implementation), the new node of height h with key k is linked to collected nodes. remove(k) starts from the traversal run. Once, the node suss with key(suss) == k isobservedonthehighestlevelh,alltraversedpred nodes are collected. After reaching the bottom level, all collected next i (pred) references are set to next i (suss), and suss node memory is freed. After every update operation, start level is verified and updated, if needed. There are two cases - when adding the node with height h > start level, start level is set to h, and, when removing the node of start level height, to find the highest level h such that next h ( head) tail, and to set start level to h. Note, that the traversal algorithm performs O(1) expected steps at each level, and that the number of levels expected to be logarithmic to nodes number, and therefore, skip list has expected logarithmic access time. The above schema, short of some small variations, is used in the most of lock based concurrent skip lists, and our implementations use it as well. The differences in the implementations ([14], [8], [11]) are concerning locking schemes and state flags devised for consistency, linearizability [10] and skip list invariants preserving. Lock free skip lists, in contrast, cannot maintain skip list invariants - this approach needs multiple locations read-and-update atomic operations, unsupported on the most of existing platforms. The lock-free implementations ([5], [9]) use relaxed skip list algorithms, where the question about node existence is answered only on the bottom list level, and the other levels are regarded as sort of index allowing to reach the bottom level in expected logarithmic time, and skip list structure can be violated at the particular execution moments.

11 2. THE FLAT COMBINED SKIP LISTS All our FC skip list variants are implemented both in Java and C++ with minimal differences. C++ implementations require memory management and explicit memory barriers, while in Java implementations the memory barriers are introduced implicitly through volatile flags store/load operations. We have chosen to present only Java implementations in order to avoid memory management issues and to have clear and standard competitor - all performance comparisons use Java SDK lock-free ConcurrentSkipListSet [18]. The flat combined skip list implements the simplest integers set interface: Listing 2.1: Set of Integers Interface 1 public interface SimpleIntSet { 2 / 3 Add item to map key key to add; true if added, 6 false if the key already exists on the map 7 / 8 boolean add(int key ); 9 / 10 Removes item from the map key key to remove; true if removed, 13 false if the key does not exist on the map 14 / 15 boolean remove(int key ); 16 / 17 Verify if the item is on the map thread id key true if item exists, false otherwise 21 / 22 boolean contains(int key ); 23 } The add and remove methods use flat combining paradigm, while contains method is implemented wait free. The coexistence of flat combining and wait free methods requires special treatment for linearization points, since flat combining data is invisible for lock-free contains. Define FCData and FCRequest:

12 2. The Flat Combined Skip Lists 6 Listing 2.2: Flat combining definitions 1 class FCRequest{ 2 int key ; // Key 3 boolean response ; // Operation result 4 volatile int opcode = NONE; // Action 5 } 6 7 class FCData { 8 public FCRequest requests []; //Submitted requests 9 public AtomicInteger lock ; // FC node lock 10 } The FCData may be attached to one or several skip list nodes The skip list node class is: Listing 2.3: Node definition 1 class Link{ public Link next ; 4 public Node node ; 5 public Link up; 6 public Link down; 7 } 8 class Node { public int numlevels(){ // Node height 11 return links. length ; 12 } 13 // Node is FC when it has FC data 14 public boolean isfcnode(){ 15 return fcdata!= null ; 16 } 17 public Link at(int index){ // Get link at level 18 return links [ index ]; 19 } 20 public Link bottom(){ // The bottom link 21 return links [0]; 22 } 23 public Link top(){ // The top link 24 return links [ links. length 1]; 25 } 26 public final int key ; 27 public volatile boolean deleted = false ; 28 public volatile boolean fully connected = false ; 29 public FCData fc data ; 30 // 2D list of links with random access 31 // Link contains reference to next, up and down links 32 private Link [] links ; 33 }

13 2. The Flat Combined Skip Lists 7 Till now, the skip list is the regular single threaded one, save for two details - deleted and fully connected flags and FCData reference (which is not null for flat combining nodes). The contais method is also very similar to single threaded implementation: Listing 2.4: Wait free contains is the same for all skip lists 1 public boolean contains () { 2 int level = start level ; // Adoptable start level 3 Link pred = head. at( level ); 4 Link curr = null ; 5 6 for (; level >= 0; level, pred = pred. down) { 7 curr = pred. next ; 8 while (inkey > curr. node. key) { 9 pred = curr ; 10 curr = pred. next ; 11 } if (inkey == curr.node.key) 14 return (! curr. node. deleted && 15 curr.node. fully connected ); 16 } 17 return false ; 18 } The only distinguishing detail is the check of deleted and fully connected flags. The difference comes with add and remove implementations. We will present implementations for several flat combined lists variants. 2.1 Naive Flat Combined Skip List The first simplest implementation is Naive FC list. It has exactly one combiner node (the head one). The thread performing add or remove action: 1. Puts its FCRequest into head node FCData. 2. Tries to acquire lock. 3. If succeeded, scans and fulfills the requests 4. Else, the thread spins on its own request completion flag and checks lock state. If request fulfilled, the thread returns with desired result, otherwise, if lock is unlocked, continue from 2. The Listing 2.5 presents add method implementation.

14 2. The Flat Combined Skip Lists 8 Listing 2.5: add Naive implementation 1 public boolean add(int key) { 2 // Put my request to node s fc data 3 FCRequest my request = 4 head. fc data. req ary [ThreadId. getthreaid ()]; 5 my request.key = key; 6 // Volatile write, from here combiner sees it 7 my request. opcode = ADD; 8 AtomicInteger lock = fc node. fc data. lock ; 9 do { 10 if (0 == lock. get () && // TTAS lock 11 lock.compareandset(0, 0xFF)) { 12 // Perform all found requests 13 scanandcombine(fc node ); 14 lock. set (0); // Unlock 15 return my request. response ; 16 } else { 17 do { 18 Thread. yield (); // Give up processor 19 // Somebody did my work 20 if (my request. opcode == NONE) 21 return my request. response ; 22 }while(0!= lock. get ()); 23 } 24 } while(true); 25 } The remove method differs from the above one only by REMOVE opcode All the work is performed within scanandcombine method, which is the same for all following implementations: Listing 2.6: scanandcombine common implementation 1 protected void scanandcombine(node fc node) { 2 for(fcrequest curr req : fc node. fc data. requests ) { 3 switch(curr req. opcode) { 4 case ADD: 5 curr req. response = doadd(fc node, curr req. key, 6 curr req. pred ary, curr req. succ ary ); 7 curr req. opcode = NONE; // Release waiting thread 8 break; 9 case REMOVE: 10 curr req. response=doremove(fc node, curr req. key, 11 curr req. pred ary, curr req. succ ary ); 12 curr req. opcode=none; // Release waiting thread 13 break; 14 } 15 } 16 }

15 2. The Flat Combined Skip Lists 9 Here, the combiner thread scans all requests and performs modifications. Both doadd and doremove methods receive the containers for predecessors and successors nodes - technical detail which allows reusing of the memory in case of Naive list, but which is used in different way in other implementations. Beside this, fc node parameter indicates the start node for search - it is not relevant for single combiner list, but important to multi-combiner one, described below. The doadd/doremove methods act exactly as in case of single threaded skip list: Listing 2.7: Physical add and remove Naive implementation 1 private boolean doadd(node fc node, int key, 2 RandomAccessList<Link> pred ary, 3 RandomAccessList<Link> succ ary ){ 4 // New node height has to be known in advance 5 // in order to restrict nodes collection. 6 int top level = randomlevel (); 7 //Find placement and nodes to connect. 8 Node found node = find(fc node, key, pred ary, 9 succ ary, top level, true); 10 if (found node == null){// Node not on map 11 Node new node = new Node(key, 12 top level, false ); 13 Link new link = new node.bottom(); 14 RandomAccessList<Link>.BiDirIterator prediter = 15 pred ary. begin (); 16 RandomAccessList<Link>.BiDirIterator succiter = 17 succ ary. begin (); 18 // Connect new node 19 for (int level = 0; level < top level ; ++level, 20 new link = new link. up) { 21 new link. next = succiter. data ; 22 prediter. data. next = new link ; 23 prediter = prediter. next (); 24 succiter = succiter. next (); 25 } 26 // Linearization point 27 new node. fully connected = true; 28 return true; 29 } 30 return false ; 31 } 32 private boolean doremove(node fc node, int key, 33 RandomAccessList<Link> pred ary, 34 RandomAccessList<Link> succ ary){ 35 // Find node to delete and its predecessors. 36 Node found node = find(fc node, key, pred ary, 37 succ ary, fc node. num levels (), false ); if (found node!= null){ 40 int top level = found node. num levels ();

16 2. The Flat Combined Skip Lists // Get link on top level 42 Link lnk = found node. top (); 43 // Topmost predecessor 44 RandomAccessList<Link>.BiDirIterator prediter = 45 pred ary. rbegin (); 46 found node. deleted = true; // Logical delete 47 for (int level = 0; level < top level ; ++level, 48 lnk = lnk. down, prediter = prediter. prev ()) { 49 // Physical delete 50 prediter. data. next = lnk. next ; 51 } 52 return true; 53 } 54 return false ; 55 } In this implementation we use fast random number generator described in [12], the similar one is adopted in JDK s lock-free list. Consider the properties of the above skip list implementation. Property Naive skip list is deadlock free. Proof. The implementation uses only one lock. Therefore, the deadlock free implementation of the lock implies deadlock freedom of the data structure. Property Naive skip list update operations do not overlap each other and have strict total order. Proof. Consider two arbitrary update operations on the list. All modification are performed by the combiner thread during combining session (Listing 2.6). The combining sections are strictly ordered by single lock and do not overlap, and, so, if the operations belong to different sessions, the order is defined by the lock acquiring order. Otherwise, if the updates belong to the same session, the order is defined by combine algorithm - the combiner performs updates sequentially, and any two modifications do not overlap. Proposition Naive skip list is linearizable. Proof. Select linearization points for skip list updates: For add: the row 27 (Listing 2.7), where fully connected flag is set to true. For remove: the row 46 (Listing 2.7), where deleted flag is set to true. Use linearizability of OptimisticSkipList proved in [8]. Note, that by Property 2.1.2, all updates performed on our skip list may be regarded as performed by single dedicated thread. Therefore, since initial preconditions are identical for both OptimisticSkipList and Naive one, modifications of the next references and deleted and fully connected flags appear in program order exactly as in OptimisticSkipList, the Naive skip list state may be considered exactly equal to OptimisticSkipList one, where all modifications on the least are performed by single thread. Then, for each possible concurrent run on Naive skip list,

17 2. The Flat Combined Skip Lists 11 Fig. 2.1: Multi-combiner skip list. Every node with height 3 is a combiner node there is a run on OptimisticSkipList, where both skip lists states defined by the next references and flags are identical at every point of time, and so, the OptimisticSkipList linearization order is applicable to Naive skip list As expected, the flat combining in this implementation exposes the sequential bottleneck, very comparable to the global lock. In Section 3 (Performance) this estimation is verified. 2.2 Flat Combined Skip List with Multiple Combiners The second attempt is the introduction of several combiners, that allow to make several modifications simultaneously and, therefore, to improve scalability. The multi-combiner skip list is implemented with statically distributed immutable combiners. The idea is to divide the skip list into non-intersecting parts, such that every part is managed by some combiner node. The multi-combiner skip list is shown on Figure 2.1. Suppose, that we start from initially filled skip list of size N and have to add c < N combiners. We choose some heights h c such that number of nodes with height h h c is at least c, and make them to be combiner nodes by adding FCData to each one. In this work, only static multi-combiner skip lists are studied. The dynamic lists may be devised by alternating h c value - the process requires consecutive locking of all FC nodes layers, converting of needed layer to combiners/non-combiners and re-scheduling of all pending combining requests. Since, by its essence, flat combining has to use a very small number of combiners (otherwise, it does not differ from sort of fine-grained synchronization), the process is rare and do not expensive. Multi-combiner skip list acts very similar to single-combiner one. As it was mentioned early, the contains method is exactly the same, while add/remove single difference is that the requests are placed to appropriate combiner nodes instead of head one. The updating thread: 1. Finds combiner node fc node responsible to modification area. 2. Puts its FCRequest into fc node s FCData.

18 2. The Flat Combined Skip Lists Tries to acquire FCData lock. 4. If succeeded, scans and fulfills the requests 5. Else, spins on its own request completion flag and checks lock state. If request is fulfilled, returns with desired result, otherwise, if lock is unlocked, continue from 3. Listing 2.8: Multi-combiner remove implementation 1 public boolean remove(int key) { 2 //Get responsible combiner 3 Node fc node = findcombiner(key ); 4 // Put my requesrt to node s fc data 5 FCRequest my request = 6 fc node. fc data. req ary [ThreadId. getthreaid ()]; 7 my request.key = key; 8 // Volatile write, from here combiner sees it 9 my request. opcode = REMOVE; 10 AtomicInteger lock = fc node. fc data. lock ; 11 do { 12 // TTAS lock 13 if (0 == lock. get () && 14 lock.compareandset(0, 0xFF)) { 15 // Perform all found requests 16 scanandcombine(fc node ); 17 // Unlock 18 lock. set (0); 19 return my request. response ; 20 } else { 21 do { 22 Thread. yield (); 23 // Somebody did my work 24 if (my request. opcode == NONE) 25 return my request. response ; 26 }while(0!= lock. get ()); 27 } 28 } while(true); 29 } The method findcombiner is wait-free and is implemented similar to contains. It has three differences - 1. The search goes down to the lowest combiners level and does not proceed to the bottom. 2. The search returns the lowest combiner predecessor of the key. 3. Since combiners are immutable, there is no need to check their deleted flag. The multi-combiner skip list properties are similar to Naive list ones.

19 2. The Flat Combined Skip Lists 13 Property Multi-combiner skip list is deadlock free. Proof. As it follows from the algorithm, no thread try to hold more than one lock simultaneously. Then, the deadlock is impossible. Practically, the multi-combiner design divides the data structure into disjoint set of single combiner Naive lists. Call these lists combining clusters and the combiner, responsible for the cluster cluster head. Then, the properties of Naive FC lists are applicable for every combining cluster. Instead of strict total order, all updates operations of multi-combiner list form strict partial order, where operations on different clusters are commutative - the operations can be reordered without affecting the final state of data structure. Proposition Multi-combiner skip list is linearizable. Proof. Follows from linearizability of each cluster and the fact that linearizability is compositional (Theorem 1 from [10]) The multi-combiner skip list scales much better than single-combiner one, but still perform a lot of work sequentially. The next try is to reduce this part of the execution by hints mechanism. 2.3 Flat Combined Skip List with Hints Hints mechanism is inspired by optimistic skip list [8]. The idea is to collect in wait-free optimistic manner the links that have to be updated, to acquire the lock, verify (and re-find, if needed) the links and then to perform update. The Listing 2.9 shows FCrequest structure supplemented with hints and add method. Listing 2.9: Optimistic (hinted) FCrequest and add implementation 1 class FCRequest{ 2 int key; // Key 3 boolean response ; // Operation result 4 volatile int opcode = NONE; // Action 5 int top level // hints size 6 RandomAccessList<Link> pred ary ; //Collected hints 7 RandomAccessList<Link> succ ary ; //Collected hints 8 } 9 10 public boolean add(int key) { 11 //Get responsible combiner 12 Node fc node = findcombiner(key ); 13 FCRequest my request = 14 fc node. fc data. req ary [ThreadId. getthreaid ()]; 15 // We have to know level prior to find in order 16 // to restrict hints size 17 int top level = randomlevel (); 18 Node found node ; 19 do{ 20 // Find placement and f i l l hints data

20 2. The Flat Combined Skip Lists found node = find(fc node, key, my request. pred ary, 22 my request. succ ary, top level, true, true); 23 }while(found node!= null && found node. deleted ); 24 // Node already exists 25 if (found node!= null) 26 return false ; 27 // Put my request to node s fc data 28 my request. top level = top level ; 29 my request. key = key; 30 // Volatile write, from here combiner sees it 31 my request. opcode = ADD; 32 AtomicInteger lock = fc node. fc data. lock ; 33 do { 34 // TTAS lock 35 if (0 == lock. get () && 36 lock.compareandset(0, 0xFF)) { 37 // Perform all found requests 38 scanandcombine(fc node ); 39 // Unlock 40 lock. set (0); 41 return my request. response ; 42 } else { 43 do { 44 Thread. yield (); 45 // Somebody did my work 46 if (my request. opcode == NONE) 47 return my request. response ; 48 }while(0!= lock. get ()); 49 } 50 } while(true); 51 } The internal doadd and dodelete (Listing 2.10) methods are also slightly modified, since we have to verify and re-fill, if needed, the collections of the predecessors and the successors. The verify method checks if all collected nodes are correct, i. e. they are non-deleted and connected, and each predecessor s next reference points to the appropriate successor, and collected nodes keys suit the requested key. Listing 2.10: Optimistic (hinted) doadd and verify implementation 1 private boolean doadd(node fc node, int key, int top level 2 RandomAccessList<Link> pred ary, 3 RandomAccessList<Link> succ ary ){ 4 Node found node = null ; 5 // Verify data and re f i l l if needed 6 if (! verify (key, pred ary, succ ary, top level )){ 7 found node = find(fc node, key, pred ary, 8 succ ary, top level, true, false ); 9 } 10 // From here, as in \textit{naive} list

21 2. The Flat Combined Skip Lists } protected boolean verify (int key, 15 RandomAccessList<Link> predary, 16 RandomAccessList<Link> succary, 17 int top level ) 18 { 19 RandomAccessList<Link>.BiDirIterator prediter 20 = predary. begin (); 21 RandomAccessList<Link>.BiDirIterator succiter 22 = succary. begin (); 23 for(int ilevel = 0; ilevel < top level ; ++ilevel, 24 prediter = prediter. next(), succiter = succiter. next()){ 25 Link pred = prediter. data ; 26 Link next = succiter. data ; 27 if (pred. node. deleted next. node. deleted 28! pred. node. fully connected 29! next. node. fully connected 30 pred. next!= next 31 pred. node. key >= key next. node. key < key) 32 return false ; 33 } 34 return true; 35 } As its predecessors, the hinted skip list is deadlock free and linearizable. The deadlock freedom is obvious, since this implementation uses exactly the same locking scheme as previous ones. The linearizability may be devised from the fact that if verify fails, the hints skip list algorithm is identical to naive one. Otherwise verify success guarantees that the state of all memory that has to be updated is identical to one when data was collected, and therefore, all preconditions, mentioned in linearizability proof for OptimisticSkipList hold, and the proof is applicable also for hints skip list. The hints mechanism is applied to both single- and multi-combiners lists. As it is shown in Chapter 3 (Performance), the optimistic approach is very efficient, especially when update rate is not high.

22 3. PERFORMANCE For the performance verifications, we use the skip lists described above and several additional data structures designed to verify flat combining impact. The JDK ConcurrentSkipListSet by Doug Lea is used as a main competitor - by now, it is a one of the most efficient and scalable skip list implementations. Computations were performed on Sun TM SPARC R Enterprise T5140 server powered by two UltraSPARC T2 Plus processors. Each processor contains eight cores running eight hardware threads, which gives 128 total hardware threads per system. The benchmarked algorithms notation is: FC-Naive-0 - Naive FC-list with 0 non-head combiners. FC-Hints-64 - Hinted FC-list with at least 64 non-head combiners - the combiners distribution algorithm was described in Section 2.2 JDK - JDK ConcurrentSkipListSet (based on ConcurrentSkipListMap). ML-0, ML-64 - Multi-lock skip lists with 0 and 64 non-head locks correspondingly - the data structure, designed to isolate combining effect from combiners distribution one. Generally, it is multi-combiners skip list, where the FCData structures are substituted with simple locks. The updating thread locks appropriate locking node, makes the update and releases lock - instead of making all the combining algorithm. ML-hints-0, ML-hints-64 - Multi-lock optimistic skip lists with 0 and 64 nonhead locks correspondingly using hints mechanism exactly as flat combining one does. FC-Ideal-64 - The artificial FC-list made from FC-list with hints. Here, we assumed, that hints are always successful, and the combiner only work is to update the next references. This data structure gives an indication about maximal FC skip list performance, when the combiner fulfills all its requests sequentially. Experiments were performed on data structures with initial size of about keys. Actually, before selecting this size, the base skip list implementations were roughly benchmarked for wide range of sizes - from one hundred to few millions. The relations between run times for different skip list implementations were very similar for different sizes, and therefore, every initial size was representative enough to show qualitative differences between algorithms. The access locality factor was introduced to simulate different workloads. Suppose that the experiment is performed for keys space S = {1,2,...,N}. The access locality factor k,1 k N is defined in the following way: the keys in the

23 3. Performance 17 benchmark are uniformly selected from the S k = {t,t + 1,...,t + N/k}, where t is selected uniformly from S at the start of the run, and is changed slowly during the execution. The access locality factor of 1, therefore, corresponds to uniformly distributed keys from S. The factor increase means that the keys are selected from the smaller interval, and so the contention increases. 3.1 Performance Comparison of Flat Combined Skip Lists vs JDK ConcurrentSkipListSet The first group of benchmarks compares the flat combining skip list implementations throughput with JDK ConcurrentSkipListSet s one. Figure 3.1 presents the benchmark results for Naive flat combining using uniformly distributed values. The graphs show that single combiner implementation fails to compete with SDK list even for read-dominate loads, when implementation with 64 combiners shows scalability even for write only loads. The picture changes dramatically when workload locality increases. Figure 3.2 depicts the same data structures, where all requests are selected from 1/128 of total keys space. In this case, naive FC skip list lose to SDK one even for read dominated workloads - when number of running threads increases enough, and multiple combiners do not help. The next group of runs deals with improved optimistic skip list, using hints mechanism described above in Chapter 2 (The Flat Combined Skip Lists). Figure 3.3 shows the benchmark results for uniformly distributed requests, when Figure 3.4 depicts the runs with high locality access. The presented graphs show significant performance gain due to optimistic approach. For read-dominated workloads, both single- and multi-combiner lists perform better than SDK for all workload localities. For higher update operations rate, multi-combiners list competes well with SDK data structure, while single combiner one shows lack of scalability, especially for high access locality. So far, we can conclude that at least hinted variant of combining skip list is simple and effective alternative to SDK decision. It is clear enough that for read-dominated workloads lock-free list performs worse than ones with lock protected updates and lock-free contains. The first reason for more effective read is that FC lists contains (Listing 2.4) performs only two volatile reads, while lock-free implementations require all next references to be volatile, and therefore, need log N volatile reads. The second reason is that all known lockfree skip list implementations conclude about node presence only after reaching the bottom skip list level, when our implementation stops if node with desired key is found on any level. However, it remains not clear yet what the combiner mechanism impact on the presented results.

24 3. Performance 18 Fig. 3.1: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution

25 3. Performance 19 Fig. 3.2: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality

26 3. Performance 20 Fig. 3.3: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution

27 3. Performance 21 Fig. 3.4: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality

28 3. Performance Flat Combining Mechanism Experimental Verifications. In this section we experimentally verify in depth the FC impact on skip list behavior. The first experiments compare flat combining implementations with especially designed multi-lock skip list. Multi-lock skip list is devised from flat combining one by replacing FCData by simple lock. It has single- and multilocks implementation, exactly as FC skip list has, and may be extended with hints mechanism as well. The multi-lock skip list with hints add method is shown at Listing 3.1. The method doadd called at row 26 is identical to flat combined one presented at Listing 2.9 Listing 3.1: Optimistic (hinted) multi-lock add method implementation 1 public boolean add(int key) { 2 //Get responsible combiner 3 Node lock node = findlocknode(key ); 4 5 // We have to know level prior to find in order 6 // to restrict hints size 7 int top level = randomlevel (); 8 // Thread local hints lists 9 int thread id = ThreadId.getThreadId (); 10 RandomAccessList<f c j a v a. MultiLockSkipListFH. Link> 11 succ ary = succ ary [ thread id ]; 12 RandomAccessList<f c j a v a. MultiLockSkipListFH. Link> 13 pred ary = pred ary [ thread id ]; Node found node ; 16 do{ 17 found node = find(lock node, key, pred ary, 18 succ ary, top level, true, true); 19 }while(found node!= null && found node. deleted ); 20 if (found node!= null) 21 return false ; 22 // Acquire lock and perform modification 23 AtomicInteger lock = lock node. node lock ; 24 do { // TTAS lock 25 if (0 == lock. get () && lock.compareandset(0, 0xFF)) { 26 doadd(thread id, lock node, key, pred ary, 27 succ ary, top level ); 28 // Release lock 29 lock. set (0); return true; 32 } else // Give up processor 33 Thread. yield (); 34 } while(true); 35 } Instead of placing the request and running the flat combining algorithm, the

29 3. Performance 23 updating thread finds appropriate lock node, acquires the lock and performs the change. The following graphs compare between multi-lock to FC Naive skip lists. We can see that for both low (Figure 3.5) and high (Figure 3.6) localities, and for any update rates both lists behave very similar. The multi-lock skip list even tends to perform slightly better for low access locality than its FC counterpart. It may be explained by additional overheads that flat combining introduces - the combiner thread has to read and maintain the FC registry and to write back the operations result. All this, if not compensated by FC gains that were described above, leads to performance decrease. The benchmarks of Hints versions of multi-lock and FC skip lists are shown on Figures 3.7 and 3.8 for low and high access locality. The hints mechanism introduction improves performance of both lists, but does not change the ratio between algorithms- both behave very similar with light preference to multi-lock skip list for low access locality. As it is mentioned before, flat combining, besides opening contention bottleneck, allows using the knowledge about all pending request for optimizing data structure updates. For tree-like data structures, and for skip lists in particular, the elimination and combining techniques can be applied for optimizing the data structure traversal, but it is very hard to use them for optimizing data structure update. For the next group experiments, we assumed that the traversal is perfectly optimized, i.e. our hints mechanism never fails. In practice, we replaced the verify method in Listing 2.10 with one always returning true, and supplied every nodes with additional dummy next references. The combiner, instead of writing to real next references, updates the equal quantity of dummy next ones. These benchmarks are presented on Figures 3.9 and 3.9, and show that FC skip list with ideal hints mechanism competes well with lock-free one, and fails only for high access locality and more than 50% update rate, and, so, hints mechanism verification and improvement makes sense. The next graph (Figure 3.11) shows our hints mechanism efficiency. As it follows from the graph, the hints are very close to ideal for uniform access and fall to about 50% failures, when threads number grows to 64. This result explains the scalability turning point between 16 to 32 threads for high access locality and high update rate. Note, that for ideal hints list the turning point also exists, but appears slightly later and is not so sharp. So, the problematic scalability of FC list caused, probably, by the flat combining itself.

30 3. Performance 24 Fig. 3.5: FC skip list implementation vs multi-lock one, naive implementations, uniform keys distribution

31 3. Performance 25 Fig. 3.6: FC skip list implementation vs multi-lock one, naive implementations, high access locality

32 3. Performance 26 Fig. 3.7: FC skip list implementation vs multi-lock one, hints implementations, uniform keys distribution

33 3. Performance 27 Fig. 3.8: FC skip list implementation vs multi-lock one, hints implementations, high access locality

34 3. Performance 28 Fig. 3.9: Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipList- Set, uniform keys distribution

35 3. Performance 29 Fig. 3.10: Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality

36 3. Performance 30 Fig. 3.11: Hints mechanism success rate for pure update workloads Fig. 3.12: The connection between FC intensity to throughput per thread for pure update workloads The next two benchmarks, performed for pure update workload, intend to answer the question why lock-free list scales better than FC one. For flat combining loading estimation we introduce the FC intensity - the factor showing

37 3. Performance 31 additional combiner work. It is calculated in following way: < FC intensity >= < Fulfilled requests per FC session > 1 < Number of threads > This number is 0 for single threaded execution, and tend to 1 for large number of threads, when one combiner fulfills the requests of all other threads. The Figure 3.12 shows FC intensity together with throughput per thread for different number of combiners and workload localities, and the FC intensity increase is followed by throughput decrease (note, that the ideal scalability is horizontal line). The jump of intensity between 16 to 32 threads corresponds well with graphs on Figures 3.3 and 3.4 for 50% add / 50% remove workload. The jump may be explained in following way: starting from some number of threads, the combiner has no time to complete all the requests during the period, when the released thread prepares the new request, and so, the competition for lock newer interrupts. On the other hand, for 64 combiners and low locality, the jump has not happened, and algorithm is scalable. The Figure 3.13 shows lock-free list statistics for pure update workload. As it follows from the graphs, the CAS success rate never drops below 75% and CAS number is as small as CAS per update, which explains good algorithm scalability. Fig. 3.13: Lock-free skip list CAS per update, CAS success rate and throughput per thread for pure update workloads

38 4. CONCLUSIONS We studied several approaches for flat combining technique application to skip listbasedmaps. Asitwasshownonskiplistexample, forthestructuresallowing concurrent updates, the fine-grained and especially lock-free synchronizations are preferable to FC. This conclusion does not completely deny usefulness of the FC application for such structures since for read dominated workloads and for several update request distributions flat combining behaves better than lock-free synchronization. It is also possible that for different hardware the FC approach will show better scalability. The breakthrough can also come from FC algorithm improvements. It is possible, for example, to transform FC into some sort of job dispatcher: having all the requests, it can form mutually non-conflicting groups, so the waiting threads can execute them without synchronization. Such design faces the problems with additional FC overhead for sorting and analyzing the requests, but may be applicable for NUMA or client-server architectures. It is interesting also to study the FC implementation for other popular data structures - such as B-trees or Red-Black trees, where lock-free alternatives do not exist, and fine-grained locking requires complicated read-write locks. The FC s benefit of simplicity and proved linearizability may be valuable for these cases. Another, albeit auxiliary, data structure - multi-lock skip list - may be interesting by itself. It showed characteristics as good as FC skip list, but it is simpler, needs less memory and gives more uniform latency for update requests. The idea to build the small index, protected by locks (locked or FC layers), and entirely wait-free data structure body can replace hand-by-hand fine-grained synchronization schemes for tree-like structures.

39 BIBLIOGRAPHY [1] Adelson-Velskii, G. M., and Landis, E. M. An algorithm for the organization of information. Soviet Math. Doklady, 3 (1962), [2] Bayer, R., and McCreight, E. Organization and maintenance of large ordered indices. In SIGFIDET 70: Proceedings of the 1970 ACM SIG- FIDET (now SIGMOD) Workshop on Data Description, Access and Control (New York, NY, USA, 1970), ACM, pp [3] Colvin, R., Groves, L., Luchangco, V., and Moir, M. Formal verification of a lazy concurrent list-based set algorithm. In CAV (2006), pp [4] Doherty, S., Groves, L., Luchangco, V., and Moir, M. Formal verification of a practical lock-free queue algorithm. In In FORTE (2004), Springer, pp [5] Fraser, K. Practical lock freedom. PhD thesis, Cambridge University Computer Laboratory, Also available as Technical Report UCAM- CL-TR-579. [6] Guibas, L. J., and Sedgewick, R. A dichromatic framework for balanced trees. In SFCS 78: Proceedings of the 19th Annual Symposium on Foundations of Computer Science (Washington, DC, USA, 1978), IEEE Computer Society, pp [7] Hendler, D., Incze, I., Shavit, N., and Tzafrir, M. Flat combining and the synchronization-parallelism tradeoff. In SPAA (2010), pp [8] Herlihy, M., Lev, Y., Luchangco, V., and Shavit, N. A simple optimistic skiplist algorithm. In SIROCCO 07: Proceedings of the 14th international conference on Structural information and communication complexity (Berlin, Heidelberg, 2007), Springer-Verlag, pp [9] Herlihy, M., and Shavit, N. The art of multiprocessor programming. Morgan Kaufmann, [10] Herlihy, M. P., and Wing, J. M. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12 (1990), [11] Lotan, I., and Shavit., N. Skiplist-based concurrent priority queues. In Proc. of the 14th International Parallel and Distributed Processing Symposium (IPDPS) (2000), pp

A Simple Optimistic skip-list Algorithm

A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems Laboratories yosef.lev@sun.com Victor Luchangco Sun