Teleportation as a Strategy for Improving Concurrent Skiplist Performance. Frances Steen

Size: px
Start display at page:

Download "Teleportation as a Strategy for Improving Concurrent Skiplist Performance. Frances Steen"

Transcription

1 Teleportation as a Strategy for Improving Concurrent Skiplist Performance by Frances Steen Submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science at BROWN UNIVERSITY May 2015 Author Frances Steen Department of Computer Science April 15, 2015 Advised by Maurice Herlihy Professor of Computer Science Thesis Supervisor Read by Thomas Doeppner Vice Chairman & Director of Undergraduate Studies Department of Computer Science

2 Teleportation as a Strategy for Improving Concurrent Skiplist Performance by Frances Steen Submitted to the Department of Computer Science on April 15, 2015, in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science Abstract We explore how the use of teleportation can contribute to better skiplist performance. Teleportation is a technique that leverages transactional memory to minimize the costs incurred through intermediate node traversals in concurrent linkednode data-structures such as linked-lists and skiplists. Intel s development of the Haswell microarchitecture, which has hardware support for transactional memory, makes teleportation a more viable strategy for widespread use. We apply the strategy of teleportation to lock-coupled skiplists and hazard-protected lazy skiplists, and measure the changes in performance that result. We find that teleportation becomes costly for the lock-coupled skiplist, as it is used in high-contention environments where lock acquisition is required within transactions, but contributes to performance gains for the hazard-protected lazy skiplist, where lock acquisition does not occur within transactions. Our results provide new insights into the types of data-structures that are good candidates for performance improvements through the use of teleportation. Thesis Supervisor: Maurice Herlihy Title: Professor of Computer Science Acknowledgments Thanks to my thesis advisor, Professor Herlihy, whose insights and answers to my questions were essential in moving my thesis along. Thanks, also, to my parents Kyra and Paul, and to my friends, Sonia and Jared, for proofreading my thesis and for letting me bounce ideas off of them throughout the year. 2

3 Contents 1 Introduction Transactional Memory The Skiplist The Lock-Coupled Skiplist The Lazy Skiplist Teleporting Lock-Coupled Skiplists Design of Skiplist and Nodes Algorithm Find Method and Teleportation Variations Benchmarking Results Discussion Teleporting Lazy Skiplists Design of Skiplist and Nodes Algorithm Find Method and Teleportation Add and Remove Methods Variations Benchmarking Results Discussion Conclusion 35 3

4 Introduction The focus of this thesis is on improving the performance of the concurrent skiplist. As transactional memory becomes increasingly widely available, we can begin to design general-purpose concurrent data-structures and algorithms that leverage this relatively new concurrency control mechanism. This thesis is motivated by previous work on concurrent linked-lists [2], in which teleportation is used to produce speedups. In this chapter, we give background on transactional memory, the skiplist data structure, and two specific types of concurrent skiplist: the lock-coupled and the lazy skiplist. When we discuss the lock-coupled and the lazy skiplist, we also give an overview of what teleportation is and how we expect it to speed up these two skiplist types. 1.1 Transactional Memory Much like locks and synchronization primitives, transactional memory is used as a tool for enabling thread-safe concurrent programming. Transactions are the means through which transactional memory concurrency control is implemented. Whereas locks enable thread-safe concurrent programing pessimistically, by preventing concurrent access to shared memory locations under the assumption that unprotected shared memory locations will be accessed in an unsafe manner, transactional memory operates under an optimistic assumption: assume shared memory is being accessed in a safe manner until something bad happens, and then deal with the consequences. More precisely, transactional memory allows a block of multiple instructions inside of a transaction to be executed atomically. This is done by executing the code within the transaction block speculatively. If a conflict occurs, then the changes that were made during the transaction are rolled back. Otherwise, in the case of no conflict, the transaction is committed and the writes to shared memory locations within the transaction appear to take place atomically. In order to track whether a conflict occurs, the locations that are read within 4

5 the transaction block are added to the transaction s read-set and the locations that are modified within the transaction block are added to the transaction s write-set. If a different thread modifies a location tracked within the read-set or the write-set while the transaction is ongoing, then a conflict occurs and the transaction aborts. An abort also occurs if a different thread reads a location that is tracked within the write-set. In addition to conflict aborts, a transaction may abort for several other reasons. For our purposes, the other relevant abort types are: capacity aborts, which occur when the number of memory locations accessed within the transaction exceeds the transaction s capacity; unknown aborts; and user-defined aborts, which are triggered by the programmer within the code. When a transaction aborts, it writes the relevant abort code to its status flag. After an abort occurs, the transaction may be retried or execution may move on to an alternate piece of non-transactional fallback code. Typically, code is written such that transactions are retried several times but eventually fall back to alternate, non-transactional code following a certain number of aborts. It is necessary that every transaction has a equivalent non-transactional fallback because transactions are not guaranteed to ever commit. Our code is written using Intel s Restricted Transactional Memory (RTM) interface, available on its Haswell microarchitecture. The interface exposes four methods: xbegin, which begins a transaction; xend, which ends a transaction; xabort, which the programmer can use to abort a transaction (the programmer also supplies it with an abort code to be written to the status flag); and xtest, which reports whether or not it is being called from within a transaction. 1.2 The Skiplist The skiplist is linked-node data-structure that is made up of several linked-lists. It shares the same functionality as the linked-list. It stores a set of unique elements in increasing order and has add, remove, and contains methods. The first two methods mutate the set and the third checks for set membership. All three methods have expected logarithmic run times a substantial improvement in performance over the linear run time methods of the linked-list. Additionally, unlike other data-structures with logarithmic time search methods (such as trees), the skiplist does not require rebalancing, which makes it easily adaptable to concurrent environments. The logarithmic performance of the skiplist is a direct result of its structure. It is made up of several levels of linked-list, with the bottom-level linked-list containing all of the elements in the set. As the level of the skiplist increases, the linked-lists 5

6 grow sparser, because each linked-list skips over some of the nodes contained in the linked-list at the level below it. Thus, a linked-list at a given level is a subset of all of the lower-level linked lists. We determine which nodes are contained in the linked-list at a given level based on node height. Upon node creation, a node s height is chosen from an exponential distribution, where the probability that the node will appear at level i is p i. Thus p, 0 < p < 1, is the expected fraction of nodes from a given level that will be skipped over at the next higher level. (The probability of a node appearing at level 0 the bottom level is 1.) When p is 1, for example, each 2 linked-list in the skiplist contains half as many nodes as the level below it. This gives the skiplist an expected logarithmic depth. Additionally, the skiplist has head and tail sentinel nodes which have the maximum heights and which bookend the other nodes in the skiplist. For our purposes, we use an underlying find method to search the skiplist in the add, remove, and contains methods. The find method locates the nodes that come just before the location of the value for which we are searching in the skiplist. We call these nodes the predecessor nodes, and we call the node whose value matches that for which we are searching the victim node. To find the victim node, traversal begins from the head node and the top level of the skiplist. This level of the skiplist has the fewest nodes linked into it. We traverse across this level until we locate the victim s predecessor node at this level of the skiplist, whereupon we move down a level in the skiplist and continue our traversal. We repeat this process until we reach the victim s predecessor node at the lowest level in the skiplist. We expect to traverse a constant number of nodes at each level of the skiplist, and we expect our skiplist to have depth that is logarithmic in the length of the skiplist. Thus, our find method has an expected logarithmic run time. The add and remove methods rely heavily on the find method. In both cases, we use the find method to locate the nodes that point to the spot where we will either add a node by linking it in or remove a node by redirecting the next pointers of the predecessors that we found to point to the node after the one we are removing. More information about the skiplist can be found in [1, ch.14]. 1.3 The Lock-Coupled Skiplist The lock-coupled skiplist is the first of two versions of thread-safe skiplist that we look at. In the most basic lock-coupled skiplist, each node in the skiplist has its own lock. To ensure correctness and prevent deadlock during concurrent add, remove and 6

7 contains calls, nodes are traversed in a hand-over-hand manner. That is, when nodes are traversed across a single level, the lock for a successor node is obtained prior to the release of the lock for the current node. This ensures that, for a traversal of the skiplist by a given thread, at least one node remains locked by the thread at all times. While hand-over-hand locking is necessary to ensure correctness during concurrent traversals and mutations of the skiplist, it contributes significant overhead to skiplist traversals. Each thread must obtain a lock for every node that it traverses and threads traversing the skiplist encounter a significant sequential bottleneck because it is impossible for one thread to pass another thread during traversal. Thus, a thread searching for a high-valued node must inevitably wait behind a thread performing an add or delete on a low-valued node before its search can continue. Similarly, if the scheduler pauses a thread that happens to hold the lock to the first node in the skiplist, all other threads are prevented from traversing the skiplist. The key observation for this research is that it seems possible to alleviate this problem using transactions. When node traversal occurs within a transaction, it becomes unnecessary to obtain a lock for every node that is traversed. Instead, it suffices to teleport the locks across several nodes to traverse the nodes as usual, but without obtaining any locks. At the end of the transaction, the node that was initially locked is unlocked and the final node to be traversed is locked. (Transactions let us eliminate intermediate lock acquisitions because everything that occurs within the transaction appears to take place atomically, so it is not possible for other threads to interfere with our node traversal. In actuality, as we discussed in the Transactional Memory section, if there is an interference with a memory location that is tracked in the read-set or write-set of the transaction, then the transaction will abort.) Lastly, prior to committing, we lock all predecessors to the victim node that were encountered during the teleport. Using the lock teleportation strategy, we reduce the number of lock acquisitions required in a skiplist traversal and we enable threads to pass one another during traversal. (Passing is possible because a thread searching the skiplist within a transaction can traverse locked intermediate nodes.) However, as we will see later, further changes are required in order to make lock teleportation a viable strategy for improving the performance of lock-coupled skiplists. 7

8 1.4 The Lazy Skiplist The lazy skiplist is the second type of skiplist that we explore. It is largely the same as the lock-coupled skiplist in structure (each node has its own lock), but differs from the lock-coupled skiplist in locking algorithm. Instead of using hand-over-hand locking, in the lazy skiplist s find method we traverse nodes without obtaining any locks. When we need to add a node, we lock all of its predecessors and then validate that they have not changed before linking the node into the skiplist. (This is necessary because other threads are mutating the skiplist concurrently and may have added or deleted nodes such that the set of predecessors is affected.) Similarly, when we need to remove a node, we ensure that it is still logically a part of the skiplist before locking it and then, as with the add method, validating that its predecessors have not changed. We then logically remove the node from the skiplist, by setting a flag that indicates that it has been deleted, before physically unlinking it. The lazy skiplist as described above is already rather efficient. However, we are also concerned with memory-management. In the lock-coupled skiplist, memorymanagement is not an issue, as deleted nodes can be immediately recycled back into the skiplist. (We note that if the skiplist s contains method is lock-free, then slightly more work is required.) For this reason, we do not implement node-recycling for the lock-coupled skiplist, and maintain our focus on teleportation instead. However, node-recycling is not as trivial for the lazy skiplist. In the lazy skiplist, it is possible for a thread to retain a reference to a node that has already been removed. Thus, care must be taken to ensure that such a node is not recycled back into the skiplist until all threads have discarded their references to it. To do this, we use hazard pointers, which are node references that each thread publishes to a common location, indicating that the node is not safe to recycle. Before reusing a node, a thread must check the hazard pointer array to ensure that the node is not hazard protected. Additionally, in order to ensure that this information is immediately visible to all threads, we must follow each hazard pointer publication with a memory barrier. A thread overwrites its hazard pointers when a new node needs to be hazard protected. The use of hazard pointers allows us to recycle nodes in the lazy skiplist. However, it creates significant overhead for the lazy skiplist because hazard protection must be performed for each node prior to its being read. This is especially expensive because of the necessity of a memory barrier after each hazard pointer publication. We can alleviate this issue using hazard pointer teleportation in much the same way that lock teleportation is employed in the lock-coupled skiplist. That is, we can avoid publishing 8

9 a hazard pointer for every node by traversing nodes within a transaction. Just before committing the transaction, hazard pointers are published for the final two nodes that are traversed. The strategy of using hazard pointer teleportation significantly reduces the number of hazard pointers that need to be published, thereby reducing the number of necessary memory barriers and improving performance. We will see later that the lock-free nature of hazard pointer teleportation makes it a successful means of speeding up the lazy skiplist. 9

10 Teleporting Lock-Coupled Skiplists In this chapter, we present an implementation, written in C++, of the lock-coupled skiplist that we adapted to use lock teleportation. First, we cover the design of the skiplist and its nodes, followed by an overview of our teleportation algorithm. We then present several variations on the basic teleporting lock-coupled skiplist. We describe the benchmarks that we used to measure the performances of the various lock-coupled skiplists, followed by the results that we obtained and a discussion of these results. In our discussion of the results, we provide an explanation for our finding that the teleporting lock-coupled skiplist remains slower than its non-teleporting counterpart. Finally, we suggest changes that might produce a better teleporting lock-coupled skiplist and that motivate our design of the teleporting lazy skiplist. 2.1 Design of Skiplist and Nodes Here, we present an overview of the design of the lock-coupled skiplist and its nodes. In our skiplist implementation, we limit ourselves to dealing with integer values and we maintain a set of nodes whose values are unique. In addition, our skiplist maintains the set in increasing order at all times. The Skiplist structure has a reference to a head Node and a tail Node. These sentinel Nodes are given the placeholder values of MIN INT and MAX INT, respectively; they serve to simplify the code. In the Skiplist structure, we also specify a MAX HEIGHT for the Node structures. The MAX HEIGHT determines how many levels the Skiplist can have. Each Node structure consists of several members. An Int key stores the value of the Node. Each Node has an Int height, which is determined at the time of Node creation by sampling from an exponential distribution across the possible heights. A Lock serves to protect the Node from concurrent accesses. Lastly, each Node has an array of pointers to next Nodes, one pointer for every level up until the Node s height. Note that a Node s next pointer at a given level must point to a Node whose height is at least 1 greater than that level. (This is because levels are zero-indexed and we 10

11 cannot have Nodes that are linked in to the skiplist at levels above their heights.) 2.2 Algorithm We now turn our attention to the algorithm that we use in the Find method, giving specific attention to our procedure for teleportation. For background information on the lock-coupled skiplist, we refer the reader to [1, ch.14] Find Method and Teleportation The skiplist has public Add, Remove and Contains methods. These are backed by an underlying Find method (see Figure 2-1). The Find method takes in: an Int value, v; an array of node pointers, preds; and a ThreadLocal structure, threadlocal. It does not return anything. (The ThreadLocal structure simply stores information about the thread that is currently executing the code; for our purposes it is only important to note that this structure contains the thread ID.) In the Find method, we search the skiplist for the victim node (the node whose value is equal to v), and populate the preds array with references to the (locked) predecessors of the victim. Note that even if the victim node is absent from the skiplist, we still can (and do) find all of its predecessors. (The predecessors to an absent victim are simply those nodes that would point to the victim if it were present.) This process encompasses the bulk of the work of mutating the skiplist, leaving the Add and Remove methods mainly only the work of manipulating pointers to link or unlink a node within the skiplist. In broad terms, we describe the Find method as follows. To begin, we initialize our starting level to the top level of the skiplist. (Recall the skiplist Find algorithm traversal pattern described in the previous chapter.) We preliminarily set the top level predecessor to point to the head of the skiplist and we lock this node. Inside of a loop, we repeatedly attempt to advance down the skiplist using lock teleportation. If a certain number of attempted transactions abort, we then revert to hand-over-hand locking, which is the fallback traversal strategy. In the most basic implementation, after we advance one node down the skiplist using hand-over-hand locking, we attempt to use lock teleportation again. (However, it is possible to alter the traversal strategy so that, within a Find operation, lock teleportation is never revisited after we revert to hand-over-hand locking. We find that this strategy tends to produce better performance.) 11

12 When traversing the skiplist, we use a thread s teleportlimit to determine when to revert to hand-over-hand locking and how much of the skiplist to traverse within one transaction using lock teleportation. The teleportlimit is a variable that sets an upper bound for the number of next pointers that the thread may follow before it must attempt to commit a transaction. It is possible that the thread will traverse fewer than teleportlimit nodes; this will be the case when the victim node is found within teleportlimit of the starting node. It is important to enforce a traversal limit (rather than allowing the whole Find traversal to occur within one transaction) because the likelihood of a successful transaction decreases as the number of nodes traversed within the transaction increases. (The decrease in likelihood is due to the fact that longer teleports are more likely to have read-set or write-set conflicts with other threads and are also more likely to suffer from capacity aborts due to having accessed too many memory locations.) For each thread executing operations on our skiplist, teleportlimit is initialized to some starting value (we use 16 in our code). After every transaction, it is updated using an additive-increase multiplicative-decrease (AIMD) strategy. More specifically, if a transaction utilizes the entire alotted teleportlimit and commits, then we increment the teleportlimit. If the transaction aborts, then we halve the teleportlimit. In the case where the transaction commits before reaching the limit, teleportlimit remains the same. This strategy enables us to dynamically search for a good traversal limit. The Fallback component of the Find method (see Figure 2-2) retains the traversal strategy of the non-transactional lock-coupled skiplist, except that the Fallback method encompasses only one individual next pointer traversal. We read the next node in the skiplist in line 1, and, in line 2, determine whether or not we have arrived at our victim. If we have not, then it is safe to traverse further. We lock the node that we just read and, in lines 4-6, only unlock the previous node if it is not the predecessor to the victim at the level above. Finally, in lines 7-8, we update our predecessor array and return false, indicating that our traversal is not complete. If, in line 2, we instead determine that we have reached our victim node, then (in lines 9-14) we move down a level of the skiplist, if possible, and indicate whether or not we have completed our traversal. The lock teleportation code is somewhat more complex. In Teleport (Figure 2-3), we begin by resetting the teleportlimit, if necessary, and initializing our local variables. In lines 7-15, we traverse the skiplist. This code is very similar to that used for traversal in the non-transactional lock-coupled skiplist, except for the fact that 12

13 void find(int v, Node* preds[], ThreadLocal* threadlocal) { 1 bool done, success = false; 2 int start_layer = MaxHeight-1; 3 preds[start_layer] = &LSentinel; 4 preds[start_layer]->lock(); 5 while (!done) { 6 do { 7 if ((success = (xbegin(threadlocal)) == _XBEGIN_STARTED)) { 8 done = teleport(v, preds, threadlocal, &start_layer); 9 } 10 } while (!done && (success (threadlocal->teleportlimit /= 2) > 2)); 11 if (done) { break; } 12 done = fallback(v, preds, &start_layer); 13 threadlocal->teleportlimit = 2; 14 } 15 } Figure 2-1: The lock-coupled Find method, used in Add, Remove, and Contains. bool fallback(int v, Node* preds[], int* start_layer) { 1 Node* curr = preds[*start_layer]->next[*start_layer]; 2 if (curr->value < v) { 3 curr->lock(); 4 if (*start_layer == MaxHeight-1 preds[*start_layer+1]!= preds[*start_layer]) { 5 preds[*start_layer]->unlock(); 6 } 7 preds[*start_layer] = curr; 8 return false; 9 } else if (*start_layer > 0) { 10 *start_layer = *start_layer-1; 11 preds[*start_layer] = preds[*start_layer+1]; 12 return false; 13 } else { return true; } 14 } Figure 2-2: The lock-coupled Fallback method, used in Find. 13

14 bool teleport(int v, Node* preds[], ThreadLocal* threadlocal, int* start_layer) { 1 if (threadlocal->teleportlimit < 0) { 2 threadlocal->teleportlimit = DEFAULT_TELEPORT_DISTANCE; 3 } 4 int dist = threadlocal->teleportlimit; 5 int curr_layer; 6 Node* start, pred = preds[*start_layer]; 7 for (int layer = *start_layer; layer >= 0 && dist >= 0; layer--) { 8 curr_layer = layer; 9 Node* curr = pred->next[layer]; 10 preds[layer] = pred; 11 while (curr->value < v && --dist >= 0) { 12 preds[layer] = pred = curr; 13 curr = curr->next[layer]; 14 } 15 } 16 bool done = curr_layer == 0 && preds[0]->next[0]->value >= v; 17 if (*start_layer == MaxHeight-1 preds[*start_layer+1]!= start) { 18 start->unlock(); 19 } 20 Node* justlocked = NULL; 21 for (int layer = *start_layer; layer >= curr_layer; layer--) { 22 if (preds[layer]!= justlocked) { 23 preds[layer]->lock(); 24 justlocked = preds[layer]; 25 } 26 } 27 _xend(); 28 if (dist <= 0) { threadlocal->teleportlimit++; } 29 *start_layer = curr_layer; 30 return done; 31 } Figure 2-3: The lock-coupled Teleport method, used in Find. 14

15 no locks are obtained and that we have an additional stopping condition, dist, which we use to enforce the teleportlimit. In line 16, we check to see whether our traversal ended because we reached the teleportlimit or because we found our victim node (in which case done is true). Just as in the Fallback method, we unlock the initial node that was locked, given that it is not the predecessor to the victim at the level above (lines 17-19). Then, in lines 20-26, we go back over all of the predecessors that we found during our traversal and lock them, making sure we never lock the same node twice. (The same node may be a predecessor to the victim at more than one level.) If we attempt to lock a node that was locked prior to the transaction beginning or that becomes locked by another thread while the transaction is ongoing, then an abort occurs. After locking is complete, we attempt to commit the transaction, and, upon success, update our starting level and return whether or not our traversal is complete Variations In this section, we discuss one of the main performance issues that arises for the teleporting lock-coupled skiplist that we introduced in the previous section: contention on top-level nodes. This issue serves to motivate our introduction, later in the section, of several other variations on the teleporting lock-coupled skiplist. However, when we look at the graphs, in the Results section, of the original teleporting lock-coupled skiplist and the variations that we introduce, we determine that these strategies do not significantly relieve the issue. Finally, in the Discussion section of this chapter, we analyze why this is the case. We turn our attention, first, to our method of locking predecessors in the lockcoupled skiplist. Recall that in our Find method, we go through the process of locating and locking all of the predecessors to the victim node, starting with the top-level predecessor and continuing until we reach the lowest-level predecessor. Note that because we require each thread to lock the top-level predecessor of its victim node in order to perform a mutation on the skiplist, it is not possible for two or more threads to concurrently add or remove nodes that have the same top-level predecessor. Thus, in the teleporting skiplist described above, the degree of concurrency during Add and Remove operations is limited by the number of top-level nodes in the skiplist. Furthermore, because we draw node heights from an exponential distribution, we expect to have a very low ratio of top-level nodes to total number of nodes in the skiplist, especially when the skiplist height is high. The observation that there is likely to be a great deal of contention on the top-level 15

16 predecessors within the skiplist motivates our first variation, which we call the exact (lock-coupled) skiplist. The design of the exact skiplist stems from the observation that, in adding or removing a node within a skiplist, it is sufficient to lock only those nodes that are predecessors at or below the top level of the node in question. The reason that we need to lock predecessor nodes in the first place is in order to to prevent changes to these nodes by other threads while the victim node is being linked into or unlinked from the skiplist. However, because we only link or unlink a node up to its top level, there is no need to lock any predecessors above the top level of the node. We alter the code for the skiplist presented in previous section to reflect this new strategy. The Add and Remove methods both speculate the heights of their victim nodes. (For the Add method, this is the height of the node that it is trying to link in. The Remove method speculates the lowest height possible, which is 1.) Both methods then pass this information along to the Find method. During the Find traversal, we refrain from locking the predecessor nodes until we either reach the speculated node height or encounter the victim node, whichever occurs first. This strategy applies to both teleporting and non-teleporting exact skiplists and greatly relieves contention on top-level nodes, as most skiplist mutations occur on low-height nodes and thus do not require the locking of a high-height predecessor. We take the exact skiplist strategy one step further by introducing level-locking skiplists. In this variation, every node in the skiplist has one lock for each of its levels, rather than simply having one lock overall. We thereby further reduce contention on high-height nodes by enabling each of their levels to be locked individually. Thus, for example, if a top level node is the predecessor to a node of height 1, it can be locked at its lowest level and left unlocked at all of its higher levels. This allows it to be locked by another thread as the predecessor, at a higher level, for different node. In nonteleporting contexts, this strategy does not require many changes to the skiplist code that we describe in the previous section. However, the logic for using level-locking in conjunction with teleportation is complex and requires a slightly more conservative locking strategy than the one that we outlined at the beginning of this chapter. These trade-offs yield a teleporting level-locking skiplist that often performs worse than both the teleporting exact skiplist and our original teleporting lock-coupled skiplist. We compare the performance of these two teleporting lock-coupled skiplist variations and the original teleporting lock-coupled skiplist in the Results section. 16

17 2.3 Benchmarking We now discuss the methods by which we measure the performance of our teleporting lock-coupled skiplists. Our benchmarking is done on Brown s Haswell machine, Spoonbill, which has an Intel Core i CPU. The machine has 4 hyper-threaded cores, which enables it to run 2 threads per core, giving it 8 virtual processors. This fact is relevant because some of our skiplists exhibit a dramatic change in performance trend as we go from 8 to 9 threads. We benchmark each version of the lock-coupled skiplist by populating it with an initial set of elements and then executing a predetermined number of Add, Remove, and Contains operations on it. We are able to vary several parameters in our benchmarking suite: the skiplist initial size, the benchmark test length, the range from which we sample our node values, the skiplist height, the distribution of operations, and the number of threads performing operations concurrently. Even though many of these factors can impact skiplist performance, we keep most of them fixed in our Results section graphs because they do not significantly impact the relative performance between skiplist versions. For all of our benchmarking: we initially populate our skiplists with 5,000 nodes prior to beginning the test; our tests remain fixed at a length of 100,000 operations performed on the skiplist; we draw our node values from between 1 and 10,000; and our skiplist height is 32 throughout. We also only present performance graphs for a fixed distribution of operations. Operation distribution refers to the percent of operations that are read-only Contains operations (the rest of the operations consist of an equal balance of Add and Remove operations e.g. a 50% read probability describes a test with 50% Contains operations, 25 % Add operations, and 25% Remove operations). This is an interesting variable to look at, as teleporting skiplists that do produce performance gains tend to most outperform their non-teleporting counterparts when the read probability is lower. However, in our case, we do not gain any critical information about the relative qualities of the different skiplists from varying the read probability, so we fix it at 50% for our Results section. During our testing, we use between 1 and 12 threads. When more than 1 thread is enlisted to perform operations on the skiplist, we assign each thread an equal number of operations such that our total number of operations is met. (We make the simplifying assumption that each thread will perform the same number of operations in roughly the same amount of time.) We measure the time that each benchmark test takes to run using the built-in C function gethrtime, which returns the high-resolution time in nanoseconds (we 17

18 convert this to milliseconds for our graphs). Our results for each benchmark test consist of the averages obtained from 5 runs of that test. Each thread used in the test records its start time and finish time, and we take an average of the elapsed times for each threads execution to get our test run time. We obtain the average commit rate of the threads during a benchmark test by incrementing each threads transaction started counter when a transaction is attempted and incrementing the transaction committed counter each time a transaction succeeds. Similarly, we have each thread record how far it teleports during each successful transaction, from which information we obtain the average teleport length for a benchmark test. Here, we present the results of benchmarking the teleporting lock-coupled skiplist. We compare our original teleporting lock-coupled skiplist ( original ), which we introduced in detail at the beginning of this chapter, to our two variations the teleporting exact skiplist ( exact ) and the teleporting level-locking skiplist ( level ). We benchmark these teleporting skiplists against a non-teleporting exact skiplist ( exact no-teleport ), which we had hoped to outperform. Note that graphs pertaining to teleportation-specific measures do not include the non-teleporting skiplist. 2.4 Results First, we look at performance, which is the measure that we are most interested in. In Figure 2-4, we graph the run time of each of our skiplists as a function of number of threads. The exact no-teleport skiplist maintains a steady run time between 4 and 10 threads, with decreased performance outside of this range, which is as expected. Our original and exact skiplists share this run time trend but do not perform as well. It is reassuring to note that the exact skiplist does indeed outperform the original skiplist, as we expected would be the case because of its more liberal locking policy. However, whereas the exact no-teleport skiplist maintains a flat performance between 4 and 10 threads, the original and exact skiplists grow slightly worse. This could be an indication that their performance is negatively impacted by increased contention with an increase in the number of threads executing concurrently. (We will see that this is, indeed, the case by looking at the next two graphs.) The level skiplist performs rather poorly even with just 1 thread, which tells us that its algorithm is slower than that of the other skiplists. It does, however, experience a fairly steep improvement in performance as we go from 1 thread up to 8 threads, so it seems to be the case that its code is more parallelizable than that of the other skiplists. Nonetheless, as soon as the number of threads moves beyond the number of virtual processors, we see a 18

19 sharp drop-off in performance. This could be attributed to the fact that when there are 9 or more threads, a thread may be put to sleep while holding a contentious lock, prolonging the period of time during which other threads cannot acquire said lock. Figure 2-4: Run time versus number of threads. The trends that we see in the run time graph are corroborated by those that appear in our graph of commit rate versus number of threads, Figure 2-5. First, we note the drastic decrease in commit rate, for all three skiplist variations, as the number of threads increases from 1 to 4 (or 1 to 8, in the case of the level skiplist). The fact that the commit rate appears to be inversely related to the number of threads indicates that many of our transaction aborts are due to conflicts between threads oftentimes, we suspect, because of attempts to lock already-locked nodes within a transaction. (We discuss the commit rate issue in greater depth in the Results section.) Comparing the commit rates between the three variations, we see that the exact skiplist tends to have a slightly higher commit rate than does the original skiplist. The difference in their commit rates is quite small, but the exact skiplist s better commit rate helps to explain its outperformance of the original skiplist and makes sense given that fewer nodes are locked in the exact skiplist s Find algorithm, slightly decreasing the chances of transactions aborting. We also observe that the level skiplist has a shallower slope of commit rate degradation than do the other two variations. This indicates that the mutation algorithms employed in the level 19

20 skiplist are less sensitive to conflict aborts and might explain why we see greater relative performance in the level skiplist with higher thread counts (up to 8 threads), as compared to the exact and original skiplists. Figure 2-5: Commit rate versus number of threads. Finally, we turn our attention to the graph of teleport length versus number of threads, Figure 2-6. We see a story that is similar to that which we observed in the commit rate graph. There is a marked decline in teleport length as the number of threads increases, but the decline is slightly less severe for the level skiplist than for the other two. Recall that our AIMD strategy for determining teleport length is based almost entirely off of commit rate we halve our teleport length when a transaction aborts and increment it if a transaction commits (and has used up the entire allotted traversal limit). As a result, it is probable that the declining teleport lengths are largely a reflection on the commit rates. Thus, we draw the same conclusion here that we drew previously: all three skiplist variations suffer from a great deal of transaction aborts caused by conflicts between threads (namely, as stated earlier, attempts during transactions to lock already-locked nodes), but the level skiplist is less affected and thus we see that its performance benefits more from increases in thread count. 20

21 Figure 2-6: Teleport length versus number of threads. 2.5 Discussion The results that we presented in the previous section do not look promising for the teleporting lock-coupled skiplist. One of the main issues that we run into with the three variations of the teleporting lock-coupled skiplist that we present above is a commit rate that plummets as we increase the number of threads accessing and mutating the skiplist concurrently. Recall, as we described in the beginning part of this chapter, that our Teleport method functions by searching down the skiplist without obtaining any locks until it either reaches the teleport limit or finds the node for which it is searching. At this point in our code, in order to maintain the invariant that all of the predecessors of the victim node are locked upon being found, we try to acquire locks for the predecessors that we encountered during the teleport. However, our transaction is guaranteed to abort if any of the nodes that we attempt to lock are either already locked when we begin our transaction or become locked while our transaction is taking place. This is doubly bad. When a transaction aborts, not only does the thread responsible for the transaction not hold locks for any of the predecessors that it found, it also loses its place further down the skiplist and must begin searching from the location at which it entered the transaction. Note that this 21

22 is distinctly different from what happens when a thread searches the skiplist nontransactionally. In a non-transactional traversal, the consequence of a node s lock being held by a different thread is simply that the current thread must wait until the lock is released, which means that in the non-transactionally based lock-coupled search, there is no possibility of losing work that has already been done. Our graphs show us the negative effects of a having a significant amount of contention on nodes (particularly high-level nodes) within the skiplist and our reasoning about lock acquisitions in transactions gives us insights into why transaction aborts are costly (and frequent) in our teleporting skiplist code. With these understandings in mind, we propose adapting our Teleport method to be free of lock acquisitions. Transactions are not well-suited to performing lock acquisitions (especially when contention is high) because they cannot wait on a lock to be released, but must instead abort and retry (or revert to non-transactional code), significantly increasing the cost of encountering a locked node. In our teleporting exact skiplist code, we are able to reduce the number of locks that a thread must acquire during the course of a search, thus reducing the amount of contention in the skiplist. Likewise, in our teleporting level-locking skiplist code, we attempt to reduce contention by assigning one lock to each level of each node, making our code more fine-grained. While both strategies have the potential to alleviate contention to the point of outperforming the nonteleporting skiplist, we have not yet seen success in practice. (Although we should remind the reader that a similar lock-teleportation strategy used in [2] has been successful in improving the performance of the lock-coupled linked-list, as mentioned in the introductory chapter.) While teleportation does not produce good performance in situations where contention is high and locks must be acquired within a transaction, it is still quite promising as a strategy for avoiding some of the intermediate work that is usually required to traverse a skiplist. In fact, in the next chapter, we present our hazard pointer teleportation algorithm, which proves to be effective in boosting the performance of the lazy skiplist. 22

23 Teleporting Lazy Skiplists We now discuss our implementation, written in C++, of the teleporting lazy skiplist. We assume prior knowledge of the lazy skiplist (and refer the reader to [1, ch.14] for background information on this skiplist). In this chapter, we first cover the design of the teleporting lazy skiplist and its nodes before providing an overview of the general algorithm that we use for hazard pointer teleportation and a brief description of the skiplist s Add, Remove, and Contains methods. We then discuss variations on the teleporting lazy skiplist. In the Benchmarking, Results, and Discussion sections, we describe the performance measures that are of interest to us, provide graphs of our findings using these measures, and give an interpretation of these findings. 3.1 Design of Skiplist and Nodes We rely heavily on the teleporting lock-coupled Skiplist in our design of the teleporting lazy Skiplist. The teleporting lazy Skiplist Node structures have several new members that serve to enable lazy locking, in addition to those present in the teleporting lock-coupled Skiplist. We also add a hazard pointer array, hazard, to the Skiplist structure to allow for threads to publish the addresses of the Nodes that they need to hazard protect. Our hazard array is set up such that for each thread, for each level of the skiplist, we have a set of two hazard pointer slots. In our discussion of the algorithm, we explain the usage of the hazard array and why we require two slots for every thread and level combination. 3.2 Algorithm We now turn our attention to the hazard pointer teleportation algorithm that we utilize. The design of the Add, Remove, and Contains methods is based on [1, ch.14]; we will give a brief description of the first two. 23

24 3.2.1 Find Method and Teleportation As in the teleporting lock-coupled skiplist, the Add, Remove, and Contains methods of the teleporting lazy skiplist all depend on the Find method. The Find method in the teleporting lazy skiplist builds closely off of that in the lock-coupled skiplist (see Figure 2-1), but additionally takes in an array of node pointers, succs, and an Int pointer, level. It populates succs with successor nodes, which we define to be exactly those nodes pointed to by the predecessors next pointers. (Thus, the victim node is always the lowest-level successor node.) In level, we store the height of the victim node, if the victim s value matches that for which we are searching. These additions require only trivial changes to the teleporting lock-coupled skiplist s version of Find. The Fallback and Teleport methods of the teleporting lazy skiplist do not differ drastically from those in the teleporting lock-coupled skiplist either. However, in the teleporting lazy skiplist we must be certain that each node that we want to read is hazard protected prior to being read. Thus, each thread publishes, in an alternating manner, to the two hazard array slots that correspond to that thread and the skiplist level across which it is currently traversing. We require two slots for each thread and level combination because we need to ensure that all of the victim s predecessors and successors remain hazard protected from the time we encounter them until we are able to lock all of the victim s predecessors. This necessity informs the structure of the hazard array. When dealing with hazard pointer memory management, we must also keep track of which nodes are currently hazard protected. Thus, outside of the Fallback and the Teleport methods, we maintain the invariant that all of the predecessors pointed to by the preds array and all of the successors pointed to by the succs array are hazard protected. This invariant holds when we enter and exit the Fallback and Teleport methods. The Fallback method for the teleporting lazy skiplist functions in nearly the same manner as its lock-coupled counterpart s Fallback method (Figure 2-2). That is, it performs one individual step of the skiplist traversal, either following one next pointer or moving down one level of the skiplist. However, when it follows the node s next pointer, it calls HazardRead rather than performing a regular read. The HazardRead method loads the address of the node in question into the less-recently-used of the two slots that correspond to the current thread and level in hazard, overwriting the address of whichever node was previously hazard protected in that slot. It then executes a memory barrier. The only other difference in the lazy Fallback method 24

25 bool teleport(int v, Node* preds[], Node* succs[], ThreadLocal* threadlocal, int* start_layer, int* level) { 1 if (threadlocal->teleportlimit < 0) { 2 threadlocal->teleportlimit = DEFAULT_TELEPORT_DISTANCE; 3 } 4 int dist = threadlocal->teleportlimit; 5 int curr_layer; 6 Node* pred = preds[*start_layer]; 7 for (int layer = *start_layer; layer >= 0 && dist > 0; layer--) { 8 curr_layer = layer; 9 Node* curr = pred->next[layer]; 10 preds[layer] = pred; 11 while (curr->value < v && --dist > 0) { 12 preds[layer] = pred = curr; 13 curr = curr->next[layer]; 14 } 15 succs[layer] = curr; 16 if (*level == -1 && curr->value == v) { 17 *level = layer; 18 } 19 } 20 bool done = curr_layer == 0 && preds[0]->next[0]->value >= v; 21 for (int i = *start_layer; i >= curr_layer; i--) { 22 transactionalhazardread(&preds[i], threadlocal); 23 transactionalhazardread(&succs[i], threadlocal); 24 } 25 *start_layer = curr_layer; 26 _xend(); 27 if (dist <= 0) { threadlocal->teleportlimit++; } 28 return done; 29 } Figure 3-7: The lazy Teleport method, used in Find. 25

A Concurrent Skip List Implementation with RTM and HLE

A Concurrent Skip List Implementation with RTM and HLE A Concurrent Skip List Implementation with RTM and HLE Fan Gao May 14, 2014 1 Background Semester Performed: Spring, 2014 Instructor: Maurice Herlihy The main idea of my project is to implement a skip

More information

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists Companion slides for The by Maurice Herlihy & Nir Shavit Set Object Interface Collection of elements No duplicates Methods add() a new element remove() an element contains() if element

More information

A Simple Optimistic skip-list Algorithm

A Simple Optimistic skip-list Algorithm A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems Laboratories yosef.lev@sun.com Victor Luchangco Sun

More information

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28 Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Linked Lists: Locking, Lock-Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Adding threads should not lower throughput Contention

More information

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, lazy implementation 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Thesis Defense Master of Science Sean Moore Advisor: Binoy Ravindran Systems Software Research Group Virginia Tech Multiprocessing

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

15 418/618 Project Final Report Concurrent Lock free BST

15 418/618 Project Final Report Concurrent Lock free BST 15 418/618 Project Final Report Concurrent Lock free BST Names: Swapnil Pimpale, Romit Kudtarkar AndrewID: spimpale, rkudtark 1.0 SUMMARY We implemented two concurrent binary search trees (BSTs): a fine

More information

Concurrent Counting using Combining Tree

Concurrent Counting using Combining Tree Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,

More information

Distributed Computing Group

Distributed Computing Group Distributed Computing Group HS 2009 Prof. Dr. Roger Wattenhofer, Thomas Locher, Remo Meier, Benjamin Sigg Assigned: December 11, 2009 Discussion: none Distributed Systems Theory exercise 6 1 ALock2 Have

More information

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations

More information

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Linked Lists: Locking, Lock- Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Coarse-Grained Synchronization Each method locks the object Avoid

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

Atomic Transactions in Cilk

Atomic Transactions in Cilk Atomic Transactions in Jim Sukha 12-13-03 Contents 1 Introduction 2 1.1 Determinacy Races in Multi-Threaded Programs......................... 2 1.2 Atomicity through Transactions...................................

More information

Atomicity via Source-to-Source Translation

Atomicity via Source-to-Source Translation Atomicity via Source-to-Source Translation Benjamin Hindman Dan Grossman University of Washington 22 October 2006 Atomic An easier-to-use and harder-to-implement primitive void deposit(int x){ synchronized(this){

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

SPIN, PETERSON AND BAKERY LOCKS

SPIN, PETERSON AND BAKERY LOCKS Concurrent Programs reasoning about their execution proving correctness start by considering execution sequences CS4021/4521 2018 jones@scss.tcd.ie School of Computer Science and Statistics, Trinity College

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G.

More information

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto Outline Motivation

More information

Remaining Contemplation Questions

Remaining Contemplation Questions Process Synchronisation Remaining Contemplation Questions 1. The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and

More information

Run-Time Environments/Garbage Collection

Run-Time Environments/Garbage Collection Run-Time Environments/Garbage Collection Department of Computer Science, Faculty of ICT January 5, 2014 Introduction Compilers need to be aware of the run-time environment in which their compiled programs

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set

Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set Ahmed Hassan Roberto Palmieri Binoy Ravindran Virginia Tech hassan84@vt.edu robertop@vt.edu binoy@vt.edu Abstract

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

COMP3151/9151 Foundations of Concurrency Lecture 8

COMP3151/9151 Foundations of Concurrency Lecture 8 1 COMP3151/9151 Foundations of Concurrency Lecture 8 Transactional Memory Liam O Connor CSE, UNSW (and data61) 8 Sept 2017 2 The Problem with Locks Problem Write a procedure to transfer money from one

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

Intel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013

Intel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013 Intel Transactional Synchronization Extensions (Intel TSX) Linux update Andi Kleen Intel OTC Linux Plumbers Sep 2013 Elision Elision : the act or an instance of omitting something : omission On blocking

More information

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014 GC (Chapter 14) Eleanor Ainy December 16 th 2014 1 Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 2 Introduction Till now

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions Ahmed Hassan Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

A GENERIC SIMULATION OF COUNTING NETWORKS

A GENERIC SIMULATION OF COUNTING NETWORKS A GENERIC SIMULATION OF COUNTING NETWORKS By Eric Neil Klein A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of

More information

Lecture 10: Avoiding Locks

Lecture 10: Avoiding Locks Lecture 10: Avoiding Locks CSC 469H1F Fall 2006 Angela Demke Brown (with thanks to Paul McKenney) Locking: A necessary evil? Locks are an easy to understand solution to critical section problem Protect

More information

Improving STM Performance with Transactional Structs 1

Improving STM Performance with Transactional Structs 1 Improving STM Performance with Transactional Structs 1 Ryan Yates and Michael L. Scott University of Rochester IFL, 8-31-2016 1 This work was funded in part by the National Science Foundation under grants

More information

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy Invyswell: A HyTM for Haswell RTM Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy Multicore Performance Scaling u Problem: Locking u Solution: HTM? u IBM BG/Q, zec12,

More information

6.895 Final Project: Serial and Parallel execution of Funnel Sort

6.895 Final Project: Serial and Parallel execution of Funnel Sort 6.895 Final Project: Serial and Parallel execution of Funnel Sort Paul Youn December 17, 2003 Abstract The speed of a sorting algorithm is often measured based on the sheer number of calculations required

More information

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 20: Transactional Memory Parallel Computer Architecture and Programming Slide credit Many of the slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford University) Raising

More information

6 Transactional Memory. Robert Mullins

6 Transactional Memory. Robert Mullins 6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2

More information

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence? CPSC-4/6: Operang Systems Atomic Transactions The Transaction Model / Primitives Serializability Implementation Serialization Graphs 2-Phase Locking Optimistic Concurrency Control Transactional Memory

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Concurrency. Glossary

Concurrency. Glossary Glossary atomic Executing as a single unit or block of computation. An atomic section of code is said to have transactional semantics. No intermediate state for the code unit is visible outside of the

More information

Performance Improvement via Always-Abort HTM

Performance Improvement via Always-Abort HTM 1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing

More information

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6) Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

An Approach to Task Attribute Assignment for Uniprocessor Systems

An Approach to Task Attribute Assignment for Uniprocessor Systems An Approach to ttribute Assignment for Uniprocessor Systems I. Bate and A. Burns Real-Time Systems Research Group Department of Computer Science University of York York, United Kingdom e-mail: fijb,burnsg@cs.york.ac.uk

More information

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to:

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to: F2007/Unit5/1 UNIT 5 OBJECTIVES General Objectives: To understand the process management in operating system Specific Objectives: At the end of the unit you should be able to: define program, process and

More information

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock Locking: necessary evil? Lecture 10: voiding Locks CSC 469H1F Fall 2006 ngela Demke Brown (with thanks to Paul McKenney) Locks are an easy to understand solution to critical section problem Protect shared

More information

6.852: Distributed Algorithms Fall, Class 20

6.852: Distributed Algorithms Fall, Class 20 6.852: Distributed Algorithms Fall, 2009 Class 20 Today s plan z z z Transactional Memory Reading: Herlihy-Shavit, Chapter 18 Guerraoui, Kapalka, Chapters 1-4 Next: z z z Asynchronous networks vs asynchronous

More information

Lock Elision and Transactional Memory Predictor in Hardware. William Galliher, Liang Zhang, Kai Zhao. University of Wisconsin Madison

Lock Elision and Transactional Memory Predictor in Hardware. William Galliher, Liang Zhang, Kai Zhao. University of Wisconsin Madison Lock Elision and Transactional Memory Predictor in Hardware William Galliher, Liang Zhang, Kai Zhao University of Wisconsin Madison Email: {galliher, lzhang432, kzhao32}@wisc.edu ABSTRACT Shared data structure

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

CS 61B Asymptotic Analysis Fall 2018

CS 61B Asymptotic Analysis Fall 2018 CS 6B Asymptotic Analysis Fall 08 Disclaimer: This discussion worksheet is fairly long and is not designed to be finished in a single section Some of these questions of the level that you might see on

More information

EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS BY MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER

More information

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion

More information

Implementation of Process Networks in Java

Implementation of Process Networks in Java Implementation of Process Networks in Java Richard S, Stevens 1, Marlene Wan, Peggy Laramie, Thomas M. Parks, Edward A. Lee DRAFT: 10 July 1997 Abstract A process network, as described by G. Kahn, is a

More information

Chapter 16 Heuristic Search

Chapter 16 Heuristic Search Chapter 16 Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas Outline Synchronization Methods Priority Queues Concurrent Priority Queues Lock-Free Algorithm: Problems

More information

Final Exam Solutions May 11, 2012 CS162 Operating Systems

Final Exam Solutions May 11, 2012 CS162 Operating Systems University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 Anthony D. Joseph and Ion Stoica Final Exam May 11, 2012 CS162 Operating Systems Your Name: SID AND

More information

Annex 10 - Summary of analysis of differences between frequencies

Annex 10 - Summary of analysis of differences between frequencies Annex 10 - Summary of analysis of differences between frequencies Introduction A10.1 This Annex summarises our refined analysis of the differences that may arise after liberalisation between operators

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Concurrent & Distributed Systems Supervision Exercises

Concurrent & Distributed Systems Supervision Exercises Concurrent & Distributed Systems Supervision Exercises Stephen Kell Stephen.Kell@cl.cam.ac.uk November 9, 2009 These exercises are intended to cover all the main points of understanding in the lecture

More information

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 20 Concurrency Control Part -1 Foundations for concurrency

More information

Transactional Interference-less Balanced Tree

Transactional Interference-less Balanced Tree Transactional Interference-less Balanced Tree Technical Report Ahmed Hassan, Roberto Palmieri, and Binoy Ravindran Virginia Tech, Blacksburg, VA, USA. Abstract. In this paper, we present TxCF-Tree, a balanced

More information

Bw-Tree. Josef Schmeißer. January 9, Josef Schmeißer Bw-Tree January 9, / 25

Bw-Tree. Josef Schmeißer. January 9, Josef Schmeißer Bw-Tree January 9, / 25 Bw-Tree Josef Schmeißer January 9, 2018 Josef Schmeißer Bw-Tree January 9, 2018 1 / 25 Table of contents 1 Fundamentals 2 Tree Structure 3 Evaluation 4 Further Reading Josef Schmeißer Bw-Tree January 9,

More information

Interactive Responsiveness and Concurrent Workflow

Interactive Responsiveness and Concurrent Workflow Middleware-Enhanced Concurrency of Transactions Interactive Responsiveness and Concurrent Workflow Transactional Cascade Technology Paper Ivan Klianev, Managing Director & CTO Published in November 2005

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

PROCESS SYNCHRONIZATION

PROCESS SYNCHRONIZATION PROCESS SYNCHRONIZATION Process Synchronization Background The Critical-Section Problem Peterson s Solution Synchronization Hardware Semaphores Classic Problems of Synchronization Monitors Synchronization

More information

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Transactional Memory Companion slides for The by Maurice Herlihy & Nir Shavit Our Vision for the Future In this course, we covered. Best practices New and clever ideas And common-sense observations. 2

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 17 November 2017 Lecture 7 Linearizability Lock-free progress properties Hashtables and skip-lists Queues Reducing contention Explicit

More information

Lecture 1 Contracts : Principles of Imperative Computation (Fall 2018) Frank Pfenning

Lecture 1 Contracts : Principles of Imperative Computation (Fall 2018) Frank Pfenning Lecture 1 Contracts 15-122: Principles of Imperative Computation (Fall 2018) Frank Pfenning In these notes we review contracts, which we use to collectively denote function contracts, loop invariants,

More information

Composable Shared Memory Transactions Lecture 20-2

Composable Shared Memory Transactions Lecture 20-2 Composable Shared Memory Transactions Lecture 20-2 April 3, 2008 This was actually the 21st lecture of the class, but I messed up the naming of subsequent notes files so I ll just call this one 20-2. This

More information

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 33 Overloading Operator for User - Defined Types: Part 1 Welcome

More information

A Linear-Time Heuristic for Improving Network Partitions

A Linear-Time Heuristic for Improving Network Partitions A Linear-Time Heuristic for Improving Network Partitions ECE 556 Project Report Josh Brauer Introduction The Fiduccia-Matteyses min-cut heuristic provides an efficient solution to the problem of separating

More information

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer Segregating Data Within Databases for Performance Prepared by Bill Hulsizer When designing databases, segregating data within tables is usually important and sometimes very important. The higher the volume

More information

CS5412: TRANSACTIONS (I)

CS5412: TRANSACTIONS (I) 1 CS5412: TRANSACTIONS (I) Lecture XVII Ken Birman Transactions 2 A widely used reliability technology, despite the BASE methodology we use in the first tier Goal for this week: in-depth examination of

More information

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Agenda Lecture Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Next discussion papers Selecting Locking Primitives for Parallel Programming Selecting Locking

More information

Lecture Notes on Arrays

Lecture Notes on Arrays Lecture Notes on Arrays 15-122: Principles of Imperative Computation July 2, 2013 1 Introduction So far we have seen how to process primitive data like integers in imperative programs. That is useful,

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Hash Tables and Hash Functions

Hash Tables and Hash Functions Hash Tables and Hash Functions We have seen that with a balanced binary tree we can guarantee worst-case time for insert, search and delete operations. Our challenge now is to try to improve on that...

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

Parallel access to linked data structures

Parallel access to linked data structures Parallel access to linked data structures [Solihin Ch. 5] Answer the questions below. Name some linked data structures. What operations can be performed on all of these structures? Why is it hard to parallelize

More information

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Synchronization Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Election Algorithms Issue: Many distributed algorithms require that one process act as a coordinator (initiator, etc).

More information

Operating Systems (234123) Spring (Homework 3 Wet) Homework 3 Wet

Operating Systems (234123) Spring (Homework 3 Wet) Homework 3 Wet Due date: Monday, 4/06/2012 12:30 noon Teaching assistants in charge: Operating Systems (234123) Spring-2012 Homework 3 Wet Anastasia Braginsky All emails regarding this assignment should be sent only

More information

(Refer Slide Time: 06:01)

(Refer Slide Time: 06:01) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 28 Applications of DFS Today we are going to be talking about

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

Part VII Data Protection

Part VII Data Protection Part VII Data Protection Part VII describes how Oracle protects the data in a database and explains what the database administrator can do to provide additional protection for data. Part VII contains the

More information

Design Patterns for Real-Time Computer Music Systems

Design Patterns for Real-Time Computer Music Systems Design Patterns for Real-Time Computer Music Systems Roger B. Dannenberg and Ross Bencina 4 September 2005 This document contains a set of design patterns for real time systems, particularly for computer

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Last Lecture: Spin-Locks

Last Lecture: Spin-Locks Linked Lists: Locking, Lock- Free, and Beyond Last Lecture: Spin-Locks. spin lock CS critical section Resets lock upon exit Today: Concurrent Objects Adding threads should not lower throughput Contention

More information

The University of Texas at Arlington

The University of Texas at Arlington The University of Texas at Arlington Lecture 6: Threading and Parallel Programming Constraints CSE 5343/4342 Embedded Systems II Based heavily on slides by Dr. Roger Walker More Task Decomposition: Dependence

More information