Drop the Anchor: Lightweight Memory Management for Non-Blocking Data Structures

Size: px

Start display at page:

Download "Drop the Anchor: Lightweight Memory Management for Non-Blocking Data Structures"

Alfred McBride
5 years ago
Views:

1 Drop the Anchor: Lightweight Memory Management for Non-Blocking Data Structures Anastasia Braginsky Computer Science Technion Alex Kogan Oracle Labs Erez Petrank Computer Science Technion ABSTRACT Efficient memory management of dynamic non-blocking data structures remains an important open question. Existing methods either sacrifice the ability to deallocate objects or reduce performance notably. In this paper, we present a novel technique, called Drop the Anchor, which significantly reduces the over associated with the memory management while reclaiming objects even in the presence of thread failures. We demonstrate this memory management scheme on the common linked list data structure. Using extensive evaluation, we show that Drop the Anchor significantly outperforms Hazard Pointers, the widely used technique for non-blocking memory management. Categories and Subject Descriptors E.1 [Data]: Data Structures lists, stacks, and queues; D.1. [Software]: Programming Techniques Concurrent Programming General Terms Algorithms, Design, Theory Keywords Concurrent data structures, progress guarantee, lock-freedom, parallel programming, linked list, memory management, hazard pointers, timestamps, freezing Supported by the Israeli Science Foundation (grant No. 28/1). Supported by the Ministry of Science and Technology, Israel. The work on this paper was done while the author was with the Department of Computer Science, Technion. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM or the author must be honored. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA 1, July 2 25, 21, Montréal, Québec, Canada. Copyright 21 ACM /1/7...$ INTRODUCTION Non-blocking data structures [9, 11] are fast, scalable and widely used. In the last two decades, many efficient nonblocking implementations for almost any common data structure have been developed. However, when designing a dynamic non-blocking data structure, one must address the non-trivial issue of how to manage its memory. Specifically, one has to ensure that whenever a thread removes some internal node from the data structure, then (a) the memory occupied by this node will be eventually deallocated (i.e., returned to the memory manager for arbitrary reuse), and (b) no other concurrently running thread will access the deallocated memory, even though some threads might hold a reference to the node. Previous attempts to tackle the memory management problem had limited success. Existing non-blocking algorithms usually take two standard approaches. The first approach is to rely on automatic garbage collection (GC), simply deferring the problem to the GC. By doing this, the designers hinder the algorithm from being ported to environments without GC [5]. Moreover, the implementations of these designs with currently available (blocking) GC s cannot be considered non-blocking. The second approach taken by designers of concurrent data structures is to adopt one of the available non-blocking memory management schemes. The most common schemes are probably the Hazard Pointers technique by Michael [14] or the similar Pass the Buck method by Herlihy et al. [1]. In these schemes, each thread has a pool of global pointers, called hazard pointers in [14] or guards in [1], which are used to mark objects as live or ready for reclamation. When a thread t reclaims a node, t adds the node to a special local reclamation buffer. Once in a while, t scans its buffer and for each node it checks whether some other thread has a hazard pointer 1 to the node. If not, that node can be safely deallocated. Special attention must be given to the time interval after a thread obtains a reference to an object and before it registers this object in a hazard pointer. During this time, the object may be reclaimed and reallocated. Thus, by the time it gets protected by a hazard pointer, it could have become a completely different entity. This delicate point enforces validation of the object s state after assigning it with a hazard pointer. Although these techniques are not universal (i.e., there is no automatic way to incorporate them into a given algorithm), they are relatively simple. Moreover, a failure 1 In this paper we will use the term hazard pointers, but guard pointers are equally relevant.

2 of one thread prevents only a small number of nodes (to which the failed thread has references in its hazard pointers) from being deallocated. The major drawback of these techniques, however, is their significant runtime over, caused mainly by the management and validation of the global pointers required before accessing each internal node for the first time [8]. Along with that, expensive instructions, such as memory fences or compare-and-swap (CAS) instructions [14, 1, 8], are required for correctness of those schemes. Moreover, if the validation fails, the thread must restart its operation on the data structure, harming the performance further. Another known method for memory management uses per-thread timestamps, as in [7], which are incremented by threads before every access to the data structure. When a thread removes a node, it records the timestamps of other threads. Later, it can deallocate the node once all threads increase their timestamps beyond the recorded values. Although this method is very lightweight, it is vulnerable to thread delays and failures. In such cases, memory space of an unbounded size may become impossible to reclaim [14]. In this paper, we concentrate on the linked list, one of the most fundamental data structures, which is particularly prone to the shortcomings of previous approaches [8]. The presented technique eliminates the performance over associated with the memory management without sacrificing the ability to deallocate memory in case of thread failures. The good performance of our technique stems from the assumption that thread failures are typically very uncommon in real systems, and if they do occur, this is usually indicative of more serious problems than being unable to deallocate some small part of memory. Our approach provides a flexible tradeoff between the runtime over introduced by memory management and the size of memory that might be lost when some thread fails. Our memory management technique builds on a combination of three ideas: timestamps, anchors and freezing. As in [7], we use per-thread timestamps to track the activity of each thread on the data structure. Similarly to [14], we use global pointers, which we call anchors. Unlike [14], however, a thread drops the anchor (i.e., records a reference in the anchor) every bunch of node accesses, e.g., every one thousand nodes it traverses. As a result, the amortized cost of anchor management is spread across multiple node accesses and is thus very low. To recover the data structure from a failure of a thread t, we apply freezing []. That is, using t s anchors, other threads mark nodes that t may hold a reference to, as frozen. Then they copy and replace the frozen part of the data structure, restoring the ability of all threads to deallocate memory. The recovery operation is relatively expensive, but it is required only in the uncommon case in which a thread fails to make progress for a long while. Thus, the overall cost of the memory management remains very low. We have implemented our scheme in C and compared its performance to the widely used implementation of the linked list based on Hazard Pointers (HP) [14]. Our performance results show that the total running time, using the anchorbased memory management, is about 25 5% faster the one based on HP. We also discuss how to apply our technique on other data structures, where the use of other approaches for memory management is more expensive. 2. RELATED WORK Memory management can be fairly considered as the Achilles heel of many dynamically sized non-blocking data structures. In addition to the techniques mentioned in the introduction (that use per-thread timestamps [7] or global pointers [14, 1]), one can also find an approach based on reference counting [16, 6, 4, 15]. There, the idea is to associate a counter with every node, which is updated atomically when a thread gains or drops a reference to the node. Such atomic updates are typically performed with a fetch-andadd instruction, and the node can be safely removed once its reference count drops to zero. This approach suffers from several drawbacks, such as requiring each node to keep the reference count field even after the node is reclaimed [16, 15] or using uncommon atomic primitives, such as double compare-and-swap (DCAS) [4]. The major problem, however, remains performance [14, 8], since even when applying a read-only operation on the data structure, this approach requires atomic reference counter updates on every node access. In a related work, Hart et al. [8] compare several memory management techniques, including hazard pointers, reference counters, and so-called quiescent-state-based reclamation. In the latter, the memory can be reclaimed when each thread passes through at least one quiescent state [1], in which it does not hold any reference to shared nodes, and in particular, to nodes that have been removed from the data structure. In fact, the timestamp-based technique [7] discussed in the introduction can be seen as a special case of the quiescent-state approach. Hart et al. [8] find that when using hazard pointers or reference counters, expensive atomic instructions, such as fences and compare-andswaps (CAS) executed for every traversed node, dominate the performance cost. Quiescent-state reclamation usually performs better, but it heavily depends on how often quiescent states occur. Moreover, if a thread fails before reaching the quiescent state, no memory can be safely reclaimed from that point. Dragojevic et al. [5] consider how hardware transactional memory (HTM) can help to alleviate the performance and conceptual difficulties associated with memory management techniques. In contrast to [5], our algorithm does not rely on special hardware support, such as HTM. The freezing idea was previously used in the context of concurrent data structures by Braginsky and Petrank in their recent work on chunk-based linked lists []. There, list nodes are grouped into chunks for better cache locality and list traversal performance. The freezing technique is used in [] for list restructuring to notify threads that the part of the data structure they are currently using is obsolete. This is done by setting a special freeze-bit on pointers belonging to nodes in the obsolete part, making the pointers/nodes unsuitable for traversing. A thread that fails to use a frozen pointer realizes that this part of the data structure is obsolete and it restarts its operation, usually after helping to accomplish the list restructuring procedure that froze that part.. AN OVERVIEW OF DROP THE ANCHOR As mentioned in the introduction, our technique relies on three building blocks, namely timestamps, anchors, and freezing. A thread t manages a monotonically increasing

3 running someone suspects that t has failed stuck t starts an operation t finishes an operation recovery from t's failure is completed t (re)starts an operation idle recovered Figure 1: Transition diagram for possible states of the thread t timestamp in the following way. When t starts its operation on a list data structure, it reads the timestamps of all threads and sets its timestamp to the maximal value it read plus one. When t finishes its operation, it simply marks its timestamp as idle. The timestamp of t is associated with two flags, stuck and idle. These flags specify one of the following states of the timestamp (and of the corresponding thread): running (both flags are turned off, meaning that t has a pending operation on the data structure), idle (only the idle flag is on, meaning that t does not have any pending operation on the data stucture), running, but stuck, which we call for brevity simply stuck (only the stuck flag is on, meaning that t with a pending operation is suspected by other thread(s) to be stuck) and recovered (both flags are turned on, meaning that other threads have frozen and copied the memory that might be accessed by t). The transition between these states is captured in Figure 1. Normally, t moves between running and idle states. Once some thread suspects t to be stuck, t s timestamp is marked as stuck. The only way for t to return to the running state is to go through the recovered state (cf. Figure 1). The timestamps are also used to mark the insertion and deletion times of list nodes. That is, each node in the list has two additional fields, which are set as follows. When t decides to insert (remove) a node into (from) the list, it sets the node s insertion (deletion, respectively) timestamp field to be higher by one from the maximal timestamp value that it observes among the timestamps of all threads. The nodes deleted by t are stored in t s special reclamation buffer, which is scanned by t once in a while (as in [14]). During each scan, and for each deleted node n, t checks whether the deletion time of n is smaller than the current timestamps of other threads (plus an additional condition described later), and if so, deallocates the node. This check ensures that all threads have started a new list operation since the time this node was removed from the list, and therefore, no thread can be viewing this node at this time. If threads did not fail, this would be everything needed to manage the memory of non-blocking lists by a traditional epoch-based approach [7]. Unfortunately, thread failures may happen. In the design described so far, if a thread fails during its operation on the list, no additional node can be deallocated, since the timestamp of the failed thread would not advance. To cope with the problem of thread failures, we use two additional concepts, namely anchors and freezing. Anchors are simply pointers used by threads to point to list nodes. In fact, hazard pointers[14] can be seen as a special case of anchors. The difference between the two is that the anchor is not dropped (set) before accessing every internal node in the list, but rather every ten, one hundred, or several thousands of node accesses (the frequency is controlled by the anchor threshold parameter). As a result, the amortized cost of anchor management is significantly reduced and spread across the traversal of (controllably) many nodes in the list data structure. The downside of our approach, however, is that when a thread t is suspected of being stuck, other threads do not know for sure which object t may access when (and if) it revives. They only know a range of nodes where t might be, which includes the node pointed by t s anchor plus additional nodes reachable from that anchored node. The suspecting threads use this range to recover the list from the failure of t. Specifically, they freeze all nodes in the range by setting the special freeze-bit of all pointers in these nodes 2. Next, they copy all frozen nodes into a new sub-list. Finally, they replace the frozen nodes with the new sub-list and mark t s timestamp as recovered. This mark tells other threads that the list was recovered from t s failure. In other words, threads may again deallocate nodes they remove from the data structure, disregarding t s timestamp. The recovery procedure is relatively heavy performancewise and has certain technical issues, but in return, the common path, i.e., the traversal of the data structure, incurs virtually no additional operations related to memory management. Since the recovery is expected to be very infrequent, we believe (and show in our performance measurements) that the complication associated with the recovery procedure pays off by eliminating the over in the common path. In the next section, we provide technical details of the application of this general idea into the concrete nonblocking implementation of the linked list. 4. DETAILED DESCRIPTION 4.1 Auxiliary fields and records We use the singly linked list of Harris [7] as a basis for our construction. To support our scheme, each thread maintains two records where it stores information related to the memory management. The first record is global, i.e., it can be read and written by any thread (not just the owner of the record), and used to manage the thread s timestamp and anchor. The second record is local, and is used during object reclamations and for deciding whether the recovery procedure is necessary. The structure of the records is given in Listing 1. The global record contains two fields, timestampandanchor and lowtimestamp. The timestampandanchor field contains the timestamp, the anchor, and the idle and stuck bits of the thread, combined into one word so that all can be modified atomically. The width and the actual internal structure of the field depends on the underlying machine. In certain settings of 64-bit Linux-based architectures, the virtual memory addressing requires 48 bits; the two least significant bits in a pointer are typically zeroed due to memory alignment. Moreover, most existing architectures support wide-cas instruction, which operates atomically on two adjacent memory words (i.e., 128 bits). In such settings, we allocate 64 2 The freeze-bit is one of the least significant bits of a pointer, which are normally zeroed due to memory alignment.

4 Listing 1: Auxiliary records s t r u c t GlobalMemoryManagementRec{ u i n t t timestampandanchor ; u i n t 6 4 t lowtimestamp ; } ; s t r u c t LocalMemoryManagementRec{ l i s t t r e c l a m a t i o n B u f f e r ; u i n t 6 4 t mintimestamp ; u i n t 2 t mintimestampthreadid ; u i n t 2 t mintimestampthreadcnt ; } ; bits for the timestamp and 64 bits for the anchor pointer, including two bits for two flags, which specify the state of the thread (i.e., running, idle, stuck and recovered). In the settings where only 64 bits can be a target for a CAS instruction, one can use 48 bits or fewer for the anchor pointer and, respectively, 16 bits for timestamp. However, different allocation techniques can be used to require fewer bits for the pointer. When a thread t accesses the list, it reads the timestamps of all threads in the system and sets its own timestamp to one plus the maximum among all the timestamp values that were read. It writes its new timestamp in the timestampandanchor field, simultaneously setting the idle bit to zero. When t completes its operation, it simply turns the idle bit on (leaving the same timestamp value). The exact details of the manipulation of this field are provided in subsequent sections. In addition to the timestampandanchor field, the global record contains a field called lowtimestamp. This field is set by t to the minimal timestamp observed by t when it starts an operation on the list. As described in Section 4.4, the lowtimestamp field is used by other threads when they try to recover the list from the failure of t (to identify nodes that were inserted into the list before t started its current operation). The local record has four fields. The description of their role is given in Section 4.. Along with adding auxiliary records for each thread, we also augment each node in the linked list with two fields having self-explanatory names, insertts and retirets. These fields are set to the current maximal timestamp plus one when a node is inserted into or deleted from the list, respectively. 4.2 Anchor maintenance Anchor maintenance is carried out when threads traverse the list, looking for a particular key. The simplified pseudocode for this traversal composes the find method given in the full version of this paper [2]. Recall that this method is used by all list operations in [7]. A thread counts the number of list nodes it has passed through and updates its anchor every anchor threshold nodes (where anchor threshold is some preset number). The anchor points to the first node in the list that can be accessed by the thread (which is the node pointed by prev in the find method). Anchor updates are made in the auxiliary setanchor function also shown in [2]. An anchor update may fail for thread t i if some other thread t j has marked the timestampandanchor field of t i as stuck, as explained in Section 4.4. It is important to note that the actual update of the anchor is done with CAS (and not with a simple write operation) to avoid races with concurrently running threads that might suspect t i being stuck and try to set the stuck bit in t i s timestampandanchor. From a performance standpoint, however, the write operation of a hazard pointer, made on accessing every node, requires an expensive memory fence right after it [14, 8]. In contrast, the CAS in our approach is performed only every anchor threshold node accesses, and its amortized cost is negligible. We note that the find function is allowed to traverse the frozen nodes of the list. A node is frozen if the second least significant bit in its next pointer is turned on (while the first least significant bit is used to mark the node as deleted [7]). If there is a need to update the next pointer of the frozen node, the update operation fails (as in []) and retries after invoking the helprecovery method (pseudo-code can be found in [2]). As its name suggests, the latter method is used to help the recovery process of some stuck thread. This method is also called when a thread fails to update its anchor in setanchor. Finally, we note that at any time instant, list operations have references to at most two adjacent list nodes. (Recall that for the linked list data structure two hazard pointers are required [14]). As we require that a stuck thread will be able to access nodes only between its current anchor and (but not including) the next potential anchor, the anchor threshold parameter for the linked lists has to be at least Node reclamation When a thread t i removes a node from the list, it calls the retirenode method, which sets the deletion timestamp of the node (i.e., the retirets field) to the current maximal timestamp plus one. Then, similarly to [14], the retirenode method adds the deleted node to a reclamation buffer. The latter is simply a local linked list (cf. Listing 1) where t i stores nodes deleted from the list data structure, but not deallocated yet. When the size of the buffer reaches a predefined bound (controlled by the retire threshold parameter), t i runs through the buffer and deallocates all nodes with the retire timestamp smaller than the current minimal timestamp (plus an additional condition elaborated in Section 4.4). Note that if the deletion time of a node n is smaller than the timestamp of a thread t j, t j started its last operation on the list after n was removed from the list; thus, t j will never access n. Obviously, if this holds for any t j, it is safe to deallocate n. When t i finds that some thread t j exists such that the timestamp of t j is smaller than or equal to the timestamp of one of the nodes in t i s reclamation buffer, t i stores the ID of that thread (i.e., j) in the mintimestampthreadid field of its local memory management record (cf. Listing 1) and t j s timestamp in the mintimestamp field of that record. It also sets the mintimestampthreadcnt field to 1. It is important to note that if several threads have the same minimal timestamp, t i will store the smallest ID in mintimestampthreadid. This will ensure that even if several threads are stuck with the same timestamp, all threads will consider the same thread in the recovery procedure.

5 On later scans of the reclamation buffer, if t i finds that the thread t j (whose ID is stored in t i s mintimestampthreadid) still has the same timestamp, t i will increase the mintime- StampThreadCnt counter. Once the counter reaches the predefined recovery threshold parameter, t i will suspect that t j has failed and will start the recovery procedure described in Section Recovery procedure The recovery procedure is invoked in one of the following three cases. First, it is invoked by a thread t i that tries to deallocate an object n from its reclamation buffer, but repeatedly finds a running thread t j whose timestamp remains smaller than or equal to the timestamp of n (cf. Section 4.). The second case is when a thread t i tries to modify the next pointer of one of the nodes in the list, but finds that this node is frozen (cf. Section 4.2). Finally, the third case happens when a thread tries to update its anchor by modifying its timestampandanchor field, but finds that some other thread turned the stuck bit on in this field (cf. Section 4.2). In two last cases t i invokes the helprecovery method. There, t i scans through global records of the threads, looking for a thread t j with the stuck bit in t j s timestampandanchor field turned on. The recovery procedure consists of four phases (the code can be found in [2]). We explain these phases using the example in Figure 2. Assume that at some point in time the list data structure is in the state depicted in Figure 2(a), and thread t decides to recover the list from the failure of thread t 1. Before invoking the first phase of the recovery procedure, t stores locally the current value of t 1 s timestampandanchor field. Then, in the first phase of the recovery procedure, t attempts to modify t 1 s timestampandanchor field by turning the stuck bit on using CAS operation (cf. Figure 2(b)). If this operation fails, t rereads t 1 s timestampandanchor field and checks whether it was marked as stuck by some other thread. If not, it aborts the recovery procedure (since either t 1 is actually alive and has modified its timestampandanchor field, or some other thread, i.e., t 2, has finished the recovery of t 1 and, as we will see later, turned both stuck and idle bits on). Otherwise, if the CAS operation that turns the stuck bit on succeeds, or if it fails, but t 1 is marked as stuck by another thread, t proceeds to the second phase. In the second phase of the recovery procedure, t freezes and copies all nodes that t 1 might access if t 1 revived and traversed the list until realizing at the next anchor update that its anchor is marked as stuck. To identify such nodes, t extracts t 1 s anchor pointer out of the value stored in t 1 s timestampandanchor field (which points to node 25 in our example in Figure 2(a)). Then, t starts setting the freeze-bit in the next pointers of reachable nodes, starting from node 25. It copies the frozen nodes (with the freeze-bit set off) into a new list. Note that some of the nodes may already be deleted from the list (e.g., node 25, 27 and 42 in Figure 2), but not disconnected or reclaimed yet. Such nodes are frozen, but they do not enter the new copied part of the list. The thread t keeps freezing and copying until it passes through anchor threshold nodes having an insertion timestamp smaller than the value of t 1 s lowtimestamp field. In our example in Figure 2, let us assume that these are nodes 25, 27, 4, and 42. Note that for traversing those nodes, t had to update its own anchor to be the same as t 1 s anchor in order to handle t s failure during the recovery procedure. At the end of the second phase the list looks as depicted in Figure 2(c). The pseudo-code for how we freeze the nodes and create the copies can be found in [2]. In the third phase, t attempts to replace the frozen nodes with a locally copied part of the list. To this end, it runs from the beginning of the list data structure and looks for the first (not-deleted) node m whose next pointer either points to the not-deleted frozen nodes or it is followed by a sequence of one or more deleted nodes such that the next pointer of the last node in the sequence points to a not-deleted frozen node (in Figure 2(c), m is the node 12). If such m were not found by reaching the end of the list data structure, t would finish this phase, as it would assume that some other thread has replaced the frozen part of the list with the new list created by that thread. Otherwise, t attempts to update m s next pointer to point to the corresponding copied node in the new list. If it fails, it restarts this phase from the beginning. Otherwise, t inserts all nodes between m and the first frozen node (i.e., node 2 in Figure 2(c)) into its reclamation buffer in order to deallocate them later, bringing the list to the state exhibited in Figure 2(d). The code of the procedure for replacing frozen nodes can also be found in [2]. One subtlety that is left out of the code for lack of space, is the verification that the new local list indeed matches the frozen nodes being replaced. It is crucial to ensure that if a thread running the recovery procedure gets delayed, it does not replace another frozen part of the list when it resumes. To this end, we record the sources of the new nodes, when they are copied, and CAS the new list into the data structure only if it replaces the adequate original nodes that can be found in the recorded sources. In the final, fourth phase, t sets the idle bit in the t 1 s timestampandanchor field, marking t 1 as recovered. Additionally, t promotes t 1 s timestamp, recording the (logical) time when t 1 was recovered (cf. Figure 2(e)). Note that t does not need to check whether its CAS has succeeded, since if it hasn t, some other thread has performed this operation. We denote a timestamp of a thread with idle and stuck flags turned on as recovery timestamp. 4.5 The refined reclamation procedure Thread t 1, considered stuck, might actually have a pointer to a node n which is already not a part of the list. For instance, in the state of the list shown in Figure 2(a), t 1 might be stopped while inspecting node 25 (or 27). If this node is currently in the reclamation buffer of some other thread t k (i.e., t or t 2), and if t k does not consider t 1 after the recovery is done (i.e., t k only checks that node 25 s retirets is smaller than the timestamp of any running thread), t k might deallocate node 25 and t 1 might erroneously access this memory if and when it revives. Note that node 27 may already be unreachable from the node pointed by t 1 s anchor by the time of t 1 s recovery, if, e.g., the next pointer of node 25 was updated while t 1 was inspecting node 27. In this case, node 27 will not be frozen and copied at all. In order to cope with such situations, before deallocating a node we require its retire timestamp (retirets) to be larger than the timestamp of any thread in the recovered state (in addition to being smaller than the timestamp of any running thread). This way we prevent nodes removed from the list

6 thread thread 1 thread x 4 42 x thread thread 1 thread thread 12 2 x thread thread x 24 thread 12 2 x thread thread x 24 (a) The state of the list before the recovery is in- voked thread 12 2 x thread thread x 2 4 TS anchor 25 I x S 27 TS x anchor I S thread 12 2 x thread thread x x thread 4 thread 1 42 x 4 thread thread thread 1 thread 2 thread thread 1 thread x 4 42 x x 4 42 x 4 thread thread 1 thread 2 thread thread 1 thread x 4 42 x thread thread 1 thread x 4 42 x thread 12 2 x thread thread x 24 4 (b) Phase x 4 42 x 4 4 thread thread 1 thread 2 (c) Phase x 4 42 x 4 thread 4 thread 1 thread 2 (d) Phase 12 2 x 4 42 x TS anchor I S thread thread 1 thread x 4 42 x 4 thread thread 1 thread x 4 42 x x 4 42 x 4 4 (e) Phase 4 4 Figure 2: Recovery phases. Nodes marked with x are deleted, i.e., the delete-bit of their next pointer is turned on [7]. Shaded nodes are frozen, i.e., the freeze-bit of their next pointer is turned on.

7 before some thread got recovered from being deallocated, as the thread being recovered might hold a pointer to such node. When (and if) the recovered thread becomes running again, it will be possible to reclaim those nodes. To summarize, when a thread wants to deallocate a node n, it checks that n is not frozen (i.e., the freeze-bit in its next pointer is not set) and that the following condition holds: MAX({timestamp of t x t x is recovered}) < n.retirets < MIN({timestamp of t x t x is running}) It should be noted that when calculations of the retire timestamp and the recovery timestamp are done simultaneously (by different threads), the retire timestamp can erroneously be higher than the recovery timestamp, and wrong reclamation can happen. Therefore, when calculating the retire timestamp for a node, we require a thread to pass twice over the timestamps of the threads verifying that no thread was marked stuck or recovered concurrently. If such thread(s) is found, the node is inserted into the reclamation stack as frozen. For simplicity of presentation, in the algorithm described above frozen nodes are not reclaimed. Such nodes can only appear if threads fail and such a solution may be acceptable. However, frozen nodes can be easily reclaimed for recovered threads that have resumed operation. A recovered thread can reclaim nodes according to the recovery timestamps. Also, a frozen node that appears in a reclamation stack can be reclaimed using its retirets field and the lowtimestamp field of all stuck threads. Details are omitted. 4.6 Correctness argument In this section we give high-level arguments behind the proofs of correctness. We start by outlining the assumed memory model and defining linearization points for the modified list operations. Then we argue that any internal node deleted after the last recovery was finished (or deleted any time if no thread has been suspected being stuck) will be eventually reclaimed (we call this property eventual conditional reclamation). Next, we argue that our technique guarantees the safety of memory references. In other words, no thread t accesses the memory that has been reclaimed since the time t obtained a reference to it. Finally, we argue that our technique is non-blocking, meaning that whenever a thread t starts the recovery procedure, then after a finite number of t s steps either t completes the recovery, or some other thread completes an operation on the list. In addition, we show that the system-wide progress with respect to the list operations is preserved, that is after a finite number of completed recovery procedures there is at least one completed list operation. Due to space limitation, the proof sketch of all lemmas appears in the full version of this paper [2] Model and linearizability Our model for concurrent multi-threaded computation follows the linearizability model of [12]. In particular, we assume an asynchronous shared memory system where n deterministic threads communicate by executing atomic operations on some finite number of shared variables. Each thread performs a sequence of steps, where in each step the tread may perform some local computation or invoke a single atomic operation on a shared variable. The atomic operations allowed in our model are reads, writes, or compareand-swaps (CAS). The latter receives a memory address of a shared variable v and two values, old and new. It sets the value of v to new only if the value of v right before CAS is applied is old; in this case CAS returns true. Otherwise, the value of v does to change and CAS returns false. We assume that each thread has an ID, denoted as tid, which is a value between and n 1. In systems where tid may have values from arbitrary range, known non-blocking renaming algorithms can be applied (e.g., [1]). In addition, we assume each thread can access its tid and n. The original implementation of all operations of the nonblocking linked list by Harris [7] is linearizable [12]. We argue that after applying Drop the Anchor memory management technique, all list operations remain linearizable. Recall that all list operations invoke the find method, which returns pointers to two adjacent nodes, one of which holds the value smaller than the given key. For further details, see [7]). Denote this node as prev. Furthermore, recall that list operations may invoke find several times. For instance, insert will invoke find again if the next pointer of prev has being concurrently modified (in particular, in our case, frozen). Thus, we define the linearization points for a list operation op with respect to the prev returned from the last invocation of find by op. If this prev node is not frozen (i.e., the freeze-bit of its next pointer is not set), the linearization point of op is exactly as in [7]. However, if this prev node is frozen, we set the linearization point of op at the time instance defined as following. Consider the sequence of frozen nodes read by the corresponding find operation starting from a frozen node m and including the (frozen) node prev (where m and prev might be the same node). The linearization point of op is defined at the latest of the two events: (a) the corresponding find traversed m (i.e., read the next pointer of the node previous to m in the list) and (b) the latest time at which some node between (and including) m and prev was inserted or marked as deleted. The intuition is that when find returns a result from a frozen part of the list, this part no longer reflects the actual state of the list at the moment prev node is read. Thus, we have to linearize the corresponding operation at some earlier time instance, at which the nodes read by find are still consistent with the actual keys stored in the list Eventual conditional reclamation Lemma 4.1. Let T s and T f be the time when a thread t starts and finishes, respectively, the call to retirenode(node) and T r > T f is the time when t finishes to scan its reclamation buffer. Then at least one of the following events occurs in the time interval [T s, T r]: 1. Some thread remains running throughout [T f, T r], and its timestamp changes at most once in [T f, T r]. 2. Some thread becomes recovered at some point in time in [T s, T r].. The memory allocated to node is reclaimed by the time T r. Based on the lemma above, we prove that when a thread removes a node from the list, as long as that tread keeps applying (delete) operations on the list and particularly bad things do not happen to other threads (e.g., they are not suspected to be failed), the memory of that node will be eventually reclaimed.

8 Running time in seconds Total time for different number of threads using different memory reclamation techniques HP (with fence instruction) Anchor every 2 nodes Anchor every 1 nodes Anchor every 1 nodes Delayed Reclamation Over in seconds Relative performance ratio compared to delayed reclamation HP (with fence) Ratio Anchor 2 Ratio Anchor 1 Ratio Anchor 1 Ratio Number of threads Number of threads (a) The total running time comparison for searches only. (b) Memory management over referred to delayed reclamation for searches only. Figure : Drop the Anchor vs. workload results. Hazard Pointers for lists with the initial size of 1k keys, the read-only Lemma 4.2. Let T s be the time when a thread t starts the call to retirenode(node). Then node will be eventually reclaimed as long as t keeps removing nodes from the list and there is no thread that is stuck or recovered at or after T s. Note that even if some thread t x gets stuck or recovered after T s as above, it may have impact only on nodes being removed before (or concurrently to) t x s recovery Safety of memory references First, we prove that access to any node that can be reached from t s anchor (for any thread t) is safe, i.e., such node cannot be reclaimed. Lemma 4.. No node reachable from an anchor of some thread can be reclaimed. Using the lemma above, we show that with the Drop the Anchor memory management, no thread will access a reclaimed memory. Lemma 4.4. No thread t accesses the memory that has been reclaimed since the time t obtained a reference to it Progress guarantees The original implementation of all operations of the nonblocking linked list by Harris [7] is lock-free [12]. We argue that after applying Drop the Anchor memory management technique, all list operations remain lock-free. We say that a thread t i starts the recovery of a thread t j when t i sets the stuck bit on in t j s timestampandanchor field. Similarly, we say a thread t i completes the recovery of a stuck thread t j when t i sets the idle bit on in t j s timestampandanchor field. Lemma 4.5. If a thread t i starts the recovery of t j at T s, then the recovery of t j will be completed at T f > T s (by possibly another thread t k ) and/or infinitely many list operations will be linearized after T s. Next, we show that despite recovery operations, the systemwide progress is preserved, i.e., threads never keep recovering one another forever without completing list operations. Lemma 4.6. Consider n+1 recovery operations completed at times T 1 < T 2 <... < T n+1. Then there must be at least one list (delete) operation linearized in the time interval [T 1, T n+1]. 5. PERFORMANCE EVALUATION We have implemented the non-blocking linked list data structure of Harris [7] with several memory management techniques. First, we have implemented the Hazard Pointers technique following the pseudo-code presented in [14], but with the additional memory fence instruction added just after the write of a new value to the hazard pointer of a thread [8]. Second, we have implemented our new Drop the Anchor technique presented in this paper. Finally, we have also implemented a simple technique, where nodes removed by a thread t from the list are added to t s reclamation stack and reclaimed later once 64 nodes are collected in the reclamation stack. We refer to this implementation as delayed reclamation. We note that this scheme is incorrect in a sense that it allows threads to access deallocated memory, but we used this implementation to represent a memory management scheme with a minimal performance impact. All our implementations were coded in C and compiled with -O optimization level. We have run our experiments on the machine with two AMD Opteron(TM) core processors, operated by Linux OS (Ubuntu 12.4). We have varied the number of threads between 1 and 4, slightly above the number of threads that can run concurrently on this machine (2). If not stated otherwise, each test starts by building an initial list with 1k random keys. After that, we measure the total time of 2k operations divided equally between all threads. The keys for searches and insertions are randomly chosen 2-bit sized keys. For deletion operations, we ensure that randomly chosen keys actually exist in the list in order to make the reclamation process substantial. The values of recovery threshold and retire threshold were always 64. All threads are synchronized to start their operations immediately after the initial list is built and we

9 Running time in seconds Total time for different number of threads using different memory reclamation techniques Number of threads HP (with fence instruction) Anchor every 2 nodes Anchor every 1 nodes Anchor every 1 nodes Delayed Reclamation Over in seconds Relative performance ratio compared to delayed reclamation Number of threads HP (with fence) Ratio Anchor 2 Ratio Anchor 1 Ratio Anchor 1 Ratio (a) The total running time comparison for 2% inserts, 2% deletes and 6% searches. (b) Memory management over referred to delayed reclamation for 2% inserts, 2% deletes and 6% searches. Figure 4: Drop the Anchor vs. Hazard Pointers for lists with the initial size of 1k keys, the mixed workload results. measure the time it takes to complete all operations by all threads. We run each test 1 times and present the average results. The variance of all reported results is below 1.5% of the average. Figures and 4 show the measurements of the total time required to complete our benchmark using the HP memory management, the delayed reclamation and the Drop the Anchor method. For the latter, we have used three versions with different values for the anchor threshold value. Specifically, in the first version the anchor is dropped every second node (which is the lowest legitimate value for anchor threshold in the case of the linked list), in the second version the anchor is dropped every 1 nodes, and in the third every 1 nodes. We show the results for read-only workload where all operations are searches (Figure (a)) and for the mixed workload, where 2% of all operations are insertions, 2% are deletions, and the remaining 6% are searches (Figure 4(a)). Our measurements show that in the mixed workload the Drop the Anchor-based implementation is faster in about 15 25% than the HP-based one, even if the anchor is dropped every second node. When increasing the anchor threshold parameter from 2 to 1, we get even higher improvement of 45% over the performance of HPs. In read-only workload we can see even better performance improvement (4% on average) due to anchors usage compared to HP usage (cf. Figure (a)). Finally, we can see that for substantial amount of threads, the Drop the Anchorbased linked list performance is very close to the linked list implementation based on the simple delayed reclamation. This suggests that the amortized cost of the memory management in the Drop the Anchor technique is very small. Additionally, Figures (b), 4(b) present the relative performance ratio of each memory management technique, explained above, compared to delayed reclamation. When the ratio is close to 1 it means that the memory management technique adds almost no over over the delayed reclamation. The HP memory management shows 4 55% slowdown, where Anchor-based implementation shows 7 1% slowdown for anchors dropped every 1 nodes, and 2 25% slowdown for anchors dropped every 2 nodes, all compared to delayed reclamation results. In another set of experiments, we measure the impact of the initial size of the list on the performance of the HPbased and Drop the Anchor-based implementations, while the number of threads is constant (16) and the workload is mixed. The results are depicted in Figure 5(a). It can be seen that the running time of both implementations increases linearly with the size of the list as threads need to traverse more nodes per operation on average. The slope of the HP-based implementation is much steeper, however, suggesting that the over introduced by fences is much more significant than the cost of the anchor management. In Figure 5(b) we can see the performance impact of the recovery procedure in the Drop the Anchor technique. We use the version of the technique with the anchor threshold value equals 1 for more significant impact. We explicitly delay one of the threads, thus causing this thread to be considered as stuck and recovered by other threads. The stuck thread returns to run after 2 seconds and the presented total time is measured until all threads finish their runs. The results show that the recovery procedure has 15 5% impact on the performance, even when the anchor threshold value is high. In any case, the Anchors-based implementation s performance (with the delay and recovery) is much better than the HP-based one. 6. DISCUSSION We presented a new method for memory management of non-blocking data structures called Drop the Anchor. Drop the Anchor is a novel combination of the time-stamping method (which cannot handle thread failures) with the anchors and freezing techniques that provide a fallback allowing reclamation even when threads fail. Non-blocking algorithms must be robust to thread failures and so coping with thread failures in the memory manager is crucial. We have applied Drop the Anchor for the common nonblocking linked list implementation and compared it with

10 Total time for different list sizes Recovery performance impact Running time (seconds) HP (with fence instruction) Anchor every 1 nodes Running time in seconds HP (with fence instruction) Anchor every1 nodes with recovery Anchor every 1 nodes Initial list sizes Number of threads (a) Drop the Anchor vs. Hazard Pointers when 16 threads are run on lists with different initial sizes. In the Anchor-based implementation, the anchor is dropped every 1 nodes. (b) The impact of the recovery on the performance of lists with the initial size of 1k keys. Figure 5: Drop the Anchor vs. performance impact. Hazard Pointers for lists with the different initial sizes and the recovery the standard Hazard Pointers method. Measurements show that Drop the Anchor drastically reduces the memory management over, while robustly reclaiming objects in all executions. We believe our technique can be applied for other nonblocking data structures. Specifically, assume a data structure represented by a directed graph, where vertices correspond to internal nodes and edges correspond to pointers between these nodes. When recovering a thread t, we need to freeze and copy the sub-graph containing all internal nodes at the distance that depends on the anchor threshold parameter, from the node pointed by t s anchor. Essentially, although the copying operation might be expensive and even involve the whole data structure, the scalability bottleneck associated with the memory fences will be removed from the common node access step. 7. ACKNOWLEDGMENTS We want to thank Tim Harris for his helpful comments on the earlier version of this paper. 8. REFERENCES [1] H. Attiya and A. Fouren. Adaptive and efficient algorithms for lattice agreement and renaming. SIAM J. Comput., 1(2): , 21. [2] A. Braginsky, A. Kogan, and E. Petrank. Drop the anchor: Lightweight memory management for non-blocking data structures (full version). erez/papers/- DropTheAnchorFull.pdf. [] A. Braginsky and E. Petrank. Locality-conscious lock-free linked lists. In ICDCN, pages , 211. [4] D. Detlefs, P. A. Martin, M. Moir, and G. L. Steele. Lock-free reference counting. Distributed Computing, 15(4): , 22. [5] A. Dragojevic, M. Herlihy, Y. Lev, and M. Moir. On the power of hardware transactional memory to simplify memory management. In PODC, pages 99 18, 211. [6] A. Gidenstam, M. Papatriantafilou, H. Sundell, and P. Tsigas. Efficient and reliable lock-free memory reclamation based on reference counting. TPDS, 2(8): , 29. [7] T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In DISC, pages 14, 21. [8] T. E. Hart, P. E. McKenney, A. D. Brown, and J. Walpole. Performance of memory reclamation for lockless synchronization. Journal of Parallel and Distributed Computing, 67(12): , 27. [9] M. Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 1(1): , [1] M. Herlihy, V. Luchangco, P. Martin, and M. Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst., 2(2): , 25. [11] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., 28. [12] M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12():46 492, 199. [1] P. E. McKenney and J. D. Slingwine. Read-copy update: Using execution history to solve concurrency problems. In PDCS, pages , [14] M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. TPDS, 15:491 54, 24. [15] H. Sundell. Wait-free reference counting and memory management. In IPDPS, 25. [16] J. D. Valois. Lock-free linked lists using compare-and-swap. In PODC, pages , 1995.

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap Håkan Sundell Philippas Tsigas OPODIS 2004: The 8th International Conference on Principles of Distributed Systems