We present an algorithm for implementing binary operations (of any type) from unary loadlinked

Size: px

Start display at page:

Download "We present an algorithm for implementing binary operations (of any type) from unary loadlinked"

Elijah Knight
5 years ago
Views:

1 Improved Implementations of Binary Universal Operations Hagit Attiya and Eyal Dagan The Technion We present an algorithm for implementing binary operations (of any type) from unary loadlinked (LL) and store-conditional (SC) operations. The performance of the algorithm is evaluated according to its sensitivity, measuring the distance between operations in the graph induced by conicts, which guarantees that they do not inuence the step complexity of each other. The sensitivity of our implementation is O(log n), where n is the number of processors in the system. That is, operations that are (log n) apart in the graph induced by conicts do not delay each other. Constant sensitivity is achieved for operations used to implement heaps and array-based linked lists. We also prove that there is a problem which can be solved in O(1) steps using binary LL/SC operations, but requires O(log log n) operations if only unary LL/SC operations are used. This indicates a non-constant gap between unary and binary LL/SC operations. Categories and Subject Descriptors: C.2.4 [COMPUTER-COMMUNICATION NETWORKS]: Distributed Systems; C.4 [PERFORMANCE OF SYSTEMS]: Fault tolerance; D.1.3 [PRO- GRAMMING TECHNIQUES]: Concurrent Programming; D.2.12 [SOFTWARE ENGI- NEERING]: Interoperability Distributed objects; D.4.1 [OPERATING SYSTEMS]: Process Management Synchronization; F.1.2 [COMPUTATION BY ABSTRACT DEVICES]: Modes of Computation Parallelism and concurrency General Terms: Algorithms, Performance, Reliability, Theory Additional Key Words and Phrases: asynchronous shared-memory systems, load-linked/storeconditional operations, universal operations, contention-sensitive algorithms, deterministic coin tossing, wait-free algorithms An extended abstract of this paper appeared as \Universal operations: Unary versus Binary", in proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, 1996, pp. 223{232. Name: Hagit Attiya Aliation: Department of Computer Science, The Technion Address: Haifa 32000, ISRAEL; hagit@cs.technion.ac.il Name: Eyal Dagan Address: Dune Networks; eyal@dunenetworks.com Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or direct commercial advantage and that copies show this notice on the rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specic permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY USA, fax +1 (212) , or permissions@acm.org.

2 2 H. Attiya and E. Dagan 1. INTRODUCTION Non-blocking algorithms, in which a processor is delayed only if another processor is making progress, avoid performance bottlenecks due to processors' failures or delay. In asynchronous shared memory systems, non-blocking algorithms require the use of universal operations, such as load-linked (LL) and store-conditional (SC) [Herlihy 1991]. Writing non-blocking algorithm is easier with universal operations that can access several memory words atomically [Anderson 1994; Greenwald and Cheriton 1996; Israeli and Rappoport 1993; Massalin and Pu 1991]. However, most existing commercial architectures provide only unary operations, accessing a single memory word [May et al. 1994; Sites 1993]. Multi-word operations can be implemented using unary universal operations, e.g., [Herlihy 1991; Herlihy 1993], but these implementations are not very ecient. The eciency of an implementation can be evaluated in isolation, when there is no interference from other operations contending for the same memory words [Israeli and Rappoport 1994]. However, this provides no indication how the implementation behaves in the presence of contention, when several operations compete for access to the same memory words. Clearly, if we have a \hot spot", i.e., a memory word for which contention is high, then in any implementation, some operations trying to access this word will be delayed for a long time. One can even argue that in this case, operations will be delayed even when they are supported in hardware [Anderson 1990; Pster and Norton 1985]. However, such a hot spot should not delay \far away" operations. This paper proposes to evaluate implementations by their sensitivity, measuring to what distance a hot spot inuences the performance of other operations. Roughly stated, the sensitivity is the longest distance from one operation to another operation that inuences its performance, e.g., change the number of steps needed in order to complete the operation. We concentrate on implementations of binary operations from unary LL/SC. Binary operations induce a conict graph in which nodes represent memory words; there is an edge between two memory words if and only if they belong to the data set of an operation, i.e., they are accessed by the operation. A hot spot corresponds to a node with high degree. Two operations whose distance in the conict graph is larger than the sensitivity should not interfere; that is, their step complexity is the same whether they execute in parallel or not. We present an algorithm for implementing arbitrary binary operations from unary LL and SC operations; the sensitivity of our implementation is O(log n). The algorithm uses LL/SC since they are supported by several contemporary architectures [May et al. 1994; Sites 1993]. The algorithm can be extended to rely on other unary universal operations; in particular, the implementation of LL/SC from compare&swap [Anderson and Moir 1999] with O(1) step complexity can be employed. The core of the algorithm implements the binary operation in a manner similar to known algorithms [Anderson and Moir 1999; Barnes 1993; Israeli and Rappoport 1994; Shavit and Touitou 1997; Turek et al. 1992]: A processor locks the memory words in the data set of the binary operation, applies the operation, and then unlocks the data set. Operations help each other to complete, thus ensuring that

3 Improved Implementations of Binary Universal Operations 3 the algorithm does not block. The new feature of our algorithm is that a processor may lock its data set in two directions either starting with the low-address word or starting with the high-address word. The sensitivity of the core algorithm depends on the orientation of the conict graph according to locking directions. For two common data structures an arraybased linked list and a heap we can a priori determine locking directions which induce zero sensitivity. In general, however, processors have to dynamically decide on locking directions. This is achieved by encapsulating the core algorithm with a decision algorithm, coordinating the order in which processors lock their data sets (low-address word rst or high-address word rst). We introduce a synchronization method that breaks an arbitrary conict graph into paths; in each path, we apply a decision algorithm based on the deterministic coin tossing technique of Cole and Vishkin [Cole and Vishkin 1986]. Combined with the previous algorithm, this is an implementation with O(log n) sensitivity of arbitrary binary operations from unary LL/SC. We also show that there is a problem which can be solved in O(1) steps using binary LL/SC operations, but requires (log log n) steps if only unary operations (of any type) are used. The proof adapts a lower bound of Linial [Linial 1992], showing that in a message passing model a maximal independent set in an n-ring cannot be found in less than (log n) rounds. This lower bound indicates that any implementation of binary LL/SC from unary operations must incur a non-constant overhead. Following the original publication of this work [Attiya and Dagan 1996; Dagan 1996], Afek, Merritt, Taubenfeld and Touitou [Afek et al. 1997] presented an implementation of k-word operations from unary operations; the algorithm is wait-free, guaranteeing that every operation eventually terminates. They use algorithmic ideas from our algorithm, and employ it as a base case in a recursive construction. Herlihy and Moss [Herlihy and Moss 1993] introduce transactional memory, a hardware-based scheme for implementing arbitrary multi-word operations. Three schemes [Anderson and Moir 1999; Israeli and Rappoport 1994; Shavit and Touitou 1997] present software implementations of transactional memory from single-word atomic operations: Israeli and Rappoport [Israeli and Rappoport 1994] and Shavit and Touitou [Shavit and Touitou 1997] present non-blocking implementations of arbitrary multi-word operations using unary LL/SC, while Anderson and Moir [Anderson and Moir 1999] give a wait-free implementation of k-compare&swap and k-sc. Shavit and Touitou [Shavit and Touitou 1997] present simulation results indicating that their algorithm performs well in practice; Israeli and Rappoport [Israeli and Rappoport 1994] analyze the step complexity of an operation; Anderson and Moir [Anderson and Moir 1999] measure the step complexity of k- compare&swap and k-sc operations. All three implementations are very sensitive to contention by distant operations. For example, two operations executing on two ends of a linked list can increase each other's step complexity. Turek, Shasha, and Prakash [Turek et al. 1992] show a method for transforming a concurrent implementation of a data structure into a non-blocking one. A process being blocked due to some lock held by another process helps the blocking process until it releases its lock; help continues recursively if the blocking process is also blocked by another process. Barnes [Barnes 1993] presents a method for

4 4 H. Attiya and E. Dagan constructing non-blocking implementations of concurrent data structures. In this method, only words needed by the operation are cached into a private memory and operations can access the data structure concurrently if they do not contend. These methods are similar to software transactional memory [Anderson and Moir 1999; Israeli and Rappoport 1994; Shavit and Touitou 1997], and their sensitivity is high. Our algorithm uses helping, as in [Anderson and Moir 1999; Barnes 1993; Israeli and Rappoport 1994; Shavit and Touitou 1997; Turek et al. 1992], but decreases the sensitivity and increases parallelism by minimizing the distance to which an operation helps. Herlihy [Herlihy 1993] introduces a method for converting a sequential data structure into a shared wait-free one. Both Herlihy's method and its extension by Alemany and Felten [Alemany and Felten 1992], do not allow \parallelism" between concurrent operations and are inherently sequential. Anderson and Moir [Anderson and Moir 1999] present a construction that allows operations to access multiple objects atomically. Their implementation uses multiword operations and can be employed to implement certain large shared objects, where it saves copying and allows parallelism. Non-blocking implementations of multi-word operations induce solutions to the well-known resource-allocation problem; these solutions have short waiting chains and small failure locality [Choy and Singh 1996]. Additional discussion of the relationships between the two problems appears in [Afek et al. 1997]. 2. PRELIMINARIES 2.1 The Asynchronous Shared-Memory Model In the shared-memory model, processors p 1 ; : : : ; p n communicate by applying memory access operations (in short, operations) to a set of memory words (in short, words), m 1 ; : : : ; m l. Each processor p i is modeled as a (possibly innite) state machine with state set Q i, containing a distinguished initial state, q 0;i. A conguration is a vector C = (q 1 ; : : : ; q n ; v 1 ; : : : ; v l ), where q i is a local state of processor p i and v j is the value of word m j. In the initial conguration, all processors are in their (local) initial states, and words contain a default value. Each operation has a type, which denes the number of input and output arguments, their allowable values, and the functional dependency between the inputs, the shared-memory state and the processor state, on one hand, and the output arguments and the new states of the processor and the memory, on the other hand. Each operation is an instance of some operation type; the data set of an operation is the set of words it accesses. For example, unary LL and SC are dened as follows: LL(m) return the value of m SC(m, new) if no write or successful SC to m since your previous LL(m) then m = new return true SC is successful else return false

5 Improved Implementations of Binary Universal Operations 5 An event is a computation step by a single processor; in an event, a processor determines the memory operation to perform according to its local state, and determines its next local state according to the value returned by the operation. Operations are atomic; that is, each operation seems to occur at a certain point, and no two operations occur at the same point. Therefore, computations in the system are captured by sequences of congurations, where each conguration is obtained from the previous one by an event of a single processor. An execution segment is a (nite or innite) sequence C 0 ; 0 ; C 1 ; 1 ; C 2 ; : : : where for every k = 0; 1; : : :, C k is a conguration, k is an event, and the application of k to C k results in C k+1 ; that is, if k is an event of p i then C k+1 is the result of applying p i 's transition function to p i 's state in C k, and applying p i 's memory access operation to the memory in C k. An execution is an execution segment C 0 ; 0 ; C 1 ; 1 ; C 2 ; : : :, in which C 0 is the initial conguration. There are no constraints on the interleavings of events by dierent processors, since processors are asynchronous and there is no bound on their relative speeds. An implementation of a high-level operation type H by low-level operations of type L, is a procedure using operations of L. Processors should not distinguish between H and its implementation by L. Assume processor p i invokes a procedure implementing an operation op which terminates; let f and l be the rst and the last events, respectively, executed by p i in the procedure for op; the interval of op is the execution segment = C f ; f ; : : : ; C l ; l ; C l+1. If the operation does not terminate, its interval is the innite execution segment = C f ; f ; : : :. Two operations overlap if their intervals overlap; that is, the rst event of one interval precedes the last event of the other interval. An invocation of an operation may result in dierent intervals, depending on the context of its execution. For example, two intervals of the same operation may dier and even return dierent values if the rst is executed in isolation, while the second overlaps other operations. An execution is linearizable [Herlihy and Wing 1990] if there is a total ordering of the implemented operations in, preserving the order of non-overlapping operations, in which each response satises the semantics of H, given the responses of the previous operations in the total order. Let be the interval of some operation invoked by p i ; the step complexity of, denoted step(), is the number of events of p i in. An implementation is non-blocking if at any point, some processor with a pending operation completes within a bounded number of steps. 2.2 Sensitivity The conict graph of an execution segment represents the dependencies between the data sets of operations in ; it is an undirected graph, denoted G. A node in G represents a word m i. An edge between two nodes m i and m j corresponds to an operation with data set fm i ; m j g whose interval overlaps. G may contain

6 6 H. Attiya and E. Dagan op 1 m l m i m j m k op 2 op op 3 Fig. 1. A simple conict graph. parallel edges, if contains several operations with the same data set. 1 Figure 1 shows the conict graph for a nite interval of an operation op(m i ; m j ) which overlaps op 1 (m l ; m i ), op 2 (m l ; m i ) and op 3 (m j ; m k ). Below, we talk about a word in the conict graph, referring to the node representing it; similarly, we talk about an operation in the conict graph, referring to the edge representing it. Next, we consider G, the conict graph of an interval of an operation op, and measure the distance between op and operations that delay its execution. The maximum distance measured in all intervals of an implementation determines its sensitivity. The distance between two operations, op 1 and op 1, in G is the length 2 of the shortest path between a word of op 1 and a word of op 2. In particular, if the data sets of two operations intersect, then their distance is zero. In Figure 1, the distance between op 1 and op 3 is one; the distance between op and any other operation is zero. Intuitively, the sensitivity measures the minimum distance guaranteeing that two operations do not \interfere" with each other. Below, we say that an operation op 2 does not interfere with another operation op 1, if the step complexity of op 1 is the same, whether op 2 is executed in parallel or not. This denition can be modied so that the sensitivity depends on other complexity measures, e.g., the set of words accessed. An interval of some operation op is sensitive to distance ` if there is an interval 0 of op, such that G 0 has exactly one more operation (i.e., an edge) than G, at distance ` from the edge representing op, and step() < step( 0 ). That is, the step complexity of op increases when a single operation is added to at distance ` from op. The sensitivity of is the maximum s such that is sensitive to distance s. This means that the step complexity of op does not increase when a single operation is added to at distance s + 1 from op. If this maximum does not exist, then the sensitivity is 1. The sensitivity of an implementation is the maximum sensitivity over all its intervals. The sensitivity captures non-interference between operations in the following sense: If the sensitivity of an implementation is s and the distance between two operations in the conict graph is d > s, then the step complexity (or any other measure we consider) of the operations is the same whether they execute in parallel or not. In particular, if the sensitivity of an implementation is zero then two operations interfere only if their data sets intersect. 1 An earlier version of this work [Attiya and Dagan 1996; Dagan 1996] dened the contention graph of an execution segment, in which nodes represent operations and edges represent the words in their data sets; it is the dual of the conict graph. 2 The number of edges.

7 Improved Implementations of Binary Universal Operations Related Complexity Measures Disjoint-access parallelism [Israeli and Rappoport 1994] requires that when an operation is executed without interference (no other operations contend for the same words), then it completes in the same number of steps as when executed alone. Sensitivity strengthens this notion and allows to evaluate the behavior of an implementation in the presence of contention. Afek et al. [Afek et al. 1997] suggest two other complexity measures; cast in our terminology, they are described as follows: (1) An implementation has d-local step complexity if the number of steps performed in an interval is bounded by a function of the number of operations within distance d in G. (2) An implementation has d-local contention if two operations access the same word only if their distance in the conict graph of their (joint) interval is at most d. Clearly, sensitivity d implies d-local step complexity; however, the converse is not true. For example, suppose the data set of an operation op contains a hot spot m, accessed by ` > 2 other operations; suppose that m is also on a path of operations with length `. Sensitivity 0 does not allow operations on the path to inuence op's performance, while with 0-local step complexity, op may still have to help distant operations on the path. Local contention is orthogonal to sensitivity and local step complexity, and can be evaluated in addition to either of them. However, if operations access only words associated with operations they help, then d-local contention follows from sensitivity d. (The contention locality of our algorithm is discussed at the end of Section 4.) Dwork, Herlihy and Waarts [Dwork et al. 1997] suggest to measure the step complexity of algorithms while taking contention into account, by assuming that concurrent accesses to the same words are penalized by delaying their response. This measure is appropriate for evaluating solutions for specic problems; however, implementations of multi-word operations inevitably result in concurrent accesses to the same words, creating hot spots. Sensitivity dierentiates between multi-word implementations by measuring the inuence of hot spots. 3. THE LEFT-RIGHT ALGORITHM A general scheme for implementing multi-word operations [Anderson and Moir 1999; Barnes 1993; Israeli and Rappoport 1994; Shavit and Touitou 1997; Turek et al. 1992] is that an operation \locks" its data set rst, and \helps" stuck operations to avoid blocking. In this section, we introduce the left-right algorithm, in which operations lock words in dierent orders. We show that the sensitivity and liveness of the left-right algorithm depend on the orientation of the conict graph induced by the locking orders of overlapping operations. At the end of this section, we discuss data structures in which operations have inherent asymmetry; for such data structures, the left-right algorithm can be directly applied to achieve constant sensitivity. In the next section, we show how to break symmetry in general situations so as to govern the locking directions and reduce sensitivity.

8 8 H. Attiya and E. Dagan m 1 m m op 2 3 m 1 op n?1 m n m n+1 2 op n op n?1 Fig. 2. A scenario with high sensitivity. 3.1 Overview Multi-word operations can be implemented from unary operations with the following \locking" scheme [Barnes 1993; Israeli and Rappoport 1994; Shavit and Touitou 1997; Turek et al. 1992]. An operation starts by obtaining locks on the words in its data set (locking stage); then, the operation is applied to the data set (execution stage); nally, the operation releases the locks (unlocking stage). A word is locked by an operation if it contains the operation's id (each operation has a unique identier); the word is unlocked if its value is?. If a word is locked by an operation, no other operation can modify it. An operation is blocked if a word in its data set is locked by another, blocking operation. To avoid blocking, the processor executing the blocked operation helps the blocking operation. Several processors may execute an operation: The initiating processor is the processor invoking the operation, and the executing processors are the processors helping it to complete. Although there are several executing processors, only the most advanced processor at each point of the execution performs the operation, and other executing processors have no eect. In order to be helped, the operation's details are published when it is invoked and its state is maintained during its execution. The blocking operation being helped can be either in its own locking stage, or already in its execution or unlocking stages. In the latter case, the operation has already locked its words and it will never be blocked. Thus, help for a blocking operation which has passed the locking stage is guaranteed to complete. In contrast, help for a blocking operation in its locking stage may have to continue transitively: the may be blocked by a third operation, which in turn may be blocked by a fourth operation, and so on. A non-blocking implementation guarantees that eventually transitive helping stops and some operation terminates; yet, the sensitivity can be very high. Consider the overlapping operations in Figure 2; the data set of op i is fm i ; m i+1 g, 1 i n. Assume every operation op i locks its low-address word, m i, successfully; then op 1 tries to lock its high-address word m 2, while op 2 ; : : : ; op n are delayed. Since m 2 is locked by op 2, op 1 has to help op 2 ; since m 3 is locked by op 3, op 1 has to help op 3, etc. Thus, op 1 is delayed by op 2 ; : : : ; op n. Since the distance between op 1 and op n is n? 2, the sensitivity of this simple implementation is at least n? 2. In this example, the symmetric behavior of the operations all locking their low-address word rst causes high sensitivity. The main idea of the left-right algorithm is that asymmetry can be introduced by having the operations lock their words in two directions: Either from left to right low-address word rst, or from right to left high-address word rst. If in the example of Figure 2, odd-numbered operations, op 1 ; op 3 ; : : :, lock their low-address word rst, while even-numbered operations, op 2 ; op 4 ; : : :, lock their high-address word rst. If op i (for odd i) locks

9 Improved Implementations of Binary Universal Operations 9 its low-address word, m i, and nds its high-address word, m i+1, locked by another operation (which must be op i+1 ), then op i+1 has already locked its two words. Therefore, op i helps op i+1 in its execution stage and/or unlocking stage but no other operations. Before the locking stage, an operation decides on its locking direction in the decision stage. After the unlocking stage, the operation resets the shared-memory areas that were used in the decision stage in the post-decision stage. In this section, we focus on the locking and unlocking stages, leaving the algorithms for the decision and post-decision stages to Section The pseudocode To simplify the code and its description, a separate shared-memory area is used for the locking and unlocking stages. The size of this area is the same as the size of the data area; word i in the locking area corresponds to word i in the data area. The algorithm uses a shared array, op-details, where the operation's details are published by the initiating processor. The initiating processor also sets an operation id (op-id) to be used later; op-id is composed from the id of the initiating processor and a timestamp generated by a timestamp function which returns a unique value each time it is invoked. The algorithm follows the general scheme discussed earlier, except that locking is done either from left to right or from right to left. If an operation discovers that a word is locked by another operation, it helps the blocking operation by executing all its stages until it unlocks its words; then, the operation tries again. The pseudocode appears in Algorithm 1. Several processors may execute the locking and unlocking stages or an operation; synchronization is needed to ensure that this does not cause any errors. Algorithm 2 presents the details of the shared procedures used for locking and unlocking. The user is responsible for avoiding synchronization errors in the execution stage. The same local variable tmp is used in all procedures, and it holds the last value read from the shared memory. The main synchronization mechanism guaranteeing that only the most advanced executing processor makes changes is the timestamp part of the operation id. This eld is written by the initiating processor at the beginning of the operation and is cleared at the beginning of the unlocking stage. An operation is valid if its timestamp is set; otherwise, it is invalid. An executing processor nding that the operation is invalid (its timestamp is not set) skips to the unlocking stage. This ensures that once a word is unlocked by an operation, it will not be locked again by this operation. Similar considerations apply when unlocking the word. Each word is initially?; when locked by some operation, it contains its id. Procedure lock locks two words in the order they are given as parameters. A single word is locked by cell-lock, which tries to lock the word if the operation is still valid and the word is not locked by another operation. If the word is locked by the operation, the procedure returns true; if the word is locked by another operation, the executing processor helps the blocking operation and tries again; if the operation becomes invalid, the procedure returns false. To help another operation, the blocked operation invokes help with the blocking operation's id as argument. The blocked operation becomes an executing processor

10 10 H. Attiya and E. Dagan Algorithm 1 The left-right algorithm: Code for processor p i. record state = low-word, high-word // the data set ts // timestamp direction // locking direction shared state op-details[n] procedure implemented-operation(m i, m j ) t = timestamp() op-id = (i, t) atomically write to op-details[i] low-word = m i ; high-word = m j ts = t ; direction =? help(op-id) procedure help(op-id) if ( op-id ==? ) then return low = op-details[op-id.pid].low-word high = op-details[op-id.pid].high-word decision(low, high, op-id) if ( op-details[op-id.pid].direction == left ) then lock(low, high, op-id) else lock(high, low, op-id) execution(low, high, op-id) unlock(low, high, op-id) post-decision(low, high, op-id) // assume m i < m j // publish // help yourself // decide on locking direction // locking stage // left to right // right to left // execution stage // unlocking stage // clean memory of the blocking operation and goes through all its stages. Procedure unlock invalidates the operation by resetting its timestamp eld, preventing other executing processors from locking its words again. Then, it unlocks the two words with cell-unlock, which unlocks a single word only if it is still locked by the operation. The success of SC is not checked; if it fails, the word was unlocked by another executing processor. Procedure validate compares the timestamp passed in the operation id with the timestamp in the ts eld of the operation's entry in op-details; the operation is valid if they are equal. As mentioned before, it is the responsibility of the user to avoid synchronization errors in procedure execution (which is left unspecied). The user can use the operation's timestamp and if necessary, add more state information. For example, if the implemented operation is SC2, we only need to validate the operation before each write, as done in cell-lock. 3.3 Proof of Correctness The proof that the algorithm is linearizable follows as in the general schemes [Barnes 1993; Israeli and Rappoport 1994; Shavit and Touitou 1997; Turek et al. 1992], once locking and unlocking are shown to behave correctly. We only show that the data

11 Improved Implementations of Binary Universal Operations 11 Algorithm 2 The left-right algorithm: Shared procedures for processor p i. procedure lock(x, y, op-id) cell-lock(x, op-id) cell-lock(y, op-id) procedure cell-lock(addr, op-id) while ( true ) tmp = LL(addr) if ( not validate(op-id) ) then return if ( tmp ==? ) then SC(addr,op-id) tmp = LL(addr) if ( tmp == op-id ) then return else help(tmp) // the operation ended // try to lock // check if successfully locked procedure validate(op) if ( op-details[op.pid].ts == op.ts ) then return true else return false procedure unlock(x, y, op-id) tmp = LL(op-details[op-id.pid].ts) if ( tmp == op-id.ts ) then SC(op-details[op-id.pid].ts,?) cell-unlock(x, op-id) cell-unlock(y, op-id) // invalidate op-id // unlock the words procedure cell-unlock(addr, op-id) tmp = LL(addr) if ( tmp == op-id ) then SC(addr,?) set of an operation is locked during the execution stage and is unlocked after the operation terminates. An executing processor of an operation op returns from cell-lock either when the word is locked by op or when op is invalid. The latter case happens only when an executing processor reaches the unlocking stage, after completing the locking stage. This implies the next lemma: Lemma 1. The data set of an operation op is locked by op when the rst executing processor of op completes the locking stage. A word is unlocked only by an executing processor of the operation which locked it since cell-unlock checks whether the word is locked by this operation. This implies the next lemma: Lemma 2. The data set of an operation op remains locked by op until the rst executing processor of op reaches the unlocking stage. The next lemma shows that an unlocked word is not locked again. Lemma 3. If m is in the data set of an operation op, then m remains unlocked by op after the rst executing processor of op reaches the end of the unlocking stage.

12 12 H. Attiya and E. Dagan m 1 m 2 m op 3 1 op 2 Fig. 3. Helping directions reduce sensitivity. Proof. Suppose that an executing processor of op, p i, reaches the end of the unlocking stage and another executing processor of op, p j, tries to lock m. p i rst resets the timestamp eld in op-details, thus invalidating op, and then performs SC(m,?). p i performs LL(m) and then validate op. Clearly, either op is invalid, or p i reads a non-? from m and does not lock m. 3.4 Progress and Sensitivity The liveness properties of the algorithm and its sensitivity depend on the orientation of the conict graph according to locking directions. We assume that locking direction is determined by the data set; hence, operations with the same data set lock their words in the same order. The helping graph of an interval is a directed graph representing helping among operations overlapping. Specically, H is an orientation of G, the conict graph of : An edge representing an operation op with data set m 1 and m 2 is oriented m 1! m 2 if op locks m 1 rst; it is oriented m 2! m 1 if op locks m 2 rst. Lemma 4. Let be an execution of the left-right algorithm in which no operation completes. Then the helping graph of some interval in contains a directed cycle. Proof. In there must be an (innite) interval,, of some blocked operation, op 0, in which no operation completes. By the algorithm, op 0 is blocked if it cannot lock its data set. Since op 0 does not terminate, the blocking operation op 1 is itself blocked by another blocked operation, op 2. Since the number of processors is nite and a processor has at most one pending operation, a nite number of operations is blocked in. Therefore, there is a cycle of blocked operations, op 0 ; : : : ; op l, l 1. By the algorithm, op i helps op (i+1) mod l. If the cycle contains two operations, then they have the same data set and help each other; this implies they lock it in dierent directions, contradicting our assumption. Otherwise, we have three or more operations helping each other, and there is a directed cycle in H. Consider two operations, op 1 with data set fm 1 ; m 2 g and op 2 with data set fm 2 ; m 3 g. Figure 3 shows a helping graph in which the edge between m 1 and m 2 is directed to m 2, and the edge between m 2 and m 3 is also directed to m 2. If op 1 helps op 2, then by the code of cell-lock, m 2 is locked by op 2. However, op 2 locks m 3 before locking m 2, and has passed its locking stage; thus, op 1 helps op 2 only in its execution or unlocking stages. As illustrated by this example, if an operation op 1 helps another operation op 2 then there is a directed path from a word of op 1 to a word of op 2, in the helping graph. This is used in the proof of the next lemma, which is the key to bounding the sensitivity of the algorithm. Lemma 5. Let be the interval of an operation op i and let op j be an overlapping

13 Improved Implementations of Binary Universal Operations 13 operation. If there is no directed path from a word of op i to a word of op j in H, then there is an interval of op i, 0, with the same overlapping operations except op j, such that step() = step( 0 ). Proof. Assume op i helps the operations in OP. If an operation op 0 2 OP helps op j then there is a directed path from a word of op 0 to a word of op j (as argued before the lemma). Since there is a directed path from a word of op i to a word of op 0, there is a directed path from a word of op i to a word of op j. This contradicts the assumption and shows that no operation in OP helps op j. We construct an interval 0 without op j. In 0, op i performs the same sequence of steps as in ; moreover, all the operations in OP lock their words in the same order as in. Since op i performs the same sequence of steps in and in 0, step() = step( 0 ). We argue that 0 is an execution of the left-right algorithm; otherwise, let op k be the rst operation in OP which locks a word m in and cannot do so in 0. By the algorithm, this happens only if another operation holds a lock on m. However, no operations were added in 0 and the locking sequence until op k 's locking in 0 is as in. Thus, if m is unlocked in, then m is unlocked in 0 and op k succeeds in locking it. By Lemma 5, if the length of directed paths in H is d, then adding an operation at distance d does not increase the number of steps taken by the operation. Lemma 6. Let be an interval of the left-right algorithm. If the length of directed paths in H is less than or equal to d, then the sensitivity of is strictly smaller than d. 3.5 Data Structures with Zero Sensitivity We discuss two data structures in which the memory access patterns of operations are structured and therefore, locking directions can be determined a priori to obtain zero sensitivity A linked list:. If a linked list is implemented inside an array, then the data set of typical operations, such as insertion and deletion, is m i and m i+1, for some i. Let the locking direction of the operation be determined by the parity of its low-address word; that is, the locking direction of an operation accessing m i and m i+1, for some i, is \left" if i is even, and \right", if i is odd. Clearly, neighboring operations in the conict graph lock in opposite directions. Therefore, there are only trivial directed paths (of length 1) in the helping graph. By Lemma 4, the implementation is non-blocking and by Lemma 6, its sensitivity is zero A heap:. Israeli and Rappoport [Israeli and Rappoport 1993] present an implementation of a heap supporting a bubble up and bubble down using unary LL and binary SC2 operations. In this implementation, the data set of a binary operation is always a parent node and one of its children. We implement the binary operations with the left-right algorithm; the locking direction of an operation is the parity of the depth of the higher node it has to lock. Figure 4 depicts four nodes in a heap, v g, v f, v a and v b. Two kinds of paths can be formed by contending operations. In the rst kind, the depths are monotone,

14 14 H. Attiya and E. Dagan v g v f v a v b Fig. 4. Binary operations on a heap: v g is at even depth. op m 2 op 0 op m 2 op 0 m 1 m 0 m 1 1 m 0 1 m 0 2 m 0 2 Fig. 5. Locking directions: High-address word is equal to low-address word. op m 2 op m 2 m 1 m 1 m 0 1 op 0 m 0 2 m 0 1 op 0 m 0 2 Fig. 6. Locking directions: Low-address words are equal. e.g., : : : ; v a ; v f ; v g ; : : : or : : : ; v b ; v f ; v g ; : : :. In this case, neighboring operations lock in opposite directions, and there is no directed path from v a to v g or from v b to v g. In the second kind, the depths are not monotone, e.g., : : : ; v a ; v f ; v b ; : : :. In this case, neighboring operations lock in the same direction (determined by the depth of v f ), and there is no directed path between v a and v b. Thus, there are only trivial directed paths in the helping graph, which implies that the implementation is non-blocking and its sensitivity is zero (as was argued for linked lists). 4. THE DECISION ALGORITHM This section describes how to choose locking directions in order to reduce sensitivity, when access patterns are not known in advance. Assume that the data set of op is fm 1 ; m 2 g and the data set of op 0 is fm 0 1 ; m0 2 g, and they intersect. If m 2 = m 0 1 (the high-address word of op is the low-address word of op 0 ) then the locking directions of op and op 0 have to be dierent in order to avoid a directed path (Figure 5). If m 1 = m 0 1 (the low-address word of op is the low-address word of op 0 ) then the locking directions of op and op 0 have to be equal in order to avoid a directed path (Figure 6) and similarly when m 2 = m 0 2 (the high-address word of op is the high-address word of op 0 ). We rst describe the algorithm for the restricted case of a single monotone path,

15 Improved Implementations of Binary Universal Operations 15 in which the high-address word of one operation is the low-address word of another operation (as in Figure 5). In this situation, we want neighboring operations to lock in dierent directions (as much as possible). The general case is handled by decomposing an arbitrary conict graph into monotone paths. For simplicity, a separate shared-memory area is used for the decision stage; word i in the decision area corresponds to word i in the locking or data areas. 4.1 Monotone Paths Consider operations op 1 ; : : :; op n such that op i is initiated with processor id pid i and data set (m i ; m i+1 ). For op i, the operations with lower indices, op 1 ; : : : ; op i?1, are called downstream neighbors; the operations with higher indices, op i+1 ; : : : ; op n, are called upstream neighbors. (This situation is similar to the one depicted in Figure 2.) Assume that an operation op i have a nonnegative number num i and that neighboring operations have dierent numbers; op i chooses its locking direction by the following rule: () If num i < num i+1 then op i decides left; otherwise, op i decides right. Under this rule, a directed path in the helping graph corresponds to strictly ascending or descending sequence of numbers; for example, if operations decide left, then numbers are strictly ascending. Thus, there are no directed cycles in the helping graph and the algorithm is non-blocking, by Lemma 4. This also indicates how to reduce the sensitivity of the algorithm: By Rule (), the length of the longest directed path in the helping graph is strictly smaller than the biggest number. If numbers are small then directed paths are short and by Lemma 6, sensitivity is small. Numbers are reduced with the \deterministic coin tossing technique" of Cole and Vishkin [Cole and Vishkin 1986]. This is a symmetry breaking algorithm for synchronous rings, which we adapt to monotone paths in an asynchronous system. The initial number of an operation is the initiating processor's id. The algorithm works in phases; in each phase, numbers are reduced by a logarithmic factor. Reduced numbers are no longer unique; however, neighboring operations have dierent numbers, allowing to apply Rule (). If the reduced numbers are at most l, then at most l numbers are strictly ascending or descending; thus, at most l? 1 consecutive operations decide on the same direction. To perform k reduction phases, an operation needs the initial numbers of k + 1 upstream operations; edge operations, without k + 1 upstream neighbors, decide left. Since there may be k + 1 edge operations which decide left, at most l + k consecutive operations may decide left. For monotone paths, we simplify the description by assuming that (a) all operations start together, and (b) an operation waits after the locking stage, until all operations nish their locking stage. Later, we will remove these assumptions. We describe how to reduce the numbers to size O(log n) with O(1) memory operations; repeating this reduction yields numbers of size O(log n) with O(log n) memory operations.

16 16 H. Attiya and E. Dagan p i p i+1 p i+2 num (= 85) (= 253) (= 125) num (= 6) (= 15) Fig. 7. Reduction of numbers in a single phase A Single Phase. An operation starts by writing its initiating processor's id into its low-address word. There are pointers between consecutive words in the path: The id in the low-address word leads to the operation's details record, containing the high-address word of the operation. Assume op i reads num 0 i from m i (its own processor id), num 0 i+1 from m i+1, and num 0 i+2 from m i+2. These are binary strings of length dlog ne, where bits are numbered from 0 to dlog ne? 1, going from least signicant bit to most signicant bit. (Logarithms are base two.) Let j be the index of the least signicant (rightmost) bit in which the binary representations of num 0 and i num0 i+1 dier; j can be represented as a binary string of length dlog log ne. Let num 1 be the concatenation of the binary representation i of j and b j, the value of the jth bit in num 0 i. Denote j by num 1 :index and b i j by num 1 :bit. The length of i num1 i is dlog log ne + 1 bits. In a similar manner, op i computes num 1 from i num0 i+1 and num0 i+2. In Figure 7, num 0 i is , num 0 i+1 is , and num0 i+1 is The index of the rightmost bit in which num 0 and i num0 i+1 dier is 3 and the value in num 0 is 0; thus, i num1 is The index of the rightmost bit in which i num0 i+1 and num 0 i+2 dier is 7 and the value in num0 i+1 is 1; thus, num1 i+1 is By our simplifying assumptions, words are written together and are not overwritten while operations decide on directions. Hence, op i and op i+1 compute num 1 based on the same values of i+1 num0 and i+1 num0 i+2. This allows to ignore the processor computing num 1. Lemma 7. If op i and op i+1 are neighboring operations on the path, then they compute the same value for num 1 i+1. The next lemma shows that consecutive values of num 1 are not equal. Lemma 8. If op i and op i+1 are neighboring operations on the path and num 0 i 6= num 0 i+1, then num1 i 6= num1 i+1. Proof. If num 1 i = num 1 i+1, then they have the same bit, num 1 :bit = i num1 i+1 :bit, in position num 1 :index = i num1 i+1 :index, contradicting the fact that num0 i and dier in this bit. num 0 i The Multi-Phase Algorithm. The above idea is applied repeatedly to reduce the numbers to be at most three bits long. Denote `(0; n) = dlog ne, and let `(j +1; n) = dlog `(j; n)e+1, for any integer j 0. Let f(n) be the smallest integer j such that `(j; n) 3; note that f(n) = O(log n).

17 Improved Implementations of Binary Universal Operations 17 An operation starts by writing its initial number in its low-address word; then it reads f(n)+1 upstream words. (Edge operations, without f(n)+1 upstream words, choose left without any further calculation.) Assume op i reads num 0, i num0 i+1 ; : : :, num 0. By iterating on i+f(n)+1 k = 1; : : :; f(n), op i computes num k j from numk?1 j and num k?1 j+1 for every j, i j i + f(n)? k, as in the single-phase algorithm (Section 4.1.1). Inductive application of Lemma 7 (generalized to hold for an arbitrary phase) implies that an operation and its downstream neighbor compute the same reduced number. Lemma 9. If op i and op i+1 are neighboring operations on the path, then they compute the same value for num k i+1, for every k, 0 < k f(n). Inductive application of Lemma 8 (generalized to hold for an arbitrary phase) implies that numbers computed by neighboring operations are not equal. Lemma 10. If op i and op i+1 are neighboring operations on the path and num 0 6= i num 0 i+1, then num k i 6= num k i+1 for every k, 0 < k f(n). In particular, num f(n) 6= num f(n) i i+1. If n < 8, then initial numbers are at most three bits long, and thus, f(n) = 0. Otherwise, the numbers are strictly reduced in each iteration, since dlog xe + 1 < x for every x > 3. Hence, num f(n) < 8. i At most seven consecutive operations decide on the same direction by applying Rule () with the reduced numbers. Edge operations, without f(n) + 1 neighbors, decide left, and thus at most f(n) + 8 consecutive operations decide left. This implies the next theorem: Theorem 1. The length of a directed path is at most f(n) General Topology We \disentangle" an arbitrary combination of overlapping and contending operations into a collection of monotone paths, to which the reduction technique of the previous section can be applied. If an operation's data set may create a nonmonotone path, it stalls while helping other operations; otherwise, it applies the algorithm for a monotone path. To explain this idea further, we need to dene monotone paths more precisely. Assume words m 1 ; : : : ; m l form an undirected path in some conict graph; word m i, 1 < i < l, is a local minimum if m i?1 > m i and m i < m i+1 ; it is a local maximum if m i?1 < m i and m i > m i+1. A local minimum is created when two operations have the same low-address word (as in Figure 6); a local maximum is created when two operations have the same high-address word. A path is monotone if it does not contain local minima or maxima. The decision stage is preceded with a marking stage, in which operations check the memory access patterns to detect local minima or maxima. Only one of the operations forming a local minimum or a local maximum continues; and the others stall. The marking stage uses a variant of the conict graph. Nodes are marked words; a word has a eld for low marking and a eld for high marking. The low eld of a

18 18 H. Attiya and E. Dagan word indicates whether it is the low-address word of some operation; the high eld of a word indicates whether it is the high-address word of some operation. An operation marks a word by writing its id in the relevant eld; marking fails if the eld is not?. Hence, a word is not marked as high (or as low) twice; if two overlapping operations have the same high-address or low-address word, only one of them marks the word. An operation op can mark a word m as low even if m is already marked high by another operation op 0 (and vice versa); this happens when m is the low-address word of op and the high-address word of op 0. In contrast, m cannot be locked by two operations. If an operation marks its low-address word as low and its high-address word as high, then its data set is on a monotone path; the operation decides on a locking direction as in Section 4.1. If marking fails, then the operation's data set creates a non-monotone path. An operation unmarks its data set in the post-decision stage, after the unlocking stage. Two problems arise due to the dynamic nature of the conict graph. First, if new operations join the end of a marked path after the locking stage starts, then an edge operation may help upstream operations after it nds the end of the path; this increases the sensitivity of the locking stage. Second, after an operation unmarks its data set, another operation with the same data set may take its place. Local computation in some downstream operations may use the rst operation's id, while other downstream operations use the second operation's id. Both problems are avoided by \truncating" the path. An operation nding the end of a path places a special end symbol in the low eld of the last word of the path; thus, operations cannot extend the path by marking this word as low. An operation unmarking its data set places end in the low eld of its low-address word m, if m is marked high (i.e., the operation has a downstream neighbor); thus, new operations cannot mark m again. If the low eld contains end when the high eld is unmarked, then both elds are set to? The Pseudocode. Each word contains two elds for marking, low and high, which may contain an operation id, end or?; both are initially?. A Boolean intersected eld is added to the operation's details record; this eld is set when the operation is intersected, i.e., its high-address word is already marked high by another operation and is part of another monotone path. This eld is set to? when the operation's details are written. An array pids[f(n)+2] is also added to the operation's details record. This array contains the processor ids of upstream operations, used for the local computation of the reduced number (as in the monotone path algorithm of Section 4.1). The decision and the post-decision stages appear in Algorithm 3. Algorithms 4 and 5 detail the low-level procedures for the decision and the post-decision stages, respectively. An operation op i starts by initializing the local variables. Then, op i tries to mark its low-address word, in rst. If marking fails, then op i helps the operation whose id is marking the word as low, and tries again; if marking succeeds, then op i continues.

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture