A Polylogarithmic Concurrent Data Structures from Monotone Circuits

Size: px

Start display at page:

Download "A Polylogarithmic Concurrent Data Structures from Monotone Circuits"

Beatrice Dalton
5 years ago
Views:

1 A Polylogarithmic Concurrent Data Structures from Monotone Circuits JAMES ASPNES, Yale University HAGIT ATTIYA, Technion KEREN CENSOR-HILLEL, MIT The paper presents constructions of useful concurrent data structures, including max registers and counters, with step complexity that is sublinear in the number of processes, n. This result avoids a well-known lower bound by having step complexity that is polylogarithmic in the number of values the object can take or the number of operations applied to it. The key step in these implementations is a method for constructing a max register, a linearizable, waitfree concurrent data structure that supports a write operation and a read operation that returns the largest value previously written. For fixed m, an m-valued max register is constructed from one-bit multi-writer multi-reader registers at a cost of at most log m atomic register operations per write or read. An unbounded max register is constructed with cost O(min(log v, n)) to read or write a value v. Max registers are used to transform any monotone circuit into a wait-free concurrent data structure that provides write operations setting the inputs to the circuit and a read operation that returns the value of the circuit on the largest input values previously supplied. One application is a simple, linearizable, waitfree counter with a cost of O(min(log n log v, n)) to perform an increment and O(min(log v, n)) to perform a read, where v is the current value of the counter. For polynomially-many increments, this becomes O(log 2 n), an exponential improvement on the best previously known upper bounds of O(n) for exact counting and O(n 4/5+ɛ ) for approximate counting. Finally, it is shown that the upper bounds are almost optimal. It is shown that for deterministic implementations, even if they are only required to satisfy solo-termination, min( log m, n 1) is a lower bound on the worst-case complexity for an m-valued bounded max register, which is exactly equal to the upper bound for m 2 n 1, and min(n 1, log m log ( log m + k)) is a lower bound for the read operation of an m-valued k-additive-accurate counter, which is a bounded counter in which a read operation is allowed to return a value within an additive error of ±k of the number of increment operations linearized before it. Furthermore, even in a solo-terminating randomized implementation of an n-valued max register with an oblivious adversary and global coins, there exist simple schedules in which, with high probability, the worstcase step complexity of a read operation is Ω(log n/ log log n) if the write operations have polylogarithmic step complexity. Categories and Subject Descriptors: D.1.3 [Software]: Programming Techniques Concurrent programming; E.1 [Data]: Data Structures; F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity Nonnumerical Algorithms and Problems General Terms: Algorithms, Theory Additional Key Words and Phrases: distributed computing, shared memory, max registers, counters, monotone circuits A preliminary version of this paper appeared in Proceedings of the 28th Annual ACM Symposium on Principles of Distributed Computing, pages 36-45, James Aspnes is partially supported by NSF grant CNS Hagit Attiya is partially supported by the Israel Science Foundation (grant number 953/06). Keren Censor-Hillel is supported by the Simons Postdoctoral Fellows Program. Part of this author s work was done as a Ph.D. student at the Department of Computer Science, Technion, and supported in part by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. c YYYY ACM /YYYY/01-ARTA $10.00 DOI /

2 A:2 James Aspnes et al. ACM Reference Format: J. ACM V, N, Article A (January YYYY), 25 pages. DOI = / INTRODUCTION A critical aspect of programming contemporary multiprocessing systems is the implementation of concurrent data structures, e.g., getting the largest value stored in a data structure, or counting. It is important to find methods for building efficient concurrent data structures in shared-memory systems, where n processes communicate by reading and writing to shared multi-writer multi-reader registers. One successful approach to building concurrent data structures is to employ the atomic snapshot abstraction [Afek et al. 1993; Anderson 1993; Aspnes and Herlihy 1990]. An atomic snapshot object is composed of components, each of which typically updated by a different processes; the components can be atomically read. By applying a specific function to the components read, we can provide a specific data structure. For example, to obtain a max register, supporting a write operation and a ReadMax operation that returns the largest value previously written, the function returns the component with the maximum value; to obtain a counter, supporting an increment operation and a ReadCounter operation, the function adds up the contribution from each process. These constructions take a linear (in n) number of steps, due to the cost of implementing atomic snapshots [Inoue and Chen 1994]. Indeed, Jayanti et al. [2000] show that operations must take Ω(n) space and Ω(n) steps in the worst case, for many common data structures, including max registers and counters. This seems to indicate that we cannot do better than snapshots. However, careful inspection of Jayanti, Tan, and Toueg s lower bound proof reveals that it holds only when there are numerous operations on the data structure. Thus, it does not rule out the possibility of having sub-linear algorithms when the number of operations is bounded, or, more generally, the existence of algorithms whose complexity depends on the number of operations. Such data structures are useful for many applications, either because they have a limited life-time, or because several instances of the data structure can be used. Our Contributions. In this paper, we present polylogarithmic implementations of key data structures that support only bounded values. The cornerstone of our constructions, and our first example, is an implementation of a max register that beats the Ω(n) lower bound of Jayanti et al. [2000] when log m is o(n). 1 If the number of values is bounded by m, its cost per operation is O(log m); for an unbounded set of values, the cost is O(min(log v, n)), where v is the value of the register. To implement a counter, instead of simply summing the individual process contributions, as in a snapshot-based implementation of a counter, we can use a tree of max registers to compute this sum: take an O(log n) depth tree of two-input adders, where the output of each adder is a max register. To increment, walk up the tree recomputing all values on path. The cost of a read operation is O(min(log v, n)), where v is the current value of counter, and the cost of an increment operation is O(min(log n log v, n)). When the number of increments is polynomial, this has O(log 2 n) cost, which is an exponential improvement from the trivial upper bound of O(n) using snapshots. The resulting counter is wait-free and linearizable. More generally, we show how a max register can be used to transform any monotone circuit into a wait-free concurrent data structure that provides write operations set- 1 Throughout the paper we use log to denote log 2.

3 Polylogarithmic Concurrent Data Structures A:3 ting the inputs to the circuit and a read operation that returns the value of the circuit on the largest input values previously supplied. Monotone circuits expose the parallelism inherent in the dependency of the data structure s values on the arguments to the operations. Formally, a monotone circuit computes a function over some finite alphabet of size m, which is assumed to be totally ordered. The circuit is represented by a directed acyclic graph where each node corresponds to a gate that computes a function of the outputs of its predecessors. Nodes with in-degree zero are input nodes; nodes with out-degree zero are output nodes. Each gate g, with k inputs, computes some monotone function f g of its inputs. Monotonicity means that if x i y i for all i, then f g (x 1,..., x k ) f g (y 1,..., y k ). The cost of writing a new value to an input to the circuit is bounded by O(Sd min( log m, n), where m is the size of the alphabet for the circuit, d is the number of inputs to each gate, and S is the number of gates whose value changes as the result of the write. The cost of reading the output value is min( log m, O(n)). While the resulting data structure is not linearizable in general, it satisfies a weaker but natural consistency condition, called monotone consistency, which we will show is still useful for many applications. So far we have only described upper bounds. We show a lower bound of min( log m, n 1) steps for the read operation of any solo-terminating deterministic implementation of a max register when m n. This implies that our implementation is optimal. A bound of min(n 1, log m log ( log m + k)) steps is shown to hold for the read operation of any solo-terminating deterministic implementation of an m-valued k-additive-accurate counter, which is a bounded counter in which a read operation is allowed to return a value within an additive error of ±k of the number of increment operations linearized before it. Furthermore, we show a lower bound for randomized implementations of a max register which exhibits a tradeoff between the number of steps required for read and write operations. It is shown that in any randomized implementation of an n-valued max register, there exist simple schedules containing n 1 partial write operations and one read operation in which, with probability 1 o(1), one of the following holds: (1) The write operation with maximum value takes more than w steps. (2) The read operation returns an incorrect value. (3) The read operation takes Ω(log n/(log w + log log n)) steps. This applies for an oblivious adversary, which determines the entire schedule in advance, even when requiring only solo-termination, and allowing global coins, which are random values shared by all processes (as opposed to local coins). In particular, this tradeoff shows an Ω(log n/ log log n) lower bound on the worst-case step complexity of read operations for any randomized max register whose write operations have polylogarithmic step complexity. Related Work. Previously, the linear lower bound of Jayanti et al. [2000] motivated, e.g., Aspnes and Censor [2009] to switch to approximate counting, but even this implementation has O(n 4/5+ɛ ) cost. Our construction improves this for the case of bounded counters. Regarding snapshot-based implementations, another approach that improves upon the snapshot-based implementation of many data structures, is to use f-arrays, as proposed by Jayanti [2002]. An f-array is a data structure that supports computation of a function f over the components of an array. Instead of having a process take a snapshot of the array and then locally apply f to the result, Jayanti implements an f-array by having the write operations update a predetermined location according to the new value of f, which requires a read operation to only read that location. This

4 A:4 James Aspnes et al. construction is then extended to a tree algorithm. For implementing an f-array of m registers, where f can be any common aggregate function, including the maximum value or the sum of values, this reduces the number of steps required to O(log m) for a write operation, while a read operation takes O(1) steps. These implementations use base objects that are stronger than read / write registers (specifically, LL/SC), while we restrict our base objects to multi-writer multi-reader registers. Hence, we do not use this approach in this paper. Model. We assume the standard model of an asynchronous shared-memory system (cf. [Attiya and Welch 2004]), which consists of a set of n processes P = {p 0,..., p n 1 }. Processes communicate by reading and writing to shared multi-writer multi-reader registers. Each step consists of some local computation and one shared memory event, which is either a read or a write to some register. The schedule according to which processes take steps is determined by an adversary. The system is asynchronous, which implies that there are no timing assumptions, and specifically no bounds on the time between two steps of a process, or between steps of different processes. We are interested in wait-free implementations, in which any operation on the data structure terminates within a finite number of its steps regardless of the schedule chosen by the adversary (even if all other processes stop taking steps). The cost of an implementation is measured by the maximum number of steps required for any operation on the data structure. Our lower bounds hold even for the weaker solo-termination condition, in which an operation is only required to terminate within a finite number of its steps if there are no concurrent steps by operations of other processes. An implementation should also satisfy some consistency condition. The consistency condition should specify the semantic requirements of all possible executions, including possibly both complete operations, which start and terminate during the execution, and incomplete operations, which begin during the execution but still have low-level shared memory accesses pending. While the sequential semantics of a data structure has to address only non-concurrent operations, our data structures are concurrent, which means that during an execution some operations may overlap, in which case their shared-memory accesses are interleaved. Operations whose shared memory accesses are not interleaved are called non-overlapping. Since a consistency condition has to specify the semantic requirements for all possible concurrent executions, it also has to address the possibility of overlapping operations. Some of our implementations are linearizable [Herlihy and Wing 1990]. This means that for any execution, there is a sequence that contains all the completed operations, as well as some of the incomplete ones, that (1) extends the order of non-overlapping operations; and (2) preserves the sequential semantics of the implemented object. Other implementations are not linearizable, but satisfy what we call monotone consistency, a weaker yet natural consistency condition defined formally in Section 4. In a randomized implementation, a process is allowed to flip coins as part of its local computation, and choose its next step according to their results. We require an operation of a randomized implementation of a data structure to be correct with high probability, i.e., with probability 1 o(1). The step complexity of an operation in a randomized implementation is an upper bound on the number of steps required for that operation with high probability. This implies that in a randomized implementation with probability 1 o(1) every operation returns the correct value within its step complexity.

5 Polylogarithmic Concurrent Data Structures A:5 2. MAX REGISTERS Our basic data structure is a max register, which is an object r that supports a WriteMax(r, t) operation with an argument t that records the value t in r, and a ReadMax(r) operation returning the maximum value written to the object r. A max register may be either bounded or unbounded. For a bounded max register, we assume that the values it stores are in the range {0,..., (m 1)}, where m is the size of the register. We assume that any non-negative integer can be stored in an unbounded max register. In general, we will be interested in unbounded max registers, but will consider bounded max registers in some of our constructions and lower bounds. One way to implement max registers is by using snapshots. Given a linear-time snapshot protocol (e.g., [Inoue and Chen 1994]), a WriteMax operation for process p i updates location a[i], while a ReadMax operation takes a snapshot of all locations and returns the maximum value. Assuming no bounds on the size of snapshot array elements, this gives an implementation of an unbounded max register with linear cost (in the number of processes n) for both WriteMax and ReadMax. We show below how to build more efficient max registers: a recursive construction that gives costs logarithmic in the size of the register for both WriteMax and ReadMax. We also describe (in Section 6.4) a non-linearizable, Monte Carlo implementation with only one atomic register read per ReadMax, but the cost of WriteMax is drastically increased; while impractical, it is useful for illustrating the limitations of our lower bounds. In the remainder of this section we show how to construct a max register recursively from a tree of increasingly large max registers. Each node in this tree is a one-bit register, that directs the reader left or right depend on its value. (See Figure 1.) Intuitively, we can think of a read of the max register as following a path through this tree, constructing the sequence of bits found in the registers along the path and then treating this sequence as an element of a prefix-free code that can be decoded to obtain the actual value. A write operation similarly walks down along the path determined by the encoding of its input, testing to see if a larger value has already been written. If it finds one (because it sees a 1 in a register where its bit-vector would put a 0), it exits without changing any of the bits in the tree; this mechanism prevents outof-date writes from entering non-linearizable values that would otherwise be observed by out-of-date reads. But if the writer successfully reaches a leaf of the tree without seeing evidence of a larger value, it sets the 1 bits on its path needed to cause a reader to follow it in bottom-up order. This ordering ensures that no reader is diverted to the new path until it has been completely marked. For convenience of both implementation and the correctness proof, it is easiest to describe this process recursively, where we think of a max register as built from a onebit atomic register switching between two smaller max registers. The smallest such registers are trivial MaxReg 0 objects, which are max registers r with size m = 1 that support only the value 0. The implementation of MaxReg 0 requires zero space and zero step complexity: WriteMax(r, 0) is a no-op, and ReadMax(r) always returns 0. To get larger max registers, we combine smaller ones recursively (see Figure 1). The base objects will consist of at most one snapshot-based max register as described earlier (used to limit the depth of the tree in the unbounded construction) and a large number of trivial MaxReg 0 objects and read/write registers. A recursive MaxReg object has three components: two MaxReg objects called r.left and r.right, where r.left is a bounded max register of size m, and one 1-bit multi-writer register called r.switch. The resulting object is a max register whose size is the sum of the sizes of r.left and r.right, or unbounded if r.right is unbounded. Pseudocode for the max-register implementation is given in Algorithm 1. Writing a value t to r is done using the WriteMax(r, t) procedure. For values t less than m,

6 A:6 James Aspnes et al. left switch right... left switch right MaxReg 0 MaxReg 0 Fig. 1. Implementing a max register. a process first checks that r.switch is off, and only then writes t to r.left. For larger values, the process writes t m to r.right and then sets r.switch. Reading the maximal value is done using the ReadMax(r) procedure. Here the process uses r.switch to choose whether to read r.left or r.right; if it reads r.right, it adds m to the result. shared data: switch: a 1-bit multi-writer register, initially 0 left, a MaxReg object of size m, initially 0, right, a MaxReg object of arbitrary size, initially 0 1 procedure WriteMax(r, t) 2 begin 3 if t < m then 4 if r.switch = 0 then 5 WriteMax(r.left, t) 6 end 7 else 8 WriteMax(r.right, t m) 9 r.switch 1 10 end 11 end 12 procedure ReadMax(r) 13 begin 14 if r.switch = 0 then 15 return ReadMax(r.left) 16 else 17 return ReadMax(r.right) + m 18 end 19 end Algorithm 1: Max register implementation An important property of this implementation is that it preserves linearizability, as shown in the following lemma.

7 Polylogarithmic Concurrent Data Structures A:7 LEMMA 2.1. If r.left and r.right are linearizable max registers, so is r. PROOF. We assume that each of the MaxReg objects r.left and r.right is linearizable. Thus, we can associate each operation on them with one linearization point and treat these operations as atomic. In addition, we can associate each read or write to the register r.switch with a single linearization point since it is atomic. We now consider a schedule of ReadMax(r) and WriteMax(r, t) operations. These consist of reads and writes to r.switch and of ReadMax and WriteMax operations on r.left and r.right. We divide the operations on r into three categories: C left : ReadMax(r) operations that read 0 from r.switch, and WriteMax(r, t) operations with t < m that read 0 from r.switch. C right : ReadMax(r) operations that read 1 from r.switch, and WriteMax(r, t) operations with t m (i.e., that write 1 to r.switch). C switch : WriteMax(r, t) operations with t < m that read 1 from r.switch. Inspection of the code shows that each completed operation on r falls into exactly one of these categories. Notice that an operation is in C left if and only if it invokes an operation on r.left, an operation is in C right if and only if it invokes an operation on r.right, and an operation is in C switch if and only if it invokes no operation on r.left or r.right. We order the operations by the following four rules: (1) We order all operations of C left before those of C right. (2) An operation op in C switch is ordered at the latest point possible before any operation op that starts after op finishes. If two operations are ordered in the same point, then they appear according to the order they are encountered. (3) Within C left we order the operations according to the order at which they access r.left, i.e., by the order of their respective linearization points in accessing r.left. (4) Within C right we order the operations according to the order at which they access r.right, i.e., by the order of their respective linearization points in accessing r.right). It is easy to verify that these rules are well-defined. We first prove that these rules preserve the execution order of non-overlapping operations. For two operations in the same category this is clearly implied by rules 2 4. Since rule 1 shows that two operations from C left and C right are also properly ordered because an operation that starts after an operation in C right finishes cannot be in C left, it is left to consider the case that one operation is in C switch and the other is either in C left or in C right. In this case, rule 2 implies that their order preserves the execution order. We now prove that this order satisfies the specification of a max register, i.e., if a ReadMax(r) operation op returns t then t is the largest value written by operations on r of type WriteMax that are ordered before op. This requires showing that there is a WriteMax(r, t) operation op w ordered before op, and that there is no WriteMax(r, t ) operation op w with t > t ordered before op. This is obtained by using the linearizability of the components. If op returns a value t < m (i.e., it is in C left ) then this is the value that is returned from its invocation op of ReadMax(r.left). By the linearizability of r.left, there is a WriteMax(r.left, t) operation op w ordered before op in the linearization of r.left. By rule 3, this implies that the WriteMax(r, t) operation op w which invoked op w is ordered before op. A similar argument for r.right applies if op returns a value t m. To prove that no operation of type WriteMax with a larger value is ordered before op, we assume, towards a contradiction, that there is a WriteMax(r, t ) operation op w with t > t that is ordered before op. If op returns a value t < m (i.e., it is in C left ) then op w cannot be in C right, otherwise it would be ordered after op, by rule 1. Moreover, op w cannot be in C switch, since rule 2 implies that op starts after op w finishes and

8 A:8 James Aspnes et al. hence must also read 1 from r.switch which is a contradiction to op C left. Therefore, op w C left, but this contradicts the linearizability of r.left. If op returns a value t m (i.e., it is in C right ) then op w cannot be in C left because t > t. Moreover, op w cannot be in C switch, since t > t m. Therefore, op w is in C right, which contradicts the linearizability of r.right. Using Lemma 2.1, we can build a max register whose structure corresponds to an arbitrary binary tree, where each internal node of the tree is represented by a recursive max register and each leaf is a MaxReg 0, or, for the rightmost leaf, a MaxReg 0 or snapshot-based MaxReg depending on whether we want a bounded or an unbounded max register. There are several natural choices, as we will discuss next Using a balanced binary tree To construct a bounded max register of size 2 k, we use a balanced binary tree. Let MaxReg k be a recursive max register built from two MaxReg k 1 objects, with MaxReg 0 being the trivial max register defined previously. Then MaxReg k has size 2 k for all k. It is linearizable by induction on k, using Lemma 2.1 for the induction step. We can also easily compute an exact upper bound on the cost of ReadMax and WriteMax on a MaxReg k object. For k = 0, both ReadMax and WriteMax perform no operations. For larger k, each ReadMax operation performs one register read and then recurses to perform a single ReadMax operation on a MaxReg k 1 object, while each WriteMax performs either a register read or a register write plus at most one recursive call to WriteMax. Thus: THEOREM 2.2. A MaxReg k object implements a linearizable max register for which every ReadMax operation requires exactly k register reads, and every WriteMax operation requires at most k register operations. In terms of the size of the max register, operations on a max register that supports m values, where 2 k 1 < m 2 k values, each take at most log m steps. Note that this cost does not depend on the number of processes n; indeed, it is not hard to see that this implementation works even with infinitely many processes Using an unbalanced binary tree In order to implement max registers that support unbounded values, we use unbalanced binary trees. Bentley and Yao [1976] provide several constructions of unbalanced binary trees with the property that the i-th leaf sits at depth O(log i). The simplest of these, called B 1, constructs the tree by encoding each positive integer using a modified version of a classic variable-length code known as the Elias delta code [Elias 1975]. In this code, each positive integer N = 2 k + l with 0 l < 2 k is represented by the bit sequence δ(n) = 1 k 0β(l), where β(l) is the k-bit binary expansion of l. The first few such encodings are 0, 100, 101, 11000, 11001, 11010, 11011, ,.... If we interpret a leading 0 bit as a direction to the left subtree and a leading 1 bit as a direction to the right subtree, this gives a binary tree that consists of an infinitely long rightmost path (corresponding to the increasingly long prefixes 1 k ), where the i-th node in this path (i = 1, 2,...) has a left subtree that is a balanced binary tree with 2 i 1 leaves. (A similar construction is used by Attiya and Fouren [2001].) Let us consider what happens if we build a max register using the B 1 tree (see Figure 2). A ReadMax operation that reads the value v will follow the path corresponding to δ(v + 1), and in fact will read precisely this sequence of bits from the switch registers in each recursive max register along the path. This gives a cost to read value v that is

9 Polylogarithmic Concurrent Data Structures A:9 left switch right switch MaxReg 0 left right switch left right... MaxReg 0 MaxReg 0 snapshot-based max register Fig. 2. An unbalanced max register. equal to δ(v + 1) = 2 log(v + 1) + 1. Similarly, the cost of WriteMax(v) will be at most 2 log(v + 1) + 1. Both of these costs are unbounded for unbounded values of v. For ReadMax operations, there is an additional complication: repeated concurrent WriteMax operations might set each switch just before the ReadMax reaches it, preventing the ReadMax from terminating. Another complication is in proving linearizability, as the induction does not terminate without trickery like truncating the structure just below the last node actually used by any completed operation. For these reasons, we prefer to backstop the tree with a single snapshot-based max register that replaces the entire subtree at position 1 n, where n is the number of processes. Using this construction, we have: THEOREM 2.3. There is a linearizable implementation of MaxReg for which every ReadMax operation that returns value v requires min(2 log(v + 1) + 1, O(n)) register reads, and every WriteMax operation requires at most min(2 log(v + 1) + 1, O(n)) register operations. If constant factors are important, the multiplicative factor of 2 can be reduced to 1 + o(1) by using a more sophisticated unbalanced tree; the interested reader should consult [Bentley and Yao 1976] for examples. Note that the infinite-tree construction does give an obstruction-free algorithm, since any operation does terminate when running alone. 3. COUNTERS An immediate application of max registers is a construction of a linearizable m-valued counter for n processes with step complexity O(min(log n log m, n)) for increment operations and O(min(log m, n)) for read operations. This is a special case of a general construction of a large class of monotone data structures based on monotone circuits that we present in Section 4, but we describe the construction first in its own terms without depending on the general construction. The complexity bounds are an exponential improvement on the best previously known upper bound of O(n) for exact counting, and on the bound O(n 4/5+ɛ ((1/δ) log n) O(1/ɛ) ), where ɛ is a small constant parameter, for approximate counting which is δ-accurate [Aspnes and Censor 2009].

10 A:10 James Aspnes et al. Pseudocode for the counter is given as Algorithm 2. The counter is structured as a binary tree of max registers. Each max register has an index, which is a bit vector x 1 x 2... x k, where k is the depth of the register in the tree. At the top of the tree is a root max register whose index is the empty sequence. With n processes, the leaves have depth log n, and each process is assigned to the leaf whose index x 1 x 2... x log n corresponds to the id of the process, expressed in binary. Each leaf register records the number of increments done by the corresponding process. Internal nodes of the tree record the total number of increments done by all processes whose leaves lie below the node. These values are, of course, not updated immediately; instead, when a process increments its leaf register, it takes responsibility for propagating the new value up the path to the root. It does so by reading both children of each node along the path, and writing their sum to the node itself, before proceeding to the next node and repeating the process. A read operation is carried out by simply reading the root node. shared data: MaxReg objects r[x 1... x k ] for each bit vector x 1... x k with 0 k log n, all initially 0. 1 procedure CounterIncrement() 2 begin 3 let x 1... x log n represent the current process s identity. 4 r[x 1... x log n ] r[x 1... x log n ] for k log n 1 down to 0 do 6 v 0 r[x 1... x k 0] 7 v 1 r[x 1... x k 1] 8 r[x 1... x k ] v 0 + v 1 9 end 10 end 11 procedure ReadCounter() 12 begin 13 return r[ ] 14 end Algorithm 2: Counter implementation To prove linearizability, we treat each subtree as a counter object in its own right. We can then consider some counter C rooted at some position x 1... x k as built from a max register r[x 1... x k ] together with two counters C 0 and C 1, corresponding to the subtrees rooted at x 1... x k 0 and x 1... x k 1, respectively. We can think of the execution of CounterIncrement up through the first log n k iterations of the loop as carrying out an increment operation on one of C 0 or C 1, followed by reading both counters and writing the sum of the values read to r[x 1... x k ]. We will show by induction on the height of r[x 1... x k ] in the tree that these internal increment and read operations are linearizable; when we reach the root register r[ ], this will show that the full counter is linearizable. The base of this induction is the leaf register r[x 1... r log n ], which is trivially linearizable as only one process ever updates this max register, letting us assign as linearization point for increments the corresponding WriteMax and for reads the corresponding ReadMax. At higher levels of the tree, we have the following lemma: LEMMA 3.1. If C 0 and C 1 are linearizable unit-increment counters, then so is C.

11 Polylogarithmic Concurrent Data Structures A:11 PROOF. Each increment operation of C is associated with a value equal to C 0 + C 1 when it increments C 0 or C 1, treating C 0 and C 1 as atomic counters according to the induction hypothesis. An increment operation with an associated value v is linearized when a value v v is first written to the output max register r[x 1... x k ]. A read operation is linearized at the point it reads the output max register r[x 1... x k ] (which we consider to be atomic). To see that the linearization point for increment v occurs within the interval of the operation, observe that no increment can write a value v v to r[x 1... x k ] before increment v finishes incrementing the relevant sub-counter C 0 or C 1, because before then C 0 + C 1 < v. Moreover, the increment v cannot finish before some v v is first written to r[x 1... x k ], because v itself writes a value v v before it finishes. Since the read operations are also linearized within their execution interval, this order is consistent with the order of non-overlapping operations. This gives a valid sequential execution, since we now have exactly one increment operation associated with every integer up to any value read from C, and there are exactly v increment operations ordered before a read operation that returns v. THEOREM 3.2. There is an implementation of a linearizable m-valued unitincrement counter of n processes where a read operation takes O(min(log m, n)) low-level register operations and an increment operation takes O(min(log n log m, n)) low-level register operations. PROOF. Linearizability follows from the preceding argument. For the complexity, observe that the read operation has the same cost as ReadMax, while an increment operation requires reading and updating O(1) max registers per iteration at a cost of O(min(log m, 2 i )) for the i-th iteration. The full cost of a write is obtained by summing this quantity as i goes from 0 from log n. Note that for a polynomial number of increments, an increment takes O(log 2 n) steps. It is also possible to use unbounded max registers, in which case the value m in the cost of a read or increment is replaced by the current value of the counter. 4. MONOTONE CIRCUITS In this section, we show how a max register can be used to construct more sophisticated data structures from arbitrary monotone circuits. The cost of operations on the data structure will be a function of the number of gates whose values need to be updated when an input changes and the size of the alphabet. For each monotone circuit, we can construct a corresponding monotone data structure. This data structure supports operations WriteInput and ReadOutput, where each WriteInput operation updates the value of one of the inputs to the circuit and each ReadOutput operation returns the value of one of the outputs. Like the circuit as a whole, the effects of WriteInput operations are monotone: attempts to set an input to a value less than or equal to its current value have no effect. The resulting data structure always provides monotone consistency, which is generally weaker than linearizability: Definition 4.1. A monotone data structure is monotone consistent if the following properties hold in any execution: (1) For each output, there is a total ordering < on all ReadOutput operations for it, such that if some operation R 1 finishes before some other operation R 2 starts, then R 1 < R 2, and if R 1 < R 2, then the value returned by R 1 is less than or equal to the value returned by R 2.

12 A:12 James Aspnes et al. (2) The value v returned by any ReadOutput operation satisfies f(x 1,..., x k ) v, where each x i is the largest value written to input i by a WriteInput operation that completes before the ReadOutput operation starts or the initial value of input i if no such operation exists. (3) The value v returned by any ReadOutput operation satisfies v f(y 1,..., y k ), where each y i is the largest value written to input i by a WriteInput operation that starts before the ReadOutput operation completes or the initial value of input i if no such operation exists. The intuition here is that the values at each output appear to be non-decreasing over time (the first condition), all completed WriteInput operations are always observed by ReadOutput (the second condition), and no spurious larger values are observed by ReadOutput (the third condition). But when operations are concurrent, it may be that some ReadOutput operations return intermediate values that are not consistent with any fixed ordering of WriteInput operations, violating linearizability (an example is given in Subsection 5.1). We convert a monotone circuit to a monotone data structure by assigning a max register to each input and each gate output in the circuit. We assume that these max registers are initialized to a default minimum value, so that the initial state of the data structure will be consistent with the circuit. A WriteInput operation on this data structure updates an input (using WriteMax) and propagates the resulting changes through the circuit as described in Procedure WriteInput. A ReadOutput operation reads the value of some output node, by performing a ReadMax operation on the corresponding output. The cost of a ReadOutput operation is the same as that of a ReadMax operation: O(min(log m, n)). The cost of WriteInput operation depends on the structure of the circuit: in the worst case, it is O(Sd min(log m, n)), where S is the number of gates reachable from the input and d is the maximum number of inputs to each gate. 1 procedure UpdateGate(g) 2 begin 3 Let x 1,..., x d be the inputs to g. 4 for i 1 to d do 5 y i ReadMax(x i ) 6 end 7 WriteMax(g, f g (y 1,..., y d )) 8 end 9 procedure WriteInput(g, v) 10 begin 11 WriteMax(g, v) 12 Let g 1,... g S be a topological sort of all gates reachable from g. 13 for i 1 to S do 14 UpdateGate(g i ) 15 end 16 end 17 procedure ReadOutput(g) 18 begin 19 return ReadMax(g) 20 end Algorithm 3: Monotone circuit implementation.

13 Polylogarithmic Concurrent Data Structures A:13 THEOREM 4.2. For any fixed monotone circuit C, the WriteInput and ReadOutput operations based on that circuit are monotone consistent. PROOF. Consider some execution of a collection of WriteInput and ReadOutput operations. We think of this execution as consisting of a sequence of atomic WriteMax and ReadMax operations and use time to refer to points in the execution. The first clause in Definition 4.1 follows immediately from the linearizability of max registers, since we can just order ReadOutput operations by the order of their internal ReadMax operations. For the remaining two clauses, we will jump ahead to the third, upper-bound, clause first. The proof is slightly simpler than the proof for the lower bound, and it allows us to develop tools that we will use for the proof of the second clause. For each input x i, let Vi t be the maximum value written to the register representing x i at or before time t or its initial value if no such write exists. For any gate g, let C g (x 1,..., x n ) be the function giving the output of g when the original circuit C is applied to x 1,..., x n (see Figure 3). For simplicity, we allow C in this case to include internal gates, output gates, and the registers representing inputs (which we can think of as zero-input gates). We thus can define C g recursively by C g (x 1,..., x n ) = x i when g = x i is an input gate and C g (x 1,..., x n ) = f g (C gi1 (x 1,..., x n ),..., C gik (x 1,..., x n )) when g is an internal or output gate with inputs g i1... g ik. Let g t be the actual output of g in our execution at time t, i.e., the contents of the max register representing the output of g. We claim that for all g and t, g t C g (V1 t,..., Vn). t The proof is by induction on t and the structure of C. In the initial state, all max registers are at their default minimum value and the induction hypothesis holds. Suppose now that some max register g changes its value at time t. If this max register represents an input, the new value corresponds to some input supplied directly to WriteInput, and we have g t = C g (V1 t,..., Vn). t If the max register represents an internal or output gate, its value is written during some call to UpdateGate, and is equal to f g (g t1 i 1, g t2 i 2,..., g t k ik ) where each g ij is some register read by this call to UpdateGate and t j < t is the time at which it is read. Because max register values can only increase over time, we have, for each j, g tj i j gi t j = g t 1 i j C gij (V1 t 1,..., Vn t 1 ) by the induction hypothesis, and the fact that only gate g changes at time t. This last quantity is in turn at most C gij (V1 t,..., Vn) t as only gate g changes at time t. By monotonicity of f g we then. g... g i1... g ik... x 1... x n Fig. 3. A gate g in a circuit computes a function of its inputs f g(g i1,..., g ik ). The inputs to the circuit are x 1,..., x n.

14 A:14 James Aspnes et al. get g t = f g (g t1 i 1, g t2 i 2,..., g t k ik ) f g (C gi1 (V t 1,..., V t n),..., C gik (V t 1,..., V t n)) = C g (V t 1,..., V t n) as claimed, which completes the proof of clause 3. We now consider clause 2, which gives a lower bound on output values. For each time t and input x i, let vi t be the maximum value written to the max register representing x i by a WriteInput operation that finishes at or before time t. We wish to show that for any output gate g, g t C g (v1, t..., vn). t As with the upper bound, we proceed by induction on t and the structure of C. But the induction hypothesis is slightly more complicated, in that in order to make the proof go through we must take into account which gate we are working with when choosing which input values to consider. For each gate g, let vi t(g) be the maximum value written to input register x i by any instance of WriteInput that completes UpdateGate(g) at or before time t or its initial value if no such operation exists. Our induction hypothesis is that at each time t and for each gate g, g t C g (v1(g), t... vn(g)). t Although in general we have vi t vt i (g), since the set of operations that give vi t contains the set of operations that gives vt i (g), having g t C g (v1(g), t..., vn(g)) t implies g t C g (v1, t..., vn), t as any process that writes to some input x i that affects the value of g as part of some WriteInput operation must complete UpdateGate(g) before finishing the operation. The claim holds for t = 0, since we assume a consistent initialization of the circuit. Suppose now that some max register g changes its value at time t or its initial value if no such operation exists. If g is an input, the induction hypothesis holds trivially. Otherwise, consider the set of all WriteInput operations that write to g at or before time t. Among these operations, one of them is the last to complete UpdateGate(g ) for some input g to g. Let this event occur at time t < t, and call the process that completes this operation p. We now consider the effect of the UpdateGate(g) procedure carried out as part of this WriteInput operation. Because no other operation completes an UpdateGate procedure for any input g ij to g between t and t, we have that for each such input and each i, vi t(g i j ) = vi t (g i j ). Since the ReadMax operation of each g ij in p s call to UpdateGate(g) occurs after time t, it obtains a value that is at least gi t j C gij (v1 t (g ij ),..., vn t (g ij )), by the induction hypothesis, and this value is at least C gij (v1(g t ij ),..., vn(g t ij )), by the monotonicity of C gij and the previous observation on the relation between v t i (g i j ) and v t i (g i j ). But then g t f g (C gi1 (v t 1(g i1 ),..., v t n(g i1 )),... C gik (v t 1(g ik ),..., v t n(g ik ))) f g (C gi1 (v t 1(g),..., v t n(g)),..., C gik (v t 1(g),..., v t n(g))) = C g (v t 1(g),..., v t n(g)). 5. APPLICATIONS In this section we consider applications of the circuit-based method for building data structures described in Section 4. Most of these applications will be variants on counters, as these are the main example of monotone data structures currently found in the literature. Because we are working over a finite alphabet, all of our counters will be bounded.

15 Polylogarithmic Concurrent Data Structures A: Generalized counters A generalized counter takes an argument to its increment operation, which allows the counter to be incremented by any non-negative amount. We can construct a generalized counter from a circuit consisting of a binary tree of adders, where each gate in the circuit computes the sum of its inputs and each input to the circuit is assigned to a distinct process to avoid lost updates. This gives an implementation essentially identical to Algorithm 2, except that now an increment by start by adding an arbitrary non-negative value to the leaf register. As in Algorithm 2, the step complexity of an increment is O(min(log n log m, n)) and of a read is O(min(log m, n)). Unfortunately, unlike in the unit-increment case, it is not hard to see that our generalized counter is not linearizable, even though it satisfies monotone consistency. The reason is that read operations it may return intermediate values that are not consistent with any ordering of the increments. Here is a small example of a non-linearizable execution, which we present to illustrate the differences between linearizability and monotone consistency. Consider an execution with three writers, and look at what happens at the top gate in the circuit. Imagine that process p 0 executes a WriteInput operation with argument 0, p 1 executes a WriteInput operation with argument 1, and p 2 executes a WriteInput operation with argument 2. Let p 1 and p 2 arrive at the top gate through different intermediate gates g 1 and g 2 ; we also assume that each process reads g 2 before g 1 when executing UpdateGate(g). Now consider an execution in which p 0 arrives at g first, reading 0 from g 2 just before p 2 writes 2 to g 2. Process p 2 then reads g 2 and g 1 and computes the sum 2 but does not write it yet. Process p 1 now writes 1 to g 1 and p 0 reads it, causing p 0 to compute the sum 1 which it writes to the output gate. Process p 2 now finishes by writing 2 to the output gate. If both these values are observed by readers, we have a non-linearizable schedule, as there is no sequential ordering of the increments 0, 1, and 2 that will yield both output values. However, for restricted applications, we can obtain a fully linearizable object, as shown in the following sections Threshold objects Another variant of a shared counter that is linearizable is a threshold object. This counter allows increment operations, and supports a read operation that returns a binary value indicating whether a predetermined threshold has been crossed. We implement a threshold object with threshold T by having increment operations act as in the generalized counter, and having a read operation return 1 if the value it reads from the output gate is at least T, and 0 otherwise. We show that this implementation is linearizable even with non-uniform increments, where the requirement is that a read operation returns 1 if and only if the sum of the increment operations linearized before it is at least T. LEMMA 5.1. The implementation of a threshold object C with threshold T by a monotone data structure with the procedures WriteInput and ReadOutput is linearizable. PROOF. We use monotone consistency to prove linearizability for the threshold object C. Let C 1 and C 2 be the sub-counters that are added to the final output gate g. We order read operations according to the ordering implied by monotone consistency, which is consistent with the order of non-overlapping read operations, and implies that once a read operation returns 1 then any following read operation returns 1. We order write operations according to their execution order, which is clearly consistent with the

Faster than Optimal Snapshots (for a While)

Faster than Optimal Snapshots (for a While) Preliminary Version ABSTRACT James Aspnes Department of Computer Science, Yale University aspnes@cs.yale.edu Keren Censor-Hillel Computer Science and Artificial