Etna: a fault-tolerant algorithm for atomic mutable DHT data

Size: px

Start display at page:

Download "Etna: a fault-tolerant algorithm for atomic mutable DHT data"

Tyrone Hodges
6 years ago
Views:

1 Etna: a fault-tolerant algorithm for atomic mutable DHT data Athicha Muthitacharoen Seth Gilbert Robert Morris athicha@lcs.mit.edu sethg@mit.edu rtm@lcs.mit.edu MIT Laboratory for Computer Science 200 Technology Square, Cambridge, MA Phone: (617) Fax: (617) Abstract This paper presents Etna, an algorithm for atomic reads and writes of replicated data stored in a distributed hash table. Etna correctly handles dynamically changing sets of replica hosts, and is optimized for reads, writes, and reconfiguration, in that order. Etna maintains a series of replica configurations as replicas fail, adding new replicas as needed from the pool supplied by the distributed hash table system. It uses Paxos to ensure consensus on the identity of each new configuration. For simplicity and performance, Etna serializes all reads and writes through a primary during the lifetime of each configuration. Etna performs read and write operations in a single round from the primary. Experiments in an environment with high network delays show that Etna s read latency is determined by round-trip delay in the underlying network, while write and reconfiguration latency is determined by the transmission time required to send data to each replica. Etna s write latency is about the same as that of a nonatomic replicating DHT, and Etna s read latency is about twice that of a non-atomic DHT due to Etna assembling a quorum for every read. Number of pages: 10 Keywords: replication, consistency, atomicity Regular submission. The first two authors are students. Please consider for brief announcement as well. 1 Introduction An increasing number of research systems are being built on top of distributed hash tables (DHTs). These include systems for messaging [8], sharing read/write files [17], publishing read-only content [5], resolving names [25], maintaining bulletin boards [20], and searching large text databases [23]. All of the cited examples depend on a DHT to store and replicate their data, and all of them either assume mutable DHT storage or could be simplified if mutable storage were supported. Most existing DHTs support immutable or write-once data well. In contrast, DHTs that support fault-tolerant replicated mutable data typically provide no consistency guarantees. For this reason DHT-based applications must implement application-specific techniques to ensure consistency. A typical approach is for the application to time-stamp updates for eventual consistency, and to include complex techniques to handle or avoid the consequences of conflicting updates. However, many DHT-based applications could be simplified if they could rely on atomic update of mutable data. Existing quorum-based techniques for consistent replication (for example, [24, 2]) cannot easily be applied to DHTs for a number of reasons. Most critically, these techniques are not well-suited for dynamic environments. In a peer-to-peer network, hosts may leave the system permanently. This means that a replication scheme must designate new replica hosts (or suffer a failure of the quorum system), and over time many hosts may have held replicas of a given piece of data. As a result there is a danger that stale replicas may accumulate one-by-one in a network partition, and eventually form a quorum. Existing replication techniques avoid this problem by assuming a fixed set of replica hosts. This paper presents Etna, a new algorithm for atomic read/write replicated DHT data blocks. Etna guarantees atomicity regardless of network behavior; for example, it will not return stale data during network partitions. Etna uses a succession of configurations to ensure that only the single most up-to-date quorum of replicas can execute operations, much like Rambo [14]. Etna handles configuration changes using Paxos. Etna is designed for low message complexity in the common case 1

2 in which reads and writes are more frequent than reconfiguration: both reads and writes involve a single round of communication. This work contributes the first full implementation of replicated atomic update in a complete DHT 1. Etna is more efficient than previous solutions for reads and writes, though it prohibits reconfiguration concurrently with reads and writes. The rest of this paper includes an overview of related work in Section 2, a summary of existing ideas used by Etna in Section 3, a description of the Etna algorithm in Section 4, a proof of atomicity in Section 5, a performance analysis in Section 6, and an evaluation of an implementation in Section 7. 2 Related Work A few DHT proposals address the issue of atomic data consistency in the face of dynamic membership. Rodrigues et al. [19] use a small configuration service to maintain and distribute a list of all the non-failed nodes. Since every participant is aware of the complete list of active nodes, it is easy to transfer and replicate data while ensuring consistency. However, maintaining global knowledge may limit the approach to small or relatively static systems. Etna uses the Chord DHT [21] to manage the dynamic environment, and augments the basic service to guarantee robust, mutable data. There have been many quorum-based atomic read/write algorithms developed for static sets of replica hosts (for example [24, 2]). These algorithms assume that the participants are known in advance, and that the number of failures is bounded by a constant. Group communication services [10] and other virtually synchronous services [3] support the construction of robust and dynamic systems. These algorithms provide stronger guarantees than are required for mutable data storage, implementing totally-ordered broadcast, which effectively requires consensus to be performed for every operation. As a result, the GCS algorithms work best in a low-latency LAN environment. Also, in most GCS systems, whenever a node joins or leaves, a new view (i.e., configuration) is created, leading to a potentially slow reconfiguration. Etna uses some of the reconfiguration techniques developed in the GCS protocols. However the read and write operations in Etna require less communication than the multi-phase protocol required to perform totally-ordered broadcast. Also, the rate of reconfiguration can be significantly reduced: a 1 Source available at new configuration need only be created when a number of replicas has failed (and no reconfiguration is necessary as a result of join operations). Prior approaches for reconfigurable read/write memory [11, 6, 18] require that new quorums include processors from the old quorums, restricting the choice of a new configuration. Some earlier algorithms [16, 7] rely on a single process to initiate all reconfigurations. The RAMBO algorithms [14, 9], on the other hand, allow completely flexible reconfiguration, and Etna takes a similar approach. However, RAMBO focuses on allowing reads and writes to proceed concurrently with reconfiguration; Etna, instead, optimizes read and write performance assuming that reconfigurations are rare. As a result, read and write operations in Etna are much more efficient than those in RAMBO. 3 Background This section describes two existing techniques on which Etna relies. 3.1 DHash and Chord Etna is an extension of DHash [5], a distributed hash table. DHash uses Chord [22] as a look-up service. Every Chord node has a unique ID. Chord arranges that each node knows its current successor, the node with the next highest ID; the ID space wraps around at zero. Thus the nodes form a circular list linked through the successors. DHash stores blocks, or key/value pairs. DHash stores each block on the node that is the block s successor (the node whose ID most immediately follows the block s key). 3.2 Paxos Etna uses the Paxos [12, 13] distributed consensus protocol to determine a total-ordering of the replica configurations. We execute a single instance of the Paxos protocol for each configuration; Paxos then outputs the next configuration. All Paxos procedures are subscripted by, the configuration that is associated with that instance of Paxos. Paxos is a three-phase commit protocol that implements consensus. As originally described, Paxos consists of two components: an eventual leader election service, and an agreement protocol. The eventual leader election service ensures that eventually only one node thinks that it is the leader (if the communication network is eventually well-behaved). The agreement protocol allows a node that thinks it is the leader to propose a value. It ensures that at most only one value is chosen, and that 2

3 if eventually there is only one leader, then some value is chosen. In the first phase of the protocol, the supposed leader queries for previously proposed decision values; the second phase chooses a decision value. Etna only makes use of the agreement protocol. It relies on Chord to choose the leader: Etna periodically sends a query to Chord for the successor node for a given block. This node then plays the role of the leader in the agreement protocol. It determines if a reconfiguration is needed, and, if so, begins the proposal by performing a: proposal value When consensus is reached, Paxos calls the Etna proposal value procedure. At this point, Etna notifies the new configuration of the decision, and the algorithm can proceed. Paxos guarantees the following: Theorem 1 (derived from [12, 13]). For each instance,, of the Paxos protocol, if one or more decision value events occur, then all the values of decision value are the same. If, eventually, a occurs at some node at time!, and no later occurs at any other node ", and node # # occurs by time!%$'& (. does not fail, then a 4 The Etna Algorithm In this section we present Etna, an algorithm that provides fault-tolerant, atomic mutable data in DHash. For each DHash block Etna uses Paxos to maintain a consistent series of replica configurations in the face of dynamic membership. Given a configuration of replicas, Etna designates a node as the primary and serializes all reads and writes through it. Etna assumes that nodes behave according to the protocol and are fail-stop. 4.1 Block State Etna provides atomicity for each mutable block, which extends to the entire DHT. By the composability of atomic objects [15], all mutable blocks in the DHT form an atomic memory. Therefore, we describe our protocol in terms of a single block. To provide fault-tolerance, Etna replicates a mutable block at ) different DHash nodes, where ) is the systemwide replication factor. We call this replica set a configuration of nodes responsible for a block. Etna initiates reconfigurations in order to arrange that a block s replicas are the nodes that immediately succeed the block s status Per Node State Flag variable, with values idle, active or recon inprog, initially idle tag A pair * version+ primaryid,, initially *.-/+0-1, value Latest block value, initially 213 new-tag A pair * version+ primaryid,, initially *.-/+0-1,, containing the largest tag of any ongoing write operation config Current configuration, with sub-fields seqnum 465, initially 0, and nodes 7 *98/:<; =?>@+A8/:<; =@B +#CDCDC,, initially E responses Per Operation State Set of responses for the current operation, initially F Figure 1: Fields in an Etna block key in the Chord ID space, and that the block s immediate successor node is the primary in the configuration. For each block, a replica keeps a tag, which is a pair G version H primaryidi ; a status flag, which designates the phase of operation to the block; a new-tag, which is used during write operations; and a config variable, which contains the configuration sequence number and the IDs for the nodes in the block s configuration. Etna increments the sequence number for each new configuration. Figure 1 summarizes the fields in a mutable block. We refer to a DHash mutable block as simply a block throughout the rest of the paper. 4.2 Inserting a New Block To insert a new block, the writer passes the block s data to the DHash client on the local machine. Once in DHash, Etna extends the block with the fields in Figure 1. Because the block is new, Etna can directly insert it at the initial replicas, set to be the ) immediate successors of the block s ID. 4.3 Read Protocol To read a block with key bid, the reader sends a read RPC to the node which is the immediate successor of bid. checks if it believes it is the current primary of bid and if bid is not going through a reconfiguration (status = active). If both conditions are true, it sends a GET RPC to each node in config. If not, the system may be going through membership reconfiguration, so returns an error. The reader will retry periodically. When a node " receives a GET RPC for bid, it looks in config to see if the sender is the current primary of bid and if status = active. If both conditions are true, " returns with a positive ack. If not, it returns an error. If collects more than )J & positive acks, it returns its 3

4 F F Network functions send AC#CDC Sends a message from to rcv AC#CDC Receives a message from to succ() k successors() primary() propose(c CDC ) Chord functions Returns the immediate successor of the block s identifier on the Chord ring. Returns the immediate successors of the block s identifier on the Chord ring. Etna function Returns the primary for a given configuration Paxos functions Proposes a new configuration. Figure 2: Auxiliary functions, provided by Chord, Etna, Paxos, and the network. own stored copy of the block value to the reader. If fails to assemble a majority after a certain time, it returns an error. Figure 3 shows the pseudocode for the read protocol. Read protocol for the primary: Procedure recv read if i primary config then if status active then responses j 4 config.nodes do send get, c, config.seqnum Procedure recv get-ack, c, seqnum if seqnum config.seqnum then responses responses j if responses k/2 then if status active then send tag, value Read protocol for the replicas (including the primary): Procedure recv get, c, seqnum if status active then if seqnum config.seqnum then send get-ack, c Figure 3: Pseudo-code for the read protocol Write Protocol To write block bid, the writer sends a write RPC to the successor of bid, node. consults its local state to verify that it believes it is the current primary of bid and that status is active. If both conditions are true, starts the write protocol: 1. Node assigns a new version number to this write, giving it tag G new-tag.version $ H I. 2. Node sends a put RPC to each replica node in config.nodes, including the write s tag and value. 3. When a node " receives a put RPC for bid, it ignores the RPC if status is not active or if the sender isn t the primary. If the write s tag is higher than the stored tag, " updates its stored block. This ensures that a replica is not confused if concurrent writes arrive from the primary out of order. Node " then returns an ack to the sender. 4. When node receives positive acks from a majority of the replicas, it updates its own copy of the block, though only if it has completed no subsequent concurrent write. Node then replies to the client. If fails to assemble a majority after a certain time, it returns an error. Figure 4 illustrates the write protocol. The primary assigns increasing version numbers to writes in the order that they arrive at the primary, but issues the writes to the replicas concurrently. Write protocol for the primary: Procedure recv write, c, new-val if i primary config then if status active then new-tag * new-tag.version + 1, i, op-block * new-tag, new-val, responses j 4 config.nodes do send put, c, config.seqnum, op-block Procedure recv put-ack, c, op-block, seqnum if seqnum config.seqnum then responses responses j if responses k/2 then if status active then if op-block.tag tag then tag op-block.tag value op-block.value send put-ack Write protocol for the replicas (including the primary): Procedure recv put, c, seqnum, op-block if status active and i j then if seqnum config.seqnum then if op-block.tag tag then tag op-block.tag value op-block.value send put-ack, c, op-block, config.seqnum Figure 4: Pseudo-code for the write protocol

5 ( G F 4.5 Reconfiguration Protocol Etna must change the configuration of nodes responsible for a block when a replica leaves (to maintain the ) replication factor) and when a new node joins that would be the block s successor (so that the primary is the Chord successor and is thus easy to find). Etna maintains only one configuration at a time, rather than a set of configurations as in Rambo [14]; this allows Etna to be simpler and have higher performance in all but the highest-churn environments. Etna uses the Paxos [12] distributed consensus protocol to decide on the next configuration. If an Etna node notices that the set of Chord successors for a block bid does not match the set of replicas in the block s config, it tries to initiate a reconfiguration: Case I If notices that it is the immediate successor of bid, it collects some information that will serve as a configuration proposal for a Paxos execution. The proposal has the form proposal value new configh block copyi. sets new config to be the ) immediate successors of. sends a recon-get RPC to each node in config asking for its current block value and tag, waits for a majority of replicas to respond, and uses the most up-to-date response as the block copy in the proposal. When a replica receives a recon-get RPC, it sets status to recon inprog and stops processing all reads and writes. When has assembled a majority of recon-get responses, it uses Paxos to propose its proposal value to the nodes in the old configuration. Paxos calls the decide function at each node that proposed a new configuration, with the consensus configuration information; the proposer(s) send the new configuration and most up to date block to the replicas in the new configuration. Case II If is not immediate successor of bid, sends the tag, value, and config information to the immediate successor of bid, asking the successor to initiate a reconfiguration. Figure 5 shows the pseudo-code for the reconfiguration protocol. During reconfiguration, nodes in the current configuration become inactive (stops serving write and read requests). If reconfiguration fails, there will be no active configuration. Section 6 discusses liveness. Reconfiguration protocol for the primary: Procedure recon if config.nodes k_successors then if i succ then new-config k_successors status recon_inprog proposed false responses j 4 config.nodes send recon-get Procedure recv recon-ack, new-tag, new-value, seqnum if seqnum config.seqnum then if new-tag tag then tag new-tag value new-value responses responses j if responses k/2 and proposed true then proposed true block-copy * tag, value, Paxos.propose new-config, block-copy 3 Procedure decide new-config, block-copy j 4 new-config.nodes send update, new-config, block-copy Reconfiguration protocol for the replicas: Procedure recv recon-get, seqnum if seqnum config.seqnum then status recon_inprog send recon-ack, tag, value, config.seqnum Procedure recv update, new-config, block-copy if i 4 new-config.nodes then if new-config.seqnum config.seqnum then status active config new-config if block-copy.tag tag then tag block-copy.tag value block-copy.value Figure 5: Pseudo-code for reconfiguration. 5 Atomicity In this section, we show that Etna correctly implements an atomic read/write object. Throughout this section, we consider only a single block,. Since atomic memory is composable, this is sufficient to show that Etna guarantees atomic consisteny. We omit references to for the rest of this section. We use the partial-ordering technique, described in Lynch [15]. We rely on the following lemma: Lemma 1 (Lemmas and in [15]). Let be 5

6 - - any well-formed, finite execution of algorithm (implementing a read/write atomic object) in which every operation completes. Let be the set of all operations in. Suppose that is an irreflexive partial ordering of all the operations in, satisfying the following properties: 1. For any operation, there are only finitely many operations such that. 2. If finishes before starts, then it cannot be the case that. 3. If is a write operation in and is any operation in, then either or. 4. The value returned by each read operation is the value written by the last preceding write operation according to (or, if there is no such write). Then every well-formed execution of algorithm satisfies the atomicity property. We consider an arbitrary well-formed execution,, of the Etna algorithm in which every read and write operation completes. (Lemma in Lynch [15] indicates that it is sufficient to consider only such executions.) We first define a partial order on the read and write operations in, and then show that this partial order has certain sufficient properties. Finally, we conclude that Etna guarantees atomic consistency. Partial Order. We first order the read and write operations in based on their tags. For a read or write operation, initiated at node, we define tag tag immediately before the operation returns to the reader or writer; that is, tag is the value of the block s tag when sends the result back to the client (Figure 4, Line 18 and Figure 3, Line 13). For a reconfiguration operation,, we define tag % block-copy.tag immediately before the call to (Figure 5, Line 20). We then define the partial order : (i) For any two operations and in : if tag tag, then. (ii) For any write operation in, and any read operation in : if tag tag then. We show in Theorem 2 that it is straightforward to see that this partial-order satisfies Properties 1, 3, and 4 of Lemma 1. The primary goal of the rest of this section is to show that this ordering satisfies Property 2. Atomicity Proof. Our first goal is to show that when a new configuration is installed, no information is lost. Let configuration be the initial configuration, and configuration "!$# be the unique configuration decided on by &% ('. (Theorem 1 ensures that this is, in fact, unique.) If )% *' does not terminate, then + is undefined. Recall that when Paxos is initiated, the process performing the reconfiguration includes block-copy, the latest copy of the block, in the proposal. That is, )% *' is initiated with at least one call to: new configh block copy )% *'. Define tag, G*- to be H I and tag "!$# to be block copy.tag of the successful proposal to &% ('. We sometimes refer to tag "!$# as the initial tag of configuration,"!$#. We want to show that the initial tags of the configurations are nondecreasing: Lemma 2. Let and."!$# be configurations installed in. Then tag. 0/ tag."!$#. Proof. No replica in configuration, can send any response until it has received an 1 message (Fig- 32# ure 5, Lines 29 36), which causes the replica to set its status to,2< 4. Therefore every response to the 6587:9,2 message (Figure 5, Lines 24 27) during the 65 that proposes configuration +"!$# must include a tag no smaller than tag +. Therefore tag / block-copy.tag in the proposal from,"!$#, from which the result follows. It then follows immediately by induction: Corollary 1. If and,; are two configurations in, and <= ), then tag 0/ tag,;. We next consider a read or write operation that occurs in configuration (for some <?> ). We want to show that if is an operation that completes in,, then the earlier value it can return is the one associated with the initial tag of.. For read or write operation, let conf config.seqnum when sends the result back to the client (Figure 4, Line 18 and Figure 3, Line 13). Lemma 3. Let be a read or write operation in, and assume that Then tag + B/ tag, and if is a write operation then the inequality is strict. Proof. The reconfiguration to install configuration concludes when a primary,, wins a Paxos decision (Figure 5, Lines 22 24) and sends messages to the new replicas. Notice that the decision includes the block-copy determined when the reconfiguration began. Therefore, by the time the configuration is installed, 6

7 tag. / tag. As a result, every operation initiated at node has a tag greater than or equal to tag, and with write operations the inequality is strict. Next, we relate the tag of a read or write operation to the tag of the next configuration. We want to show that if a read or write operation completes, the information is transferred to the next configuration. Lemma 4. Let be a read or write operation in, and assume that conf <. If configuration!$# is installed in, then tag 0/ tag +"!$#. Proof. First, notice that the value of the primary always reflects a 2# operation that has updated a majority of the replicas. Therefore if is a read operation, there is a write operation,, that wrote the tag and value returned by to a majority of the replicas. If is a write operation, define. Since operation completes in configuration +, there exists a set of at least )J & nodes in configuration that a send a response for to the primary. Call this set of nodes (for writers ). Since configuration "!$# is installed in, there exists a set of at least )J & nodes in configuration that send a response to the 65 7:9,2 message during the reconfiguration. Call this set of nodes (for readers ). Notice that since there are ) nodes in configuration, and both and contain at least )J & nodes, there is at least one node, ", in both and. Node " sends a response both for operation and for the successful reconfiguration resulting in +"!$#. We claim that node " sends the response for operation first. As soon as node " sends a response for a 6587:9,2, it sets its status to at which point it ceases responding to requests. Since we know that " sends a response to operation, it must send this response prior to the first time it receives a 65 7:9,2 request. We conclude, then, that the primary sends its put request for to replica " prior to " sending its response to the 6587:9,2 request. Therefore, tag tag tag."!$#. In the final preliminary lemma, we show that the configurations used by operations are non-decreasing. That is, if operation uses one configuration, then a later operation cannot use a smaller configuration. Lemma 5 (sketch). Let and be two read or write operations in. Assume that operation completes before operation begins. Then conf 0/ conf. / Proof. If completes in configuration,, then some reconfiguration completes prior to for,. During that reconfiguration, a majority of replicas in configuration. # were sent a 65 7:9,2 message notifying them to cease processing read and write requests. Inductively, it is clear that a majority of replicas from all earlier configurations received such messages. Therefore operation can not complete after using an earlier configuration. Relating Read and Write Operations. Finally, we show that if operation completes before operation begins, then the tag of of. is less then or equal to the tag Lemma 6. Let and be two read and write operations in where completes before begins. Then tag 0/ tag, and if is a write operation then the inequality is strict. Proof. We break the proof down into two cases: (i) and complete in the same configuration, and (ii) completes in a smaller configuration than. Lemma 5 shows that cannot complete in a larger configuration than. First, assume that ) conf conf. Let node be the primary of configuration ;. Both operations originate at node. Therefore, when operation begins, the tag of is at least as large as tag. If is a write operation, then is strict. increments the tag, and the inequality Next, consider the case where conf conf. Notice that tag / tag conf, by Lemma 4. Next, notice that tag conf / tag conf, by Corollary 1. Third, notice that tag conf / tag, and if is a write operation, the inequality is strict, by Lemma 3. Combining the inequalities implies the desired result. Finally, we prove the main theorem: Theorem 2. The Etna algorithm correctly implements an atomic read/write object. Proof. We show that the protocol satisfies the four conditions of Lemma 1. For an arbitrary execution in which every read and write operation completes, we demonstrate that the partial-ordering,, satisfies Properties 1 4 of Lemma Immediate. 7

8 2. It follows from Lemma 6 that if completes before begins, then tag / tag. Therefore. 3. If and are write operations, then it follows immediately that tag tag, since the tags are unique (as they consist of a sequence number and a node identifier to break ties). If is a write operation and is a read operation and tag = tag then. Otherwise, if tag tag, then either or, depending on whether or has a larger tag. 4. This follows by the definition of the partial order. If is a read operation, then tag is the tag of the write operation,, whose value returns, or the initial tag. Therefore, either is the last preceding write operation (since tag returns. tag or 6 Theoretical Performance As in all quorum based algorithm, the performance of the algorithm depends on enough replicas remaining alive. We assume that if a node crashes, the remaining live nodes in the relevant configurations notice the crash and reconfigure quickly enough to maintain a live majority. If a majority of the nodes in a configuration fail, then operations can no longer complete. A recovery protocol could attempt to collect the values from the remaining replicas, at the expense of atomicity; we do not address recovery from failed configurations in this paper. During intervals in which the primary does not fail, the algorithm is efficient. A write operation requires a single round of communication to propagate the new value to a quorum. A read operation also requires only a single round of communication, involving only small control messages, since the primary supplies the data. For the purpose of this section, we assume that each message is delivered in time ( : Lemma 7. If a read or write operation begins at time!, and the primary does not fail by time! $ & (, then the operation completes by time! $ & (. A reconfiguration is somewhat more expensive, requiring three and a half rounds of communication. As soon as a new primary is established, it queries the old replicas for the latest value of the block. It then begins Paxos, which requires two rounds of communication to arrive at a decision. Finally, it updates the new configuration. Figure 6: Experiment topology: 30 physical machines connected via a LAN. Each link has a capacity of 500 kilobits/second and a 10 millisecond one-way delay. In our implementation, we piggy-back the recon-get RPC with the first RPC of Paxos, reducing the communication to two and a half rounds. The following lemma reflects this optimization. Lemma 8. If node is designated the primary at time!, and does not fail by time! $ (, then the new configuration is installed by time! $ (. Recall that when a reconfiguration takes place, ongoing read and write operations may fail to complete. We assume that, in this case, the client retries the operation at the new primary. Combining the two previous lemmas, we see that even if a primary fails, a read or write operation completes within ( after a new primary is designated. 7 Experimental Evaluation This section reports the measured latency of Etna operations in a working implementation, focusing on the costs of increasing degrees of replication. The test environment is the Emulab [1] test-bed, which connects a large set of standard PCs with a network capable of emulating delays and limited link capacities. Figure 7 shows the emulated network: 30 hosts in a star topology with a router in the center. Each star link has a capacity of 500 kilobits/second and one-way delays of 10 milliseconds. This network configuration is an abstract version of the Internet, in which each host has a limited speed access link connected to a high-speed core. Each host is an 800 mhz Pentium III with 256 MB RAM running FreeBSD 8

9 Read latency (milliseconds) Number of replicas Reconfiguration time (milliseconds) Number of replicas Figure 7: Read latency as a function of the number of replicas. Figure 9: Reconfiguration time as a function of the number of replicas. Write latency (milliseconds) Number of replicas Figure 8: Write latency as a function of the number of replicas Each host runs one Etna/DHash/Chord process. The latencies reported below include only the costs of the Etna algorithms; they do not include the cost of the client performing the Chord lookup to find the primary, or the communication between client and primary. The size of the blocks being read and written is 8192 bytes. Figure 7 shows the measured latency of an Etna read operation as a function of the number of replicas. The read latency is mostly determined by the cost of the read protocol s single round, which costs one 40 millisecond round-trip time. The latency grows slowly with the number of replicas due to the transmission required to send and receive one short message to each replica over the primary s link. Figure 8 shows the measured write latency as a function of the number of replicas. The write latency is determined by the transmission time required to send the new 8192-byte block to each replica; each copy takes about 132 milliseconds to send over the primary s link. Thus five replicas (where one is the primary) result in a write time of & & milliseconds, and the latency rises linearly with more replicas. To evaluate reconfiguration latency, we kill one-third of the replicas in a configuration and measure the time from the start of reconfiguration until the next configuration is installed. Figure 9 shows the results; the latencies increase with the number of replicas for the same reasons that the write latencies increase. The reason for the higher-than-expected reconfiguration times appears to be packet loss and TCP back-off caused by the replicas simultaneously sending the block value to the primary; this is mostly an artifact of the network emulator and will be fixed in a future version of this paper. How much would Etna add to the overall latency of a DHT? Etna s read latency must be added to the DHT s lookup latency, since the Etna primary cannot start the read protocol until the client finds it and sends it a request. DHT lookups take somewhat less than an average network round-trip delay [4], so Etna would somewhat more than double overall read latency. A replicated DHT must send writes to all replicas, just like Etna, so Etna is likely to leave overall write latency unchanged. 8 Conclusion This paper describes Etna, an algorithm for atomic mutable blocks in a distributed hash table. Etna correctly handles a dynamically changing set of replica hosts, using protocols optimized for situations in which reads are more common that writes or replica set changes. Etna uses Paxos to agree on a sequence of replica config- 9

10 urations, and only allows operations when a majority of replicas from the same configuration are available. Etna s write latency is comparable to that of non-atomic replicated DHTs, and its read latency is approximately twice that of a DHT. References [1] Emulab. [2] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM, 42(1): , [3] Ken Birman and Thomas Joseph. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles, December [4] F. Dabek, F. Kaashoek, J. Li, R. Morris, J. Robertson, and E. Sit. Designing a DHT for low latency and high throughput. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (NSDI), March [5] F. Dabek, M. Frans Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. In Proc. of the ACM Symposium on Operating System Principles, October [6] Danny Dolev, Idit Keidar, and Esti Yeger Lotem. Dynamic voting for consistent primary components. In Proc. of the Sixteenth Annual ACM Symp. on Principles of Distributed Computing, pages ACM Press, [7] B. Englert and Alex A. Shvartsman. Graceful quorum reconfiguration in a robust emulation of shared memory. In Proceedings of the International Conference on Distributed Computer Systems, pages , [8] Alan Mislove et. al. POST: A secure, resilient, cooperative messaging system. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May [9] Seth Gilbert, Nancy A. Lynch, and Alex A. Shvartsman. RAMBO II:: Rapidly reconfigurable atomic memory for dynamic networks. In Proc. of the Intl. Conference on Dependable Systems and Networks, pages , June [10] Communications of the ACM, Special section on Group Communication Systems, volume 39(4), [11] S. Jajodia and David Mutchler. Dynamic voting algorithms for maintaining the consistency of a replicated database. Trans. on Database Systems, 15(2): , [12] Leslie Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems, 16(2): , [13] Leslie Lamport. Paxos Made Simple [14] Nancy Lynch and Alex Shvartsman. RAMBO: A reconfigurable atomic memory service. In Proceedings of the 16th International Symposium on DIStributed Computing (DISC 02), Toulouse, France, October [15] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann, [16] Nancy A. Lynch and Alexander A. Shvartsman. Robust emulation of shared memory using dynamic quorumacknowledged broadcasts. In Twenty-Seventh Annual Intl. Symposium on Fault-Tolerant Computing, pages , June [17] Athicha Muthitacharoen, Robert Morris, Thomer Gil, and Benjie Chen. Ivy: A read/write peer-to-peer file system. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI 02), Boston, Massachusetts, December [18] Roberto De Prisco, Alan Fekete, Nancy A. Lynch, and Alexander A. Shvartsman. A dynamic primary configuration group communication service. In Proceedings of the 13th International Symposium on Distributed Computing, pages 64 78, September [19] Rodrigo Rodrigues, Barbara Liskov, and Liuba Shrira. The design of a robust peer-to-peer system. In Proceedings of the Tenth ACM SIGOPS European Workshop, September [20] E. Sit, F. Dabek, and J. Robertson. UsenetDHT: A low overhead Usenet server. In Proc. of the Third International Workshop on Peer-to-Peer Systems, February [21] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proc. ACM SIGCOMM, August [22] I. Stoica, R. Morris, David Liben-Nowell, D. Karger, M. Frans Kaashoek, Frank Dabek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup protocol for internet applications. In IEEE/ACM Transactions on Networking. [23] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proc. ACM SIGCOMM Conference, August [24] Eli Upfal and Avi Wigderson. How to share memory in a distributed system. Journal of the ACM, 34(1): , [25] M. Walfish, H. Balakrishnan, and S. Shenker. Untangling the web from DNS. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (NSDI), March

Etna: a Fault-tolerant Algorithm for Atomic Mutable DHT Data

Etna: a Fault-tolerant Algorithm for Atomic Mutable DHT Data Athicha Muthitacharoen Seth Gilbert Robert Morris athicha@lcs.mit.edu sethg@mit.edu rtm@lcs.mit.edu MIT Computer Science and Artificial Intelligence