Etna: a fault-tolerant algorithm for atomic mutable DHT data

Size: px
Start display at page:

Download "Etna: a fault-tolerant algorithm for atomic mutable DHT data"

Transcription

1 Etna: a fault-tolerant algorithm for atomic mutable DHT data Athicha Muthitacharoen Seth Gilbert Robert Morris athicha@lcs.mit.edu sethg@mit.edu rtm@lcs.mit.edu MIT Laboratory for Computer Science 200 Technology Square, Cambridge, MA Phone: (617) Fax: (617) Abstract This paper presents Etna, an algorithm for atomic reads and writes of replicated data stored in a distributed hash table. Etna correctly handles dynamically changing sets of replica hosts, and is optimized for reads, writes, and reconfiguration, in that order. Etna maintains a series of replica configurations as replicas fail, adding new replicas as needed from the pool supplied by the distributed hash table system. It uses Paxos to ensure consensus on the identity of each new configuration. For simplicity and performance, Etna serializes all reads and writes through a primary during the lifetime of each configuration. Etna performs read and write operations in a single round from the primary. Experiments in an environment with high network delays show that Etna s read latency is determined by round-trip delay in the underlying network, while write and reconfiguration latency is determined by the transmission time required to send data to each replica. Etna s write latency is about the same as that of a nonatomic replicating DHT, and Etna s read latency is about twice that of a non-atomic DHT due to Etna assembling a quorum for every read. Number of pages: 10 Keywords: replication, consistency, atomicity Regular submission. The first two authors are students. Please consider for brief announcement as well. 1 Introduction An increasing number of research systems are being built on top of distributed hash tables (DHTs). These include systems for messaging [8], sharing read/write files [17], publishing read-only content [5], resolving names [25], maintaining bulletin boards [20], and searching large text databases [23]. All of the cited examples depend on a DHT to store and replicate their data, and all of them either assume mutable DHT storage or could be simplified if mutable storage were supported. Most existing DHTs support immutable or write-once data well. In contrast, DHTs that support fault-tolerant replicated mutable data typically provide no consistency guarantees. For this reason DHT-based applications must implement application-specific techniques to ensure consistency. A typical approach is for the application to time-stamp updates for eventual consistency, and to include complex techniques to handle or avoid the consequences of conflicting updates. However, many DHT-based applications could be simplified if they could rely on atomic update of mutable data. Existing quorum-based techniques for consistent replication (for example, [24, 2]) cannot easily be applied to DHTs for a number of reasons. Most critically, these techniques are not well-suited for dynamic environments. In a peer-to-peer network, hosts may leave the system permanently. This means that a replication scheme must designate new replica hosts (or suffer a failure of the quorum system), and over time many hosts may have held replicas of a given piece of data. As a result there is a danger that stale replicas may accumulate one-by-one in a network partition, and eventually form a quorum. Existing replication techniques avoid this problem by assuming a fixed set of replica hosts. This paper presents Etna, a new algorithm for atomic read/write replicated DHT data blocks. Etna guarantees atomicity regardless of network behavior; for example, it will not return stale data during network partitions. Etna uses a succession of configurations to ensure that only the single most up-to-date quorum of replicas can execute operations, much like Rambo [14]. Etna handles configuration changes using Paxos. Etna is designed for low message complexity in the common case 1

2 in which reads and writes are more frequent than reconfiguration: both reads and writes involve a single round of communication. This work contributes the first full implementation of replicated atomic update in a complete DHT 1. Etna is more efficient than previous solutions for reads and writes, though it prohibits reconfiguration concurrently with reads and writes. The rest of this paper includes an overview of related work in Section 2, a summary of existing ideas used by Etna in Section 3, a description of the Etna algorithm in Section 4, a proof of atomicity in Section 5, a performance analysis in Section 6, and an evaluation of an implementation in Section 7. 2 Related Work A few DHT proposals address the issue of atomic data consistency in the face of dynamic membership. Rodrigues et al. [19] use a small configuration service to maintain and distribute a list of all the non-failed nodes. Since every participant is aware of the complete list of active nodes, it is easy to transfer and replicate data while ensuring consistency. However, maintaining global knowledge may limit the approach to small or relatively static systems. Etna uses the Chord DHT [21] to manage the dynamic environment, and augments the basic service to guarantee robust, mutable data. There have been many quorum-based atomic read/write algorithms developed for static sets of replica hosts (for example [24, 2]). These algorithms assume that the participants are known in advance, and that the number of failures is bounded by a constant. Group communication services [10] and other virtually synchronous services [3] support the construction of robust and dynamic systems. These algorithms provide stronger guarantees than are required for mutable data storage, implementing totally-ordered broadcast, which effectively requires consensus to be performed for every operation. As a result, the GCS algorithms work best in a low-latency LAN environment. Also, in most GCS systems, whenever a node joins or leaves, a new view (i.e., configuration) is created, leading to a potentially slow reconfiguration. Etna uses some of the reconfiguration techniques developed in the GCS protocols. However the read and write operations in Etna require less communication than the multi-phase protocol required to perform totally-ordered broadcast. Also, the rate of reconfiguration can be significantly reduced: a 1 Source available at new configuration need only be created when a number of replicas has failed (and no reconfiguration is necessary as a result of join operations). Prior approaches for reconfigurable read/write memory [11, 6, 18] require that new quorums include processors from the old quorums, restricting the choice of a new configuration. Some earlier algorithms [16, 7] rely on a single process to initiate all reconfigurations. The RAMBO algorithms [14, 9], on the other hand, allow completely flexible reconfiguration, and Etna takes a similar approach. However, RAMBO focuses on allowing reads and writes to proceed concurrently with reconfiguration; Etna, instead, optimizes read and write performance assuming that reconfigurations are rare. As a result, read and write operations in Etna are much more efficient than those in RAMBO. 3 Background This section describes two existing techniques on which Etna relies. 3.1 DHash and Chord Etna is an extension of DHash [5], a distributed hash table. DHash uses Chord [22] as a look-up service. Every Chord node has a unique ID. Chord arranges that each node knows its current successor, the node with the next highest ID; the ID space wraps around at zero. Thus the nodes form a circular list linked through the successors. DHash stores blocks, or key/value pairs. DHash stores each block on the node that is the block s successor (the node whose ID most immediately follows the block s key). 3.2 Paxos Etna uses the Paxos [12, 13] distributed consensus protocol to determine a total-ordering of the replica configurations. We execute a single instance of the Paxos protocol for each configuration; Paxos then outputs the next configuration. All Paxos procedures are subscripted by, the configuration that is associated with that instance of Paxos. Paxos is a three-phase commit protocol that implements consensus. As originally described, Paxos consists of two components: an eventual leader election service, and an agreement protocol. The eventual leader election service ensures that eventually only one node thinks that it is the leader (if the communication network is eventually well-behaved). The agreement protocol allows a node that thinks it is the leader to propose a value. It ensures that at most only one value is chosen, and that 2

3 if eventually there is only one leader, then some value is chosen. In the first phase of the protocol, the supposed leader queries for previously proposed decision values; the second phase chooses a decision value. Etna only makes use of the agreement protocol. It relies on Chord to choose the leader: Etna periodically sends a query to Chord for the successor node for a given block. This node then plays the role of the leader in the agreement protocol. It determines if a reconfiguration is needed, and, if so, begins the proposal by performing a: proposal value When consensus is reached, Paxos calls the Etna proposal value procedure. At this point, Etna notifies the new configuration of the decision, and the algorithm can proceed. Paxos guarantees the following: Theorem 1 (derived from [12, 13]). For each instance,, of the Paxos protocol, if one or more decision value events occur, then all the values of decision value are the same. If, eventually, a occurs at some node at time!, and no later occurs at any other node ", and node # # occurs by time!%$'& (. does not fail, then a 4 The Etna Algorithm In this section we present Etna, an algorithm that provides fault-tolerant, atomic mutable data in DHash. For each DHash block Etna uses Paxos to maintain a consistent series of replica configurations in the face of dynamic membership. Given a configuration of replicas, Etna designates a node as the primary and serializes all reads and writes through it. Etna assumes that nodes behave according to the protocol and are fail-stop. 4.1 Block State Etna provides atomicity for each mutable block, which extends to the entire DHT. By the composability of atomic objects [15], all mutable blocks in the DHT form an atomic memory. Therefore, we describe our protocol in terms of a single block. To provide fault-tolerance, Etna replicates a mutable block at ) different DHash nodes, where ) is the systemwide replication factor. We call this replica set a configuration of nodes responsible for a block. Etna initiates reconfigurations in order to arrange that a block s replicas are the nodes that immediately succeed the block s status Per Node State Flag variable, with values idle, active or recon inprog, initially idle tag A pair * version+ primaryid,, initially *.-/+0-1, value Latest block value, initially 213 new-tag A pair * version+ primaryid,, initially *.-/+0-1,, containing the largest tag of any ongoing write operation config Current configuration, with sub-fields seqnum 465, initially 0, and nodes 7 *98/:<; =?>@+A8/:<; =@B +#CDCDC,, initially E responses Per Operation State Set of responses for the current operation, initially F Figure 1: Fields in an Etna block key in the Chord ID space, and that the block s immediate successor node is the primary in the configuration. For each block, a replica keeps a tag, which is a pair G version H primaryidi ; a status flag, which designates the phase of operation to the block; a new-tag, which is used during write operations; and a config variable, which contains the configuration sequence number and the IDs for the nodes in the block s configuration. Etna increments the sequence number for each new configuration. Figure 1 summarizes the fields in a mutable block. We refer to a DHash mutable block as simply a block throughout the rest of the paper. 4.2 Inserting a New Block To insert a new block, the writer passes the block s data to the DHash client on the local machine. Once in DHash, Etna extends the block with the fields in Figure 1. Because the block is new, Etna can directly insert it at the initial replicas, set to be the ) immediate successors of the block s ID. 4.3 Read Protocol To read a block with key bid, the reader sends a read RPC to the node which is the immediate successor of bid. checks if it believes it is the current primary of bid and if bid is not going through a reconfiguration (status = active). If both conditions are true, it sends a GET RPC to each node in config. If not, the system may be going through membership reconfiguration, so returns an error. The reader will retry periodically. When a node " receives a GET RPC for bid, it looks in config to see if the sender is the current primary of bid and if status = active. If both conditions are true, " returns with a positive ack. If not, it returns an error. If collects more than )J & positive acks, it returns its 3

4 F F Network functions send AC#CDC Sends a message from to rcv AC#CDC Receives a message from to succ() k successors() primary() propose(c CDC ) Chord functions Returns the immediate successor of the block s identifier on the Chord ring. Returns the immediate successors of the block s identifier on the Chord ring. Etna function Returns the primary for a given configuration Paxos functions Proposes a new configuration. Figure 2: Auxiliary functions, provided by Chord, Etna, Paxos, and the network. own stored copy of the block value to the reader. If fails to assemble a majority after a certain time, it returns an error. Figure 3 shows the pseudocode for the read protocol. Read protocol for the primary: Procedure recv read if i primary config then if status active then responses j 4 config.nodes do send get, c, config.seqnum Procedure recv get-ack, c, seqnum if seqnum config.seqnum then responses responses j if responses k/2 then if status active then send tag, value Read protocol for the replicas (including the primary): Procedure recv get, c, seqnum if status active then if seqnum config.seqnum then send get-ack, c Figure 3: Pseudo-code for the read protocol Write Protocol To write block bid, the writer sends a write RPC to the successor of bid, node. consults its local state to verify that it believes it is the current primary of bid and that status is active. If both conditions are true, starts the write protocol: 1. Node assigns a new version number to this write, giving it tag G new-tag.version $ H I. 2. Node sends a put RPC to each replica node in config.nodes, including the write s tag and value. 3. When a node " receives a put RPC for bid, it ignores the RPC if status is not active or if the sender isn t the primary. If the write s tag is higher than the stored tag, " updates its stored block. This ensures that a replica is not confused if concurrent writes arrive from the primary out of order. Node " then returns an ack to the sender. 4. When node receives positive acks from a majority of the replicas, it updates its own copy of the block, though only if it has completed no subsequent concurrent write. Node then replies to the client. If fails to assemble a majority after a certain time, it returns an error. Figure 4 illustrates the write protocol. The primary assigns increasing version numbers to writes in the order that they arrive at the primary, but issues the writes to the replicas concurrently. Write protocol for the primary: Procedure recv write, c, new-val if i primary config then if status active then new-tag * new-tag.version + 1, i, op-block * new-tag, new-val, responses j 4 config.nodes do send put, c, config.seqnum, op-block Procedure recv put-ack, c, op-block, seqnum if seqnum config.seqnum then responses responses j if responses k/2 then if status active then if op-block.tag tag then tag op-block.tag value op-block.value send put-ack Write protocol for the replicas (including the primary): Procedure recv put, c, seqnum, op-block if status active and i j then if seqnum config.seqnum then if op-block.tag tag then tag op-block.tag value op-block.value send put-ack, c, op-block, config.seqnum Figure 4: Pseudo-code for the write protocol

5 ( G F 4.5 Reconfiguration Protocol Etna must change the configuration of nodes responsible for a block when a replica leaves (to maintain the ) replication factor) and when a new node joins that would be the block s successor (so that the primary is the Chord successor and is thus easy to find). Etna maintains only one configuration at a time, rather than a set of configurations as in Rambo [14]; this allows Etna to be simpler and have higher performance in all but the highest-churn environments. Etna uses the Paxos [12] distributed consensus protocol to decide on the next configuration. If an Etna node notices that the set of Chord successors for a block bid does not match the set of replicas in the block s config, it tries to initiate a reconfiguration: Case I If notices that it is the immediate successor of bid, it collects some information that will serve as a configuration proposal for a Paxos execution. The proposal has the form proposal value new configh block copyi. sets new config to be the ) immediate successors of. sends a recon-get RPC to each node in config asking for its current block value and tag, waits for a majority of replicas to respond, and uses the most up-to-date response as the block copy in the proposal. When a replica receives a recon-get RPC, it sets status to recon inprog and stops processing all reads and writes. When has assembled a majority of recon-get responses, it uses Paxos to propose its proposal value to the nodes in the old configuration. Paxos calls the decide function at each node that proposed a new configuration, with the consensus configuration information; the proposer(s) send the new configuration and most up to date block to the replicas in the new configuration. Case II If is not immediate successor of bid, sends the tag, value, and config information to the immediate successor of bid, asking the successor to initiate a reconfiguration. Figure 5 shows the pseudo-code for the reconfiguration protocol. During reconfiguration, nodes in the current configuration become inactive (stops serving write and read requests). If reconfiguration fails, there will be no active configuration. Section 6 discusses liveness. Reconfiguration protocol for the primary: Procedure recon if config.nodes k_successors then if i succ then new-config k_successors status recon_inprog proposed false responses j 4 config.nodes send recon-get Procedure recv recon-ack, new-tag, new-value, seqnum if seqnum config.seqnum then if new-tag tag then tag new-tag value new-value responses responses j if responses k/2 and proposed true then proposed true block-copy * tag, value, Paxos.propose new-config, block-copy 3 Procedure decide new-config, block-copy j 4 new-config.nodes send update, new-config, block-copy Reconfiguration protocol for the replicas: Procedure recv recon-get, seqnum if seqnum config.seqnum then status recon_inprog send recon-ack, tag, value, config.seqnum Procedure recv update, new-config, block-copy if i 4 new-config.nodes then if new-config.seqnum config.seqnum then status active config new-config if block-copy.tag tag then tag block-copy.tag value block-copy.value Figure 5: Pseudo-code for reconfiguration. 5 Atomicity In this section, we show that Etna correctly implements an atomic read/write object. Throughout this section, we consider only a single block,. Since atomic memory is composable, this is sufficient to show that Etna guarantees atomic consisteny. We omit references to for the rest of this section. We use the partial-ordering technique, described in Lynch [15]. We rely on the following lemma: Lemma 1 (Lemmas and in [15]). Let be 5

6 - - any well-formed, finite execution of algorithm (implementing a read/write atomic object) in which every operation completes. Let be the set of all operations in. Suppose that is an irreflexive partial ordering of all the operations in, satisfying the following properties: 1. For any operation, there are only finitely many operations such that. 2. If finishes before starts, then it cannot be the case that. 3. If is a write operation in and is any operation in, then either or. 4. The value returned by each read operation is the value written by the last preceding write operation according to (or, if there is no such write). Then every well-formed execution of algorithm satisfies the atomicity property. We consider an arbitrary well-formed execution,, of the Etna algorithm in which every read and write operation completes. (Lemma in Lynch [15] indicates that it is sufficient to consider only such executions.) We first define a partial order on the read and write operations in, and then show that this partial order has certain sufficient properties. Finally, we conclude that Etna guarantees atomic consistency. Partial Order. We first order the read and write operations in based on their tags. For a read or write operation, initiated at node, we define tag tag immediately before the operation returns to the reader or writer; that is, tag is the value of the block s tag when sends the result back to the client (Figure 4, Line 18 and Figure 3, Line 13). For a reconfiguration operation,, we define tag % block-copy.tag immediately before the call to (Figure 5, Line 20). We then define the partial order : (i) For any two operations and in : if tag tag, then. (ii) For any write operation in, and any read operation in : if tag tag then. We show in Theorem 2 that it is straightforward to see that this partial-order satisfies Properties 1, 3, and 4 of Lemma 1. The primary goal of the rest of this section is to show that this ordering satisfies Property 2. Atomicity Proof. Our first goal is to show that when a new configuration is installed, no information is lost. Let configuration be the initial configuration, and configuration "!$# be the unique configuration decided on by &% ('. (Theorem 1 ensures that this is, in fact, unique.) If )% *' does not terminate, then + is undefined. Recall that when Paxos is initiated, the process performing the reconfiguration includes block-copy, the latest copy of the block, in the proposal. That is, )% *' is initiated with at least one call to: new configh block copy )% *'. Define tag, G*- to be H I and tag "!$# to be block copy.tag of the successful proposal to &% ('. We sometimes refer to tag "!$# as the initial tag of configuration,"!$#. We want to show that the initial tags of the configurations are nondecreasing: Lemma 2. Let and."!$# be configurations installed in. Then tag. 0/ tag."!$#. Proof. No replica in configuration, can send any response until it has received an 1 message (Fig- 32# ure 5, Lines 29 36), which causes the replica to set its status to,2< 4. Therefore every response to the 6587:9,2 message (Figure 5, Lines 24 27) during the 65 that proposes configuration +"!$# must include a tag no smaller than tag +. Therefore tag / block-copy.tag in the proposal from,"!$#, from which the result follows. It then follows immediately by induction: Corollary 1. If and,; are two configurations in, and <= ), then tag 0/ tag,;. We next consider a read or write operation that occurs in configuration (for some <?> ). We want to show that if is an operation that completes in,, then the earlier value it can return is the one associated with the initial tag of.. For read or write operation, let conf config.seqnum when sends the result back to the client (Figure 4, Line 18 and Figure 3, Line 13). Lemma 3. Let be a read or write operation in, and assume that Then tag + B/ tag, and if is a write operation then the inequality is strict. Proof. The reconfiguration to install configuration concludes when a primary,, wins a Paxos decision (Figure 5, Lines 22 24) and sends messages to the new replicas. Notice that the decision includes the block-copy determined when the reconfiguration began. Therefore, by the time the configuration is installed, 6

7 tag. / tag. As a result, every operation initiated at node has a tag greater than or equal to tag, and with write operations the inequality is strict. Next, we relate the tag of a read or write operation to the tag of the next configuration. We want to show that if a read or write operation completes, the information is transferred to the next configuration. Lemma 4. Let be a read or write operation in, and assume that conf <. If configuration!$# is installed in, then tag 0/ tag +"!$#. Proof. First, notice that the value of the primary always reflects a 2# operation that has updated a majority of the replicas. Therefore if is a read operation, there is a write operation,, that wrote the tag and value returned by to a majority of the replicas. If is a write operation, define. Since operation completes in configuration +, there exists a set of at least )J & nodes in configuration that a send a response for to the primary. Call this set of nodes (for writers ). Since configuration "!$# is installed in, there exists a set of at least )J & nodes in configuration that send a response to the 65 7:9,2 message during the reconfiguration. Call this set of nodes (for readers ). Notice that since there are ) nodes in configuration, and both and contain at least )J & nodes, there is at least one node, ", in both and. Node " sends a response both for operation and for the successful reconfiguration resulting in +"!$#. We claim that node " sends the response for operation first. As soon as node " sends a response for a 6587:9,2, it sets its status to at which point it ceases responding to requests. Since we know that " sends a response to operation, it must send this response prior to the first time it receives a 65 7:9,2 request. We conclude, then, that the primary sends its put request for to replica " prior to " sending its response to the 6587:9,2 request. Therefore, tag tag tag."!$#. In the final preliminary lemma, we show that the configurations used by operations are non-decreasing. That is, if operation uses one configuration, then a later operation cannot use a smaller configuration. Lemma 5 (sketch). Let and be two read or write operations in. Assume that operation completes before operation begins. Then conf 0/ conf. / Proof. If completes in configuration,, then some reconfiguration completes prior to for,. During that reconfiguration, a majority of replicas in configuration. # were sent a 65 7:9,2 message notifying them to cease processing read and write requests. Inductively, it is clear that a majority of replicas from all earlier configurations received such messages. Therefore operation can not complete after using an earlier configuration. Relating Read and Write Operations. Finally, we show that if operation completes before operation begins, then the tag of of. is less then or equal to the tag Lemma 6. Let and be two read and write operations in where completes before begins. Then tag 0/ tag, and if is a write operation then the inequality is strict. Proof. We break the proof down into two cases: (i) and complete in the same configuration, and (ii) completes in a smaller configuration than. Lemma 5 shows that cannot complete in a larger configuration than. First, assume that ) conf conf. Let node be the primary of configuration ;. Both operations originate at node. Therefore, when operation begins, the tag of is at least as large as tag. If is a write operation, then is strict. increments the tag, and the inequality Next, consider the case where conf conf. Notice that tag / tag conf, by Lemma 4. Next, notice that tag conf / tag conf, by Corollary 1. Third, notice that tag conf / tag, and if is a write operation, the inequality is strict, by Lemma 3. Combining the inequalities implies the desired result. Finally, we prove the main theorem: Theorem 2. The Etna algorithm correctly implements an atomic read/write object. Proof. We show that the protocol satisfies the four conditions of Lemma 1. For an arbitrary execution in which every read and write operation completes, we demonstrate that the partial-ordering,, satisfies Properties 1 4 of Lemma Immediate. 7

8 2. It follows from Lemma 6 that if completes before begins, then tag / tag. Therefore. 3. If and are write operations, then it follows immediately that tag tag, since the tags are unique (as they consist of a sequence number and a node identifier to break ties). If is a write operation and is a read operation and tag = tag then. Otherwise, if tag tag, then either or, depending on whether or has a larger tag. 4. This follows by the definition of the partial order. If is a read operation, then tag is the tag of the write operation,, whose value returns, or the initial tag. Therefore, either is the last preceding write operation (since tag returns. tag or 6 Theoretical Performance As in all quorum based algorithm, the performance of the algorithm depends on enough replicas remaining alive. We assume that if a node crashes, the remaining live nodes in the relevant configurations notice the crash and reconfigure quickly enough to maintain a live majority. If a majority of the nodes in a configuration fail, then operations can no longer complete. A recovery protocol could attempt to collect the values from the remaining replicas, at the expense of atomicity; we do not address recovery from failed configurations in this paper. During intervals in which the primary does not fail, the algorithm is efficient. A write operation requires a single round of communication to propagate the new value to a quorum. A read operation also requires only a single round of communication, involving only small control messages, since the primary supplies the data. For the purpose of this section, we assume that each message is delivered in time ( : Lemma 7. If a read or write operation begins at time!, and the primary does not fail by time! $ & (, then the operation completes by time! $ & (. A reconfiguration is somewhat more expensive, requiring three and a half rounds of communication. As soon as a new primary is established, it queries the old replicas for the latest value of the block. It then begins Paxos, which requires two rounds of communication to arrive at a decision. Finally, it updates the new configuration. Figure 6: Experiment topology: 30 physical machines connected via a LAN. Each link has a capacity of 500 kilobits/second and a 10 millisecond one-way delay. In our implementation, we piggy-back the recon-get RPC with the first RPC of Paxos, reducing the communication to two and a half rounds. The following lemma reflects this optimization. Lemma 8. If node is designated the primary at time!, and does not fail by time! $ (, then the new configuration is installed by time! $ (. Recall that when a reconfiguration takes place, ongoing read and write operations may fail to complete. We assume that, in this case, the client retries the operation at the new primary. Combining the two previous lemmas, we see that even if a primary fails, a read or write operation completes within ( after a new primary is designated. 7 Experimental Evaluation This section reports the measured latency of Etna operations in a working implementation, focusing on the costs of increasing degrees of replication. The test environment is the Emulab [1] test-bed, which connects a large set of standard PCs with a network capable of emulating delays and limited link capacities. Figure 7 shows the emulated network: 30 hosts in a star topology with a router in the center. Each star link has a capacity of 500 kilobits/second and one-way delays of 10 milliseconds. This network configuration is an abstract version of the Internet, in which each host has a limited speed access link connected to a high-speed core. Each host is an 800 mhz Pentium III with 256 MB RAM running FreeBSD 8

9 Read latency (milliseconds) Number of replicas Reconfiguration time (milliseconds) Number of replicas Figure 7: Read latency as a function of the number of replicas. Figure 9: Reconfiguration time as a function of the number of replicas. Write latency (milliseconds) Number of replicas Figure 8: Write latency as a function of the number of replicas Each host runs one Etna/DHash/Chord process. The latencies reported below include only the costs of the Etna algorithms; they do not include the cost of the client performing the Chord lookup to find the primary, or the communication between client and primary. The size of the blocks being read and written is 8192 bytes. Figure 7 shows the measured latency of an Etna read operation as a function of the number of replicas. The read latency is mostly determined by the cost of the read protocol s single round, which costs one 40 millisecond round-trip time. The latency grows slowly with the number of replicas due to the transmission required to send and receive one short message to each replica over the primary s link. Figure 8 shows the measured write latency as a function of the number of replicas. The write latency is determined by the transmission time required to send the new 8192-byte block to each replica; each copy takes about 132 milliseconds to send over the primary s link. Thus five replicas (where one is the primary) result in a write time of & & milliseconds, and the latency rises linearly with more replicas. To evaluate reconfiguration latency, we kill one-third of the replicas in a configuration and measure the time from the start of reconfiguration until the next configuration is installed. Figure 9 shows the results; the latencies increase with the number of replicas for the same reasons that the write latencies increase. The reason for the higher-than-expected reconfiguration times appears to be packet loss and TCP back-off caused by the replicas simultaneously sending the block value to the primary; this is mostly an artifact of the network emulator and will be fixed in a future version of this paper. How much would Etna add to the overall latency of a DHT? Etna s read latency must be added to the DHT s lookup latency, since the Etna primary cannot start the read protocol until the client finds it and sends it a request. DHT lookups take somewhat less than an average network round-trip delay [4], so Etna would somewhat more than double overall read latency. A replicated DHT must send writes to all replicas, just like Etna, so Etna is likely to leave overall write latency unchanged. 8 Conclusion This paper describes Etna, an algorithm for atomic mutable blocks in a distributed hash table. Etna correctly handles a dynamically changing set of replica hosts, using protocols optimized for situations in which reads are more common that writes or replica set changes. Etna uses Paxos to agree on a sequence of replica config- 9

10 urations, and only allows operations when a majority of replicas from the same configuration are available. Etna s write latency is comparable to that of non-atomic replicated DHTs, and its read latency is approximately twice that of a DHT. References [1] Emulab. [2] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM, 42(1): , [3] Ken Birman and Thomas Joseph. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles, December [4] F. Dabek, F. Kaashoek, J. Li, R. Morris, J. Robertson, and E. Sit. Designing a DHT for low latency and high throughput. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (NSDI), March [5] F. Dabek, M. Frans Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. In Proc. of the ACM Symposium on Operating System Principles, October [6] Danny Dolev, Idit Keidar, and Esti Yeger Lotem. Dynamic voting for consistent primary components. In Proc. of the Sixteenth Annual ACM Symp. on Principles of Distributed Computing, pages ACM Press, [7] B. Englert and Alex A. Shvartsman. Graceful quorum reconfiguration in a robust emulation of shared memory. In Proceedings of the International Conference on Distributed Computer Systems, pages , [8] Alan Mislove et. al. POST: A secure, resilient, cooperative messaging system. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May [9] Seth Gilbert, Nancy A. Lynch, and Alex A. Shvartsman. RAMBO II:: Rapidly reconfigurable atomic memory for dynamic networks. In Proc. of the Intl. Conference on Dependable Systems and Networks, pages , June [10] Communications of the ACM, Special section on Group Communication Systems, volume 39(4), [11] S. Jajodia and David Mutchler. Dynamic voting algorithms for maintaining the consistency of a replicated database. Trans. on Database Systems, 15(2): , [12] Leslie Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems, 16(2): , [13] Leslie Lamport. Paxos Made Simple [14] Nancy Lynch and Alex Shvartsman. RAMBO: A reconfigurable atomic memory service. In Proceedings of the 16th International Symposium on DIStributed Computing (DISC 02), Toulouse, France, October [15] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann, [16] Nancy A. Lynch and Alexander A. Shvartsman. Robust emulation of shared memory using dynamic quorumacknowledged broadcasts. In Twenty-Seventh Annual Intl. Symposium on Fault-Tolerant Computing, pages , June [17] Athicha Muthitacharoen, Robert Morris, Thomer Gil, and Benjie Chen. Ivy: A read/write peer-to-peer file system. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI 02), Boston, Massachusetts, December [18] Roberto De Prisco, Alan Fekete, Nancy A. Lynch, and Alexander A. Shvartsman. A dynamic primary configuration group communication service. In Proceedings of the 13th International Symposium on Distributed Computing, pages 64 78, September [19] Rodrigo Rodrigues, Barbara Liskov, and Liuba Shrira. The design of a robust peer-to-peer system. In Proceedings of the Tenth ACM SIGOPS European Workshop, September [20] E. Sit, F. Dabek, and J. Robertson. UsenetDHT: A low overhead Usenet server. In Proc. of the Third International Workshop on Peer-to-Peer Systems, February [21] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In Proc. ACM SIGCOMM, August [22] I. Stoica, R. Morris, David Liben-Nowell, D. Karger, M. Frans Kaashoek, Frank Dabek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup protocol for internet applications. In IEEE/ACM Transactions on Networking. [23] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proc. ACM SIGCOMM Conference, August [24] Eli Upfal and Avi Wigderson. How to share memory in a distributed system. Journal of the ACM, 34(1): , [25] M. Walfish, H. Balakrishnan, and S. Shenker. Untangling the web from DNS. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (NSDI), March

Etna: a Fault-tolerant Algorithm for Atomic Mutable DHT Data

Etna: a Fault-tolerant Algorithm for Atomic Mutable DHT Data Etna: a Fault-tolerant Algorithm for Atomic Mutable DHT Data Athicha Muthitacharoen Seth Gilbert Robert Morris athicha@lcs.mit.edu sethg@mit.edu rtm@lcs.mit.edu MIT Computer Science and Artificial Intelligence

More information

RAMBO II: Rapidly Reconfigurable Atomic Memory for Dynamic Networks

RAMBO II: Rapidly Reconfigurable Atomic Memory for Dynamic Networks RAMBO II: Rapidly Reconfigurable Atomic Memory for Dynamic Networks Seth Gilbert ½ sethg@theory.lcs.mit.edu Nancy Lynch ½ lynch@theory.lcs.mit.edu Alex Shvartsman ½ ¾ alex@theory.lcs.mit.edu MIT Laboratory

More information

Implementing Shared Registers in Asynchronous Message-Passing Systems, 1995; Attiya, Bar-Noy, Dolev

Implementing Shared Registers in Asynchronous Message-Passing Systems, 1995; Attiya, Bar-Noy, Dolev Implementing Shared Registers in Asynchronous Message-Passing Systems, 1995; Attiya, Bar-Noy, Dolev Eric Ruppert, York University, www.cse.yorku.ca/ ruppert INDEX TERMS: distributed computing, shared memory,

More information

Reconfigurable Distributed Storage for Dynamic Networks

Reconfigurable Distributed Storage for Dynamic Networks Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler 1,2, Seth Gilbert 1, Vincent Gramoli 3,4, Peter M Musial 3, and Alexander A Shvartsman 1,3 1 CSAIL, Massachusetts Institute of Technology,

More information

Building a low-latency, proximity-aware DHT-based P2P network

Building a low-latency, proximity-aware DHT-based P2P network Building a low-latency, proximity-aware DHT-based P2P network Ngoc Ben DANG, Son Tung VU, Hoai Son NGUYEN Department of Computer network College of Technology, Vietnam National University, Hanoi 144 Xuan

More information

RAMBO: A Robust, Reconfigurable Atomic Memory Service for Dynamic Networks

RAMBO: A Robust, Reconfigurable Atomic Memory Service for Dynamic Networks RAMBO: A Robust, Reconfigurable Atomic Memory Service for Dynamic Networks Seth Gilbert EPFL, Lausanne, Switzerland seth.gilbert@epfl.ch Nancy A. Lynch MIT, Cambridge, USA lynch@theory.lcs.mit.edu Alexander

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comput. 69 (2009) 100 116 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Reconfigurable distributed storage for

More information

Time-related replication for p2p storage system

Time-related replication for p2p storage system Seventh International Conference on Networking Time-related replication for p2p storage system Kyungbaek Kim E-mail: University of California, Irvine Computer Science-Systems 3204 Donald Bren Hall, Irvine,

More information

CSE-E5430 Scalable Cloud Computing Lecture 10

CSE-E5430 Scalable Cloud Computing Lecture 10 CSE-E5430 Scalable Cloud Computing Lecture 10 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 23.11-2015 1/29 Exam Registering for the exam is obligatory,

More information

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services PODC 2004 The PODC Steering Committee is pleased to announce that PODC 2004 will be held in St. John's, Newfoundland. This will be the thirteenth PODC to be held in Canada but the first to be held there

More information

MULTI-DOMAIN VoIP PEERING USING OVERLAY NETWORK

MULTI-DOMAIN VoIP PEERING USING OVERLAY NETWORK 116 MULTI-DOMAIN VoIP PEERING USING OVERLAY NETWORK Herry Imanta Sitepu, Carmadi Machbub, Armein Z. R. Langi, Suhono Harso Supangkat School of Electrical Engineering and Informatics, Institut Teknologi

More information

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou Scalability In Peer-to-Peer Systems Presented by Stavros Nikolaou Background on Peer-to-Peer Systems Definition: Distributed systems/applications featuring: No centralized control, no hierarchical organization

More information

Wait-Free Regular Storage from Byzantine Components

Wait-Free Regular Storage from Byzantine Components Wait-Free Regular Storage from Byzantine Components Ittai Abraham Gregory Chockler Idit Keidar Dahlia Malkhi July 26, 2006 Abstract We consider the problem of implementing a wait-free regular register

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE for for March 10, 2006 Agenda for Peer-to-Peer Sytems Initial approaches to Their Limitations CAN - Applications of CAN Design Details Benefits for Distributed and a decentralized architecture No centralized

More information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5. Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message

More information

Authenticated Agreement

Authenticated Agreement Chapter 18 Authenticated Agreement Byzantine nodes are able to lie about their inputs as well as received messages. Can we detect certain lies and limit the power of byzantine nodes? Possibly, the authenticity

More information

Distributed Algorithms 6.046J, Spring, Nancy Lynch

Distributed Algorithms 6.046J, Spring, Nancy Lynch Distributed Algorithms 6.046J, Spring, 205 Nancy Lynch What are Distributed Algorithms? Algorithms that run on networked processors, or on multiprocessors that share memory. They solve many kinds of problems:

More information

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashock, Frank Dabek, Hari Balakrishnan March 4, 2013 One slide

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Paxos and Replication. Dan Ports, CSEP 552

Paxos and Replication. Dan Ports, CSEP 552 Paxos and Replication Dan Ports, CSEP 552 Today: achieving consensus with Paxos and how to use this to build a replicated system Last week Scaling a web service using front-end caching but what about the

More information

Large-Scale Data Stores and Probabilistic Protocols

Large-Scale Data Stores and Probabilistic Protocols Distributed Systems 600.437 Large-Scale Data Stores & Probabilistic Protocols Department of Computer Science The Johns Hopkins University 1 Large-Scale Data Stores and Probabilistic Protocols Lecture 11

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

Distributed Consensus Protocols

Distributed Consensus Protocols Distributed Consensus Protocols ABSTRACT In this paper, I compare Paxos, the most popular and influential of distributed consensus protocols, and Raft, a fairly new protocol that is considered to be a

More information

A Framework for Highly Available Services Based on Group Communication

A Framework for Highly Available Services Based on Group Communication A Framework for Highly Available Services Based on Group Communication Alan Fekete fekete@cs.usyd.edu.au http://www.cs.usyd.edu.au/ fekete Department of Computer Science F09 University of Sydney 2006,

More information

Communication-Efficient Probabilistic Quorum Systems for Sensor Networks (Preliminary Abstract)

Communication-Efficient Probabilistic Quorum Systems for Sensor Networks (Preliminary Abstract) Communication-Efficient Probabilistic Quorum Systems for Sensor Networks (Preliminary Abstract) Gregory Chockler Seth Gilbert Boaz Patt-Shamir chockler@il.ibm.com sethg@mit.edu boaz@eng.tau.ac.il IBM Research

More information

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks. Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

Byzantine Fault Tolerant Raft

Byzantine Fault Tolerant Raft Abstract Byzantine Fault Tolerant Raft Dennis Wang, Nina Tai, Yicheng An {dwang22, ninatai, yicheng} @stanford.edu https://github.com/g60726/zatt For this project, we modified the original Raft design

More information

Consensus, impossibility results and Paxos. Ken Birman

Consensus, impossibility results and Paxos. Ken Birman Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Exam 2 Review. Fall 2011

Exam 2 Review. Fall 2011 Exam 2 Review Fall 2011 Question 1 What is a drawback of the token ring election algorithm? Bad question! Token ring mutex vs. Ring election! Ring election: multiple concurrent elections message size grows

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

Generating Fast Indulgent Algorithms

Generating Fast Indulgent Algorithms Generating Fast Indulgent Algorithms Dan Alistarh 1, Seth Gilbert 2, Rachid Guerraoui 1, and Corentin Travers 3 1 EPFL, Switzerland 2 National University of Singapore 3 Université de Bordeaux 1, France

More information

Consistency in Distributed Systems

Consistency in Distributed Systems Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in scale and widely distributed 1. concurrent execution of components 2. independent failure modes 3. transmission

More information

Providing File Services using a Distributed Hash Table

Providing File Services using a Distributed Hash Table Providing File Services using a Distributed Hash Table Lars Seipel, Alois Schuette University of Applied Sciences Darmstadt, Department of Computer Science, Schoefferstr. 8a, 64295 Darmstadt, Germany lars.seipel@stud.h-da.de

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Erasure Coding in Object Stores: Challenges and Opportunities

Erasure Coding in Object Stores: Challenges and Opportunities Erasure Coding in Object Stores: Challenges and Opportunities Lewis Tseng Boston College July 2018, PODC Acknowledgements Nancy Lynch Muriel Medard Kishori Konwar Prakash Narayana Moorthy Viveck R. Cadambe

More information

Simple Determination of Stabilization Bounds for Overlay Networks. are now smaller, faster, and near-omnipresent. Computer ownership has gone from one

Simple Determination of Stabilization Bounds for Overlay Networks. are now smaller, faster, and near-omnipresent. Computer ownership has gone from one Simple Determination of Stabilization Bounds for Overlay Networks A. Introduction The landscape of computing has changed dramatically in the past half-century. Computers are now smaller, faster, and near-omnipresent.

More information

Securing Chord for ShadowWalker. Nandit Tiku Department of Computer Science University of Illinois at Urbana-Champaign

Securing Chord for ShadowWalker. Nandit Tiku Department of Computer Science University of Illinois at Urbana-Champaign Securing Chord for ShadowWalker Nandit Tiku Department of Computer Science University of Illinois at Urbana-Champaign tiku1@illinois.edu ABSTRACT Peer to Peer anonymous communication promises to eliminate

More information

Dynamo: Key-Value Cloud Storage

Dynamo: Key-Value Cloud Storage Dynamo: Key-Value Cloud Storage Brad Karp UCL Computer Science CS M038 / GZ06 22 nd February 2016 Context: P2P vs. Data Center (key, value) Storage Chord and DHash intended for wide-area peer-to-peer systems

More information

Enhanced Paxos Commit for Transactions on DHTs

Enhanced Paxos Commit for Transactions on DHTs Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7 D-14195 Berlin-Dahlem Germany FLORIAN SCHINTKE, ALEXANDER REINEFELD, SEIF HARIDI, THORSTEN SCHÜTT Enhanced Paxos Commit for Transactions

More information

GeoQuorums: Implementing Atomic Memory in Mobile Ad Hoc Networks

GeoQuorums: Implementing Atomic Memory in Mobile Ad Hoc Networks GeoQuorums: Implementing Atomic Memory in Mobile Ad Hoc Networks (Extended Abstract) Shlomi Dolev 1, Seth Gilbert 2, Nancy A. Lynch 2, Alex A. Shvartsman 3,2, and Jennifer L. Welch 4 1 Department of Computer

More information

Enhanced Paxos Commit for Transactions on DHTs

Enhanced Paxos Commit for Transactions on DHTs Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7 D-14195 Berlin-Dahlem Germany FLORIAN SCHINTKE, ALEXANDER REINEFELD, SEIF HARIDI, THORSTEN SCHÜTT Enhanced Paxos Commit for Transactions

More information

SimpleChubby: a simple distributed lock service

SimpleChubby: a simple distributed lock service SimpleChubby: a simple distributed lock service Jing Pu, Mingyu Gao, Hang Qu 1 Introduction We implement a distributed lock service called SimpleChubby similar to the original Google Chubby lock service[1].

More information

Oh-RAM! One and a Half Round Atomic Memory

Oh-RAM! One and a Half Round Atomic Memory Oh-RAM! One and a Half Round Atomic Memory Theophanis Hadjistasi Nicolas Nicolaou Alexander Schwarzmann July 21, 2018 arxiv:1610.08373v1 [cs.dc] 26 Oct 2016 Abstract Emulating atomic read/write shared

More information

Multi-writer Regular Registers in Dynamic Distributed Systems with Byzantine Failures

Multi-writer Regular Registers in Dynamic Distributed Systems with Byzantine Failures Multi-writer Regular Registers in Dynamic Distributed Systems with Byzantine Failures Silvia Bonomi, Amir Soltani Nezhad Università degli Studi di Roma La Sapienza, Via Ariosto 25, 00185 Roma, Italy bonomi@dis.uniroma1.it

More information

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks Pericles Lopes Ronaldo A. Ferreira pericles@facom.ufms.br raf@facom.ufms.br College of Computing, Federal University of Mato Grosso

More information

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible Distributed Consensus: Making Impossible Possible Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 hh360.user.srcf.net Sometimes inconsistency is not an option

More information

Efficient Recovery in Harp

Efficient Recovery in Harp Efficient Recovery in Harp Barbara Liskov Sanjay Ghemawat Robert Gruber Paul Johnson Liuba Shrira Laboratory for Computer Science Massachusetts Institute of Technology 1. Introduction Harp is a replicated

More information

arxiv: v1 [cs.dc] 20 Apr 2016

arxiv: v1 [cs.dc] 20 Apr 2016 FRAPPÉ: Fast Replication Platform for Elastic Services Vita Bortrnikov vita@il.ibm.com Gregory Chockler chockler@il.ibm.com Dmitri Perelman Technion, Israel Institute of Technology dima39@tx.technion.ac.il

More information

Peer-to-Peer Systems and Distributed Hash Tables

Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables CS 240: Computing Systems and Concurrency Lecture 8 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected

More information

NodeId Verification Method against Routing Table Poisoning Attack in Chord DHT

NodeId Verification Method against Routing Table Poisoning Attack in Chord DHT NodeId Verification Method against Routing Table Poisoning Attack in Chord DHT 1 Avinash Chaudhari, 2 Pradeep Gamit 1 L.D. College of Engineering, Information Technology, Ahmedabad India 1 Chaudhari.avi4u@gmail.com,

More information

Page 1. Key Value Storage"

Page 1. Key Value Storage Key Value Storage CS162 Operating Systems and Systems Programming Lecture 14 Key Value Storage Systems March 12, 2012 Anthony D. Joseph and Ion Stoica http://inst.eecs.berkeley.edu/~cs162 Handle huge volumes

More information

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman Strong Consistency CS6450: Distributed Systems Lecture 11 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed

More information

Survey of DHT Evaluation Methods

Survey of DHT Evaluation Methods Survey of DHT Evaluation Methods Markus Meriläinen Helsinki University of Technology Markus.Merilainen@tkk.fi Abstract In this paper, we present an overview of factors affecting the performance of the

More information

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004) Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous

More information

Coded Emulation of Shared Atomic Memory for Message Passing Architectures

Coded Emulation of Shared Atomic Memory for Message Passing Architectures Coded Emulation of Shared Atomic Memory for Message Passing Architectures Viveck R. Cadambe, ancy Lynch, Muriel Médard, Peter Musial Abstract. This paper considers the communication and storage costs of

More information

Dfinity Consensus, Explored

Dfinity Consensus, Explored Dfinity Consensus, Explored Ittai Abraham, Dahlia Malkhi, Kartik Nayak, and Ling Ren VMware Research {iabraham,dmalkhi,nkartik,lingren}@vmware.com Abstract. We explore a Byzantine Consensus protocol called

More information

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS6450: Distributed Systems Lecture 15. Ryan Stutsman Strong Consistency CS6450: Distributed Systems Lecture 15 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications Pawel Jurczyk and Li Xiong Emory University, Atlanta GA 30322, USA {pjurczy,lxiong}@emory.edu Abstract. The continued advances

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Asynchronous Reconfiguration for Paxos State Machines

Asynchronous Reconfiguration for Paxos State Machines Asynchronous Reconfiguration for Paxos State Machines Leander Jehl and Hein Meling Department of Electrical Engineering and Computer Science University of Stavanger, Norway Abstract. This paper addresses

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 11, NO. 1, FEBRUARY 2003 17 Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger,

More information

DISTRIBUTED HASH TABLE PROTOCOL DETECTION IN WIRELESS SENSOR NETWORKS

DISTRIBUTED HASH TABLE PROTOCOL DETECTION IN WIRELESS SENSOR NETWORKS DISTRIBUTED HASH TABLE PROTOCOL DETECTION IN WIRELESS SENSOR NETWORKS Mr. M. Raghu (Asst.professor) Dr.Pauls Engineering College Ms. M. Ananthi (PG Scholar) Dr. Pauls Engineering College Abstract- Wireless

More information

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol Min Li 1, Enhong Chen 1, and Phillip C-y Sheu 2 1 Department of Computer Science and Technology, University of Science and Technology of China,

More information

Distributed Hash Table

Distributed Hash Table Distributed Hash Table P2P Routing and Searching Algorithms Ruixuan Li College of Computer Science, HUST rxli@public.wh.hb.cn http://idc.hust.edu.cn/~rxli/ In Courtesy of Xiaodong Zhang, Ohio State Univ

More information

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible Distributed Consensus: Making Impossible Possible QCon London Tuesday 29/3/2016 Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 What is Consensus? The process

More information

Availability Study of Dynamic Voting Algorithms

Availability Study of Dynamic Voting Algorithms Availability Study of Dynamic Voting Algorithms Kyle Ingols Idit Keidar kwi,idish @theory.lcs.mit.edu Massachusetts Institute of Technology Lab for Computer Science 545 Technology Square, Cambridge, MA

More information

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual

More information

Topics in Reliable Distributed Systems

Topics in Reliable Distributed Systems Topics in Reliable Distributed Systems 049017 1 T R A N S A C T I O N S Y S T E M S What is A Database? Organized collection of data typically persistent organization models: relational, object-based,

More information

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage Horizontal or vertical scalability? Scaling Out Key-Value Storage COS 418: Distributed Systems Lecture 8 Kyle Jamieson Vertical Scaling Horizontal Scaling [Selected content adapted from M. Freedman, B.

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems 1 Motivation from the File Systems World The App needs to know the path /home/user/my pictures/ The Filesystem

More information

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 10. Consensus: Paxos Paul Krzyzanowski Rutgers University Fall 2017 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented by Jibin Yuan ION STOICA Professor of CS

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

: Scalable Lookup

: Scalable Lookup 6.824 2006: Scalable Lookup Prior focus has been on traditional distributed systems e.g. NFS, DSM/Hypervisor, Harp Machine room: well maintained, centrally located. Relatively stable population: can be

More information

Understanding Disconnection and Stabilization of Chord

Understanding Disconnection and Stabilization of Chord Understanding Disconnection and Stabilization of Chord Zhongmei Yao Joint work with Dmitri Loguinov Internet Research Lab Department of Computer Science Texas A&M University, College Station, TX 77843

More information

INF-5360 Presentation

INF-5360 Presentation INF-5360 Presentation Optimistic Replication Ali Ahmad April 29, 2013 Structure of presentation Pessimistic and optimistic replication Elements of Optimistic replication Eventual consistency Scheduling

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers Replicated s, RAFT COS 8: Distributed Systems Lecture 8 Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable service Goal #: Servers should behave just like

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

Byzantine Fault Tolerance

Byzantine Fault Tolerance Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables COS 418: Distributed Systems Lecture 7 Today 1. Peer-to-Peer Systems Napster, Gnutella, BitTorrent, challenges 2. Distributed Hash Tables 3. The Chord Lookup

More information

Two New Protocols for Fault Tolerant Agreement

Two New Protocols for Fault Tolerant Agreement Two New Protocols for Fault Tolerant Agreement Poonam Saini 1 and Awadhesh Kumar Singh 2, 1,2 Department of Computer Engineering, National Institute of Technology, Kurukshetra, India nit.sainipoonam@gmail.com,

More information

EFFICIENT ROUTING OF LOAD BALANCING IN GRID COMPUTING

EFFICIENT ROUTING OF LOAD BALANCING IN GRID COMPUTING EFFICIENT ROUTING OF LOAD BALANCING IN GRID COMPUTING MOHAMMAD H. NADIMI-SHAHRAKI*, FARAMARZ SAFI, ELNAZ SHAFIGH FARD Department of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad,

More information

Scaling Out Key-Value Storage

Scaling Out Key-Value Storage Scaling Out Key-Value Storage COS 418: Distributed Systems Logan Stafman [Adapted from K. Jamieson, M. Freedman, B. Karp] Horizontal or vertical scalability? Vertical Scaling Horizontal Scaling 2 Horizontal

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety Copyright 2012 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4.

More information

Building a transactional distributed

Building a transactional distributed Building a transactional distributed data store with Erlang Alexander Reinefeld, Florian Schintke, Thorsten Schütt Zuse Institute Berlin, onscale solutions GmbH Transactional data store - What for? Web

More information

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast Coordination 2 Today l Group communication l Basic, reliable and l ordered multicast How can processes agree on an action or a value? Modes of communication Unicast 1ç è 1 Point to point Anycast 1è

More information

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS Herwig Unger Markus Wulff Department of Computer Science University of Rostock D-1851 Rostock, Germany {hunger,mwulff}@informatik.uni-rostock.de KEYWORDS P2P,

More information

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Bidyut Gupta, Nick Rahimi, Henry Hexmoor, and Koushik Maddali Department of Computer Science Southern Illinois

More information