Geographic State Machine Replication

Size: px
Start display at page:

Download "Geographic State Machine Replication"

Transcription

1 Università della Svizzera italiana USI Technical Report Series in Informatics Geographic State Machine Replication Paulo Coelho 1, Fernando Pedone 1 1 Faculty of Informatics, Università della Svizzera italiana, Switzerland Abstract Many current online services need to serve clients distributed across geographic areas. These systems are subject to stringent availability and performance requirements. In order to meet these requirements, replication is used to tolerate the crash of servers and improve performance by deploying replicas near the clients. Coordinating geographically distributed replicas, however, is challenging. We present GeoPaxos, a protocol that addresses this challenge by combining three insights. It decouples order from execution in state machine replication, it induces a partial order on the execution of operations, instead of a total order, and it exploits geographic locality, typical of geo-distributed online services. GeoPaxos outperforms state-of-the-art approaches by more than an order of magnitude in some cases. We describe GeoPaxos design and implementation in detail, and present an extensive performance evaluation. Report Info Published June 217 Number USI-INF-TR Institution Faculty of Informatics Università della Svizzera italiana Lugano, Switzerland Online Access 1 Introduction Many current online services must serve clients distributed across geographic areas. In order to ensure that clients experience high service availability and performance, servers are typically replicated and deployed over geographically distributed sites (i.e., datacenters). By replicating the servers, the service can be configured to tolerate the crash of a few nodes within a single datacenter or the disruption of an entire datacenter. Geographic replication can improve performance by placing the data close to the clients, which reduces service latency. GeoPaxos combines three insights to implement efficient state machine replication in geographically distributed environments. Combining these ideas proved to be challenging and resulted in a novel system that can fully exploit common geographically distributed cloud computing infrastructures (e.g., Amazon s EC2) and outperform related approaches by a large margin. First, GeoPaxos decouples operation ordering from operation execution. Although there are storage systems that distinguish order from execution (e.g.,[39]) and Paxos itself introduces different roles for the ordering and execution of operations[2], Paxos-based systems typically combine the two roles in a replica (e.g.,[14, 25, 28]). Combining order and execution in a geographically distributed system, however, leads to a performance dilemma. On the one hand, replicas must be deployed near clients to avoid communication with remote servers during the execution of operations. On the other hand, distributing replicas across geographic areas slows down the ordering of operations, since order requires replicas to coordinate. By decoupling order from execution, GeoPaxos can use a set of nodes to order operations and another set of nodes, the replicas, to execute operations. As a result, replicas can be geographically distributed without penalizing the ordering of operations. Second, instead of totally ordering operations before executing them, as traditionally done in state machine replication[19], GeoPaxos introduces a novel partial order protocol. It is well-known that total order is not necessary to ensure consistency in state machine replication[32] and a few designs have implemen- 1

2 ted partial ordering of operations (e.g.,[18, 25]). GeoPaxos differs from these systems in the way it implements partial order. GeoPaxos uses multiple independent instances of Multi-Paxos[5] to order operations hereafter, we call an instance of Multi-Paxos a partition. Operations are ordered by one or more partitions, depending on the objects they access. Operations that are ordered by a single partition are the most efficient ones since they involve servers in datacenters in the same region and subject to small communication delays. Operations that involve multiple partitions, require coordination among replicas in datacenters that may be far apart. Thus, multi-partition operations perform worse than single-partition operations. Third, to maximize the number of single-partition operations, GeoPaxos exploits geographic locality. Geographic locality is present in many online services. In online social services, for example, the probability of having a social connection between two individuals decreases as an inverse power of their geographic distance[2]. Some distributed systems exploit locality by sharding the data and placing shards near the users of the data (e.g.,[14, 35]). GeoPaxos does not shard the service state; instead, it distributes the responsibility for ordering operations. Operations are ordered by nodes deployed in the region where the most likely clients to access these objects are. GeoPaxos s flexibility results in excellent performance at the cost of extra nodes to order operations, when compared to traditional approaches that combine order and execution[14, 25, 28]. This paper makes the following contributions: It demonstrates the importance of decoupling the ordering from the execution of operations in a geographically distributed system. Although Paxos makes this distinction, Paxos-based systems combine ordering and execution in a replica. It proposes a novel partial ordering protocol that can take advantage of public cloud computing infrastructures such as Amazon EC2. In GeoPaxos, redundancy for fault tolerance is provided by replicas in datacenters in different availability zones, within the same region; redundancy for performance is provided by replicas in different regions. Although intra-region redundancy does not tolerate catastrophic failures in which all datacenters of a region are wiped out, most applications do not require this level of reliability. It shows how these ideas can be combined with geographic locality, a property present in many online services, leading to a state machine replication design that under some common circumstances outperforms state-of-the-art protocols by an order of magnitude. The rest of the paper is structured as follows. Section 2 details the system model and recalls fundamental notions. Section 3 overviews the main paper contributions. Section 4 details GeoPaxos. Section 5 describes our prototype. Section 6 presents our performance evaluation. Section 7 reviews related work and Section 8 concludes the paper. 2 Background In this section, we define our system model and assumptions ( 2.1), recall the notions of consensus and state machine replication ( 2.2), and briefly describe Paxos ( 2.3). 2.1 System model We consider a message-passing geographically distributed system. Client and server processes are grouped within datacenters (also known as sites or availability zones) distributed over different regions. The system is asynchronous in that there is no bound on message delays and on relative process speeds, but communication between processes within the same region experience much shorter delays than communication between processes in different regions. Processes are subject to crash failures and do not behave maliciously (e.g., no Byzantine failures). Service state can be replicated in servers in datacenters within the same region and across regions. Replication within a datacenter can tolerate the crash of some of the replicas; replication using servers located in different datacenters can tolerate the crash of a whole datacenter. Replication across regions is mostly used to explore locality, since storing data close to the clients avoids large delays due to expensive inter-region communication. We account for client-data proximity by assuming that clients have a preferred region[36]. 2

3 2.2 Consensus and replication Consensus is an abstraction whereby replicas agree on a common value (e.g., the next operation to be executed). More precisely, consensus is defined by three properties: (a) If a replica decides on a value, then the value was proposed by some process (validity). (b) No two replicas decide differently (agreement). (c) If a non-faulty process proposes a value, then eventually all non-faulty replicas decide some value (termination). Consensus requires additional assumptions to be solved[1, 2]. Since the protocols proposed in this work do not make explicit use of these assumptions, we simply assume that consensus can be implemented[2]. State machine replication is a principled approach to building highly available services[19, 32]. State machine replication regulates how service operations must be propagated to and executed by the replicas. Operation propagation has two requirements: (i) every non-faulty replica must receive every operation and (ii) no two replicas can disagree on the order of received and executed operations. If operations are deterministic, then replicas will reach the same state and produce the same output upon executing the same sequence of operations. State machine replication can be implemented as a series of consensus instances, where the i -th consensus instance decides on the i -th operation (or batch of operations) to be executed by the replicas[2]. Although this total order of operations is sufficient to implement state machine replication, it is not necessary[32]. State machine replication guarantees linearizability, a consistency criterion[12]. An execution is linearizable if there is a permutation of the operations in the execution that respects (i) the service s sequential specification and (ii) the real-time precedence of operations as seen by the clients. Operation o p i precedes command o p j if the response of o p i occurs before the invocation of o p j. 2.3 Classic Paxos Paxos is a fault-tolerant consensus protocol with important characteristics: it has been proven safe under asynchronous assumptions (i.e., when there are no timing bounds on message propagation and process execution), live under weak synchronous assumptions, and resilience-optimum[2]. Paxos distinguishes the following roles that a process can play: proposers, acceptors and learners. Clients of a replicated service are typically proposers, and propose operations that need to be ordered by Paxos before they are learned and executed by the replicated state machines. These replicas typically play the roles of acceptors (i.e., the processes that actually agree on a value) and learners. Paxos is resilience-optimum in the sense that it tolerates the failure of up to f acceptors from a total of 2f+ 1 acceptors to ensure progress (i.e., a quorum of f +1 acceptors must be non-faulty)[22]. In practice, replicated services run multiple executions of the Paxos protocol to achieve consensus on a sequence of values. We refer to multiple executions of Paxos chained together as Multi-Paxos[5] or Atomic Broadcast[4]. 3 Overview GeoPaxos combines three strategies: it exploits the different roles implemented by Paxos ( 3.1), it induces a partial order on operations ( 3.2), and it makes use of geographic locality ( 3.3). 3.1 Dissociating order from execution GeoPaxos takes advantage of the fact that Paxos allows to dissociate the ordering of operations, performed by the acceptors, from the execution of operations, performed by the learners (i.e., the service replicas). Systems based on Paxos typically combine the acceptor and learner roles in a replica (e.g.,[25, 14]). As a consequence, these systems are subject to a tradeoff: on the one hand, placing replicas near remote clients reduces the response time experienced by the clients; on the other hand, distributing replicas across geographic areas may slow down the ordering of commands (e.g., if no quorum exists involving nearby replicas [15]). GeoPaxos is not vulnerable to this tradeoff since the ordering of operations does not depend on the number and placement of replicas. 3

4 3.2 Partial versus total order GeoPaxos induces a partial order on operations. It has been early observed that interference-free operations can be executed in any order by replicas in state machine replication without violating consistency[32]. Two operations are interference-free if they do not access a common object. Intuitively, if two operations interfere, then they must be executed sequentially in the same order by each replica. Interference-free operations may be executed in different orders by the replicas, and even concurrently at a replica. GeoPaxos uses one or more independent instances of Multi-Paxos to order operations, where each instance is called a partition. To determine the partitions responsible for ordering an operation, we assign each object to a partition, the object owner we explain how to assign ownership based on geographic locality in 3.3 and how objects can change ownership dynamically in 4.2. Operations that access a single object or multiple objects with the same owner, single-partition operations, are ordered by the owner partition. Operations that access objects owned by multiple partitions, multi-partition operations, are ordered by all involved partitions. The challenge is to ensure that such operations are ordered consistently across partitions, that is, if operation op i is ordered before operation op j by partition A, then op j is not ordered before op i by partition B we explain how GeoPaxos ensures consistent order in 4.1. Every replica in GeoPaxos contains one learner per partition. Single-partition operations are learned and executed by the learner that learns the ordered operation. Multi-partition operations are learned by multiple learners but executed by only one learner at the replica. 3.3 Exploiting geographic locality GeoPaxos differs from other partial-order protocols in that it uses multiple partitions to order operations, as opposed to a single partition[21, 25, 27, 28]. This distinction allows GeoPaxos to account for locality in geographically distributed applications and assign object ownership in order to (i) minimize multi-partition operations and (ii) maximize single-partition operations that are ordered in the region where they are issued. C x preferred region DC 1 XYZ A DC 4 XYZ B DC 7 XYZ C XYZ A Replica with objects X,Y, and Z Acceptor in partition A DC 2 XYZ A DC 5 B DC 8 C DC : Datacenter DC 3 A DC 6 B DC 9 C object X Y owner partition A B Partition A Partition B Partition C Z C Region A Region B Region C Figure 1: A deployment of GeoPaxos with three regions and three datacenters in each region. Region A has two replicas, at datacenters D C 1 and D C 2. Regions B and C have one replica only, at D C 4 and D C 7, respectively. The problem of assigning object ownership to achieve properties (i) and (ii) is application-specific and orthogonal to GeoPaxos[11]. We illustrate a solution with an online social service. Online social services are notorious for exhibiting geographic locality, where the probability of having a social connection between two users is inversely proportional to their geographic distance[1, 3, 2]. In a social service, the problem of assigning object ownership can be reduced to a graph partitioning problem. In this graph, clients are vertices (i.e., service users) and their interconnections (i.e., friendship relations) are edges. Operations (i.e., gettimeline, post) involve interconnected clients see 6.4 for a detailed description of operations. A partitioning of the graph results in strongly connected subgraphs, each one weakly connected to the other subgraphs. By assigning object ownership in a subgraph to a partition in the preferred region of the clients in the subgraph, we achieve properties (i) and (ii) above. Figure 1 illustrates a deployment of GeoPaxos in three regions. Client C x s preferred region is A, where his friends, represented by object x, are located. Since partition A owns x, if C x issues an operation from region A (its preferred region), the operation will be a local single-partition operation, ordered by acceptors in partition A and executed by all replicas in all partitions. If C x issues an operation from region B, 4

5 the operation will be a remote single-partition operation. A client with friends in partitions B and C will issue multi-partition operations. Local single-partition operations are more efficient than remote singlepartition operations, and both are more efficient than multi-partition operations. 4 Design In this section, we detail how GeoPaxos orders and executes operations ( 4.1), present some extensions and improvements to the basic protocol ( 4.2), discuss practical aspects ( 4.3), and argue about the correctness of GeoPaxos ( 4.4). 4.1 The order protocol Clients can submit operations to any one of the replicas. At a replica, an operation has three attributes: The state" attribute identifies whether the operation is (a) waiting to be ordered, (b) ordered, or (c) done, after the replica has executed the operation. The dst" attribute is set by the replica and contains all partitions that own objects the operation accesses (or a superset of these partitions). The tp" attribute is a timestamp. Timestamps are tuples, with one entry per partition. For two operations op i, op i.tp<op j.tp if for all x op j.d s t op i.d s t, it holds that op i.tp[x]<op j.tp[x]. GeoPaxos ensures that if two operations interfere, then replicas execute the operations in the same order. To guarantee this property, GeoPaxos assigns timestamps to operations such that if op i interfere, then either op i.tp<op j.tp or op j.tp<op i.tp and executes operations in timestamp order. Replicas execute the following five steps to order operations (see also Figure 2). I. Upon receiving an operation op from a client, the replica initializes the operation s attributes and requests that the operation be ordered in each one of the partitions involved in op. We call this the operation s first communication round. A replica requests operation op to be ordered in partition x (i.e., atomically broadcasts operation op in partition x ) by executing primitive abcast[x](op). II. When a replica learns an ordered operation op in partition x, an event that we identify with primitive deliver[x](op), the replica assigns to op a tentative order in partition x. To compute the tentative order, replicas implement a logical clock vector, LC, with one entry per partition. The replica first increments partition x s entry in LC and assigns the result as x s tentative order to op. In partition x, op is now waiting for its timestamp. III. After a tentative order has been assigned to op for each partition involved in op, there are two cases to consider. If op is a single-partition operation, then the order proposed by the partition involved in o p becomes its timestamp and op transitions to the ordered state. If op involves multiple partitions, then its timestamp will be computed from the tentative orders proposed by partitions in op.d s t. To ensure that replicas update their logical clock vector consistently, op s timestamp is atomically broadcast to partitions in op.d s t. We call this the operation s second communication round. IV. Upon delivering op s timestamp in partition x, the replica updates x s logical clock and marks op as ordered in partition x. V. After op is in the ordered state in each one of the partitions involved in op and the replica has already executed all operations with a timestamp smaller than op s, the replica executes op, responds to the client, and marks op as executed, which allows operations ordered after op to be eventually executed. 4.2 Extensions and optimizations In this section, we describe three optimizations to improve the performance of the basic order protocol. 5

6 I When receive operation op from client op.state (,, ) op.dst partitions(op) op.tp (-,-, ) For all x in op.dst: abcast[x](round-1,op) II When deliver[x] (round-1,op) increment LC[x] op.tp[x] LC[x] op.state[x] waiting III When there is op s.t. state(op) = waiting op.dst = 1 yes let x be in op.dst op.state[x] ordered no op.tp[x] tmax(op) For all x in op.dst: abcast[x](round-2,op) IV When deliver[x] (round-2,op) LC[x] max(lc[x], tmax(op)) op.state[x] ordered V When there is op s.t. state(op) = ordered and there isn t op : state(op ) done and op.tp < op.tp Execute operation op and respond to client For all x in op.dst: op.state[x] done partitions(op) all partitions to be accessed by operation op state(op) = val for all x in op.dst: C.state[x] = val tmax(op) max(op.tp[], op.tp[1], ) Figure 2: The five steps of the GeoPaxos protocol Speeding up single-partition operations In the basic protocol described in the previous section, single-partition operations need one consensus execution and multi-partition operations need two consensus executions. Let op s and op m be two operations that access a common object in partition x (i.e., op s and op m interfere), where op s is single-partition and op m is multi-partition. Thus, op s and op m must be executed in the same order by all replicas. Consider an execution in which op s s communication round happens between op m s first and second communication rounds at partition x. From the protocol, op s is assigned an execution order at partition x greater than the one proposed for op m since op m is handled before op s. Hence, even though op s requires one communication round to be ordered, it can only be executed after op m completes its second communication round. We call the phenomenon by which the execution of an ordered operation is delayed by the ordering of another operation the convoy effect". Since there is a significant difference between the response times of single-partition and multi-partition operations, even a small percentage of multi-partition operations in the workload can add substantial delays in the execution of single-partition operations. Figure 3 depicts the response time CDF of multi-partition operations and single-partition operations in executions with %, 1% and 1% of multi-partition operations. (The details about the setup and the application can be found in 6.) In the workload without multipartition operations, the 92-th latency percentile of single-partition operations is below 3.5 msec. With 1% of multi-partition operations in the workload, the 92-th latency percentile of single-partition operations 6

7 reaches almost 8 msec; when the workload contains 1% of multi-partition operations, the 68-th latency percentile of single-partition operations reaches almost 8 msec. 1 8 local w/ % global op. local w/ 1% global op. local w/ 1% global op. global op. [%] Figure 3: The impact of the convoy effect on the latency of single-partition operations. To cope with the convoy effect, we let single-partition operations be executed as soon as they are ordered (at the end of step III). Consequently, the second communication round of a multi-partition operation does not delay the execution of a single-partition operation. Note that with this modification, the execution of a single-partition operation op s and a multi-partition operation o p m may not happen in timestamp order. Intuitively, this does not violate correctness because op s and op m are handled in the same total order within a partition, and so, all replicas agree that op s should be executed before op m is ordered Parallel execution of operations We improve the performance of a replica by multithreading (and parallelizing) the execution of operations that do not interfere[33, 17, 24]. In order to introduce concurrent execution of non-interfering operations, each replica spawns as many threads as the number of partitions so that operations that access different partitions can execute concurrently. Single-partition operations are executed by the thread in charge of the involved partition. Multi-partition operations require a barrier among the threads involved in the operation. This ensures that only one thread executes the operation and avoids race conditions Dynamically changing object ownership Object ownership must be assigned to partitions to minimize multi-partition operations and to maximize same-region single-partition operations. Since workloads and locality can vary over time, and it may be unfeasible to predict in which regions requests for specific objects will be issued beforehand (e.g., a user who travels to a different country and wants to access her data), GeoPaxos allows object ownership to change dynamically. In GeoPaxos, object ownership can be reassigned from one partition to another using the move(object_id_list, source, destination) operation, addressed to the source and the destination partitions, where object_id_list contains a list of objects whose ownership should be re-assigned. Note that a change of ownership does not involve any transfer of actual objects since every replica contains a full copy of the application state. 4.3 Practical considerations Partitions are available as long as there is a majority of operational acceptors in the partition. Clients connected to a replica that fails can reconnect to any operational replica, possibly in the client s preferred region. We experimentally evaluate the effects on performance when clients reconnect to a remote replica in 6.7. To recover from failures, the in-memory state of acceptors must be saved on stable storage (i.e., disk). In GeoPaxos, acceptors can persist their state in both asynchronous or synchronous mode. These modes represent a performance and reliability tradeoff: the asynchronous mode is more efficient but can cause information loss if an acceptor crashes before flushing its state to disk. We evaluate both persistency modes in 6. 7

8 As an optimization, multi-partition operations (with associated parameters) do not need to be sent to all partitions involved in the operation. It is sufficient that one partition receives the full operation while the other partitions receive only the unique id of the operation, so that the operation can be ordered in all involved partitions. 4.4 Correctness We first argue that if operations op i interfere, then replicas execute them in the same order. Since op i interfere, they access a common object, and thus are ordered. The claim follows from two facts: (a) for ordered operations op i either op i.t p< op j.t p or op j.t p< op i.t p ; and (b) replicas execute ordered operations in timestamp order. Without loss of generality, assume op i.t p< op j.t p. Thus, for all x op i.d s t op j.d s t, op i.t p[x]< op j.t p[x]. Fact (a) holds since logical clock values are unique and the timestamp of an ordered operation op is the maximum among the logical clock values proposed by each one of the destinations in op.d s t. Fact (b) holds because when an operation op is executed by a replica, there is no operation op at the replica with a smaller timestamp. Moreover, no future operation can have a smaller timestamp than op s timestamp since timestamps are monotonically increasing. We now show that GeoPaxos is linearizable. From the definition of linearizability (see 2.2), we must show that there is a permutationπof the operations in any execution of GeoPaxos that respects (i) the realtime ordering of operations as seen by the clients, and (ii) the semantics of the operations. Let op i be two operations submitted by clients C i and C j, respectively. There are two cases to consider. Case (a): op i are interference-free. Thus, op i access disjoint sets of objects. Consequently, the execution of one operation does not affect the execution of the other and they can be placed in any relative order inπ. We arrange op i inπso that their relative order respects their real-time dependencies, if any. Case (b): op i interfere. It follows from GeoPaxos order property above that replicas execute the operations in the same order. Since the two operations execute in sequence, the execution of the operations satisfies their semantics. We now show that the execution order satisfies any real-time constraints among op i. Without lack of generality, assume op i finishes before op j starts (i.e., op i precedes op j in real time). Thus, before op j is submitted by C j, op i has completed (i.e., C i has received op i s response). Since op j is ordered and then executed, we conclude that op i is ordered before op j. From the claims above, we can arrange op i inπaccording to their delivery order so that the execution of each operation satisfies its semantics. 5 Implementation GeoPaxos was implemented in C. 1 Our prototype allows to configure disk access mode, synchronous or asynchronous, by default set to asynchronous. We use Libpaxos 2 as the Paxos library. In GeoPaxos proposers and acceptors are single-threaded processes. To ensure liveness, the system starts with a default distinguished proposer, which exchanges heartbeats with the other proposers to allow progress in the event of a failure. Replicas are multithreaded processes. The learner for each partition is executed as an independent thread and only synchronizes with other learners when an operation involves multiple partitions. An additional thread handles the requests from the clients and, depending on the operation parameters, sets the destination partitions accordingly. Clients are multithreaded, with each thread usually connected to the closest replica. Operations are submitted in a closed loop, i.e., an operation is only sent after the response for the previous operation is received. 6 Evaluation In this section, we detail our experimental environment and benchmarks ( 6.1), compare the performance of GeoPaxos to other protocols under various conditions ( ), and assess the behavior of GeoPaxos in

9 the presence of failures ( 6.7). 6.1 Environment and benchmarks The evaluation was conducted in two environments, a local-area network (LAN) and a public wide-area network (WAN). While the LAN allows us to compare the protocols in a controlled environment, the WAN provides a realistic environment, aligned with the conditions GeoPaxos was developed for. The LAN consistes of a cluster of nodes, each one with an 8-core Intel Xeon L542 processor (2.5GHz), 8GB of memory, SATA SSD disks, and 1Gbps ethernet card. Each node runs CentOS bits. The RTT (round-trip time) between nodes in the cluster is.1 msec. We use Amazon EC2 in the WAN configuration, with each partition deployed in a different region. All the nodes are m3.large instances, with 2 vcpus and 7.5GB of memory. For the experiments with 3 partitions, we use 2 datacenters in California (CA), 3 datacenters in North Virginia (VA) and 3 datacenters in Ireland (EU). The regions of Oregon (OR), with 3 datacenters, and Tokyo (JP), with 2 datacenters, are included to complete the 5 partitions. Table 1 summarizes the RTT between these regions. RTT within a datacenter is smaller than 1 msec and between datacenters in the same region below 2.5 msec. VA EU OR JP CA VA EU OR 95 Table 1: Average RTT between regions, in milliseconds. Throughput [operations / sec] Multi-Paxos GeoPaxos EPaxos M2Paxos 3 Partitions 5 Partitions Partitions 5 Partitions (a) Peak throughput (b) Latency Figure 4: Performance in LAN (whiskers: 95% confidence interval for throughput, 99-th percentile for latency). In the LAN configuration, we use a key-value store service replicated with each of the three protocols. In our workload, all the client requests are 64-byte updates. In the WAN configuration, we use a social network service. Social networks are notorious for exhibiting locality properties of the sort that GeoPaxos can take advantage of[1, 3, 2]. Our social network has 1 users. The friendship relations follow a Zipf distribution with skew 1.5. There are two operations: gettimeline and post. The gettimeline operation returns the last 1 messages posted on a specified user timeline. The post appends a message to the timeline of all the followers of the specified user. While get- Timeline is always a single-partition operation, post depends on the partitions that own the followers of a user. To assign users to partitions, the social network was partitioned among 3 and 5 partitions using METIS[16], with the following results: Three partitions with 344, 317 and 3426 users; where 8% of users have followers in the same partition, 18% of users have followers in two partitions, and 2% of users have followers in all partitions. Five partitions with 1998, 1942, 257, 1943 and 26 users; 74% of users have followers in the same partition, 22% in two partitions, 2.6% in three partitions, 1% in four partitions, and.4% in five partitions. 9

10 The experiments were executed with 1 clients in each region and a mix of gettimeline and post operations in a proportion of 4: Performance in the LAN We compare GeoPaxos to Multi-Paxos (implemented with Libpaxos), EPaxos, 3 and M2Paxos 4 in configurations with 3 and 5 partitions. GeoPaxos uses three acceptors per partition, with the partition replica colocated with an acceptor (see Figure 1). Multi-Paxos, EPaxos and M2Paxos are deployed with one replica per partition. For Multi-Paxos, one of the replicas is the coordinator. Clients run in a closed loop and we increase the number of clients until the system is saturated and no increase in throughput is possible. For GeoPaxos, EPaxos and M2Paxos, where the clients are equally distributed among replicas, 1 simultaneous clients per partition are enough to saturate the system. Multi- Paxos saturates sooner, with around 8 clients per partition, equally distributed among proposers, which forward the operations to the coordinator. Furthermore, we set batching for Multi-Paxos, EPaxos and Geo- Paxos to 5, (i.e., at most 5 operations can be ordered with a single Paxos execution). M2Paxos does not provide batching. As depicted in Figure 4 (a), GeoPaxos and EPaxos have similar behavior. This is explained by the absence of leader in EPaxos and the independent ordering in each partition in GeoPaxos. M2Paxos combines clients and replicas in the same process, imposing high CPU usage. As the number of partitions increases and the amount of single-partition operations remains high, we expect GeoPaxos throughput to increase linearly, just like most partially replicated approaches. Multi-Paxos saturates when the coordinator reaches maximum CPU usage. Figure 4 (b) shows the latency at peak load for all protocols. GeoPaxos and EPaxos have similar results, substantially lower than Multi-Paxos, which suffers the effects of the overloaded coordinator. M2Paxos has the lowest median latency but much larger latency tail. 6.3 Latency in the WAN In the WAN experiments, GeoPaxos contains 3 acceptors and 1 replica per partition, with the acceptors distributed in different availability zones and the replica co-located with one acceptor in a node. Multi- Paxos, M2Paxos and EPaxos use one replica per partition. Clients are in the availability zone of their local replica. Figure 5 shows the median latency and 99-th percentile of the protocols in scenarios with a single client. In the remote client" configurations, the client is in Ireland (EU) and connect to the replica in CA; in all other executions the client is in California (CA). By deploying a single client, we aim to assess the protocols in the absence of queueing effects local client remote client 1 partition - local client 1 partition - remote client 2 partitions 3 partitions Multi-Paxos GeoPaxos local client remote client local client r.c. EPaxos M2Paxos local client remote client 1 partition - local client 1 partition - remote client 2 partitions 5 partitions local client 3 Partitions 5 Partitions Figure 5: Latency in WAN (whiskers: 99-th percentiles). 3 For the evaluation of EPaxos we used the authors original code, available athttps://github.com/efficient/epaxos and compiled using Go version For the evaluation of M2Paxos we used the authors original code, available athttps://bitbucket.org/talex/hyflow-go and compiled using Go version remote client local client r.c. 1

11 Multi-Paxos and GeoPaxos have their latency strictly related to the proximity to other replicas and the location of the clients. GeoPaxos also depends on the number of partitions that an operation is addressed to. For single-partition operations, the latency of GeoPaxos is around 2 msec, while the best case for EPaxos and Multi-Paxos is around 8 9 msec for both 3 and 5 partitions. M2Paxos takes around 63 msec to order a message. Even with a single client, M2Paxos has high latency for remote clients, almost twice the value of GeoPaxos s latency in the 5-partitions scenario. Operations that involve two partitions in GeoPaxos have latency between 8 msec and 9 msec. GeoPaxos only has higher latency than the other techniques when an operation involves all partitions, something that is expected to happen very scarcely, depending directly on the latency of the farthest replica. 1 Multi-Paxos GeoPaxos NC GeoPaxos EPaxos / / 5PC: single- / three- / five-partition operation 1 1 5PC 5PC 1 CA VA EU CA VA EU OR JP Figure 6: The impact of the convoy effect on latency (whiskers: 75-th percentiles). 6.4 Convoy effect We now compare GeoPaxos without the optimizations to cope with the convoy effect ( GeoPaxos") and with the optimizations described in to mitigate the convoy effect ( GeoPaxos NC"). We also include EPaxos and Multi-Paxos in our evaluation. M2Paxos is not evaluated in this setup since the available implementation cannot handle multi-partition operations. [%] Multi-Paxos GeoPaxos NC GeoPaxos EPaxos (a) Latency in WAN and convoy effect with 3 partitions. [%] Multi-Paxos GeoPaxos NC GeoPaxos EPaxos (b) Latency in WAN and convoy effect with 5 partitions. Figure 7: The impact of the convoy effect on latency. Figure 6 shows the results for 3 and 5 regions. Single-partition operations () in GeoPaxos suffer from the convoy effect in all regions. This can be seen by the large difference between the 75-th percentile and the median latency values. The OR region suffers less from the convoy effect because the partitioning computed by METIS resulted in almost no multi-partition operations in this region. Our proposed strategy to counter the convoy effect proved effective as it brought the 75-th latency percentile of GeoPaxos NC close to the median latency. Multi-Paxos and EPaxos do not suffer from the convoy effect since all messages are multi-partition. To better illustrate the benefits of GeoPaxos s strategy to handle the convoy effect, the cumulative distribution functions (CDF) for both setups are presented in Figure 7. With 3 partitions, GeoPaxos NC brings the percentage of low-latency single-partition operations from around 65% to more than 9%. With 5 partitions, half the single-partition operations experience the convoy effect originally and less than 15% are penalized with GeoPaxos NC. Part of the single-partition operations that display high latency with Geo- Paxos NC are due to queuing effects and CPU scheduling (2 vcpus in configurations with as many threads 11

12 as the number of partitions) and a minor fraction of users with followers exclusively external to their own partition (single-partition operations ordered by a remote partition, resulting in a remote client). Figure 8 exhibits the impact on throughput. GeoPaxos is 2x faster than EPaxos with 3 partitions (from 374 to 745 operations/sec) and GeoPaxos NC is 6x faster (2245 operations/second). With 5 partitions, the speedup over EPaxos is 3.8x and 7.6x for GeoPaxos and GeoPaxos NC, respectively (366, 1357 and 2767 operations/sec). Multi-Paxos experienced the lowest throughput due to the ordering of operations done by the single coordinator. Throughput [operations / sec] Multi-Paxos GeoPaxos GeoPaxos NC EPaxos 3 Partitions 5 Partitions Figure 8: The impact of the convoy effect on throughput. 6.5 Dynamic ownership changes In this experiment, conducted using our social network service running in a WAN, we re-assign object ownerships dynamically. The execution starts with a random assignment of object ownership to partitions. The first time an object is accessed in an operation (i.e., when a user executes his first post or gettimeline), the object is moved to the user s preferred region so that further accesses will be local to the user. Figure 9 shows the throughput in the first 14 seconds of the execution and the latency CDF of three time intervals. Until the system has rearranged object ownership so that objects are owned by the partitions at the clients preferred regions performance is low. Once object ownership has been assigned, something that happens after 75 seconds into the execution, most operations will be local single-partition. Throughput [operations / sec] 3 25 t 1 t 2 t Time [sec] (a) Throughput [%] t 1 t 2 t (b) Latency Figure 9: Impact of dynamic ownership changes on throughput and latency. Although conceptually M2Paxos supports the notion of object ownership, we could not compare it experimentally to GeoPaxos since the available implementation does not include object migration. 6.6 Synchronous disk writes In the experiments so far we have configured acceptors to write asynchronously to disk, which provides better performance than synchronous writes. Although asynchronously writes may result in information loss in the case of failures, controllers typically avoid the problem with battery-backed write caches. We now assess the performance of GeoPaxos with synchronous disk writes in the LAN setup presented in

13 GeoPaxos can benefit from servers with multiple disk drives to improve performance. In order to do that, we create partitions within the same LAN. Each partition uses its own set of acceptors (3 acceptors per partition) and each acceptor uses a different disk in the server. We configure the system with 1 clients per partition, running in closed loop. Figure 1 shows the results as we increase the number of partitions (and clients). Since the disk is the bottleneck in the execution, we see a linear increase in throughput as we add partitions. Throughput [operations / sec] Number of partitions Figure 1: Performance with message logging. 6.7 Performance under failures The last set of experiments assess GeoPaxos in the presence of replica failures. Initially, we configured the system with clients running in closed loop and equally distributed across partitions in order to keep the overall throughput between 1 and 12 operations per second, without saturating the replicas (see Figure 11). 1 VA replica crash, recovery replica in EU CA replica crash, recovery replica in CA 1 1 Throughput [operations / sec] Time [sec] 14 CA VA EU 12 1 CA replica fails! VA replica fails! Time [sec] Figure 11: Latency and throughput in WAN (3 regions, clients in all regions). Clients from 3 EC2 regions (CA, VA and EU) connect to their local replicas. Clients in CA have a backup replica in another datacenter in the same region, while clients in VA are configured to connect to the EU 13

14 replica in case of failures. After 3 seconds into the execution, a replica in CA is killed. Clients in this region immediately connect to the second replica of the region, which has a slightly greater latency (from.4 to 1.2 msec), resulting in some throughput decrease. A second replica is killed 3 seconds later, now in VA. Clients reconnect to the EU replica, but are subject to increased latency (around 14 msec). The replica in EU keeps a constant throughput despite two failures in two distinct regions. This is possible due to the partial ordering implemented by GeoPaxos and the separation of Paxos roles, disassociating the acceptor and proposer from the replicas. In a second configuration, we include the regions of OR and JP and start one replica per region. Clients from CA execute both single- and multi-partition operations. After almost 1 minute, the replica in CA is killed. The 1 clients in CA immediately connect to the VA replica (see Figure 12). The latency for multi-partition operations raises by approximately 1 RTT, from 85 msec to around 16 msec, reflecting the distance from the clients to the remote replica. Single-partition operations are more affected by local replica crash: the latency of 2.3 msec jumps to the same 16 msec. In this case, the additional 2 RTTs are caused by 4 inter-region messages: a) from client to the remote replica; b) from the remote replica back to be ordered in client s local region; c) the operation goes back to the remote replica after being ordered; and d) the remote replica replies to the client. This latency could be reduced if the remote replica, upon noticing the increasing number of requests from the affected region, triggered a move operation for the objects being order by that region, saving 1 RTT in further requests from the same clients. 1 1 Multi-partition operations Single-partition operations Throughput [operations / sec] Time [sec] Local replica fails! Time [sec] Figure 12: Latency and Throughput in WAN (5 regions, 1 clients in CA). 7 Related work If on the one hand state machine replication is widely used to increase service availability (e.g.,[3, 6, 13]), on the other hand it is also notably criticized for its performance. From single-leader algorithms, like Paxos[2], to leaderless algorithms (e.g.,[25, 38]) and variations that take the semantics of operations into account (e.g.,[27, 21]), all efforts have been directed to finding faster ways to order operations. None of the solutions, however, can avoid the large latency imposed by geographically distributed applications since at least a simple majority quorum of replicas is needed to order operations[22]. Furthermore, existing solutions experience reduced performance as the number of replicas increases. 14

15 GeoPaxos improves the performance of state machine replication by exploiting the fact that operations do not need a total order, but, on the contrary, can be partially ordered by multiple totally ordered instances of a Multi-Paxos, and judiciously placing the various Multi-Paxos instances close to their clients. Partially ordering operations with the goal of improving performance has been previously implemented by EPaxos and M2Paxos. EPaxos[25] improved on traditional Paxos[2] by reducing the overload on the coordinator and allowing any replica to order operations. Depending on the interference between operations, however, EPaxos can impose two additional communication steps in the ordering of operations, a high price to be paid by geographically distributed applications. Moreover, EPaxos does not take locality into account. M2Paxos[28] is an implementation of Generalized Consensus[21]. M2Paxos does not establish operation dependencies based on conflicts, but, similar to GeoPaxos, maps nodes (partitions in GeoPaxos) to accessed objects. M2Paxos guarantees that operations that access the same objects are ordered by the same node. It needs at least two communications steps for local operations and one additional step for remote operations. While GeoPaxos has a worst case of two inter-partition messages exchanges, M2Paxos s mechanism to deal with multiple-partition operations requires changing the ownership of all involved objects and does not provide any guarantees on the maximum number of communication delays. In geographically distributed scenarios, applications cannot afford such uncertainty. Several solutions that partition (i.e., shard) the data have appeared in the literature. Systems in this category are sometimes referred to as partially replicated systems, as opposed to designs in which each replica has a full copy of the service state, like in GeoPaxos. Spanner[14] is a partitioned distributed database for WANs. It uses a combination of two-phase commit and a TrueTime API to achieve consistent multipartition transactions. TrueTime uses hardware clocks to derive bounds on clock uncertainty, and is used to assign globally valid timestamps and for consistent reads across partitions. It requires ellaborate synchronization mechanisms to keep the clock skew among nodes within an acceptable limit. Spinnaker[29] is similar to the approach presented here. It also uses several instances of Multi-Paxos to achieve scalability. However, Spinnaker does not support operations across multiple Multi-Paxos instances. Differently from existing sharded systems, where replicas contain only part of the service state, in Geo- Paxos each replica contains the entire state. In doing this, GeoPaxos can improve performance without sacrificing the simplicity of the state machine replication approach. Moreover, there is no need to reshard data across nodes and to migrate data across nodes for load balance and in response to failures. Some systems seek to boost performance by exploiting transactional semantics. The most related to GeoPaxos are MDCC, Granola, Geo-DUR and P-Store. MDDC[18] is a replicated transactional data store that also uses several instances of Paxos. MDCC optimizes for commutative transactions, and uses Generalized Paxos to relax the order of commuting transactions. Granola[7] is distributed transaction coordination system that relies on real time timestamps to provide strong consistency among transactions. It depends on synchronized clocks and needs three communication delays to order multi-repository transactions, in the absence of aborts. P-Store[31] and Geo-DUR[34] are optimized for wide-area networks and have also looked into techniques to reduce the convoy effect. Differently from these systems, GeoPaxos does not require transaction support. For example, GeoPaxos does not need to handle rollbacks in case partitions do not agree on the order of operations[34]. Some solutions have faced the high latency" of state machine replication by weakening consistency guarantees. One example is eventual consistency[9], which allows replicas to diverge in case of network partitions, with the advantage that the system is always available. However, clients are exposed to conflicts and reconciliation must be handled at the application level. Walter[35] offers Parallel Snapshot Isolation (PSI) for databases replicated across multiple datacenters. PSI guarantees snapshot isolation and total order of updates within a site, but only causal ordering across datacenters. COPS[23] ensures a stronger version of causal consistency: in addition to ordering causally related write operations, it also orders writes to the same data items. GeoPaxos can largely benefit from graph partitioning techniques, although this is not the focus of this paper. Partitioning techniques could be applied to reassign the objects and keep the frequency of singlepartition operation as high as possible. Several partitioning techniques have been devised which could be used in GeoPaxos (e.g.,[8][26][37]). 15

Fast Atomic Multicast

Fast Atomic Multicast Università della Svizzera italiana USI Technical Report Series in Informatics Fast Atomic Multicast Paulo R. Coelho 1, Nicolas Schiper 2, Fernando Pedone 1 1 Faculty of Informatics, Università della Svizzera

More information

There Is More Consensus in Egalitarian Parliaments

There Is More Consensus in Egalitarian Parliaments There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David Andersen, Michael Kaminsky Carnegie Mellon University Intel Labs Fault tolerance Redundancy State Machine Replication 3 State Machine

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports

More information

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete MDCC MULTI DATA CENTER CONSISTENCY Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete gpang@cs.berkeley.edu amplab MOTIVATION 2 3 June 2, 200: Rackspace power outage of approximately 0

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks (Technical Report)

Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks (Technical Report) : Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks (Technical Report) Jiaqing Du, Daniele Sciascia, Sameh Elnikety, Willy Zwaenepoel, and Fernando Pedone

More information

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering

More information

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li Janus Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li New York University, University of Southern California State of the Art

More information

Memory-Based Cloud Architectures

Memory-Based Cloud Architectures Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)

More information

Replications and Consensus

Replications and Consensus CPSC 426/526 Replications and Consensus Ennan Zhai Computer Science Department Yale University Recall: Lec-8 and 9 In the lec-8 and 9, we learned: - Cloud storage and data processing - File system: Google

More information

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers 1 HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers Vinit Kumar 1 and Ajay Agarwal 2 1 Associate Professor with the Krishna Engineering College, Ghaziabad, India.

More information

AGORA: A Dependable High-Performance Coordination Service for Multi-Cores

AGORA: A Dependable High-Performance Coordination Service for Multi-Cores AGORA: A Dependable High-Performance Coordination Service for Multi-Cores Rainer Schiekofer 1, Johannes Behl 2, and Tobias Distler 1 1 Friedrich-Alexander University Erlangen-Nürnberg (FAU) 2 TU Braunschweig

More information

PARALLEL CONSENSUS PROTOCOL

PARALLEL CONSENSUS PROTOCOL CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL Bernard Wong CoNEXT 2017 Joint work with Sajjad Rizvi and Srinivasan Keshav CONSENSUS PROBLEM Agreement between a set of nodes in the presence

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Key-value store with eventual consistency without trusting individual nodes

Key-value store with eventual consistency without trusting individual nodes basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed

More information

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini Large-Scale Key-Value Stores Eventual Consistency Marco Serafini COMPSCI 590S Lecture 13 Goals of Key-Value Stores Export simple API put(key, value) get(key) Simpler and faster than a DBMS Less complexity,

More information

Tail Latency in ZooKeeper and a Simple Reimplementation

Tail Latency in ZooKeeper and a Simple Reimplementation Tail Latency in ZooKeeper and a Simple Reimplementation Michael Graczyk Abstract ZooKeeper [1] is a commonly used service for coordinating distributed applications. ZooKeeper uses leader-based atomic broadcast

More information

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases ABSTRACT We present here a transaction management protocol using snapshot isolation in partially replicated multi-version

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

SCALABLE CONSISTENCY AND TRANSACTION MODELS

SCALABLE CONSISTENCY AND TRANSACTION MODELS Data Management in the Cloud SCALABLE CONSISTENCY AND TRANSACTION MODELS 69 Brewer s Conjecture Three properties that are desirable and expected from realworld shared-data systems C: data consistency A:

More information

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES Divy Agrawal Department of Computer Science University of California at Santa Barbara Joint work with: Amr El Abbadi, Hatem Mahmoud, Faisal

More information

Building global and scalable systems with Atomic Multicast

Building global and scalable systems with Atomic Multicast Building global and scalable systems with Atomic Multicast Samuel Benz University of Lugano Switzerland Parisa Jalili Marandi University of Lugano Switzerland Fernando Pedone University of Lugano Switzerland

More information

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Beyond TrueTime: Using AugmentedTime for Improving Spanner Beyond TrueTime: Using AugmentedTime for Improving Spanner Murat Demirbas University at Buffalo, SUNY demirbas@cse.buffalo.edu Sandeep Kulkarni Michigan State University, sandeep@cse.msu.edu Spanner [1]

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

CPS 512 midterm exam #1, 10/7/2016

CPS 512 midterm exam #1, 10/7/2016 CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Paxos Replicated State Machines as the Basis of a High- Performance Data Store Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a

More information

High performance recovery for parallel state machine replication

High performance recovery for parallel state machine replication High performance recovery for parallel state machine replication Odorico M. Mendizabal and Fernando Luís Dotti and Fernando Pedone Universidade Federal do Rio Grande (FURG), Rio Grande, Brazil Pontifícia

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals... 2 1.2 Data model and the hierarchical namespace... 3 1.3 Nodes and ephemeral nodes...

More information

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun Spanner : Google's Globally-Distributed Database James Sedgwick and Kayhan Dursun Spanner - A multi-version, globally-distributed, synchronously-replicated database - First system to - Distribute data

More information

Strong Consistency at Scale

Strong Consistency at Scale Strong Consistency at Scale Carlos Eduardo Bezerra University of Lugano (USI) Switzerland Le Long Hoang University of Lugano (USI) Switzerland Fernando Pedone University of Lugano (USI) Switzerland Abstract

More information

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication Important Lessons Lamport & vector clocks both give a logical timestamps Total ordering vs. causal ordering Other issues in coordinating node activities Exclusive access to resources/data Choosing a single

More information

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ) Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.

More information

SpecPaxos. James Connolly && Harrison Davis

SpecPaxos. James Connolly && Harrison Davis SpecPaxos James Connolly && Harrison Davis Overview Background Fast Paxos Traditional Paxos Implementations Data Centers Mostly-Ordered-Multicast Network layer Speculative Paxos Protocol Application layer

More information

WICE - A Pragmatic Protocol for Database Replication in Interconnected Clusters

WICE - A Pragmatic Protocol for Database Replication in Interconnected Clusters WICE - A Pragmatic Protocol for Database Replication in Interconnected Clusters Jon Grov 1 Luís Soares 2 Alfrânio Correia Jr. 2 José Pereira 2 Rui Oliveira 2 Fernando Pedone 3 1 University of Oslo, Norway

More information

Replication and Consistency. Fall 2010 Jussi Kangasharju

Replication and Consistency. Fall 2010 Jussi Kangasharju Replication and Consistency Fall 2010 Jussi Kangasharju Chapter Outline Replication Consistency models Distribution protocols Consistency protocols 2 Data Replication user B user C user A object object

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Copyright 2003 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches

More information

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases. Technical Report

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases. Technical Report Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union

More information

Enhancing Throughput of

Enhancing Throughput of Enhancing Throughput of NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano Enhancing Throughput of Partially Replicated State Machines via NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano Enhancing

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Making Fast Consensus Generally Faster

Making Fast Consensus Generally Faster Making Fast Consensus Generally Faster [Technical Report] Sebastiano Peluso Virginia Tech peluso@vt.edu Alexandru Turcu Virginia Tech talex@vt.edu Roberto Palmieri Virginia Tech robertop@vt.edu Giuliano

More information

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines Hanyu Zhao *, Quanlu Zhang, Zhi Yang *, Ming Wu, Yafei Dai * * Peking University Microsoft Research Replication for Fault Tolerance

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013 Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013 *OSDI '12, James C. Corbett et al. (26 authors), Jay Lepreau Best Paper Award Outline What is Spanner? Features & Example Structure

More information

殷亚凤. Consistency and Replication. Distributed Systems [7]

殷亚凤. Consistency and Replication. Distributed Systems [7] Consistency and Replication Distributed Systems [7] 殷亚凤 Email: yafeng@nju.edu.cn Homepage: http://cs.nju.edu.cn/yafeng/ Room 301, Building of Computer Science and Technology Review Clock synchronization

More information

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion Problem

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS CONSISTENCY AND REPLICATION CONSISTENCY MODELS Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Consistency Models Background Replication Motivation

More information

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech Spanner: Google's Globally-Distributed Database Presented by Maciej Swiech What is Spanner? "...Google's scalable, multi-version, globallydistributed, and synchronously replicated database." What is Spanner?

More information

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Dynamo Recap Consistent hashing 1-hop DHT enabled by gossip Execution of reads and writes Coordinated by first available successor

More information

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju Chapter 4: Distributed Systems: Replication and Consistency Fall 2013 Jussi Kangasharju Chapter Outline n Replication n Consistency models n Distribution protocols n Consistency protocols 2 Data Replication

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

Scalable Deferred Update Replication

Scalable Deferred Update Replication Scalable Deferred Update Replication Daniele Sciascia University of Lugano (USI) Switzerland Fernando Pedone University of Lugano (USI) Switzerland Flavio Junqueira Yahoo! Research Spain Abstract Deferred

More information

Documentation Accessibility. Access to Oracle Support

Documentation Accessibility. Access to Oracle Support Oracle NoSQL Database Availability and Failover Release 18.3 E88250-04 October 2018 Documentation Accessibility For information about Oracle's commitment to accessibility, visit the Oracle Accessibility

More information

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Designing Distributed Systems using Approximate Synchrony in Data Center Networks Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE Today s most

More information

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer Group Replication: A Journey to the Group Communication Core Alfranio Correia (alfranio.correia@oracle.com) Principal Software Engineer 4th of February Copyright 7, Oracle and/or its affiliates. All rights

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

Scalable State-Machine Replication

Scalable State-Machine Replication Scalable State-Machine Replication Carlos Eduardo Bezerra, Fernando Pedone, Robbert van Renesse University of Lugano, Switzerland Cornell University, USA Universidade Federal do Rio Grande do Sul, Brazil

More information

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer - proposes a formal definition for the timed asynchronous distributed system model - presents measurements of process

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Distributed DNS Name server Backed by Raft

Distributed DNS Name server Backed by Raft Distributed DNS Name server Backed by Raft Emre Orbay Gabbi Fisher December 13, 2017 Abstract We implement a asynchronous distributed DNS server backed by Raft. Our DNS server system supports large cluster

More information

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended

More information

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks. Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive

More information

Google Spanner - A Globally Distributed,

Google Spanner - A Globally Distributed, Google Spanner - A Globally Distributed, Synchronously-Replicated Database System James C. Corbett, et. al. Feb 14, 2013. Presented By Alexander Chow For CS 742 Motivation Eventually-consistent sometimes

More information

Using Optimistic Atomic Broadcast in Transaction Processing Systems

Using Optimistic Atomic Broadcast in Transaction Processing Systems Using Optimistic Atomic Broadcast in Transaction Processing Systems Bettina Kemme Fernando Pedone Gustavo Alonso André Schiper Matthias Wiesmann School of Computer Science McGill University Montreal, Canada,

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS 2013 Long Kai EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS BY LONG KAI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety Copyright 2012 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4.

More information

arxiv: v1 [cs.dc] 14 May 2018

arxiv: v1 [cs.dc] 14 May 2018 Early Scheduling in Parallel State Machine Replication Eduardo Alchieri, Fernando Dotti 2 and Fernando Pedone Departamento de Ciência da Computação Universidade de Brasília, Brazil 2 Escola Politécnica

More information

Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines

Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines João Sousa and Alysson Bessani LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal Abstract State

More information

Augustus: Scalable and Robust Storage for Cloud Applications

Augustus: Scalable and Robust Storage for Cloud Applications Augustus: Scalable and Robust Storage for Cloud Applications Ricardo Padilha Fernando Pedone University of Lugano, Switzerland Abstract Cloud-scale storage applications have strict requirements. On the

More information

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary Consistency and Replication Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary Reasons for Replication Reliability/Availability : Mask failures Mask corrupted data Performance: Scalability

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang Naveen Kr. Sharma Adriana Szekeres Arvind Krishnamurthy Dan R. K. Ports University of Washington {iyzhang, naveenks, aaasz, arvind,

More information

No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transactions with consistency, availability, and performance No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,

More information

Architecture of a Real-Time Operational DBMS

Architecture of a Real-Time Operational DBMS Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc.

More information

HP: Hybrid Paxos for WANs

HP: Hybrid Paxos for WANs HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de TU Darmstadt, Germany Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded

More information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5. Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message

More information

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service

More information

Consensus, impossibility results and Paxos. Ken Birman

Consensus, impossibility results and Paxos. Ken Birman Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Eventual Consistency 1

Eventual Consistency 1 Eventual Consistency 1 Readings Werner Vogels ACM Queue paper http://queue.acm.org/detail.cfm?id=1466448 Dynamo paper http://www.allthingsdistributed.com/files/ amazon-dynamo-sosp2007.pdf Apache Cassandra

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, PushyDB Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, osong}@mit.edu https://github.com/jeffchan/6.824 1. Abstract PushyDB provides a more fully featured database that exposes

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni Agenda Motivation Current Approaches FAB Design Protocols, Implementation, Optimizations Evaluation SSDs in enterprise

More information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication

More information

Presented By: Devarsh Patel

Presented By: Devarsh Patel : Amazon s Highly Available Key-value Store Presented By: Devarsh Patel CS5204 Operating Systems 1 Introduction Amazon s e-commerce platform Requires performance, reliability and efficiency To support

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 496 (2013) 170 183 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Optimizing Paxos with batching

More information

Lecture 6 Consistency and Replication

Lecture 6 Consistency and Replication Lecture 6 Consistency and Replication Prof. Wilson Rivera University of Puerto Rico at Mayaguez Electrical and Computer Engineering Department Outline Data-centric consistency Client-centric consistency

More information

The Case for Reconfiguration without Consensus: Comparing Algorithms for Atomic Storage

The Case for Reconfiguration without Consensus: Comparing Algorithms for Atomic Storage The Case for Reconfiguration without Consensus: Comparing Algorithms for Atomic Storage Leander Jehl 1 and Hein Meling 2 1 University of Stavanger, Stavanger, Norway leander.jehl@uis.no 2 University of

More information

Interactive Responsiveness and Concurrent Workflow

Interactive Responsiveness and Concurrent Workflow Middleware-Enhanced Concurrency of Transactions Interactive Responsiveness and Concurrent Workflow Transactional Cascade Technology Paper Ivan Klianev, Managing Director & CTO Published in November 2005

More information