Practice: Large Systems Part 2, Chapter 2

Similar documents
Practice: Large Systems

Practice: Large Systems Part 2, Chapter 2

11/24/11. Overview. Computability vs. Efficiency. Fault- Tolerance in Prac=ce. Replica=on Is Expensive

DD2451 Parallel and Distributed Computing --- FDD3008 Distributed Algorithms

Distributed Systems Consensus

Consensus and related problems

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Recovering from a Crash. Three-Phase Commit

Distributed Systems 11. Consensus. Paul Krzyzanowski

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Distributed Consensus Protocols

Fault-Tolerance & Paxos

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

To do. Consensus and related problems. q Failure. q Raft

Paxos provides a highly available, redundant log of events

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

You can also launch the instances on different machines by supplying IPv4 addresses and port numbers in the format :3410

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance

Failures, Elections, and Raft

Replications and Consensus

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Introduction to Distributed Systems Seif Haridi

CSE 5306 Distributed Systems. Fault Tolerance

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Replication in Distributed Systems

Consensus, impossibility results and Paxos. Ken Birman

CSE 5306 Distributed Systems

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Consensus. Chapter Two Friends. 2.3 Impossibility of Consensus. 2.2 Consensus 16 CHAPTER 2. CONSENSUS

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

Semi-Passive Replication in the Presence of Byzantine Faults

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CS505: Distributed Systems

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Exam 2 Review. Fall 2011

Intuitive distributed algorithms. with F#

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Building Consistent Transactions with Inconsistent Replication

Exam 2 Review. October 29, Paul Krzyzanowski 1

Paxos and Replication. Dan Ports, CSEP 552

EECS 591 DISTRIBUTED SYSTEMS

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Consensus in Distributed Systems. Jeff Chase Duke University

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Replicated State Machine in Wide-area Networks

EECS 498 Introduction to Distributed Systems

CS October 2017

Distributed System. Gang Wu. Spring,2018

Linearizability CMPT 401. Sequential Consistency. Passive Replication

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Asynchronous Reconfiguration for Paxos State Machines

Distributed Commit in Asynchronous Systems

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast

Database management system Prof. D. Janakiram Department of Computer Science and Engineering Indian Institute of Technology, Madras

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Practical Byzantine Fault Tolerance

Incompatibility Dimensions and Integration of Atomic Commit Protocols

Exam Distributed Systems

CPS 512 midterm exam #1, 10/7/2016

Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks

PRIMARY-BACKUP REPLICATION

Distributed Deadlock

Universal Turing Machine Chomsky Hierarchy Decidability Reducibility Uncomputable Functions Rice s Theorem Decidability Continued

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Topics in Reliable Distributed Systems

Integrity in Distributed Databases

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Byzantine Failures. Nikola Knezevic. knl

Fault-Tolerant Distributed Services and Paxos"

WA 2. Paxos and distributed programming Martin Klíma

arxiv: v2 [cs.dc] 12 Sep 2017

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

A Reliable Broadcast System

MODELS OF DISTRIBUTED SYSTEMS

Distributed Systems Fault Tolerance

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Key-value store with eventual consistency without trusting individual nodes

UNIT 02 DISTRIBUTED DEADLOCKS UNIT-02/LECTURE-01

MODELS OF DISTRIBUTED SYSTEMS

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm

CS 347 Parallel and Distributed Data Processing

Distributed Systems 24. Fault Tolerance

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Transcription:

Practice: Large Systems Part 2, Chapter 2 Overvie Introduction Strong Consistency Crash Failures: Primary Copy, Commit Protocols Crash-Recovery Failures: Paxos, Chubby Byzantine Failures: PBFT, Zyzzyva CAP: Consistency or Availability? Weak Consistency Consistency Models Peer-to-Peer, Distributed Storage, Cloud Computing Computation: MapReduce Thomas Locher Roger Wattenhofer ETH Zurich Distributed Computing.disco.ethz.ch 2/2 Computability vs. Efficiency In the last part, e studied computability When is it possible to guarantee consensus? What kind of failures can be tolerated? Ho many failures can be tolerated? Worst-case scenarios! 1 0 1 1 0 1 0 In this part, e consider practical solutions Simple approaches that ork ell in practice Focus on efficiency 2/3 2/4

Fault-Tolerance in Practice Replication is Expensive Reading a value is simple Just query any server Writing is more ork Inform all servers about the update What if some servers are not available? Fault-Tolerance is achieved through replication Read: Write: Replicated data??? r 2/5 2/6 Primary Copy Can e reduce the load on the clients? Yes! Write only to one server (the primary copy), and let primary copy distribute the update This ay, the client only sends one message in order to read and rite Read: Write: Primary copy r 2/7 2/8

Problem ith Primary Copy State Machine Replication? If the clients can only send read requests to the primary copy, the system stalls if the primary copy fails Hoever, if the clients can also send read requests to the other servers, the clients may not have a consistent vie The state of each server has to be updated in the same ay This ensures that all servers are in the same state henever all updates have been carried out! A B A r Reads an outdated value!!! A B A C C C A B A C The servers have to agree on each update Consensus has to be reached for each update! 2/9 2/10 Theory Practice From Theory to Practice So, ho do e go from theory to practice? Impossible to guarantee consensus using a deterministic algorithm in asynchronous systems even if only one node is faulty Consensus is required to guarantee consistency among different replicas Communication is often not synchronous, but not completely asynchronous either There may be reasonable bounds on the message delays Practical systems often use message passing. The machines ait for the response from another machine and abort/retry after time-out Failures: It depends on the application/system hat kind of failures have to be handled Depends on the bounds on the message delays! That is... Real-orld protocols also make assumptions about the system These assumptions allo us to circumvent the loer bounds! 2/11 2/12

System Consistency Models (Client Vie) Storage System Servers: 2...Millions Store data and react to client request Processes Clients, often millions Read and rite/modify data Interface that describes the system behavior (abstract aay implementation details) If clients read/rite data, they expect the behavior to be the same as for a single storage cell. 2/13 2/14 Let s Formalize these Ideas Motivation We have memory that supports 3 types of operations: rite(u := v): rite value v to the memory location at address u read(u): Read value stored at address u and return it snapshot(): return a map that contains all address-value pairs Each operation has a start-time T s and return-time T R (time it returns to the invoking client). The duration is given by T R T s. read(u) rite(u:=1) rite(u:=2) rite(u:=3) rite(u:=4) start-time rite(u := 3) return-time A X Y B read(u)? rite(u:=5) rite(u:=6) rite(u:=7) replica time 2/15 2/16

Executions Strong Consistency: Linearizability We look at executions E that define the (partial) order in hich processes invoke operations. Real time partial order < r A replicated system is called linearizable if it behaves exactly as a singlesite (unreplicated) system. Real-time partial order of an execution < r : p < r q means that duration of operation p occurs entirely before duration of q (i.e., p returns before the invocation of q in real time). Client partial order < c : p < c q means p and q occur at the same client, and that p returns before q is invoked. A B Client partial order < c Definition Execution E is linearizable if there exists a sequence H such that: 1) H contains exactly the same operations as E, each paired ith the return value received in E 2) The total order of operations in H is compatible ith the real-time partial order < r 3) H is a legal history of the data type that is replicated A B 2/17 2/18 Example: Linearizable Execution Strong Consistency: Sequential Consistency read(u 1 ) 5 rite(u 2 := 7) snapshot() (u 0 :0, u 1 :5, u 2 :7, u 3 :0) Real time partial order < r A X Y B rite(u 1 := 5) read(u 2 ) 0 rite(u 3 := 2) Valid sequence H: 1.) rite(u 1 := 5) 2.) read(u 1 ) 5 3.) read(u 2 ) 0 4.) rite(u 2 := 7) 5.) snapshot() (u 0 : 0, u 1 : 5, u 2 :7, u 3 :0) 6.) rite(u 3 := 2) For this example, this is the only valid H. In general there might be several sequences H that fullfil all required properties. Orders at different locations are disregarded if it cannot be determined by any observer ithin the system. I.e., a system provides sequential consistency if every node of the system sees the (rite) operations on the same memory address in the same order, although the order may be different from the order as defined by real time (as seen by a hypothetical external observer or global clock). Definition Execution E is sequentially consistent if there exists a sequence H such that: 1) H contains exactly the same operations as E, each paired ith the return value received in E 2) The total order of operations in H is compatible ith the client partial order < c 3) H is a legal history of the data type that is replicated 2/19 2/20

Example: Sequentially Consistent Client partial order < c Is Every Execution Sequentially Consistent? rite(u 2 := 7) rite(u 1 := 5) rite(u 1 := 5) read(u 1 ) 5 rite(u 2 := 7) snapshot() (u 0 :0, u 1 :5, u 2 :7, u 3 :0) A X Y B read(u 2 ) 0 rite(u 3 := 2) Valid sequence H: 1.) rite(u 1 := 5) 2.) read(u 1 ) 5 3.) read(u 2 ) 0 4.) rite(u 2 := 7) 5.) snapshot() (u 0 :0, u 1 :5, u 2 :7, u 3 :0) 6.) rite(u 3 := 2) rite(u 0 := 8) snapshot u0,u1 () rite(u 3 := 2) (u 0 :8, u 1 :0) snapshot u2,u3 () (u 2 :0, u 3 :2) A X Y B rite(u 2 := 7) rite(u 1 := 5) Circular dependencies! Real-time partial order requires rite(3,2) to be before snapshot(), hich contradicts the vie in snapshot()! rite(u 0 := 8) rite(u 3 := 2) I.e., there is no valid total order and thus above execution is not sequentially consistent 2/21 2/22 Sequential Consistency does not Compose Transactions rite(u 2 := 7) rite(u 1 := 5) rite(u 0 := 8) snapshot u0,u1 () (u 0 :8, u 1 :0) rite(u 3 := 2) snapshot u2,u3 () (u 2 :0, u 3 :2) A X Y B If e only look at data items 0 and 1, operations are sequentially consistent If e only look at data items 2 and 3, operation are also sequentially consistent But, as e have seen before, the combination is not sequentially consistent In order to achieve consistency, updates have to be atomic A rite has to be an atomic transaction Updates are synchronized Either all nodes (servers) commit a transaction or all abort Ho do e handle transactions in asynchronous systems? Unpredictable messages delays! Moreover, any node may fail Long delay Recall that this problem cannot be solved in theory! Short delay Sequential consistency does not compose! (this is in contrast to linearizability) 2/23 2/24

To-Phase Commit (2PC) To-Phase Commit: Failures A idely used protocol is the so-called to-phase commit protocol The idea is simple: There is a coordinator that coordinates the transaction All other nodes communicate only ith the coordinator The coordinator communicates the final decision Coordinator Fail-stop model: We assume that a failed node does not re-emerge Failures are detected (instantly) E.g. time-outs are used in practical systems to detect failures If the coordinator fails, a ne coordinator takes over (instantly) Ho can this be accomplished reliably? Coordinator Ne coordinator 2/25 2/26 To-Phase Commit: Protocol To-Phase Commit: Protocol In the first phase, the coordinator asks if all nodes are to commit In the second phase, the coordinator sends the decision (commit/abort) The coordinator aborts if at least one node said no Phase 1: Coordinator sends to all nodes Coordinator abort ack Coordinator abort ack If a node receives from the coordinator: If it is to commit Send to coordinator else Send no to coordinator no abort ack ack abort 2/27 2/28

To-Phase Commit: Protocol To-Phase Commit: Analysis Phase 2: If the coordinator receives only messages: Send commit to all nodes else Send abort to all nodes 2PC obviously orks if there are no failures If a node that is not the coordinator fails, it still orks If the node fails before sending /no, the coordinator can either ignore it or safely abort the transaction If the node fails before sending ack, the coordinator can still commit/abort depending on the vote in the first phase If a node receives commit from the coordinator: Commit the transaction else (abort received) Abort the transaction Send ack to coordinator Once the coordinator received all ack messages: It completes the transaction by committing or aborting itself 2/29 2/30 To-Phase Commit: Analysis To-Phase Commit: Multiple Failures What happens if the coordinator fails? As e said before, this is (someho) detected and a ne coordinator takes over This safety mechanism is not a part of 2PC Ho does the ne coordinator proceed? It must ask the other nodes if a node has al received a commit A node that has received a commit replies, otherise it sends no and promises not to accept a commit that may arrive from the old coordinator If some node replied, the ne coordinator broadcasts commit As long as the coordinator is alive, multiple failures are no problem The same arguments as for one failure apply What if the coordinator and another node crashes? no abort Aborted! commit or abort??? commit or abort??? commit Committed! commit or abort??? commit or abort??? This orks if there is only one failure Does 2PC still ork ith multiple failures? The nodes cannot commit! The nodes cannot abort! 2/31 2/32

To-Phase Commit: Multiple Failures Three-Phase Commit What is the problem? Some nodes may be to commit hile others have al committed or aborted If the coordinator crashes, the other nodes are not informed! Ho can e solve this problem? Committed/Aborted! Solution: Add another phase to the protocol! The ne phase precedes the commit phase The goal is to inform all nodes that all are to commit (or not) At the end of this phase, every node knos hether or not all nodes ant to commit before any node has actually committed or aborted! This solves the problem of 2PC! Yes/ no commit/ abort??? The remaining nodes cannot make a decision! This protocol is called the three-phase commit (3PC) protocol??? 2/33 2/34 Three-Phase Commit: Protocol Three-Phase Commit: Protocol In the ne (second) phase, the coordinator sends prepare (to commit) messages to all nodes Phase 1: Coordinator sends to all nodes Coordinator Coordinator prepare prepare ack ack prepare ack prepare ack Coordinator commit commit ackc commit ackc ackc commit ackc If a node receives from the coordinator: If it is to commit Send to coordinator else Send no to coordinator The first phase of 2PC and 3PC are identical! acknoledge commit 2/35 2/36

Three-Phase Commit: Protocol Three-Phase Commit: Protocol Phase 2: If the coordinator receives only messages: Send prepare to all nodes else Send abort to all nodes If a node receives prepare from the coordinator: Prepare to commit the transaction else (abort received) Abort the transaction Send ack to coordinator This is the ne phase Phase 3: Once the coordinator received all ack messages: If the coordinator sent abort in Phase 2 The coordinator aborts the transaction as ell else (it sent prepare) Send commit to all nodes If a node receives commit from the coordinator: Commit the transaction Send ackcommit to coordinator Once the coordinator received all ackcommit messages: It completes the transaction by committing itself 2/37 2/38 Three-Phase Commit: Analysis Three-Phase Commit: Analysis All non-faulty nodes either commit or abort If the coordinator doesn t fail, 3PC is correct because the coordinator lets all nodes either commit or abort Termination can also be guaranteed: If some node fails before sending /no, the coordinator can safely abort. If some node fails after the coordinator sent prepare, the coordinator can still enforce a commit because all nodes must have sent If only the coordinator fails, e again don t have a problem because the ne coordinator can restart the protocol Assume that the coordinator and some other nodes failed and that some node committed. The coordinator must have received ack messages from all nodes All nodes must have received a prepare message. The ne coordinator can thus enforce a commit. If a node aborted, no node can have received a prepare message. Thus, the ne coordinator can safely abort the transaction Although the 3PC protocol still orks if multiple nodes fail, it still has severe shortcomings 3PC still depends on a single coordinator. What if some but not all nodes assume that the coordinator failed? The nodes first have to agree on hether the coordinator crashed or not! In order to solve consensus, you first need to solve consensus Transient failures: What if a failed coordinator comes back to life? Suddenly, there is more than one coordinator! Still, 3PC and 2PC are used successfully in practice Hoever, it ould be nice to have a practical protocol that does not depend on a single coordinator and that can handle temporary failures! 2/39 2/40

Paxos Paxos: Majority Sets Historical note In the 1980s, a fault-tolerant distributed file system called Echo as built According to the developers, it achieves consensus despite any number of failures as long as a majority of nodes is alive The steps of the algorithm are simple if there are no failures and quite complicated if there are failures Leslie Lamport thought that it is impossible to provide guarantees in this model and tried to prove it Instead of finding a proof, he found a much simpler algorithm that orks: The Paxos algorithm Paxos is a to-phase protocol, but more resilient than 2PC Why is it more resilient? There is no coordinator. A majority of the nodes is asked if a certain value can be accepted A majority set is enough because the intersection of to majority sets is not empty If a majority chooses one value, no majority can choose another value! Majority set Paxos is an algorithm that does not rely on a coordinator Communication is still asynchronous All nodes may crash at any time and they may also recover fail-recover model Majority set 2/41 2/42 Paxos: Majority Sets Majority sets are a good idea But, hat happens if several nodes compete for a majority? Conflicts have to be resolved Some nodes may have to change their decision No majority No majority Paxos: Roles Each node has one or more roles: There are three roles Proposer A proposer is a node that proposes a certain value for acceptance Of course, there can be any number of proposers at the same time Acceptor An acceptor is a node that receives a proposal from a proposer An acceptor can either accept or reject a proposal Learner A learner is a node that is not involved in the decision process The learners must learn the final result from the proposers/acceptors No majority 2/43 2/44

Paxos: Proposal Paxos: Prepare A proposal (x,n) consists of the proposed value x and a proposal number n Whenever a proposer issues a ne proposal, it chooses a larger (unique) proposal number An acceptor accepts a proposal (x,n) if n is larger than any proposal number it has ever heard Give preference to larger proposal numbers! Before a node sends propose(x,n), it sends prepare(x,n) This message is used to indicate that the node ants to propose (x,n) If n is larger than all received request numbers, an acceptor returns the accepted proposal (y,m) ith the largest request number m If it never accepted a proposal, the acceptor returns (Ø,0) Note that m < n! The proposer learns about accepted proposals! An acceptor can accept any number of proposals An accepted proposal may not necessarily be chosen The value of a chosen proposal is the chosen value Any number of proposals can be choosen Hoever, if to proposals (x,n) and (y,m) are chosen, then x = y Consensus: Only one value can be chosen! prepare(x,n) prepare(x,n) prepare(x,n) prepare(x,n) acc(y,m) acc(ø,0) acc(z,l) Majority set This is the first phase! Majority set 2/45 2/46 Paxos: Propose Paxos: Algorithm of Proposer If the proposer receives all replies, it sends a proposal Hoever, it only proposes its on value, if it only received acc(ø,0), otherise it adopts the value y in the proposal ith the largest request number m The proposal still contains its sequence number n, i.e., (y,n) is proposed If the proposer receives all acknoledgements ack(y,n), the proposal is chosen propose(y,n) propose(y,n) propose(y,n) propose(y,n) Majority set (y,n) is chosen! This is the second phase! ack(y,n) ack(y,n) ack(y,n) ack(y,n) Majority set Proposer ants to propose (x,n): Send prepare(x,n) to a majority of the nodes if a majority of the nodes replies then Let (y,m) be the received proposal ith the largest request number if m = 0 then (No acceptor ever accepted another proposal) Send propose(x,n) to the same set of acceptors else Send propose(y,n) to the same set of acceptors if a majority of the nodes replies ith ack(x,n) (or ack(y,n)) The proposal is chosen! The value of the proposal is also chosen! After a time-out, the proposer gives up and may send a ne proposal 2/47 2/48

Paxos: Algorithm of Acceptor Paxos: Spreading the Decision Initialize and store persistently: Largest request number ever received n max := 0 (x last,n last ) := (Ø,0) Last accepted proposal Acceptor receives prepare (x,n): if n > n max then n max := n Send acc(x last,n last ) to the proposer Acceptor receives proposal (x,n): if n = n max then x last := x n last := n Send ack(x,n) to the proposer Why persistently? 2/49 After a proposal is chosen, only the proposer knos about it! Ho do the other nodes get informed? The proposer could inform all nodes directly Only n-1 messages are required If the proposer fails, the others are not informed (directly) The acceptors could broadcast every time they accept a proposal Much more fault-tolerant Many accepted proposals may not be chosen Moreover, choosing a value costs O(n 2 ) messages ithout failures! Something in the middle? The proposer informs b nodes and lets them broadcast the decision (x,n) is chosen! Accepted (x,n)! Trade-off: fault-tolerance vs. message complexity (x,n) 2/50 Paxos: Agreement Paxos: Theorem Lemma If a proposal (x,n) is chosen, then for every issued proposal (y,n ) for hich n > n it holds that x = y Proof: Assume that there are proposals (y,n ) for hich n > n and x y. Consider the proposal ith the smallest proposal number n Consider the non-empty intersection S of the to sets of nodes that function as the acceptors for the to proposals Proposal (x,n) has been accepted Since n > n, the nodes in S must have received prepare(y,n ) after (x,n) has been accepted This implies that the proposer of (y,n ) ould also propose the value x unless another acceptor has accepted a proposal (z,n*), z x and n < n* < n. Hoever, this means that some node must have proposed (z,n*), a contradiction because n* < n and e said that n is the smallest proposal number! Theorem If a value is chosen, all nodes choose this value Proof: Once a proposal (x,n) is chosen, each proposal (y,n ) that is sent afterards has the same proposal value, i.e., x = y according to the lemma on the previous slide Since every subsequent proposal has the same value x, every proposal that is accepted after (x,n) has been chosen has the same value x Since no other value than x is accepted, no other value can be chosen! 2/51 2/52

Paxos: Wait a Minute Paxos: No Liveness Guarantee Paxos is great! It is a simple, deterministic algorithm that orks in asynchronous systems and tolerates f < n/2 failures The anser is no! Paxos only guarantees that if a value is chosen, the other nodes can only choose the same value It does not guarantee that a value is chosen! time Is this really possible? prepare(x,1) acc(ø,0) prepare(y,2) Theorem A deterministic algorithm cannot guarantee consensus in asynchronous systems even if there is just one faulty node Time-out! propose(x,1) prepare(x,3) acc(ø,0) acc(ø,0) propose(y,2) Does Paxos contradict this loer bound? prepare(y,4) Time-out! 2/53 acc(ø,0) 2/54 Paxos: Agreement vs. Termination Paxos in Practice In asynchronous systems, a deterministic consensus algorithm cannot have both, guaranteed termination and correctness Paxos is alays correct. Consequently, it cannot guarantee that the protocol terminates in a certain number of rounds Termination is sacrificed for correctness Although Paxos may not terminate in theory, it is quite efficient in practice using a fe optimizations Ho can Paxos be optimized? There are ays to optimize Paxos by dealing ith some practical issues For example, the nodes may ait for a long time until they decide to try to submit a ne proposal A simple solution: The acceptors send NAK if they do not accept a prepare message or a proposal. A node can then abort immediately Note that this optimization increases the message complexity Paxos is indeed used in practical systems! Yahoo! s ZooKeeper: A management service for large distributed systems uses a variation of Paxos to achieve consensus Google s Chubby: A distributed lock service library. Chubby stores lock information in a replicated database to achieve high availability. The database is implemented on top of a fault-tolerant log layer based on Paxos 2/55 2/56