Paxos and Distributed Transactions INF 5040 autumn 2016 lecturer: Roman Vitenberg Paxos what is it? The most commonly used consensus algorithm A fundamental building block for data centers Distributed leader/coordinator election Distributed lock Distributed resource control (naming service at Google) Management of distributed logs Used by most data center and cloud providers Google Chubby Zookeeper (originally Yahoo s, now Apache open source) Amazon Web Services Oracle NoSQL DB, VMware, MS, IBM 3 INF 5040 1
Use of Chubby in Google Google Drive, Calendar, Earth, Analytics, Gmail Bigtable, GFS Chubby 4 Roles in Paxos Client issues requests to the Paxos system For instance, get a lock on a distributed file Proposer accepts the client request and publicizes and advocates the state change to the other processes. May be single or multiple proposers in different implementations Acceptor executes the agreement and record the state change once agreement is achieved Learner once a client request has been agreed on by the acceptors, the learner may take action (i.e.: execute the request and send a response to the client) To improve availability of processing, we can add more learners Also called a witness Leader Paxos requires a distinguished proposer (called Paxos leader) to make progress 5 INF 5040 2
A typical flow in Paxos need a lock on the file Proposer Client Paxos implementation Internal protocol Acceptors Decision Learners/workers 6 Basic Paxos the first stab Basic Paxos: one decision only Prepare proposer sends a Prepare message with the proposed value to the acceptors Confirm the acceptors confirm its reception Accept the proposer asks the acceptors to accept Accepted the acceptors respond, indicating they have accepted the value sent to the proposer and to each learner learners act as a secondary storage mechanism 7 INF 5040 3
Basic Paxos the first stab illustrated Prepare(v) Confirm(v) v= grant lock to a process Accept(v) Accepted(v) 8 Basic Paxos the first stab problem Prepare(v) Prepare(u) Confirm(v) Confirm(u) The protocol is blocked! 9 INF 5040 4
Basic Paxos breaking symmetry with proposer ids p q Prepare(p) Prepare(q) Confirm(p) Confirm(q) Accept(u,q) Accepted(u,q) 10 Basic Paxos failure of the proposer p q Prepare(p) Prepare(q) Confirm(p) Confirm(q) The protocol is blocked! 11 INF 5040 5
Basic Paxos changes to the protocol Need to monitor active proposers for failures and establish additional proposers if a current one fails Proposals are indexed by an increasing sequence number N, timestamp T={N,id} Use the same T in the Prepare and subsequent Accept Replace Confirm with Promise If T in the Prepare any timestamp previously received by the acceptor from any proposer, the acceptor ignores the proposal Otherwise, it returns a promise to ignore all future proposals with timestamp T Thus, acceptors may give a promise to multiple proposals, with increasing timestamps! 12 Basic Paxos sequencing proposals p q Prepare(T={1,p}) Prepare(T={1,q}) Promise({1,p}) after a timeout Prepare(T={2,p}) Promise({1,q}) 13 INF 5040 6
Basic Paxos failure of the proposer (cont d) p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({1,q}) If the acceptors go along with the 2nd proposal, two proposals are accepted. Otherwise, the protocol is blocked. Promise({1,q}) Accept(u,{1,q}) 14 Basic Paxos more changes to the protocol Promise message includes most recent previously accepted value (or null if none) Allows proposers to keep track of previous accepts If there were no previous accepts, a proposer includes its own wish (i.e., value) in the Accept Otherwise, it has to use a previously accepted value Acceptors perform the same timestamp check when accepting as when promising May accept (not only promise) multiple proposals However, all accepted proposals will have the same value 15 INF 5040 7
Basic Paxos using previously accepted values p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({1,q}) Promise({1,q},v) Accept(v,{1,q}) 16 Basic Paxos dueling proposers p Multiple acceptors q Prepare({1,p}) Promise({1,p},null) Accept(v,{1,p}) after a timeout Prepare({2,p}) Promise({2,p},null) In theory, this can go on forever. In practice, the probability goes down exponentially with the # of rounds. Prepare({1,q}) Promise({1,q},null) Accept(u,{1,q}) after a timeout Prepare({2,q}) Promise({2,q},null) Accept(v,{2,p}) 17 INF 5040 8
Basic Paxos failure of an acceptor The protocol as described above blocks Idea: collect responses from a quorum of acceptors Both Promise and Accepted In case of a majority quorum, tolerates failure of any minority of acceptors The proposer can send to all and wait for a quorum, or send to a quorum only and wait for all quorum members Results in the possibility of two different values being accepted But only one can be accepted by a quorum Thus, learners can act when they hear same value from a quorum 18 Basic Paxos selecting values for Accept p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({2,p}) Promise({2,p},{{1,p},v}) Promise({2,p},{{1,q},u}) Accept(u,{2,p}) Prepare({1,q}) Promise({1,q},null) Accept(u,{1,q}) Accepted(u,{1,q}) Tricky has to use the previously accepted value with the highest timestamp 20 INF 5040 9
Basic Paxos complete protocol The protocol proceeds over several rounds A successful round has two phases Phase 1a: Prepare A proposer selects a quorum of acceptors It sends a Prepare message to the selected quorum The message contains a timestamp T={N,id} N greater than any previous sequence number used by this proposer Phase 1b: Promise If T any timestamp previously received by the acceptor from any proposer, the acceptor ignores the proposal Otherwise, it returns a promise to ignore all future proposals with timestamp T If the acceptor accepted a proposal in the past, the promise message must include the corresponding sequence number and accepted value 21 Basic Paxos complete protocol (cont d) Phase 2a: Accept Request The proposer waits for promise messages from all quorum nodes It selects a value to be accepted If none of the promise messages contains a previously accepted value, then the proposer may choose any value Otherwise, the proposer finds the promise message with the highest timestamp and chooses the value in that message It selects a quorum of acceptors and sends the request with the chosen value and same T as in Phase 1a to the selected quorum Phase 2b: Accepted An acceptor accepts an accept request with timestamp T, if and only if it has not promised to any prepare with timestamp > T Sent to the proposer and every learner Otherwise, ignore the request When a learner gets Accepted with the same value from a quorum, it acts 22 INF 5040 10
Multi-Paxos A typical deployment of Paxos uses a continuous stream of agreed values acting as commands to update a distributed state machine To achieve this, the instance number I is included along with each proposal: Prepare(T,I), etc Need to produce a sequence of decisions: can only agree on I+1 after having reached agreement on I Important optimization: if the leader is stable, phase 1 becomes unnecessary The leader immediate starts with Accept Can reach an agreement faster and with fewer messages 23 Optimizations and extensions of Paxos Can tolerate F acceptor failures with F+1 main acceptors and F spare acceptors by dynamically reconfiguring after each failure A client can communicate directly with the acceptors without going through the leader Faster when there are no conflicts More expensive and complex conflict resolution Can accept and merge conflicting proposals if the two proposed operations are commutative Paxos may also be extended to support Byzantine failures of the participants 24 INF 5040 11
Introduction to transactions Servers can offer concurrent access to the objects/data the service encapsulates Application frequently needs to perform sequences of operations as undivided units => atomic transactions The server can offer persistent storage of objects/data => motivation for continued operation after a server process has failed Service can be provided by a group of servers => distributed transactions 25 Distributed transactions T X Y Client transaction that invokes operations on multiple servers Client Z 27 INF 5040 12
Transactional service Offers access to resources via transactions Cooperation between clients and transactional servers Operations of transactional services OpenTransaction() TransId CloseTransaction(TransID) {commit, abort} AbortTransaction (TransID) {} All operations between OpenTransaction and CloseTransaction are said to be performed in a transactional context 29 Completing a transaction Commit point for transaction T All operations in T that access the server database are successfully performed The effect of the operations is made permanent (typically by recording them in a log) We say that transaction T is committed The service (or the database system) has put itself under an obligation The results of T are made permanent in the database 30 INF 5040 13
Component roles in distributed transactions Distributed system components that are involved in a transaction can have a role as: Transactional client Transactional server Coordinator 31 Coordinator Plays a key role in managing the transaction The component that handles begin/commit/abort operations Allocates globally unique transaction identifiers Includes new servers in the transaction (Join operation) and monitors all the participants Typical implementation The first server that the client contacts (by invoking OpenTransaction) becomes a coordinator for the transaction 32 INF 5040 14
Transactional server Serves as a proxy for each resource that is accessed or modified under transactional control Transactional server must know its coordinator via parameter in the AddServer operation Transactional server registers its participation in the transaction via the coordinator By invoking the Join operation at the coordinator. Transactional server must implement a commitment protocol (such as two-phase commit - 2PC) 33 Transactional client Sees the transaction only through coordinator Invokes operations at the coordinator Open Transaction CloseTransaction AbortTransaction The implementation of the transaction protocol (such as 2PC) is transparent for the client 34 INF 5040 15
Example Coordinator BranchX 9. Starts commitment protocol A Client T 3a BranchX.Join(T,BranchY) 3. BranchY.AddServer(T,BranchX) BranchY B 5a BranchX.Join(T,BranchZ) BranchZ C D 35 The non-blocking atomic commit problem (intuition) Multiple autonomous distributed servers Prior to committing the transaction, all the transactional servers must verify that they can locally perform commit If any server cannot perform commit, all the servers must perform abort 36 INF 5040 16
The non-blocking atomic commit problem (formal) Uniform agreement All processes that decide, decide on the same value Decisions are not reversible Validity Commit can only be reached if all processes vote for commit Non-triviality If all voted commit and there are no (suspicions of) failures, then the decision must be commit Termination If after some time there are no more failures, then eventually all live processes decide 37 2-PC protocol One-phase protocol is insufficient Does not allow a server to perform unilateral abort E.g., in the case of a deadlock Rationale for two phases Phase one: agreement Phase two: execution 38 INF 5040 17
Phase one: agreement Coordinator asks all servers if they are able to perform commit (CanCommit?(T) call) Server response: Yes: will perform commit if the coordinator requests, but the server does not know yet if it will perform commit Determined by the coordinator No: the server performs immediate abort of the transaction Servers can unilaterally perform abort, but they cannot unilaterally perform commit 39 Phase two: execution Coordinator collects all replies from the servers, including itself, and decides to perform commit, if all replied Yes abort, if at least one replied No Coordinator propagates its decision to the servers All participants perform DoCommit(T) call if the decision is commit AbortTransaction(T) call otherwise If the decision is commit, the servers notify the coordinator right after they have performed DoCommit(T) call HaveCommited(T)back on the coordinator 40 INF 5040 18
The 2PC protocol Coordinator Server (participant) step state step state 1 Ready to commit (waits for replies) 3 committed 2 Ready to commit (uncertainty) 4 committed performed 41 2PC state diagram Init (not in transaction) Ready to commit Aborted Committed Performed Coordinator only 42 INF 5040 19
2PC: when a previously failed server recovers Coordinator Participant Init Nothing Nothing Ready AbortTransaction GetDecision(T) Committed Performed Sends DoCommit(T) Nothing Sends HaveCommitted(T) 43 2PC: when a process detects a failure What happens if a coordinator or a participant does not receive a message it expects to receive? For a participant in the Ready state Figure out the state of other participants What if all remaining participants are in the Ready state? This is known as blocking There are more advanced protocols (3PC) that block in fewer cases Impose higher overhead during normal operation 2PC is the most widely used protocol If the network might partition, blocking is unavoidable 44 INF 5040 20
Summary Atomic commitment problem and its solutions CORBA Transaction Service Implements 2PC Requires resources to be transaction-enabled Transactions and EJB programmatic & declarative transactions Container provides support for distributed transactions based on CORBA OTS and X/Open XA protocol EJB container/server implements Java Transaction API (JTA) and Java Transaction Service (JTS) Extended transaction models & OASIS BTP B2B transactions 49 INF 5040 21