Paxos and Distributed Transactions

Similar documents
CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Recovering from a Crash. Three-Phase Commit

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems Consensus

Distributed Consensus Protocols

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Topics in Reliable Distributed Systems

Distributed System. Gang Wu. Spring,2018

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Consensus and related problems

Intuitive distributed algorithms. with F#

Middleware and Distributed Systems. Transactions. Martin v. Löwis

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Exam 2 Review. October 29, Paul Krzyzanowski 1

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Database Architectures

Distributed Transaction Management. Distributed Database System

Applications of Paxos Algorithm

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Today: Fault Tolerance

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Today: Fault Tolerance. Fault Tolerance

CSE 486/586: Distributed Systems

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Paxos and Replication. Dan Ports, CSEP 552

Paxos Made Simple. Leslie Lamport, 2001

11/7/2018. Event Ordering. Module 18: Distributed Coordination. Distributed Mutual Exclusion (DME) Implementation of. DME: Centralized Approach

Consensus on Transaction Commit

Replications and Consensus

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 13 - Distribution: transactions

Exam 2 Review. Fall 2011

Consensus, impossibility results and Paxos. Ken Birman

Distributed Systems 8L for Part IB

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Paxos provides a highly available, redundant log of events

Integrity in Distributed Databases

Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Chapter 16: Distributed Synchronization

Comparison of Different Implementations of Paxos for Distributed Database Usage. Proposal By. Hanzi Li Shuang Su Xiaoyu Li

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Introduction to Distributed Systems Seif Haridi

The Chubby Lock Service for Loosely-coupled Distributed systems

CS 541 Database Systems. Three Phase Commit

Database Architectures

7 Fault Tolerant Distributed Transactions Commit protocols

Chapter 18: Distributed

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

CS 425 / ECE 428 Distributed Systems Fall 2017

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Replication in Distributed Systems

Distributed Data Management Transactions

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

2-PHASE COMMIT PROTOCOL

EECS 498 Introduction to Distributed Systems

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Distributed KIDS Labs 1

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Silberschatz and Galvin Chapter 18

Figure 13.1 Transactions T and U with exclusive locks. Transaction T: Bank$Withdraw(A, 4) Bank$Deposit(B, 4)

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Lecture 17 : Distributed Transactions 11/8/2017

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Database Management Systems

Dynamic Reconfiguration of Primary/Backup Clusters

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

CS October 2017

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Principles of Software Construction: Objects, Design, and Concurrency

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Advanced Databases Lecture 17- Distributed Databases (continued)

Building Consistent Transactions with Inconsistent Replication

Control. CS432: Distributed Systems Spring 2017

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

WA 2. Paxos and distributed programming Martin Klíma

Homework #2 Nathan Balon CIS 578 October 31, 2004

EECS 591 DISTRIBUTED SYSTEMS

Fault Tolerance. Distributed Systems IT332

Chapter 18: Parallel Databases

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Distributed File System

Consensus in Distributed Systems. Jeff Chase Duke University

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Module 8 - Fault Tolerance

Transcription:

Paxos and Distributed Transactions INF 5040 autumn 2016 lecturer: Roman Vitenberg Paxos what is it? The most commonly used consensus algorithm A fundamental building block for data centers Distributed leader/coordinator election Distributed lock Distributed resource control (naming service at Google) Management of distributed logs Used by most data center and cloud providers Google Chubby Zookeeper (originally Yahoo s, now Apache open source) Amazon Web Services Oracle NoSQL DB, VMware, MS, IBM 3 INF 5040 1

Use of Chubby in Google Google Drive, Calendar, Earth, Analytics, Gmail Bigtable, GFS Chubby 4 Roles in Paxos Client issues requests to the Paxos system For instance, get a lock on a distributed file Proposer accepts the client request and publicizes and advocates the state change to the other processes. May be single or multiple proposers in different implementations Acceptor executes the agreement and record the state change once agreement is achieved Learner once a client request has been agreed on by the acceptors, the learner may take action (i.e.: execute the request and send a response to the client) To improve availability of processing, we can add more learners Also called a witness Leader Paxos requires a distinguished proposer (called Paxos leader) to make progress 5 INF 5040 2

A typical flow in Paxos need a lock on the file Proposer Client Paxos implementation Internal protocol Acceptors Decision Learners/workers 6 Basic Paxos the first stab Basic Paxos: one decision only Prepare proposer sends a Prepare message with the proposed value to the acceptors Confirm the acceptors confirm its reception Accept the proposer asks the acceptors to accept Accepted the acceptors respond, indicating they have accepted the value sent to the proposer and to each learner learners act as a secondary storage mechanism 7 INF 5040 3

Basic Paxos the first stab illustrated Prepare(v) Confirm(v) v= grant lock to a process Accept(v) Accepted(v) 8 Basic Paxos the first stab problem Prepare(v) Prepare(u) Confirm(v) Confirm(u) The protocol is blocked! 9 INF 5040 4

Basic Paxos breaking symmetry with proposer ids p q Prepare(p) Prepare(q) Confirm(p) Confirm(q) Accept(u,q) Accepted(u,q) 10 Basic Paxos failure of the proposer p q Prepare(p) Prepare(q) Confirm(p) Confirm(q) The protocol is blocked! 11 INF 5040 5

Basic Paxos changes to the protocol Need to monitor active proposers for failures and establish additional proposers if a current one fails Proposals are indexed by an increasing sequence number N, timestamp T={N,id} Use the same T in the Prepare and subsequent Accept Replace Confirm with Promise If T in the Prepare any timestamp previously received by the acceptor from any proposer, the acceptor ignores the proposal Otherwise, it returns a promise to ignore all future proposals with timestamp T Thus, acceptors may give a promise to multiple proposals, with increasing timestamps! 12 Basic Paxos sequencing proposals p q Prepare(T={1,p}) Prepare(T={1,q}) Promise({1,p}) after a timeout Prepare(T={2,p}) Promise({1,q}) 13 INF 5040 6

Basic Paxos failure of the proposer (cont d) p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({1,q}) If the acceptors go along with the 2nd proposal, two proposals are accepted. Otherwise, the protocol is blocked. Promise({1,q}) Accept(u,{1,q}) 14 Basic Paxos more changes to the protocol Promise message includes most recent previously accepted value (or null if none) Allows proposers to keep track of previous accepts If there were no previous accepts, a proposer includes its own wish (i.e., value) in the Accept Otherwise, it has to use a previously accepted value Acceptors perform the same timestamp check when accepting as when promising May accept (not only promise) multiple proposals However, all accepted proposals will have the same value 15 INF 5040 7

Basic Paxos using previously accepted values p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({1,q}) Promise({1,q},v) Accept(v,{1,q}) 16 Basic Paxos dueling proposers p Multiple acceptors q Prepare({1,p}) Promise({1,p},null) Accept(v,{1,p}) after a timeout Prepare({2,p}) Promise({2,p},null) In theory, this can go on forever. In practice, the probability goes down exponentially with the # of rounds. Prepare({1,q}) Promise({1,q},null) Accept(u,{1,q}) after a timeout Prepare({2,q}) Promise({2,q},null) Accept(v,{2,p}) 17 INF 5040 8

Basic Paxos failure of an acceptor The protocol as described above blocks Idea: collect responses from a quorum of acceptors Both Promise and Accepted In case of a majority quorum, tolerates failure of any minority of acceptors The proposer can send to all and wait for a quorum, or send to a quorum only and wait for all quorum members Results in the possibility of two different values being accepted But only one can be accepted by a quorum Thus, learners can act when they hear same value from a quorum 18 Basic Paxos selecting values for Accept p q Accept(v,{1,p}) Accepted(v,{1,p}) Prepare({2,p}) Promise({2,p},{{1,p},v}) Promise({2,p},{{1,q},u}) Accept(u,{2,p}) Prepare({1,q}) Promise({1,q},null) Accept(u,{1,q}) Accepted(u,{1,q}) Tricky has to use the previously accepted value with the highest timestamp 20 INF 5040 9

Basic Paxos complete protocol The protocol proceeds over several rounds A successful round has two phases Phase 1a: Prepare A proposer selects a quorum of acceptors It sends a Prepare message to the selected quorum The message contains a timestamp T={N,id} N greater than any previous sequence number used by this proposer Phase 1b: Promise If T any timestamp previously received by the acceptor from any proposer, the acceptor ignores the proposal Otherwise, it returns a promise to ignore all future proposals with timestamp T If the acceptor accepted a proposal in the past, the promise message must include the corresponding sequence number and accepted value 21 Basic Paxos complete protocol (cont d) Phase 2a: Accept Request The proposer waits for promise messages from all quorum nodes It selects a value to be accepted If none of the promise messages contains a previously accepted value, then the proposer may choose any value Otherwise, the proposer finds the promise message with the highest timestamp and chooses the value in that message It selects a quorum of acceptors and sends the request with the chosen value and same T as in Phase 1a to the selected quorum Phase 2b: Accepted An acceptor accepts an accept request with timestamp T, if and only if it has not promised to any prepare with timestamp > T Sent to the proposer and every learner Otherwise, ignore the request When a learner gets Accepted with the same value from a quorum, it acts 22 INF 5040 10

Multi-Paxos A typical deployment of Paxos uses a continuous stream of agreed values acting as commands to update a distributed state machine To achieve this, the instance number I is included along with each proposal: Prepare(T,I), etc Need to produce a sequence of decisions: can only agree on I+1 after having reached agreement on I Important optimization: if the leader is stable, phase 1 becomes unnecessary The leader immediate starts with Accept Can reach an agreement faster and with fewer messages 23 Optimizations and extensions of Paxos Can tolerate F acceptor failures with F+1 main acceptors and F spare acceptors by dynamically reconfiguring after each failure A client can communicate directly with the acceptors without going through the leader Faster when there are no conflicts More expensive and complex conflict resolution Can accept and merge conflicting proposals if the two proposed operations are commutative Paxos may also be extended to support Byzantine failures of the participants 24 INF 5040 11

Introduction to transactions Servers can offer concurrent access to the objects/data the service encapsulates Application frequently needs to perform sequences of operations as undivided units => atomic transactions The server can offer persistent storage of objects/data => motivation for continued operation after a server process has failed Service can be provided by a group of servers => distributed transactions 25 Distributed transactions T X Y Client transaction that invokes operations on multiple servers Client Z 27 INF 5040 12

Transactional service Offers access to resources via transactions Cooperation between clients and transactional servers Operations of transactional services OpenTransaction() TransId CloseTransaction(TransID) {commit, abort} AbortTransaction (TransID) {} All operations between OpenTransaction and CloseTransaction are said to be performed in a transactional context 29 Completing a transaction Commit point for transaction T All operations in T that access the server database are successfully performed The effect of the operations is made permanent (typically by recording them in a log) We say that transaction T is committed The service (or the database system) has put itself under an obligation The results of T are made permanent in the database 30 INF 5040 13

Component roles in distributed transactions Distributed system components that are involved in a transaction can have a role as: Transactional client Transactional server Coordinator 31 Coordinator Plays a key role in managing the transaction The component that handles begin/commit/abort operations Allocates globally unique transaction identifiers Includes new servers in the transaction (Join operation) and monitors all the participants Typical implementation The first server that the client contacts (by invoking OpenTransaction) becomes a coordinator for the transaction 32 INF 5040 14

Transactional server Serves as a proxy for each resource that is accessed or modified under transactional control Transactional server must know its coordinator via parameter in the AddServer operation Transactional server registers its participation in the transaction via the coordinator By invoking the Join operation at the coordinator. Transactional server must implement a commitment protocol (such as two-phase commit - 2PC) 33 Transactional client Sees the transaction only through coordinator Invokes operations at the coordinator Open Transaction CloseTransaction AbortTransaction The implementation of the transaction protocol (such as 2PC) is transparent for the client 34 INF 5040 15

Example Coordinator BranchX 9. Starts commitment protocol A Client T 3a BranchX.Join(T,BranchY) 3. BranchY.AddServer(T,BranchX) BranchY B 5a BranchX.Join(T,BranchZ) BranchZ C D 35 The non-blocking atomic commit problem (intuition) Multiple autonomous distributed servers Prior to committing the transaction, all the transactional servers must verify that they can locally perform commit If any server cannot perform commit, all the servers must perform abort 36 INF 5040 16

The non-blocking atomic commit problem (formal) Uniform agreement All processes that decide, decide on the same value Decisions are not reversible Validity Commit can only be reached if all processes vote for commit Non-triviality If all voted commit and there are no (suspicions of) failures, then the decision must be commit Termination If after some time there are no more failures, then eventually all live processes decide 37 2-PC protocol One-phase protocol is insufficient Does not allow a server to perform unilateral abort E.g., in the case of a deadlock Rationale for two phases Phase one: agreement Phase two: execution 38 INF 5040 17

Phase one: agreement Coordinator asks all servers if they are able to perform commit (CanCommit?(T) call) Server response: Yes: will perform commit if the coordinator requests, but the server does not know yet if it will perform commit Determined by the coordinator No: the server performs immediate abort of the transaction Servers can unilaterally perform abort, but they cannot unilaterally perform commit 39 Phase two: execution Coordinator collects all replies from the servers, including itself, and decides to perform commit, if all replied Yes abort, if at least one replied No Coordinator propagates its decision to the servers All participants perform DoCommit(T) call if the decision is commit AbortTransaction(T) call otherwise If the decision is commit, the servers notify the coordinator right after they have performed DoCommit(T) call HaveCommited(T)back on the coordinator 40 INF 5040 18

The 2PC protocol Coordinator Server (participant) step state step state 1 Ready to commit (waits for replies) 3 committed 2 Ready to commit (uncertainty) 4 committed performed 41 2PC state diagram Init (not in transaction) Ready to commit Aborted Committed Performed Coordinator only 42 INF 5040 19

2PC: when a previously failed server recovers Coordinator Participant Init Nothing Nothing Ready AbortTransaction GetDecision(T) Committed Performed Sends DoCommit(T) Nothing Sends HaveCommitted(T) 43 2PC: when a process detects a failure What happens if a coordinator or a participant does not receive a message it expects to receive? For a participant in the Ready state Figure out the state of other participants What if all remaining participants are in the Ready state? This is known as blocking There are more advanced protocols (3PC) that block in fewer cases Impose higher overhead during normal operation 2PC is the most widely used protocol If the network might partition, blocking is unavoidable 44 INF 5040 20

Summary Atomic commitment problem and its solutions CORBA Transaction Service Implements 2PC Requires resources to be transaction-enabled Transactions and EJB programmatic & declarative transactions Container provides support for distributed transactions based on CORBA OTS and X/Open XA protocol EJB container/server implements Java Transaction API (JTA) and Java Transaction Service (JTS) Extended transaction models & OASIS BTP B2B transactions 49 INF 5040 21