Distributed Systems Consensus

Similar documents
Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Distributed Commit in Asynchronous Systems

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Consensus, impossibility results and Paxos. Ken Birman

Consensus in Distributed Systems. Jeff Chase Duke University

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Recovering from a Crash. Three-Phase Commit

Paxos and Replication. Dan Ports, CSEP 552

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Today: Fault Tolerance

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Distributed System. Gang Wu. Spring,2018

CS October 2017

Assignment 12: Commit Protocols and Replication Solution

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Consensus and related problems

Today: Fault Tolerance. Fault Tolerance

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Practice: Large Systems Part 2, Chapter 2

Consensus on Transaction Commit

CS505: Distributed Systems

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

To do. Consensus and related problems. q Failure. q Raft

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Exercise 12: Commit Protocols and Replication

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Intuitive distributed algorithms. with F#

Lecture 17 : Distributed Transactions 11/8/2017

Consensus Problem. Pradipta De

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Replications and Consensus

Fault Tolerance. Basic Concepts

CS 425 / ECE 428 Distributed Systems Fall 2017

Fault Tolerance. Distributed Systems. September 2002

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

Assignment 12: Commit Protocols and Replication

Distributed Data with ACID Transactions

Distributed Consensus Protocols

WA 2. Paxos and distributed programming Martin Klíma

Introduction to Distributed Systems Seif Haridi

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Paxos provides a highly available, redundant log of events

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Distributed Algorithms Benoît Garbinato

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Asynchronous Reconfiguration for Paxos State Machines

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

7 Fault Tolerant Distributed Transactions Commit protocols

Paxos and Distributed Transactions

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

EECS 498 Introduction to Distributed Systems

Distributed Consensus: Making Impossible Possible

Failures, Elections, and Raft

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

EECS 591 DISTRIBUTED SYSTEMS

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Exam 2 Review. October 29, Paul Krzyzanowski 1

Basic vs. Reliable Multicast

Chapter 8 Fault Tolerance

CSE 486/586: Distributed Systems

6.033 Computer System Engineering

Distributed Systems Fault Tolerance

Fault-Tolerance & Paxos

Paxos Made Simple. Leslie Lamport, 2001

Distributed Transactions

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Distributed Consensus: Making Impossible Possible

Byzantine Fault Tolerance

Module 8 - Fault Tolerance

Topics in Reliable Distributed Systems

Distributed Systems COMP 212. Lecture 19 Othon Michail

Control. CS432: Distributed Systems Spring 2017

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Two-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure

Exam 2 Review. Fall 2011

PRIMARY-BACKUP REPLICATION

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment

Causal Consistency and Two-Phase Commit

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Today: Fault Tolerance. Replica Management

Fault Tolerance. Distributed Systems IT332

COMMENTS. AC-1: AC-1 does not require all processes to reach a decision It does not even require all correct processes to reach a decision

Distributed Systems 8L for Part IB

Transcription:

Distributed Systems Consensus Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 1 / 56

What is the Problem? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 2 / 56

Two Generals Problem Two generals need to be agree on time to attack to win. They communicate through messengers, who may be killed on their way. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56

Two Generals Problem Two generals need to be agree on time to attack to win. They communicate through messengers, who may be killed on their way. Agreement is the problem. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56

Replicated State Machine Problem (1/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56

Replicated State Machine Problem (1/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56

Replicated State Machine Problem (1/2) The solution: replicate the server. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56

Replicated State Machine Problem (2/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56

Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56

Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56

Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Ensure correct replicas step through the same sequence of state transitions (How?) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56

Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Ensure correct replicas step through the same sequence of state transitions (How?) Agreement is the problem. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56

The Agreement Problem Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 6 / 56

The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56

The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. But,... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56

The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. But,... Concurrent processes and uncertainty of timing, order of events and inputs. Failure and recovery of machines/processors, of communication channels. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56

Agreement Requirements Safety Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56

Agreement Requirements Safety Validity: only a value that has been proposed may be chosen. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56

Agreement Requirements Safety Validity: only a value that has been proposed may be chosen. Agreement: no two correct nodes choose different values. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56

Agreement Requirements Safety Validity: only a value that has been proposed may be chosen. Agreement: no two correct nodes choose different values. Integrity: a node chooses at most once. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56

Agreement Requirements Safety Validity: only a value that has been proposed may be chosen. Agreement: no two correct nodes choose different values. Integrity: a node chooses at most once. Liveness Termination: every correct node eventually choose a value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56

Agreement in Distributed Systems: Possible Solutions Two-Phase Commit (2PC) Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 9 / 56

Two-Phase Commit (2PC) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 10 / 56

The Two-Phase Commit (2PC) Problem The problem first was encountered in database systems. Suppose a database system is updating some complicated data structures that include parts residing on more than one machine. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56

The Two-Phase Commit (2PC) Problem The problem first was encountered in database systems. Suppose a database system is updating some complicated data structures that include parts residing on more than one machine. System model: Concurrent processes and uncertainty of timing, order of events and inputs (asynchronous systems). Failure and recovery of machines/processors, of communication channels. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56

Intuitive Example (1/3) You want to organize outing with 3 friends at 6pm Tuesday. Go out only if all friends can make it. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 12 / 56

Intuitive Example (2/3) What do you do? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56

Intuitive Example (2/3) What do you do? Call each of them and ask if can do 6pm on Tuesday (voting phase) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56

Intuitive Example (2/3) What do you do? Call each of them and ask if can do 6pm on Tuesday (voting phase) If all can do Tuesday, call each friend back to ACK (commit) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56

Intuitive Example (2/3) What do you do? Call each of them and ask if can do 6pm on Tuesday (voting phase) If all can do Tuesday, call each friend back to ACK (commit) If one cannot do Tuesday, call other three to cancel (abort) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56

Intuitive Example (3/3) Critical details Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56

Intuitive Example (3/3) Critical details While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56

Intuitive Example (3/3) Critical details While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. You need to remember the decision and tell anyone whom you have not been able to reach during commit/abort phase. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56

Intuitive Example (3/3) Critical details While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. You need to remember the decision and tell anyone whom you have not been able to reach during commit/abort phase. That is exactly how 2PC works. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56

The 2PC Players Coordinator (Transaction Manager) Begins transaction. Responsible for commit/abort. Participants (Resource Managers) The servers with the data used in the distributed transaction. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 15 / 56

The 2PC Algorithm Phase 1 - prepare phase Phase 2 - commit phase Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 16 / 56

The 2PC Algorithm - Prepare Phase Coordinator asks each participant cancommit. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56

The 2PC Algorithm - Prepare Phase Coordinator asks each participant cancommit. Participants must prepare to commit using permanent storage before answering yes. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56

The 2PC Algorithm - Prepare Phase Coordinator asks each participant cancommit. Participants must prepare to commit using permanent storage before answering yes. Lock the objects. Participants are not allowed to cause an abort after it replies yes to cancommit. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56

The 2PC Algorithm - Prepare Phase Coordinator asks each participant cancommit. Participants must prepare to commit using permanent storage before answering yes. Lock the objects. Participants are not allowed to cause an abort after it replies yes to cancommit. Outcome of the transaction is uncertain until docommit or doabort. Other participants might still cause an abort. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56

The 2PC Algorithm - Commit Phase The coordinator collects all votes. If unanimous yes, causes commit. If any participant voted no, causes abort. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56

The 2PC Algorithm - Commit Phase The coordinator collects all votes. If unanimous yes, causes commit. If any participant voted no, causes abort. The fate of the transaction is decided atomically at the coordinator, once all participants vote. Coordinator records fate using permanent storage. Then broadcasts docommit or doabort to participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56

2PC Sequence of Events Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 19 / 56

Recovery in 2PC Recovery after timeouts. Recovery after crashes and reboot. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56

Recovery in 2PC Recovery after timeouts. Recovery after crashes and reboot. Note: you cannot differentiate between the above in a realistic asynchronous network. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56

Handling Timeout To avoid processes blocking for ever. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56

Handling Timeout To avoid processes blocking for ever. Two scenarios: Coordinator waits for votes from participants. Participant is waiting for the final decision. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56

Handling Timeout at Coordinator If B voted no, can coordinator unilaterally abort? If B voted yes, can coordinator unilaterally abort/commit? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56

Handling Timeout at Coordinator If B voted no, can coordinator unilaterally abort? If B voted yes, can coordinator unilaterally abort/commit? Coordinator waits for votes from participants. Participant is waiting for the final decision. Coordinator timeout abort and send doabort to participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. B sends status message to A Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. B sends status message to A If A has received commit/abort from TM,... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. B sends status message to A If A has received commit/abort from TM,... If A has not responded to TM,... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. B sends status message to A If A has received commit/abort from TM,... If A has not responded to TM,... If A has responded with no/yes... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56

Handling Crash and Recovery (1/2) All nodes must log protocol progress. Participants: prepared, uncertain, committed/aborted Coordinator: prepared, committed/aborted, done Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56

Handling Crash and Recovery (1/2) All nodes must log protocol progress. Participants: prepared, uncertain, committed/aborted Coordinator: prepared, committed/aborted, done Nodes cannot back out if commit is decided. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56

Handling Crash and Recovery (2/2) Coordinator crashes: If it finds no commit on disk, it aborts. If it finds commit, it commits. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56

Handling Crash and Recovery (2/2) Coordinator crashes: If it finds no commit on disk, it aborts. If it finds commit, it commits. Participant crashes: If it finds no yes on disk, it aborts. If it finds yes, runs termination protocol to decide. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56

Fault-Tolerance Limitations of 2PC Even with recovery enabled, 2PC is not really fault-tolerant (or live), because it can be blocked even when one (or a few) machines fail. Blocking means that it does not make progress during the failures. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56

Fault-Tolerance Limitations of 2PC Even with recovery enabled, 2PC is not really fault-tolerant (or live), because it can be blocked even when one (or a few) machines fail. Blocking means that it does not make progress during the failures. Any scenarios? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56

2PC Blocking Scenario TM sends docommit to A, A gets it and commits, and then both TM and A die. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56

2PC Blocking Scenario TM sends docommit to A, A gets it and commits, and then both TM and A die. B, C, D have already also replied yes, have locked their mutexes, and now need to wait for TM or A to reappear. They cannot recover the decision with certainty until TM or A are online. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56

2PC Blocking Scenario TM sends docommit to A, A gets it and commits, and then both TM and A die. B, C, D have already also replied yes, have locked their mutexes, and now need to wait for TM or A to reappear. They cannot recover the decision with certainty until TM or A are online. This is why 2PC is called a blocking protocol: 2PC is safe, but not live. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56

Impossibility of Distributed Consensus with One Faulty Process Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 28 / 56

FLP Fischer-Lynch-Paterson (FLP) M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM, 1985. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 29 / 56

FLP Impossibility Result It is impossible for a set of processors in an asynchronous system to agree on a binary value, even if only a single process is subject to an unannounced failure. The core of the problem is asynchrony. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 30 / 56

FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56

FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. What FLP does not say: in practice, how close can you get to the ideal (always safe and live)? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56

FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. What FLP does not say: in practice, how close can you get to the ideal (always safe and live)? So, Paxos... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56

Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 32 / 56

Paxos The only known completely-safe and largely-live agreement protocol. L. Lamport, The part-time parliament, ACM Transactions on Computer Systems, 1998. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 33 / 56

The Paxos Players Proposers Suggests values for consideration by acceptors. Acceptors Considers the values proposed by proposers. Renders an accept/reject decision. Learners Learns the chosen value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56

The Paxos Players Proposers Suggests values for consideration by acceptors. Acceptors Considers the values proposed by proposers. Renders an accept/reject decision. Learners Learns the chosen value. A node can act as more than one roles (usually 3). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56

Single Proposal, Single Acceptor Use just one acceptor Collects proposers proposals. Decides the value and tells everyone else. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56

Single Proposal, Single Acceptor Use just one acceptor Collects proposers proposals. Decides the value and tells everyone else. Sounds familiar? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56

Single Proposal, Single Acceptor Use just one acceptor Collects proposers proposals. Decides the value and tells everyone else. Sounds familiar? two-phase commit (2PC) acceptor fails = protocol blocks Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56

Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Let s have multiple acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56

Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Let s have multiple acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56

Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Let s have multiple acceptors. From there, must reach a decision. How? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56

Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Let s have multiple acceptors. From there, must reach a decision. How? Decision = value accepted by the majority. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56

Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Let s have multiple acceptors. From there, must reach a decision. How? Decision = value accepted by the majority. P1: an acceptor must accept first proposal it receives. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56

Multiple Proposals, Multiple Acceptors If there are multiple proposals, no proposal may get the majority. 3 proposals may each get 1/3 of the acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56

Multiple Proposals, Multiple Acceptors If there are multiple proposals, no proposal may get the majority. 3 proposals may each get 1/3 of the acceptors. Solution: acceptors can accept multiple proposals, distinguished by a unique proposal number. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56

Handling Multiple Proposals All chosen proposals must have the same value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56

Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56

Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. P2a:... accepted... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56

Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. P2a:... accepted... P2b:... proposed... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56

The Paxos Algorithm Phase 1a - prepare phase Phase 1b - promise phase Phase 2a - accept phase Phase 2b - accepted phase Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 39 / 56

Paxos Algorithm Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 40 / 56

Paxos Algorithm - Prepare Phase A proposer selects a proposal number n and sends a prepare request with number n to majority of acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 41 / 56

Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56

Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw It responses yes to that request with a promise not to accept any more proposals numbered less than n. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56

Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw It responses yes to that request with a promise not to accept any more proposals numbered less than n. It includes the highest-numbered proposal (if any) that it has accepted. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56

Paxos Algorithm - Accept Phase If the proposer receives a response yes to its prepare requests from a majority of acceptors Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56

Paxos Algorithm - Accept Phase If the proposer receives a response yes to its prepare requests from a majority of acceptors It sends an accept request to each of those acceptors for a proposal numbered n with a values v, which is the value of the highest-numbered proposal among the responses. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56

Paxos Algorithm - Accepted Phase If an acceptor receives an accept request for a proposal numbered n Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56

Paxos Algorithm - Accepted Phase If an acceptor receives an accept request for a proposal numbered n It accepts the proposal unless it has already responded to a prepare request having a number greater than n. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56

Definition of Chosen A value is chosen at proposal number n, iff majority of acceptors accept that value in phase 2 of the proposal number. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 45 / 56

Paxos Example Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 46 / 56

Paxos Example Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 47 / 56

Paxos - Safety (1/3) If a value v is chosen at proposal number n, any value that is sent out in phase 2 of any later proposal numbers must be also v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 48 / 56

Paxos - Safety (2/3) Decision = Majority (any two majorities share at least one element) Therefore after the first round in which there is a decision, any subsequent round involves at least one acceptor that has accepted v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 49 / 56

Paxos - Safety (3/3) Now suppose our claim is not true, and let m is the first proposal number that is later than n and in 2nd phase, the value sent out is w v. This is not possible, because if the proposer P was able to start 2nd phase for w, it means it got a majority to accept round for m (for m > n). So, either: v would not have been the value decided, or v would have been proposed by P Therefore, once a majority accepts v, that never changes. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 50 / 56

Liveness If two or more proposers race to propose new values, they might step on each other toes all the time. P1 : prepare(n 1 ) P2 : prepare(n 2 ) P 1 : accept(n 1, v 1 ) P 2 : accept(n 2, v 2 ) P1 : prepare(n 3 ) P2 : prepare(n 4 )... n 1 < n 2 < n 3 < n 4 < With randomness, this occurs exceedingly rarely. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 51 / 56

Paxos is Everywhere Google: Chubby (Paxos-based distributed lock service) Most Google services use Chubby directly or indirectly Yahoo: Zookeeper (Paxos-based distributed lock service) Zookeeper is open-source and integrates with Hadoop UW: Scatter (Paxos-based consistent DHT) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 52 / 56

Summary Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 53 / 56

Summary Replicated State Machine 2PC: blocking FLP impossibility Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 54 / 56

References: M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM, 1985. L. Lamport, The part-time parliament, ACM Transactions on Computer Systems, 1998. L. Lamport, Paxos made simple, ACM Sigact News, 2001. J. Gray, and L. Lamport, Consensus on transaction commit, ACM Transactions on Database Systems, 2006. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 55 / 56

Questions? Acknowledgements Some slides were derived from Jinyang Li slides (New York University). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 56 / 56