Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Similar documents
Consensus, impossibility results and Paxos. Ken Birman

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Distributed Systems Consensus

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Consensus Problem. Pradipta De

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Systems 11. Consensus. Paul Krzyzanowski

Consensus in Distributed Systems. Jeff Chase Duke University

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Distributed Commit in Asynchronous Systems

Consensus and related problems

Intuitive distributed algorithms. with F#

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Paxos provides a highly available, redundant log of events

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

To do. Consensus and related problems. q Failure. q Raft

Assignment 12: Commit Protocols and Replication Solution

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Today: Fault Tolerance

Failures, Elections, and Raft

Consensus. Chapter Two Friends. 2.3 Impossibility of Consensus. 2.2 Consensus 16 CHAPTER 2. CONSENSUS

Fast Paxos. Leslie Lamport

Paxos and Replication. Dan Ports, CSEP 552

Replications and Consensus

CSE 486/586 Distributed Systems

Recovering from a Crash. Three-Phase Commit

Paxos Made Simple. Leslie Lamport, 2001

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Today: Fault Tolerance. Fault Tolerance

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Consensus. Chapter Two Friends. 8.3 Impossibility of Consensus. 8.2 Consensus 8.3. IMPOSSIBILITY OF CONSENSUS 55

Introduction to Distributed Systems Seif Haridi

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Replication in Distributed Systems

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed Consensus: Making Impossible Possible

Exercise 12: Commit Protocols and Replication

Fault-Tolerance & Paxos

Distributed Systems Fault Tolerance

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Distributed Transactions

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

Fault Tolerance. Distributed Systems. September 2002

Distributed Consensus: Making Impossible Possible

Distributed Systems 8L for Part IB

Causal Consistency and Two-Phase Commit

CS October 2017

Distributed Consensus Protocols

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Specifying and Proving Broadcast Properties with TLA

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Assignment 12: Commit Protocols and Replication

Basic vs. Reliable Multicast

Strong Consistency and Agreement

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Transactions Between Distributed Ledgers

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Asynchronous Reconfiguration for Paxos State Machines

CS505: Distributed Systems

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

EECS 498 Introduction to Distributed Systems

Semi-Passive Replication in the Presence of Byzantine Faults

Raft and Paxos Exam Rubric

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

WA 2. Paxos and distributed programming Martin Klíma

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Distributed algorithms

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Consensus on Transaction Commit

Fault Tolerance. Basic Concepts

arxiv: v2 [cs.dc] 12 Sep 2017

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Distributed Systems COMP 212. Lecture 19 Othon Michail

Correct-by-Construction Attack- Tolerant Systems. Robert Constable Mark Bickford Robbert van Renesse Cornell University

416 practice questions (PQs)

Failure Tolerance. Distributed Systems Santa Clara University

CS 425 / ECE 428 Distributed Systems Fall 2017

Integrity in Distributed Databases

The Wait-Free Hierarchy

Paxos and Distributed Transactions

Fault Tolerance. Distributed Software Systems. Definitions

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Paxos Made Moderately Complex Made Moderately Simple

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Coordination and Agreement

A Reliable Broadcast System

Transcription:

Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1} Asynchronous, reliable network At most 1 process fails by halting (crash) Goal: protocol whereby all decide same value v, and v was an input Distributed Consensus Asynchronous networks No common clocks or shared notion of time (local ideas of time are fine, but different processes may have very different clocks ) No way to know how long a message will take to get from A to B Messages are never lost in the network Jenkins, if I want another yes-man, I ll build one! Lee Lorenz, Brent Sheppard Quick comparison Asynchronous model Reliable message passing, unbounded delays No partitioning faults ( wait until over ) No clocks of any kinds Crash failures, can t detect reliably Real world Just resend until acknowledged; often have a delay model May have to operate during partitioning Clocks but limited sync Usually detect failures with timeout Fault-tolerant protocol Collect votes from all N processes At most one is faulty, so if one doesn t respond, count that vote as 0 Compute majority Tell everyone the outcome They decide (they accept outcome) but this has a problem! Why? 1

What makes consensus hard? Fundamentally, the issue revolves around membership In an asynchronous environment, we can t detect failures reliably A faulty process stops sending messages but a slow message might confuse us Yet when the vote is nearly a tie, this confusing situation really matters Fischer, Lynch and Patterson A surprising result Impossibility of Asynchronous Distributed Consensus with a Single Faulty Process They prove that no asynchronous algorithm for agreeing on a one-bit value can guarantee that it will terminate in the presence of crash faults And this is true even if no crash actually occurs! Proof constructs infinite non-terminating runs Core of FLP result They start by looking at a system with inputs that are all the same All 0 s must decide 0, all 1 s decides 1 Now they explore mixtures of inputs and find some initial set of inputs with an uncertain ( bivalent ) outcome They focus on this bivalent state Self-Quiz questions When is a state univalent as opposed to bivalent? Can the system be in a univalent state if no process has actually decided? What causes a system to enter a univalent state? Self-Quiz questions Suppose that event e moves us into a univalent state, and e happens at p Might p decide immediately? Now sever communications from p to the rest of the system Both event e and p s decision are hidden Does this matter in the FLP model? Might it matter in real life? Bivalent state Sooner or later all executions decide 0 S * denotes bivalent state S 0 denotes a decision 0 state S 1 denotes a decision 1 state Sooner or later all executions decide 1 2

Bivalent state Bivalent state e is a critical event that takes us from a bivalent to a univalent state: eventually we ll decide 0 They delay e and show that there is a situation in which the system will return to a bivalent state e S * Bivalent state Bivalent state In this new state they show that we can deliver e and that now, the new state will still be Events bivalent! can S * Notice that we made the system do some work and yet it ended up back in an uncertain state We can do this again and again S * e e S * S * Core of FLP result in words In an initially bivalent state, they look at some execution that would lead to a decision state, say 0 At some step this run switches from bivalent to univalent, when some process receives some message m They now explore executions in which m is delayed Core of FLP result Initially in a bivalent state Delivery of m would make us univalent but we delay m They show that if the protocol is fault-tolerant there must be a run that leads to the other univalent state And they show that you can deliver m in this run without a decision being made 3

Core of FLP result This proves the result: a bivalent system can be forced to do some work and yet remain in a bivalent state We can pump this to generate indefinite runs that never decide Interesting insight: no failures actually occur (just delays) FLP attacks a faulttolerant protocol using fault-free runs! Intuition behind this result? Think of a real system trying to agree on something in which process p plays a key role But the system is fault-tolerant: if p crashes it adapts and moves on Their proof tricks the system into treating p as if it had failed, but then lets p resume execution and rejoin This takes time and no real progress occurs But what did impossibility mean? In formal proofs, an algorithm is totally correct if It computes the right thing And it always terminates When we say something is possible, we mean there is a totally correct algorithm solving the problem But what did impossibility mean? FLP proves that any fault-tolerant algorithm solving consensus has runs that never terminate These runs are extremely unlikely ( probability zero ) Yet they imply that we can t find a totally correct solution consensus is impossible thus means consensus is not always possible Solving consensus GMS in a large system s that solve consensus often use a membership service This GMS functions as an oracle, a trusted status reporting function Then consensus protocol involves a kind of 2- phase protocol that runs over the output of the GMS It is known precisely when such a solution will be able to make progress Global events are inputs to the GMS GMS Output is the official record of events that mattered to the system 4

Paxos Algorithm Distributed consensus algorithm Doesn t use a GMS at least in basic version but isn t very efficient either Guarantees safety, but not liveness Key Assumptions: Set of processes that run Paxos is known a-priori Processes suffer crash failures All processes have Greek names (but translate as Fred, Cynthia, Nancy ) Paxos proposal Node proposes to append some information to a replicated history Proposal could be a decision value, hence can solve consensus Or could be some other information, such as Frank s new salary or Position of Air France flight 21 Paxos Algorithm Proposals are associated with a version number Processors vote on each proposal A proposal approved by a majority will get passed Size of majority is well known because potential membership of system was known a-priori A process considering two proposals approves the one with the larger version number Paxos Algorithm 3 roles proposer acceptor Learner 2 phases Phase 1: prepare request Response Phase 2: Accept request Response Phase 1: (prepare request) (1) A proposer chooses a new proposal version number n, and sends a prepare request ( prepare,n) to a majority of acceptors: (a) Can I make a proposal with number n? (b) if yes, do you suggest some value for my proposal? Phase 1: (prepare request) (2) If an acceptor receives a prepare request ( prepare, n) with n greater than that of any prepare request it has already responded, sends out ( ack, n, n, v ) or ( ack, n,, ) (a) responds with a promises not to accept any more proposals numbered less than n (b) suggest the value v of the highest-number proposal that it has accepted if any, else 5

Phase 2: (accept request) (3) If the proposer receives responses from a majority of the acceptors, then it can issue a accept request ( accept, n, v) with number n and value v: (a) n is the number that appears in the prepare request (b) v is the value of the highest-numbered proposal among the responses Phase 2: (accept request) (4) If the acceptor receives an accept request ( accept, n, v), it accepts the proposal unless it has already responded to a prepare request having a number greater than n Learning the decision In Well-Behaved Runs Whenever acceptor accepts a proposal, respond to all learners ( accept, n, v) Learner receives ( accept, n, v) from a majority of acceptors, decides v, and sends ( decide, v) to all other learners Learners receive ( decide, v), decide v 1 1 1 1 2 2 ( prepare,1) ( accept,1,v 1 ) ( ack,1,, ) n n 1 2 n 1: proposer 1-n: acceptors 1-n: acceptors ( accept,1,v 1 ) decide v 1 Paxos is safe Intuition: If a proposal with value v is decided, then every higher-numbered proposal issued by any proposer has value v A majority of acceptors accept (n, v), v is decided next prepare request with Proposal Number n+1 (what if n+k?) Safety (proof) Suppose (n, v) is the earliest proposal that passed If none, safety holds Let (n, v ) be the earliest issued proposal after (n, v) with a different value v!=v As (n, v ) passed, it requires a major of acceptors Thus, some process approve both (n, v) and (n, v ), though it will suggest value v with version number k>= n As (n, v ) passed, it must receive a response ( ack, n, j, v ) to its prepare request, with n<j<n Consider (j, v ) we get the contradiction 6

Liveness Per FLP, cannot guarantee liveness Paper gives us a scenario with 2 proposers, and during the scenario no decision can be made Liveness(cont) Omissions cause the Liveness problem Partitioning failures would look like omissions in Paxos Repeated omissions can delay decisions indefinitely (a scenario like the FLP one) But Paxos doesn t block in case of a lost message Phase I can start with new rank even if previous attempts never ended Liveness(cont) As the paper points out, selecting a distinguished proposer will solve the problem Leader election This is how the view management protocol of virtual synchrony systems works GMS view management implements Paxos with leader election Protocol becomes a 2-phase commit with a 3- phase commit when leader fails A small puzzle How does Paxos scale? Assume that as we add nodes, each node behaves iid to the other nodes hence likelihood of concurrent proposals will rise as O(n) Core Paxos: 3 linear phases but expected number of rounds will rise too get O(n 2 ) O(n 3 ) with failures How does Paxos scale? Another, subtle scaling issue Suppose we are worried about the memory in use to buffer pending decisions and other messages Under heavy load, round trip delay to reach a majority of the servers will limit the clearing time Works out to something like an O(n logn) or O(n 2 ) cost depending on how you implement the protocol This is a kind of time-space complexity that has never really been studied we ll see why it matters in an upcoming lecture Paxos in real life Used but not widely For example, Google uses Paxos in their lock server One issue is that Paxos gets complex if we need to reconfigure it to change the set of nodes running the protocol Another problem is that other more scalable alternatives are available 7

Summary Consensus is impossible But this doesn t turn out to be a big obstacle We can achieve consensus with probability one in many situations Paxos is an example of a consensus protocol, very simple We ll look at other examples Thursday 8