Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Similar documents
Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Distributed Systems 11. Consensus. Paul Krzyzanowski

EECS 498 Introduction to Distributed Systems

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Consensus, impossibility results and Paxos. Ken Birman

Paxos and Replication. Dan Ports, CSEP 552

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Paxos Made Simple. Leslie Lamport, 2001

Recovering from a Crash. Three-Phase Commit

To do. Consensus and related problems. q Failure. q Raft

Failures, Elections, and Raft

Consensus and related problems

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Replications and Consensus

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Commit in Asynchronous Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Consensus in Distributed Systems. Jeff Chase Duke University

Paxos provides a highly available, redundant log of events

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Systems Consensus

Today: Fault Tolerance

Today: Fault Tolerance. Fault Tolerance

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

PRIMARY-BACKUP REPLICATION

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Fault-Tolerance & Paxos

Consensus Problem. Pradipta De

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Distributed Consensus Protocols

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Intuitive distributed algorithms. with F#

WA 2. Paxos and distributed programming Martin Klíma

Exam 2 Review. October 29, Paul Krzyzanowski 1

Causal Consistency and Two-Phase Commit

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Exam 2 Review. Fall 2011

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Introduction to Distributed Systems Seif Haridi

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

CSE 5306 Distributed Systems. Fault Tolerance

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Strong Consistency and Agreement

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

416 practice questions (PQs)

Practice: Large Systems Part 2, Chapter 2

Distributed Consensus: Making Impossible Possible

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Distributed Consensus: Making Impossible Possible

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Byzantine Fault Tolerance

Byzantine Fault Tolerance

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Primary-Backup Replication

CSc Outline. Basics. What is DHCP? Why DHCP? How does DHCP work? DHCP

Topics in Reliable Distributed Systems

REPLICATED STATE MACHINES

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Lecture XII: Replication

Fault-Tolerant Distributed Services and Paxos"

Distributed Consensus Paxos

Replication in Distributed Systems

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Fault Tolerance. Distributed Software Systems. Definitions

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

Semi-Passive Replication in the Presence of Byzantine Faults

11. Replication. Motivation

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Basic vs. Reliable Multicast

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

CS October 2017

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Consensus on Transaction Commit

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Byzantine Fault Tolerant Raft

CSE 5306 Distributed Systems

Distributed Systems Homework 1 (6 problems)

Assignment 12: Commit Protocols and Replication Solution

Database management system Prof. D. Janakiram Department of Computer Science and Engineering Indian Institute of Technology, Madras

Paxos Playground: a simulation to understand a replicated state machine implementation using Paxos

CS 425 / ECE 428 Distributed Systems Fall 2017

Transcription:

Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4 TC à A, B: commit! or abort! 2 Recall our 2PC commit problem Doing failover correctly isn t easy Client C Who acts as TC? Which server(s) own the account of A? B? Which node takes over as backup? Who takes over if TC fails? What about if A or B fail? Bank A B 3 4 1

Doing failover correctly isn t easy Doing failover correctly isn t easy Okay, so specify some ordering (manually, using some identifier) 1 2 3 But who determines if 1 failed? 1 2 3 5 6 Doing failover correctly isn t easy Doing failover correctly isn t easy Easy, right? Just ping and timeout! 1 2 3 Is the server or the network actually dead/slow? 1 1 2 7 8 2

What can go wrong? What can go wrong? Two nodes think they are TC: Two nodes think they are TC: Split brain scenario Split brain scenario 1 1 1 1 9 10 What can go wrong? Consensus Safety invariant: Only 1 node is TC at any single time 1 Definition: 1 A general agreement about something 2 An idea or opinion that is shared by all the people in a group Another problem: A and B need to know (and agree upon) who the TC is Origin: Latin, from consentire 11 12 3

Consensus Given a set of processors, each with an initial value: Termination: All non-faulty processes eventually decide on a value Agreement: All processes that decide do so on the same value Validity: The value that has been decided must have proposed by some process Consensus used in systems Group of servers attempting: Make sure all servers in group receive the same updates in the same order as each other Maintain own lists (views) on who is a current member of the group, and update lists when somebody leaves/fails Elect a leader in group, and inform everybody Ensure mutually exclusive (one process at a time only) access to a critical resource like a file 13 14 Step one: Define your system model Network model: Synchronous (time-bounded delay) or asynchronous (arbitrary delay) Reliable or unreliable communication Unicast or multicast communication Node failures: Fail-stop (correct/dead) or Byzantine (arbitrary) Step one: Define your system model Network model: Synchronous (time-bounded delay) or asynchronous (arbitrary delay) Reliable or unreliable communication Unicast or multicast communication Node failures: Fail-stop (correct/dead) or Byzantine (arbitrary) 15 16 4

FLP result Consensus is impossible abandon hope, all ye who enter here No deterministic 1-crash-robust consensus algorithm exists for asynchronous model Holds even for weak consensus (ie, only some process needs to decide, not all) 17 Holds even for only two states: 0 and 1 18 Main technical approach Initial state of system can end in decision 0 or 1 Consider 5 processes, each in some initial state [ 1,1,0,1,1 ] 1 [ 1,1,0,1,0 ]? [ 1,1,0,0,0 ]? [ 1,1,1,0,0 ]? [ 1,0,1,0,0 ] 0 Must exist two configurations here which differ in decision Main technical approach Initial state of system can end in decision 0 or 1 Consider 5 processes, each in some initial state [ 1,1,0,1,1 ] 1 [ 1,1,0,1,0 ] 1 [ 1,1,0,0,0 ] 1 [ 1,1,1,0,0 ] 0 [ 1,0,1,0,0 ] 0 Assume decision differs between these two processes 19 20 5

Main technical approach Goal: Consensus holds in face of 1 failure Main technical approach Goal: Consensus holds in face of 1 failure One of these configs must be bi-valent : Both futures possible One of these configs must be bi-valent : Both futures possible [ 1,1,0,0,0 ] [ 1,1,1,0,0 ] 1 0 0 [ 1,1,0,0,0 ] [ 1,1,1,0,0 ] 1 0 1 Key result: All bi-valent states can remain in bi-valent states after performing some work 21 22 You won t believe this one trick! 1 System thinks process p crashes, adapts to it 2 But then p recovers and q crashes 3 Needs to wait for p to rejoin, because can only handle 1 failure, which takes time for system to adapt 4 repeat ad infinitum All is not lost But remember Impossible in the formal sense, ie, there does not exist Even though such situations are extremely unlikely Circumventing FLP Impossibility Probabilistically Randomization Partial Synchrony (eg, failure detectors ) 23 24 6

Why should you care? Paxos Werner Vogels, Amazon CTO Job openings in my group What kind of things am I looking for in you? You know your distributed systems theory: You know about logical time, snapshots, stability, message ordering, but also acid and multi-level transactions You have heard about the FLP impossibility argument You know why failure detectors can solve it (but you do not have to remember which one diamond-w was) You have at least once tried to understand Paxos by reading the original paper Safety Only a single value is chosen Only a proposed value can be chosen Only chosen values are learned by processes Liveness *** Some proposed value eventually chosen if fewer than half of processes fail If value is chosen, a process eventually learns it 25 26 Roles of a Process Strawman Three conceptual roles Proposers propose values Acceptors accept values, where chosen if majority accept Learners learn the outcome (chosen value) In reality, a process can play any/all roles 3 proposers, 1 acceptor Acceptor accepts first value received No liveness on failure 3 proposals, 3 acceptors Accept first value received, acceptors choose common value known by majority But no such majority is guaranteed 27 28 7

Paxos Each acceptor accepts multiple proposals Hopefully one of multiple accepted proposals will have a majority vote (and we determine that) If not, rinse and repeat (more on this) How do we select among multiple proposals? Ordering: proposal is tuple (proposal #, value) = (n, v) Proposal # strictly increasing, globally unique Globally unique? Trick: set low-order bits to proposer s ID 29 Paxos Protocol Overview Proposers: 1 Choose a proposal number n 2 Ask acceptors if any accepted proposals with n a < n 3 If existing proposal v a returned, propose same value (n, v a) 4 Otherwise, propose own value (n, v) Note altruism: goal is to reach consensus, not win Accepters try to accept value with highest proposal n Learners are passive and wait for the outcome 30 Paxos Phase 1 Proposer: Choose proposal number n, send <prepare, n> to acceptors Acceptors: If n > n h n h = n promise not to accept any new proposals n < n If no prior proposal accepted Reply < promise, n, Ø > Else Reply < promise, n, (n a, v a) > Else Reply < prepare-failed > 31 Paxos Phase 2 Proposer: If receive promise from majority of acceptors, Determine v a returned with highest n a, if exists Send <accept, (n, v a v)> to acceptors Acceptors: Upon receiving (n, v), if n n h, Accept proposal and notify learner(s) n a = n h = n v a = v 32 8

Paxos Phase 3 Paxos: Well-behaved Run Learners need to know which value chosen Approach #1 1 1 1 1 1 Each acceptor notifies all learners More expensive Approach #2 Elect a distinguished learner Acceptors notify elected learner, which informs others <prepare, 1> 2 n <accept, (1,v 1)> <promise, 1> 2 n 2 n <accepted, (1,v 1)> decide v 1 Failure-prone 33 34 Paxos is safe Intuition: if proposal with value v decided, then every higher-numbered proposal issued by any proposer has value v Majority of acceptors accept (n, v): v is decided Next prepare request with proposal n+1 Race condition leads to liveness problem Completes phase 1 with proposal n0 Performs phase 2, acceptors reject Restarts and completes phase 1 with proposal n2 > n1 Process 0 Process 1 Starts and completes phase 1 with proposal n1 > n0 Performs phase 2, acceptors reject 35 can go on indefinitely 36 9

Using Paxos in system Paxos with leader election Simplify model with each process playing all three roles If elected proposer can communicate with a majority, protocol guarantees liveness Leader election to decide transaction coordinator 1 2 3 L Paxos can tolerate failures f < N / 2 37 38 Using Paxos in system New leader election protocol L 2 3 L new Still have split-brain scenario! Tells mythical story of Greek island of Paxos with legislators and current law passed through parliamentary voting protocol Misunderstood paper: submitted 1990, published 1998 Lamport won the Turing Award in 2013 39 40 10

The Paxos story The Paxos story As Paxos prospered, legislators became very busy Parliament could no longer handle all details of government, so a bureaucracy was established Instead of passing a decree to declare whether each lot of cheese was fit for sale, Parliament passed a decree appointing a cheese inspector to make those decisions Parliament passed a decree making ικστρα the first cheese inspector After some months, merchants complained that ικστρα was too strict and was rejecting perfectly good cheese Parliament then replaced him by passing the decree 1375: Γωυδα is the new cheese inspector But ικστρα did not pay close attention to what Parliament did, so he did not learn of this decree right away Cheese inspector leader using quorum-based voting protocol 41 There was a period of confusion in the cheese market when both ι κστρα and Γωυδα were inspecting cheese and making conflicting decisions Split-brain! 42 The Paxos story The Paxos story To prevent such confusion, the Paxons had to guarantee that a position could be held by at most one bureaucrat at any time To do this, a president included as part of each decree the time and date when it was proposed A decree making ι κστρα the cheese inspector might read 2716: 8:30 15 Jan 72 ι κστρα is cheese inspector for 3 months A bureaucrat needed to tell time to determine if he currently held a post Mechanical clocks were unknown on Paxos, but Paxons could tell time accurately to within 15 minutes by the position of the sun or the stars If ικστρα s term began at 8:30, he would not start inspecting cheese until his celestial observations indicated that it was 8:45 Handle clock skew: Leader gets a lease! Lease doesn t end until expiry + max skew 43 44 11

Solving Split Brain New leader election protocol Next lecture: Monday L L 2 new 3 Solution If L isn t part of majority electing L new L new waits until L s lease expires before accepting new ops Other consensus protocols with group membership + leader election at core Viewstamped Replication RAFT (assignment 3 & 4) 45 46 12