Consensus and related problems

Similar documents
To do. Consensus and related problems. q Failure. q Raft

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Today: Fault Tolerance

Distributed Systems 11. Consensus. Paul Krzyzanowski

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Today: Fault Tolerance. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

The Chubby Lock Service for Loosely-coupled Distributed systems

CSE 5306 Distributed Systems

Recovering from a Crash. Three-Phase Commit

Fault Tolerance. Distributed Systems. September 2002

Today: Fault Tolerance. Replica Management

Fault Tolerance. Distributed Systems IT332

Failures, Elections, and Raft

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Consensus Problem. Pradipta De

Today: Fault Tolerance. Failure Masking by Redundancy

Fault Tolerance. Basic Concepts

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Distributed Systems (ICE 601) Fault Tolerance

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

The Chubby lock service for loosely- coupled distributed systems

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Consensus in Distributed Systems. Jeff Chase Duke University

The Chubby Lock Service for Distributed Systems

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Exam 2 Review. October 29, Paul Krzyzanowski 1

Distributed Systems COMP 212. Lecture 19 Othon Michail

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Distributed Systems Consensus

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Applications of Paxos Algorithm

Fault Tolerance. Distributed Software Systems. Definitions

Byzantine Techniques

Intuitive distributed algorithms. with F#

Distributed Consensus Protocols

Distributed Systems 16. Distributed File Systems II

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Distributed Systems Fault Tolerance

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Introduction to Distributed Systems Seif Haridi

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Basic vs. Reliable Multicast

Consensus, impossibility results and Paxos. Ken Birman

Failure Tolerance. Distributed Systems Santa Clara University

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Assignment 12: Commit Protocols and Replication Solution

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Chapter 8 Fault Tolerance

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Dep. Systems Requirements

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Byzantine Fault Tolerant Raft

Byzantine Failures. Nikola Knezevic. knl

Fault-Tolerance & Paxos

Practice: Large Systems Part 2, Chapter 2

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

The Chubby lock service for loosely-coupled distributed systems

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Lecture XIII: Replication-II

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

WA 2. Paxos and distributed programming Martin Klíma

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

A Reliable Broadcast System

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

EECS 591 DISTRIBUTED SYSTEMS

Security (and finale) Dan Ports, CSEP 552

Byzantine Fault Tolerance

Byzantine Fault Tolerance

CS 347 Parallel and Distributed Data Processing

Paxos Made Simple. Leslie Lamport, 2001

CPS 512 midterm exam #1, 10/7/2016

CSE 486/586 Distributed Systems

MODELS OF DISTRIBUTED SYSTEMS

Reliability & Chubby CSE 490H. This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Today: Fault Tolerance. Reliable One-One Communication

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Lecture XII: Replication

Transcription:

Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby

Consensus and failures How to make process agree on a value after one or more have proposed what the value should be? Should we abort or proceed the mission? reach consensus even if some processes may fail Failure System cannot meet its promises Error Part of system s state that can lead to a failure Many ways to fail 2

Failure models Failures Crash failures A component simply halts, but behaves correctly before halting Omission failures fails to respond to incoming requests Timing failures Correct output, but outside a specified realtime interval Response failures Output is incorrect Arbitrary/byzantine failures May produce arbitrary output and be subject to arbitrary timing failures Basic approach to masking faults Redundancy Information Add extra bits to a message for reconstruction Time Do something more than once if needed Physical Add copies of software/hardware 3

Consensus Definition To reach consensus Every process p i begins in undecided state and proposes a value v i Processes exchange values Each sets the value of a decision variable d i, entering the decided state Requirements Termination Eventually all correct processes decide Agreement All correct processes decide the same Integrity If the correct processes all proposed the same value, then any correct process in the decided state has chose that value 4

Dreamland solution Just as illustration A system where processes cannot fail Each of N process p i R-multicasts its proposed value to g Every process p j Collects all N values (including its own) Evaluates a = majority(v i for i: 1.. N), returns a or NIL if no majority exists Requirements Termination guaranteed by the reliability of multicast Agreement and integrity by the definition of majority and the integrity property of reliable multicast (every process receives the same set of values and runs the same function, so they must agree on the same value) 5

Consensus and related problems Besides consensus Byzantine general problem 3+ generals need to agree to attack or retreat Commander issues an order, lieutenants decide what to do One of the generals may be treacherous ~consensus, slightly different integrity: if commander is correct, all decide on the value proposed by commander Interactive consistency All processes agree on a vector of values, one per process Similar requirements, e.g. for integrity: if p i is correct, all correct processes decide on v i as the i-th entry of its vector All can be define in the presence of crash or arbitrary failures and for synchronous or asynchronous systems 6

Consensus in synchronous systems Up to f of the N processes exhibit crash failures Algorithm proceeds in f+1 rounds, in each the correct processes B-multicast the values between themselves Round duration is set based on max time for a correct process to multicast a message Algorithm for process p i On initialization: Values i 1 := {v i }; Values i0 := {} Values i r holds the proposed values known to p i at the beginning of round r 1. In round r from [1,f+1] 2. B-multicast(g, Values r i Values r-1 i ) // send only new values 3. Values r+1 i := Values r i 4. while (in round r) { 5. On B-deliver(V j ) from some p j 6. Insert V j in Values r+1 i 7. } 8. After (f+1) rounds 9. d i is minimum(values f+1 i ) 7

Consensus in synchronous systems Termination is obvious because the system is synchronous (fix number of rounds of fixed length) Agreement and integrity holds if all processes arrive at the same set of values at the end of the final round By contradiction, if two processes differ in their final set, some correct process p i has a value v that another correct process p j does not Only possible explanation, a process p k managed to send v to p i, but didn t get to p j, before crashing But for that, p k must have received it in a previous round in which p j didn t, so there must be a process p l And on and on, for each of the f+1 round, But we have assumed at most f crashes! 8

Byzantine general problem Three or more generals have to agree to attack/retreat One, the commander, issues the order to n-1 lieutenants The lieutenants must decide whether to attack or retreat One or more generals, m, may be treacherous (faulty) If the commander is a traitor Can propose attack to one lieutenant and retreat to another If the lieutenant is a traitor, Can tell one lieutenant that the commander said attack and tell another that he said retreat A variant of consensus with a distinguished process proposing a value Integrity if the commander is correct, all correct processes agree with what the commander proposed 9

Byzantine general problem Assumptions Synchronous system: Correct processes can detect absence of a message through a timeout (but can t conclude the sender has crashed) Arbitrary failures: A faulty process can send any message with any value at any time (or omit to send) The communication channel between processes are private Impossibility with three processes If L 1 has to decide to accept v on the first example, has to choose w in the second one C 1:v 1:v 2:1:v L 1 L 2 3:1:u C 1:w 1:x 2:1:w L 1 L 2 3:1:x Lamport et al. solution solves BGP for 3m+1 or more generals in the presence of at most m traitors 10

Byzantine generals problem A solution Algorithm BG(0) 1. The commander (C) sends value to every lieutenant 2. Lieutenants use the value received from C or RETREAT if they received no value Algorithm BG(m), m>0 1. C sends value to every lieutenant 2. For each i, let v i be the value Lieutenant i receives from C or RETREAT. Lieutenant i acts as C in algorithm BG(m-1) to send the value v i to each of the n-2 other lieutenants 3. For each i, and each j!= i, let v j be the value Lieutenant i received from Lieutenant j in step (2) (using algorithm BG(m-1)), or else RETREAT. Lieutenant i uses the value majority(v 1,, v n-1 ) 11

A run with N 4 and f = 1 A lieutenant is the traitor L1: majority(v,v,x) = v L2: majority(v,v,y) = v v L1 v C L2 v L3 v L1 v v C L2 x v L3 y The commander is the traitor L1: majority(v,w,z) = NIL L2: majority(v,w,z) = NIL L3: majortiy(v,w,z) = NIL v L1 C w L2 z L3 L1 w v v C z L2 w L3 z 12

Consensus and related problems It is possible to derive a solution to one problem using a solution to another Suppose there s a solution to consensus, Byzantine generals and interactive consistency C i (v 1,v 2,,v N ) returns decision value of p i in a run of the solution to consensus BG i (j, v) returns decision value of p i in a run of the solution to Byzantine generals where commander p j, proposed value v IC i (v 1,v 2,,v N )[j] returns the jth value in decision vector of p i in a run of the solution to interactive consistency IC from BG by running BG N times, one per process IC i (v 1,v 2,,v N )[j] = BG i (j, v j ) (i, j = 1 N) C from IC C i (v 1,v 2,,v N ) = majority(ic i (v 1,v 2,,v N )[j]) 13

Consensus, synchronous & asynchronous We discussed solutions for consensus and Byzantine generals problem Our solutions assume synchronous system What if the system is asynchronous? No algorithm can guarantee to reach consensus with even one faulty process [Fischer et. al, 1985] Basic idea: there s always a continuation of a processes execution that avoids reaching consensus (why? Can t tell if a process is running slow or is dead) Guarantee it is possible How do we work around this? Masking faults: process keep data in persistent storage so that they can restart after crashing (Chubby!) Using perfect by design failure detectors processes agree to a maximum response time (otherwise, the process has failed) 14

Back in 5 15

Google s Chubby A fault-tolerant system that provides a distributed locking mechanism and stores small files A multi-faceted service providing Coarse-grained distributed locks A file-system offering reliable storage of small files Can be used to support election of a primary in a set of replicas Used by Google FS and Bigtable Also used as a name service within Google That diverse? Not really, one core service a practical solution to distributed consensus 16

Chubby architecture A Chubby instance cell Typically one per data center Replication for fault-tolerance few replicas (5), with a master Replicas are placed in failure-independent sites to avoid correlated failures have access to persistent storage that survives crashes each maintains a small DB whose elements are entities in the Chubby namespace Applications access replicas via a Chubby library Chubby cell Client Client Chubby client library Chubby client library RPC Log Local data- Log Snapshots base Local data- Snapshots Log base Local data- Snapshots base 17

Chubby architecture Session and caching A session a relation between a client and a cell Maintained using a KeepAlive handshake Client caching of data, metadata and info on open handles Unlike GFS, small files are access repeatedly Consistency A mutating operation (e.g., SetContents) is blocked until all associated caches are invalidated Invalidation piggybacked onto KeepAlive msgs Cached data never updated directly Determinism promoted the use of it as name service (DNS allows naming data to become inconsistent) 18

Chubby interface Abstraction based on a file system Files are organized into a hierarchical namespace /ls/chubby_cell/directory_name/ /file_name As a FS, whole-file operations Reads/writes of files are atomic operations Role Operation Effect General Open Open a file/directory, returns a handle Close Close the file associated with the handle Delete File GetContentAndStat Returns atomically whole content and attributes GetStat ReadDir SetContents SetACL Write the whole content of a file (atomically) Access control lists 19

Chubby interface As a lock-management tool Locks are advisory Checking for conflicts is on the programmer Role Operation Effect Lock Acquire Acquire a lock on a file TryAcquire Release Tries to acquire a lock on a file Releases a lock Simple support for an event mechanism Clients can register for events as an option in the open call E.g., file has been modified Notifications delivered through callbacks 20

Chubby interface As a fault tolerant log Replica 1 Replica 1 Replica 1 Value Client app callback callback callback Paxos framework submit Value Value Value To support primary election All candidates attempt to acquire a lock associated with an election Only one succeeds, the primary Primary writes its id in the associated file Others can so learn the id of the primary Once the submitted value has enter the fault-tolerant log 21

Chubby and Paxos Chubby s consistency using Lamport s Paxos Based on maintaining operation logs As logs become long with time, snapshots Paxos a family of protocols for distributed consensus Replica servers may operate at arbitrary speed and may fail and later recover Replica servers have access to stable, persistent storage Messages may be lost, reordered, duplicated; delivered w/o corruption but may take an arbitrarily long time to arrive distributed consensus for asynchronous systems Impossibility result Paxos workaround it by ensuring correctness but not liveness (i.e., it is not guaranteed to terminate) 22

Paxos Replicas can submit a value to achieve consensus on Agreement = all replicas having this value as the next entry in their update logs Algorithm is guaranteed to eventually achieve consensus if majority of replicas run long enough with sufficient network stability Paxos-L1 (progress): if there exists a stable majority set of replicas, then if a replica in the set initiates an update, then some member of the set eventually executes the update Paxos-L2 (eventual replication): if replica s executes an update and there is a set of replicas containing s and r, and a time after which the set does not experience any communication or process failures, then r eventually executes the update 23

Paxos algorithm 1. Elect a replica to be the coordinator 2. Coordinator selects a value and broadcast it to all replicas in a msg called accept Other replicas either acknowledge it or reject it 3. Once a majority acks, consensus has been reached Coordinator broadcasts a commit msg to notify replicas Assuming there is only one coordinator and no failures Accept v Commit Replica 1 Replica 2 Commit Accept v Ack Replica 3 24

Paxos algorithm Multiple coordinators? Since coordinators can fail, multiple ones can coexist And each can select a different value For consensus Paxos introduces two mechanisms Assigning ordering to the successive coordinators Restricting each coordinator's choice in selecting a value 25

Paxos algorithm Electing a coordinator To identify the right one, ordering with an increasing sequence # Each replica keeps the highest seq # seen If bidding as coordinator, pick a unique #, higher than any it has seen, and broadcast it in a propose msg If others have not seen a higher bidder, reply with promise msg indicating they will reject later coordinators claims with lower # or send a negative ack (not voting for you!) Promise msg also contains the most recent value heard by sender as a proposal or consensus (or null) If majority of promise msgs are received, replica is elected coordinator (majority is known as quorum) 26

Paxos algorithm electing a coordinator To ensure the propose seq # is unique Pick smallest seq # s that is > any seq # see so that s mod n = i r (replicas unique ids are i r, between 0...n-1) E.g., if the number of replicas (n) is 5, replica id (i r ) is 3, and the last seq # was 20, replica will pick seq # 23 for next bid 27

Paxos Seeking and achieving consensus Elected coordinator selects a value and sends an accept msg with this value to the associated quorum If any of the promise msgs contained a value, Coordinator must pick a value from those received The one of the most recent coordinator Otherwise it is free to select its own value 28

Paxos Seeking and achieving consensus Any member of the quorum that receives the accept msg must accept the value and ack the acceptance Coordinator waits, possible indefinitely, for a majority to ack the accept msg If the majority acks, we have reached consensus Coordinator broadcasts a commit msg to notify replicas of this Else coordinator abandons the proposal and start again 29

Multi-Paxos In Chubby, a need to agree on a sequence of values So algorithm must repeat to agree on a set of entries in the log Log records all Paxos actions log before sending each msg propose, promise, accept, ack, commit and flush to disk With close-by machines, write to disk is $$$ Certain optimizations are possible such as electing a coordinator for a while And avoid propose messages Any replica can still try to become the coordinator 30

Summary How to make process agree on a value after one or more have proposed what the value should be? Consensus In synchronous and asynchronous systems Not just theory Chubby! 31