Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Similar documents
Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems Consensus

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos Made Simple. Leslie Lamport, 2001

Recovering from a Crash. Three-Phase Commit

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Consensus and related problems

Applications of Paxos Algorithm

Replication. Feb 10, 2016 CPSC 416

CS October 2017

Intuitive distributed algorithms. with F#

Paxos and Replication. Dan Ports, CSEP 552

Exam 2 Review. October 29, Paul Krzyzanowski 1

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

To do. Consensus and related problems. q Failure. q Raft

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Distributed Consensus Protocols

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

WA 2. Paxos and distributed programming Martin Klíma

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Today: Fault Tolerance

Exam 2 Review. Fall 2011

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Today: Fault Tolerance. Fault Tolerance

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Lecture XIII: Replication-II

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Paxos and Distributed Transactions

Process Synchroniztion Mutual Exclusion & Election Algorithms

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

416 practice questions (PQs)

Paxos provides a highly available, redundant log of events

Consensus, impossibility results and Paxos. Ken Birman

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Integrity in Distributed Databases

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

You can also launch the instances on different machines by supplying IPv4 addresses and port numbers in the format :3410

Distributed Systems. 12. Concurrency Control. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Fault-Tolerance & Paxos

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

Replications and Consensus

Distributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Failures, Elections, and Raft

EECS 498 Introduction to Distributed Systems

There Is More Consensus in Egalitarian Parliaments

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Strong Consistency and Agreement

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Lecture 7: Logical Time

Comparison of Different Implementations of Paxos for Distributed Database Usage. Proposal By. Hanzi Li Shuang Su Xiaoyu Li

Large-Scale Data Stores and Probabilistic Protocols

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

The Distributed Coordination Engine (DConE) TECHNICAL WHITE PAPER

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

SEGR 550 Distributed Computing. Final Exam, Fall 2011

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

EECS 498 Introduction to Distributed Systems

Basic vs. Reliable Multicast

Replication in Distributed Systems

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

CS November 2017

Security (and finale) Dan Ports, CSEP 552

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

PROCESS SYNCHRONIZATION

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CS 425 / ECE 428 Distributed Systems Fall 2017

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Fault Tolerance. Distributed Systems. September 2002

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Paxos Made Moderately Complex Made Moderately Simple

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Using Paxos For Distributed Agreement

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Building Consistent Transactions with Inconsistent Replication

Dynamic Reconfiguration of Primary/Backup Clusters

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Practical Byzantine Fault Tolerance

Distributed Systems Homework 1 (6 problems)

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Distributed Systems (5DV147)

Low-Latency Multi-Datacenter Databases using Replicated Commit

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Transcription:

Distributed Systems 10. Consensus: Paxos Paul Krzyzanowski Rutgers University Fall 2017 1

Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one that was submitted by at least one process (the consensus algorithm cannot just make up a value) 2

We saw versions of this Mutual exclusion Agree on who gets a resource or who becomes a coordinator Election algorithms Agree on who is in charge Other uses of consensus: Synchronize state to manage replicas: make sure every group member agrees on the message ordering of events Manage group membership Agree on distributed transaction commit General consensus problem: How do we get unanimous agreement on a given value? value = sequence number of a message, key=value, operation, whatever 3

Achieving consensus seems easy! client value = "x=1" Data store One request at a time Server that never dies 4

Servers might die let's add replicas Data store client value = "x=1" Data store Data store One request at a time 5

Reading from replicas is easy Data store Client x=abc Data store Data store We rely on a quorum (majority) to read successfully No quorum = failed read! 6

What about concurrent updates? Client 1 value = "x=1" Data store coordinator value = "x=1" Data store Client 2 value = "x=2" Data store Coordinator processes requests one at a time But now we have a single point of failure! We need something safer 7

Consensus algorithm goal Create a fault-tolerant consensus algorithm that does not block if a majority of processes are working Goal: agree on one result among a group of participants Processors may fail (some may need stable storage) Messages may be lost, out of order, or duplicated If delivered, messages are not corrupted Quorum: majority (>50%) agreement is the key part: If a majority of coins show heads, there is no way that a majority will show tails at the same time. If members die and others come up, there will be one member in common with the old group that still holds the information. 8

Consensus requirements Validity Only proposed values may be selected Uniform agreement No two nodes may select different values Integrity A node can select only a single value Termination (Progress) Every node will eventually decide on a value 9

Consensus algorithm: Paxos 10

Paxos Goal: provide a consistent ordering of events from multiple clients All machines running the algorithm agree on a proposed value from a client The value will be associated with an event or action Paxos ensures that no other machine associates the value with another event Fault-tolerant distributed consensus algorithm Does not block if a majority of processes are working The algorithm needs a majority (2P+1) of processors survive the simultaneous failure of P processors Paxos provides abortable consensus A client s request may be rejected It then has to re-issue the request 11

A Programmer s View Client Process Submit(R) accepted Consensus algorithm Send results to replicas (total order) If your request is not accepted, you can submit it again later: while (submit_request(r)!= ACCEPTED) ; Think of R as a key:value pair in a database where multiple clients might want to modify the same key 12

Paxos players Client: makes a request Proposers: Get a request from a client and run the protocol to get everyone in the cluster to agree Leader: elected coordinator among the proposers (not necessary but simplifies message numbering and ensures no contention) we don t need to rely on the presence of a single leader s: Multiple processes that remember the state of the protocol Quorum = any majority of acceptors Learners: When agreement has been reached by acceptors, a Learner executes the request and/or sends a response back to the client These different roles are usually part of the same system 13

Proposal numbers Paxos ensures a consistent ordering in a cluster of machines Events are ordered by sequential event IDs (N) Client wants to log an event: sends request to a Proposer E.g., value, v = add $100 to my checking account Proposer Increments the latest proposal number (event ID) it knows about ID = sequence number Asks all the acceptors to reserve that proposal # s A majority of acceptors have to accept the requested proposal # 14

Proposal Numbers Each proposal has a unique number (created by proposer) Must be unique (e.g., <sequence #>.<process_id> ) Newer proposals take precedence over older ones Each acceptor Keeps track of the largest number it has seen so far Lower proposal numbers get rejected sends back the {number, value} of the currently accepted proposal Proposer has to play fair : It will ask the acceptors to accept the {number, value} Either its own or the one it got from the acceptor 15

Paxos in action Goal: have all acceptors agree to a value v associated with a proposal Paxos nodes: one machine may serve several roles Leader Proposer Quorum Proposer Client Proposer Learner Learner The image part with The image part with 16

Paxos in action: Phase 0 Client sends a request to a proposer Client request(v) Proposer Quorum Learner 17

Paxos in action: Phase 1a PREPARE Proposer: creates a proposal #N (N acts like a Lamport time stamp), where N is greater than any previous proposal number used by this proposer Send to Quorum of s (however many you can reach but a majority) Proposer Quorum Client Prepare(N) N = < seq#. process_id > Learner 18

Paxos in action: Phase 1b PROMISE : if proposer s ID > any previous proposal promise to ignore all requests with IDs < N reply with info about highest past proposal: { N, value } Proposer Promise(N, [value]) Quorum Client Promise to ignore all proposals < N Learner Promise contains the previous N 19

Paxos in action: Phase 2a ACCEPT Proposer: if proposer receives promises from the quorum (majority): Attach a value v to the proposal (the event). Send Accept to quorum with the chosen value If promise was for another {N', v}, proposer MUST accept v for the highest accepted proposal Proposer Quorum Client Accept (N', v) Promise to ignore all proposals < N Learner 20

Paxos in action: Phase 2b ANNOUNCE : if the promise still holds, then announce the value v Send Accepted message to Proposer and every Learner BUT: if a higher proposal # may have been received during this time then send NACK to proposer so it can try again Proposer Quorum Accepted(N) Client Accepted Learners Announce(N, v) Most often, there are no learners and the "announce" phase is just the execution of the action on the local server 21

Paxos in action: Phase 3 ANNOUNCE Learner: Respond to client and/or take action on the request Proposer Quorum Client Announce(N, v) Learners Promise to ignore all proposals < N Server Server Server 22

Paxos: A Simple Example All Good 23

Paxos in action: Phase 0 Client sends a request to a proposer Request( e ) Client Proposer Quorum Learner 24

Paxos in action: Phase 1a PREPARE Proposer: picks a sequence number: 5 Send to Quorum of s Proposer Quorum Prepare(5: e ) Client Learner 25

Paxos in action: Phase 1b PROMISE : Suppose 5 is the highest sequence # any acceptor has seen Each acceptor PROMISES not to accept any lower numbers Proposer Promise(5: e ) Quorum Client Promise to ignore all proposals < 5 Learner 26

Paxos in action: Phase 2a ACCEPT Proposer: Proposer receives the promise from a majority of acceptors Proposer must accept that <seq, value> Proposer Quorum Client Accept(5, e ) Promise to ignore all proposals < N Learner 27

Paxos in action: Phase 2b ANNOUNCE : s state that they accepted the request Proposer Quorum Accepted(5, e ) Client Accepted Announce(5, e ) Learners Announce(5, e ) 28

Paxos: A Simple Example Higher Offer 29

Paxos in action: Phase 0 Client sends a request to a proposer Request( e ) Client Proposer Quorum Learner 30

Paxos in action: Phase 1a PREPARE Proposer: picks a sequence number: 5 Send to Quorum of s One acceptor receives a higher offer BEFORE it gets this PREPARE message Prepare(7: f ) Proposer Quorum Prepare(5: e ) Client Learner 31

Paxos in action: Phase 1b PROMISE : Suppose 5 is the highest sequence # any acceptor has seen Each acceptor PROMISES not to accept any lower numbers Proposer Promise(5: e ) Quorum Client Promise(5: e ) Promise(7: f ) Learner 32

Paxos in action: Phase 2a ACCEPT Proposer: Proposer receives the higher # offer and MUST change its mind Proposer Quorum Client Accept(7, f ) Promise to ignore all proposals < N Learner 33

Paxos in action: Phase 2b ANNOUNCE : s state that they accepted the request Proposer Quorum Accepted(7, f ) Client Announce(7, f ) Accepted Learners Announce(7, f ) 34

Paxos: Keep trying if you need to A proposal N may fail because The acceptor may have made a new promise to ignore all proposals less than some value M >N A proposer does not receive a quorum of responses: either promise (phase 1b) or accept (phase 2b) Algorithm then has to be restarted with a higher proposal # 35

Paxos summary Paxos allows us to ensure consistent (total) ordering over a set of events in a group of machines Events = commands, actions, state updates Each machine will have the latest state or a previous version of the state Paxos used in: Google Chubby lock manager / name server Apache Zookeeper (clone of Google Chubby) Cassandra lightweight transactions Google Spanner, Megastore Microsoft Autopilot cluster management service from Bing VMware NSX Controller Amazon Web Services, DynamoDB 36

Paxos summary To make a change to the system: Tell the proposer (leader) the event/command you want to add Note: these requests may occur concurrently Leader = one elected proposer. Not necessary for Paxos algorithm but an optimization to ensure a single, increasing stream of proposal numbers. Cuts down on rejections and retries. The proposer picks its next highest event ID and asks all the acceptors to reserve that event ID If any acceptor sees has seen a higher event ID, it rejects the proposal & returns that higher event ID The proposer will have to try again with another event ID When the majority of acceptors accept the proposal, accepted events are sent to learners, which can act on them (e.g., update system state) Fault tolerant: need 2k+1 servers for k fault tolerance 37

Implementation Use only one proposer at a time the leader Other nodes can be active backups just in case the leader dies No need to worry about sync of proposal # those are local per proposer Acts like a fault-tolerant coordinator Avoids failed proposals due to higher numbers from other proposers Alternatively, embed proposer logic into client library Too many clients issuing concurrent requests can cause a large # of retries Learners rarely needed s are often running on the system that processes the request (e.g., data store, log, ) Just send an acknowledgement directly to the client. 38

The End 39