Paxos and Replication. Dan Ports, CSEP 552

Similar documents
Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recovering from a Crash. Three-Phase Commit

Distributed Systems Consensus

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Distributed Consensus Paxos

Failures, Elections, and Raft

Today: Fault Tolerance

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Consensus: Making Impossible Possible

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Paxos provides a highly available, redundant log of events

To do. Consensus and related problems. q Failure. q Raft

Today: Fault Tolerance. Fault Tolerance

Distributed Consensus: Making Impossible Possible

Security (and finale) Dan Ports, CSEP 552

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Consensus, impossibility results and Paxos. Ken Birman

Distributed Systems 11. Consensus. Paul Krzyzanowski

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

Paxos Made Moderately Complex Made Moderately Simple

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

EECS 498 Introduction to Distributed Systems

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Replication in Distributed Systems

Paxos Made Simple. Leslie Lamport, 2001

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

You can also launch the instances on different machines by supplying IPv4 addresses and port numbers in the format :3410

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Consensus and related problems

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Byzantine Fault Tolerant Raft

WA 2. Paxos and distributed programming Martin Klíma

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Distributed Consensus Protocols

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

There Is More Consensus in Egalitarian Parliaments

Intuitive distributed algorithms. with F#

CPS 512 midterm exam #1, 10/7/2016

Exam 2 Review. Fall 2011

EECS 498 Introduction to Distributed Systems

Trek: Testable Replicated Key-Value Store

Raft and Paxos Exam Rubric

Distributed Commit in Asynchronous Systems

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Replication. Feb 10, 2016 CPSC 416

Strong Consistency and Agreement

Fault-Tolerance & Paxos

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Replicated State Machine in Wide-area Networks

SpecPaxos. James Connolly && Harrison Davis

Byzantine Fault Tolerance

Byzantine Fault Tolerance

G Distributed Systems: Fall Quiz II

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Introduction to Distributed Systems Seif Haridi

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Tolerating Latency in Replicated State Machines through Client Speculation

11. Replication. Motivation

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Assignment 12: Commit Protocols and Replication Solution

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

REPLICATED STATE MACHINES

The Distributed Coordination Engine (DConE) TECHNICAL WHITE PAPER

Paxos and Distributed Transactions

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

HP: Hybrid Paxos for WANs

Revisiting Fast Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance

Module 7 - Replication

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Fast Paxos Made Easy: Theory and Implementation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Asynchronous Reconfiguration for Paxos State Machines

Distributed Systems 8L for Part IB

Transcription:

Paxos and Replication Dan Ports, CSEP 552

Today: achieving consensus with Paxos and how to use this to build a replicated system

Last week Scaling a web service using front-end caching but what about the database?

Instead: How do we replicate the database? How do we make sure that all replicas have the same state? even when some replicas aren t available?

Two weeks ago (and ongoing!) Two related answers: Chain Replication Lab 2 - Primary/backup replication Limitations of this approach Lab 2 - can only tolerate one replica failure (sometimes not even that!) Both: need to have a fault-tolerant view service How would we make that fault-tolerant?

Last week: Consensus The consensus problem: multiple processes start w/ an input value processes run a consensus protocol, then output chosen value all non-faulty processes choose the same value

Paxos Algorithm for solving consensus in an asynchronous network Can be used to implement a state machine (VR, Lab 3, upcoming readings!) Guarantees safety w/ any number of replica failures Makes progrèss when a majority of replicas online and can communicate long enough to run protocol

Paxos History 1989 1990 Viewstamped Replication Liskov & Oki Paxos Leslie Lamport, The Part-Time Parliament 1998 Paxos paper published ~2005 First practical deployments 2010s 2014 Widespread use! Lamport wins Turing Award

Why such a long gap? Before its time? Paxos is just hard? Original paper is intentionally obscure: Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers.

Meanwhile, at MIT Barbara Liskov & group develop Viewstamped Replication: essentially same protocol Original paper entangled with distributed transaction system & language VR Revisited paper tries to separate out replication (similar: RAFT project at Stanford) Liskov: 2008 Turing Award, for programming w/ abstract data types, i.e. object-oriented programming

Paxos History 1989 1990 Viewstamped Replication Liskov & Oki Paxos Leslie Lamport, The Part-Time Parliament 1998 ~2005 2010s 2014 Paxos paper published The ABCDs of Paxos [2001] Paxos Made Simple [2001] Paxos Made Practical [2007] Paxos Made Live [2007] Paxos Made Moderately Complex [2011] First practical deployments Widespread use! Lamport wins Turing Award

Three challenges about Paxos How does it work? Why does it work? How do we use it to build a real system? (these are in increasing order of difficulty!)

Why is replication hard? Split brain problem: Primary and backup unable to communicate w/ each other, but clients can communicate w/ them Should backup consider primary failed and start processing requests? What if the primary considers the backup is failed and keeps processing requests? How does Lab 2 (and Chain Replication) deal with this?

Using consensus for state machine replication 3 replicas, no designated primary, no view server Replicas maintain log of operations Clients send requests to some replica Replica proposes client s request as next entry in log, runs consensus Once consensus completes: execute next op in log and return to client

GET X X=2 1: PUT X=2 2: PUT Y=5 3: GET X 1: PUT X=2 1: PUT X=2 2: PUT Y=5 2: PUT Y=5 3: GET X 3: GET X

Two ways to use Paxos Basic approach (Lab 3) run a completely separate instance of Paxos for each entry in the log Leader-based approach (Multi-Paxos, VR) use Paxos to elect a primary (aka leader) and replace it if it fails primary assigns order during its reign Most (but not all) real systems use leader-based Paxos

Paxos-per-operation Each replica maintains a log of ops Clients send RPC to any replica Replica starts Paxos proposal for latest log number completely separate from all earlier Paxos runs note: agreement might choose a different op! Once agreement reached: execute log entries & reply to client

Terminology Proposers propose a value Acceptors collectively choose one of the proposed values Learners find out which value has been chosen In lab3 (and pretty much everywhere!), every node plays all three roles!

Paxos Interface Start(seq, v): propose v as value for instance seq fate, v := Status(seq): find the agreed value for instance seq Correctness: if agreement reached, all agreeing servers will agree on same value (once agreement reached, can t change mind!)

How does an individual Paxos instance work? Note: all of the following is in the context of deciding on the value for one particular instance, i.e., what operation should be in log entry 4?

Why is agreement hard? Server 1 receives Put(x)=1 for op 2, Server 2 receives Put(x)=3 for op 2 Each one must do something with the first operation it receives yet clearly one must later change its decision So: multiple-round protocol; tentative results? Challenge: how do we know when a result is tentative vs permanent?

Why is agreement hard? S1 and S2 want to select Put(x)=1 as op 2, S3 and S4 don t respond Want to be able to complete agreement w/ failed servers so are S3 and S4 failed? or are they just partitioned, and trying to accept a different value for the same slot? How do we solve the split brain problem?

Key ideas in Paxos Need multiple protocol rounds that converge on same value Rely on majority quorums for agreement to prevent the split brain problem

Majority Quorums Why do we need 2f+1 replicas to tolerate f failures? Every operation needs to talk w/ a majority (f+1) Why? X request OK Have to be able to proceed w/ n-f responses f of those might fail need one left (n-f)-f 1 => n 2f+1

Another reason for quorums Majority quorums solve the split brain problem Suppose request N talks to a majority All previous requests also talked to a majority Key property: any two majority quorums intersect at at least one replica! So request N is guaranteed to see all previous operations What if the system is partitioned & no one can get a majority?

The mysterious f f is the number of failures we can tolerate For Paxos, need 2f+1 replicas (Chain Replication was f+1; some protocols need 3f+1) How do we choose f? Can we have more than 2f+1 replicas?

Paxos protocol overview Proposers select a value Proposers submit proposal to acceptors, try to assemble a majority of responses might be concurrent proposers, e.g., multiple clients submitting different ops acceptors must choose which requests they accept to ensure that algorithm converges

Strawman Proposer sends propose(v) to all acceptors Acceptor accepts first proposal it hears Proposer declares success if its value is accepted by a majority of acceptors What can go wrong here?

Strawman What if no request gets a majority? 1: PUT X=2 1: PUT Y=4 1: GET X

Strawman What if there s a failure after a majority quorum? 1: PUT X=2 1: PUT Y=4 1: PUT X=2 X 1: PUT X=2 1: PUT Y=4 1: PUT X=2 How do we know which request succeeded?

Basic Paxos exchange Proposer Acceptors propose(n) propose_ok(n, na, va) accept(n, v ) accept_ok(n) decided(v )

Definitions n is an id for a given proposal attempt not an instance this is still all within one instance! e.g., n = <time, server_id> v is the value the proposer wants accepted server S accepts n, v => S sent accept_ok to accept(n, v) n, v is chosen => a majority of servers accepted n,v

Key safety property Once a value is chosen, no other value can be chosen! This is the safety property we need to respond to a client: algorithm can t change its mind! Trick: another proposal can still succeed, but it has to have the same value! Hard part: chosen is a systemwide property: no replica can tell locally that a value is chosen

Paxos protocol idea proposer sends propose(n) w/ proposal ID, but doesn t pick a value yet acceptors respond w/ any value already accepted and promise not to accept proposal w/ lower ID When proposer gets a majority of responses if there was a value already accepted, propose that value otherwise, propose whatever value it wanted

Paxos acceptor n p = highest propose seen n a, v a = highest accept seen & value On propose(n) if n > n p n p = n reply propose_ok(n, n a, v a ) else reply propose_reject On accept(n, v) if n n p n p = n n a = n v a = v reply accept_ok(n) else reply accept_reject

Example: Common Case Proposer Acceptor Acceptor Acceptor propose(1) propose_ok(1, nil, nil) propose_ok(1, nil, nil) propose_ok(1, nil, nil) accept(1, V) accept_ok(1) accept_ok(1) accept_ok(1) decided(v)

What is the commit point? i.e., the point at which, regardless of what failures happen, the algorithm will always proceed to choose the same value? once a majority of acceptors send accept_ok(n)! why not when a majority of proposers send propose_ok(n)?

Acceptor Acceptor Acceptor propose_ok(10) accept_ok(10, X) propose_ok(10) propose_ok(11) propose_ok(10) propose_ok(11) Has a value been chosen? accept_ok(11, Y) Could either X or Y be chosen? What happens if #2 gets accept(10, X)? What happens if #1 gets accept(11, Y)?

Why does the proposer need to choose the value v a with highest n a? Guaranteed to see any value that has already obtained a majority of acceptors can t change this value, so we need to use it! Will also see any value that could subsequently obtain a majority of acceptors because the proposal prevents any lower-numbered proposal from being accepted

What about FLP? No determinstic algorithm for solving consensus in an asynchronous network is both safe (correct) and live (terminates eventually) Paxos is an algorithm for solving consensus Paxos must not be guaranteed to be live How can it get stuck?

Worst-case for Paxos Proposer Acceptor Acceptor Acceptor Proposer propose(1) prop_ok(1) prop_ok(1) prop_ok(1) propose(2) accept(1) prop_ok(2) prop_ok(2) prop_ok(2) accept_rej(1) accept_rej(1) accept_rej(1) propose(3) prop_ok(3) prop_ok(3) prop_ok(3) accept(2) accept_rej(2) accept_rej(2) accept_rej(2)

What can we do about this? don t retry immediately; wait random time then retry designate one replica as leader (aka distinguished proposer), have it make all the proposals what if that replica fails? just an optimization, other replicas can still make proposals if they think it failed

Multi-Paxos All of the above was about a single instance, i.e., agreeing on the value for one log entry In reality: series of Paxos instances Optimization: if we have a leader, have it run the first phase for multiple instances at once propose(n): acceptor sets n p = n for this instance and all future instances Then the proposer can jump to the accept phase

Multi-Paxos Client request accept acceptok reply Leader Replica Replica exec decide Replica

Viewstamped Replication A Paxos-like protocol presented in terms of state machine replication i.e, a system-builder s view of Paxos see also RAFT from Stanford

Viewstamped Replication is exactly Multi-Paxos!

Starting point 2f+1 replicas, one of them is the primary each one maintains a numbered log of operations either PREPARED or COMMITTED clients send all requests to primary primary runs a two-phase commit over replicas

2-phase commit Client request prepare prepare-ok reply Leader Replica Replica exec commit Replica

Beyond 2PC 2PC does not remain available with failures So let s try requiring a majority quorum: f+1 PREPARE-OKs, including the primary can tolerate f backup failures (no primary failure) Minor detail: what if backup receives op n+1 without seeing op n need state transfer mechanism

The hard part need to detect that the primary has failed (timeout?) need to replace it with a new primary need to make sure that the new primary knows about all operations committed by the primary need to keep the old primary from completing new operations need to make sure that there are no race conditions!

Replacing the primary Each replica maintains a view number, view number determines the primary, process PREPARE-OK only if view number matches When primary suspected faulty: send <START-VIEW-CHANGE, new v> to all On receiving START-VIEW-CHANGE: increment view number, stop processing reqs send <DO-VIEW-CHANGE, v, log> to new primary When primary receives DO-VIEW-CHANGE from majority: take log with highest seen (not necessarily committed) op install that log, send <START-VIEW, v, log> to all

Why is this correct?

Why is this correct? New primary sees every operation that could possibly have completed in old view every completed operation was processed by majority of replicas, and we have DO-VIEW- CHANGE logs from a majority Can the old primary commit new operations? no - once a replica sends DO-VIEW-CHANGE it stops listening to the old primary!

Why is this correct? Because it s Paxos! View change = propose a new primary a two-phase protocol involving majorities other replicas promise not to accept ops in old view and proposer finds out all ops accepted in old view and must propose them in new view

VR = (Multi-)Paxos view number = proposal number start-view-change(v) = propose(v) do-view-change(v) = propose_ok(v) start-view(v, log) = accept(v, op) for appropriate instance prepare(v, opnum, op) = accept(v, op) for instance opnum prepare_ok(v, opnum) = accept_ok(v, op) for instance opnum commit(opnum, op) = decided(opnum, op)

Paxos performance What determines Paxos performance? We ll consider Multi-Paxos / VR since it s the most common way to use Paxos

Multi-Paxos Client throughput: bottleneck replica processes 2n request prepare prepareok reply msgs Leader Replica Replica exec commit Replica latency: 4 message delays

Batching Have leader accumulate requests from many clients Run one round of Paxos in parallel to add them all to the log Much higher throughput Potentially higher latency (can get it about even)

Partitioning One idea: run multiple Paxos groups each replica will be a leader in some, follower in others - spreads load around very common in practice Separate idea: partition instances, different leaders for each instance some protocols do this for higher throughput more complicated, easy to get wrong