Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Similar documents
Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Paxos and Replication. Dan Ports, CSEP 552

EECS 498 Introduction to Distributed Systems

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Paxos Made Simple. Leslie Lamport, 2001

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Distributed Systems 11. Consensus. Paul Krzyzanowski

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems Consensus

Consensus, impossibility results and Paxos. Ken Birman

Distributed Consensus Paxos

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Recovering from a Crash. Three-Phase Commit

Fault-Tolerance & Paxos

Consensus and related problems

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

CSE 486/586 Distributed Systems

CSE 486/586 Distributed Systems

To do. Consensus and related problems. q Failure. q Raft

CS 425 / ECE 428 Distributed Systems Fall 2017

Paxos provides a highly available, redundant log of events

Today: Fault Tolerance

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Consensus Problem. Pradipta De

Today: Fault Tolerance. Fault Tolerance

Exam 2 Review. October 29, Paul Krzyzanowski 1

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Intuitive distributed algorithms. with F#

Failures, Elections, and Raft

CSE 486/586: Distributed Systems

Basic vs. Reliable Multicast

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

WA 2. Paxos and distributed programming Martin Klíma

416 practice questions (PQs)

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Practice: Large Systems Part 2, Chapter 2

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Fault Tolerance Dealing with an imperfect world

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Introduction to Distributed Systems Seif Haridi

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Replications and Consensus

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CSE 486/586 Distributed Systems

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

PROCESS SYNCHRONIZATION

Today: Fault Tolerance. Replica Management

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

CSE 486/586 Distributed Systems Reliable Multicast --- 1

Distributed Systems 8L for Part IB

Using Paxos For Distributed Agreement

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Consensus Protocols

Lecture XII: Replication

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Transactions Between Distributed Ledgers

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems 24. Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems COMP 212. Lecture 19 Othon Michail

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Distributed Systems Final Exam

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Second Exam.

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

11. Replication. Motivation

Distributed Consensus: Making Impossible Possible

Replication in Distributed Systems

PRIMARY-BACKUP REPLICATION

Database management system Prof. D. Janakiram Department of Computer Science and Engineering Indian Institute of Technology, Madras

X X C 1. Recap. CSE 486/586 Distributed Systems Gossiping. Eager vs. Lazy Replication. Recall: Passive Replication. Fault-Tolerance and Scalability

Consensus in Distributed Systems. Jeff Chase Duke University

Exam 2 Review. Fall 2011

CS October 2017

Dep. Systems Requirements

MODELS OF DISTRIBUTED SYSTEMS

There Is More Consensus in Egalitarian Parliaments

Fault-Tolerant Distributed Services and Paxos"

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Transcription:

Recap Distributed Systems Steve Ko Computer Sciences and Engineering University at Buffalo Facebook photo storage CDN (hot), Haystack (warm), & f4 (very warm) Haystack RAID-6, per stripe: 10 data disks, 2 parity disks, 2 failures tolerated Replication degree within a datacenter: 2 4 total disk failures tolerated within a datacenter One additional copy in another datacenter Storage usage: 3.6X (1.2X for each copy) f4 Reed-Solomon code, per stripe: 10 data disks, 4 parity disks, 4 failures tolerated within a datacenter One additional copy XOR ed to another datacenter Storage usage: 2.1X 2 A consensus algorithm Known as one of the most efficient & elegant consensus algorithms If you stay close to the field of distributed systems, you ll hear about this algorithm over and over. What? Consensus? What about FLP (the impossibility of consensus)? Obviously, it doesn t solve FLP. It relies on failure detectors to get around it. Plan Brief history (with a lot of quotes) The protocol itself How to discover the protocol (this is now optional in the schedule). Developed by Leslie Lamport (from the Lamport clock) A fault-tolerant file system called Echo was built at SRC in the late 80s. The builders claimed that it would maintain consistency despite any number of non-byzantine faults, and would make progress if any majority of the processors were working. I decided that what they were trying to do was impossible, and set out to prove it. Instead, I discovered the algorithm. I decided to cast the algorithm in terms of a parliament on an ancient Greek island (). 3 4 The paper abstract: Recent archaeological discoveries on the island of reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament s protocol provides a new way of implementing the statemachine approach to the design of distributed systems. I gave a few lectures in the persona of an Indiana- Jones-style archaeologist. My attempt at inserting some humor into the subject was a dismal failure. People who attended my lecture remembered Indiana Jones, but not the algorithm. People thought that was a joke. Lamport finally published the paper 8 years later in 1998 after it was written in 1990. Title: The Part-Time Parliament People did not understand the paper. Lamport gave up and wrote another paper that explains in simple English. Title: Made Simple Abstract: The algorithm, when presented in plain English, is very simple. Still, it s not the easiest algorithm to understand. So people started to write papers and lecture notes to explain Made Simple. (e.g., Made Moderately Complex, Made Practical, etc.) 5 6 C 1

Review: Consensus How do people agree on something? Q: should Steve give an A to everybody taking CSE 486/586? Input: everyone says either yes/no. Output: an agreement of yes or no. FLP: this is impossible even with one-faulty process and arbitrary delays. Many distributed systems problems can cast into a consensus problem Mutual exclusion, leader election, total ordering, etc. How do multiple processes agree on a value? Under failures, network partitions, message delays, etc. Review: Consensus People care about this! Real systems implement Google Chubby MS Bing cluster management Etc. Amazon CTO Werner Vogels (in his blog post Job Openings in My Group ) What kind of things am I looking for in you? You know your distributed systems theory: You know about logical time, snapshots, stability, message ordering, but also acid and multi-level transactions. You have heard about the FLP impossibility argument. You know why failure detectors can solve it (but you do not have to remember which one diamond-w was). You have at least once tried to understand by reading the original paper. 7 8 Administrivia PA4 due 5/6 (Friday) Final: Thursday, 5/12, 8am 11am at Knox 20 Assumptions & Goals The network is asynchronous with message delays. The network can lose or duplicate messages, but cannot corrupt them. Processes can crash. Processes are non-byzantine (only crash-stop). Processes have permanent storage. Processes can propose values. The goal: every process agrees on a value out of the proposed values. 9 10 Desired Properties Roles of a Process Safety Only a value that has been proposed can be chosen Only a single value is chosen A process never learns that a value has been chosen unless it has been Liveness Some proposed value is eventually chosen If a value is chosen, a process eventually learns it Three roles Proposers: processes that propose values Acceptors: processes that accept (i.e., consider) values Considering a value : the value is a candidate for consensus. Majority acceptance à choosing the value Learners: processes that learn the outcome (i.e., chosen value) 11 12 C 2

Roles of a Process In reality, a process can be any one, two, or all three. Important requirements The protocol should work under process failures and with delayed and lost messages. The consensus is reached via a majority (> ½). Example: a replicated state machine All replicas agree on the order of execution for concurrent transactions All replica assume all roles, i.e., they can each propose, accept, and learn. First Attempt Let s just have one acceptor, choose the first one that arrives, & tell the proposers about the outcome. What s wrong? Single point of failure! 13 14 Second Attempt Let s have multiple acceptors; each accepts the first one; then all choose the majority and tell the proposers about the outcome. Second Attempt One example, but many other possibilities A2 What s wrong? (next slide) A2 15 16 Let s have multiple acceptors each accept (i.e., consider) multiple proposals. An acceptor accepting a proposal doesn t mean it will be chosen. A majority should accept it. Make sure one of the multiple accepted proposals will have a vote from a majority (will get back to this later) : how do we select one value when there are multiple acceptors accepting multiple proposals? Protocol Overview A proposal should have an ID (since there s multiple). (proposal #, value) == (N, V) The proposal # strictly increasing and globally unique across all proposers, i.e., there should be no tie. E.g., (per-process number).(process id) == 3.1, 3.2, 4.1, etc. Three phases Prepare phase: a proposer learns previously-accepted proposals from the acceptors. Propose phase: a proposer sends out a proposal. Learn phase: learners learn the outcome. 17 18 C 3

Protocol Overview Phase 1 Rough description of the proposers Before a proposer proposes a value, it will ask acceptors if there is any proposed value already. If there is, the proposer will propose the same value, rather than proposing another value. Even with multiple proposals, the value will be the same. The behavior is altruistic: the goal is to reach a consensus, rather than making sure that my value is chosen. Rough description of the acceptors The goal for acceptors is to accept the highest-numbered proposal coming from all proposers. An acceptor tries to accept a value V with the highest proposal number N. Rough description of the learners All learners are passive and wait for the outcome. A proposer chooses its proposal number N and sends a prepare request to acceptors. Hey, have you accepted any proposal yet? Note: Acceptors keep the history of proposals. An acceptor needs to reply: If it accepted anything, the accepted proposal and its value with the highest proposal number less than N This reply also means a promise to not accept any proposal numbered less than N any more (to make sure that it doesn t alter the result of the reply). 19 20 Phase 2 If a proposer receives a reply from a majority, it sends an accept request with the proposal (N, V). V: the value from the highest proposal number N from the replies (i.e., the accepted proposals returned from acceptors in phase 1) Or, if no accepted proposal was returned in phase 1, a new value to propose. Upon receiving (N, V), acceptors either: Accept it Or, reject it if there was another prepare request with N higher than N, and it replied to it (due to the promise in phase 1). Phase 3 Learners need to know which value has been chosen. Many possibilities One way: have each acceptor respond to all learners, whenever it accepts a proposal. Learners will know if a majority has accepted a proposal. Might be effective, but expensive Another way: elect a distinguished learner Acceptors respond with their acceptances to this process This distinguished learner informs other learners. Failure-prone Mixing the two: a set of distinguished learners 21 22 A simple run A problematic run N: 5 N: 5 23 24 C 4

A problematic run (cont.) N: 6 N: 6 (N, V): (5, 10) There s a race condition for proposals. completes phase 1 with a proposal number N0 Before starts phase 2, starts and completes phase 1 with a proposal number N1 > N0. performs phase 2, acceptors reject. Before starts phase 2, restarts and completes phase 1 with a proposal number N2 > N1. performs phase 2, acceptors reject. (this can go on forever) (N, V): (5, 10) 25 26 Providing Liveness Solution: elect a distinguished proposer I.e., have only one proposer If the distinguished proposer can successfully communicate with a majority, the protocol guarantees liveness. I.e., if a process plays all three roles, can tolerate failures f < 1/2 * N. Still needs to get around FLP for the leader election, e.g., having a failure detector Summary A consensus algorithm Handles crash-stop failures (f < 1/2 * N) Three phases Phase 1: prepare request/reply Phase 2: accept request/reply Phase 3: learning of the chosen value 27 28 Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta (UIUC). 29 C 5