Replicated State Machine in Wide-area Networks

Similar documents
Mencius: Another Paxos Variant???

MENCIUS: BUILDING EFFICIENT

HP: Hybrid Paxos for WANs

There Is More Consensus in Egalitarian Parliaments

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Intuitive distributed algorithms. with F#

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems Consensus

Consensus and related problems

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Paxos provides a highly available, redundant log of events

PARALLEL CONSENSUS PROTOCOL

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Replication in Distributed Systems

Paxos and Replication. Dan Ports, CSEP 552

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Fast Paxos Made Easy: Theory and Implementation

SpecPaxos. James Connolly && Harrison Davis

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Recovering from a Crash. Three-Phase Commit

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Distributed Consensus: Making Impossible Possible

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Introduction to Distributed Systems Seif Haridi

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

S-Paxos: Offloading the Leader for High Throughput State Machine Replication

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

CPS 512 midterm exam #1, 10/7/2016

Distributed Systems 8L for Part IB

Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

Efficient and Scalable Replication of Services over Wide-Area Networks

Distributed systems. Consensus

Distributed Consensus Protocols

Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks (Technical Report)

ZooKeeper Atomic Broadcast

Basic vs. Reliable Multicast

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Distributed Algorithms. Partha Sarathi Mandal Department of Mathematics IIT Guwahati

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Today: Fault Tolerance

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab

Replication. Feb 10, 2016 CPSC 416

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

Enhancing Throughput of

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Google Spanner - A Globally Distributed,

Practical Byzantine Fault Tolerance

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Distributed algorithms

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Local Recovery for High Availability in Strongly Consistent Cloud Services

Consensus, impossibility results and Paxos. Ken Birman

Providing Real-Time and Fault Tolerance for CORBA Applications

Practice: Large Systems Part 2, Chapter 2

Theoretical Computer Science

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Be General and Don t Give Up Consistency in Geo- Replicated Transactional Systems

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast

Today: Fault Tolerance. Fault Tolerance

Asynchronous Reconfiguration for Paxos State Machines

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

CORFU: A Shared Log Design for Flash Clusters

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

INF-5360 Presentation

JPaxos: State machine replication based on the Paxos protocol

Fault-Tolerant Distributed Services and Paxos"

Optimizing GIS Services: Scalability & Performance. Firdaus Asri

Building Consistent Transactions with Inconsistent Replication

Distributed Systems. Day 9: Replication [Part 1]

Low-Latency Multi-Datacenter Databases using Replicated Commit

Canopus: A Scalable and Massively Parallel Consensus Protocol

Lecture XII: Replication

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Distributed Consensus: Making Impossible Possible

State Machine Replication is More Expensive than Consensus

Applications of Paxos Algorithm

Distributed DNS Name server Backed by Raft

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Computer Network Fundamentals Spring Week 3 MAC Layer Andreas Terzis

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

arxiv: v2 [cs.dc] 12 Sep 2017

The Google File System

Erasure Coding in Object Stores: Challenges and Opportunities

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

PRIMARY-BACKUP REPLICATION

Comparison of Different Implementations of Paxos for Distributed Database Usage. Proposal By. Hanzi Li Shuang Su Xiaoyu Li

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Tolerating Latency in Replicated State Machines through Client Speculation

Transcription:

Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1

Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency E.g., FT file system for LANs Servers agree on the same sequence of command to execute Network x x y z Consensus to reach agreement E.g. Paxos, Fast Paxos. y x y z x y z z 2

Why wide-area? Reliability and availability To cope with colocation failure: fire etc. Consistency To coordinate wide-area applications Yahoo s Zookeeper, Google s Chubby Are existing protocols efficient in wide-area? If not, can we improve them? 3

More on wide-area Wide-area is more complex than simply not local area Distribution of servers LAN WAN Distribution of clients LAN {LxL}:Localarea system {L,W} WAN {WxL}:Clusterbased system {WxW}: Grid, SOA, mobile 4

Paxos vs. Fast Paxos in (W, L) Topology All servers in the same LAN Clients talk to the servers via WAN Sources of asynchrony Wide area network Client Operating system Observation Client High variance on delays WAN delay dominates WAN LAN Servers 5

Consensus speed Theoretically Counting the number of message delays to reach consensus Assuming each message take the same delay. Practically Message delay varies greatly WAN message delay maybe hundreds time larger than that of a LAN message Client Replica cmd cmd cmd result result result Replica Replica 6

Paxos Analysis Client cmd result Leader 1a 1b 2a Client WAN LAN Replica 1a 1b 2a Client Replica Paxos Servers Wide area delay for Paxos The message from the client to the leader 7

Fast Paxos Analysis Client Leader 1a 1b 2a ANY 2a ANY cmd cmd cmd result Replica 1a 1a 1b 1b 2a ANY result result cmd result Client WAN LAN Replica Client Replica Fast Paxos Quorum Size: 2f+1 Need 2f+1 replicas to receive the cmd from the client to make progress Wide area delay (f=1) The third fastest message from the client to the replicas Servers 8

Classic Paxos vs. Fast Paxos In (W, L) Paxos Fast Paxos Learning delay 3 2 Number of replicas 2f+1 3f+1 Quorum Size f+1 2f+1 Collision No Yes Question: Is Fast Paxos always faster in practice in collision-free runs? 9

Analysis Result Assumption WAN message delay dominates WAN message delay is IID and follows a continuous distribution Result Classic Paxos has a fixed probability (60%) of being faster (f=1) Classic Paxos does even better with larger f Intuition WAN dominates and has high variance 10

Quick Proof Given a wide area latency distribution Draw 5 variables (A-E) from the distribution A < B < C < D for Fast Paxos E for Paxos E s relative order to A-D ( 5 cases ) Each case has the same probability A < B < C < D E E E Paxos Faster E E Fast Paxos Faster 11

Summary Protocol performance is network settings sensitive Fast Paxos better in (L,L) Paxos beter in (W,L) and (W,W) How to design a protocol to have the best of both in a wide-range of network settings Run both Paxos and Fast Paxos concurrently Switch between Paxos and Fast Paxos when network condition changes 12

High performance replicated state machine for (W, W) systems Model A small number of n sites inter-connected by asynchronous wide-area network Up to f out of n servers can fail by crashing System load Possible variable and changing load from each site Network-bound (large request size) CPU-bound (small request size) Goals Low latency under low client load High throughput with high client load WAN 13

Paxos in steady state P 0 /Leader P 1 /Replica P 2 /Replica P 0 /Leader Clinet 0 sends x x P 0 proposes x x x proposeaccept learn Clinet 0 recv x x P 0 learns x fwd propose accept z z z P 2 fwd z z Client 2 sends z Client 2 recv z Instance 0 Instance 1 learn P 2 learns z Pros Simple and low message complexity (3n-3) Low latency at the leader (2 steps) Cons High latency at the non-leader replicas (4 steps) Not load balanced: bottleneck at the leader or the leader s links 14

Fast Paxos in steady state x P 0 /Replica P 1 /Replica P 2 /Replica x P 0 proposes x P 0 learns x x x propose propose accept x propose accept P 3 /Replica z z P 3 proposes z z P 3 learns z z Pros Load balanced Instance 0 Instance 1 Low latency at all servers under no contention(2 steps) Cons Collision results in additional latency under contention High message complexity (n 2-1) More replicas needed (3f+1) 15

Paxos vs. Fast Paxos in (W, W) system Paxos Fast Paxos New Alg? WAN learning latency Leader: 2 Non-leader: 4 2 (when no collision) 2 Collision No Yes No Number of replicas 2f+1 3f+1 2f+1 WAN msg Leader: 3n-3 n 2-1 3n-3 complexity Non-Leader: 3n-2 Load balancing No Yes Yes Neither algorithm is perfect, but Paxos is better. Question: can we achieve best of both with a new algorithm? 16

Deriving Mencius from Paxos Approach Rotating leader A variant of consensus Optimizations Advantages Ensure safety Safety is easy when derived from existing protocol Flexible design Others may derive their own protocol 17

First step: Rotating the leader Each instance of consensus is assigned to a coordinator server The coordinator is the default leader of that instance Simple well-known assignment: e.g. round-robin A server proposes client requests immediately to the next available instance it coordinates Don t have to wait for other servers: reduce latency A server only proposes client requests to instances it coordinates There will be no contention 18

Mencius Rule 1 A server p maintains its index I p, i.e., the next consensus instance it coordinates. Rule 1: Upon receiving a client request v, it immediately proposes v to instance I p and updates its index accordingly. 19

Benefits of rotating the leader All servers can now propose requests directly Potentially low latency at all sites Load balancing at the servers Higher throughput under CPU-bound client load Balanced communication pattern Higher throughput under network-bound load Leader fwd, accept propose, learn propose, learn fwd, accept prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn prop, acc, learn Paxos Mencius 20

Ensure liveness when servers have different load Rule 1 only works well when all the servers have the same load Servers may observe different client loads Servers can skip their turns (propose no-ops) w/o skipping x 1 y 1 z 1 y 2 z 2 z 3 0 1 2 3 4 5 6 7 8 w/ skipping x y z z y z no-op no-op no-op 1 1 1 2 2 3 0 1 2 3 4 5 6 7 8 21

Mencius Rule 2 Rule 2: If server p receives a propose message with some value v other than no-op for instance i and i>i p, before accepting the v and sending back an accept message, p updates its index I p to be greater than i and proposes no-ops for each instance in range [I p, I p ) that p coordinates, where I p is p s new index. 22

Proposing no-ops is costly Consider the case where only one server proposes requests It takes 3n(n-1) messages to get one value v chosen 3(n-1) for choosing v 3(n-1) 2 for n-1 no-ops Solution Use a simpler abstraction than consensus 23

Simple consensus for efficiency Simple consensus Coordinator: can propose either a client request or a no-op Non-coordinator: only no-op Benefits no-op can be learned in one message delay if the coordinator skips (proposes a no-op) Easy to piggyback no-ops to improve efficiency: essentially no extra cost 24

Coordinated Paxos Starting state The coordinator is the default leader Start from state as if phase one is done by the coordinator Suggest The coordinator propose a request by running phase 2 Skip The coordinator proposes a no-op Fast learning for skipping Revoke A replica start the full phase 1 and 2 and propose a no-op Only needed when the coordinator is detected as having failed 25

Skips with simple consensus Replica 0 x y 2a 3a 2a 3a Replica 1 2a 3a 2a 3a Replica 2 Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip 3a Replica 2 skip x no-op no-op y 0 1 2 3 4 5 6 7 8 Pros: Simple and correct protocol Cons: High Message complexity: (n-1)(n+2) 26

Reduce message complexity Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip skip 3a Replica 2 Replica 0 Replica 1 x 2a 2a 3a 3a y 2a 2a +skip +skip 3a 3a Replica 0 s view: x no-op no-op y Replica 2 0 1 2 3 Idea: piggyback skip to messages Can piggyback to future 2a messages too No longer need to broadcast 27

Mencius Optimization 1 & 2 When a server p receives a suggest message from server q. Let r be a server other than p or q. Optimization 1: p does not send a separate skip message to q. Instead, p uses the accept message that replies the suggest to promise not to suggest any client request to instance smaller than i in the future. Optimization 2: p does not send a skip message to r immediately. Instead, p waits for a future suggest message from p to r to indicate that p has promised not to suggest any client requests to instance smaller than i. 28

Gap in idle replicas Potentially unbounded number of requests wait to be committed Only happens to idle replicas. Once the replica sends requests, the pending skips are resolved. Idle replicas can send skips periodically to accelerate resolution Replica 0 Replica 1 Replica 2 x no-op no-op y no-op no-op z 0 1 2 3 4 5 6 7 8 x no-op y no-op z 0 1 2 3 4 5 6 7 8 x no-op y no-op z 0 1 2 3 4 5 6 7 8 29

Mencius Accelerator 1 When a server p receives a suggest message from server q, let r be a server other than p or q. Accelerator 1: A server p propagates skip messages to r if the total number of outstanding skip messages to r is more than some constant α, or the message has been deferred for more than some time τ. 30

Failures Faulty processes cannot skip Leave gap in the requests sequence Replica 0 x y 2a 3a 2a skip 3a Replica 1 2a 3a 2a skip 3a Replica 2 Replica 0 x no-op y 0 1 2 3 4 5 6 7 8 Replica 1 x no-op y 0 1 2 3 4 5 6 7 8 31

Mencius: Rule 3 Rule 3: Let q be a server that another server p suspects has failed, and let C q be the smallest instance that is coordinated by q and not learned by p, p revokes q for all instances in the range [C q, I p ] that q coordinates. 32

Imperfect failure detector A false suspicion may result no-op chosen, even if the coordinator is not crashed and has proposed some value v Solution: re-suggest v when the coordinator learns no-op What if false suspicion happens again? Keep trying? W is not enough to provide liveness One server can be permanently falsely suspected So, we require P instead Eventually there is no false suspicion, hence P guarantees eventual liveness 33

Mencius: Rule 4 Rule 4: If server p suggests a value v other than no-op to instance i, and p learns no-op is chosen, then p suggests v again. 34

Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 29 59 89 119 35

Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 29 59 89 119 36

Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y 29 59 89 119 37

Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y 29 59 89 119 38

Deal with failure Revoke: propose no-op on behalf of the faulty processes Problem: Full 3 phases of Paxos are costly Solution: revoke for large block 0 30 60 90 x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y 29 59 89 119 39

Mencius failure recovery P 2 may come back because of failure recovery or false suspicion Find out the next available slots it coordinates Start proposing request to that slot 0 30 60 x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y 90 119 120 z 149 29 59 89 P 2 recovers and propose request z to its next available slot 40

Mencius failure recovery P 2 may come back because of failure recovery or false suspicion 0 30 60 90 120 Find out the next available slots it coordinates Start proposing request to that slot Other servers catch up with P 2 by skipping x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y x y z 149 29 59 89 119 P 0 and P 1 skip their turns upon receiving the request from P 2 41

Mencius: Optimization 3 Optimization 3: Let q be a server that another server p suspects to have failed, and let C q be the smallest instance that is coordinated by q and not learned by p. For some constant β, p revokes q for all instances in the range [C q, I p +2β] that q coordinates if C q <I p +β. 42

Mencius summary Rule 1 Suggest client request immediately to the next consensus instance a server coordinates Rule 2 Update index and skip unused instances when accepting a value Rule 3 Revoke suspected servers Rule 4 Re-suggest a client request after a false suspicion 43

Mencius summary continued Optimization 1 Piggyback skips to accept () messages Optimization 2 Piggyback skips to future suggest (2a) messages Accelerator 1 Put a bound before starting flush skip messages Optimization 3 Issue revocation in large blocks to amortize the cost 44

Mencius vs. Paxos and Fast Paxos in (W, W) system Paxos Fast Paxos New Mencius Alg? WAN learning latency Leader: 2 Non-leader: 4 2 (when no collision) 2* Collision No Yes No Number of replicas 2f+1 3f+1 2f+1 WAN msg Leader: 3n-3 n 2-1 3n-3 complexity Non-Leader: 3n-2 Load balancing No Yes Yes Mencius achieves the best of both Paxos and Fast Paxos 45

Delayed commit P 0 P 0 suggests x to instance 0 x P 0 learns and commits x x y P 0 learns and commits y P 2 y 2a P 2 suggests y to instance 2 2a x? no-op y 3a 3a y P 2 learns y x P 2 learns y and commits x, y x no-op y 0 1 2 3 Delayed Commit 0 1 2 3 Up to one round-trip extra delay Out-of-order commit reduces the extra delay when requests are commutable A benefit of using simple consensus 46

Experimental setup Mencius vs. Paxos A clique topology for three sites A simple replicated register service k total registers: Mencius-k ρ bytes dummy payload ρ = 4000 : Network-bound ρ = 0 : CPU-bound Out-of-order commit option Clients send requests at a fixed rate 50ms, 5~20Mbps 50ms, 5~20Mbps 50ms, 5~20Mbps 47

Latency vs. system load Network-bound load, 20Mbps for all links Latency (ms) 400 300 200 Mencius Mencius-1024 Mencius-128 Mencius-16 Paxos ρ = 4000 100 0 400 800 1200 1600 Throughput (ops) Take away: Mencius has good latency Latency goes to infinitive when reaching maximum throughput Mencius s latency is as good as Paxos s under high contention Enable out-of-order commit reduces latency up to 30% 48

Throughput 10000 200% improvement when network-bound Throughput (ops) 5000 0 Mencius Mencius-1024 Mencius-128 Mencius-16 Paxos x ρ = 4000 (Network-bound) y ρ = 0 (CPU-bound) Up-to 50% improvement when CPU-bounded Enabling out-of-order hurts throughput Take away: Mencius has good throughput 49

Throughput under failure 4000 2000 Caused by delayed commits Throughput (ops) Paxos: crash follower 800 Paxos: crash leader Mencius 600 Temporary stalls when any server fails Still higher than Paxos 400 200 Temporary stalls Caused by when leader fails duplicates 0 0 20 40 60 80 Failure suspected Time (sec) Failure reported 50

Fault scalability Star topology: 10 Mbps link from each site to the central node Throughput (ops) 500 250 0 3 sites Mencius Paxos 5 sites 7 sites ρ = 4000 (Network-bound) 10000 5000 0 3 sites 5 sites 7 sites ρ = 0 (CPU-bound) Take away: Mencius scales better than Paxos 51

More evaluation results Lower latency at all servers under low contention True, even without out-of-order commit Adaptable to available bandwidth Adaptation happens automatically Easy to take advantage of all available bandwidth 52

Summary Mencius Rotating leader Paxos Easy to ensure safety and a flexible design Simple consensus The foundation of efficiency design High performance High throughput under high load Low latency under low load Low latency under medium to high load when enabling out-oforder commit Better load balancing Better scalability 53

References Flavio Junqueira, Yanhua Mao, Keith Marzullo "Classic Paxos vs. Fast Paxos: Caveat Emptor, HotDep 2007, Edinburgh, UK Yanhua Mao, Flavio Junqueira, Keith Marzullo "Mencius: Building Efficient Replicated State Machines for WANs, OSDI 2008, San Diego, CA, USA 54