Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Similar documents
Security (and finale) Dan Ports, CSEP 552

Byzantine Fault Tolerance

Byzantine Fault Tolerance

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Paxos provides a highly available, redundant log of events

Paxos and Replication. Dan Ports, CSEP 552

Byzantine Fault Tolerant Raft

Practical Byzantine Fault

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00)

Consensus and related problems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Practical Byzantine Fault Tolerance

Key-value store with eventual consistency without trusting individual nodes

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Evaluating BFT Protocols for Spire

Reducing the Costs of Large-Scale BFT Replication

Distributed Systems 11. Consensus. Paul Krzyzanowski

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recovering from a Crash. Three-Phase Commit

arxiv: v2 [cs.dc] 12 Sep 2017

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Scaling Byzantine Fault-tolerant Replication to Wide Area Networks

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

SpecPaxos. James Connolly && Harrison Davis

AS distributed systems develop and grow in size,

Distributed Systems COMP 212. Lecture 19 Othon Michail

Robust BFT Protocols

You can also launch the instances on different machines by supplying IPv4 addresses and port numbers in the format :3410

Consensus, impossibility results and Paxos. Ken Birman

CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT

Authenticated Agreement

Today: Fault Tolerance

G Distributed Systems: Fall Quiz II

Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Steward: Scaling Byzantine Fault-Tolerant Systems to Wide Area Networks

EECS 498 Introduction to Distributed Systems

Byzantine Techniques

Resource-efficient Byzantine Fault Tolerance. Tobias Distler, Christian Cachin, and Rüdiger Kapitza

Today: Fault Tolerance. Fault Tolerance

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Introduction to Cryptoeconomics

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li

Proactive Recovery in a Byzantine-Fault-Tolerant System

Distributed Consensus Protocols

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Intuitive distributed algorithms. with F#

Bitcoin, Security for Cloud & Big Data

CS State Machine Replication

WA 2. Paxos and distributed programming Martin Klíma

Proactive Recovery in a Byzantine-Fault-Tolerant System

Availability, Reliability, and Fault Tolerance

Distributed Systems 8L for Part IB

Consensus in Distributed Systems. Jeff Chase Duke University

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Practice: Large Systems Part 2, Chapter 2

There Is More Consensus in Egalitarian Parliaments

EECS 591 DISTRIBUTED SYSTEMS. Manos Kapritsos Fall 2018

arxiv:cs/ v3 [cs.dc] 1 Aug 2007

Efficient Data Structures for Tamper-Evident Logging

Failures, Elections, and Raft

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Proof of Stake Made Simple with Casper

To do. Consensus and related problems. q Failure. q Raft

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Practical Byzantine Fault Tolerance and Proactive Recovery

Replicated State Machine in Wide-area Networks

Basic vs. Reliable Multicast

Distributed Systems COMP 212. Revision 2 Othon Michail

Alternative Consensus

Paxos Made Moderately Complex Made Moderately Simple

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Distributed Operating Systems

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Towards Recoverable Hybrid Byzantine Consensus

Distributed Systems Consensus

Blockchains & Cryptocurrencies

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

The Long March of BFT. Weird Things Happen in Distributed Systems. A specter is haunting the system s community... A hierarchy of failure models

Distributed Commit in Asynchronous Systems

HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance

Transcription:

Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick other nodes, collude, Why this model? (h/t Ellis Michael and Dan Ports) Consequences of software bugs are often unpredictable; measurable rate of (not fail stop) hardware failures Build systems that don t rely on everyone being trusted What can go wrong? Paxos is fail-stop tolerant A: Append(x, foo); Append(x, bar) B: Get(x) -> foo bar C: Get(x) -> foo bar What can a malicious server do? return something totally unrelated reorder the append operations ( bar foo ) only process one of the appends show B and C different results Paxos tolerates up to f out of 2f+1 fail-stop failures What could a malicious replica do? stop processing requests (but Paxos should handle this!) change the value of a key acknowledge an operation then discard it execute and log a different operation tell some replicas that slot 42 is Put and others that it is Get force view changes to keep the system from making progress BFT replication Same replicated state machine model as Paxos Tolerate f byzantine failures out of 3f+1 replicas Other 2f+1 replicas are non-faulty, but might be slow Use voting, signatures so that the correct replicas return the right result If client hears the same thing from f+1 replicas, done! BFT model Attacker controls f replicas can make them do anything knows their crypto keys, can send messages Attacker knows what protocol the other replicas are running Attacker can delay messages in the network arbitrarily But the attacker can't cause more than f replicas to fail cause clients to misbehave break crypto 1

Why is BFT consensus hard? Paxos Quorums Why did Paxos need 2f+1 replicas to tolerate f failures?...and why do we need 3f+1 replicas? Every operation needs to talk w/ a majority (f+1) request f of those nodes might fail need one left OK quorums intersect X The Byzantine case Quorums What if we tried to tolerate Byzantine failures with 2f+1 replicas? put(x, 1) X=0 get(x) X=0 OK X=0 X=0 X=0 X=1 X=1 In Paxos: quorums of f+1 out of 2f+1 nodes quorum intersection: any two quorums intersect at at least one node For BFT: quorums of 2f+1 out of 3f+1 nodes quorum majority any two quorums intersect at f+1 nodes => any two quorums intersect at at least one good node Are quorums enough? Are quorums enough? put(x,1) We saw this problem before with Paxos: just writing to a quorum wasn t enough X=0 X=1 X=1 X=0 Solution in Paxos: use a two-phase protocol: propose, then accept 2

BFT approach BFT approach Use a primary to order requests But the primary might be faulty could send wrong result to client could ignore client request entirely could send different op to different replicas (this is the really hard case!) All replicas send replies directly to client Replicas exchange information about ops received from primary (to make sure the primary isn t equivocating) Clients notify all replicas of ops, not just primary; if no progress, they replace primary All messages cryptographically signed; serve as transferrable proof (e.g., I know you received message X) Client sends request to current Primary Attempt 1 Attempt 2 Client sends request to primary & other replicas Primary assigns seq number, sends PRE-PREPARE(seq, op) to all replicas Primary proposes request in seq number n What s the problem with using this? primary might send different op order to replicas When replica receives PRE-PREPARE, sends PREPARE(seq, op) to all replicas 2f+1 PREPAREs serve as proof certificate, to anyone Once it has proof, the replica executes and replies to the client Client can proceed when it hears f+1 (same) replies Faulty primary What if the primary sends different ops to different replicas? case 1: all good nodes get 2f+1 matching prepares they must have gotten the same op case 2: >= f+1 good nodes get 2f+1 matching prepares they must have gotten the same op Can a faulty non-primary replica prevent progress? Can a faulty primary cause a problem that won t be detected? What if it sends ops in a different order to different replicas? what about the other (f or less) good nodes? case 3: < f+1 good nodes get 2f+1 matching prepares system is stuck, doesn t execute any request 3

View changes What if a replica suspects the primary of being faulty? e.g., heard request but not PRE-PREPARE Can it start a view change on its own? no - it needs f+1 view change Who will be the next primary? How do we keep a malicious node from making sure it s always the next primary? primary = view number mod n Straw-man view change When a replica suspects the primary, sends VIEW-CHANGE to the next primary, includes all of the PREPARE certificates it received. Asks other replicas to join in. Other replicas join the view change when they receive f+1 requests. Once primary receives 2f+1 VIEW-CHANGEs, announces view with NEW-VIEW message includes copies of the VIEW-CHANGES (showing the view change is justified; propagates the PREPARE certificates) starts numbering new operations after last seq number it saw What goes wrong? Fixing view changes Need another round in the operation protocol! Some replica saw 2f+1 PREPAREs for an op in seq number n, executed it The new primary did not receive the PREPARE for that op New primary starts numbering new requests at n => two different ops with seq num n! Not just enough to know that replicas agreed on an op for seq n, need to make sure that the next primary will hear about it After receiving 2f+1 PREPAREs, replicas send COMMIT message to let the others know Only execute requests after receiving 2f+1 COMMITs; receiving 2f+1 COMMITs is a certificate that any quorum contains f+1 nodes with the PREPARE certificate The final protocol The final protocol client sends op to primary primary sends PRE-PREPARE(seq, op) to all all send PREPARE(seq, op) to all after replica receives 2f+1 matching PREPARE(seq, op), send COMMIT(seq, op) to all after receiving 2f+1 matching COMMIT(seq, op), execute op, reply to client 4

The final protocol BFT vs MultiPaxos Correct clients only accept replies from f+1 replicas Correct replicas only execute once they have a COMMIT certificate, implying that f+1 correct replicas have a PREPARE certificate Therefore, if a replica has a COMMIT certificate, that operation will survive in that seq into new views Replicas never send conflicting PREPAREs in the same view Therefore, no two correct replicas ever execute different operations for the same seq number BFT: 4 phases PRE-PREPARE - primary determines request order PREPARE - replicas make sure primary told them same order COMMIT - replicas ensure that a quorum knows about the order execute and reply MultiPaxos: 3 phases PREPARE - primary determines request order PREPARE-OK - replicas ensure that a quorum knows about the order execute and reply PBFT vs MultiPaxos What did this buy us? Before, we could only tolerate fail-stop failures with replication Now we can tolerate any failure, benign or malicious as long as it only affects less than 1/3 replicas (what if more than 1/3 replicas are faulty?) Performance Implementation Complexity Why would we expect BFT to be slow? Latency (extra round) Message complexity (O(n 2 ) communication!) Crypto ops are slow! Building a bug-free Paxos is hard! BFT is much more complicated Which is more likely? bugs caused by the BFT implementation the bugs that BFT is meant to avoid 5

BFT summary It s possible to build systems that work correctly even though parts may be malicious! Requires a lot of complex and expensive mechanisms On the boundary of practicality? 6