Distributed Algorithms Practical Byzantine Fault Tolerance

Distributed Algorithms Practical Byzantine Fault Tolerance Alberto Montresor Università di Trento 2018/12/06 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Table of contents 1 Introduction 2 Byzantine generals 3 Practical Byzantine Fault Tolerance 4 Beyond PBFT Overview 5 Zyzzyva Introduction Three cases The case of the missing phase View changes 6 Aardvark 7 UpRight

Introduction Motivation Processes may exhibit arbitrary (Byzantine) behavior Malicious attacks They lie They collude Software error Arbitrary states, messages Examples Amazon outage (2008), Root cause was a single bit flip in internal state messages 1 Shuttle Mission STS-124 (2008), 3-1 disagreement on sensors during fuel loading (on Earth!) 2 2 http://status.aws.amazon.com/s3-20080720.html 2 https://c3.nasa.gov/dashlink/resources/624/ Alberto Montresor (UniTN) DS - BFT 2018/12/06 1 / 80

Introduction History State-of-the-art at the end of the 90 s Theoretically feasible algorithms to tolerate Byzantine failures, but inefficient in practice Assume synchrony known bounds for message delays and processing speed Most importantly: synchrony assumption needed for correctness what about DoS? Bibliography L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3):382 401, 1982. http://www.disi.unitn.it/~montreso/ds/papers/byzantinegenerals.pdf Alberto Montresor (UniTN) DS - BFT 2018/12/06 2 / 80

Byzantine generals Byzantine generals Wait Attack! No, wait! Surrender! Attack! Attack! Wait From cs4410 fall 08 lecture Alberto Montresor (UniTN) DS - BFT 2018/12/06 3 / 80

Byzantine generals Specification A commanding general must send an order to his n 1 lieutenant generals such that: IC1: All loyal lieutenants obey the same order IC2: If the commanding general is loyal, then every loyal lieutenant obeys the order he sends Assumptions ( Oral messages): Every message that is sent is received correctly The receiver of a message knows who sent it The absence of a message can be detected Alberto Montresor (UniTN) DS - BFT 2018/12/06 4 / 80

Byzantine generals Impossibility results Under the Oral messages assumption, no solution with three generals can handle even a single traitor Comm. Gen. Comm. Gen. Attack! Attack! Attack! Retreat! Liut. 1 He said Retreat! Liut. 2 Liut. 1 He said Retreat! Liut. 2 Alberto Montresor (UniTN) DS - BFT 2018/12/06 5 / 80

Byzantine generals Oral Message algorithm OM(m) Algorithm OM(0) 1 The commander sends its value to every lieutenant 2 Each lieutenant uses the value he received from commander, or uses retreat if he received no value Algorithm OM(m) 1 The commander sends its value to every lieutenant 2 i, let v i be the value lieutenant i receives from the commander, or retreat if it has received no value. Lieutenant i acts as the commander of algorithm OM(m 1) to send the value v i to each of the other n 2 other lieutenants 3 j i, let v j be the value received by i from j in Step 2 of algorithm OM(m 1) or retreat if no value. Lieutenant i uses the value majority(v 1,..., v n ) (deterministic function) Alberto Montresor (UniTN) DS - BFT 2018/12/06 6 / 80

Byzantine generals Oral Message Algorithm Example OM(1) C L1 A A A L2 A A R L3 A A A Alberto Montresor (UniTN) DS - BFT 2018/12/06 7 / 80

Byzantine generals Oral messages Theorem For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 if there are more than 3m generals and at most m traitors Problems: message paths of length up to m + 1 (expensive) absence of messages must be detected via time-out (vulnerable to DoS) Alberto Montresor (UniTN) DS - BFT 2018/12/06 8 / 80

Practical Byzantine Fault Tolerance A Byzantine renaissance Bibliography M. Castro and B. Liskov. Practical Byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst., 20:398 461, Nov. 2002. http://www.disi.unitn.it/~montreso/ds/papers/pbfttocs.pdf Contributions First state machine replication protocol that survives Byzantine faults in asynchronous networks Live under weak Byzantine assumptions Byzantine Paxos/Raft! Implementation of a Byzantine, fault tolerant distributed FS Experiments measuring cost of replication technique Alberto Montresor (UniTN) DS - BFT 2018/12/06 9 / 80

Practical Byzantine Fault Tolerance Assumptions System model Asynchronous distributed system with N processes Unreliable channels Unbreakable cryptography Message m is signed by its sender i, and we write m σ(i), through: Public/private key pairs Message authentication codes (MAC) A digest d(m) of message m is produced through collision-resistant hash functions Alberto Montresor (UniTN) DS - BFT 2018/12/06 10 / 80

Practical Byzantine Fault Tolerance Assumptions Failure model Up to f Byzantine servers N > 3f total servers (Potentially Byzantine clients) Independent failures Different implementations of the service Different operating systems Different root passwords, different administrator Alberto Montresor (UniTN) DS - BFT 2018/12/06 11 / 80

Practical Byzantine Fault Tolerance Specification State machine replication Replicated service with a state and deterministic operations operating on it Clients issue a request and block waiting for reply Safety The system satisfies linearizability, provided that N > 3f + 1 Regardless of faulty clients... all operations performed by faulty clients are observed in a consistent way by non-faulty clients The algorithm does not rely on synchrony to provide safety... Liveness It relies on synchrony to provide liveness Assumes delay(t) does not grow faster than t indefinitely Weak assumption if network faults are eventually repaired Circumvent the impossibility results of FLP Alberto Montresor (UniTN) DS - BFT 2018/12/06 12 / 80

Practical Byzantine Fault Tolerance Optimality Theorem To tolerate up to f malicious nodes, N must be equal to 3f + 1 Proof

Practical Byzantine Fault Tolerance Optimality Theorem To tolerate up to f malicious nodes, N must be equal to 3f + 1 Proof It must be possible to proceed after communicating with N f replicas, because the faulty replicas may not respond But the f replicas not responding may be just slow, so f of those that responded might be faulty The correct replicas who responded (N 2f) must outnumber the faulty replicas, so N 2f > f N > 3f Alberto Montresor (UniTN) DS - BFT 2018/12/06 13 / 80

Practical Byzantine Fault Tolerance Optimality So, N > 3f to ensure that at least a correct replica is present in the reply set N = 3f + 1; more is useless more and larger messages without improving resiliency Alberto Montresor (UniTN) DS - BFT 2018/12/06 14 / 80

Practical Byzantine Fault Tolerance Processes and views Replicas IDs: 0... N 1 Replicas move through a sequence of configurations called views During view v: Primary replica is i: i = v mod N The other are backups View changes are carried out when the primary appears to have failed Alberto Montresor (UniTN) DS - BFT 2018/12/06 15 / 80

Practical Byzantine Fault Tolerance The algorithm To invoke an operation, the client sends a request to the primary The primary multicasts the request to the backups Quorums are employed to guarantee ordering on operations When an order has been agreed, replicas execute the request and send a reply to the client When the client receives at least f + 1 identical replies, it is satisfied Client Primary Backup 1 Backup 2 Backup 3 Alberto Montresor (UniTN) DS - BFT 2018/12/06 16 / 80

Practical Byzantine Fault Tolerance Problems The primary could be faulty! could ignore commands; assign same sequence number to different requests; skip sequence numbers; etc backups monitor primary s behavior and trigger view changes to replace faulty primary Backups could be faulty! could incorrectly store commands forwarded by a correct primary use dissemination Byzantine quorum systems Faulty replicas could incorrectly respond to the client! Client waits for f + 1 matching replies before accepting response Alberto Montresor (UniTN) DS - BFT 2018/12/06 17 / 80

Practical Byzantine Fault Tolerance The general idea Algorithm steps are justified by certificates Sets (quorums) of signed messages from distinct replicas proving that a property of interest holds With quorums of size at least 2f + 1 Any two quorums intersect in at least one correct replica There is always one quorum that contains only non-faulty replicas 1. State: A 2. State: A 3. State: A 4. State: Servers X Clients write A write A write A write A Alberto Montresor (UniTN) DS - BFT 2018/12/06 18 / 80

Practical Byzantine Fault Tolerance The general idea Algorithm steps are justified by certificates Sets (quorums) of signed messages from distinct replicas proving that a property of interest holds With quorums of size at least 2f + 1 Any two quorums intersect in at least one correct replica There is always one quorum that contains only non-faulty replicas 1. State: 2. State: 3. State: 4. State: A A B B B Servers Clients X write B write B write B write B Alberto Montresor (UniTN) DS - BFT 2018/12/06 18 / 80

Practical Byzantine Fault Tolerance Protocol schema Normal operation How the protocol works in the absence of failures hopefully, the common case View changes How to depose a faulty primary and elect a new one Garbage collection How to reclaim the storage used to keep certificates Recovery How to make a faulty replica behave correctly again (not here) Alberto Montresor (UniTN) DS - BFT 2018/12/06 19 / 80

Practical Byzantine Fault Tolerance State The internal state of each of the replicas include: the state of the actual service a message log containing all the messages the replica has accepted an integer denoting the replica current view Alberto Montresor (UniTN) DS - BFT 2018/12/06 20 / 80

Practical Byzantine Fault Tolerance Client request Primary Request Backup 1 Backup 2 Backup 3 request, o, t, c σ(c) o: state machine operation t: timestamp (used to ensure exactly-once semantics) c: client id σ(c): client signature Alberto Montresor (UniTN) DS - BFT 2018/12/06 21 / 80

Practical Byzantine Fault Tolerance Pre-prepare phase Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare pre-prepare, v, n, d(m) σ(p), m v: current view n: sequence number d(m): digest of client message σ(p): primary signature m: client message Alberto Montresor (UniTN) DS - BFT 2018/12/06 22 / 80

Practical Byzantine Fault Tolerance Pre-prepare phase pre-prepare, v, n, d(m) σ(p), m Correct replica i accepts pre-prepare if: the pre-prepare message is well-formed the current view of i is v i has not accepted another pre-prepare for v, n with a different digest n is between two water-marks L and H (to avoid sequence number exhaustion caused by faulty primaries) Each accepted pre-prepare message is stored in the accepting replica s message log (including the primary s) Non-accepted pre-prepare messages are just discarded Alberto Montresor (UniTN) DS - BFT 2018/12/06 23 / 80

Practical Byzantine Fault Tolerance Prepare phase Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare Prepare prepare, v, n, d(m) σ(i) Accepted by correct replica j if: the prepare message is well-formed current view of j is v n is between two water-marks L and H Alberto Montresor (UniTN) DS - BFT 2018/12/06 24 / 80

Practical Byzantine Fault Tolerance Prepare phase Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare Prepare prepare, v, n, d(m) σ(i) Replicas that send prepare accept the sequence number n for m in view v Each accepted prepare message is stored in the accepting replica s message log Alberto Montresor (UniTN) DS - BFT 2018/12/06 24 / 80

Practical Byzantine Fault Tolerance Prepare certificate (P-certificate) Replica i produces a prepare certificate prepared(m, v, n, i) iff its log holds: The request m A pre-prepare for m in view v with sequence number n Log contains 2f prepare messages from different backups that match the pre-prepare prepared(m, v, n, i) means that a quorum of (2f + 1) replicas agrees with assigning sequence number n to m in view v Theorem There are no two non-faulty replicas i, j such that prepared(m, v, n, i) and prepared(m, v, n, j), with m m Proof? Alberto Montresor (UniTN) DS - BFT 2018/12/06 25 / 80

Practical Byzantine Fault Tolerance Commit phase Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare Prepare Commit commit, v, n, d(m), i σ(i) After having collected a P-certificate prepared(m, v, n, i), replica i sends a commit message Accepted if: The commit message is well-formed Current view of i is v n is between two water-marks L and H Alberto Montresor (UniTN) DS - BFT 2018/12/06 26 / 80

Practical Byzantine Fault Tolerance Commit certificate (C-Certificate) Commit certificates ensure total order across views we guarantee that we can t miss prepare certificates during a view change A replica has a certificate committed(m, v, n, i) if: it had a P-certificate prepared(m, v, n, i) log contains 2f + 1 matching commit from different replicas (possibly including its own) Replica executes a request after it gets commit certificate for it, and has cleared all requests with smaller sequence numbers Alberto Montresor (UniTN) DS - BFT 2018/12/06 27 / 80

Practical Byzantine Fault Tolerance Reply phase Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare Prepare Commit Reply reply, v, t, c, i, r σ(i) r is the reply Client waits for f + 1 replies with the same t, r If the client does not receive replies soon enough, it broadcast the request to all replicas Alberto Montresor (UniTN) DS - BFT 2018/12/06 28 / 80

Practical Byzantine Fault Tolerance View change A un-satisfied replica backup i mutinies: stops accepting messages (except view-change and new-view) multicasts view-change, v + 1, P, i σ(i) P contains a P-certificate P m for each request m (up to a given number, see garbage collection) Mutiny succeeds if the new primary collects a new-view certificate V : a set containing 2f + 1 view-change messages indicating that 2f + 1 distinct replicas (including itself) support the change of leadership Alberto Montresor (UniTN) DS - BFT 2018/12/06 29 / 80

Practical Byzantine Fault Tolerance View change The primary elect p (replica v + 1 mod N): extracts from the new-view certificate V the highest sequence number h of any message for which V contains a P-certificate creates a new pre-prepare message for any client message m with sequence number n h and add it to the set O if there is a P-certificate for n, m in V Otherwise O O pre-prepare, v + 1, n, d m σ(p ) O O pre-prepare, v + 1, n, d null σ(p ) p multicasts new-view, v + 1, V, O σ(p ) Alberto Montresor (UniTN) DS - BFT 2018/12/06 30 / 80

Practical Byzantine Fault Tolerance View change Backup accepts a new-view, v + 1, V, O σ(p ) message for v + 1 if it is signed properly by p V contains valid view-change messages for v + 1 the correctness of O can be locally verified (repeating the primary s computation) Actions: Adds all entries in O to its log (so did p!) Multicasts a prepare for each message in O Adds all prepares to the log and enters new view Alberto Montresor (UniTN) DS - BFT 2018/12/06 31 / 80

Practical Byzantine Fault Tolerance Garbage collection A correct replica keeps in log messages about request o until: o has been executed by a majority of correct replicas, and this fact can proven during a view change Truncate log with stable checkpoints Each replica i periodically (after processing k requests) checkpoints state and multicasts checkpoint, n, d, i n: last executed request d: state digest A set S containing 2f + 1 equivalent checkpoint messages from distinct processes are a proof of the checkpoint s correctness (stable checkpoint certificate) Alberto Montresor (UniTN) DS - BFT 2018/12/06 32 / 80

Practical Byzantine Fault Tolerance View Change, revisited Message view-change, v + 1, n, S, C, P, i σ(i) n: the sequence number of the last stable checkpoint S: the last stable checkpoint C: the checkpoint certificate (2f + 1 checkpoint messages) Message new-view, v + 1, n, V, O σ(p ) n: the sequence number of the last stable checkpoint V, O: contains only requests with sequence number larger than n Alberto Montresor (UniTN) DS - BFT 2018/12/06 33 / 80

Practical Byzantine Fault Tolerance Optimizations Reducing replies One replica designated to send reply to client Other replicas send digest of the reply Lower latency for writes (4 messages) Replicas respond at Prepare phase (tentative execution) Client waits for 2f + 1 matching responses Fast reads (one round trip) Client sends to all; they respond immediately Client waits for 2f + 1 matching responses Alberto Montresor (UniTN) DS - BFT 2018/12/06 34 / 80

Practical Byzantine Fault Tolerance Optimizations: cryptography Reducing overhead Public-key cryptography only for view changes MACs (message authentication codes) for all other messages To give an idea (Pentium 200Mhz) Generating 1024-bit RSA signature of a MD5 digest: 43ms Generating a MAC of the same message: 10µs Alberto Montresor (UniTN) DS - BFT 2018/12/06 35 / 80

Practical Byzantine Fault Tolerance Application: Byzantine NFS server Alberto Montresor (UniTN) DS - BFT 2018/12/06 36 / 80

r readimpact hmark okup iolates o have Practical Byzantine Fault Tolerance the first four phases because the time spent waiting for Application: Byzantine NFS server de ran client e same ons for ad and es, and ark for ion for below 4% for is high ration. lookup operations to complete in BFS-strict is at least 20% of the elapsed time for these phases, whereas it is less than 5% of the elapsed time for the last phase. BFS phase strict r/o lookup NFS-std 1 0.55 (-69%) 0.47 (-73%) 1.75 2 9.24 (-2%) 7.91 (-16%) 9.46 3 7.24 (35%) 6.45 (20%) 5.36 4 8.77 (32%) 7.87 (19%) 6.60 5 38.68 (-2%) 38.38 (-2%) 39.35 total 64.48 (3%) 61.07 (-2%) 62.52 Table 3: Andrew benchmark: BFS vs NFS-std. times are in seconds. The Table 3 shows the results for BFS vs NFS-std. These results show that BFS can be used in practice BFSstrict takes only 3% more time to run the complete benchmark. Thus, one could replace the NFS V2 Alberto Montresor (UniTN) DS - BFT 2018/12/06 37 / 80

Practical Byzantine Fault Tolerance Reality Check Example of systems that have adopted Byzantine Fault Tolerance: Boeing 777 Aircraft Information Management System Boeing 777/787 flight control system SpaceX Dragon flight control system BitCoin Alberto Montresor (UniTN) DS - BFT 2018/12/06 38 / 80

Distributed Algorithms Practical Byzantine Fault Tolerance Alberto Montresor Università di Trento 2018/12/06 Acknowledgments: Lorenzo Alvisi This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Beyond PBFT Overview Overview After PBFT, several others papers started to appear: HQ: J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira. HQ replication: A hybrid quorum protocol for Byzantine fault tolerance. In Proc. of the Symposium on Operating systems design and implementation, OSDI 06, Oct. 2005 Q/U: M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, and J. Wylie. Fault-scalable Byzantine fault-tolerant services. In Proc. of the ACM Symposium on Operating Systems Principles, SOSP 05, Oct. 2005 The end results has been to complicate the adoption of Byzantine solutions. Alberto Montresor (UniTN) DS - BFT 2018/12/06 39 / 80

Beyond PBFT Overview Overview In the regions we studied (up to f = 5), if contention is low and low latency is the main issue, then if it is acceptable to use 5f + 1 replicas, Q/U is the best choice, else HQ is the best since it outperforms PBFT with a batch size of 1. Otherwise, PBFT is the best choice in this region: It can handle high contention workloads, and it can beat the throughput of both HQ and Q/U through its use of batching. Outside of this region, we expect HQ will scale best: HQ s throughput decreases more slowly than Q/U s (because of the latter s larger message and processing costs) and PBFT s (where eventually batching cannot compen- sate for the quadratic number of messages). Alberto Montresor (UniTN) DS - BFT 2018/12/06 40 / 80

Zyzzyva Introduction Zyzzyva3 OSDI 06 R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva: Speculative byzantine fault tolerance. In Proc. of the ACM Symposium on Operating Systems Principles, (SOSP 07), Stevenson, WA, Oct. 2007. ACM. http://www.disi.unitn.it/~montreso/ds/papers/zyzzyva.pdf One protocol to rule them all! Zyzzyva is the last word on BFT! (Is it?) http://www.flickr.com/photos/matthewfch/2478230533/ 3 Zyzzyva is the last word of the English dictionary Apart from Zyzzyzus Alberto Montresor (UniTN) DS - BFT 2018/12/06 41 / 80

Zyzzyva Introduction Replica coordination All correct replicas execute the same sequence of commands For each received command c, correct replicas: Agree on c s position in the sequence Execute c in the agreed upon order Reply to the client Alberto Montresor (UniTN) DS - BFT 2018/12/06 42 / 80

Zyzzyva Introduction How it is done now Primary Request Backup 1 Backup 2 Backup 3 Pre-prepare Prepare Commit Reply Alberto Montresor (UniTN) DS - BFT 2018/12/06 43 / 80

Zyzzyva Introduction The engineer s Rule of thumb Citation Handle normal and worst case separately as a rule, because the requirements for the two are quite different: the normal case must be fast; the worst case must make some progress Butler Lampson, Hints for Computer System Design Alberto Montresor (UniTN) DS - BFT 2018/12/06 44 / 80

Zyzzyva Introduction How Zyzzyva does it Primary Request Replica 1 Replica 2 Replica 3 Alberto Montresor (UniTN) DS - BFT 2018/12/06 45 / 80

Zyzzyva Introduction Specification for State Machine Replication (SMR) Stability A command is stable at a replica once its position in the sequence cannot change Safety Correct clients only process replies to stable commands Liveness All commands issued by correct clients eventually become stable and elicit a reply Alberto Montresor (UniTN) DS - BFT 2018/12/06 46 / 80

Zyzzyva Introduction Enforncing safety Safety requires: Correct clients only process replies to stable commands...but SMR implementations enforce instead: Correct replicas only execute and reply to commands that are stable Service performs an output commit with each reply Alberto Montresor (UniTN) DS - BFT 2018/12/06 47 / 80

Zyzzyva Introduction Speculative BFT (Trust, but verify) Replicas execute and reply to a command without knowing whether it is stable trust order provided by primary no explicit replica agreement! Correct client, before processing reply, verifies that it corresponds to stable command if not, client takes action to ensure liveness Alberto Montresor (UniTN) DS - BFT 2018/12/06 48 / 80

Zyzzyva Introduction Verifying stability Necessary condition for stability in Zyzzyva: A command c can become stable only if a majority of correct replicas agree on its position in the sequence Client can process a response for c iff: a majority of correct replicas agrees on c s position the set of replies is incompatible, for all possible future executions, with a majority of correct replicas agreeing on a different command holding c s current position Alberto Montresor (UniTN) DS - BFT 2018/12/06 49 / 80

Zyzzyva Introduction History History H i,k is the sequence of the first k commands executed by replica i On receipt of a command c from the primary, replica appends c to its command history Replica reply for c includes: the application-level response the corresponding command history Additional details: Can be hashed through incremental hashing Alberto Montresor (UniTN) DS - BFT 2018/12/06 50 / 80

Zyzzyva Three cases Case 1: Unanimity Primary c <c,k> < r 1,H 1,k > Replica 1 < r 2,H 2,k > <c,k> Replica 2 <c,k> < r 3,H 3,k > Replica 3 < r 4,H 4,k > Client processes response if all replies match: r 1 =... = r 4 H 1,k =... = H 4,k Alberto Montresor (UniTN) DS - BFT 2018/12/06 51 / 80

Zyzzyva Three cases Case 1: Unanimity Some comments: Note that although a client has a proof that the request position in the command history is irremediately set, no server has such a proof Comparison of histories may be based on incremental hash Three message hops to complete the request in the good case Is it safe to accept the reply in this case? Alberto Montresor (UniTN) DS - BFT 2018/12/06 52 / 80

Zyzzyva Three cases Case 1: Unanimity Some comments: Note that although a client has a proof that the request position in the command history is irremediately set, no server has such a proof Comparison of histories may be based on incremental hash Three message hops to complete the request in the good case Is it safe to accept the reply in this case? All processes have agreed on ordering Alberto Montresor (UniTN) DS - BFT 2018/12/06 52 / 80

Zyzzyva Three cases Case 1: Unanimity Some comments: Note that although a client has a proof that the request position in the command history is irremediately set, no server has such a proof Comparison of histories may be based on incremental hash Three message hops to complete the request in the good case Is it safe to accept the reply in this case? All processes have agreed on ordering Correct processes cannot change their mind later Alberto Montresor (UniTN) DS - BFT 2018/12/06 52 / 80

Zyzzyva Three cases Case 1: Unanimity Some comments: Note that although a client has a proof that the request position in the command history is irremediately set, no server has such a proof Comparison of histories may be based on incremental hash Three message hops to complete the request in the good case Is it safe to accept the reply in this case? All processes have agreed on ordering Correct processes cannot change their mind later New primary can ask n f replicas for their histories Alberto Montresor (UniTN) DS - BFT 2018/12/06 52 / 80

Zyzzyva Three cases Case 2: A majority of correct replicas agree Primary c <c,k> < r 1,H 1,k > Replica 1 < r 2,H 2,k > <c,k> Replica 2 <c,k> < r 3,H 3,k > Replica 3 Is it safe to accept such a message? Alberto Montresor (UniTN) DS - BFT 2018/12/06 53 / 80

Zyzzyva Three cases Case 2: A majority of correct replicas agree Primary c <c,k> < r 1,H 1,k > Replica 1 < r 2,H 2,k > <c,k> Replica 2 < r 3,H 3,k > Replica 3 Consider this case... Alberto Montresor (UniTN) DS - BFT 2018/12/06 54 / 80

Zyzzyva Three cases Case 2: A majority of correct replicas agree Primary Replica 1 c < r i,h i,k > <c,k> <c,k> CC=<H 1,k, H 2,k, H 3,k> Replica 2 <c,k> Replica 3 Client sends to all a commit certificate containing 2f + 1 matching histories Alberto Montresor (UniTN) DS - BFT 2018/12/06 55 / 80

Zyzzyva Three cases Case 2: A majority of correct replicas agree Primary Replica 1 c < r i,h i,k > <c,k> <c,k> ack CC=<H 1,k, H 2,k, H 3,k> Replica 2 <c,k> Replica 3 Client processes response if it receives at least 2f + 1 acks Alberto Montresor (UniTN) DS - BFT 2018/12/06 56 / 80

Zyzzyva Three cases Case 2: A majority of correct replicas agree Safe? Certificate proves that a majority of correct processes agree on its position in the sequence Incompatible with a majority backing a different command for that position Stability Stability depends on matching command histories Stability is prefix-closed: If a command with sequence number k is stable, then so is every command with sequence number k < k Alberto Montresor (UniTN) DS - BFT 2018/12/06 57 / 80

Zyzzyva Three cases Case 3: None of the above Primary c <c,k> < r 1,H 1,k > Replica 1 < r 2,H 2,k > Replica 2 Replica 3 Fewer than 2f + 1 replies match Clients retransmits c to all replicas hinting primary may be faulty Alberto Montresor (UniTN) DS - BFT 2018/12/06 58 / 80

Zyzzyva The case of the missing phase The case of the missing phase Primary Backup 1 Backup 2 Backup 3 Request Pre-prepare Prepare Commit Reply Where did the third phase go? Why was it there to begin with? Primary Replica 1 c < r i,h i,k > <c,k> <c,k> ack CC=<H 1,k, H 2,k, H 3,k> Replica 2 <c,k> Replica 3 Alberto Montresor (UniTN) DS - BFT 2018/12/06 59 / 80

Zyzzyva The case of the missing phase The missing phase commit Consider this scenario: f malicious replicas, including the primary The primary stops communicating with f correct replicas They go on strike they stop accepting messages in this view, ask a view change f + f replicas stops accepting messages, f + 1 replicas keep working The remaining f + 1 replicas are not enough to conclude the pre-prepare and prepare phases The f correct processes that are asking a view change are not enough to conclude one, so there is no opportunity to regain liveness by electing a new primary Alberto Montresor (UniTN) DS - BFT 2018/12/06 60 / 80

Zyzzyva The case of the missing phase The missing phase commit The third phase of PBFT breaks this stalemate: The remaining f + 1 replicas either gather the evidence necessary to complete the request, or determine that a view change is necessary Commit phase needed for liveness Alberto Montresor (UniTN) DS - BFT 2018/12/06 61 / 80

Zyzzyva View changes Where the third phase go? In PBFT What compromises liveness in the previous scenario is that the PBFT view change protocol lets correct replicas commit to a view change and become silent in a view without any guarantee that their action will lead to the view change In Zyzzyva A correct replica does not abandon view v unless it is guaranteed that every other correct replica will do the same, forcing a new view and a new primary Alberto Montresor (UniTN) DS - BFT 2018/12/06 62 / 80

Zyzzyva View changes View change Two phases: Processes unsatisfied with the current primary sent a message i-hate-the-primary, v to all If a process collect f + 1 i-hate-the-primary messages, sends a message to all containing such messages and starts a new view change (similar to the traditional one) Extra phase of agreement protocol is moved to the view change protocol Alberto Montresor (UniTN) DS - BFT 2018/12/06 63 / 80

Zyzzyva View changes Optimizations Checkpoint protocol to garbage collect histories Replacing digital signatures with MAC Replicating application state at only 2f + 1 replicas Batching Alberto Montresor (UniTN) DS - BFT 2018/12/06 64 / 80

Zyzzyva View changes Performance 7:28 R. Kotla et al. 140 Unreplicated 120 100 Throughput (Kops/sec) 80 60 40 Zyzzyva Zyzzyva (B=10) Zyzzyva5 (B=10) PBFT (B=10) Zyzzyva5 20 PBFT Q/U max throughput HQ 0 0 20 40 60 80 100 Number of clients Fig. 4. Realized throughput for the 0/0 benchmark as the number of client varies for systems configured to tolerate f = 1 faults. 4 Alberto Montresor (UniTN) DS - BFT 2018/12/06 65 / 80

Zyzzyva View changes Discussion What have you learned? Do you agree on the principles? Alberto Montresor (UniTN) DS - BFT 2018/12/06 66 / 80

Aardvark Aardvark 4 NSDI 09 A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proc. of the 6th USENIX symposium on Networked systems design and implementation, NSDI 09, pages 153 168. USENIX Association, 2009. http://www.disi.unitn.it/~montreso/ds/papers/aardvark.pdf A new beginning! http://en.wikipedia.org/wiki/file: Porc_formiguer.JPG 4 Aardvark is the first word of the English dictionary Oritteropo in Italian Alberto Montresor (UniTN) DS - BFT 2018/12/06 67 / 80

Aardvark From the article Surviving vs tolerating Although current BFT systems can survive Byzantine faults without compromising safety, we contend that a system that can be made completely unavailable by a simple Byzantine failure can hardly be said to tolerate Byzantine faults. Alberto Montresor (UniTN) DS - BFT 2018/12/06 68 / 80

Aardvark Conventional wisdom Handle normal and worst case separately remain safe in worst case make progress in normal case Maximize performance when the network is synchronous all clients and servers behave correctly Alberto Montresor (UniTN) DS - BFT 2018/12/06 69 / 80

Aardvark Conventional wisdom Misguided encourages systems that fail to deliver BFT Maximize performance when the network is synchronous all clients and servers behave correctly Alberto Montresor (UniTN) DS - BFT 2018/12/06 69 / 80

Aardvark Conventional wisdom Misguided encourages systems that fail to deliver BFT Dangerous it encourages fragile optimizations Alberto Montresor (UniTN) DS - BFT 2018/12/06 69 / 80

Aardvark Conventional wisdom Misguided encourages systems that fail to deliver BFT Dangerous it encourages fragile optimizations Futile it yields diminishing return on common case Alberto Montresor (UniTN) DS - BFT 2018/12/06 69 / 80

Aardvark Blueprint Build the system around execution path that: provides acceptable performance across the broadest set of executions it is easy to implement it is robust against Byzantine attempts to push the system away from it Alberto Montresor (UniTN) DS - BFT 2018/12/06 70 / 80

Aardvark Revisiting conventional wisdom Signatures are expensive use MACs Faulty clients can use MACs to generate ambiguity (One node validating a MAC authenticator does not guarantee that any other nodes will validate that same authenticator) Aardvark requires clients to sign requests View changes are to be avoided Aardvark uses regular view changes to maintain high throughput despite faulty primaries Hardware multicast is a boon Aardvark uses separate work queues for clients and individual replicas Aardvark uses fully connected topology among replicas (separate NICs) Alberto Montresor (UniTN) DS - BFT 2018/12/06 71 / 80

Aardvark MAC Attack Primary c <c,k> Replica 1 <c,k> Replica 2 <c,k> Replica 3 Alberto Montresor (UniTN) DS - BFT 2018/12/06 72 / 80

Aardvark MAC Attack Primary c <c,k> Replica 1 <c,k> Replica 2 <c,k> Replica 3 Alberto Montresor (UniTN) DS - BFT 2018/12/06 73 / 80

Aardvark Throughput Best Faulty Client Faulty Faulty case client flood primary replica PBFT 62K 0 crash 1k 250 QU 24K 0 crash NA 19k HQ 15K NA 4.5K NA crash Zyzzyva 80K 0 crash crash 0 Aardvark 39K 39K 7.8K 37K 11K Alberto Montresor (UniTN) DS - BFT 2018/12/06 74 / 80

UpRight UpRight Bibliography A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, and T. Riche. UpRight cluster services. In Proc. of the ACM Symposium on Operating Systems Principles, SOSP 09, Oct. 2009. http://www.disi.unitn.it/~montreso/ds/papers/upright.pdf A new (B)FT replication library Minimal intrusiveness for existing apps Adequate performance Goal: ease BFT deployment make explicit incremental cost of BFT switching to BFT: simple change in a config file Alberto Montresor (UniTN) DS - BFT 2018/12/06 75 / 80

UpRight UpRight u= max number of failures to ensure liveness r = max number of commission failures to preserve safety Omission Byzantine Commission r = u = f: BFT r = 0 : CFT Crash Alberto Montresor (UniTN) DS - BFT 2018/12/06 76 / 80

UpRight UpRight Exposes incremental cost of BFT Byzantine agreement if r << u, BFT CFT in replication cost Allows richer design options Byzantine faults are rare: u > r Safety more critical than liveness: r > u Alberto Montresor (UniTN) DS - BFT 2018/12/06 77 / 80

UpRight Reality Check UpRight 5 (Java; latest update Oct. 2009) ArchiStar-BFT 6 (Java; latest update May 2015) Bft-SMaRt 7 (Java; latest update Apr. 2016) 7 https://code.google.com/archive/p/upright/ 7 https://github.com/archistar/archistar-bft 7 http://bft-smart.github.io/library/ Alberto Montresor (UniTN) DS - BFT 2018/12/06 78 / 80

UpRight For (far in the) future lectures S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, and H. Weinfurter. Experimental demonstration of a quantum protocol for byzantine agreement and liar detection. Physical Review Letters, 100(7), Feb. 2008 Alberto Montresor (UniTN) DS - BFT 2018/12/06 79 / 80

UpRight Reading material M. Castro and B. Liskov. Practical Byzantine fault tolerance. In Proc. of the 3 rd Symposium on Operating systems design and implementation, OSDI 99, pages 173 186, New Orleans, Louisiana, USA, 1999. USENIX Association. http://www.disi.unitn.it/~montreso/ds/papers/pbftosdi.pdf R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva: Speculative byzantine fault tolerance. In Proc. of the ACM Symposium on Operating Systems Principles, (SOSP 07), Stevenson, WA, Oct. 2007. ACM. http://www.disi.unitn.it/~montreso/ds/papers/zyzzyva.pdf A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proc. of the 6th USENIX symposium on Networked systems design and implementation, NSDI 09, pages 153 168. USENIX Association, 2009. http://www.disi.unitn.it/~montreso/ds/papers/aardvark.pdf Alberto Montresor (UniTN) DS - BFT 2018/12/06 80 / 80