Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Similar documents
Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Coordination and Agreement

CSE 486/586 Distributed Systems

PROCESS SYNCHRONIZATION

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Process Synchroniztion Mutual Exclusion & Election Algorithms

Distributed Algorithms 6.046J, Spring, Nancy Lynch

Distributed Algorithms 6.046J, Spring, 2015 Part 2. Nancy Lynch

Synchronization. Chapter 5

21. Distributed Algorithms

Distributed Leader Election Algorithms in Synchronous Networks

Failure Tolerance. Distributed Systems Santa Clara University

Consensus Problem. Pradipta De

Last Class: Clock Synchronization. Today: More Canonical Problems

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Chapter 16: Distributed Synchronization

Mutual Exclusion. A Centralized Algorithm

Distributed Systems Coordination and Agreement

To do. Consensus and related problems. q Failure. q Raft

Chapter 18: Distributed

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

11/7/2018. Event Ordering. Module 18: Distributed Coordination. Distributed Mutual Exclusion (DME) Implementation of. DME: Centralized Approach

Distributed Systems 11. Consensus. Paul Krzyzanowski

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Algorithms for COOPERATIVE DS: Leader Election in the MPS model

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

6.852: Distributed Algorithms Fall, Instructor: Nancy Lynch TAs: Cameron Musco, Katerina Sotiraki Course Secretary: Joanne Hanley

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

CSc Leader Election

Distributed Systems 8L for Part IB

6.852: Distributed Algorithms Fall, Class 12

Introduction to Distributed Systems Seif Haridi

Silberschatz and Galvin Chapter 18

Lecture 1: Introduction to distributed Algorithms

CSE 5306 Distributed Systems. Synchronization

Broadcast: Befo re 1

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Fault Tolerance. Distributed Software Systems. Definitions

Verteilte Systeme (Distributed Systems)

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Homework #2 Nathan Balon CIS 578 October 31, 2004

Distributed Computing over Communication Networks: Leader Election

Consensus and related problems

Concurrency and OS recap. Based on Operating System Concepts with Java, Sixth Edition, 2003, Avi Silberschatz, Peter Galvin e Greg Gagne

Exam 2 Review. Fall 2011

Arvind Krishnamurthy Fall Collection of individual computing devices/processes that can communicate with each other

CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED MUTUAL EXCLUSION] Frequently asked questions from the previous class survey

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance

Last Time. 19: Distributed Coordination. Distributed Coordination. Recall. Event Ordering. Happens-before

Distributed Systems (5DV147)

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Chapter 8 Fault Tolerance

Byzantine Consensus in Directed Graphs

Consensus and agreement algorithms

Frequently asked questions from the previous class survey

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

殷亚凤. Synchronization. Distributed Systems [6]

Distributed Systems Leader election & Failure detection

Event Ordering Silberschatz, Galvin and Gagne. Operating System Concepts

Mutual Exclusion in DS

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Several of these problems are motivated by trying to use solutiions used in `centralized computing to distributed computing

Exam 2 Review. October 29, Paul Krzyzanowski 1

Distributed Deadlock

2. Time and Global States Page 1. University of Freiburg, Germany Department of Computer Science. Distributed Systems

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Leader Election in Rings


Dep. Systems Requirements

How Bitcoin achieves Decentralization. How Bitcoin achieves Decentralization

Distributed Systems Fault Tolerance

Synchronization. Clock Synchronization

Failures, Elections, and Raft

Selected Questions. Exam 2 Fall 2006

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Distributed Systems. Chapman & Hall/CRC. «H Taylor S* Francis Croup Boca Raton London New York

Today: Fault Tolerance. Failure Masking by Redundancy

Distributed Deadlocks. Prof. Ananthanarayana V.S. Dept. of Information Technology N.I.T.K., Surathkal

Chapter 8 Fault Tolerance

Distributed Coordination! Distr. Systems: Fundamental Characteristics!

Lecture 2: Leader election algorithms.

Today: Fault Tolerance. Replica Management

Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

Snapshot Protocols. Angel Alvarez. January 17, 2012

June 22/24/ Gerd Liefländer System Architecture Group Universität Karlsruhe (TH), System Architecture Group

Distributed Systems. Rik Sarkar James Cheney Global State & Distributed Debugging February 3, 2014

Distributed systems. Consensus

CSE 5306 Distributed Systems. Fault Tolerance

Distributed Operating Systems. Distributed Synchronization

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Distributed Algorithmic

Consensus in Distributed Systems. Jeff Chase Duke University

Chapter 6 Synchronization (2)

Distributed Systems (ICE 601) Fault Tolerance

Transcription:

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Holger Karl Computer Networks Group Universität Paderborn

Goal of this chapter Apart from issues in distributed time and resulting questions like global snapshots, there are many additional algorithmic questions This chapter will study some typical problems in distributed systems and their algorithmic solution Leader election algorithms Mutual exclusion Global snapshot Distributed consensus WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 2

Overview Leader election Mutual exclusion Global snapshot Distributed consensus WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 3

Goal of leader election Suppose a distributed system consists of a collection of homogeneous participants How to pick one out of this group? Purpose: this one might take over certain duties, additional tasks, coordinate other system participants, Break symmetry Crucial requirement: This choice is unambiguously known to all group members! Leader election problem WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 4

Leader election algorithms assumptions Basic assumptions Each participant has a unique identifier Goal is to choose that member with the largest identifier as leader Set of all identifiers unknown to all participants Fault assumptions Processes may or may not fail, may behave in a hostile fashion Messages may or may not be lost, corrupted, Different algorithms can handle different fault assumptions Time assumptions Synchronous time model all processes operate in lock-step, bounded message transit time? Asynchronous model no such bounds available? WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 5

Leader Election in a Synchronous Ring Assumptions G is a ring consisting of n nodes Nodes are numbered 1 to n Nodes do not know their indices, nor those of their neighbors Node can distinguish its clockwise neighbor from its counterclockwise neighbor WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 6

Leader Election in a Synchronous Ring Task Find an algorithm so that at the end of each execution exactly one node declares itself the leader Possible variations All other nodes additionally declare themselves the non-leader G is unidirectional or bidirectional n is known or not Processes can be identical or different (by UID) Possible with identical processes? WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 7

LCR Leader-Election-Algorithmus Simple algorithm by Le Lann, Chang, and Roberts (LCR) Assumptions: G unidirectional, n unknown, only leader performs output, nodes know their uniform identifier (UID) Algorithm (informal) Each node sends its UID to its neighbor. A received UID is compared to the own UID. If new UID < own UID: ignore new UID, send largest UID so far If new UID > largest so far occurred UID: pass this UID on If new UID = own UID: claim leadership Invariant: Each node sends in every round the largest so far occurred UID to its neighbor WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 8

LCR Leader Election Proof of correctness: induction over number of rounds Time complexity: O(n) Message complexity: O(n 2 ) Time complexity is acceptable, but many messages Algorithm with substantially fewer messages possible? WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 9

Leader Election in Arbitrary Graphs FloodMax algorithm Assumption: diameter d of the graph is known Every process sends in every round the so far largest UID to its neighbors After d rounds the process is leader that has not seen a greater UID than its own Improvement Process sends only messages to its neighbors if it received a value larger than its own After d rounds the winner is again determined Synchronization! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 10

Leader Election in Arbitrary Graphs Leader Election possible even if neither the number of nodes nor the diameter of the graph are known? Yes! One possibility: search the whole graph How? Tip: breadth-first search Or by intermediate steps first determine the diameter How? Tip: breadth-first search See exercise sheet WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 11

Leader Election in Asynchronous Networks Adaptation of optimized FloodMax In the beginning each process sends its UID to every neighbor When a process sees a UID that is greater than the so far greatest, it sends it to its neighbors Properties Eventually, all processes will receive the largest UID But when to terminate??? In the synchronous model it was simple by counting the rounds, but here unclear! Would knowledge about the graph s diameter help? Different solutions possible (spanning tree and the like), but more expensive than in the synchronous model WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 12

Spanning Tree Spanning tree for the execution of broadcasts Spanning tree: partial graph that contains all nodes but only edges to create a tree Size or diameter of the graph are unknown Algorithm The root node sends a search message to each neighbor Processes that receive a search message Mark themselves as part of the tree Set the sending node as father node in the tree Send a search message to each neighbor (except for father) Already marked nodes ignore search messages WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 13

Spanning Tree Properties Algorithm terminates when no search messages are on the way (how can this be detected?) Algorithm creates spanning tree In a synchronous network this algorithm even creates a breadth-first spanning tree! Send message with search messages for broadcast Child pointer ascertainable by reflected search messages Convergecast: leaves send information along the tree to the root Useful for distributed termination, e.g. with a leader election: each node starts a Broadcast/Convergecast WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 14

Bully algorithm Assumption All nodes know already the unique IDs of all other nodes So leader choice is trivial, but nodes, including coordinators, may fail Algorithm Once a node suspects the coordinator of having failed, it sends an ELECTION message to all nodes with a larger ID If initiator does get no answer at all, it becomes the new coordinator If this initiator gets an answer from one of these nodes, that node will take over coordinator role How to handle multiple answering nodes? Recursive process of becoming initiators again, until one node does not get any answers any more WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 15

Example: Bully-Algorithm 1 2 Election 5 OK Election OK OK Election Coordinator 4 6 Election Election 0 Election 3 7 Coordinator WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 16

Overview Leader election Mutual exclusion Global snapshot Distributed consensus WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 17

Mutual exclusion in distributed systems Problem of mutual exclusion: when processes execute concurrently, there may be crucial portions of code which may only be executed by at most one process at any one time This/these piece(s) of code form a so-called critical region In non-distributed systems: semaphores to protect such critical regions But this does not directly carry over to distributed systems! Options Centralized algorithm Distributed algorithm Token-Ring-based algorithm WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 18

A centralized algorithm for mutual exclusion Run a leader election algorithm, determine a coordinator for a critical region Known to everybody Coordinator holds a token for the critical region Node who wants to enter into the region sends message to coordinator If coordinator owns token, send it Else, put request into a queue After leaving the critical region, send back token to coordinator WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 19

Example: Mutual-Exclusion-Server Queue of requests Server 4 Token p 1 2 4. Request token 1. Request token 5. Release token 6. Grant 3. token Request token 2. Grant token p 4 p 2 p 3 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 20

Properties + Mutual exclusion is achieved + Fair requests are served in order + Easy to implement + Per access to critical region, only three messages are required Coordinator is single point of failure When a requester is blocked, impossible to distinguish between a failed coordinator and a long queue Coordinator becomes a performance bottleneck in large systems In particular when serving more than one critical region WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 21

Distributed mutual exclusion How to achieve mutual exclusion without a coordinator? All processes use multicast All processes have a logical clock When trying to enter into the critical region Send a request to all other nodes All other nodes have to agree to such a request before a node may enter critical region WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 22

Example distributed mutual exclusion (1) Timestamp = 8 p 1 8 8 12 p 2 p 3 12 Timestamp = 12 Prozesses p 1 phas 1 and smallest p 3 want timestamp to enter critical and wins region WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 23

Example distributed mutual exclusion(2) p 1 enters critical region p 1 OK OK p 2 p 3 OK Prozess p 1 has smallest timestamp and wins WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 24

Example distributed mutual exclusion (3) p 1 leaves critical region p 1 OK Any problems if these messages are not delivered in atomic order??? p 2 p 3 p 3 enters critical region Once prozess p 1 is done, it leaves critical region and sends OK to p 3. WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 25

Algorithm (Ricart and Agrawala, 1981) On initialization state := RELEASED; To enter the section state := WANTED; Multicast request to all processes; T := request s timestamp; Wait until (number of replies received = (N 1)); state := HELD; On receipt of a request <T i, p i > at p j (i j) if (state = HELD or (state = WANTED and (T, p j ) < (T i, p i ))) then queue request from p i without replying; else reply immediately to p i ; end if To exit the critical section state := RELEASED; reply to all queued requests; WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 26

Properties of distributed mutual exclusion In simple form, each node turns into a single point of failure N of them, instead of just one Can be overcome by using additional protocol mechanisms E.g., a more powerful group communication protocol! With terminating reliable multicast Each process is involved in decision about access to critical region, even if not interested Possible improvement: simple majority suffices In total: slower, more complicated, more expensive, less robust Finally, like eating spinach and learning Latin in high school, some things are said to be good for you in some abstract way. (Tanenbaum) WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 27

Comparison mutual exclusion Algorithm Messages per entry/ exit Delay before entry (in message times) Problems Centralized 3 2 Coordinator crash, bottleneck Distributed 2 ( n 1 ) 2 ( n 1 ) Crash of any process Token ring 1 to 0 to n 1 Lost token, process crash WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 28

Overview Leader election Mutual exclusion Global snapshot Distributed consensus WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 29

Applications of logical time: Global states How to find out whether a given property is true of a distributed system in its current state? Distributed garbage collection: Are there still pointers to an object around? Danger: pointers to objects may currently be in transit between processes! Distributed deadlock detection: Look for cycle in wait-for graph Distributed termination detection: Problem again are requests in transit Distributed debugging Deciding whether to deliver group view changes objec t reference p 1 p 1 p 1 passive message wait-for wait-for ac tiv ate p 2 garbage object p 2 p 2 passive WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 30

Snapshot algorithm to determine global state Goal: Provide algorithm to compute global state of distributed system Assumptions Neither channels nor processes fail; messages arrive intact, exactly once Channels are unidirectional, pair-wise, FIFO Graphs of processes and channels is strongly connected Any process may initiate a global snapshot at any time Processes may continue execution and send/receive normal messages while the snapshot takes place Snapshot Algorithm, Chandy & Lamport, 1985 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 31

Snapshot algorithm core ideas Each process records Its own state For each incoming channel, messages arrived via this channel after receiver has recorded state, sent before sender has recorded its state Accounts for messages transmitted, but not yet received for different points in time of process state recording Channels also have state, messages sent but not received Algorithm uses marker messages Prompts receiver to record its own state Determines which messages are included in the channel state Start of algorithm: Initiator behaves as if it had received a marker (over a fictive channel) WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 32

Snapshot algorithm process model WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 33

Snapshot algorithm Marker receiving rule for process p i On p i s receipt of a marker message over channel c: if (p i has not yet recorded its state) it records its process state now; records the state of c as the empty set; turns on recording of messages arriving over other incoming channels; else p i records the state of c as the set of messages it has received over c since it saved its state. end if Marker sending rule for process p i After p i has recorded its state, for each outgoing channel c: p i sends one marker message over c (before it sends any other message over c). WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 34

Snapshot algorithm some steps b) Process Q has received marker for the first time; records its local state c) Q records all incoming messages d) Q receives marker on its incoming channel; stops recording further messages on this channel WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 35

Snapshot example P 1 State1 P 2 P 3 State2 State3 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 36!

Snapshot example P 1 M M C 21 ={ State1 C 31 ={ P 2 P 3 State2 State3 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 37

Snapshot example P 1 State1 State1 C 21 ={ C 31 ={ } M M C 12 ={} C 32 ={ P 2 M P 3 M State2 State3 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 38

Snapshot example P 1 State1 State1 C 21 ={ C 31 ={ } } M State2 C 12 ={} C 32 ={ } P 2 P 3 M State2 M M State3 C 13 ={ } C 23 ={}! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 39

Snapshot algorithm Example 2, initial state p c 2 1 p 2 c 1 $1000 (none) $50 2000 ac count widget s ac count widget s WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 40

Snapshot algorithm Example 2, progress 1. Global state S 0 <$1000, 0> p 1 c 2 (empty) p 2 <$50, 2000> c 1 (empty) 2. Global state S 1 <$900, 0> p 1 c 2 (Order 10, $100), M p 2 <$50, 2000> c 1 (empty) 3. Global state S 2 <$900, 0> p 1 c 2 (Order 10, $100), M p 2 <$50, 1995> c 1 (five widgets) 4. Global state S 3 <$900, 5> p 1 c 2 (Order 10, $100) p 2 <$50, 1995> c 1 (empty) (M = marker mess age) Resulting recorded state: p 1 = <$1000,0>; p 2 = <$50, 1995>; c 1 = <five widgets>; c 2 = <> WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 41

Snapshot summary It is impossible to record a global state simultaneously The Chandy/Lamport algorithm provides a way of taking a snapshot in a distributed fashion A global state exactly equal to this snapshot might never have occurred in this way in the actual system WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 42

Overview Leader election Mutual exclusion Global snapshot Distributed consensus WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 43

Consensus Problem statement Each process starts the algorithm with an initial value Outputs are required to be the same All processes must decide for some output value Often: a particular value is seen as preferable Different criteria of validity of a solution Areas of application Status of a transaction, equivalence of sensor data, failure detection,... Consensus is simple to solve in the absence of failures With failures? Communication/link failures? Process failures? Time model here: Synchronous model! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 44

Consensus with Link Failures Model Coordinated attack Two armies form up for an attack against each other One army is split into two parts that have to attack together alone they will lose Commanders of the parts communicate via messengers who can be captured Which rules shall the commanders use to agree on an attack date? Developed: modeling of data base problems How to coordinate? WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 45

Consensus with Link Failures Consensus is impossible to solve in these conditions! Look at graph G with two nodes and bidirectional edge, each node can decide for 0 or 1 (e.g., abort or commit) Requirements for the solution algorithm Agreement: Both processes decide on the same values. Validity: If both processes start with 0, then 0 is the only possible decision value. If both processes start with 1 and all messages are delivered, then 1 is the only possible decision value. Termination: Both processes eventually decide. Very weak requirements only to exclude trivial solutions Just always outputting 0 is not valid solution algorithm But: not even solvable under such relaxed requirements! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 46

Proof by Contradiction (sketch) Assumption: algorithm for coordinated attack in G exists Without restriction: both nodes send messages in every round Let α be the execution when both nodes start with value 1 and all messages are delivered Both processes decide in α for the value 1, after r rounds Let α 1 be the same execution as α, when all messages after round r are lost Both processes decide in α 1 for the value 1, after r rounds Let α 2 be the same execution as α 1, except that the last message from node 1 to 2 is lost α 1 ~ 1 α 2 (i.e., executions α 1 and α 2 are not distinguishable for process 1) Since process 1 decides 1 in α 1, it also decides 1 in α 2 By the termination and agreement properties, process 2 also decides 1 in α 2 (possibly after some further state transitions) WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 47

Proof by Contradiction (sketch) Let α 3 be the same execution as α 2, except that the last message from node 2 to 1 is lost α 2 ~ 2 α 3 Since process 2 decides 1 in α 2 it also decides 1 in α 3 By the termination and agreement properties, process 1 also decides 1 in α 3 Continue to remove messages in this way until execution α, in which no messages are delivered Both processes are forced to decide 1 in α! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 48

Proof by Contradiction (sketch) Now consider the execution α in which process 1 starts with 1 but process 2 starts with 0, and no messages are delivered α ~ 1 α Hence process 1 decides 1 Process 2 decides 1 in α, by termination and agreement In the execution α both processes start with 0, no messages are delivered α ~ 2 α So process 2 decides 1 also in α (because no difference is noticeable!) But contradiction to validity condition requires that both processes decide 0! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 49

Consensus with Link Failure a Way Out? Proof is easily expanded to n nodes Consensus with link failures is even with weak assumptions not solvable by deterministic algorithms Possible way out: randomized algorithms Algorithm gives a correct result with high probability Idea: an opponent (may destroy messages) tries to mislead the algorithm into a false decision Modified agreement condition: for each opponent P (some nodes decide 0, others 1) < ε ε can depend, e.g., on the number of rounds WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 50

Consensus with Process Failures Assumptions Communication is error-free Processes stop or Processes show Byzantine (arbitrary) behavior Many other assumptions are possible In particular: processes can fail if they have sent only a part of a message Example: Process B sends a broadcast to processes A and C, but fails after the message has been sent to A but not to C WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 51

Consensus with Process Failures Correctness conditions Agreement: No two processes decide on different values. Validity: If all processes start with the same value, then this value is the only valid one Sometimes also: The initial value of each process is a valid final value Termination: All nonfaulty processes eventually decide Fault model Why is it necessary to restrict here to nonfaulty processes? At most f processes will fail WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 52

Consensus with Failing Processes FloodSet algorithm Each process stores a subset W out of the valid values V Initially: W is the value that the respective process proposes After every round each process broadcasts W to all participants All received elements are added to W After f +1 rounds each process decides Has W only one element, this is the result Otherwise a default value is chosen (e.g. the smallest possible value v 0 ) validity makes no requirements for the case when there is more than a single initial value! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 53

Consensus with Failing Processes Why f +1 rounds? At most f nodes may fail With f +1 rounds, there is at least one round in which no node fails in this round consensus is established f +1 rounds, O((f +1)n 2 ) messages Optimizations possible (exponential information gathering algorithms which function also for other failure models) Why not decide after two rounds? See exercise sheet Hint: partly sending before failure WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 54

Consensus with Byzantine Processes Correctness conditions Agreement: No two nonfaulty processes decide on different values Validity: If all nonfaulty processes start with the same value, then this value is the only valid one Termination: All nonfaulty processes eventually decide Remark: Restriction of the validity conditions to only nonfaulty processes because conditions for the behavior of Byzantine processes make no sense n > 3f required (triple modular redundancy not sufficient!) For failing processes such a limit does not exist! An algorithm for Byzantine processes does in general not solve the consensus problem for failing processes Reason: In version for failing processes all processes that decide have to agree even processes that fail after they have decided! WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 55

n=3, f=1 does not Solve Byzantine Agreement Example algorithm (no failure) Round 1: Everybody sends its own value to its neighbor Round 2: Everybody sends value of each neighbor to the other neighbor B 0 B 0 1 0 1 0 1 1 A C 1 B said 0 1 A B said 0 C 1 1 Initial value WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 56

n=3, f=1 does not Solve Byzantine Agreement Execution α 1 : C is faulty Round 1: Round 2: B 1 B 1 1 0 1 1 B said 1 1 A 0 C 0 A B said 0 C A and B decide 1 because of validity WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 57

n=3, f=1 does not Solve Byzantine Agreement Execution α 2 : A is faulty Round 1: Round 2: B 0 B 1 0 0 0 1 B said 1 1 A 0 C 0 A B said 0 C B and C decide 0 because of validity WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 58

n=3, f=1 does not Solve Byzantine Agreement Execution α 3 : B is faulty Round 1: Round 2: B B 1 1 0 0 1 B said 1 1 A 0 C 0 A B said 0 C A and C would have to decide for the same value because of agreement WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 59

n=3, f=1 does not Solve Byzantine Agreement Thus: α 1 ~ 1 α 3 Process A decides 1 in α 1 (because of validity condition) and therefore also in α 3 α 2 ~ 3 α 3 Process C decides 0 in α 2 and therefore also in α 3 Processes A and C violate the agreement condition in α 3 This algorithm cannot solve Byzantine agreement for n = 3f Because no statement was made about the type of decision making this holds for all algorithms with this communication structure Argument can be extended to a proof for arbitrary algorithms with n = 3f (with arbitrary messages, rounds) WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 60

Byzantine agreement for n > 4f with 2(f+1) rounds WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 61

Example: n=5, f=1, 2(1+1) rounds, 2nd proc is traitor Processor 1 2 3 4 5 Initial value 7 % 3 3 2 Initial preference 7---- % --3-- ---3- ----2 Round 1 Preferences 77332 % 72332 73332 72332 Majority 7 % 2 3 2 Multiplicity 2 2 3 2 Round 2 King majority 7 Preference 77332 72732 73372 72337 Round 3 Preference 72777 75777 73777 73777 Majority 7 7 7 7 Multiplicity 4 4 4 4 Round 4 King majority 1,2,3,4 Preference 72777 75777 73777 73777 Decision 7 7 7 7 WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 62

Powerful algorithms for Byzantine Agreement: EIG To achieve bound n > 3f for Byzantine Agreement, better algorithms exist Popular: Exponential Information Gathering Each node builds a tree of all the information all other nodes have achieved in all previous rounds Complex information exchange among nodes Relatively complex decisions rules WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 63

Conclusions Both leader election and mutual exclusions are building blocks for more complex distributed functions Both algorithms entail non-trivial, but usually acceptable overhead Consensus is even worse: even under mild reality assumptions, it becomes an unsolvable problem And this was only assuming synchronous time! Full distribution is something that needs to be carefully considered as it entails substantial costs as well WS 10/11, v 1.2 Distributed Systems, Ch 5: Various distributed algorithms 64