Distributed Systems Leader election & Failure detection

Similar documents
Algorithms for COOPERATIVE DS: Leader Election in the MPS model

Distributed Computing over Communication Networks: Leader Election

Leader Election in Rings

21. Distributed Algorithms

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Coordination and Agreement

Distributed Leader Election Algorithms in Synchronous Networks

Introduction to Distributed Systems Seif Haridi

Lecture 2: Leader election algorithms.

Distributed Systems Coordination and Agreement

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Distributed Algorithms 6.046J, Spring, 2015 Part 2. Nancy Lynch

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

MITOCW watch?v=w_-sx4vr53m

6.852: Distributed Algorithms Fall, Instructor: Nancy Lynch TAs: Cameron Musco, Katerina Sotiraki Course Secretary: Joanne Hanley

DD2451 Parallel and Distributed Computing --- FDD3008 Distributed Algorithms


Leader election. message passing asynchronous. motivation who starts? Leader election, maximum finding, spanning tree.

Distributed Algorithms 6.046J, Spring, Nancy Lynch

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

CSc Leader Election

Throughout this course, we use the terms vertex and node interchangeably.

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

Maximal Independent Set

R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch

Last Class: Clock Synchronization. Today: More Canonical Problems

Algorithm 23 works. Instead of a spanning tree, one can use routing.

Distributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Last Class: Clock Synchronization. Today: More Canonical Problems

Distributed Algorithms. The Leader Election Problem. 1.2 The Network Model. Applications. 1 The Problem and the Model. Lesson two Leader Election

Exam Principles of Distributed Computing

Distributed Systems 11. Consensus. Paul Krzyzanowski

Consensus, impossibility results and Paxos. Ken Birman

CSE 5306 Distributed Systems. Synchronization

A Dag-Based Algorithm for Distributed Mutual Exclusion. Kansas State University. Manhattan, Kansas maintains [18]. algorithms [11].

Distributed Systems. Communica3on and models. Rik Sarkar Spring University of Edinburgh

GIAN Course on Distributed Network Algorithms. Spanning Tree Constructions

MAC Theory. Chapter 7. Ad Hoc and Sensor Networks Roger Wattenhofer

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

OUTLINE. Introduction Clock synchronization Logical clocks Global state Mutual exclusion Election algorithms Deadlocks in distributed systems

Arvind Krishnamurthy Fall Collection of individual computing devices/processes that can communicate with each other

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Process Synchroniztion Mutual Exclusion & Election Algorithms

Arranging lunch value of preserving the causal order. a: how about lunch? meet at 12? a: <receives b then c>: which is ok?

Distributed Systems. Communica3on and models. Rik Sarkar 2015/2016. University of Edinburgh

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Mutual Exclusion in DS

A synchronizer generates sequences of clock pulses at each node of the network satisfying the condition given by the following definition.

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Broadcast: Befo re 1

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Maximal Independent Set

THE FIRST APPROXIMATED DISTRIBUTED ALGORITHM FOR THE MINIMUM DEGREE SPANNING TREE PROBLEM ON GENERAL GRAPHS. and

Chapter 8 Sort in Linear Time

Algorithm Analysis. (Algorithm Analysis ) Data Structures and Programming Spring / 48

6.852 Lecture 10. Minimum spanning tree. Reading: Chapter 15.5, Gallager-Humblet-Spira paper. Gallager-Humblet-Spira algorithm

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

Chapter 8 DOMINATING SETS

All-to-All Communication

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Deterministic Leader Election in O(D + log n) Rounds with Messages of size O(1)

Graphs. Part I: Basic algorithms. Laura Toma Algorithms (csci2200), Bowdoin College

Lecture 1: Introduction to distributed Algorithms

Chapter 8 DOMINATING SETS

Relationships Between Broadcast and Shared Memory in Reliable Anonymous Distributed Systems

Sorting and Selection

Distributed minimum spanning tree problem

Paxos and Replication. Dan Ports, CSEP 552

Event Ordering Silberschatz, Galvin and Gagne. Operating System Concepts

GIAN Course on Distributed Network Algorithms. Spanning Tree Constructions

Randomization. Randomization used in many protocols We ll study examples:

Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Michel Raynal. Distributed Algorithms for Message-Passing Systems

Reasoning About Imperative Programs. COS 441 Slides 10

Leader Election Algorithms in Distributed Systems

The Power of Locality

Frequently asked questions from the previous class survey

Randomized Rumor Spreading in Social Networks

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Distributed Computing

CMSC351 - Fall 2014, Homework #2

6 Distributed data management I Hashing

Λέων-Χαράλαμπος Σταματάρης

CSE 486/586 Distributed Systems

Homework #2 Nathan Balon CIS 578 October 31, 2004

COMP3121/3821/9101/ s1 Assignment 1

In this paper we consider probabilistic algorithms for that task. Each processor is equipped with a perfect source of randomness, and the processor's

PROCESS SYNCHRONIZATION

Introduction to Graph Theory

Distributed algorithms

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15

CS4411 Intro. to Operating Systems Exam 1 Fall points 9 pages

Transcription:

Distributed Systems Leader election & Failure detection He Sun School of Informatics University of Edinburgh

Leader of a computation Many distributed computations need a coordinator of server processors E.g., Central server for mutual exclusion Initiating a distributed computation Computing the sum/max using aggregation tree We may need to elect a leader at the start of computation In every admissible execution, exactly one processor enters an elected state. We may need to elect a new leader if the current leader of the computation fails

Leader election in ring networks Initial state (all not-elected) Final state leader

Why study rings? Simple starting point, easy to analyze Lower bounds and impossibility results for ring topology also apply to arbitrary topologies

LE algorithms in rings depend on Anonymous Rings Non-anonymous Rings Size of the network n is known (non-unif.) Size of the network n is not known (unif.) Synchronous Algorithms Asynchronous Algorithms

LE in Anonymous Rings Every processor runs the same algorithm Every processor does exactly the same execution

Impossibility for Anonymous Rings Theorem There is no leader election algorithm for anonymous rings, even if - the algorithm knows the ring size (non-uniform) - in the synchronous model Proof Sketch (for non-unif and sync rings): - Every processor begins in same state (not-elected) with same outgoing msgs (since anonymous) - Every processor receives same msgs, does same state transition, and sends same msgs in round 1 - And so on and so forth for rounds 2, 3, - Eventually some processor is supposed to enter an elected state. But then they all would.

Initial state (all not-elected) Final state leader If one node is elected a leader, then every node is elected a leader

Impossibility for Anonymous Rings Since the theorem was proven for non-uniform and synchronous rings, the same result holds for weaker models: - uniform - asynchronous

Rings with Identifies (non-anonymous) Assume each processor has a unique ID. Do not confuse indices and IDs: indices are 0 to n-1: used only for analysis, not available to the processors IDs are arbitrary nonnegative integers: are available to the processors

Overview of LE in Rings with IDs There exists algorithms when nodes have unique IDs. We will evaluate them according to their message complexity. Best result: Asynchronous rings: Θ(nlog n) messages Synchronous rings: Θ(n) messages All bounds are asymptotically tight!

Node with the highest identifier If all nodes know the highest ID (say n), we do not need an election. Everyone assumes n is the leader n starts operating as the leader But what if n fails? We cannot assume n 1 is leader, since n 1 may have failed too! Our Strategy The node with the highest ID and still surviving is the leader. We need an algorithm that finds the working node with the highest ID.

One strategy: use aggregation tree Suppose node r detects that leader has failed, and initiates lead election r = 4 Node r creates a BFS tree. Asks for max node ID to be computed via 7 8 aggregation Each node receives ID values from children 7 6 Each node computes max of own ID and received ID, and forwards to parents Needs a tree construction 2 5 8 3 If n nodes start election, we ll need n trees O(n - ) communication O(n) storage per node 2 5 8 3 Can we do better?

Chang-Roberts algorithm: High-level Ideas Suppose the network is a ring We assume that each node has 2 points to nodes it knows about Next Previous (like a circular doubly linked list) The actual network may not be a ring Every node send max(own ID, received ID) to the next node If a processor receives its own ID, it is the leader It is uniform: number of processors does not need to be known to the algorithm

Chang-Roberts algorithm: example 6 Basic idea: 3 2 Suppose 6 starts election Send 6 to 6.next, i.e. 2 2 takes max(2,6), sends to 2.next 8 takes max(8,6), sends to 8.next 8 Etc 5 8 4

Chang-Roberts algorithm: example The value 8 goes around the ring and 6 comes back to 8 Then 8 knows that 8 is the highest ID Since if there was a higher ID, that 3 2 wouldhave stopped 8. 8 8 declares itself the leader: sends a message aroundthe ring. 5 8 4

Chang-Roberts algorithm : final step If node p receives election message m with m.id=p.id P declares itself leader Set p.leader=p.id Send leader message with p.id to p.next Any other node q receiving the leader message Set q.leader=p.id Forwards leader message to q.next

Chang-Roberts algorithm: discussion Works in an asynchronous system Correctness: Elects processor with largest ID msg containing that ID passes through every processor. Message complexity O(n - ) When does it occur? Worst case to arrange the IDs is in the decreasing order: 2 nd largest ID causes n 1 messages 3 rd largest ID causes n 2 messages Etc. Total messages = n + n 1 + n 2 + + 1 = O(n - )

Average case analysis of Chang-Roberts algorithm Theorem The average message complexity of Chang-Roberts algorithm is O(nlog n) Assume that all rings appear with equal probability. For the proof, assume IDs are 1,2, n Let P(i, k) be the prob. that ID i makes exactly k steps. Prob. that the k 1 neighbors of i are < i P i, k = ;<= ><=?<= ><= @ n i n k i k steps Prob. that the k neighbor of i is >i Hence, expected total number of messages?<= ;C= ; >C= =n + k P(i, k) 0.69n log n + O(1)

Can we use fewer messages? The O(n - ) algorithm is simple and works in both synchronous and asynchronous model. But can we solve the problem with fewer messages? Idea: Try to have msgs containing larger IDs travel smaller distance in the ring.

Hirschberg-Sinclair algorithm Assume all nodes want to know the leader k-neighborhood of node p How does p send a message to distance k? Message has a time to live variable Each node decrements m.ttl on receiving If m.ttl=0, don t forward any more

Hirschberg-Sinclair algorithm (1) Algorithm operates in phases In phase 0, node p sends election message m to both p.next and p.previous with m.id=p.id, and TTL=1 Suppose q receives this message Set m.ttl=0 If q.id>m.id, do nothing If q.id<m.id, return message to p If p gets back both messages, it declares itself leader of its 1-neighborhood, and proceeds to next phase

Hirschberg-Sinclair algorithm (1) Phase 0: send(id, current phase, step counter) to 1-neighborhood 3 1 5 8 5 3 5 1 7 1 7 8 3 7 4 2 8 4 4 6 2 6 6 2

Hirschberg-Sinclair algorithm (1) If: received ID>current ID Then: send a reply(ok) 5 1 8 2 6 3 7 4

Hirschberg-Sinclair algorithm (1) If: a node receives both replies Then: it becomes a temporal leader & proceed to next phase 5 1 8 2 6 3 7 4

Hirschberg-Sinclair algorithm (2) Algorithm operates in phases In phase i, node p sends election message m to both p.next and p.previous with m.id=p.id, and TTL=2 ; Suppose q receives this message (from next/previous) If m.ttl=0 then forward suitably to previous/next Set m.ttl=m.ttl-1 If q.id>m.id, do nothing ELSE: If m.ttl=0 then return to sending process else forward to suitably to previous/next If p gets back both messages, it is the leader of its 2 ; neighborhood, and proceeds to phase i + 1

Hirschberg-Sinclair algorithm When 2 ; n/2, only 1 processor survives & Leader found Number of phases: O(log n). Question What is the message complexity?

H&S algorithm: Message complexity Question What is the message complexity? In phase i At most one node initiates message in any sequence of 2 ;<= nodes So, n/2 ;<= candidates Each sends 2 messages, going at most 2 ; distance, and transfers 2 2 2 ; messages in total O(n) messages in phase i, and there are O(log n) phases Total of O(nlog n) messages.

H&S algorithm: Message complexity Max # messages per leader Phase 1: 4 Phase 2: 8 Phase i: Phase log n: 2 i +1 log 1 2 n + Max # current leaders n n n /2 n /2 1 / 2 i - logn -1

H&S algorithm: Message complexity Max # messages per leader Max # current leaders Phase 1: 4 Phase 2: 8 Phase i: Phase log n: Total messages: 2 i +1 log 1 2 n + O ( n log n) n n n /2 n /2 1 / 2 i - logn -1 = 4n = 4n = 4n = 4n

Can we go better? The O(nlog n) algorithm is more complicated than the O(n - ) algorithm but uses fewer messages in worst case. Works in both asynchronous case. Can we reduce the number of messages even more? Not in the asynchronous model. Theorem Any asynchronous Leader Election algorithm requires Ω(nlog n) messages. Theorem O(n) message synchronous algorithm exists for non-uniform ring.

Failures Question How do we know that something has failed? Let s see what we mean by failed Models of failure: 1. Assume no failures 2. Crash failures: Process may fail/crash 3. Message failures: Messages may get dropped 4. Link failures: a communication stops working 5. Some combinations of 2,3,4 6. Arbitrary failures: computation/communication may be erroneous

Failure detectors Detection of a crashed process A major challenge in distributed systems. A failure detector is a process that responds to questions asking whether a given processor has failed A failure detector is not necessarily accurate

Failure detectors Reliable failure detectors Replies with working or failed Difficulty: Detecting something is working is easy: if they respond to a message, they are working Detecting failure is harder: if they don t respond to the message, the message may have been lost/delayed, maybe the processor is busy, etc. Unreliable failure detector Replies with suspected (failed) or unsuspected That is, does not try to give a confirmed answer We would ideally like reliable detectors, but unreliable ones (that say give maybe answers) could be more realistic

Simple example Suppose we know all messages are delivered within D seconds Then we can require each processor to send a message every T seconds to the failure detector If a failure detector does not get a message from process p in T + D seconds, it marks p as suspected or failed

Synchronous vs asynchronous In a synchronous system there is a bound on message delivery time (and clock drift) So this simple method gives a reliable failure detector In fact, it is possible to implement this simply as a function Send a message to process p, wait for 2D + ε time A dedicated detector process is not necessary In asynchronous systems, things are much harder

Simple failure detector If we choose T or D too large, then it will take a long time for failure to be detected If we select T too small, it increases communication costs and puts too much burden on processes If we select D too small, then working processes may get labeled as failed/suspected

Assumptions and real world In reality, both synchronous and asynchronous are too rigid Real systems are fast, but sometimes messages can take a longer time than usual (But not indefinitely long) Messages usually get delivered, but sometimes not