R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch

Similar documents
Shared Memory Seif Haridi

Shared memory model" Shared memory guarantees" Read-write register" Wait-freedom: unconditional progress " Liveness" Shared memory basics"

Distributed systems. Consensus

Distributed algorithms

Review. Review. Review. Constructing Reliable Registers From Unreliable Byzantine Components 12/15/08. Space of registers: Transformations:

1 The comparison of QC, SC and LIN

Shared Objects. Shared Objects

Wait-Free Regular Storage from Byzantine Components

Distributed Algorithms Failure detection and Consensus. Ludovic Henrio CNRS - projet SCALE

Coordination and Agreement

CS505: Distributed Systems

Implementing Shared Registers in Asynchronous Message-Passing Systems, 1995; Attiya, Bar-Noy, Dolev

Distributed Systems Leader election & Failure detection

Coordination and Agreement

Fault-Tolerant Distributed Services and Paxos"

Byzantine Failures. Nikola Knezevic. knl

Replication in Distributed Systems

Part 5: Total Order Broadcast

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN

Coordination and Agreement

Distributed Algorithms Benoît Garbinato

6.852: Distributed Algorithms Fall, Class 21

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

Consistency & Replication

The alternator. Mohamed G. Gouda F. Furman Haddix

What is Distributed Storage Good For?

System Models 2. Lecture - System Models 2 1. Areas for Discussion. Introduction. Introduction. System Models. The Modelling Process - General

Lecture 7: Logical Time

Implementing Isolation

CSE 486/586 Distributed Systems

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Intuitive distributed algorithms. with F#

Chapter 15 : Concurrency Control

Topics in Reliable Distributed Systems

RESEARCH ARTICLE. A Simple Byzantine Fault-Tolerant Algorithm for a Multi-Writer Regular Register

Oh-RAM! One and a Half Round Atomic Memory

Semi-Passive Replication in the Presence of Byzantine Faults

CSE 486/586 Distributed Systems

Secure Distributed Programming

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Introduction to Distributed Systems Seif Haridi

Robust Data Sharing with Key-Value Stores

Anonymous and Fault-Tolerant Shared-Memory Computing

The Wait-Free Hierarchy

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

Optimal Resilience for Erasure-Coded Byzantine Distributed Storage

14 Concurrency control

Reconfiguring Replicated Atomic Storage: A Tutorial

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Distributed Systems 8L for Part IB

Reminder from last time

12.3 Nonlocking schedulers. Timestamp ordering. TO concurrency control guarantees conflict-serializable schedules


Distributed Systems - II

CSE 486/586: Distributed Systems

11. Replication. Motivation

With the advent of storage area network. Reliable Distributed Storage

Robust Data Sharing with Key-Value Stores Replication over multiple cloud storage providers

DBMS Architecture. Transaction Management. Chapters 16 and 17. Concurrency control and Recovery. Relationship with the mutual exclusion problem

Practical Byzantine Fault Tolerance

Distributed Databases

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Paxos. Sistemi Distribuiti Laurea magistrale in ingegneria informatica A.A Leonardo Querzoni. giovedì 19 aprile 12

Synchrony Weakened by Message Adversaries vs Asynchrony Enriched with Failure Detectors. Michel Raynal, Julien Stainer

Concurrent Programming Synchronisation. CISTER Summer Internship 2017

EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

arxiv: v3 [cs.dc] 11 Mar 2016

Optimistic Erasure-Coded Distributed Storage

CSE 5306 Distributed Systems

Fork Sequential Consistency is Blocking

OPERATING SYSTEMS. Deadlocks

The Complexity of Renaming

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm

Asynchronous Reconfiguration for Paxos State Machines

Byzantine Disk Paxos

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Transactions and Concurrency Control

Chapter 13 : Concurrency Control

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

Research Report. (Im)Possibilities of Predicate Detection in Crash-Affected Systems. RZ 3361 (# 93407) 20/08/2001 Computer Science 27 pages

Concurrency Control in Distributed Systems. ECE 677 University of Arizona

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris


Consensus and agreement algorithms

Several of these problems are motivated by trying to use solutiions used in `centralized computing to distributed computing

Concurrent Computing

Distributed systems. Total Order Broadcast

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Module 6: Process Synchronization. Operating System Concepts with Java 8 th Edition

Failures, Elections, and Raft

A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment

! What is a deadlock? ! What causes a deadlock? ! How do you deal with (potential) deadlocks? Maria Hybinette, UGA

Two-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure

Lecture 22: Fault Tolerance

The Alpha of Indulgent Consensus

Practice: Large Systems Part 2, Chapter 2

Transcription:

- Shared Memory - R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch R. Guerraoui 1

The application model P2 P1 Registers P3 2

Register (assumptions) For presentation simplicity, we assume registers of integers We also assume that the initial value of a register is 0 and this value is initialized (written()) by some process before the register is used We assume that every value written is uniquely identified (this can be ensured by associating a process id and a timestamp with the value) 3

Register: specification Assume a register that is local to a process, i.e., accessed only by one process: In this case, the value returned by a Read() is the last value written 4

Sequential execution P1 R() R() P2 W(5) W(6) 5

Sequential execution P1 R() -> 5 R() -> 6 P2 W(5) W(6) 6

Concurrent execution P1 R1() ->? R2() ->? R3() ->? P2 W(5) W(6) 7

Execution with failures P1 R() ->? P2 W(5) W(6) crash 8

Regular register It assumes only one writer; It provides strong guarantees when there is no concurrent or failed operations (invoked by processes that fail in the middle) When some operations are concurrent, or some operation fails, the register provides minimal guarantees 9

Regular register Read() returns: ü the last value written if there is no concurrent or failed operations ü and otherwise the last value written or any value concurrently written, i.e., the input parameter of some Write() 10

Execution P1 R1() R2() R3() P2 W(5) W(6) 11

Results 1 P1 R1() -> 5 R2() -> 0 R3() -> 25 P2 W(5) W(6) 12

Results 2 P1 R1() -> 5 R2() -> 6 R3() -> 5 P2 W(5) W(6) 13

Results 3 P1 R() -> 5 P2 W(5) W(6) crash 14

Results 4 P1 R() -> 6 P2 W(5) W(6) crash 15

Correctness Results 1: non-regular register (safe) Results 2; 3; 4: regular register 16

Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch R. Guerraoui 1

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 2

A distributed system P2 P1 P3 3

Shared memory model P3 P1 Registers P3 4

Message passing model B A C 5

Implementing a register From message passing to shared memory Implementing the register comes down to implementing Read() and Write() operations at every process 6

Implementing a register Before returning a Read() value, the process must communicate with other processes Before performing a Write(), i.e., returning the corresponding ok, the process must communicate with other processes 7

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 8

A bogus algorithm We assume that channels are reliable (perfect point to point links) Every process pi holds a copy of the register value vi 9

A bogus algorithm Read() at pi ü Return vi Write(v) at pi ü vi := v ü Return ok The resulting register is live but not safe: ü Even in a sequential and failure-free execution, a Read() by pj might not return the last written value, say by pi 10

No safety P1 R1() -> 0 R2() -> 0 R3() -> 0 P2 W(5) W(6) 11

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 12

A simplistic algorithm We still assume that channels are reliable but now we also assume that no process fails Basic idea: one process, say p1, holds the value of the register 13

A simplistic algorithm Read() at pi ü send [R] to p1 ü when receive [v] ü Return v Write(v) at pi ü send [W,v] to p1 ü when receive [ok] ü Return ok At p1: T1: when receive [R] from pi send [v1] to pi T2: when receive [W,v] from pi v1 := v send [ok] to pi 14

Correctness (liveness) By the assumption that ü (a) no process fails, ü (b) channels are reliable no wait statement blocks forever, and hence every invocation eventually terminates 15

Correctness (safety) (a) If there is no concurrent or failed operation, a Read() returns the last value written (b) A Read() must return some value concurrently written or the last value written NB. If a Read() returns a value written by a given Write(), and another Read() that starts later returns a value written by a different Write(), then the second Write() cannot start after the first Write() terminates 16

Correctness (safety 1) (a) If there is no concurrent or failed operation, a Read() returns the last value written Assume a Write(x) terminates and no other Write() is invoked. The value of the register is hence x at p1. Any subsequent Read() invocation by some process pj returns the value of p1, i.e., x, which is the last written value 17

Correctness (safety 2) (b) A Read() returns the previous value written or the value concurrently written Let x be the value returned by a Read(): by the properties of the channels, x is the value of the register at p1. This value does necessarily come from a concurrent or from the last Write(). 18

What if? Processes might crash? If p1 crashes, then the register is not live (wait-free) If p1 is always up, then the register is regular and wait-free 19

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 20

A fail-stop algorithm We assume a fail-stop model; more precisely: any number of processes can fail by crashing (no recovery) channels are reliable failure detection is perfect (we have a perfect failure detector) 21

A fail-stop algorithm We implement a regular register every process pi has a local copy of the register value vi every process reads locally the writer writes globally, i.e., at all (noncrashed) processes 22

A fail-stop algorithm Write(v) at pi send [W,v] to all for every pj, wait until either: receive [ack] or suspect [pj] Return ok At pi: when receive [W,v] from pj vi := v send [ack] to pj Read() at pi Return vi 23

Correctness (liveness) ü A Read() is local and eventually returns ü A Write() eventually returns, by the (a) the strong completeness property of the failure detector, and (b) the reliability of the channels 24

Correctness (safety 1) (a) In the absence of concurrent or failed operation, a Read() returns the last value written Assume a Write(x) terminates and no other Write() is invoked. By the accuracy property of the failure detector, the value of the register at all processes that did not crash is x. Any subsequent Read() invocation by some process pj returns the value of pj, i.e., x, which is the last written value 25

Correctness (safety 2) (b) A Read() returns the value concurrently written or the last value written Let x be the value returned by a Read(): by the properties of the channels, x is the value of the register at some process. This value does necessarily come from the last or a concurrent Write(). 26

What if? Failure detection is not perfect Can we devise an algorithm that implements a regular register and tolerates an arbitrary number of crash failures? 27

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 28

Lower bound Proposition: any wait-free asynchronous implementation of a regular register requires a majority of correct processes Proof (sketch): assume a Write(v) is performed and n/2 processes crash, then a Read() is performed and the other n/2 processes are up: the Read() cannot see the value v The impossibility holds even with a 1-1 register (one writer and one reader) 29

The majority algorithm [ABD95] We assume that p1 is the writer and any process can be reader We assume that a majority of the processes is correct (the rest can fail by crashing no recovery) We assume that channels are reliable Every process pi maintains a local copy of the register vi, as well as a sequence number sni and a read timestamp rsi Process p1 maintains in addition a timestamp ts1 30

Algorithm - Write() Write(v) at p1 ü ts1++ ü send [W,ts1,v] to all ü when receive [W,ts1,ack] from majority ü Return ok At pi ü when receive [W,ts1, v] from p1 ü If ts1 > sni then vi := v sni := ts1 send [W,ts1,ack] to p1 31

Algorithm - Read() Read() at pi ü rsi++ ü send [R,rsi] to all ü when receive [R, rsi,snj,vj] from majority ü v := vj with the largest snj ü Return v At pi ü when receive [R,rsj] from pj ü send [R,rsj,sni,vi] to pj 32

What if? Any process that receives a write message (with a timestamp and a value) updates its value and sequence number, i.e., without checking if it actually has an older sequence number 33

Old writes P1 P2 W(5) sn1 = 1; v1 = 5 sn2 = 1; v2 = 5 W(6) sn1 = 2; v1 = 6 R() -> 5 P3 sn3 = 2; v3 = 6 sn3 = 1; v3 = 5 34

Correctness 1 ü Liveness: Any Read() or Write() eventually returns by the assumption of a majority of correct processes (if a process has a newer timestamp and does not send [W,ts1,ack], then the older Write() has already returned) ü Safety 2: By the properties of the channels, any value read is the last value written or the value concurrently written 35

Correctness 2 (safety 1) (a) In the absence of concurrent or failed operation, a Read() returns the last value written Assume a Write(x) terminates and no other Write() is invoked. A majority of the processes have x in their local value, and this is associated with the highest timestamp in the system. Any subsequent Read() invocation by some process pj returns x, which is the last written value 36

What if? Multiple processes can write concurrently? 37

Concurrent writes P1 W(5) ts1 = 1 W(6) ts1 = 2 P2 R() -> 6 P3 W(1) ts3 = 1 38

Overview of this lecture (1) Overview of a register algorithm (2) A bogus algorithm (3) A simplistic algorithm (4) A simple fail-stop algorithm (5) A tight asynchronous lower bound (6) A fail-silent algorithm 39

- Atomic register specification - R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch R. Guerraoui 1

The application model P2 P1 Registers P3 2

Sequential execution P1 R() R() P2 W(5) W(6) 3

Sequential execution P1 R() -> 5 R() -> 6 P2 W(5) W(6) 4

Concurrent execution P1 R1() ->? R2() ->? R3() ->? P2 W(5) W(6) 5

Execution with failures P1 R() ->? P2 W(5) W(6) crash 6

Safety An atomic register provides strong guarantees even when there is concurrency and failures The execution is equivalent to a sequential and failure-free execution (linearization) 7

Atomic register Every failed (write) operation appears to be either complete or not to have been invoked at all And Every complete operation appears to be executed at some instant between its invocation and reply time events 8

Execution 1 P1 R1() -> 5 R2() -> 0 R3() -> 25 P2 W(5) W(6) 9

Execution 2 P1 R1() -> 5 R2() -> 6 R3() -> 5 P2 W(5) W(6) 10

Execution 3 P1 R1() -> 5 R2() -> 5 R3() -> 5 P2 W(5) W(6) 11

Execution 4 P1 R1() -> 5 R2() -> 6 R3() -> 6 P2 W(5) W(6) 12

Execution 5 P1 R() -> 5 P2 W(5) W(6) crash 13

Execution 6 P1 R() -> 5 R() -> 6 P2 W(5) W(6) crash 14

Execution 7 P1 R() -> 6 R() -> 5 P2 W(5) W(6) crash 15

Correctness Execution 1: non-regular (safe) Executions 2 and 7: non-atomic (regular) Executions 3; 4, 5 and 6: atomic 16

Regular vs Atomic For a regular register to be atomic, two successive Read() must not overlap a Write() The regular register might in this case allow the first Read() to obtain the new value and the second Read() to obtain the old value 17

Atomic register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch R. Guerraoui 1

Overview of this lecture (1) From regular to atomic (2) A 1-1 atomic fail-stop algorithm (3) A 1-N atomic fail-stop algorithm (4) A N-N atomic fail-stop algorithm (5) From fail-stop to fail-silent 2

Fail-stop algorithms We first assume a fail-stop model; more precisely: any number of processes can fail by crashing (no recovery) channels are reliable failure detection is perfect 3

The simple algorithm Consider our fail-stop regular register algorithm every process has a local copy of the register value every process reads locally the writer writes globally, i.e., at all (noncrashed) processes 4

The simple algorithm Write(v) at pi send [W,v] to all for every pj, wait until either: received [ack] or suspected [pj] Return ok At pi: when receive [W,v] from pj vi := v send [ack] to pj Read() at pi Return vi 5

Atomicity? P1 P2 R1() -> 5 R2() -> 6 v1 = 5 v1 = 6 W(5) W(6) P3 R3() -> 5 v3 = 5 6

Linearization? P1 P2 P3 W(5) R1() -> 5 R2() -> 6 W(6) R3() -> 5?? 7

Fixing the pb: read-globally Read() at pi send [W,vi] to all for every pj, wait until either: receive [ack] or suspect [pj] Return vi 8

Still a problem P1 R() -> 5 P2 W(5) W(6) P3 R() -> 5 9

Linearization? P1 R1() -> 5 P2 W(5) W(6) P3 R3() -> 5?? 10

Overview of this lecture (1) From regular to atomic (2) A 1-1 atomic fail-stop algorithm (3) A 1-N atomic fail-stop algorithm (4) A N-N atomic fail-stop algorithm (5) From fail-stop to fail-silent 11

A fail-stop 1-1 atomic algorithm Write(v) at p1 send [W,v] to p2 Wait until either: receive [ack] from p2 or suspect [p2] Return ok At p2: when receive [W,v] from p1 v2 := v send [ack] to p2 Read() at p2 Return v2 12

A fail-stop 1-N algorithm every process maintains a local value of the register as well as a sequence number the writer, p1, maintains, in addition a timestamp ts1 any process can read in the register 13

A fail-stop 1-N algorithm Write(v) at p1 ts1++ send [W,ts1,v] to all for every pi, wait until either: receive [ack] or suspect [pi] Return ok Read() at pi send [W,sni,vi] to all for every pj, wait until either: receive [ack] or suspect [pj] Return vi 14

A 1-N algorithm (cont d) At pi When pi receive [W,ts,v] from pj if ts > sni then vi := v sni := ts send [ack] to pj 15

Why not N-N? P1 R() -> Y P2 W(X) W(Y) P3 W(Z) 16

The Write() algorithm Write(v) at pi ü send [W] to all ü for every pj wait until receive [W,snj] or suspect pj ü (sn,id) := (highest snj + 1,i) ü send [W,(sn,id),v] to all ü for every pj wait until receive [W,(sn,id),ack] or suspect pj ü Return ok At pi T1: ü when receive [W] from pj send [W,sn] to pj T2: ü when receive [W,(snj,idj),v] from pj ü If (snj,idj) > (sn,id) then vi := v (sn,id) := (snj,idj) ü send [W,(snj,idj),ack] to pj 17

The Read() algorithm Read() at pi ü send [R] to all ü for every pj wait until receive [R,(snj,idj),vj] or suspect pj ü v = vj with the highest (snj,idj) ü (sn,id) = highest (snj,idj) ü send [W,(sn,id),v] to all ü for every pj wait until receive [W,(sn,id),ack] or suspect pj ü Return v At pi T1: ü when receive [R] from pj send [R,(sn,id),vi] to pj T2: ü when receive [W,(snj,idj),v] from pj ü If (snj,idj) > (sn,id) then vi := v (sn,id) := (snj,idj) ü send [W,(snj,idj),ack] to pj 18

Overview of this lecture (1) From regular to atomic (2) A 1-1 atomic fail-stop algorithm (3) A 1-N atomic fail-stop algorithm (4) A N-N atomic fail-stop algorithm (5) From fail-stop to fail-silent 19

From fail-stop to fail-silent We assume a majority of correct processes In the 1-N algorithm, the writer writes in a majority using a timestamp determined locally and the reader selects a value from a majority and then imposes this value on a majority In the N-N algorithm, the writers determines first the timestamp using a majority 20