Distributed Systems 2 Introduction
|
|
- Silas Hubbard
- 5 years ago
- Views:
Transcription
1 Distributed Systems 2 Introduction Alberto Montresor University of Trento, Italy 2018/09/13 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
2 Contents 1 Getting Started Two generals Common Knowledge Byzantine generals Summary 2 Themes of the course Impossible vs practical Classical vs extreme distributed systems Syllabus 3 Modeling Distributed Systems Computation Interaction Failures Time
3 Getting Started Two generals Two generals A thought experiment: Alberto Montresor (UniTN) DS - Introduction 2018/09/13 1 / 43
4 Getting Started Two generals A potential solution General A: attack at dawn! General B: ack, attack at dawn! General A: ack ack, attack at dawn! General B: ack ack ack, attack at dawn!... Theorem Under this scenario, there is no solution for the Two Generals Problem Proof. By contradiction on the number of messages exchanged. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 2 / 43
5 DS - Introduction Getting Started Two generals A potential solution A potential solution General A: attack at dawn! General B: ack, attack at dawn! General A: ack ack, attack at dawn! General B: ack ack ack, attack at dawn!... Theorem Under this scenario, there is no solution for the Two Generals Problem Proof. By contradiction on the number of messages exchanged. Let s assume (by contradiction) that there is at least one solution to this problem under this scenario. If there is one, there could be many. If there are many, we can find one which uses the minimum number of messages. Take the last message of this protocol: it can be received or it can be lost The protocol should work in both cases. So we could avoid sending it at all! The resulting protocol uses less messages than the minimum, which is a contradiction
6 Getting Started Two generals Reality Check Atomic Commit An atomic commit is an operation in which a set of distinct changes is applied as a single operation. Example: ATM s withdrawal You whitdraw 100 euro from an ATM in Trento Your balance should be reduced by 100 euro Alberto Montresor (UniTN) DS - Introduction 2018/09/13 3 / 43
7 Getting Started Two generals A pragmatic solution Probabilistic protocol Assuming messengers are caught independently of each other with probability p General A: Send n messengers Attacks no matter what General B: Attacks if receives at least one messenger p n is the probability that the attack will be uncoordinated Trade-off: We can decrease the probability of failure by increasing n... But at the additional cost of sending more messengers! Without be ever certain that the attack will be coordinated! Alberto Montresor (UniTN) DS - Introduction 2018/09/13 4 / 43
8 Getting Started Common Knowledge Muddy children n children go playing Children are truthful, perceptive, intelligent Mom says: Don t get muddy! A bunch (say, k) get mud on their forehead Daddy comes, looks around, and says Some of you got a muddy forehead Daddy repeatedly ask: Do you know whether you have a muddy forehead? What happens? Slides from Lorenzo Alvisi Alberto Montresor (UniTN) DS - Introduction 2018/09/13 5 / 43
9 Getting Started Common Knowledge Muddy children Theorem The first k 1 times Daddy asks, the children says No The k-th time, the k children say Yes. Proof. By induction on k Alberto Montresor (UniTN) DS - Introduction 2018/09/13 6 / 43
10 DS - Introduction Getting Started Common Knowledge Muddy children Muddy children Theorem The first k 1 times Daddy asks, the children says No The k-th time, the k children say Yes. Proof. By induction on k Let k = 1. The first time daddy asks, the child with mud on his forehead say yes. Because all the other have no mud, and someone has mud on his forehead, it must be him. Let k > 1 Every child with mud see k 1 children with mud on the forehead. If there were k 1 children with mud, they would have said yes at the (k 1)-th time daddy asks, but they didn t. So there are actually k children with mud, and they all say yes at the k-th time daddy asks.
11 Getting Started Common Knowledge Muddy children Variation 1 Suppose k > 1 Every one knows that someone has a dirty forehead before Dad announces it Does Dad still need to speak up? Let p = Someone s forehead is dirty Every one knows p But, unless the father speak, if k = 2 not every one knows that everyone knows p! Suppose A and B are dirty. Before the father speaks A does not know whether B knows p If k = 3, not every one knows that every one knows that every one knows p... Alberto Montresor (UniTN) DS - Introduction 2018/09/13 7 / 43
12 Getting Started Common Knowledge Muddy children Variation 2... the father took every child aside and told them individually (without others noticing) that someone s forehead is muddy? Variation 3... every child had (unknown to the other children) put a miniature microphone on every other child so they can hear what the father says in private to them? Alberto Montresor (UniTN) DS - Introduction 2018/09/13 8 / 43
13 Getting Started Common Knowledge Two generals, reloaded There is an entire logic that formalizes what knowledge participants acquire while running a protocol J. Halpern and Y. Moses. Knowledge and Common Knowledge in a Distributed Environment. E.W. Dijkstra Prize Solving the Two Generals Problem requires common knowledge everyone knows that everyone knows that everyone knows... But: Common knowledge cannot be achieved by communicating through unreliable channels Alberto Montresor (UniTN) DS - Introduction 2018/09/13 9 / 43
14 Getting Started Common Knowledge A common knowledge puzzle Albert and Bernard just become friends with Cheryl, and they want to know when her birthday is. Cheryl gives them a list of 10 dates: May 15,16,19 June 17,18 July 14,16 August 14,15,17 Cheryl then tells Albert and Bernard separately the month and the day of her birthday, respectively. Albert: I don t know when Cheryl s birthday is, but I know that Bernard does not know too. Bernard: At first I didn t know when Cheryl s birthday is, but I know now. Albert: Then I also know when Cheryl s birthday is Alberto Montresor (UniTN) DS - Introduction 2018/09/13 10 / 43
15 Getting Started Byzantine generals Byzantine generals Scenario n Byzantine generals encircling a city They must decide whether to attack or retreat! Messengers are reliable and synchronous Generals may be traitors Nobody knows which generals are traitors Problem specification The generals require an algorithm to reach an agreement such that (i) all loyal generals decide on the same plan of action and (ii) a small number of traitorous generals cannot cause the loyal generals to adopt different plans. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 11 / 43
16 Getting Started Byzantine generals A potential solution Wait for a majority of generals to agree WRONG! Possible scenario: 3 generals 1 vote attack, 1 vote retreat 1 traitorous general: sends a vote attack to the attack general sends a vote retreat to the retreat general Alberto Montresor (UniTN) DS - Introduction 2018/09/13 12 / 43
17 Getting Started Byzantine generals The problem is solvable Byzantine Fault Tolerance (1982) L. Lamport, R. Shostak, M. Pease, The Byzantine Generals Problem, ACM Trans. on Programming Languages and Systems, 4(3): , A protocol that given n processes,can tolerate up to t traitorous generals with n 3t + 1 Example: 4 generals can tolerate up to 1 byzantine general Practical Byzantine Fault Tolerance (2002) M. Castro and B. Liskov, Practical Byzantine Fault Tolerance and Proactive Recovery, ACM Trans. on Computer Systems, 20(4): , PBFT triggered a renaissance in BFT replication research Still going on... Alberto Montresor (UniTN) DS - Introduction 2018/09/13 13 / 43
18 Getting Started Byzantine generals Reality Check BFT sponsors in 1982 NASA The Ballistic Missile Defense System Command Army Research Office Nancy Lynch s book on Distributed Systems: The agreement problem is a simplified version of a problem that originally arose in the development of on-board aircraft control systems. BitCoin, a peer-to-peer digital currency system, is based on BFT. The 8-hour downtime of Amazon S3 in July 2008 is a well-known example of what happens when you don t use BFT. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 14 / 43
19 Getting Started Summary Take-home lessons We need to properly model our distributed systems Reliable / unreliable communication Benign / malicious processes Solutions depend on the underlying model Approximate or probabilistic solutions Bounded solution No solution at all! Coordinating multiple processes is difficult Unexpected events: failures, malicious behavior Lack of common knowledge Alberto Montresor (UniTN) DS - Introduction 2018/09/13 15 / 43
20 Themes of the course Impossible vs practical Theory vs practice Yogi Berra says: In theory, theory and practice are the same. In practice, they are not. Yogi Berra Always go to other people s funerals, otherwise they won t go to yours I really didn t say everything I said Nobody goes there anymore; it s too crowded Alberto Montresor (UniTN) DS - Introduction 2018/09/13 16 / 43
21 Themes of the course Impossible vs practical First theme of the course Impossible vs practical Several papers about impossibility results: M. Fischer, N. Lynch, M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32(2): , S. Gilbert, N. Lynch. Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News, 33(2):51-59, Yet, many of these problems have practical solutions: T. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2): , L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18 25, A general tension in Computer Science: M. Vardi. Solving the Unsolvable. Comm. of the ACM, 54(7):5, Alberto Montresor (UniTN) DS - Introduction 2018/09/13 17 / 43
22 Themes of the course Classical vs extreme distributed systems Beyond technology The Spanish Flu, Pandemic: killed 20M people in a relatively short time, more than World War I Virus goal: spread itself as quickly as possible Unreliable environment: viruses may be killed transmission may fail Transmission network is a complex graph Alberto Montresor (UniTN) DS - Introduction 2018/09/13 18 / 43
23 Themes of the course Classical vs extreme distributed systems Beyond technology Flocks of birds Flying in a flock is good: probability of being killed by a predator is reduced Flying in a flock is bad: probability of finding (enough) food is reduced Birds self-organize themselves in a flock No central authority Alberto Montresor (UniTN) DS - Introduction 2018/09/13 19 / 43
24 Themes of the course Classical vs extreme distributed systems Second theme of the course Classical vs extreme distributed systems Classical distributed system problems include agreement, total order broadcast, atomic commit, replication, etc. Extreme distributed system problems include self-* properties, scalability, full decentralization, etc. Special issue on Springer Computing (Sept. 2012) Alberto Montresor, Gusz Eiben, Maarten van Steen, editors Modern distributed systems may nowadays consist of hundreds of thousands of computers, ranging from high-end powerful machines to low-end resource-constrained wireless devices. We label them as extreme distributed systems, as they push scalability and complexity well beyond traditional scenarios. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 20 / 43
25 Themes of the course Syllabus Topics Introduction Reliable broadcast Epidemic protocols Impossibility of consensus Consensus and failure detectors Complex networks P2P Epidemics: Beyond dissemination Rollback and Recovery Paxos Practical Byzantine Fault Tolerance Byzantine, altruistic, rational model Blockchains Alberto Montresor (UniTN) DS - Introduction 2018/09/13 21 / 43
26 Themes of the course Syllabus What s next in the course? Distributed system modeling Which kind of failures exist? Which kind of failures we tolerate? Problem specification Formal description of the problem Algorithms, algorithms, algorithms Pseudo-code descriptions of algorithms Proofs Just writing the code is not enough! Sometimes, impossibility proofs Reality checks Learn about real systems where these protocols are applied Alberto Montresor (UniTN) DS - Introduction 2018/09/13 22 / 43
27 Modeling Distributed Systems Contents Modeling distributed systems Computation: Processes, deterministic vs probabilistic behavior Interaction: Processes interact through messages, which result in: Communication, i.e. information flow Coordination, i.e. synchronization and ordering of activities Failures: Which kind of failures can occur? Benign vs malicious (Byzantine) Process vs communication Time: Determining whether we can make any assumption on time bounds on communication and computation speeds. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 23 / 43
28 Modeling Distributed Systems Computation Computation Process: the unit of computation in a distributed system. Sometimes we may call it node, host, etc. Process set: denoted by Π, it is composed by a collection of n uniquely identified processes, like p 1, p 2,..., p n. Typical assumptions: The set is static (n is well-defined); Processes do know each other All processes run a copy of the same algorithm; the sum of all these copies constitutes the distributed algorithm But in extreme distributed systems: Dynamic set Too many, too dynamic to know them all Multiple algorithms Alberto Montresor (UniTN) DS - Introduction 2018/09/13 24 / 43
29 Modeling Distributed Systems Computation Deterministic vs probabilistic Deterministic process: the local computation and the messages sent by a process is determined by the current state and the messages previously received. Probabilistic process: processes may make used of random oracles to choose the local computation to be performed or the next message to be sent. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 25 / 43
30 Modeling Distributed Systems Interaction Interaction Processes communicate through messages send(m, p): sends a message m to p receive(m): receives a messages m In some cases, messages may be uniquely identified by Sender of the message A sequence number local to the sender General assumption: every pair of processes is connected by a bi-directional communication channel Through routing Not true for P2P systems Alberto Montresor (UniTN) DS - Introduction 2018/09/13 26 / 43
31 DS - Introduction Modeling Distributed Systems Interaction Interaction Interaction Processes communicate through messages send(m, p): sends a message m to p receive(m): receives a messages m In some cases, messages may be uniquely identified by Sender of the message A sequence number local to the sender General assumption: every pair of processes is connected by a bi-directional communication channel Through routing Not true for P2P systems In the receive operation, we do not specify the original sender; can be Fully connected topology may be obtained through routing. For example, consider the following architectures: Fully connected mesh broadcast medium (Ethernet, wireless) Ring Internet with routers
32 Modeling Distributed Systems Failures Process failures In a distributed systems, both processes and communication channels may fail, i.e. depart from what is considered its correct behavior. Hadzilacos and Toueg provide a taxonomy. Benign process failures Fail-stop: A process stops executing events, and other processes may detect this fact. Crash: A process stops executing events Malicious process failures Arbitrary failure, or Byzantine: any type of error may occur. This may be caused by: A software bug A malicious behavior inspired by an intelligent adversary Alberto Montresor (UniTN) DS - Introduction 2018/09/13 27 / 43
33 Modeling Distributed Systems Failures Process failures A process that never fails is correct A process that eventually fails is faulty Several protocols are designed to work correctly if the number of failures f is bounded (for example, f < n/3). In some models, processes may perform a recovery action: After some time, a process may resume functioning It suffers amnesia: the local state maintained in volatile memory is lost To limit the effects of amnesia, a log can be maintained Alberto Montresor (UniTN) DS - Introduction 2018/09/13 28 / 43
34 DS - Introduction Modeling Distributed Systems Failures Process failures Process failures A process that never fails is correct A process that eventually fails is faulty Several protocols are designed to work correctly if the number of failures f is bounded (for example, f < n/3). In some models, processes may perform a recovery action: After some time, a process may resume functioning It suffers amnesia: the local state maintained in volatile memory is lost To limit the effects of amnesia, a log can be maintained To avoid the problem of amnesia completely, every read/write would have to pass through permanent memory; too expensive
35 Modeling Distributed Systems Failures Communication failures Benign communication failures Process p performs send of a message m to process q Message m is inserted in a local outgoing buffer of p (Send-omission) Message m is transmitted from p to q (Omission) Message m is inserted in a local incoming buffer of q (Receive-omission) Process q performs receive of m Malign communication failures Messages created out of nothing, duplicated messages, etc. These problems can easily be solved through encryption techniques. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 29 / 43
36 Modeling Distributed Systems Failures Communication failures Possible causes of message failures: Buffer overflow in the operating system Congestion, routing errors in routers Partitioning: Processes are subdivided in disjoint sets called partitions Communication inside a partition is possible Communication between partitions is not possible When a partition disappears, we say that partitions merge Alberto Montresor (UniTN) DS - Introduction 2018/09/13 30 / 43
37 Modeling Distributed Systems Failures Modeling (faulty) communication channels The idea: the channels cannot systematically drop a specific message. This is the minimum abstraction needed to create reliable channels. Fair-Loss Channels Validity Fair Loss: If a message m is sent infinitely often by a process p to a process q and neither p and q crash, then q will receive m infinitely often Integrity Finite Duplication: If a message m is sent a finite number of times by a process p to a process q, then m cannot be received by q an infinite number of times Integrity No creation: If a message m is delivered by some process p, then m was previously sent by some process q to p Alberto Montresor (UniTN) DS - Introduction 2018/09/13 31 / 43
38 Modeling Distributed Systems Failures Modeling (correct) communication channels The idea: channels are reliable, messages are never lost. It can be implemented, but there is a price to be payed: asynchrony. Perfect Channels Validity Reliable delivery: If p sends a message to q, and neither of p and q crash, then q will eventually receive m Integrity No duplication: No message is delivered to a process more than once Integrity No creation: If a message m is delivered by some process p, then m was previously sent by some process q to p Alberto Montresor (UniTN) DS - Introduction 2018/09/13 32 / 43
39 Modeling Distributed Systems Failures An Example Algorithm Fair-loss Channel Perfect Channel upon init do Set sent Set delivered starttimer(timeout) upon timeout do foreach (m, q) sent do fairlosssend(m, q) starttimer(timeout) upon perfectsend(m, q) do fairlosssend(m, q) sent sent {(m, q)} upon fairlossreceive(m, q) do if m / delivered then delivered delivered {m} perfectreceive(m, q) Alberto Montresor (UniTN) DS - Introduction 2018/09/13 33 / 43
40 Modeling Distributed Systems Failures Safety and liveness Safety Something bad will never happen In other words, a distributed program should never enter an unacceptable state. No message is delivered to a process more than once. Liveness Something good eventually does happen In other words, a distributed program eventually enters a desirable state. If p sends a message to q, and neither of p and q crash, then eventually q will receive m. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 34 / 43
41 Modeling Distributed Systems Time Time Global clock For presentation simplicity, it may be convenient to assume the presence of a global real-time clock, outside the control of processes. This can be used to provide a global ordering of steps in a distributed systems In reality: Each process is associated with a local clock Local clocks may not report the perfect time Clock drift rate: refers to the relative amount that a computer clock differs from a perfect reference clock. Synchronization is possible, but expensive: Atomic clocks GPS See: Google TrueTime API: Alberto Montresor (UniTN) DS - Introduction 2018/09/13 35 / 43
42 DS - Introduction Modeling Distributed Systems Time Time Time Global clock For presentation simplicity, it may be convenient to assume the presence of a global real-time clock, outside the control of processes. This can be used to provide a global ordering of steps in a distributed systems In reality: Each process is associated with a local clock Local clocks may not report the perfect time Clock drift rate: refers to the relative amount that a computer clock differs from a perfect reference clock. Synchronization is possible, but expensive: Atomic clocks GPS See: Google TrueTime API: GPS does not work into buildings Atomic clocks: cost not justified
43 Modeling Distributed Systems Time Time measures associated to communication Latency: The delay between the start of message sending from one process and the beginning of its receipt by another. Possible causes: the actual time for bit transmission (e.g., satellite link) the delay for accessing the network, especially in case of congestion the time taken by the operating system to handle the message both at sender and receiver Bandwidth: Total amount of information that can be transmitted over a communication channel in a given time. Jitter: Variation in the time taken to deliver a series of messages. Mostly related with multimedia data. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 36 / 43
44 Modeling Distributed Systems Time Asynchronous vs synchronous Distributed Systems vs Time Distributed systems make difficult to reason about time, not only for lack of clock synchronization. It is also difficult to pose time bounds on events and communication. We may think about several different models: Asynchronous distributed systems Synchronous distributed systems Partially synchronous distributed systems Alberto Montresor (UniTN) DS - Introduction 2018/09/13 37 / 43
45 DS - Introduction Modeling Distributed Systems Time Asynchronous vs synchronous Asynchronous vs synchronous Distributed Systems vs Time Distributed systems make difficult to reason about time, not only for lack of clock synchronization. It is also difficult to pose time bounds on events and communication. We may think about several different models: Asynchronous distributed systems Synchronous distributed systems Partially synchronous distributed systems Asynchronous distributed systems No assumptions can be made. Most of the problems cannot be solved Synchronous distributed systems Precise assumptions are possible on computation, communication time and clocks. Not really realistic / difficult to implement Partially synchronous distributed systems Some assumptions can be made, others not, OR Assumptions can be made statistically, OR Assumptions hold for arbitrarily long periods of time
46 Modeling Distributed Systems Time Asynchronous vs synchronous Asynchronous distributed system There are no bounds on the relative speed of process execution. There are no bounds on message transmission delays. There are no bounds on clock drift. OR, since we cannot count on their precision at all, there are no clocks. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 38 / 43
47 Modeling Distributed Systems Time Asynchronous vs synchronous Comments These are not assumptions! These are lack of assumptions! The worst possible model: services as simple as: failure detection time-based coordination are not possible Advantages: simple semantics easier to port to more powerful models More realistic: several sources of asynchrony are present in a large-scale network (like the Internet) Alberto Montresor (UniTN) DS - Introduction 2018/09/13 39 / 43
48 Modeling Distributed Systems Time Asynchronous vs synchronous Synchronous Distributed Systems Synchronous computation: There is a known upper bound on the relative speed of process execution. Synchronous communication: There is a known upper bound on message transmission delays. Synchronous clocks: Processes are equipped with local clocks. There is a known upper bound on the drift rates of local clocks with respect to a global real-time clock. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 40 / 43
49 Modeling Distributed Systems Time Asynchronous vs synchronous Comments The best possible model. Can be built, but not with standard hardware/software. Synchronous Ethernet vs CSMA/CD Ethernet Real-time OS vs normal OS Many interesting properties: Timed failure detection (e.g., ping) Coordination based on time (e.g., lease) Worst-case performance analysis Synchronized clocks Alberto Montresor (UniTN) DS - Introduction 2018/09/13 41 / 43
50 Modeling Distributed Systems Time Asynchronous vs synchronous Partial synchrony For most systems we know of, it is relatively easy to define physical time bounds that are respected most of the time. There are however periods where the timing assumptions do not hold. Delays on processes: Machines may run out of memory, slowing down processes A typical case of no bound on relative speeds of processes Delays on messages: Network may congested, and messages may be dropped. Re-transmission protocols can ensure reliability, but at the price of asynchrony Messages may be re-transmitted an arbitrary number of times. Alberto Montresor (UniTN) DS - Introduction 2018/09/13 42 / 43
51 DS - Introduction Modeling Distributed Systems Time Asynchronous vs synchronous Asynchronous vs synchronous Partial synchrony For most systems we know of, it is relatively easy to define physical time bounds that are respected most of the time. There are however periods where the timing assumptions do not hold. Delays on processes: Machines may run out of memory, slowing down processes A typical case of no bound on relative speeds of processes Delays on messages: Network may congested, and messages may be dropped. Re-transmission protocols can ensure reliability, but at the price of asynchrony Messages may be re-transmitted an arbitrary number of times. In this sense, practical systems are partially synchronous
52 Modeling Distributed Systems Time Asynchronous vs synchronous How to express partial synchrony? A possibility is the following: Timing assumptions only hold eventually. Theoretically, it means: There is a time after which the system is synchronous forever The system is initially asynchronous and only after a long time becomes synchronous How to read it: The system is not always synchronous There is no known bound to the period in which it is asynchronous We expect that there are periods during which the system is synchronous Some of these periods are long enough to terminate protocol execution Alberto Montresor (UniTN) DS - Introduction 2018/09/13 43 / 43
53 Reading Material A. Panconesi. The coordinated attack and the jealous amazons. A. Panconesi. Coordination and the fall of Eastern Roman Empire. V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. In S. Mullender, editor, Distributed Systems (2 nd ed.). Addison-Wesley,
54 Reality Check: Interesting links The S3 incident Solving the unsolvable The rise and fall of Corba Notes on Distributed Systems for Young Bloods
Distributed Algorithms Introduction
Distributed Algorithms Introduction Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1
More informationDistributed Algorithms Models
Distributed Algorithms Models Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Taxonomy
More informationWhere I was born. Where I studied. Distributed Computing Honors. Distributed Computing. Lorenzo Alvisi. Lorenzo Alvisi Surbhi Goel Benny Renard
Distributed Computing Distributed Computing Honors Lorenzo Alvisi Lorenzo Alvisi Surbhi Goel Benny Renard Where I was born Bologna, Italy Surbhi Goel Benny Renard Where I studied Where I studied What is
More informationDistributed Algorithms Reliable Broadcast
Distributed Algorithms Reliable Broadcast Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents
More informationCSE 5306 Distributed Systems. Fault Tolerance
CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves
More informationConsensus Problem. Pradipta De
Consensus Problem Slides are based on the book chapter from Distributed Computing: Principles, Paradigms and Algorithms (Chapter 14) by Kshemkalyani and Singhal Pradipta De pradipta.de@sunykorea.ac.kr
More informationDistributed algorithms
Distributed algorithms Prof R. Guerraoui lpdwww.epfl.ch Exam: Written Reference: Book - Springer Verlag http://lpd.epfl.ch/site/education/da - Introduction to Reliable (and Secure) Distributed Programming
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationFault-Tolerant Distributed Consensus
Fault-Tolerant Distributed Consensus Lawrence Kesteloot January 20, 1995 1 Introduction A fault-tolerant system is one that can sustain a reasonable number of process or communication failures, both intermittent
More informationIntroduction to Distributed Systems Seif Haridi
Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationDistributed Systems Fault Tolerance
Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable
More informationAgreement in Distributed Systems CS 188 Distributed Systems February 19, 2015
Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual
More informationConsensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.
Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}
More informationToday: Fault Tolerance. Failure Masking by Redundancy
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing
More informationSystem Models for Distributed Systems
System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models
More informationConsensus, impossibility results and Paxos. Ken Birman
Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}
More informationFault Tolerance. Basic Concepts
COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time
More informationData Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)
Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationPractical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov
Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change
More informationBYZANTINE AGREEMENT CH / $ IEEE. by H. R. Strong and D. Dolev. IBM Research Laboratory, K55/281 San Jose, CA 95193
BYZANTINE AGREEMENT by H. R. Strong and D. Dolev IBM Research Laboratory, K55/281 San Jose, CA 95193 ABSTRACT Byzantine Agreement is a paradigm for problems of reliable consistency and synchronization
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Failure Detectors Slides by: Steve Ko Computer Sciences and Engineering University at Buffalo Administrivia Programming Assignment 2 is out Please continue to monitor Piazza
More informationDistributed Algorithms Benoît Garbinato
Distributed Algorithms Benoît Garbinato 1 Distributed systems networks distributed As long as there were no machines, programming was no problem networks distributed at all; when we had a few weak computers,
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationSpecifying and Proving Broadcast Properties with TLA
Specifying and Proving Broadcast Properties with TLA William Hipschman Department of Computer Science The University of North Carolina at Chapel Hill Abstract Although group communication is vitally important
More informationA definition. Byzantine Generals Problem. Synchronous, Byzantine world
The Byzantine Generals Problem Leslie Lamport, Robert Shostak, and Marshall Pease ACM TOPLAS 1982 Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov OSDI 1999 A definition Byzantine (www.m-w.com):
More informationTo do. Consensus and related problems. q Failure. q Raft
Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the
More informationBYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)
BYZANTINE GENERALS (1) BYZANTINE GENERALS A fable: BYZANTINE GENERALS (2) Byzantine Generals Problem: Condition 1: All loyal generals decide upon the same plan of action. Condition 2: A small number of
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationFailures, Elections, and Raft
Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright
More informationC 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How
CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,
More informationA Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment
A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment Michel RAYNAL IRISA, Campus de Beaulieu 35042 Rennes Cedex (France) raynal @irisa.fr Abstract This paper considers
More informationConsensus. Chapter Two Friends. 8.3 Impossibility of Consensus. 8.2 Consensus 8.3. IMPOSSIBILITY OF CONSENSUS 55
8.3. IMPOSSIBILITY OF CONSENSUS 55 Agreement All correct nodes decide for the same value. Termination All correct nodes terminate in finite time. Validity The decision value must be the input value of
More informationCS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT
1 CS5412: CONSENSUS AND THE FLP IMPOSSIBILITY RESULT Lecture XII Ken Birman Generalizing Ron and Hermione s challenge 2 Recall from last time: Ron and Hermione had difficulty agreeing where to meet for
More informationDistributed Systems (5DV147)
Distributed Systems (5DV147) Fundamentals Fall 2013 1 basics 2 basics Single process int i; i=i+1; 1 CPU - Steps are strictly sequential - Program behavior & variables state determined by sequence of operations
More informationC 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.
Recap Best Practices Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo 2 Today s Question Two Different System Models How do we handle failures? Cannot
More informationSystem Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics
System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended
More informationSystem models for distributed systems
System models for distributed systems INF5040/9040 autumn 2010 lecturer: Frank Eliassen INF5040 H2010, Frank Eliassen 1 System models Purpose illustrate/describe common properties and design choices for
More informationConsensus. Chapter Two Friends. 2.3 Impossibility of Consensus. 2.2 Consensus 16 CHAPTER 2. CONSENSUS
16 CHAPTER 2. CONSENSUS Agreement All correct nodes decide for the same value. Termination All correct nodes terminate in finite time. Validity The decision value must be the input value of a node. Chapter
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationByzantine Fault Tolerance
Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures
More informationInitial Assumptions. Modern Distributed Computing. Network Topology. Initial Input
Initial Assumptions Modern Distributed Computing Theory and Applications Ioannis Chatzigiannakis Sapienza University of Rome Lecture 4 Tuesday, March 6, 03 Exercises correspond to problems studied during
More informationFault Tolerance. Distributed Software Systems. Definitions
Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:
More informationConsensus and related problems
Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?
More informationConcepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus
CIS 505: Software Systems Lecture Note on Consensus Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Concepts Dependability o Availability ready
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationDistributed Systems Consensus
Distributed Systems Consensus Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 1 / 56 What is the Problem?
More informationGlobal atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol
Global atomicity In distributed systems a set of processes may be taking part in executing a task Their actions may have to be atomic with respect to processes outside of the set example: in a distributed
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationByzantine Techniques
November 29, 2005 Reliability and Failure There can be no unity without agreement, and there can be no agreement without conciliation René Maowad Reliability and Failure There can be no unity without agreement,
More informationSecurity (and finale) Dan Ports, CSEP 552
Security (and finale) Dan Ports, CSEP 552 Today Security: what if parts of your distributed system are malicious? BFT: state machine replication Bitcoin: peer-to-peer currency Course wrap-up Security Too
More informationCprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University
Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More informationThe CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau
The CAP theorem The bad, the good and the ugly Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau 2017-05-15 1 / 19 1 The bad: The CAP theorem s proof 2 The good: A
More informationDistributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter
More informationIntuitive distributed algorithms. with F#
Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationRuminations on Domain-Based Reliable Broadcast
Ruminations on Domain-Based Reliable Broadcast Svend Frølund Fernando Pedone Hewlett-Packard Laboratories Palo Alto, CA 94304, USA Abstract A distributed system is no longer confined to a single administrative
More informationDistributed Systems 11. Consensus. Paul Krzyzanowski
Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one
More informationFailure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18
Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick
More informationAlternative Consensus
1 Alternative Consensus DEEP DIVE Alexandra Tran, Dev Ojha, Jeremiah Andrews, Steven Elleman, Ashvin Nihalani 2 TODAY S AGENDA GETTING STARTED 1 INTRO TO CONSENSUS AND BFT 2 NAKAMOTO CONSENSUS 3 BFT ALGORITHMS
More informationDistributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf
Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need
More informationPractical Byzantine Fault
Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault
More informationThe Long March of BFT. Weird Things Happen in Distributed Systems. A specter is haunting the system s community... A hierarchy of failure models
A specter is haunting the system s community... The Long March of BFT Lorenzo Alvisi UT Austin BFT Fail-stop A hierarchy of failure models Crash Weird Things Happen in Distributed Systems Send Omission
More information2. System Models Page 1. University of Freiburg, Germany Department of Computer Science. Distributed Systems. Chapter 2 System Models
2. System Models Page 1 University of Freiburg, Germany Department of Computer Science Distributed Systems Chapter 2 System Models Christian Schindelhauer 27. April 2012 2. System Models 2.1. Introduction
More informationChapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju
Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what
More informationDistributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski
Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and
More informationCoordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q
Coordination 1 To do q q q Mutual exclusion Election algorithms Next time: Global state Coordination and agreement in US Congress 1798-2015 Process coordination How can processes coordinate their action?
More informationRecovering from a Crash. Three-Phase Commit
Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator
More informationByzantine Fault Tolerance
Byzantine Fault Tolerance CS6450: Distributed Systems Lecture 10 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University.
More informationArvind Krishnamurthy Fall Collection of individual computing devices/processes that can communicate with each other
Distributed Systems Arvind Krishnamurthy Fall 2003 Concurrent Systems Collection of individual computing devices/processes that can communicate with each other General definition encompasses a wide range
More informationDistributed Algorithms
Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms
More informationFault Tolerance 1/64
Fault Tolerance 1/64 Fault Tolerance Fault tolerance is the ability of a distributed system to provide its services even in the presence of faults. A distributed system should be able to recover automatically
More informationDistributed Deadlock
Distributed Deadlock 9.55 DS Deadlock Topics Prevention Too expensive in time and network traffic in a distributed system Avoidance Determining safe and unsafe states would require a huge number of messages
More informationTwo-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure
Two-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure Yong-Hwan Cho, Sung-Hoon Park and Seon-Hyong Lee School of Electrical and Computer Engineering, Chungbuk National
More informationPRIMARY-BACKUP REPLICATION
PRIMARY-BACKUP REPLICATION Primary Backup George Porter Nov 14, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons
More informationSystem Models 2. Lecture - System Models 2 1. Areas for Discussion. Introduction. Introduction. System Models. The Modelling Process - General
Areas for Discussion System Models 2 Joseph Spring School of Computer Science MCOM0083 - Distributed Systems and Security Lecture - System Models 2 1 Architectural Models Software Layers System Architecture
More informationConsensus in the Presence of Partial Synchrony
Consensus in the Presence of Partial Synchrony CYNTHIA DWORK AND NANCY LYNCH.Massachusetts Institute of Technology, Cambridge, Massachusetts AND LARRY STOCKMEYER IBM Almaden Research Center, San Jose,
More informationByzantine Consensus in Directed Graphs
Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory
More informationCS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.
Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message
More informationCS October 2017
Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.
More informationMODELS OF DISTRIBUTED SYSTEMS
Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between
More informationTWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018
TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationTransactions. CS 475, Spring 2018 Concurrent & Distributed Systems
Transactions CS 475, Spring 2018 Concurrent & Distributed Systems Review: Transactions boolean transfermoney(person from, Person to, float amount){ if(from.balance >= amount) { from.balance = from.balance
More informationLecture 3. Introduction to Cryptocurrencies
Lecture 3 Introduction to Cryptocurrencies Public Keys as Identities public key := an identity if you see sig such that verify(pk, msg, sig)=true, think of it as: pk says, [msg] to speak for pk, you must
More informationConsensus in Distributed Systems. Jeff Chase Duke University
Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1 Unreliable multicast P 2 P 3 Consensus algorithm P 2 P 3 v 2 Step 1 Propose. v 3 d 2 Step 2 Decide. d 3 Generalizes
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationBYZANTINE CONSENSUS THROUGH BITCOIN S PROOF- OF-WORK
Informatiemanagement: BYZANTINE CONSENSUS THROUGH BITCOIN S PROOF- OF-WORK The aim of this paper is to elucidate how Byzantine consensus is achieved through Bitcoin s novel proof-of-work system without
More informationRecall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos
Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4
More informationAgreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering
Agreement and Consensus SWE 622, Spring 2017 Distributed Software Engineering Today General agreement problems Fault tolerance limitations of 2PC 3PC Paxos + ZooKeeper 2 Midterm Recap 200 GMU SWE 622 Midterm
More informationMODELS OF DISTRIBUTED SYSTEMS
Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between
More informationDistributed Computing. CS439: Principles of Computer Systems November 19, 2018
Distributed Computing CS439: Principles of Computer Systems November 19, 2018 Bringing It All Together We ve been studying how an OS manages a single CPU system As part of that, it will communicate with
More informationThe Design and Implementation of a Fault-Tolerant Media Ingest System
The Design and Implementation of a Fault-Tolerant Media Ingest System Daniel Lundmark December 23, 2005 20 credits Umeå University Department of Computing Science SE-901 87 UMEÅ SWEDEN Abstract Instead
More informationPractical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What
More information