Introduction to Distributed Systems Seif Haridi
|
|
- Charlotte McKinney
- 6 years ago
- Views:
Transcription
1 Introduction to Distributed Systems Seif Haridi
2 What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send receive Network S. Haridi, KTHx ID2203x 2
3 Our focus in this course Concepts Models Given the model Which problems are solvable/ not solvable What are the core problems in distributed systems What are the algorithms How to reason about correctness S. Haridi, KTHx ID2203x 3
4 Why study distributed systems? It is important and useful Societal importance Internet WWW Cloud computing Edge computing Small devices (mobiles, sensors) S. Haridi, KTHx ID2203x 4
5 Why study distributed systems? Internet Edge Computing Cloud Computing S. Haridi, KTHx ID2203x 5
6 Why study distributed systems? p1 p4 p2 p10 p11 p12 p13 p3 p5 p6 p7 p8 p9 Internet Edge Computing Cloud Computing S. Haridi, KTHx ID2203x 6
7 Why study distributed systems? p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 Network S. Haridi, KTHx ID2203x 7
8 Why study distributed systems? It is important and useful Technical importance Improve scalability Improve reliability Inherent distribution S. Haridi, KTHx ID2203x 8
9 Why study distributed systems? It is very challenging Partial Failures Network (dropped messages, partitions) Node failures Concurrency Nodes execute in parallel Messages travel asynchronously Recurring core problems Parallel computing S. Haridi, KTHx ID2203x 9
10 Core Problems What types of problems are there?
11 Teaser: Two Generals Problem Two generals need to coordinate an attack Must agree on time to attack They ll win only if they attack simultaneously Communicate through messengers Messengers may be killed on their way S. Haridi, KTHx ID2203x 11
12 Teaser: Two Generals Problem Lets try to solve it for general g1 and g2 g1 sends time of attack to g2 Problem: how to ensure g2 received msg? Solution: let g2 ack receipt of msg Problem: how to ensure g1 received ack Solution: let g1 ack the receipt of the ack This problem is impossible to solve! S. Haridi, KTHx ID2203x 12
13 Teaser: Two Generals Problem 1 2 S. Haridi, KTHx ID2203x 13
14 Teaser: Two Generals Problem attack! 1 2 S. Haridi, KTHx ID2203x 14
15 Teaser: Two Generals Problem sure! please confirm yes.please confirm is 18:00 ok? 1 2 S. Haridi, KTHx ID2203x 15
16 Teaser: Two Generals Problem ok.confirmed? alright! confirm? sure! please confirm yes.please confirm is 18:00 ok? 1 2 S. Haridi, KTHx ID2203x 16
17 Teaser: Two Generals Problem ambushed! 1 2 Impossible to solve! S. Haridi, KTHx ID2203x 17
18 Teaser: Two Generals Problem Applicability to distributed systems Two nodes need to agree on a value before a specific time-bound Communicate by messages using an unreliable channel Agreement is a core problem S. Haridi, KTHx ID2203x 18
19 Consensus: agreeing on a number Consensus problem All nodes propose a value Some nodes might crash & stop responding The algorithm must ensure: All correct nodes eventually decide Every node decides the same Only decide on proposed values S. Haridi, KTHx ID2203x 19
20 Example: Agreeing on a Target 1 A B 2 3 attack A attack B attack B Consensus? Consensus? Consensus S. Haridi, KTHx ID2203x 20
21 Example: Agreeing on a Target 1 A B 2 3 A confirmed A confirmed Consensus A confirmed Consensus Consensus S. Haridi, KTHx ID2203x 21
22 Example: Agreeing on a Target attack! 1 A B 2 3 Consensus Consensus Consensus S. Haridi, KTHx ID2203x 22
23 Example: Agreeing on a Target 1 A B 2 3 attack A attack B attack B Consensus? Consensus? Consensus S. Haridi, KTHx ID2203x 23
24 Example: Agreeing on a Target 1 A B 2 3 ambush! Consensus? Consensus? Consensus S. Haridi, KTHx ID2203x 24
25 Example: Agreeing on a Target 1 A B 2 A confirmed A confirmed 3 Consensus Consensus Consensus S. Haridi, KTHx ID2203x 25
26 Example: Agreeing on a Target attack! 1 A B 2 3 Consensus Consensus Consensus S. Haridi, KTHx ID2203x 26
27 Is Consensus is Solvable? Consensus problem All nodes propose a value Some nodes might crash & stop responding The algorithm must ensure: All correct nodes eventually decide Every node decides the same Only decide on proposed values S. Haridi, KTHx ID2203x 27
28 Consensus is Important Databases Concurrent changes to same data Nodes should agree on changes Use a kind of consensus: atomic commit Only two proposal values {commit, abort} S. Haridi, KTHx ID2203x 28
29 Broadcast Problem Atomic Broadcast A node broadcasts a message If sender correct, all correct nodes deliver msg All correct nodes deliver same messages Messages delivered in the same order S. Haridi, KTHx ID2203x 29
30 Atomic broadcast is Important Replicated services Multiple servers (processes) Execute the same sequence of commands Replicated State Machines RSM Use atomic broadcast Provide fault tolerance S. Haridi, KTHx ID2203x 30
31 Atomic Broadcast Consensus Given Atomic broadcast Can use it to solve Consensus Every node broadcasts its proposal Decide on the first received proposal Messages received in same order All nodes will decide the same Given Consensus Can use it to solve Atomic broadcast [d] Atomic Broadcast equivalent to Consensus S. Haridi, KTHx ID2203x 31
32 Atomic Broadcast Consensus attack A attack B attack B Consensus Consensus Consensus attack A Atomic Broadcast attack B Atomic Broadcast?? attack B Atomic Broadcast S. Haridi, KTHx ID2203x 32
33 Atomic Broadcast Consensus attack A attack A attack A Consensus Consensus Consensus attack A attack A attack A Atomic Broadcast Atomic Broadcast Atomic Broadcast S. Haridi, KTHx ID2203x 33
34 Atomic Broadcast Consensus 1 attacking A 2 3 attacking A attacking A Consensus Consensus Consensus attack B attack B attack B Atomic Broadcast Atomic Broadcast Atomic Broadcast S. Haridi, KTHx ID2203x 34
35 Models How to reason about them?
36 Modeling a Distributed System Timing assumptions Processes bounds on time to make a computation step Network Bounds on time to transmit a message between a sender and a receiver Clocks: Lower and upper bounds on clock drift rate S. Haridi, KTHx ID
37 Modeling a Distributed System Failure assumptions Processes What kind of failure a process can exhibit? Crashes and stops Behaves arbitrary (Byzantine) Network Can a network channel drop messages? S. Haridi, KTHx ID
38 Modeling a Distributed System p1 p2 p3 Network S. Haridi, KTHx ID
39 Modeling a Distributed System p1 p2 p3 Network S. Haridi, KTHx ID
40 Modeling a Distributed System p1 p2 p3 Network S. Haridi, KTHx ID
41 Network Failures p1 p2 p3 Network S. Haridi, KTHx ID
42 Network Failures p1 p2 p3 Network dropped S. Haridi, KTHx ID
43 Process Failures p1 p2 p3 Network S. Haridi, KTHx ID
44 Process Failures p1 p2 p3 Network S. Haridi, KTHx ID
45 Process Failures p1 p3 p2 Network S. Haridi, KTHx ID
46 Process Failures p1 p3 p2 Network S. Haridi, KTHx ID
47 Byzantine Processes p1 p2 p3 Network S. Haridi, KTHx ID
48 Byzantine Processes p1 p2 p3 Network S. Haridi, KTHx ID
49 Byzantine Processes?? p1 p2 p3 Network S. Haridi, KTHx ID
50 Modeling a Distributed System Asynchronous system No bound on time to deliver a message No bound on time to compute Clocks are not synchronized Internet essentially asynchronous S. Haridi, KTHx ID
51 Impossibility of Consensus Consensus cannot be solved in asynchronous system If a single node may crash Implications on Atomic broadcast Atomic commit Leader election S. Haridi, KTHx ID
52 Modeling a Distributed System Synchronous system Known bound on time to deliver a message (latency) Known bound on time to compute Known lower and upper bounds in physical clock drift rate Examples: Embedded systems Multicore computers S. Haridi, KTHx ID
53 Possibility of Consensus Consensus solvable in synchronous system with up to N-1 crashes Intuition behind solution Accurate crash detection Every node sends a message to every other node If no msg from a node within bound, node has crashed Not useful for Internet, how to proceed? S. Haridi, KTHx ID
54 Modeling a Distributed System But Internet is mostly synchronous Bounds respected mostly Occasionally violate bounds (congestion/failures) How do we model this? Partially synchronous system Initially system is asynchronous Eventually the system becomes synchronous S. Haridi, KTHx ID
55 Possibility of Consensus Consensus solvable in partially synchronous system with up to N/2 crashes Useful for Internet? S. Haridi, KTHx ID
56 Failure detectors Let each node use a failure detector Detects crashes Implemented by heartbeats and waiting Might be initially wrong, but eventually correct Consensus and Atomic Broadcast solvable with failure detectors How? Attend rest of course! S. Haridi, KTHx ID
57 Modeling a Distributed System Timed Asynchronous system No bound on time to deliver a message No bound on time to compute Clocks have known clock-drift rate Realistic model Internet S. Haridi, KTHx ID
58 Conclusions
59 Topics not covered
60 Processes always crash? Other types of failures Not just crash stops Byzantine faults Self-stabilizing algorithms S. Haridi, KTHx ID2203x 60
61 Byzantine Faults Some processes might behave arbitrarily Sending wrong information Omit messages Byzantine algorithms that tolerate such faults Only tolerate up to 1/3 Byzantine processes Non-Byzantine algorithms can often tolerate ½ nodes in the asynchronous model S. Haridi, KTHx ID2203x 61
62 Self-stabilizing Algorithms Robust algorithms that run forever System might temporarily be incorrect But eventually always becomes correct System can either by in a legitimate state or an illegitimate state Self-stabilizing algorithm iff Convergence Given any illegitimate state, system eventually goes to a legitimate state Closure If system in a legitimate state, it remains in a legitimate state S. Haridi, KTHx ID2203x 62
63 Self-stabilizing Algorithms System can either by in a legitimate state or an illegitimate state Self-stabilizing algorithm iff Convergence Closure S. Haridi, KTHx ID2203x 63
64 Self-stabilizing Algorithms Advantages Robust to transient failures Don t need initialization Can be easily composed A service composed of two self-stabilizing services is self-stabilizing service S. Haridi, KTHx ID2203x 64
65 Self-stabilizing Example Token ring algorithm Wish to have one token at all times circulating among processes Self-Stabilization Error leads to 2,3, tokens Ensure always 1 token eventually S. Haridi, KTHx ID2203x 65
66 Content of the Course
67 Content I Formal Models of Asynchronous Systems Basic Abstractions Reliable Broadcast Algorithms Distributed Shared Store and Consistency Models S. Haridi, KTHx ID2203x 67
68 Content II Single Value Consensus Paxos algorithm Sequence Consensus Multi-Paxos Replicated State Machines (RSM) Dynamic Reconfiguration Physical Clocks Leader election (timed asynchronous model) More efficient RSM) Shared stores with Strong Consistency Relaxed consistency models CAP theorem S. Haridi, KTHx ID2203x 68
69 Summary Distributed systems everywhere Set of processes (nodes) cooperating over a network Few core problems reoccur Consensus, Broadcast, Leader election, Shared Memory Different failure scenarios important Crash stop, Byzantine, self-stabilizing algorithms Interesting research directions Large scale dynamic distributed systems S. Haridi, KTHx ID2203x 69
70 Let s start S. Haridi, KTHx ID2203x 70
Consensus in Distributed Systems. Jeff Chase Duke University
Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1 Unreliable multicast P 2 P 3 Consensus algorithm P 2 P 3 v 2 Step 1 Propose. v 3 d 2 Step 2 Decide. d 3 Generalizes
More informationCoordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q
Coordination 1 To do q q q Mutual exclusion Election algorithms Next time: Global state Coordination and agreement in US Congress 1798-2015 Process coordination How can processes coordinate their action?
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Failure Detectors Slides by: Steve Ko Computer Sciences and Engineering University at Buffalo Administrivia Programming Assignment 2 is out Please continue to monitor Piazza
More informationDistributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf
Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need
More informationC 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.
Recap Best Practices Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo 2 Today s Question Two Different System Models How do we handle failures? Cannot
More informationC 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How
CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,
More informationConsensus and related problems
Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?
More informationSystem Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics
System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended
More informationInitial Assumptions. Modern Distributed Computing. Network Topology. Initial Input
Initial Assumptions Modern Distributed Computing Theory and Applications Ioannis Chatzigiannakis Sapienza University of Rome Lecture 4 Tuesday, March 6, 03 Exercises correspond to problems studied during
More informationReplication in Distributed Systems
Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over
More informationExam 2 Review. Fall 2011
Exam 2 Review Fall 2011 Question 1 What is a drawback of the token ring election algorithm? Bad question! Token ring mutex vs. Ring election! Ring election: multiple concurrent elections message size grows
More informationData Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)
Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.
More informationFault Tolerance. Distributed Software Systems. Definitions
Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:
More informationFailures, Elections, and Raft
Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright
More informationTransactions. CS 475, Spring 2018 Concurrent & Distributed Systems
Transactions CS 475, Spring 2018 Concurrent & Distributed Systems Review: Transactions boolean transfermoney(person from, Person to, float amount){ if(from.balance >= amount) { from.balance = from.balance
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationTo do. Consensus and related problems. q Failure. q Raft
Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the
More informationThe Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer
The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer - proposes a formal definition for the timed asynchronous distributed system model - presents measurements of process
More informationConsensus Problem. Pradipta De
Consensus Problem Slides are based on the book chapter from Distributed Computing: Principles, Paradigms and Algorithms (Chapter 14) by Kshemkalyani and Singhal Pradipta De pradipta.de@sunykorea.ac.kr
More informationProseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita
Proseminar Distributed Systems Summer Semester 2016 Paxos algorithm stefan.resmerita@cs.uni-salzburg.at The Paxos algorithm Family of protocols for reaching consensus among distributed agents Agents may
More informationDistributed Algorithms Benoît Garbinato
Distributed Algorithms Benoît Garbinato 1 Distributed systems networks distributed As long as there were no machines, programming was no problem networks distributed at all; when we had a few weak computers,
More informationLast time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson
Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationPractical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov
Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change
More informationNetwork Protocols. Sarah Diesburg Operating Systems CS 3430
Network Protocols Sarah Diesburg Operating Systems CS 3430 Protocol An agreement between two parties as to how information is to be transmitted A network protocol abstracts packets into messages Physical
More informationConcepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus
CIS 505: Software Systems Lecture Note on Consensus Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Concepts Dependability o Availability ready
More informationDistributed Computing. CS439: Principles of Computer Systems November 19, 2018
Distributed Computing CS439: Principles of Computer Systems November 19, 2018 Bringing It All Together We ve been studying how an OS manages a single CPU system As part of that, it will communicate with
More informationIntuitive distributed algorithms. with F#
Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype
More informationDistributed Algorithms Failure detection and Consensus. Ludovic Henrio CNRS - projet SCALE
Distributed Algorithms Failure detection and Consensus Ludovic Henrio CNRS - projet SCALE ludovic.henrio@cnrs.fr Acknowledgement The slides for this lecture are based on ideas and materials from the following
More informationEECS 498 Introduction to Distributed Systems
EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Replicated State Machines Logical clocks Primary/ Backup Paxos? 0 1 (N-1)/2 No. of tolerable failures October 11, 2017 EECS 498
More informationDistributed Systems Consensus
Distributed Systems Consensus Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 1 / 56 What is the Problem?
More informationVerteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms
Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Holger Karl Computer Networks Group Universität Paderborn Goal of this chapter Apart from issues in distributed time and resulting
More informationDistributed Systems. Day 9: Replication [Part 1]
Distributed Systems Day 9: Replication [Part 1] Hash table k 0 v 0 k 1 v 1 k 2 v 2 k 3 v 3 ll Facebook Data Does your client know about all of F s servers? Security issues? Performance issues? How do clients
More informationConsensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.
Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationBasic vs. Reliable Multicast
Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?
More informationRecall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers
Replicated s, RAFT COS 8: Distributed Systems Lecture 8 Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable service Goal #: Servers should behave just like
More informationAgreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering
Agreement and Consensus SWE 622, Spring 2017 Distributed Software Engineering Today General agreement problems Fault tolerance limitations of 2PC 3PC Paxos + ZooKeeper 2 Midterm Recap 200 GMU SWE 622 Midterm
More informationAssignment 12: Commit Protocols and Replication Solution
Data Modelling and Databases Exercise dates: May 24 / May 25, 2018 Ce Zhang, Gustavo Alonso Last update: June 04, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 12: Commit Protocols and Replication
More informationProcess groups and message ordering
Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationDistributed Computing. CS439: Principles of Computer Systems November 20, 2017
Distributed Computing CS439: Principles of Computer Systems November 20, 2017 Last Time Network Programming: Sockets End point of communication Identified by (IP address : port number) pair Client-Side
More informationDistributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi
1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.
More informationDistributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs
1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds
More informationDistributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201
Distributed Systems ID2201 coordination Johan Montelius 1 Coordination Coordinating several threads in one node is a problem, coordination in a network is of course worse: failure of nodes and networks
More informationExercise 12: Commit Protocols and Replication
Data Modelling and Databases (DMDB) ETH Zurich Spring Semester 2017 Systems Group Lecturer(s): Gustavo Alonso, Ce Zhang Date: May 22, 2017 Assistant(s): Claude Barthels, Eleftherios Sidirourgos, Eliza
More informationConsensus, impossibility results and Paxos. Ken Birman
Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}
More informationDistributed Systems 11. Consensus. Paul Krzyzanowski
Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one
More informationGlobal atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol
Global atomicity In distributed systems a set of processes may be taking part in executing a task Their actions may have to be atomic with respect to processes outside of the set example: in a distributed
More informationCS October 2017
Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.
More informationRecall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos
Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4
More informationByzantine Failures. Nikola Knezevic. knl
Byzantine Failures Nikola Knezevic knl Different Types of Failures Crash / Fail-stop Send Omissions Receive Omissions General Omission Arbitrary failures, authenticated messages Arbitrary failures Arbitrary
More informationDistributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201
Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service
More informationParallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin
Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitioning (Different data at different sites More space Better
More informationPaxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016
Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016 Bezos mandate for service-oriented-architecture (~2002) 1. All teams will henceforth expose their data and functionality through
More information02 - Distributed Systems
02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is
More information02 - Distributed Systems
02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More informationConsensus on Transaction Commit
Consensus on Transaction Commit Jim Gray and Leslie Lamport Microsoft Research 1 January 2004 revised 19 April 2004, 8 September 2005, 5 July 2017 MSR-TR-2003-96 This paper appeared in ACM Transactions
More informationCausal Consistency and Two-Phase Commit
Causal Consistency and Two-Phase Commit CS 240: Computing Systems and Concurrency Lecture 16 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Consistency
More informationTransactions Between Distributed Ledgers
Transactions Between Distributed Ledgers Ivan Klianev Transactum Pty Ltd High Performance Transaction Systems Asilomar, California, 8-11 October 2017 The Time for Distributed Transactions Has Come Thanks
More informationDistributed Synchronization. EECS 591 Farnam Jahanian University of Michigan
Distributed Synchronization EECS 591 Farnam Jahanian University of Michigan Reading List Tanenbaum Chapter 5.1, 5.4 and 5.5 Clock Synchronization Distributed Election Mutual Exclusion Clock Synchronization
More informationFault Tolerance via the State Machine Replication Approach. Favian Contreras
Fault Tolerance via the State Machine Replication Approach Favian Contreras Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Written by Fred Schneider Why a Tutorial? The
More informationLarge-Scale Key-Value Stores Eventual Consistency Marco Serafini
Large-Scale Key-Value Stores Eventual Consistency Marco Serafini COMPSCI 590S Lecture 13 Goals of Key-Value Stores Export simple API put(key, value) get(key) Simpler and faster than a DBMS Less complexity,
More informationMENCIUS: BUILDING EFFICIENT
MENCIUS: BUILDING EFFICIENT STATE MACHINE FOR WANS By: Yanhua Mao Flavio P. Junqueira Keith Marzullo Fabian Fuxa, Chun-Yu Hsiung November 14, 2018 AGENDA 1. Motivation 2. Breakthrough 3. Rules of Mencius
More informationLecture XII: Replication
Lecture XII: Replication CMPT 401 Summer 2007 Dr. Alexandra Fedorova Replication 2 Why Replicate? (I) Fault-tolerance / High availability As long as one replica is up, the service is available Assume each
More informationArvind Krishnamurthy Fall Collection of individual computing devices/processes that can communicate with each other
Distributed Systems Arvind Krishnamurthy Fall 2003 Concurrent Systems Collection of individual computing devices/processes that can communicate with each other General definition encompasses a wide range
More informationThe Design and Implementation of a Fault-Tolerant Media Ingest System
The Design and Implementation of a Fault-Tolerant Media Ingest System Daniel Lundmark December 23, 2005 20 credits Umeå University Department of Computing Science SE-901 87 UMEÅ SWEDEN Abstract Instead
More informationTwo-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure
Two-Phase Atomic Commitment Protocol in Asynchronous Distributed Systems with Crash Failure Yong-Hwan Cho, Sung-Hoon Park and Seon-Hyong Lee School of Electrical and Computer Engineering, Chungbuk National
More informationReplications and Consensus
CPSC 426/526 Replications and Consensus Ennan Zhai Computer Science Department Yale University Recall: Lec-8 and 9 In the lec-8 and 9, we learned: - Cloud storage and data processing - File system: Google
More informationThere Is More Consensus in Egalitarian Parliaments
There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David Andersen, Michael Kaminsky Carnegie Mellon University Intel Labs Fault tolerance Redundancy State Machine Replication 3 State Machine
More informationCSE 5306 Distributed Systems. Synchronization
CSE 5306 Distributed Systems Synchronization 1 Synchronization An important issue in distributed system is how processes cooperate and synchronize with one another Cooperation is partially supported by
More informationFault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components
Fault Tolerance To avoid disruption due to failure and to improve availability, systems are designed to be fault-tolerant Two broad categories of fault-tolerant systems are: systems that mask failure it
More informationAssignment 12: Commit Protocols and Replication
Data Modelling and Databases Exercise dates: May 24 / May 25, 2018 Ce Zhang, Gustavo Alonso Last update: June 04, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 12: Commit Protocols and Replication
More informationDependable Computer Systems
Dependable Computer Systems Part 6b: System Aspects Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization)
More informationConsistency. CS 475, Spring 2018 Concurrent & Distributed Systems
Consistency CS 475, Spring 2018 Concurrent & Distributed Systems Review: 2PC, Timeouts when Coordinator crashes What if the bank doesn t hear back from coordinator? If bank voted no, it s OK to abort If
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationSynchronization. Chapter 5
Synchronization Chapter 5 Clock Synchronization In a centralized system time is unambiguous. (each computer has its own clock) In a distributed system achieving agreement on time is not trivial. (it is
More informationDistributed algorithms
Distributed algorithms Prof R. Guerraoui lpdwww.epfl.ch Exam: Written Reference: Book - Springer Verlag http://lpd.epfl.ch/site/education/da - Introduction to Reliable (and Secure) Distributed Programming
More informationSystem Models for Distributed Systems
System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models
More informationFault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior
Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure
More informationAgreement in Distributed Systems CS 188 Distributed Systems February 19, 2015
Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual
More informationSynchronization. Clock Synchronization
Synchronization Clock Synchronization Logical clocks Global state Election algorithms Mutual exclusion Distributed transactions 1 Clock Synchronization Time is counted based on tick Time judged by query
More informationPractical Byzantine Fault Tolerance. Castro and Liskov SOSP 99
Practical Byzantine Fault Tolerance Castro and Liskov SOSP 99 Why this paper? Kind of incredible that it s even possible Let alone a practical NFS implementation with it So far we ve only considered fail-stop
More informationATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases
ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases We talked about transactions and how to implement them in a single-node database. We ll now start looking into how to
More informationDistributed Systems Leader election & Failure detection
Distributed Systems Leader election & Failure detection He Sun School of Informatics University of Edinburgh Leader of a computation Many distributed computations need a coordinator of server processors
More informationDistributed Systems Fault Tolerance
Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable
More informationCS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.
Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message
More informationA Distributed and Robust SDN Control Plane for Transactional Network Updates
A Distributed and Robust SDN Control Plane for Transactional Network Updates Marco Canini (UCL) with Petr Kuznetsov (Télécom ParisTech), Dan Levin (TU Berlin), Stefan Schmid (TU Berlin & T-Labs) 1 Network
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo CSE 486/586 Recap: Consensus On a synchronous system There s an algorithm that works. On
More informationDistributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.
CSE 120: Principles of Operating Systems Lecture 13 Distributed Systems December 2, 2003 Before We Begin Read Chapters 15, 17 (on Distributed Systems topics) Prof. Joe Pasquale Department of Computer Science
More informationTAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton
TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion Problem
More informationRecap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1
Recap Distributed Systems Steve Ko Computer Sciences and Engineering University at Buffalo Facebook photo storage CDN (hot), Haystack (warm), & f4 (very warm) Haystack RAID-6, per stripe: 10 data disks,
More informationExtend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:
Putting it all together for SMR: Two-Phase Commit, Leader Election RAFT COS 8: Distributed Systems Lecture Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationCSE 380 Computer Operating Systems
CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note: Distributed Systems 1 Introduction to Distributed Systems Why do we develop distributed systems?
More informationViewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De
Viewstamped Replication to Practical Byzantine Fault Tolerance Pradipta De pradipta.de@sunykorea.ac.kr ViewStamped Replication: Basics What does VR solve? VR supports replicated service Abstraction is
More informationDistributed Systems 24. Fault Tolerance
Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network
More informationAsynchronous Reconfiguration for Paxos State Machines
Asynchronous Reconfiguration for Paxos State Machines Leander Jehl and Hein Meling Department of Electrical Engineering and Computer Science University of Stavanger, Norway Abstract. This paper addresses
More information