Extending Blockchains in Computing - Transaction semantics for web services. Subhash Bhalla (Dept. of Comp. Sc., IIT Delhi)
|
|
- Donna Lynch
- 5 years ago
- Views:
Transcription
1 Extending Blockchains in Computing - Transaction semantics for web services Subhash Bhalla (Dept. of Comp. Sc., IIT Delhi)
2 Slashdot Items Tagged "blockchain" Thursday September 06, Blockchains Are Not Safe For Voting, Concludes NAP Report Friday August 24, China Shuts Down Blockchain News Accounts on WeChat App, Bans Hotels in Beijing From Hosting Cryptocurrency Events Friday August 10, The World Bank is Preparing For the World's First Blockchain Bond Thursday August 09, Colorado Candidate For Governor Wants To Put His State On the Blockchain Thursday August 09, Blockchain Hype May Have Peaked, But IBM is Still a Believer 2
3 Blockchain Philosophy Books, HBR, Sloan Management Review law, trust Technology Agriculture,. Land records, Legal systems, Healthcare(Standardized Electronic Health Records) Computing 3
4 Double Entry Book Keeping Ledger ( Append only log ) 4
5 No Cutting Compensation OK Horizontal Total Hash Number 1 (Control) Vertical Total Hash Number 2 (control) Linked List (serial no of Transations) Control Hash Numbers change with each New Transaction 5
6 Linked List Model Temper-proof Compensation is OK Transactions Over one Book ( Ledger ) 6
7 7
8
9 Multi-organization Trust: legal Doc- Registry Control Hash fixed size for all cases Legal Document / Documents Tempering Hash Change X N sites 9
10 Computing Systems : Web Developments SkyPe, MedlinePlus, MedlinePlus Encyclopedia, Google Docs, Google Maps API, gmail, google search, LMMP (NASA), PTF Caltech, Facebook, Twitter, LinkedIn, Air India, Ashoka University, AMAZON, Any e-auction web site, Postgres web site, Wikipedia, e-bay, yahoo auction,. Classify the above: sites, applications, Cloud-based 10
11 Palomar Transient Factory Time Domain Astronomy Since 2009 Sectors in Norther Sky Watched all night Real-time processing Machine Learning, data mining Archive (growing in time) 11
12 Categories of Web Applications
13 Computing as enterprise Change Computing W3C Application New Specifications
14 Development of Web Applications Specifications (25 Years): Front-end Form, client-server, XML, CSS, JavaScript, Web Services, HTML 5 (transmit GIS coordinates of clients), tracking tools/systems Back-end Map-reduce, Data centers, Hadoop,
15 Computing as enterprise (before years) Change Computing ISO, ANSI Application New Specifications Prior to 1993: Database System Distributed Systems on ETHERNET (web?,internet?) Banking, Stock exchange, Airlines, Railways,
16 Interactions thru the Web A number of processes (N) on ETHERNET Communicate through messages to Cooperate Interact with outside world (web) 16
17 ETHERNET Hand-shake, acknowledge, time-limit on response, network-status, fast vs- 17
18 In Transit Messages A message that has been sent but not yet received is called an in-transit message. Do rollback recovery protocols have to guarantee the delivery of in-transit messages? Depends on whether reliable communication is assumed! 1
19 LAN vs. Web/Internet Wired, replication (loaded standby) Synchronous Failure is detected by timeout 19
20 Distributed Systems: Computing as an Enterprise Network Eccentric and Mobile Applications 1. Middleware 2. Networks Mobile ad hoc networks ( MANETs ) 20
21 Network Eccentric and Mobile Aplications ( Middleware ): Mobile ad hoc networks (MANETs ) Energy in sensor networks, programming wireless sensor networks, Ad hoc routing, 5 G Software defined networks, Communication Models Population protocols, routing in opportunistic Networks, wireless Mesh networks Gossip-Based dissemination, Application layer Multicast, Distributed event routing in Publish/subscribe systems, Tuple Space Middleware for wireless networks, Security Middleware, Dynamic Adaptation.. Blockchain 21
22 Blockchain Networks Toyota to Bring Blockchain Networks to Smart Cars IEEE Spectrum, May 2017, By Philip E. Ross 1. It could make car-to-cloud communication easier and more secure, if your car wants to talk to another car, a service provider network 2. Blockchain Consensus, Tyler Crain, Vincent Gramoli, Michel Raynal, Mikel Larrea. Proceedings of AlgoTel
23 Blockchain in use 1. Bay area in California Rent a Toyota car 1. Down load an APP Smartphone Location nearest car on Map Walk to the car Select pay Car door unlocks 2. Driverless car / Remote Guidance System for Spacecraft 23
24 Blockchain 24
25 Blockchain / Distributed Ledger Technology Distributed Systems Applications 1. Extremely Large Amount of Data 2. Extremely Critical Data 3. Real-time Data Streaming 4. Consensus Among Nodes in an Asynchronous System Blockchain is an enabling technique Immutability Asynchrony Consensus 25
26 Distributed Systems: Immutability Log-structured Files / Append-only Logs Example: Database Nodes Asynchrony Achieve a common history at nodes thru a stamp server Consensus In the absence of globally synchronous clock, there is a need for global consensus 26
27 Database Change of State Backup 27
28 Database Systems + Dist. Syst. On Ethernet Banking 100s of ATMs, Audit system, Accounting report 1. Backup at time (To) 2. Database 3. Append only log (journal) of activity since last backup (To) Recovery after failure: Combine 1. and 2 3.
29 Computing Systems- Protocols Fail-stop Model - Communication Channel uses parity Byzantine fault tolerance Atomic commit Brooks Iyengar algorithm List of mathematical concepts named after places List of terms relating to algorithms and data structures Byzantine Paxos Quantum Byzantine agreement Two Armies Problem Impossible to win on web? Blockchain 29
30 Fail-Safe Model Communication Channel : Parity bit changes Fail-prone Fail-stop 0 3 bits are 1s Parity odd as 1 30
31 Byzantine General s Problem
32 Workflow processing 1. e-bay : Cart HP Notebook + EPSON Color printer + SONY Camera + Customer bank + e-bay bank 5 Processes in a TWO-phase Commit, with resources blocked ( Atomic ) (Consistent)(Isolation)(Durability) On top of web service connections (no Ethernet) 32
33 Transaction Atomicity 33
34 Atomicity 34
35 2 Phase Commit 35
36 Participants make note in Log 36
37 3 Phase Commit - 3PC is non-blocking (in cases C or P failure) 37
38 3
39 39
40 40
41 Application Systems on Web Services 41
42 Complexity in Distributed Systems Multiple Nodes Messages 42
43 Problems: Long Running Process Blue Gene (1999) parallel computer, for the study of bio-molecular phenomena such as protein folding P1 Process Failure Checkpoint 1 Checkpoint ( 1,., n ) : STABLE STORE Data, threads, register values Run-time overhead; Failure most recent checkpoint 64 x 64 grid of parallel computers middleware for checkpoints 43
44 Cooperating Processes Distributed System P 0 P 1 m 0 P 2 P 3 m 1 C 1,0 C 2,0 m 3 m 2 m 5 m 4 C 3,0 C 0,0 m 6 C 2,1 m 7 C 3,1 m C 1,1 C 2,2 Crashed C 0,1 44
45 Middleware Distributed System Distributed system a collection of processes that communicate through messages in a network Fault tolerance periodically using stable storage to save the processes states during the failure-free execution. After a failure a failed process restarts from one of its saved states, reducing the amount of lost computation. Each of the saved states is called a checkpoint 45
46 Checkpont Cascading Rollback Problem Last checkpoint: C 1,1 by P1, before P1 crashed Cannot use C 0,1 at P0 because it is inconsistent with C 1,1 => P0 rollbacks to C 0,0 Cannot use C 2,1 at P2 because it fails to reflect the sending of m6 => P2 rollbacks to C 2,0 P 0 P 1 m 0 C 0,0 C 0,1 m 5 m C 1,0 C 2,0 C 1,1 m 4 m 6 Crashed m 2 P 2 P 3 m1 Cannot use C 3,1 and C 3,0 as a result => P3 rollbacks 46 to initial state C 2,1 C 2,2 m 3 m 7 C 3,0 C 3,1
47 Uncoordinated Checkpointing Uncoordinated checkpoints: full autonomy, and simple. Problems Most Checkpoints are not be useful Cascading rollback to the initial state (domino effect) To select a set of consistent checkpoints during a recovery, the dependency of checkpoints has to be determined and recorded together with each checkpoint Extra overhead and complexity => not simple after all 47
48 Disadvantages of Uncoordinated Checkpointing Susceptible to the domino effect Checkpoints that will never be part of a global consistent state are recorded Stable Storage overhead do not advance the recovery line A process needs to maintain multiple checkpoints and to use garbage collector to reclaim checkpoints Not suitable for output commit, because output commit requires global coordination to compute the recovery line 4
49 Coordinated Blocking (LAN based solution) Processes are coordinated to form a consistent global state, and initiator Ready! Go! * okay, channels flushed p1 * p2 * * p3 Next: Coordinated Blocking Chkpnt (cont ) 49
50 Coordinated Blocking (cont ) Advantage Always consistent No Domino Effect Less storage overhead Disadvantage Large latency to chkpnt! Next: Coordinated Non-blocking Chkpnt 50
51 Individual Log Based Protocols Work might be lost upon recovery using checkpointbased protocols By logging messages, we may be able to recover the system to where it was prior to the failure System mode: the execution of a process is modeled as a set of consecutive state intervals Each interval is initiated by a nondeterministic state or initial state We assume the only type of nondeterministic event is receiving of a message 1st State Interval 2nd State Interval 3rd State Interval P i m 0 m1 m 2 m 3 m 4 m 5 51
52 Log Based Protocols In practice, logging is always used together with checkpointing Limits the recovery time: start with the latest checkpoint instead of from the initial state Limits the size of the log: after taking a checkpoint, previously logged events can be purged Logging protocol types: Pessimistic logging: msgs are logged prior to execution Optimistic logging: msgs are logged asynchronously Causal logging: nondeterministic events that not yet logged (to stable storage) are piggybacked with each msg sent For optimistic and causal logging, dependency of processes has to be tracked => more complexity, longer recovery time 52
53 Pessimistic Logging Synchronously log every incoming message to stable storage prior to execution Each process periodically checkpoints its state: no need for coordination Recovery: a process restores its state using the last checkpoint and replay all logged incoming msgss 53
54 Lamport s logical clock Happened before relation a -> b : Event a occurred before event b. Events in the same process p1. b -> c : If b is the event of sending a message m1 in a process p1 and c is the event of receipt of the same message m1 by another process p2. a -> b, b -> c, then a -> c; -> is transitive. 54
55 Lamport s logical clock Causally Ordered Events a -> b : Event a causally affects event b Concurrent Events a e: if a!-> e and e!-> a 55
56 Lamport s logical clock Algorithm Sending end Receiving end time = time+1; time_stamp = time; send(message, time_stamp); (message, time_stamp) = receive(); time = max(time_stamp, time)+1; 56
57 a -> b Lamport s logical clock C(a) < C(b) b -> c C (b) and C(c) must be assigned in such a way that C(b) < C(c) and the clock time, C, must always go forward (increasing), never backward (decreasing). Corrections to time can be made by adding a positive value, never by subtracting one. 57
58 Lamport s logical clock An illustration: Three processes, each with its own clock. The clocks run at different rates and Lamport's algorithm corrects the clocks. 5
59 Lamport s logical clock Limitations m1 >m3 C(m1)<C(m3) m2 >m3 C(m2)<C(m3) m1 or m2 caused m3 to be sent? 59
60 Lamport s logical clock Lamport s logical clocks all events in a distributed system are totally ordered. That is, if a -> b, then we can say C(a)<C(b). Lamport s clocks nothing can be said about the actual time of a and b. logical clock says a -> b, that does not mean in terms of real time. Lamport clocks do not capture causality. If a -> c and b -> c we do not kno which action initiated c. Problems : when trying to replay events in a distributed system (such as when trying to recover after a crash). The theory goes that if one node goes down, if we know the causal relationships between messages, then we can replay those messages and respect the causal relationship to get that node back up to the state it needs to be in. Piece-wise Deterministiic (PWD)? 60
61 Vector clocks Vector clocks allow causality to be captured Rules of Vector Clocks Properties of a process Implementation 61
62 Vector clocks Rules and properties A vector clock VC(i) is assigned to an event i. If VC(i)<VC(j) for events i and j, then event i is known to causally precede j. Each process i maintains a vector V such that Vi [i] : number of events that have occurred at i Vi [j] : number of events I knows have occurred at process j 62
63 Vector clocks Implementation Before executing an event (i.e., sending a message over the network, delivering a message to an application, or some other internal event), 1. Pi executes VCj [i] ~ VCj [i] When process Pi sends a message m to Pj, it sets m's (vector) timestamp ts (m) equal to VCj after having executed the previous step. 3. Upon the receipt of a message m, process lj adjusts its own vector by setting VCj [k] ~ max{vcj [k], ts (m )[k]} for each k, after which it executes the first 63 step and delivers the message to the application.
64 Vector clocks 64
65 Sum Up: Checkpoints and Recovery Prevent Orphan process Lamport s timestamps Integer clocks assigned to events Obeys causality Cannot distinguish concurrent events Vector timestamps Obeys causality By using more space, can also identify concurrent events 65
66 Message Dependencies 66
67 Sender and Receiver - Dependencies Sender Dependency In Figure 1.(a), the process state P1 depends on P3 (state change by message m3). (after failure of process p3, if p3 restarts from state 0, p1 becomes an orphan process. Similarly, P1 transitively depends on P2 ( transitive sender dependency.). Receiver Dependency In Figure 1.(b), the process state P1 depends on P2 (message m2) After failure of process p2, m2 becomes a lost message. Process p1 should roll back and send the message m2 again. Similarly, P1 transitively depends on P3 (transitive receiver dependency). 67
68 Interacting Processes 6
69 Total Dependency Graph 69
70 Minimum Reachability Graph 70
71 Interacting Processes Total Dependency Vector clock 71
72 TDG - Cumulative State Dependencies Vector clock 72
73 Independent Dependency Tracking using TDT Vector Clock Extending Blockchains Reliable Communication Network Vs Dependency Tracking LOST Messages Tracking + Orphan Message Tracking Instantantaneous Minimum Reachability Graph Reduced time Check-poining and Rollback 73 Recovery
74 Blockchain Distributed Ledger Philosophy same as double entry book-keeping Example: Bank Passbook Credits Debits Balance Description ( transactions (cr + db = bal) No change is allowed; compensation is allowed) [Controls on Cr, Db, Bal Check SUMs] 74
75 Distributed Ledger (Blockchain) Replicated Database 75
76 Application Systems on Web Services 76
77 Checkpont Cascading Rollback Problem Last checkpoint: C 1,1 by P1, before P1 crashed Cannot use C 0,1 at P0 because it is inconsistent with C 1,1 => P0 rollbacks to C 0,0 Cannot use C 2,1 at P2 because it fails to reflect the sending of m6 => P2 rollbacks to C 2,0 P 0 P 1 m 0 C 0,0 C 0,1 m 5 m C 1,0 C 2,0 C 1,1 m 4 m 6 Crashed m 2 P 2 P 3 m1 Cannot use C 3,1 and C 3,0 as a result => P3 rollbacks 77 to initial state C 2,1 C 2,2 m 3 m 7 C 3,0 C 3,1
78 BLOCKCHAIN 7
79 79
80 Distributed Ledger Common / one Log 0
81 1
82 2
83 3
84 4
85 5
86 Different Rollback Recovery Schemes Rollback Recovery Schemes Checkpoint based Log based Uncoordinated check pointing Blockchain Pessimistic Logging Coordinated check pointing Optimistic Logging Comm. induced check pointing Casual Logging 6
87 Computing Systems- Protocols Component Level, Sub-systems, Blockchain: - End-to-end, - at Application layer System : Sum of its parts Application recovers from underlying component failure 7
88 Computing Systems- Protocols Individual System Replicated DBMS, Internal Architecture for a Distributed Application Supports reliable computations Fail-Stop Model, Byzantine Generals protocols, RPC level End-to-end Application delivery systems: Communicate thru Web Services at Application Layer Blockchain
89 Distributed Systems: new paradigms Crash / fault tolerant consensus algorithms run by one organization BLOCKCHAINS May run with multiplicity of Organizations [ Malicious Nodes ] : No trust between each other Byzantine General s Problems; Network not reliable, Internet delays, Network Partitions, Message loss and / reordering 9
90 Blockchains ( Distributed Ledger Technology ) Blockchains Clever way to detect message loss / reordering messages Log of bloc Log of blocks log of block Copies Every block in the log has a pointer back to previous block ( Linked Lists ) Broadcasts reach many nodes ( detect missing or reorder is no problem ) Distributed Database across (no trust entities ) 90
91 Time 1 : Distributed Information Systems Distributed Oracle (Database System) : 1 Organization ( Banking SBI ) : LAN based; Synchronous Communication Time 2 : Distributed Systems : Inter-bank reconciliations (group of Org. ) : LAN + Internet ; Grid Computing Time 3: Cloud-based Mushups and Computing :Aggregate Applications (Multiple organizations); Web Services 91
92 Distributed Systems in perspective 70-0s 90s 2000s LAN Synchronous Comm. LAN + Internet; Grid Computing Asynchronous Comm. Delay/Disconnectio n Web Services; Cloud Computing Asynchronous Comm. Delay/Disconnectio n Distributed Oracle +.. J2EE / Jini Web Services One Organization Banking - SBI Super-computers Group of Organizations (Banks) Networked Workstations Multiple Organizations (may be unknown)? CLOUD Multiplicity 92 of Channels
93 What are Chellenges- BANK ATMs work on dedicated lines (similar to a FAX machine, Synchronous network) AT&T goal in circa 2000 Aimed to change to telephony using Internet in -10yrs US Govt. Air-traffic Control automation in 90s (IBM) Driverless cars Real-time Control Problems ; High-speed streams of Data Stock Exchange (in TOKYO, NY, Germany) 1 hour Internet trade handling > 1 year Budget of Japan Govt. LAN+Internet 10MB (Megabits) Giga bits Networks Multiplicity of channels- 93 Blockchains
94 How to meet the Challenges- One item Big Internet ( one Bullock ) One item Big clouds ( one Bullock ) MULTIPLICITY Individual site logs [one organization] No Global Clock Time-order (LAMPORT) Log Structured Files (append only logs) [one group] Vector Clocks Distributed logs Blockchain Technolgy / Distributed Ledger Technoloogy [Not one globe?] Globally ordered transaction logs 94
95 Problems: Long Running Process Blue Gene (1999) parallel computer, for the study of bio-molecular phenomena such as protein folding P1 Process Failure Checkpoint 1 Checkpoint ( 1,., n ) : STABLE STORE Data, threads, register values Run-time overhead; Failure most recent checkpoint 64 x 64 grid of parallel computers middleware for checkpoints 95
96 Cooperating Processes Distributed System P 0 P 1 m 0 P 2 P 3 m 1 C 1,0 C 2,0 m 3 m 2 m 5 m 4 C 3,0 C 0,0 m 6 C 2,1 m 7 C 3,1 m C 1,1 C 2,2 Crashed C 0,1 96
97 Cooperating Logs / Blockchain Distributed Ledger (external for Web services, managed by a cloud data center) P 0 P 1 m 0 P 2 P 3 m 1 C 1,0 C 2,0 m 3 m 2 m 5 m 4 C 3,0 C 0,0 m 6 C 2,1 m 7 C 3,1 m C 1,1 C 2,2 C 0,1 Crashed 97
98 Problems: Long Running Blockchain External Blockchain supports the Web Service transaction can tolerate a few failure P1 Process Failure Checkpoint 1 Log is Checkpoints ( 1,., n ) : on STABLE STORE? It is all for Web Services Data, threads, register values Run-time overhead; No Failure No most recent checkpoint Dist. / parallel computers middleware for checkpoints 9
99 References Advanced Concepts in Operating Systems by Singhal and Shivaratri on pages Distributed Systems: Principles and Paradigms, Andrew S. Tanenbaum and Maarten Van Steen, (Second Edition) on pages Time, clocks, and the ordering of events in a distributed system by Lamport (197) Youtube videos
100 References C. Lee, B. Nick, U. Brandes, and P. Cunningham, Link prediction with social vector clocks, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 13, Apr. 2013, pp M. Harrigan, Using vector clocks to visualize communication flow. in ASONAM, N. Memon and R. Alhajj, Eds. IEEE Computer Society, 2010, pp C. E. Hrischuk and C. M. Woodside, Logical clock requirements for reverse engineering scenarios from a distributed system, IEEE Trans. Software Eng., 2(4), Apr. 2002, M. Raynal and M. Singhal, Logical Time: Capturing Causality in Distributed Systems, IEEE Computer Magazine, vol. 29, no. 2, pp , Feb
101 Reference BLOCKCHAINS : 101
Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi
DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationCprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University
Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable
More informationCSE 5306 Distributed Systems. Fault Tolerance
CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves
More informationFault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered
More informationFault-Tolerant Computer Systems ECE 60872/CS Recovery
Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.
More informationDistributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter
More informationToday: Fault Tolerance. Failure Masking by Redundancy
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing
More informationA Survey of Rollback-Recovery Protocols in Message-Passing Systems
A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer
More informationToday: Fault Tolerance. Reliable One-One Communication
Today: Fault Tolerance Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing Message logging Lecture 17, page 1 Reliable One-One Communication Issues
More informationToday CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra
Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct
More informationRollback-Recovery p Σ Σ
Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationRecovering from a Crash. Three-Phase Commit
Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationClock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology
Clock and Time THOAI NAM Faculty of Information Technology HCMC University of Technology Using some slides of Prashant Shenoy, UMass Computer Science Chapter 3: Clock and Time Time ordering and clock synchronization
More informationCOV885 Distributed Systems
COV885 Distributed Systems Web-based Systems: Web Developments Web Services S. Bhalla, ( 2017) Special Module on Database Systems Client Server Systems Web-based Client Server Systems Web Engineering:
More informationDistributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf
Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationDistributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?
UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? 2. What are different application domains of distributed systems? Explain. 3. Discuss the different
More informationDistributed KIDS Labs 1
Distributed Databases @ KIDS Labs 1 Distributed Database System A distributed database system consists of loosely coupled sites that share no physical component Appears to user as a single system Database
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationExam 2 Review. October 29, Paul Krzyzanowski 1
Exam 2 Review October 29, 2015 2013 Paul Krzyzanowski 1 Question 1 Why did Dropbox add notification servers to their architecture? To avoid the overhead of clients polling the servers periodically to check
More informationClock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers
Clock Synchronization Synchronization Tanenbaum Chapter 6 plus additional papers Fig 6-1. In a distributed system, each machine has its own clock. When this is the case, an event that occurred after another
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationDistributed Systems
15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard
More informationHypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware
Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application
More informationParallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationLast Class: Naming. Today: Classical Problems in Distributed Systems. Naming. Time ordering and clock synchronization (today)
Last Class: Naming Naming Distributed naming DNS LDAP Lecture 12, page 1 Today: Classical Problems in Distributed Systems Time ordering and clock synchronization (today) Next few classes: Leader election
More informationFault Tolerance. Basic Concepts
COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationDistributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski
Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and
More informationDistributed Systems Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:
More information2017 Paul Krzyzanowski 1
Question 1 What problem can arise with a system that exhibits fail-restart behavior? Distributed Systems 06. Exam 1 Review Stale state: the system has an outdated view of the world when it starts up. Not:
More informationReliable Distributed System Approaches
Reliable Distributed System Approaches Manuel Graber Seminar of Distributed Computing WS 03/04 The Papers The Process Group Approach to Reliable Distributed Computing K. Birman; Communications of the ACM,
More informationSYNCHRONIZATION. DISTRIBUTED SYSTEMS Principles and Paradigms. Second Edition. Chapter 6 ANDREW S. TANENBAUM MAARTEN VAN STEEN
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN واحد نجف آباد Chapter 6 SYNCHRONIZATION Dr. Rastegari - Email: rastegari@iaun.ac.ir - Tel: +98331-2291111-2488
More informationMESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS
MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute
More informationCS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.
Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationSynchronization. Chapter 5
Synchronization Chapter 5 Clock Synchronization In a centralized system time is unambiguous. (each computer has its own clock) In a distributed system achieving agreement on time is not trivial. (it is
More informationConsensus and related problems
Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationFault Tolerance 1/64
Fault Tolerance 1/64 Fault Tolerance Fault tolerance is the ability of a distributed system to provide its services even in the presence of faults. A distributed system should be able to recover automatically
More informationMYE017 Distributed Systems. Kostas Magoutis
MYE017 Distributed Systems Kostas Magoutis magoutis@cse.uoi.gr http://www.cse.uoi.gr/~magoutis Message reception vs. delivery The logical organization of a distributed system to distinguish between message
More informationIntroduction to Databases
Introduction to Databases Matthew J. Graham CACR Methods of Computational Science Caltech, 2009 January 27 - Acknowledgements to Julian Bunn and Ed Upchurch what is a database? A structured collection
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationDISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.
DISTRIBUTED SYSTEMS 121r itac itple TAYAdiets Second Edition Andrew S. Tanenbaum Maarten Van Steen Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON Prentice Hall Upper Saddle River, NJ 07458 CONTENTS
More informationCheckpointing HPC Applications
Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures
More informationLast time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson
Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to
More informationProcess groups and message ordering
Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More informationDISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed
More informationFailures, Elections, and Raft
Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright
More informationTo do. Consensus and related problems. q Failure. q Raft
Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the
More informationDistributed Systems Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:
More informationDistributed Systems Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:
More informationFault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit
Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic
More informationT ransaction Management 4/23/2018 1
T ransaction Management 4/23/2018 1 Air-line Reservation 10 available seats vs 15 travel agents. How do you design a robust and fair reservation system? Do not enough resources Fair policy to every body
More informationCS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)
CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 21: Network Protocols (and 2 Phase Commit) 21.0 Main Point Protocol: agreement between two parties as to
More informationCausal Consistency and Two-Phase Commit
Causal Consistency and Two-Phase Commit CS 240: Computing Systems and Concurrency Lecture 16 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Consistency
More informationSCALABLE CONSISTENCY AND TRANSACTION MODELS
Data Management in the Cloud SCALABLE CONSISTENCY AND TRANSACTION MODELS 69 Brewer s Conjecture Three properties that are desirable and expected from realworld shared-data systems C: data consistency A:
More informationCSE 5306 Distributed Systems. Course Introduction
CSE 5306 Distributed Systems Course Introduction 1 Instructor and TA Dr. Donggang Liu @ CSE Web: http://ranger.uta.edu/~dliu Email: dliu@uta.edu Phone: 817-2720741 Office: ERB 555 Office hours: Tus/Ths
More informationChapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju
Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic
More informationDistributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases
Distributed Database Management System UNIT-2 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi-63,By Shivendra Goel. U2.1 Concurrency Control Concurrency control is a method
More informationATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases
ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases We talked about transactions and how to implement them in a single-node database. We ll now start looking into how to
More informationData Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)
Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.
More informationPractical Byzantine Fault
Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault
More informationThe CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau
The CAP theorem The bad, the good and the ugly Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau 2017-05-15 1 / 19 1 The bad: The CAP theorem s proof 2 The good: A
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationTransaction Management. Pearson Education Limited 1995, 2005
Chapter 20 Transaction Management 1 Chapter 20 - Objectives Function and importance of transactions. Properties of transactions. Concurrency Control Deadlock and how it can be resolved. Granularity of
More informationDistributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions
Distributed Systems Day 13: Distributed Transaction To Be or Not to Be Distributed.. Transactions Summary Background on Transactions ACID Semantics Distribute Transactions Terminology: Transaction manager,,
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationCA464 Distributed Programming
1 / 25 CA464 Distributed Programming Lecturer: Martin Crane Office: L2.51 Phone: 8974 Email: martin.crane@computing.dcu.ie WWW: http://www.computing.dcu.ie/ mcrane Course Page: "/CA464NewUpdate Textbook
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,
More informationPRIMARY-BACKUP REPLICATION
PRIMARY-BACKUP REPLICATION Primary Backup George Porter Nov 14, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationDistributed Commit in Asynchronous Systems
Distributed Commit in Asynchronous Systems Minsoo Ryu Department of Computer Science and Engineering 2 Distributed Commit Problem - Either everybody commits a transaction, or nobody - This means consensus!
More informationINSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad
Course Name Course Code Class Branch INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad -500 043 COMPUTER SCIENCE AND ENGINEERING TUTORIAL QUESTION BANK 2015-2016 : DISTRIBUTED SYSTEMS
More informationEvent Ordering. Greg Bilodeau CS 5204 November 3, 2009
Greg Bilodeau CS 5204 November 3, 2009 Fault Tolerance How do we prepare for rollback and recovery in a distributed system? How do we ensure the proper processing order of communications between distributed
More informationDistributed Consensus Protocols
Distributed Consensus Protocols ABSTRACT In this paper, I compare Paxos, the most popular and influential of distributed consensus protocols, and Raft, a fairly new protocol that is considered to be a
More informationDistributed Databases
Distributed Databases These slides are a modified version of the slides of the book Database System Concepts (Chapter 20 and 22), 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides
More informationA Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems
A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems Rachit Garg 1, Praveen Kumar 2 1 Singhania University, Department of Computer Science & Engineering, Pacheri Bari (Rajasthan),
More informationDistributed Deadlock
Distributed Deadlock 9.55 DS Deadlock Topics Prevention Too expensive in time and network traffic in a distributed system Avoidance Determining safe and unsafe states would require a huge number of messages
More informationDistributed Systems Architectures. Ian Sommerville 2006 Software Engineering, 8th edition. Chapter 12 Slide 1
Distributed Systems Architectures Ian Sommerville 2006 Software Engineering, 8th edition. Chapter 12 Slide 1 Objectives To explain the advantages and disadvantages of different distributed systems architectures
More informationTWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018
TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides
More informationA Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System
A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System Parveen Kumar 1, Poonam Gahlan 2 1 Department of Computer Science & Engineering Meerut Institute of Engineering
More informationIntroduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES
Introduction Storage Failure Recovery Logging Undo Logging Redo Logging ARIES Volatile storage Main memory Cache memory Nonvolatile storage Stable storage Online (e.g. hard disk, solid state disk) Transaction
More informationSynchronization. Clock Synchronization
Synchronization Clock Synchronization Logical clocks Global state Election algorithms Mutual exclusion Distributed transactions 1 Clock Synchronization Time is counted based on tick Time judged by query
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationOUTLINE. Introduction Clock synchronization Logical clocks Global state Mutual exclusion Election algorithms Deadlocks in distributed systems
Chapter 5 Synchronization OUTLINE Introduction Clock synchronization Logical clocks Global state Mutual exclusion Election algorithms Deadlocks in distributed systems Concurrent Processes Cooperating processes
More informationDistributed Systems Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 01 (version September 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20.
More informationIntroduction. Distributed Systems IT332
Introduction Distributed Systems IT332 2 Outline Definition of A Distributed System Goals of Distributed Systems Types of Distributed Systems 3 Definition of A Distributed System A distributed systems
More informationAGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus
AGREEMENT PROTOCOLS Paxos -a family of protocols for solving consensus OUTLINE History of the Paxos algorithm Paxos Algorithm Family Implementation in existing systems References HISTORY OF THE PAXOS ALGORITHM
More information