Distributed File System

Similar documents
Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

Module 8 Fault Tolerance CS655! 8-1!

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

COMMENTS. AC-1: AC-1 does not require all processes to reach a decision It does not even require all correct processes to reach a decision

Distributed Transactions

Module 8 - Fault Tolerance

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS 541 Database Systems. Three Phase Commit

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

EECS 591 DISTRIBUTED SYSTEMS. Manos Kapritsos Winter 2018

Topics in Reliable Distributed Systems

PRIMARY-BACKUP REPLICATION

Consistency in Distributed Systems

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

CS 167 Final Exam Solutions

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Failures, Elections, and Raft

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Causal Consistency and Two-Phase Commit

EECS 591 DISTRIBUTED SYSTEMS

Control. CS432: Distributed Systems Spring 2017

CS 347 Parallel and Distributed Data Processing

CS122 Lecture 15 Winter Term,

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

CS5412: TRANSACTIONS (I)

CONCURRENCY CONTROL, TRANSACTIONS, LOCKING, AND RECOVERY

Consensus and related problems

Distributed Commit in Asynchronous Systems

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Network File Systems

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Distributed System. Gang Wu. Spring,2018

DATABASE TRANSACTIONS. CS121: Relational Databases Fall 2017 Lecture 25

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

7 Fault Tolerant Distributed Transactions Commit protocols

Transactions. Transactions. Distributed Software Systems. A client s banking transaction. Bank Operations. Operations in Coordinator interface

Distributed Databases. CS347 Lecture 16 June 6, 2001

Consensus in Distributed Systems. Jeff Chase Duke University

Today: Fault Tolerance

Goal A Distributed Transaction

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Distributed Transaction Management

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

To do. Consensus and related problems. q Failure. q Raft

Lecture 17 : Distributed Transactions 11/8/2017

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Page 1. Goals of Today s Lecture. The ACID properties of Transactions. Transactions

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management

CPS 512 midterm exam #1, 10/7/2016

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

CS October 2017

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Recovery and Logging

Distributed Systems 8L for Part IB

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428. Fall 2012

Distributed Systems. 12. Concurrency Control. Paul Krzyzanowski. Rutgers University. Fall 2017

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Distributed Systems Fault Tolerance

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Today: Fault Tolerance. Fault Tolerance

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Exam 2 Review. Fall 2011

Actions are never left partially executed. Actions leave the DB in a consistent state. Actions are not affected by other concurrent actions

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

CS352 Lecture - The Transaction Concept

CSE 486/586: Distributed Systems

GFS: The Google File System

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

Today: Fault Tolerance. Replica Management

Assignment 12: Commit Protocols and Replication Solution

Raft and Paxos Exam Rubric

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 13 - Distribution: transactions

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Network File System (NFS)

Page 1. Goals of Today s Lecture" Two Key Questions" Goals of Transaction Scheduling"

Introduction to Data Management CSE 344

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Transactions. Chapter 15. New Chapter. CS 2550 / Spring 2006 Principles of Database Systems. Roadmap. Concept of Transaction.

Today: Fault Tolerance. Reliable One-One Communication

Example: Transfer Euro 50 from A to B

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS505: Distributed Systems

CPS352 Lecture - The Transaction Concept

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Transcription:

Distributed File System Last Class NFS Design choices Opportunistic locking Local name spaces CS 138 XVIII 1 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DFS Basic Info Global name space across all servers in DFS (or Cell in DCE speak) Hierarchy Is partition into filesets Stored at a different servers Mapping stored in fileset location database (FLDB) fs system users projects sol7 osf1 twd motif dce FLDB bin bin usr bin CS 138 XVIII 2 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DFS Mount Points fs FileSets are given names independent of hierarchy root.cell system users projects bin.sol7 bin.osf1 users.twd proj.motif proj.dce Mapping of FileSet name to server stored in FLDB sol7 osf1 twd motif dce bin bin.sol7 Client Client Client bin usr bin bin.osf1 users.twd proj.motif proj.dce Client Client Client FLDB FLDB mounting = symbolic link to the name To use FS must use name to look up server in FLDB Provides global name FLDB space: all clients use same name. CS 138 XVIII 4 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DCE DFS V. NFS DCE: Single global name space Namespace is build into the central db (directory hierarchy) Mounts/lookups must use central databases NSF Independent disjoin names spaces based on mount CS 138 XVIII 5 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Strict Consistency in DFS fs system users projects sol7 osf1 twd motif dce bin bin usr bin To read/write files. Clients request tokens. Client Client FLDB provides token and allows caching Client Client Client Client FLDB As long FLDB as tokens are held, client can make appropriate changes. CS 138 XVIII 6 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (1) system users projects sol7 osf1 twd motif dce bin bin usr bin Client File A: Read:0-4095 File B: Write:0-512 Client File A: Read:0-4095 File B: Write:513-4096 FLDB FLDB FLDB CS 138 XVIII 7 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (2) system users projects sol7 osf1 twd motif dce bin bin usr bin Client File A: Read:0-4095 Write(A, 0-4095) Client File A: Read:0-4095 FLDB FLDB FLDB CS 138 XVIII 8 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DFS Tokens (3) fs system users projects sol7 osf1 twd motif dce bin bin usr bin Revoke(A, Read, 0-4095) Client File A: Read:0-4095 Client File A: Read:0-4095 FLDB FLDB FLDB CS 138 XVIII 9 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (4) system users projects sol7 osf1 twd motif dce bin bin usr bin Client Grant(A, write, 0-4095) Client File A: Read:0-4095 File A: Write:0-4095 FLDB FLDB FLDB CS 138 XVIII 10 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (5) system users projects sol7 osf1 twd motif dce bin bin usr bin Client Read(A, 0-4095) Client File A: Read:0-4095 File A: Write:0-4095 FLDB FLDB FLDB CS 138 XVIII 11 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (6) system users projects sol7 osf1 twd motif dce bin bin usr bin Client Client File A: Read:0-4095 File A: Write:0-4095 FLDB FLDB FLDB CS 138 XVIII 12 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

fs DFS Tokens (7) system users projects sol7 osf1 twd motif dce bin bin usr bin Grant(A, read, 0-4095) Client File A: Read:0-4095 Client File A: Read:0-4095 FLDB FLDB FLDB CS 138 XVIII 13 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Consistency Models NFS: weak or strict Strick if locks are used CIFS: strict consistency Uses opportunistic locks Revoke + flush DCE: Strict consistency Uses tokens Similar revoke+flush policy as CIFS CS 138 XVIII 14 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Crash Recovery CS 138 XVIII 15 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Client DFS Crash Recovery Client crash: server reclaims token should be able to revoke when client down crash: on reboot rebuilt token DB Client should keep using cache even when server failed Client Network failure: both are up but can t communicate Revoking tokens or using cache create issues Compromise: Client uses token until expire revokes unilateral When failure fixed, client changes may be rejected!!! CS 138 XVIII 16 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DFS Recovery Problems Client application must participate! must recognize that operation returns timedout error must retry Due to semantics of tokens, it isn t feasible to provide NFS-style hard mount. Hard mount possible because NFS is stateless In DCE DFS state complicates hard mounting CS 138 XVIII 18 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

State It s required! exact Unix semantics - E.g. Need to know how many clients using file to delete mandatory locks CS 138 XVIII 21 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

State Recovery Recovery from crashed server clients reclaim state on server - grace period after crash during which no new state may be established recovery from crashed client server detects crash and nullifies client state information on server CS 138 XVIII 22 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coping with Non- Responsiveness Leases locks are granted for a fixed period of time - server-specified lease if lease not renewed before expiration, server may (unilaterally) revoke locks and share reservations - most client RPCs renew leases clients must contact server periodically - if clientid is rejected as stale, then server has restarted - server s grace period is equal to lease period CS 138 XVIII 23 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

DFS LFS Distributed file system!= Local file system s might give up on non-crashed clients clients may lose locks clients may lose files NFSv4 attempts to make such things unlikely CS 138 XVIII 27 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Distributed File System Summary Opportunistic Locking Enables client to safely cache file and modify it Improves performance Global V. Local name spaces Global: DCE DFS Local: NFS Crash recovery Requires client participation Leases help deal with non responsiveness DFS!= LFS CS 138 XVIII 29 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

CS 138: Distributed Transactions CS 138 XVIII 30 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Today Transactions Atomic of Transactions Five desired properties Two Phase Analyzed state machines for different roles Participant and Coordinator response to failures Blocking issue Three Phase Attempt to eliminate Blocking issue More complex protocol including more messages Analyzed state machines for different roles Much of this lecture is adapted from the textbook by Tanenbaum and Van Steen and from Chapter 7 of Concurrency Control and Recovery in Database Systems, by P. Bernstein, V. Hadzilacos, and N. Goodman, Addison-Wesley (1987). CS 138 XVIII 31 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Transactions ACID property: atomic - all or nothing consistent - take system from one consistent state to another isolated - have no effect on other transactions until committed durable - persists CS 138 XVIII 32 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Recall: Configuration Changes in Raft! Cannot switch directly from one configuration to another: conflicting majorities could arise See the paper for details 1 2 3 4 5 C old time C new Majority of C old Majority of C new CS 138 XVIII 33 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Roles in Atomic Transactions C = Coordinator Initiates a change P = Participants All nodes involved in making the atomic change CS 138 XVIII 34 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Distributed Transactions withdraw(100); Begin Transaction; a.withdraw(100); b.deposit(50); c.deposit(50); End Transaction; Client (Participant) A (Participant) deposit(50); B (Participant) deposit(50); C (Participant) CS 138 XVIII 35 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordination Coordinator withdraw(100); Begin Transaction; a.withdraw(100); b.deposit(50); c.deposit(50); End Transaction; Client (Participant) A (Participant) deposit(50); B (Participant) deposit(50); C (Participant) CS 138 XVIII 36 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Atomic Properties AC1: All participants that reach a (commit/abort) decision reach the same one AC2: A participant cannot reverse its decision AC3: The commit decision can be reached only if all participants agree AC4: If there are no failures and all participants vote yes, then decision will be commit AC5: For any execution, if all failures are repaired and no new failures occur for a sufficiently long interval, then all participants will reach a (commit/abort) decision CS 138 XVIII 37 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Two-Phase Phase 1 coordinator prepares to commit: - asks participants to vote either commit or abort participants respond appropriately Phase 2 coordinator decides outcome: - if all participants vote commit, outcome is commit, otherwise outcome is abort - outcome sent to all participants participants do what they re told CS 138 XVIII 38 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Two-Phase Coordinator pariticpants pariticpants pariticpants pariticpants If all Then If any Abort Then Abort Make a change or Abort Response Decide to abort Or commit based On local state Once a node responds. The node can t change responses or Abort CS 138 XVIII 39 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Two-Phase Coordinator pariticpants pariticpants pariticpants pariticpants If all Then If any Abort Then Abort (AC3, AC4) Make a change or Abort Response Decide to abort Or commit based On local state Once a node responds. The node can t change responses (AC2) or Abort (AC1) CS 138 XVIII 40 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordinator State Diagram for Two-Phase Init app commit/vote req Coordinator Make a change pariticpants pariticpants pariticpants pariticpants any abort/abort Wait all commit/commit or Abort Abort Response Coordinator CS 138 XVIII 41 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant State Diagram for Two-Phase Coordinator Make a change pariticpants pariticpants pariticpants pariticpants vote req/abort Init vote req/commit or Abort abort/ack Uncertain commit/ack Response Abort Participant CS 138 XVIII 42 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

State Diagrams for Two Phase Init app commit/vote req vote req/abort Init vote req/commit any abort/abort Wait all commit/commit abort/ack Uncertain commit/ack Abort Abort Coordinator Participant CS 138 XVIII 43 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Failures Coordinator or participants could crash assume fail-stop - crash detected by time-out - no byzantine failures crashed machines restart - recover their state Focus on two failures Node failures Network failures (i.e., Lost messages) CS 138 XVIII 44 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Crash Points Init app commit/vote req vote req/abort Init vote req/commit any abort/abort Wait all commit/commit abort/ack Uncertain commit/ack Abort Abort Coordinator Participant CS 138 XVIII 45 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Crash Points Init app commit/vote req vote req/abort Init vote req/commit any abort/abort Wait all commit/commit abort/ack Uncertain commit/ack Abort Abort Coordinator Participant CS 138 XVIII 46 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordinator Recovery from Participant Failure Init app commit/vote req Wait any abort/abort all commit/commit Abort Coordinator Coordinator in wait state It detects failure of participant Using timeout waiting for participant to say commit or abort Coordinator in Wait state can t assume either outcome Waiting for them to respond Takes no response as an abort Abort transaction!! CS 138 XVIII 47 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Recovery from Coordinator Failure vote req/abort abort/ack Abort Init Uncertain Participant vote req/commit commit/ack Participant in Uncertain state Participant in Uncertain state It detects failure of coordinator Using timeout waiting for coordinator to say commit or abort can t assume either outcome waits for coordinator to restart - On restart contact coordinator for final outcome CS 138 XVIII 48 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Optimizing Participant Recovery Waiting for Coordinator to restart can take a LONG time! Especially if coordinator takes a while to reboot Participants contact other participants p contacts q (p is in Uncertain state) q is in: - commit (or abort) state p goes to commit (or abort) state - init state (hasn t yet voted) both q and p abort - uncertain state vote req/abort abort/ack Init vote req/commit Uncertain commit/ack both p and q remain uncertain Abort CS 138 XVIII 49 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Optimizing Participant Recovery q is in: - commit (or abort) state p goes to commit (or abort) state - init state (hasn t yet voted) both q and p abort - uncertain state both p and q remain uncertain vote req/abort Init vote req/commit vote req/abort Init vote req/commit abort/ack Uncertain commit/ack abort/ack Uncertain commit/ack Node P Abort CS 138 XVIII 50 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Optimizing Participant Recovery q is in: - commit (or abort) state p goes to commit (or abort) state - init state (hasn t yet voted) both q and p abort - uncertain state both p and q remain uncertain vote req/abort Init vote req/commit vote req/abort Init vote req/commit abort/ack Uncertain commit/ack abort/ack Uncertain commit/ack Node P Abort CS 138 XVIII 51 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Optimizing Participant Recovery q is in: - commit (or abort) state p goes to commit (or abort) state - init state (hasn t yet voted) both q and p abort - uncertain state both p and q remain uncertain When all active nodes are uncertain, then this optimization does not provide a benefit. vote req/abort Init vote req/commit vote req/abort Init vote req/commit abort/ack Uncertain commit/ack abort/ack Uncertain commit/ack Node P Abort CS 138 XVIII 52 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Two-Phase Works fine in practice! But all participants could conceivably be in uncertain state and coordinator is down (for a long time) Can we make it so such blocking can t happen? CS 138 XVIII 53 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

What Causes Blocking? Coordinator is down If all operational (not-failed) participants are in uncertain state, they are blocked If all participants are operational, they can elect new coordinator If any participant has crashed, the others don t know if it crashed before or after voting (to commit or abort) CS 138 XVIII 54 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Guaranteeing Non-Blocking Non-blocking property (NB): if any operational process is in the Uncertain state, then no process (operational or failed) can have decided to commit If NB holds, then operational processes may elect new coordinator and decide to commit or abort CS 138 XVIII 55 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Failures Coordinator or participants could crash no communication failures assume fail-stop - crash detected by time-out - no byzantine failures crashed machines restart - recover their state CS 138 XVIII 56 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Three-Phase Phase 1 coordinator prepares to commit: - asks participants to vote either commit or abort participants respond appropriately Phase 2 coordinator counts votes: - if all participants vote commit, outcome is precommit, otherwise outcome is abort - outcome sent to all participants participants ack and either abort or wait for commit Phase 3 coordinator waits for all acks - if committing, sends final commit to all participants participants commit CS 138 XVIII 57 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Two-Phase Coordinator pariticpants pariticpants pariticpants pariticpants If all Then If any Abort Then Abort Make a change or Abort Pre-commit/Abort ACK Decide to abort Or commit based On local state Pre- or Abort CS 138 XVIII 58 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Revised State Diagrams Init app commit/vote req vote req/abort Init vote req/commit any abort/abort Wait all commit/precommit abort/ack Uncertain precommit/ack Abort Pre Abort Pre all ack/commit commit/commit CS 138 XVIII 59 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Revised State Diagrams Init app commit/vote req vote req/abort Init vote req/commit any abort/abort Wait all commit/precommit abort/ack Uncertain precommit/ack Abort Pre Abort Pre all ack/commit commit/commit CS 138 XVIII 60 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordinator Recovery from Participant Failure any abort/abort Init Wait app commit/vote req all commit/precommit Coordinator in Wait state It detects failure of a participant Using timeout waiting for participant to say commit or abort Abort Pre all ack/commit Coordinator in Wait state can t assume either outcome Waiting for them to respond Interprets absence of participant response as an abort Abort transaction!! Same as 2PC CS 138 XVIII 63 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordinator Recovery from Participant Failure any abort/abort Abort Init Wait app commit/vote req all commit/precommit Pre all ack/commit Coordinator in Pre state It detects failure of participant Using timeout waiting for participant to say ACK Coordinator in Pre- state CAN assume they will commit So commit independently Failed nodes will learn on reboot - Either from coordinator or friend CS 138 XVIII 64 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Recovery from Coordinator Failure vote req/abort abort/ack Init Uncertain vote req/commit precommit/ack Participant in init state It detects failure of coordinator Using timeout waiting for coordinator to propose a vote Abort commit/commit Pre Safely Abort without waiting CS 138 XVIII 65 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Recovery from Coordinator Failure vote req/abort abort/ack Abort Init Uncertain?? vote req/commit commit/commit precommit/ack Pre Participant in uncertain state It detects failure of coordinator Using timeout waiting for coordinator to say commit or abort Participant in Uncertain state can t assume either outcome Contact another node to determine status CS 138 XVIII 67 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Recovery from Coordinator Failure vote req/abort abort/ack Init Uncertain vote req/commit precommit/ack Participant in Precommit state It detects failure of coordinator Using timeout waiting for coordinator to say commit Abort commit/commit Pre Participant, p, in Pre state Why not just COMMIT? There might be a node, q, out there in uncertain state If p commits and fails, there s a chance q can still ABORT and violate the earlier assumptions CS 138 XVIII 69 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Details (1) If original coordinator remains operational participant crashes handled as in two-phase commit If participant times out in Uncertain state if any other participant has aborted, it aborts otherwise, it starts an election for a new coordinator - All in Uncertain or some in uncertain, some in Precommit If participant times out in Pre state it starts an election for a new coordinator All in precommit or some in uncertain, some in Precommit No node can be in the Abort state if any node is in Pre. If any in then they move into CS 138 XVIII 71 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Times out In Uncertain State vote req/abort Init vote req/commit Participant in uncertain state waiting for coordinator to say commit or abort abort/ack Abort Uncertain?? commit/commit Other participants can be in: precommit/ack Pre Uncertain these participants responded to vote with Yes Abort these participants responded to vote with Abrot Init these participants have not received the request to vote NO participant can be in Pre CS 138 XVIII 72 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Participant Times out In Pre State vote req/abort Init vote req/commit Participant in Precommit state waiting for coordinator to say commit abort/ack Abort Uncertain commit/commit Other participants can be in: precommit/ack Pre Precommit received pre-commit msg from coordinator but waiting for commit received commit Uncertain these participants have not received precommit from coordinator. - Coordinator could have failed before sending precommit to them NO participant can be in Init/Abort - All must have voted yet for CS 138 XVIII 73 coordinator to send Pre Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Coordinator Recovery: New Coordinator When newly elected coordinator starts up sends state-request message to all operational participants Coordinator collects states and proceeds according to four termination rules (termination protocol): TR1: if any participant is in Abort state, all are sent abort messages TR2: if some participant is in state, all are sent commit messages TR3: if all participants are in Uncertain state, all are sent abort messages TR4: if some participant is in Pre state, but none in State, those in Uncertain state are sent Pre messages; once these are acked, all participants are sent commit messages CS 138 XVIII 74 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Total Failure What if coordinator and all participants fail? When they come back up, how do they decide? if resurrected participant either didn t vote or voted abort, it may unilaterally abort - otherwise, must run termination protocol (from last slide) but works only if last participant to fail has come back up - If last participant doesn t come back up, you don t know its current state CS 138 XVIII 78 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Communication Failures Network could partition into multiple pieces Not sufficient to get agreement in a piece containing a quorum consensus is required for commit! Scenario all participants vote coordinator collects results network partitions before or after all results collected if network reconnects: easy network never fully reconnects, but each participant eventually can communicate (perhaps briefly) with all others CS 138 XVIII 79 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All

Summary of Distributed Transaction Transactions Atomic of Transactions Five desired properties Two Phase Analyzed state machines for different roles Participant and Coordinator response to failures Blocking issue Three Phase Attempt to eliminate Blocking issue More complex protocol including more messages Analyzed state machines for different roles CS 138 XVIII 80 Copyright 2018 Theophilus Benson, Thomas W. Doeppner. All