Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Similar documents
Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems 8L for Part IB

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Replication in Distributed Systems

Replication. Feb 10, 2016 CPSC 416

Consensus and related problems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Today: Fault Tolerance

Exam 2 Review. October 29, Paul Krzyzanowski 1

Intuitive distributed algorithms. with F#

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

DrRobert N. M. Watson

Consistency in Distributed Systems

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Today: Fault Tolerance. Fault Tolerance

PRIMARY-BACKUP REPLICATION

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

CS 425 / ECE 428 Distributed Systems Fall 2017

Fault Tolerance Dealing with an imperfect world

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Failures, Elections, and Raft

Consensus, impossibility results and Paxos. Ken Birman

Basic vs. Reliable Multicast

Distributed Systems 11. Consensus. Paul Krzyzanowski

To do. Consensus and related problems. q Failure. q Raft

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Today: Fault Tolerance. Replica Management

Last Class: Clock Synchronization. Today: More Canonical Problems

416 practice questions (PQs)

Recovering from a Crash. Three-Phase Commit

CS5412: TRANSACTIONS (I)

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Primary-Backup Replication

CS October 2017

Distributed Systems COMP 212. Lecture 19 Othon Michail

Consensus in Distributed Systems. Jeff Chase Duke University

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed System. Gang Wu. Spring,2018

Synchronization. Chapter 5

Today: Fault Tolerance. Failure Masking by Redundancy

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

PROCESS SYNCHRONIZATION

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Distributed Systems COMP 212. Revision 2 Othon Michail

CPS 512 midterm exam #1, 10/7/2016

Causal Consistency and Two-Phase Commit

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Fault Tolerance. Distributed Systems IT332

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Exam 2 Review. Fall 2011

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Today: Fault Tolerance. Reliable One-One Communication

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Lecture XII: Replication

Distributed Computation Models

Introduction to Distributed Systems Seif Haridi

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Distributed Systems 24. Fault Tolerance

Distributed Commit in Asynchronous Systems

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Failure Tolerance. Distributed Systems Santa Clara University

Eventual Consistency. Eventual Consistency

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Synchronization (contd.)

Distributed Systems Consensus

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Chapter 8 Fault Tolerance

Fault Tolerance. Basic Concepts

Replication and Consistency

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Process Synchroniztion Mutual Exclusion & Election Algorithms

Paxos provides a highly available, redundant log of events

Fault Tolerance 1/64

Goal A Distributed Transaction

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Lecture X: Transactions

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Distributed File System

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

10. Replication. Motivation

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Module 8 - Fault Tolerance

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Transcription:

Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to distinguish receipt and delivery Several ordering options: FIFO, causal or total Considered distributed mutual exclusion: Want to limit one process to a CS at a time Central server OK; but bottleneck & Single Point of Failure (SPoF) Token passing OK: but traffic, repair, token loss Totally-Ordered Multicast: OK, but high number of messages and problems with failures 2 1

Leader election Many schemes are built on the notion of having a welldefined leader (master, coordinator) examples seen so far include the Berkeley time synchronization protocol, and the central lock server An election algorithm is a dynamic scheme to choose a unique process to play a certain role assume P i contains state variable elected i when a process first joins the group, elected i = UNDEFINED By the end of the election, for every P i, elected i = P x, where P x is the winner of the election, or elected i = UNDEFINED, or P i has crashed or otherwise left the system Common theme: live node with the highest ID wins 3 Ring-based election P4 has also crashed, so P3 routes around P5 P 4 C (2,3) (2,3,5) (2,3) P3 System has coordinator who crashes Some process notices, and starts an election Find node with highest ID who will be new leader Puts its ID into a message, and sends to its successor On receipt, a process acks to sender (not shown), and then appends its ID and forwards the election message Finished when a process receives message containing its ID P1 P2 (2) (1,2,3,5) P2 notices C has crashed and starts election P2 can multicast winner (or can continue to P5) 4 2

The Bully Algorithm Algorithm proceeds by attempting to elect the process still alive with the highest ID Assume that we know the IDs of all processes Assumes we can reliably detect failures by timeouts If process P i sees current leader has crashed, sends election message to all processes with higher IDs, and starts a timer Concurrent election initiation by multiple processes is fine Processes receiving an election message reply OK to sender, and start an election of their own (if not already in progress) If a process hears nothing back before timeout, it declares itself the winner, and multicasts result A dead process that recovers (or new process that joins) also starts an election: can ensure highest ID always elected 5 Problems with elections P0 P4 P8 P6 P3 P1 P5 P2 P7 Algorithms rely on timeouts to reliably detect failure However it is possible for networks to fail: a network partition Some processes can speak to others, but not all Can lead to split-brain syndrome: Every partition independently elects a leader too many bosses! To fix, need some secondary (& tertiary?) communication scheme e.g. secondary network, shared disk, serial cables, This is important because we want to implement distributed algorithms dependent on the invariant that the leader is unique 6 3

Aside on consensus Elections are a specific example of a more general problem: consensus Given a set of N processes in a distributed system, how can we get them all to agree on something? Classical treatment has every process P i propose something (a value V i ) Want to arrive at some deterministic function of V i s (e.g. majority or maximum will work for election) A correct solution to consensus must satisfy: Agreement: all nodes arrive at the same answer Validity: answer is one that was proposed by someone Termination: all nodes eventually decide 7 Consensus is impossible Famous result due to Fischer, Lynch & Patterson (1985) Focuses on an asynchronous network (unbounded delays) with at least one process failure Shows that it is possible to get an infinite sequence of states, and hence never terminate Given the Internet is an asynchronous network, then this seems to have major consequences!! Not really: Result actually says we can t always guarantee consensus, not that we can never achieve consensus And in practice, we can use tricks to mask failures (such as reboot, or replication), and to ignore asynchrony Have seen solutions already, and will see more later 8 4

Transaction processing systems Last term looked at transactions: ACID properties Support for composite operations (i.e. a collection of reads and updates to a set of objects) A transaction is atomic ( all-or-nothing ) If it commits, all operations are applied If it aborts, it s as if nothing ever happened A committed transaction moves system from one consistent state to another Transaction processing systems also provide: isolation (between concurrent transactions) durability (committed transactions survive a crash) Q: Can we bring the {scalability, fault tolerance, } benefits of distributed systems to transaction processing? 9 Distributed transactions Scheme described last term was client/server E.g., a program (client) accessing a database (server) However distributed transactions are those which span multiple transaction processing servers E.g. booking a complex trip from London to Vail, CO Could fly LHR -> LAX -> EGE + hire a car or fly LHR -> ORD -> DEN + take a public bus Want a complete trip (i.e. atomicity) Not get stuck in an airport with no onward transport! Must coordinate actions across multiple parties 10 5

A model of distributed transactions T1 transaction { if(x<2) abort; a:= z; j:= j + 1; } S 1 S 2 x=5 y=0 z=3 a=7 b=8 c=1 T2 transaction { z:= (i+j); } C 1 S 3 i=2 j=4 C 2 Multiple servers (S 1, S 2, S 3, ), each holding some objects which can be read and written within client transactions Multiple concurrent clients (C 1, C 2, ) who perform transactions that interact with one or more servers E.g. T1 reads x, z from S 1, writes a on S 2, and reads & writes j on S 3 E.g. T2 reads i, j from S 3, then writes z on S 1 A successful commit implies agreement at all servers 11 Implementing distributed transactions Can build on top of solution for single server: e.g. use locking or shadowing to provide isolation e.g. use write-ahead log for durability Need to coordinate to either commit or abort Assume clients create unique transaction ID: TXID Uses TXID in every read or write request to a server S i First time S i sees a given TXID, it starts a tentative transaction associated with that transaction ID When client wants to commit, must perform atomic commit of all tentative transactions across all servers 12 6

Atomic commit protocols A naïve solution would have client simply invoke commit(txid) on each server in turn Will work only if no concurrent conflicting clients, every server commits (or aborts), and no server crashes To handle concurrent clients, introduce a coordinator: A designated machine (can be one of the servers) Clients ask coordinator to commit on their behalf and hence coordinator can serialize concurrent commits To handle inconsistency/crashes, the coordinator: Asks all involved servers if they could commit TXID Servers S i reply with a vote V i = { COMMIT, ABORT } If all V i = COMMIT, coordinator multicasts docommit(txid) Otherwise, coordinator multicasts doabort(txid) 13 Two-phase commit (2PC) C cancommit(txid)? docommit(txid) S1 S2 S3 physical time This scheme is called two-phase commit (2PC): First phase is voting: collect votes from all parties Second phase is completion: either abort or commit Doesn t require ordered multicast, but needs reliability If server fails to respond by timeout, treat as a vote to abort Once all ACKs received, inform client of successful commit 14 7

2PC: additional details Client (or any server) can abort during execution: simply multicasts doabort(txid) to all servers E.g., if client transaction throws exception or server fails If a server votes NO, can immediately abort locally If a server votes YES, it must be able to commit if subsequently asked by coordinator: Before voting to commit, server will prepare by writing entries into log and flushing to disk Also records all requests from & responses to coordinator Hence even if crashes after voting to commit, will be able to recover on reboot 15 2PC: coordinator crashes Coordinator must also persistently log events: Including initial message from client, requesting votes, receiving replies, and final decision made Lets it reply if (restarted) client or server asks for outcome Also lets coordinator recover from reboot, e.g. re-send any vote requests without responses, or reply to client One additional problem occurs if coordinator crashes after phase 1, but before initiating phase 2: Servers will be uncertain of outcome If voted to commit, will have to continue to hold locks, etc Other schemes (3PC, Paxos, ) can deal with this 16 8

Replication Many distributed systems involve replication Multiple copies of some object stored at different servers Multiple servers capable of providing some operation(s) Three key advantages: Load-Balancing: if have many replicas, then can spread out work from clients between them Lower Latency: if replicate an object/server close to a client, will get better performance Fault-Tolerance: can tolerate the failure of some replicas and still provide service Examples include DNS, web & file caching (& contentdistribution networks), replicated databases, 17 Replication in a single system A good single-system example is RAID: RAID = Redundant Array of Inexpensive Disks Disks are cheap, so use several instead of just one If replicate data across disks, can tolerate disk crash If don t replicate data, appearance of a single larger disk A variety of different configurations (levels) RAID 0: stripe data across disks, i.e. block 0 to disk 0, block 1 to disk 1, block 2 to disk 0, and so on RAID 1: mirror (replicate) data across disks, i.e. block 0 written on disk 0 and disk 1 RAID 5: parity write block 0 to disk 0, block 1 to disk 1, and (block 0 XOR block 1) to disk 2 Get improved performance since can access disks in parallel With RAID 1, 5 also get fault-tolerance 18 9

Distributed data replication Have some number of servers (S 1, S 2, S 3, ) Each holds a copy of all objects Each client C i can access any replica (any S i ) E.g. clients can choose closest, or least loaded If objects are read-only, then trivial: Start with one primary server P having all data If client asks S i for an object, S i returns a copy (S i fetches a copy from P if it doesn t already have a fresh one) Can easily extend to allow updates by P When updating object O, send invalidate(o) to all S i In essence, this is how web caching / CDNs work today But what if clients can perform updates? 19 Replication and consistency Gets more challenging if clients can perform updates For example, imagine x has value 3 (in all replicas) C1 requests write(x, 5) from S4 C2 requests read(x) from S3 What should occur? With strong consistency, the distributed system behaves as if there is no replication present: i.e. in above, C2 should get the value 5 requires coordination between all servers With weak consistency, C2 may get 3 or 5 (or?) Less satisfactory, but much easier to implement 20 10

Replication for fault tolerance Replication for services, not just data objects Easiest is for a stateless services: Simply duplicate functionality over k machines Clients use any (e.g. closest), fail over to another Very few totally stateless services But e.g. many web apps have per-session soft state State generated per-client, lost when client leaves For example: multi-tier web farms (Facebook, ): Web server App server Web server App server Web server App server session soft state only Cache server Database Cache server Cache server Database consistent replication (transactions) 21 Passive replication A solution for stateful services is primary/backup: Backup server takes over in case of failure Based around persistent logs and system checkpoints: Periodically (or continuously) checkpoint primary If detect failure, start backup from checkpoint A few variants trade-off fail-over time: Cold-standby: backup server must start service (software), load checkpoint & parse logs Warm-standby: backup server has software running in anticipation just needs to load primary state Hot-standby: backup server mirrors primary work, but output is discarded; on failure, enable output 22 11

Active replication Alternative: have k replicas running at all times Front-end server acts as an ordering node: Receives requests from client and forwards them to all replicas using totally ordered multicast Replicas each perform operation and respond to front-end Front-end gathers responses, and replies to client Typically require replicas to be state machines : i.e. act deterministically based on input Idea is that all replicas operate in lock step Active replication is expensive (in terms of resources) and not really worth it in the common case. However valuable if consider Byzantine failures 23 Summary + next time Leader elections + distributed consensus Distributed transactions + atomic commit protocols Replication + consistency (More) replication and consistency Strong consistency Quorum-based systems Weaker consistency Consistency, availability and partitions Further replication models Start of Google case studies 24 12