Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Similar documents
Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Systems 8L for Part IB

Consensus and related problems

Replication in Distributed Systems

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Exam 2 Review. October 29, Paul Krzyzanowski 1

Replication. Feb 10, 2016 CPSC 416

Last Class: Clock Synchronization. Today: More Canonical Problems

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Intuitive distributed algorithms. with F#

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Today: Fault Tolerance

Consensus, impossibility results and Paxos. Ken Birman

Failures, Elections, and Raft

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CS 425 / ECE 428 Distributed Systems Fall 2017

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Primary-Backup Replication

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Recovering from a Crash. Three-Phase Commit

Distributed Systems 11. Consensus. Paul Krzyzanowski

Today: Fault Tolerance. Fault Tolerance

PRIMARY-BACKUP REPLICATION

Basic vs. Reliable Multicast

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Consensus in Distributed Systems. Jeff Chase Duke University

Consistency in Distributed Systems

Distributed System. Gang Wu. Spring,2018

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

PROCESS SYNCHRONIZATION

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

To do. Consensus and related problems. q Failure. q Raft

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Fault Tolerance Dealing with an imperfect world

Synchronization. Chapter 5

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Last Class: Clock Synchronization. Today: More Canonical Problems

Exam 2 Review. Fall 2011

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Today: Fault Tolerance. Replica Management

Distributed Commit in Asynchronous Systems

Assignment 12: Commit Protocols and Replication Solution

Process Synchroniztion Mutual Exclusion & Election Algorithms

Distributed Computation Models

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

DrRobert N. M. Watson

Distributed Systems Consensus

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Introduction to Distributed Systems Seif Haridi

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

CS October 2017

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Today: Fault Tolerance. Failure Masking by Redundancy

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Today: Fault Tolerance. Reliable One-One Communication

Fault Tolerance. Basic Concepts

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed File System

CPS 512 midterm exam #1, 10/7/2016

416 practice questions (PQs)

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

殷亚凤. Synchronization. Distributed Systems [6]

Causal Consistency and Two-Phase Commit

Module 8 - Fault Tolerance

CS5412: TRANSACTIONS (I)

Distributed Systems COMP 212. Lecture 19 Othon Michail

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Module 8 Fault Tolerance CS655! 8-1!

Lecture XII: Replication

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Paxos and Replication. Dan Ports, CSEP 552

Synchronization. Distributed Systems IT332

Goal A Distributed Transaction

Synchronization. Clock Synchronization

Distributed Systems Fault Tolerance

Dep. Systems Requirements

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Fault Tolerance. Distributed Systems IT332

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Paxos provides a highly available, redundant log of events

CSE 5306 Distributed Systems. Fault Tolerance

Replications and Consensus

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Transcription:

Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf

Last time Saw how we can build ordered multicast Messages between processes in a group Need to distinguish receipt and delivery Several ordering options: FIFO, Causal or Total Considered distributed mutual exclusion: Want to arrange only one process can enter CS Central server: ok, but bottleneck & SPoF Token passing: ok, but traffic, repair, token loss Totally-Ordered Multicast: ok, but high number of messages and problems with failures

Transaction Processing Systems Last term looked at transactions: Support for composite operations (i.e. a collection of reads and updates to a set of objects) A transaction is atomic ( all-or-nothing ) If it commits, all operations are applied If it aborts, it s as if nothing ever happened A committed transaction moves system from one consistent state to another Transaction processing systems also provide: isolation (between concurrent transactions) durability (committed transactions survive a crash) A C I D

A Model of Distributed Transactions T1 transaction { if(x<2) abort; a:= z; j:= j + 1; } x 5 y 0 z 3 a 7 c 1 b 8 T2 transaction { z:= (i+j); } i 2 j 4 Multiple servers (S 1, S 2, S 3, ), each hold some objects that can be read and written within client transactions Multiple concurrent clients (C 1, C 2, ) who perform transactions which interact with one or more servers e.g. C 1 reads x, z from S 1, writes a on S 2, and reads & writes j on S 3 e.g. C 2 reads i, j from S 3, then writes z on S 1 A successful commit implies agreement at all servers

How do we implement them? Build on top of solution for single server: e.g. use locking or shadowing to provide isolation e.g. use write-ahead log for durability Need to coordinate to either commit or abort Assume clients create unique transaction ID, TXID Uses TXID in read or write requests to server S i First time S i sees a given TXID, it starts a tentative transaction associated with that transaction ID When client wants to commit, it must perform an atomic commit of all tentative transactions across all servers

Atomic Commit Protocols A naïve solution: client simply invokes commit(txid) on each server in turn Will work only if there are no concurrent conflicting clients, every server commits (or aborts), and no server crashes! To handle concurrent clients, introduce a coordinator: A designated machine (can be one of the servers) All clients ask coordinator to commit on their behalf and hence coordinator can serialize concurrent commits To handle inconsistency/crashes, the coordinator: Asks all involved servers if they could commit TXID Servers S i reply with a vote V i = { COMMIT, ABORT } If all V i = COMMIT, coordinator multicasts docommit(txid) Otherwise, coordinator multicasts doabort(txid) Q: what assumption is this (still) making?

Two-Phase Commit (2PC) C cancommit(txid)? docommit(txid) S1 S2 S3 YES YES YES ACK ACK ACK physical time This scheme is called two-phase commit (2PC): First phase is voting: collect votes from all parties Second phase is completion: either abort or commit Doesn t require ordered multicast, but needs reliability If server fails to respond by timeout, treat as a vote to abort Once have all ACKs, inform client of successful commit

2PC: Additional Details Coordinator can abort during execution: simply multicasts doabort(txid) to all servers e.g. in response to a server voting NO (abort) If a server votes NO, can immediately abort locally If a server votes YES, it must be able to do so if subsequently asked by coordinator: Before voting to commit, server will prepare by writing entries into log and flushing to disk Also records all requests from/responses to coordinator Hence even if it crashes after voting to commit, we will be able to recover on reboot

2PC: Coordinator Crashes Coordinator must persistently log events, too: Including initial message from client, requesting votes, receiving replies, and final decision made Can reply if (restarted) client or server asks for outcome Also lets coordinator recover from restart, e.g. re-send any vote requests without responses, or reply to client If coordinator crashes after phase 1, but before initiating Q: phase we still 2: have a problem; what is it? servers will be uncertain of outcome if voted to commit, have to continue to hold locks, etc. Other schemes (3PC, Paxos, ) can deal with this

Leader Election Many schemes are built on the notion of having a well-defined leader (master, coordinator) examples seen so far include the Berkeley time synchronization protocol, and the central lock server An election algorithm is a dynamic scheme to choose a unique process to play a certain role assume P i contains state variable elected i when a process joins the group, elected i = UNDEFINED By the end of the election, for every P i, elected i = P x, where P x is the winner of the election, or elected i = UNDEFINED, or P i has crashed or otherwise left the system

Ring-Based Election System has coordinator who crashes Some process notices, and starts an election Find node with highest ID who will be new leader Puts its ID into a message, and sends to its successor On receipt, a process acks to sender (not shown), and then appends its ID and forwards the election message Finished when a process receives message containing its own ID

The Bully Algorithm Algorithm proceeds by attempting to elect the process with the highest ID still alive Assumes that we know the IDs of all processes Assumes we can reliably detect failures by timeouts If process P i sees leader has crashed, sends election msg to all processes with higher IDs, and starts a timer Concurrent election initiation by multiple processes is fine Processes receiving an election message reply OK to sender, and start an election of their own (if not already in progress) If a process hears nothing back before timeout, it declares itself the winner, and multicasts result A dead process that recovers (or new process that joins) also starts an election: fine as highest ID always elected

Problems with Elections Algorithms rely on timeouts to reliably detect failure However it is possible for networks to fail: a network partition Some processes can speak to others, but not all Can lead to split-brain syndrome: Every partition independently elects a leader => too many bosses! To fix, need some secondary (& tertiary?) communication scheme e.g. secondary network, shared disk, serial cables,

Consensus Elections are a specific example of a more general problem: consensus Given a set of n processes in a distributed system, how can we get them all to agree on something? Classical treatment has every process P i propose something (a value V i ) Want to arrive at some deterministic function of V i s (e.g. majority or maximum will work for election) A correct solution to consensus must satisfy: Agreement: all nodes arrive at the same answer Validity: answer is one that was proposed by someone Termination: all nodes eventually decide

Consensus is impossible Famous result by Fischer, Lynch & Patterson (1985) Focuses on an asynchronous network (unbounded delays) with at least one process failure Shows that it is possible to get an infinite sequence of states, and hence never terminate Given the Internet is an asynchronous network, this seems to have major consequences! Not really: Result actually says we can t always guarantee consensus, not that we can never achieve consensus And in practice, we can use tricks to mask failures (such as reboot, or replication), and to ignore asynchrony Have seen solutions already, and will see more later

Replication Many distributed systems involve replication Multiple copies of some object stored at different servers Multiple servers capable of providing some operation(s) Three key advantages: Load balancing: if have many replicas, then can spread out work from clients between them Lower latency: if replicate an object/server close to a client, will get better performance Fault tolerance: can tolerate the failure of some replicas and still provide service Examples include DNS, web & file caching (& CDNs), replicated databases, etc.

Replication of data Have some number of servers (S 1, S 2, S 3, ) Each holds a copy of all objects Each client C i can access any replica (any S i ) e.g. clients can choose closest, or least loaded If objects are read-only, then trivial: Start with one primary server P having all data If client asks S i for an object, S i returns a copy (S i fetches a copy from P if it doesn t already have one) Can easily extend to allow updates by P When updating object O, send invalidate(o) to all S i (Or add just tag all objects with valid-until field) Essentially how web caching / CDNs work today

Replication of services Can also use replication for fault tolerant services Easiest is for a stateless services: Simply duplicate functionality in k machines Clients use any (e.g. closest), fail over to another Very few totally stateless services But e.g. many web apps only have per-session soft-state State generated per-client, lost when client leaves Often used in multi-tier web farms (e.g. facebook): Web server Web server Web server App server App server App server session soft state only Cache server Cache server Cache server Database Database consistent replication (transactions)

Passive Replication A solution for stateful services is primary/backup: Backup server takes over in case of failure Based on persistent logs and system checkpoints: Periodically (or continuously) checkpoint primary If detect failure, start backup from checkpoint A few variants trade-off fail-over time: Cold-standby: backup server must start service (software), load checkpoint & parse logs Warm-standby: backup server has software running in anticipation just needs to load primary state Hot-standby: backup server mirrors primary work, but output is discarded; on failure, enable output

Active Replication Alternative: have k replicas running at all times Front-end server acts as an ordering node: Receives requests from client and forwards them to all replicas using totally ordered multicast Replicas perform operation and respond to front-end Front-end gathers responses, and replies to client Typically require replicas to be state machines : i.e. act deterministically based on input Idea is that all replicas operate in lock step Active replication is expensive (in resource terms) and not really worth it in the common case. But it is valuable if we can have Byzantine failures

Next time Replication and consistency Strong consistency Quorum-based systems Weaker consistency Consistency, availability and partitions Distributed systems case studies Google, Microsoft, Amazon