Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

Similar documents
Primary-Backup Replication

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

PRIMARY-BACKUP REPLICATION

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Byzantine Fault Tolerance

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Distributed Transactions

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

Distributed Hash Tables

REPLICATED STATE MACHINES

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Bitcoin. CS6450: Distributed Systems Lecture 20 Ryan Stutsman

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman

CSE 124: LAMPORT AND VECTOR CLOCKS. George Porter October 30, 2017

Distributed System. Gang Wu. Spring,2018

TIME ATTRIBUTION 11/4/2018. George Porter Nov 6 and 8, 2018

Time Synchronization and Logical Clocks

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Byzantine Fault Tolerance

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Replication in Distributed Systems

Remote Procedure Call. Tom Anderson

Causal Consistency and Two-Phase Commit

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Transparent TCP Recovery

Google File System 2

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Exam 2 Review. October 29, Paul Krzyzanowski 1

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Datacenter replication solution with quasardb

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

6.033 Spring Lecture #19. Distributed transactions Availability Replicated State Machines spring 2018 Katrina LaCurts

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Computation Models

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

GFS: The Google File System

The Google File System

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Distributed File Systems II

CSE 124: TIME SYNCHRONIZATION, CRISTIAN S ALGORITHM, BERKELEY ALGORITHM, NTP. George Porter October 27, 2017

NPTEL Course Jan K. Gopinath Indian Institute of Science

Consensus and related problems

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Failures, Elections, and Raft

Introduction to Virtualization. From NDG In partnership with VMware IT Academy

Distributed System. Gang Wu. Spring,2018

7680: Distributed Systems

Last Class:Consistency Semantics. Today: More on Consistency

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Time and distributed systems. Just use time stamps? Correct consistency model? Replication and Consistency

Parallel Programming Concepts

Distributed Systems COMP 212. Lecture 19 Othon Michail

The Google File System

Recovering from a Crash. Three-Phase Commit

Implementation Issues. Remote-Write Protocols

Network File Systems

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

The Google File System

MODELS OF DISTRIBUTED SYSTEMS

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Time. COS 418: Distributed Systems Lecture 3. Wyatt Lloyd

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Today: Fault Tolerance. Reliable One-One Communication

Today: Fault Tolerance. Fault Tolerance

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Consistency in Distributed Systems

Today: Fault Tolerance

Paxos provides a highly available, redundant log of events

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MI-PDB, MIE-PDB: Advanced Database Systems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Today: Fault Tolerance. Replica Management

Dynamic Reconfiguration of Primary/Backup Clusters

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

MODELS OF DISTRIBUTED SYSTEMS

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Transcription:

Primary/Backup CS6450: Distributed Systems Lecture 3/4 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

Simplified Fault Tolerance in MapReduce User Program (1) fork (1) fork (1) fork split 0 split 1 split 2 split 3 split 4 worker (2) assign map Master MapReduce used GFS, stateless workers, and clients (2) assign reduce themselves to achieve fault tolerance (6) write worker (5) remote read (3) read (4) local write worker worker output file 0 output file 1 worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files 2

Plan 1. Introduction to Primary-Backup replication 2. Case study: VMWare s fault-tolerant virtual machine 3

Primary-Backup: Goals Mechanism: Replicate and separate servers Goal #1: Provide a highly reliable service Despite some server and network failures Continue operation after failure Goal #2: Servers should behave just like a single, more reliable server 4

State machine replication Any server is essentially a state machine Set of (key, value) pairs is state Operations transition between states Need an op to be executed on all replicas, or none at all i.e., we need distributed all-or-nothing atomicity If op is deterministic, replicas will end in same state Key assumption: Operations are deterministic This assumption can be (carefully) relaxed 5

Primary-Backup (P-B) approach Nominate one server the primary, call the other the backup Clients send all operations (get, put) to current primary The primary orders clients operations Should be only one primary at a time Need to keep clients, primary, and backup in sync: who is primary and who is backup 6

Challenges Network and server failures Network partitions Within each network partition, nearperfect communication between servers Between network partitions, no communication between servers 7

Primary-Backup (P-B) approach Client S 2 (Backup) S 1 (Primary) 1. Primary logs the operation locally 2. Primary sends operation to backup and waits for ack Backup performs or just adds it to its log 3. Primary performs op and acks to the client After backup has applied the operation and ack ed 8

View server A view server decides who is primary, who is backup Clients and servers depend on view server Don t decide on their own (might not agree) Challenge in designing the view service: Only want one primary at a time Careful protocol design needed For now, assume view server never fails 9

Monitoring server liveness Each replica periodically pings view server View server declares replica dead if it missed N pings in a row Considers replica alive after single ping Can a replica be alive but declared dead by view server? Yes, in the case of network failure or partition 10

The view server decides the current view View = (view #, primary server, backup server) Challenge: All parties make their own local decision of the current S 1 (Primary) view number Client (1, S 1, S 2 ) (2, S 2, ) (3, S 2, S 3 ) S 2 (Backup) (Primary) S 3 (Backup) (Idle) 11

Agreeing on the current view In general, any number of servers can ping view server Okay to have a view with a primary and no backup Want everyone to agree on the view number Include the view # in RPCs between all parties 12

Transitioning between views How do we ensure new primary has up-to-date state? Only promote a previous backup i.e., don t make a previously-idle server primary State transfer can take awhile, so may take time for backup to take up role How does view server know whether backup is up to date? View server sends view-change message to all Primary must ack new view once backup is up-to-date View server stays with current view until ack Even if primary has or appears to have failed 13

Split Brain S 1 (1, S 1, S 2 ) View Server S 2 (1, S 1, S 2 ) (2, S 2, ) (2, S 2, ) Client 14

Server S 2 in the old view S 1 (1, S 1, S 2 ) View Server (1, S 1, S 2 ) (2, S 2, ) (2, (1, S 21, ) S 2 ) S 2 (2, (1, S 21, ) S 2 ) Client 15

Server S 2 in the new view S 1 (1, S 1, S 2 ) View Server (1, S 1, S 2 ) (2, S 2, ) (2, (1, S 21, ) S 2 ) S 2 (2, S 2, ) Client 16

State transfer via operation log How does a new backup get the current state? If S 2 is backup in view i but was not in view i 1 S 2 asks primary to transfer the state One alternative: transfer the entire operation log Simple, but inefficient (operation log is long) 17

State transfer via snapshot Every op must be either before or after state transfer If op before transfer, transfer must reflect op If op after transfer, primary forwards the op to the backup after the state transfer finishes If each client has only one RPC outstanding at a time, state = map + result of the last RPC from each client (Had to save this anyway for at most once RPC) 18

Summary of rules 1. View i s primary must have been primary/backup in view i 1 2. A non-backup must reject forwarded requests Backup accepts forwarded requests only if they are in its idea of the current view 3. A non-primary must reject direct client requests 4. Every operation must be before or after state transfer 19

Primary-Backup: Summary First step in our goal of making stateful replicas fault-tolerant Allows replicas to provide continuous service despite persistent net and machine failures Finds repeated application in practical systems (next) 20

Question Suppose view server V moves from (1, A, B) to (2, B, C), but A comes back and B goes unresponsive What then? Don t know if either of A or C is up-to-date Have to wait for B to come back What happens if view service is down/partitioned? View service could use P-B, but then... 21

Plan 1. Introduction to Primary-Backup replication 2. Case study: VMWare s fault-tolerant virtual machine Scales et al., SIGOPS Operating Systems Review 44(4), Dec. 2010 22

VMware vsphere Fault Tolerance (VM-FT) Goals: 1. Replication of the whole virtual machine 2. Completely transparent to applications and clients 3. High availability for any existing software 23

Overview Two virtual machines (primary, backup) on different bare metal Primary VM Backup VM Logging channel runs over network Logging channel Fiber channel-attached shared disk Shared Disk 24

Virtual Machine I/O VM inputs Incoming network packets Disk reads Keyboard and mouse events Clock timer interrupt events VM outputs Outgoing network packets Disk writes 25

Overview Primary sends inputs to backup Primary VM Backup VM Backup outputs dropped Logging channel Primary-backup heartbeats If primary fails, backup takes over Shared Disk 26

VM-FT: Challenges 1. Making the backup an exact replica of primary 2. Making the system behave like a single server 3. Avoiding two primaries (Split Brain) 27

Log-based VM replication Step 1: Hypervisor at the primary logs the causes of non-determinism: 1. Log results of input events Including current program counter value for each 2. Log results of non-deterministic instructions e.g. log result of timestamp counter read (RDTSC) 28

Log-based VM replication Step 2: Primary hypervisor sends log entries to backup hypervisor over the logging channel Backup hypervisor replays the log entries Stops backup VM at next input event or nondeterministic instruction Delivers same input as primary Delivers same non-deterministic instruction result as primary 29

VM-FT Challenges 1. Making the backup an exact replica of primary 2. Making the system behave like a single server FT Protocol 3. Avoiding two primaries (Split Brain) 30

Primary to backup failover When backup takes over, non-determinism will make it execute differently than primary would have done This is okay! Output requirement: When backup VM takes over, its execution is consistent with outputs the primary VM has already sent 31

The problem of inconsistency Input Output Primary Backup 32

FT protocol Primary logs each output operation Delays any output until Backup acknowledges it Input Primary Backup Duplicate output Can restart execution at an output event 33

VM-FT: Challenges 1. Making the backup an exact replica of primary 2. Making the system behave like a single server 3. Avoiding two primaries (Split Brain) Logging channel may break 34

Detecting and responding to failures Primary and backup each run UDP heartbeats, monitor logging traffic from their peer Before going live (backup) or finding new backup (primary), execute an atomic test-and-set on a variable in shared storage Note the lease assumption; on successful CAS assume old primary has stopped Why is this guaranteed here? If the replica finds variable already set, it aborts 35

VM-FT: Conclusion Challenging application of primary-backup replication Design for correctness and consistency of replicated VM outputs despite failures Performance results show generally high performance, low logging bandwidth overhead 36