P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

Similar documents
Failures, Elections, and Raft

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

To do. Consensus and related problems. q Failure. q Raft

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

In Search of an Understandable Consensus Algorithm

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

REPLICATED STATE MACHINES

Byzantine Fault Tolerant Raft

Three modifications for the Raft consensus algorithm

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Raft and Paxos Exam Rubric

hraft: An Implementation of Raft in Haskell

Development of a cluster of LXC containers

Distributed Consensus Protocols

Today: Fault Tolerance

A Reliable Broadcast System

Today: Fault Tolerance. Fault Tolerance

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Consensus and related problems

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

How to make MySQL work with Raft. Diancheng Wang & Guangchao Bai Staff Database Alibaba Cloud

WA 2. Paxos and distributed programming Martin Klíma

CSE 5306 Distributed Systems. Fault Tolerance

Fault Tolerance. Basic Concepts

Intuitive distributed algorithms. with F#

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Commit in Asynchronous Systems

CSE 5306 Distributed Systems

Recovering from a Crash. Three-Phase Commit

Distributed File System

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

FAB: An Intuitive Consensus Protocol using Raft and Paxos

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Fault Tolerance. Distributed Systems IT332

Practical Byzantine Fault Tolerance. Castro and Liskov SOSP 99

Causal Consistency and Two-Phase Commit

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

PRIMARY-BACKUP REPLICATION

Distributed Data Management Replication

Improving RAFT. Semester Thesis. Christian Fluri. Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

At-most-once Algorithm for Linearizable RPC in Distributed Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Basic Paxos. Content borrowed from Ousterhout s slides on Paxos

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Comparative Analysis of Big Data Stream Processing Systems

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CPS 310 midterm exam #2, 4/9/2018

CSE 5306 Distributed Systems. Synchronization

Fault Tolerance. Distributed Systems. September 2002

SimpleChubby: a simple distributed lock service

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Term Paper: Specifying and proving cluster membership for the Raft distributed consensus algorithm

Paxos and Replication. Dan Ports, CSEP 552

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Technical Report. ARC: Analysis of Raft Consensus. Heidi Howard. Number 857. July Computer Laboratory UCAM-CL-TR-857 ISSN

You can also launch the instances on different machines by supplying IPv4 addresses and port numbers in the format :3410

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Making RAMCloud Writes Even Faster

Recovery and Logging

The Google File System

Synchronization. Clock Synchronization

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Implementation and Performance of a SDN Cluster- Controller Based on the OpenDayLight Framework

Process Synchroniztion Mutual Exclusion & Election Algorithms

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems 11. Consensus. Paul Krzyzanowski

Paxos provides a highly available, redundant log of events

Split-Brain Consensus

Distributed Data Management Transactions

Cluster Consensus When Aeron Met Raft. Martin Thompson

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

Distributed Consensus: Making Impossible Possible

Distributed Transactions

Synchronization. Chapter 5

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Synchronization (contd.)

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Abstract. Introduction

Raft & RethinkDB: A thorough Examination

Failure Tolerance. Distributed Systems Santa Clara University

[MS-WINSRA]: Windows Internet Naming Service (WINS) Replication and Autodiscovery Protocol

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems COMP 212. Lecture 19 Othon Michail

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Today: Fault Tolerance. Replica Management

Transcription:

P2 Recitation Raft: A Consensus Algorithm for Replicated Logs Presented by Zeleena Kearney and Tushar Agarwal Diego Ongaro and John Ousterhout Stanford University Presentation adapted from the original Raft slides and 440 F7 slides Slide

Logistics Checkpoint - Leader election and heartbeats - Due on Monday, November 5th at :59pm Final - Log replication - Due on Monday, November 2th at :59pm Late policy - 0% penalty for each late day - Maximum of 2 late days allowed Other notes - Individual project! - 5 Autolab submissions per checkpoint - Hidden tests! Slide 2

Leader election Implement raft state machine for election RequestVote RPC used for requesting leadership votes Heartbeats Tips Checkpoint Leader periodically sends empty AppendEntries RPC Timeouts used to detect leader failure to trigger re-election Be careful of the values chosen for timeouts and the interval chosen for heartbeats. Keep clean separation of the code for the follower, leader and the candidates. Randomize the timeouts to prevent synchronization, leading to election failure. Raft Consensus Algorithm Slide

Local Testing How to write your own tests: See raft_test.go for test structure / setup Expand given tests, create your own Useful functions to write tests: cfg.checkoneleader() checks for a leader s successful election and gets leader s ID Used in TestInitialElection2A cfg.one(value, num_servers) starts an agreement Used in TestFailAgree2B cfg.disconnect(server_id) to disconnect servers cfg.connect(server_id) to connect servers Call Start() on one of the Raft peers by using cfg.rafts Used in TestFailNoAgree2B Slide 4

What is Consensus? Agreement on shared state (i.e. single system image) Failures are a norm in a distributed system Recovers from server failures autonomously If a Minority of servers fail - No Issues If a Majority fail - must trade off availability and consistency, but: Retain Consistency, lose Availability Retain Availability, Consistency lost Don t want for a consensus algorithm Key to building large-scale, consistent storage systems Slide 5

Goal: Replicated Log Clients shl Consensus Module State Machine Consensus Module State Machine Consensus Module State Machine Servers Log add jmp mov shl Log add jmp mov shl Log add jmp mov shl Replicated log => replicated state machine All servers execute same commands (stored in logs) in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages Raft Consensus Algorithm Slide 6

Approaches to Consensus Two general approaches to consensus: Symmetric, leader-less: All servers have equal roles Clients can contact any server Example: Paxos Asymmetric, leader-based: At any given time, one server is in charge, others accept its decisions Clients communicate with the leader Raft uses leader-based Decomposes the problem (normal operation, leader changes) Simplifies normal operation (no conflicts) More efficient than leader-less approaches Slide 7

. Leader election: Raft Overview Select one of the servers to act as leader Detect crashes, choose new leader 2. Normal operation (basic log replication). Safety and consistency after leader changes 4. Neutralizing old leaders Raft Consensus Algorithm Slide 8

Server States At any given time, each server is either: Leader: handles all client interactions, log replication At most viable leader at a time Follower: completely passive (issues no RPCs, responds to incoming RPCs) Candidate: used to elect a new leader Normal operation: leader, N- followers start timeout, start election timeout, new election receive votes from majority of servers Follower Candidate Leader discover current server or higher term March, 20 step down Raft Consensus Algorithm discover server with higher term Slide 9

Terms Term Term 2 Term Term 4 Term 5 time Elections Split Vote Normal Operation Time divided into terms: Election Normal operation under a single leader At most leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information Slide 0

March, 20 Raft Consensus term term when Algorithm entry was received Slide by leader 7 index position of entry in the log command command for state machine Raft Protocol Summary Followers Respond to RPCs from candidates and leaders. Convert to candidate if election timeout elapses without either: Receiving valid AppendEntries RPC, or Granting vote to candidate Candidates Increment currentterm, vote for self Reset election timeout Send RequestVote RPCs to all other servers, wait for either: Votes received from majority of servers: become leader AppendEntries RPC received from new leader: step down Election timeout elapses without election resolution: increment term, start new election Discover higher term: step down Leaders Initialize nextindex for each to last log index + Send initial empty AppendEntries RPCs (heartbeat) to each follower; repeat during idle periods to prevent election timeouts Accept commands from clients, append new entries to local log Whenever last log index nextindex for a follower, send AppendEntries RPC with log entries starting at nextindex, update nextindex if successful If AppendEntries fails because of log inconsistency, decrement nextindex and retry Mark log entries committed if stored on a majority of servers and at least one entry from current term is stored on a majority of servers Step down if currentterm changes Persistent State Each server persists the following to stable storage synchronously before responding to RPCs: currentterm latest term server has seen (initialized to 0 on first boot) votedfor candidateid that received vote in current term (or null if none) log[] log entries Log Entry Invoked by candidates to gather votes. Arguments: candidateid term lastlogindex lastlogterm RequestVote RPC candidate requesting vote candidate's term index of candidate's last log entry term of candidate's last log entry Results: term votegranted currentterm, for candidate to update itself Implementation: true means candidate received vote. If term > currentterm, currentterm term (step down if leader or candidate) 2. If term == currentterm, votedfor is null or candidateid, and candidate's log is at least as complete as local log, grant vote and reset election timeout Invoked by leader to replicate log entries and discover inconsistencies; also used as heartbeat. Arguments: term leaderid prevlogindex prevlogterm entries[] commitindex Results: term success AppendEntries RPC leader's term so follower can redirect clients index of log entry immediately preceding new ones term of prevlogindex entry log entries to store (empty for heartbeat) last entry known to be committed currentterm, for leader to update itself true if follower contained entry matching prevlogindex and prevlogterm Implementation:. Return if term < currentterm 2. If term > currentterm, currentterm term. If candidate or leader, step down 4. Reset election timeout 5. Return failure if log doesn t contain an entry at prevlogindex whose term matches prevlogterm 6. If existing entries conflict with new entries, delete all existing entries starting with first conflicting entry 7. Append any new entries not already in the log 8. Advance state machine with newly committed entries Slide

Heartbeats and Timeouts Servers start up as followers Followers expect to receive RPCs from leaders or candidates Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority If electiontimeout elapses with no RPCs: Follower assumes leader has crashed Follower starts new election Timeouts for each server are random to reduce the chance of synchronized elections and are typically 00-500ms Raft Consensus Algorithm Slide 2

Election Basics Increment current term Change to Candidate state Vote for self Send RequestVote RPCs to all other servers, retry until either:. Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers 2. Receive AppendEntries RPC from valid leader: Return to follower state. No-one wins election (election timeout elapses): Increment term, start new election Raft Consensus Algorithm Slide

Elections, cont d Safety: allow at most one winner per term Each server gives out only one vote per term (persist on disk) Two different candidates can t accumulate majorities in same term B can t also get majority Servers Voted for candidate A Liveness: some candidate must eventually win Choose election timeouts randomly in [T, 2T] One server usually times out and wins election before others wake up Works well if T >> broadcast time Raft Consensus Algorithm Slide 4

Log Structure term command 2 4 5 6 7 8 add add cmp cmp ret ret 2 mov 2 mov jmp jmp div shl sub log index leader add add cmp cmp ret 2 mov jmp div shl sub followers add cmp ret 2 mov jmp div shl committed entries Log entry = <index, term, command> Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers Durable, will eventually be executed by state machines Raft Consensus Algorithm Slide 5

Normal Operation:. Client sends command to leader 2. Leader appends command to its log. Leader sends AppendEntries RPCs to followers 4. Once new entry committed: Leader passes command to its state machine, returns result to client Leader notifies followers of committed entries in subsequent AppendEntries RPCs Followers pass committed commands to their state machines Crashed/slow followers? Normal Operation Leader retries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers Slide 6

Log Consistency High level of coherency between logs: If log entries on different servers have same index and term: They store the same command The logs are identical in all preceding entries 2 4 5 6 log index add cmp ret 2 mov jmp div add cmp ret 2 mov jmp 4 sub If a given entry is committed, all preceding entries are also committed Raft Consensus Algorithm Slide 7

AppendEntries Consistency Check Each AppendEntries RPC contains index, term of entry preceding new ones Follower must contain matching entry; otherwise it rejects request Implements an induction step, ensures coherency leader follower 2 4 5 add add cmp cmp ret ret 2 mov 2 mov jmp AppendEntries succeeds: matching entry leader follower add add cmp cmp ret ret 2 mov shl jmp AppendEntries fails: mismatch Raft Consensus Algorithm Slide 8

Leader Changes At beginning of new leader s term: Old leader may have left entries partially replicated No special steps by new leader: just start normal operation Leader s log is the truth Will eventually make follower s logs identical to leader s Raft Consensus Algorithm Slide 9

Safety Requirement Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders The following steps guarantee safety: Leaders never overwrite entries in their logs Only entries in the leader s log can be committed Entries must be committed before applying to state machine Committed Present in future leaders logs Restrictions on commitment Restrictions on leader election Raft Consensus Algorithm Slide 20

Picking the Best Leader Can t tell which entries are committed! 2 4 5 2 2 2 committed? 2 2 unavailable during leader transition During elections, choose candidate with log most likely to contain all committed entries Candidates include log info in RequestVote RPCs (index & term of last log entry) Voting server V denies vote if its log is more complete : (lastterm V > lastterm C ) (lastterm V == lastterm C ) && (lastindex V > lastindex C ) Leader will have most complete log among electing majority Raft Consensus Algorithm Slide 2

Committing Entry from Current Term Case out of 2: Leader decides entry in current term is committed 2 4 5 6 s 2 2 2 s 2 2 2 s 2 2 s 4 2 s 5 Leader for term 2 AppendEntries just succeeded Can t be elected as leader for term Safe: leader for term must contain entry 4 Raft Consensus Algorithm Slide 22

Committing Entry from Earlier Term Case 2 out of 2: Leader is trying to finish committing entry from an earlier term 2 4 5 6 s 2 4 Leader for term 4 s 2 2 s 2 AppendEntries just succeeded s 4 s 5 Entry not safely committed: s 5 can be elected as leader for term 5 If elected, it will overwrite entry on s, s 2, and s which is BAD since we don t ever want to overwrite previous commits! Need commitment rules in addition to election rules Slide 2

New Commitment Rules For a leader to decide an entry is committed: Must be stored on a majority of servers At least one new entry from leader s term must also be stored on majority of servers Once entry 4 committed: s 5 cannot be elected leader for term 5 Entries and 4 both safe 2 4 log Index s 2 4 Leader for term 4 s 2 s s 4 s 5 2 4 2 4 Combination of election rules and commitment rules makes Raft safe Raft Consensus Algorithm Slide 24

Log Inconsistencies Leader changes can result in log inconsistencies: log index leader for term 8 2 4 5 6 7 8 9 0 2 4 4 5 5 6 6 6 (a) (b) 4 4 5 5 6 6 4 Missing Entries possible followers (c) (d) (e) 4 4 5 5 6 6 6 6 4 4 5 5 6 6 6 7 7 4 4 4 4 Extraneous Entries (f) 2 2 2 Raft Consensus Algorithm Slide 25

log index Repairing Follower Logs New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextindex for each follower: Index of next log entry to send to that follower Initialized to ( + leader s last index) When AppendEntries consistency check fails, decrement nextindex and try again: nextindex 2 4 5 6 7 8 9 0 2 leader for term 7 4 4 5 5 6 6 6 (a) 4 followers (b) 2 2 2 March, 20 Raft Consensus Algorithm Slide 26

Repairing Logs, cont d When follower overwrites inconsistent entry, it deletes all subsequent entries: nextindex log index leader for term 7 2 4 5 6 7 8 9 0 4 4 5 5 6 6 6 follower (before) 2 2 2 follower (after) 4 March, 20 Raft Consensus Algorithm Slide 2 Slide 27

Neutralizing Old Leaders Deposed leader may not be dead: Temporarily disconnected from network Other servers elect a new leader Old leader becomes reconnected, attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender If sender s term is older, RPC is rejected, sender reverts to follower and updates its term If receiver s term is older, it reverts to follower, updates its term, then processes RPC normally Election updates terms of majority of servers Deposed server cannot commit new log entries Raft Consensus Algorithm Slide 28

Visualization https://raft.github.io/raftscope-replay/index.html Slide 29

. Leader election 2. Normal operation. Safety and consistency 4. Neutralize old leaders Raft Summary Raft Consensus Algorithm Slide 0

Useful Links Summary Extended Raft paper: https://raft.github.io/raft.pdf Visualization: https://raft.github.io/raftscope-replay/index.html Original Raft slides: https://ramcloud.stanford.edu/~ongaro/cs244b.pdf Raft Consensus Algorithm Slide