ZooKeeper Atomic Broadcast

Similar documents
A simple totally ordered broadcast protocol

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Project Midterms: March 22 nd : No Extensions

Applications of Paxos Algorithm

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Dynamic Reconfiguration of Primary/Backup Clusters

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

ZooKeeper Atomic Broadcast (for Project 2) 10/27/2016

ZooKeeper. Wait-free coordination for Internet-scale systems

MENCIUS: BUILDING EFFICIENT

Knowns and Unknowns in Distributed Systems

Replicated State Machine in Wide-area Networks

ZooKeeper. Table of contents

Mencius: Another Paxos Variant???

SpecPaxos. James Connolly && Harrison Davis

Several of these problems are motivated by trying to use solutiions used in `centralized computing to distributed computing

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Intuitive distributed algorithms. with F#

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems Reliable Group Communication

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN

WA 2. Paxos and distributed programming Martin Klíma

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

BookKeeper overview. Table of contents

Lecture 7: Logical Time

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems 8L for Part IB

Data Storage Revolution

Lecture 7: Sliding Windows. CSE 123: Computer Networks Geoff Voelker (guest lecture)

No Compromises. Distributed Transactions with Consistency, Availability, Performance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Intra-cluster Replication for Apache Kafka. Jun Rao

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

Lecture XII: Replication

There Is More Consensus in Egalitarian Parliaments

Paxos provides a highly available, redundant log of events

How VoltDB does Transactions

Replication in Distributed Systems

ZooKeeper Dynamic Reconfiguration

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Last Class: Naming. Today: Classical Problems in Distributed Systems. Naming. Time ordering and clock synchronization (today)

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Diskless Data Analytics on Distributed Coordination Systems

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

6.033 Lecture 12 3/16/09. Last time: network layer -- how to deliver a packet across a network of multiple links

Lecture 17 : Distributed Transactions 11/8/2017

Verteilte Systeme (Distributed Systems)

Failures in Distributed Systems

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Introduction to MySQL InnoDB Cluster

Distributed algorithms

Eventual Consistency Today: Limitations, Extensions and Beyond

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Chapter 3: Transport Layer

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

Homework #2 Nathan Balon CIS 578 October 31, 2004

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. Distributed Systems. September 2002

NFS. Don Porter CSE 506

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Erasure Coding in Object Stores: Challenges and Opportunities

6.830 Lecture Recovery 10/30/2017

Transport Layer Marcos Vieira

Network Protocols. Sarah Diesburg Operating Systems CS 3430

CS October 2017

Chapter 8 Fault Tolerance

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Micro- Services. Distributed Systems. DevOps

Dell PowerVault MD3600f/MD3620f Remote Replication Functional Guide

Introduction to Kafka (and why you care)

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

outside policies relayout, migrate commands managers chunks

Recovery and Logging

Replications and Consensus

Failure Tolerance. Distributed Systems Santa Clara University

Synchronization. Distributed Systems IT332

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

ZooKeeper Programmer's Guide

Maelstrom: An Enterprise Continuity Protocol for Financial Datacenters

Distributed Systems Fault Tolerance

REPLICATED STATE MACHINES

Consensus and related problems

Building Consistent Transactions with Inconsistent Replication

Transcription:

ZooKeeper Atomic Broadcast The heart of the ZooKeeper coordination service Benjamin Reed, Flavio Junqueira Yahoo! Research

ZooKeeper Service Transforms a request into an idempotent transaction Request Processor DB Atomic Broadcast Request Response

ZooKeeper Servers Server ZooKeeper Service Leader Server Server Server Server Client Client Client Client Client Client Client Client

Goals 1)Must be able to tolerate failures 2)Must be able to recover from correlated recoverable failures (power outages) 3)Must be correct 4)Must be easy to implement correctly 5)Must be fast (high throughput, low latency) Bursty throughput Homogeneous servers with non homogeneous behavior (some will inevitably be faster than others because of HW or runaway processes etc)

ZooKeeper Leader Election ZooKeeper Service Server 1 Server 2 Server 3 Server 4 Server 5 1)UDP or TCP based 2)Server with the highest logged transaction gets nominated 3)Election doesn't have to be absolutely successful, just very likely successful

Starting assumption 1)Ability to create FIFO channels We use TCP Theoretically not a stronger assumption than classic lossy unordered channel since that is what TCP is built on 2)Crash fail Digests to detect corruption 3)2f+1 servers to handle f failures Service must be able to recover from correlated recoverable failures (power outages)

ZooKeeper Servers These steps make up a pipeline that will fill with thousands of pipelined requests Server 1) Forward Request Leader Server Create a proposal and stamp with zxid Server 2) Send Proposal Update in memory database and 3) Ack Proposal make visible 4) Commit 2) Send Proposal 3) Ack Proposal 4) Commit Log txn, but don't use until committed

Nice Properties 1)Leader always proposes in order 2)Because we use TCP, followers always receive in order 3)Followers process proposals in order 4)TCP means that Leader will get ACKs in order and thus commit in order 5)Followers only need to connect to a single server 6)Leader just waits for connections

Everything is cool until... Leader Failure! 2) Make sure that the what has been delivered to some get delivered to all 3) Make sure that what gets forgotten stays forgotten 4) We get to choose what to do with the stuff in between

Missed deliveries Pa,Pb,Pc,Ca,Cb,Pd b better eventually be committed! Pa,Pb,Pc,Ca X Pa,Pb,Pc

Bad Recall Pa,Pb,Pc,Ca,Cb,Pd d better go away and never come back Cb,Cc,Pe,Pf,Ce,Cf Ca,Cb,Cc,Pe,Pf,Ce,Cf Pa,Pb,Pc,Ca Pa,Pb,Pc

Never forget 1)If we elect the right guy, we will not forget anything A new leader is elected by a quorum of followers Committed messages must be seen by at least someone in the quorum Elect the server that has seen the highest message in a quorum New leader will commit all proposals it has seen from the previous leader

Missed deliveries Pa,Pb,Pc,Ca,Cb,Pd b better eventually be committed! Cb,Cc X Ca, Cb,Cc Pa,Pb,Pc,Ca Pa,Pb,Pc

Letting go 1)We use epochs to make sure that we only recover the last leaders outstanding proposals once. Zxid is a 64-bit number: 32-bit of epoch and 32-bit counter A new leader will increment the epoch A new leader will only start proposing once the previous epoch is cleaned

Bad Recall Pa,Pb,Pc,Ca,Cb,Pd Cb,Cc,Pe,Pf,Ce,Cf Truncate to c,cc,pe,pf,ce,cf Ca,Cb,Cc,Pe,Pf,Ce,Cf Pa,Pb,Pc,Ca Pa,Pb,Pc

Leader Protocol in a nutshell 1)At startup wait for a quorum of followers to connect 2)Sync with a quorum of followers Tell the follower to delete any txn that the leader doesn't have (easy since it will only differ in one epoch) Send any txns that the follower doesn't have 3)Continually Assign and zxid to any message to be proposed and broadcast proposals to followers When a quorum has acked a proposal broadcast a commit (Broadcast means queue the message to the TCP channel of each follower)

Follower protocol in a nutshell 1)Connect to a leader 2)Delete any txns in the txn log that the leader says to delete 3)Continually Log to the txn log proposed transactions and send an ack to leader Deliver any committed txn

Performance

Status 1)An Apache project http://hadoop.apache.org/zookeeper 2)Used extensively at Yahoo! Also used by non Yahoo! Projects 3)Future work: Observers Tree distribution network