Dynamic Reconfiguration of Primary/Backup Clusters

Similar documents
ZooKeeper Dynamic Reconfiguration

ZooKeeper. Wait-free coordination for Internet-scale systems

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Applications of Paxos Algorithm

Distributed Computation Models

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Replication. Feb 10, 2016 CPSC 416

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

ZooKeeper Atomic Broadcast

Exam 2 Review. October 29, Paul Krzyzanowski 1

Intuitive distributed algorithms. with F#

Distributed Systems 16. Distributed File Systems II

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

Primary-Backup Replication

Distributed Consensus Protocols

ZooKeeper Atomic Broadcast (for Project 2) 10/27/2016

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Dynamically Scalable, Fault-Tolerant Coordination on a Shared Logging Service

A Rendezvous Framework for the Automatic Deployment of Services in Cluster Computing

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper. Table of contents

BigTable. CSE-291 (Cloud Computing) Fall 2016

Project Midterms: March 22 nd : No Extensions

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Building an on premise Kubernetes cluster DANNY TURNER

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed System. Gang Wu. Spring,2018

Lecture XIII: Replication-II

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

There Is More Consensus in Egalitarian Parliaments

Consensus and related problems

No compromises: distributed transactions with consistency, availability, and performance

The Google File System

Distributed System. Gang Wu. Spring,2018

Intra-cluster Replication for Apache Kafka. Jun Rao

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Knowns and Unknowns in Distributed Systems

CPS 512 midterm exam #1, 10/7/2016

PRIMARY-BACKUP REPLICATION

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

Large-Scale Data Stores and Probabilistic Protocols

The Google File System (GFS)

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

CLOUD-SCALE FILE SYSTEMS

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed File Systems II

Hierarchical Chubby: A Scalable, Distributed Locking Service

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Paxos and Distributed Transactions

Tail Latency in ZooKeeper and a Simple Reimplementation

EECS 498 Introduction to Distributed Systems

Building and Running a Solr-as-a-Service SHAI ERERA IBM

To do. Consensus and related problems. q Failure. q Raft

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS October 2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CORFU: A Shared Log Design for Flash Clusters

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

The Google File System

Distributed Consensus: Making Impossible Possible

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

sinfonia: a new paradigm for building scalable distributed systems

Namenode HA. Sanjay Radia - Hortonworks

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Topics in Reliable Distributed Systems

SimpleChubby: a simple distributed lock service

A simple totally ordered broadcast protocol

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

BookKeeper overview. Table of contents

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Apache BookKeeper. A High Performance and Low Latency Storage Service

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

The Google File System

Efficient Geographic Replication & Disaster Recovery. Tom Pantelis Brian Freeman Colin Dixon

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

GFS: The Google File System

MapReduce. U of Toronto, 2014

BigTable: A Distributed Storage System for Structured Data

Exam 2 Review. Fall 2011

GFS: The Google File System. Dr. Yingwu Zhu

Transcription:

Dynamic Reconfiguration of Primary/Backup Clusters (Apache ZooKeeper) Alex Shraer Yahoo! Research In collaboration with: Benjamin Reed Dahlia Malkhi Flavio Junqueira Yahoo! Research Microsoft Research Yahoo! Research 1 1

Configuration of a Distributed Replicated System Membership Role of each server E.g., deciding on changes (participant) i t) or learning the changes (observer) Quorum System spec majorities / hierarchical (server votes have different weight) Network addresses & ports Timeouts, directory ypaths, etc. 2

Dynamic Membership Changes Necessary in every long-lived system! Examples: Cloud computing: adopt to changing load, don t pre-allocate! Failures: replacing failed nodes with healthy ones Upgrades: replacing out-of-date nodes with up-to-date ones Free up storage space: decreasing the number of replicas Moving nodes: within the network or the data center Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2 are up: 3

Other Dynamic Configuration Changes Changing g server addresses/ports Changing server roles: leader & followers observers 4 4

Other Dynamic Configuration Changes Changing g server addresses/ports Changing server roles: observers leader & followers Changing the Quorum System E.g., if a new powerful & well-connected server is added 5 5

Industry Approach to Reconfiguration Reconfiguration in Distributed ib t Systems is difficult! use external Coordination Service 6 6

Industry Approach to Reconfiguration Reconfiguration in Distributed ib t Systems is difficult! use external Coordination Service Leading coordination services: Chubby: Google Apache Zookeeper: Yahoo!, Linkedin, Twitter, Facebook, VMWare, UBS, Goldman Sachs, Netflix, Box, Cloudera, MapR, Nicira, Configuration management, metadata store, failure detection, distributed locking, leader election, message queues, task assignment 7

Zookeeper data model A tree of data nodes (znodes) / services Hierarchical namespace (like in a file system) Znode = <data, version, creation flags, children> workers locks apps users worker1 worker2 x-1 x-2 8 8

Zookeeper - distributed and replicated ZooKeeper Service Leader Server Server Server Server Server Client Client Client Client Client Client Client Client All servers store a copy of the data (in memory) A leader is elected at startup Reads served by followers, all updates go through h leader Update acked when a quorum of servers have persisted the change (on disk) Zookeeper uses ZAB -its own atomic broadcast protocol Borrows a lot from Paxos, but conceptually different 9

Zookeeper is a Primary/Backup system Important subclass of State-Machine Replication Many (most?) Primary/Backup systems work as follows: Primary executes operations, sends idempotent state updates to backups makes sense only in the context of Primary speculatively executes and sends out but it will only appear in a backup s log after In general SMR (Paxos), a backup s log may become Primary order: each primary commits a consecutive segment in the log Preserved by many (most?) primary/backup systems Zookeeper, Chubby, GFS, Boxwood, Chain Replication, Harp, Echo, PacificA, etc. Not preserved by Paxos / general state machine replication 10

Reconfiguring Zookeeper Not supported All config settings are static loaded during boot Zookeeper users repeatedly asking for reconfig. since 2008 Several attempts found incorrect and rejected 11

Manual Reconfiguration Bring the service down, change configuration files, bring it back up Wrong reconfiguration caused split-brain & inconsistency in production Questions about manual reconfig are asked several times each week Admins prefer to over-provision than to reconfigure [LinkedIn talk @Yahoo, 2012] Doesn t help with many reconfiguration use-cases Wastes resources, adds management overhead Can hurt Zookeeper throughput (we show) Configuration errors primary cause of failures in production systems [Yin et al., SOSP 11] 12

Hazards of Manual Reconfiguration A E C {A, B, C} C, D, E} B {A, B, C, C} D, E} {A, B, C, D, E} D {A, B, C} {A, B, C, D, E} {A, B, C, D, E} 13 Goal: add servers E and D Change configuration files Restart all servers We lost and!! 13

Can t we just store configuration in Zoookeeper? Recap of Recovery in Zookeeper C E B setdata(/x, 5) A D Leader failure activates leader election & recovery 14 14

This doesn t work for reconfigurations! C E B {A, B, C, D, E} {A, B, C, D, E} setdata(/zookeeper/config, {A, B, F}) remove C, D, E add F {A, B, C, D, E} D F A {A, B, C, D, E} {A, B, F} {A, {A, B, B, C, F} D, E} Must persist the decision to reconfigure in the old config before activating the new config! Once such decision is reached, must not allow further ops to be committed in old config 15

Principles of Reconfiguration A reconfiguration S -> S should do the following: 1. Commit reconfig in a quorum of S 2. Deactivate S (make sure no more updates committed in S) 3. Transfer state from S to S Identify all committed/potentially committed updates in S Transfer state to a quorum of S 4. Activate S, so that it can process and commit client ops 16 16

Principles Primary/Backupof Reconfiguration A reconfiguration S -> S should do the following: 1. Commit reconfig in a quorum of S Submit reconfig op just like any other update in S 2. Deactivate S (make sure no more updates committed in S) Primary-order guarantees that further updates committed in S 3. Transfer state from S to S Identify All important all committed/potentially updates are in primary s committed log updates in S Transfer Transfer state ahead to a of quorum time; here of S make sure transfer complete need quorum of S to ack all history up to reconfig 4. Activate S, so that it can process and commit client ops 17 17

Failure-Free Flow 18 18

Usually unnoticeable to clients remove add remove-leader add remove add 19

Protocol Features After reconfiguration is proposed, leader schedules & executes operations as usual Leader of the new configuration is responsible to commit these If leader of old config is in new config and able to lead, it remains the leader Otherwise, old leader nominates new leader (saves leader election time) We support multiple concurrent reconfigurations Activate only the last config, not intermediate ones In the paper, not in production 20 20

Gossiping activated configurations A E C {A, {A, B, B, C, C} D, E} B {A, B, C} D {A, B, C} {A, {A, B, B, C, C} D, E} : add servers E and D D should be leader (has latest state) {A, B, {A, C, B, D, C} E} But D doesn t have support of a quorum (3 out of 5) 21 21

Recovery Discovering Decisions C E A {A, B, C} {A, {A, B, D, C} E} B {A, B, D, C} E} D 22 {A, B, C} : replace B, C with E, D C must 1) discover possible decisions in {A, B, C} (find out about {A, D, E}) {A, D, B, E} C} 2) discover possible activation decision in {A, D, E} - If {A,D, E} is active, C mustn t attempt to transfer state - Otherwise, C should transfer state & activate {A, D, E} 22

The client side of reconfiguration When system changes, clients need to stay connected The usual solution: directory service (e.g., DNS) Re-balancing load during reconfiguration is also important! Goal: uniform #clients per server with minimal client migration Migration should be proportional to change in membership 23 23

Our approach - Probabilistic Load Balancing Example 1 : Each client moves to a random new server with probability 0.4 1 3/5 = 0.4 X 10 X 10 X 10 X 6 X 6 X 6 X 6 X 6 Exp. 40% clients will move off of each server Example 2 : A B 4/18 4/18 10/18 C D E F X 6 Clients connected to D and E don t move X 6 X 6 X 10 6 X 10 6 X 10 Clients connected to A, B, C move to D, E with probability 4/9 S S ( S - S )/ S S \S = 2(5-3)/3*3 = 4/9 Exp. 8 clients will move from A, B, C to D, E and 10 to F 24

Probabilistic Load Balancing When moving from config. S to S : E( load( i, S' )) load( i, S) j S j i load( j, S) Pr( j i) load( i, S) j S ' j i Pr( i j) expected #clients connected to i in S (10 in last example) #clients connected to i in S #clients moving to i from other servers in S Solving for Pr we get case-specific probabilities. Input: each client answers locally Question 1: Are there more servers now or less? #clients moving from i to other servers in S Question 2: Is my server being removed? Output: 1) disconnect or stay connected to my server if disconnect 2) Pr(connect to one of the old servers) and Pr(connect to newly added d server) 25

Probabilistic Load Balancing 26

Implementation Implemented in Zookeeper (Java & C), integration ongoing 3 new Zookeeper API calls: reconfig, getconfig, updateserverlist feature requested since 2008 Dynamic changes to: Membership Quorum System Server roles Addresses & ports Reconfiguration modes: Incremental (add servers E and D, remove server B) Non-incremental (new config = {A, C, D, E}) Blind or conditioned (reconfig only if current config is #5) Subscriptions to config changes using watches Client can invoke client-side re-balancing upon change 27

Example - reconfig using CLI reconfig add 1=host1.com:1234:1235:observer;1239 add 2=host2.com:1236:1237:follower;1231 remove 5 Change follower 1 to an observer and change its ports Add follower 2 to the ensemble Remove follower 5 from the ensemble reconfig file mynewconfig.txt v 234547 Change the current config to the one in mynewconfig.txt But only if current config version is 234547 getconfig w c set a watch on /zookeeper/config c means we only want the new connection string for clients host1:port1, host2:port2, host3:port3 28

Summary Primary/Backup easier to reconfigure than general SMR We have a new algorithm, implemented in ZooKeeper Being contributed to ZooKeeper codebase First practical algorithm for Speculative Reconfiguration Ui Using the primary order property Many nice features: doesn t limit concurrency reconfigures immediately preserves primary order doesn t stop client ops Clients work with a single configuration at a time No external services Includes client-side rebalancing 29

Acknowledgements ZooKeeper open source community Marshall McMullen (SolidFire) Vishal Kher (VMWare) Mahadev Konar (Horton Works) Patrick Hunt (Cloudera) Rakesh Radhakrishnan (Huawei) Raghu Shastry 30