ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Similar documents
Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper. Wait-free coordination for Internet-scale systems

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Project Midterms: March 22 nd : No Extensions

Distributed Computation Models

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

Applications of Paxos Algorithm

Knowns and Unknowns in Distributed Systems

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper. Table of contents

ZooKeeper Atomic Broadcast (for Project 2) 10/27/2016

Distributed Architectures & Microservices. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper Recipes and Solutions

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed System. Gang Wu. Spring,2018

Dynamic Reconfiguration of Primary/Backup Clusters

Consensus and related problems

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Failures, Elections, and Raft

Distributed Systems 16. Distributed File Systems II

ZooKeeper Recipes and Solutions

Replication. Feb 10, 2016 CPSC 416

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Going beyond MapReduce

EECS 498 Introduction to Distributed Systems

The Chubby Lock Service for Loosely-coupled Distributed systems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Exam 2 Review. Fall 2011

Trek: Testable Replicated Key-Value Store

BookKeeper overview. Table of contents

Recovering from a Crash. Three-Phase Commit

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Primary-Backup Replication

Intuitive distributed algorithms. with F#

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

To do. Consensus and related problems. q Failure. q Raft

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

ZooKeeper Atomic Broadcast

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Comparative Analysis of Big Data Stream Processing Systems

PRIMARY-BACKUP REPLICATION

Distributed Consensus Protocols

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Apache Zookeeper. h,p://zookeeper.apache.org

Distributed Systems 8L for Part IB

CS October 2017

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Assignment 12: Commit Protocols and Replication Solution

SimpleChubby: a simple distributed lock service

Exam 2 Review. October 29, Paul Krzyzanowski 1

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

A Brief Introduction to Key-Value Stores

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

EECS 498 Introduction to Distributed Systems

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Topics in Reliable Distributed Systems

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Consistency in Distributed Systems

Replication in Distributed Systems

Today: Fault Tolerance

arxiv: v3 [cs.dc] 4 Dec 2018

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Distributed File Systems II

Fault Tolerance. Distributed Systems. September 2002

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Paxos and Distributed Transactions

TRANSACTIONS AND ABSTRACTIONS

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

6.033 Computer System Engineering

Consistency and Replication 1/65

Dynamo: Key-Value Cloud Storage

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

The Google File System

Basic vs. Reliable Multicast

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt

CPS 512 midterm exam #1, 10/7/2016

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

CLOUD-SCALE FILE SYSTEMS

Distributed Systems Consensus

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Paxos provides a highly available, redundant log of events

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Introduction to Distributed Systems Seif Haridi

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Transcription:

ZooKeeper & Curator CS 475, Spring 2018 Concurrent & Distributed Systems

Review: Agreement In distributed systems, we have multiple nodes that need to all agree that some object has some state Examples: Who owns a lock Whether or not to commit a transaction The value of a file!2

Review: Agreement Generally Most distributed systems problems can be reduced to this one: Despite being separate nodes (with potentially different views of their data and the world) All nodes that store the same object O must apply all updates to that object in the same order (consistency) All nodes involved in a transaction must either commit or abort their part of the transaction (atomicity) Easy? but nodes can restart, die or be arbitrarily slow and networks can be slow or unreliable too!3

Review: Properties of Agreement Safety (correctness) All nodes agree on the same value (which was proposed by some node) Liveness (fault tolerance, availability) If less than N nodes crash, the rest should still be OK!4

Review: Does 3PC guarantee agreement? Reminder, that means: Liveness (availability) Yes! Always terminates based on timeouts Safety (correctness) Hmm!5

Review: Partitions Coordinator Prepared to commit Commit Soliciting Authorized Votes Network Partition!!! Yes Yes Yes Yes Participant A Participant B Participant C Participant D Uncertain Committed Uncertain Aborted Uncertain Aborted Uncertain Aborted Timeout behavior: Timeout behavior: abort Commit!!6

Review: Partition Tolerance Key idea: if you always have an odd number of nodes There will always be a minority partition and a majority partition Give up processing in the minority until partition heals and network resumes Majority can continue processing!7

Review: Partition Tolerance ZK Server ZK Server ZK Server NameNode NameNode Primary Secondary DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Primary can ascertain it is in a minority partition; secondary can ascertain it is in majority DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode!8

Announcements Form a team and get started on the project! http://jonbell.net/gmu-cs-475-spring-2018/finalproject/ AutoLab available soon Today: ZooKeeper guarantees and usage Additional readings: Distributed Systems for Fun & Profit Ch 4 http://book.mixu.net/distsys/replication.html!9

Master/worker distributed model Master Worker Worker Worker Worker!10

Master/worker distributed model Master is single point of failure! If master fails, no work is assigned Need to select a new master (Crashed) Master Master Worker Worker Worker Worker!11

Master/worker distributed model If a worker fails? Not as bad, but need to detect its failure Some tasks might need to get re-assigned elsewhere Master Crashed Worker Worker Worker Worker!12

Master/worker distributed model If a worker doesn't receive a task (network link failure)? Again, not as bad, but need to detect Will need to try to re-establish link (difference between "there is no work left to do" and "I just didn't hear I needed to do something) Master X Worker Worker Worker Worker!13

Coordination Semaphores Queues Transactions Locks Barriers!14

Strawman Fault Tolerant Master/Worker System Master Backup Master Worker Worker Worker Worker How do we know to switch to the backup? How do we know when workers have crashed, or network has failed?!15

Fault Tolerant Master/Worker System Master Backup Master Coordination Service Worker Worker Worker Worker Coordination service handles all of those tricky parts. But can t the coordination service fail?!16

Fault-Tolerant Distributed Coordination Leave it to the coordination service to be fault-tolerant Can solve our master/worker coordination problem in 2 steps: 1 - Write a fault-tolerant distributed coordination service 2 - Use it Thankfully, (1) has been done for us! Coordination server Coordination server Coordination server Coordination Service!17

ZooKeeper Distributed coordination service from Yahoo! originally, now maintained as Apache project, used widely (key component of Hadoop etc) Highly available, fault tolerant, performant Designed so that YOU don t have to implement Paxos for: Maintaining group membership, distributed data structures, distributed locks, distributed protocol state, etc!18

ZooKeeper - Guarantees Liveness: if a majority of ZooKeeper servers are active and communicating the service will be available Atomic updates: A write is either entirely successful, or entirely failed Durability: if the ZooKeeper service responds successfully to a change request, that change persists across any number of failures as long as a quorum of servers is eventually able to recover!19

Example use-cases Configuration management (which servers are doing what role?) Synchronization primitives Anti-use cases: Storing large amounts of data Sharing messages and data that don t require liveness/durability guarantees!20

ZooKeeper - Overview Client App Each client maintains a session with a single ZK server ZooKeeper Ensemble Client App Follower Leader Client App Follower Client App!21

ZooKeeper - Sessions Each client maintains a session with a single ZK server Sessions are valid for some time window If client discovers its disconnected from ZK server, attempts to reconnect to a different server before session expires!22

ZooKeeper - Overview Client App read x ZooKeeper Ensemble Client App x=10 Follower Leader Client App Follower Client App Reads processed locally!23

ZooKeeper - Overview Client App write x=11 ZooKeeper Ensemble Client App OK Follower OK Leader x=11 Client App Follower OK Client App Leader coordinates writes, waits for a quorum of servers to ack!24

ZooKeeper - Data Model Provides a hierarchical namespace Each node is called a znode ZooKeeper provides an API to manipulate these nodes / /app1 /app2 /app1/z1 /app1/z2!25

ZooKeeper - ZNodes In-memory data NOT for storing general data - just metadata (they are replicated and generally stored in memory) Map to some client abstraction, for instance - locks Znodes maintain counters and timestamps as metadata!26

ZooKeeper - Znode Types Regular znodes Can have children znodes Created and deleted by clients explicitly through API Ephemeral znodes Cannot have children Created by clients explicitly Deleted by clients OR removed automatically when client session that created them disconnects!27

ZooKeeper - API Clients track changes to znodes by registering a watch Create(path, data, flags) Delete(path, version) Exists(path, watch) getdata(path, watch) setdata(path, data, version) getchildren(path, watch) Sync(path)!28

ZooKeeper - Consistency Client App write x=11 ZooKeeper Ensemble Client App OK Follower OK Leader x=11 Client App read x Disconnected Follower Client App x=10 There is a time window before the follower realizes its disconnected in which it can have stale reads!!29

ZooKeeper - Consistency Client App write x=11 ZooKeeper Ensemble Client App OK Follower OK Leader x=11 Client App sync read x Disconnected Follower Client App sync read x sync() command forces ZK to make sure it is up-to-date!30

ZooKeeper - Consistency Sequential consistency of writes All updates are applied in the order they are sent, linearized into a total order by the leader Atomicity of writes Updates either succeed or fail Reliability Once a write has been applied, it will persist until its overwritten, as long as a majority of servers don t crash Timeliness Clients are guaranteed to be up-to-date for reads within a time bound - after which you either see newest data or are disconnected!31

ZooKeeper - Lock Example To acquire a lock called foo Try to create an ephemeral znode called /locks/foo If you succeeded: You have the lock If you failed: Set a watch on that node. When you are notified that the node is deleted, try to create it again. Note - no issue with consistency, since there is no read (just an atomic write)!32

ZooKeeper - Recipes Why figure out how to re-implement this low level stuff (like locks)? Recipes: https://zookeeper.apache.org/doc/r3.3.6/ recipes.html And in Java: http://curator.apache.org Examples: Locks Group Membership!33

How Many ZooKeepers? How many ZooKeepers do you want? An odd number 3-7 is typical Too many and you pay a LOT for coordination!34

Failure Handling in ZK Just using ZooKeeper does not solve failures Apps using ZooKeeper need to be aware of the potential failures that can occur, and act appropriately ZK client will guarantee consistency if it is connected to the server cluster!35

Failure Handling in ZK Client 1 Create event Reissue create event to zk3 4 ZK 1 ZK 2 ZK 3 2 ZK2 has network problem 3 Client reconnects to ZK3!36

Failure Handling in ZK If in the middle of an operation, client receives a ConnectionLossException Also, client receives a disconnected message Clients can t tell whether or not the operation was completed though - perhaps it was completed before the failure Clients that are disconnected can not receive any notifications from ZK!37

Dangers of ignorance Client 1 create /leader Reconnects, discovers no longer leader Disconnected ZK create /leader Notification that / leader is dead Client 2 Client 2 is leader!38

Dangers of ignorance Each client needs to be aware of whether or not its connected: when disconnected, can not assume that it is still included in any way in operations By default, ZK client will NOT close the client session because it's disconnected! Assumption that eventually things will reconnect Up to you to decide to close it or not!39

ZK: Handling Reconnection What should we do when we reconnect? Re-issue outstanding requests? Can't assume that outstanding requests didn't succeed Example: create /leader (succeed but disconnect), re-issue create /leader and fail to create it because you already did it! Need to check what has changed with the world since we were last connected!40

ZooKeeper in Final Project Leader Coordination Service Follower Follower Follower Follower All writes go to leader Who is the leader? Once we hit the leader, is it sure that it still is the leader? Leader broadcasts read-invalidates to clients Who is still alive? Reads processed on each client If don t have data cached, contact leader - who is leader?!41

ZooKeeper in Final Project x=5 x=7 Leader Failures can happen anywhere, anytime! x=5 Invalidate X PUT x=5 PUT x=7 GET x x=7 Follower Follower Follower Follower Coordination Service All writes go to leader Who is the leader? Once we hit the leader, is it sure that it still is the leader? Leader broadcasts read-invalidates to clients Who is still alive? Reads processed on each client If don t have data cached, contact leader - who is leader?!42