Applications of Paxos Algorithm

Similar documents
ZooKeeper. Table of contents

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Consensus and related problems

ZooKeeper. Wait-free coordination for Internet-scale systems

Apache Zookeeper. h,p://zookeeper.apache.org

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

The Google File System

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Dynamic Reconfiguration of Primary/Backup Clusters

CA485 Ray Walshe Google File System

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Building Consistent Transactions with Inconsistent Replication

Basic vs. Reliable Multicast

Distributed Systems 16. Distributed File Systems II

EECS 498 Introduction to Distributed Systems

Integrity in Distributed Databases

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

The Google File System

10. Replication. Motivation

The Google File System (GFS)

There Is More Consensus in Egalitarian Parliaments

CLOUD-SCALE FILE SYSTEMS

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

BigTable. CSE-291 (Cloud Computing) Fall 2016

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

ZooKeeper Atomic Broadcast

Distributed File Systems II

ZooKeeper Recipes and Solutions

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Distributed System. Gang Wu. Spring,2018

CS November 2017

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CSE 124: Networked Services Fall 2009 Lecture-19

NewSQL Databases. The reference Big Data stack

The Google File System

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The Google File System

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Project Midterms: March 22 nd : No Extensions

CSE 124: Networked Services Lecture-16

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Intuitive distributed algorithms. with F#

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

The Google File System

ZooKeeper Atomic Broadcast (for Project 2) 10/27/2016

The Chubby lock service for loosely- coupled distributed systems

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Distributed Systems 11. Consensus. Paul Krzyzanowski

NPTEL Course Jan K. Gopinath Indian Institute of Science

STORM AND LOW-LATENCY PROCESSING.

Distributed Filesystem

Comparative Analysis of Big Data Stream Processing Systems

Lessons Learned While Building Infrastructure Software at Google

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

ZooKeeper Recipes and Solutions

BookKeeper overview. Table of contents

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

SimpleChubby: a simple distributed lock service

Google File System 2

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSE-E5430 Scalable Cloud Computing Lecture 9

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

The Google File System

Distributed System. Gang Wu. Spring,2018

Knowns and Unknowns in Distributed Systems

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Building Consistent Transactions with Inconsistent Replication

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

A Reliable Broadcast System

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Google File System. By Dinesh Amatya

Google File System. Arun Sundaram Operating Systems

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Transcription:

Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

Outline Apache ZooKeeper Google Chubby Other applications 2

ZooKeeper - Overview Apache ZooKeeper: an effort to develop and maintain an open-source server which enables highly reliable distributed coordination Exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming 3

ZooKeeper - Overview Designed to be easy to program to Runs in Java, has bindings for Java and C Uses a data model similar to directory tree structure of file systems Coordination services are prone to errors such as race conditions and deadlock The motivation is to relieve distributed applications implementing coordination services from scratch 4

ZooKeeper Allows distributed processes to coordinate with each other through a shared hierarchical name space similar to standard file system The name space consists of data registers znodes ZooKeeper data is kept in-memory, unlike a file system Can achieve high throughput and low latency, can be used in large and distributed systems 5

ZooKeeper Servers that make up the ZooKeeper service must all know about each other Servers maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store As long as a majority of the servers are available, the service will be available 6

ZooKeeper Clients connect to a single ZooKeeper server The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats If the TCP connection to the server breaks, the client will connect to a different server ZooKeeper stamps each update with a number that reflects the order of all transactions 7

ZooKeeper Name space The hierarchical name space provided by ZooKeeper is similar to the standard file system A name is a sequence of path elements separated by a slash (/) Every node in ZooKeeper's name space is identified by a path 8

ZooKeeper Nodes Each node in a ZooKeeper namespace can have data associated with it as well as children Unlike file system, similar to having a file-system that allows a file to also be a directory The term znode is used to specify ZooKeeper data nodes The data in the nodes: version numbers, ACL changes and time stamps to allow cache validations and coordinated updates 9

ZooKeeper Nodes The data stored at each znode in a name space is read and written atomically Reads get all the data bytes associated with a znode and a write replaces all the data Each node has an Access Control List (ACL) that restricts who can do what Ephemeral nodes: znodes which exist as long as the session that created the znode is active When the session ends the ephemeral node is deleted 10

ZooKeeper Watches Clients can set a watch on a znode A watch will be triggered and removed when the znode changes When a watch is triggered, the client receives a packet saying that the znode has changed If the connection between the client and one of the servers is broken, the client will receive a local notification 11

ZooKeeper Guarantees Sequential Consistency: Updates from a client will be applied in the order that they were sent Atomicity: Updates either succeed or fail, no partial results Single System Image: A client will see the same view of the service regardless of the server that it connects to Reliability: Once an update has been applied, it will persist from that time forward until a client overwrites the update Timeliness: The clients view of the system is guaranteed to be upto-date within a certain time bound 12

ZooKeeper Implementation The replicated database is an in-memory database containing the entire data tree Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database 13

ZooKeeper Implementation Every server services clients. Clients connect to exactly one server to submit requests Read requests are serviced from the local replica of each server database Requests that change the state of the service, write requests, are processed by an agreement protocol As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader 14

ZooKeeper Implementation The rest of the ZooKeeper servers are called followers Followers receive message proposals from the leader and agree upon message delivery The messaging layer takes care of replacing leaders on failures and syncing followers with leaders ZooKeeper uses a custom atomic messaging protocol 15

ZooKeeper Implementation Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge When the leader receives a write request: calculates what the state of the system is, when the write is to be applied transforms the state into a transaction that captures this new state 16

ZooKeeper Internals At the heart of ZooKeeper is an atomic messaging system that keeps all of the servers in sync The guarantees by the messaging system: Reliable delivery: If a message, m, is delivered by one server, it will be eventually delivered by all servers Total order: If a is delivered before b by one server, a will be delivered before b by all servers Causal order: If b is sent after a has been delivered by the sender of b, a must be ordered before b 17

Messaging protocol: ZooKeeper Internals Packet: a sequence of bytes sent through a FIFO channel Proposal: a unit of agreement. Proposals are agreed upon by exchanging packets with a quorum of servers (e.g. the NEW_LEADER proposal) Message: a sequence of bytes to be atomically broadcast to all servers. A message put into a proposal and agreed upon before it is delivered 18

ZooKeeper Internals Two phases of messaging: Leader activation: a leader establishes the correct state of the system and gets ready to start making proposals Active message: a leader accepts messages to propose and coordinates message delivery 19

ZooKeeper Internals Once the leader is elected, it starts sending proposals All communication channels are FIFO, so everything is done in order commit to all followers as soon as a quorum of followers have acknowledged a message 20

Chubby - Overview Google`s distributed lock service A fault-tolerant system that provides a distributed locking mechanism and stores small files Typically there is one Chubby instance, or cell, per data center Several Google systems such as the Google File System (GFS) and Bigtable use Chubby for distributed coordination and to store a small amount of metadata 21

Chubby Chubby achieves fault-tolerance through replication A typical Chubby cell consists of five replicas, running the same code, each running on a dedicated machine Every Chubby object (e.g., a Chubby lock or file) is stored as an entry in a database, it is this database that is replicated At any one time, one of the replicas is considered to be the master 22

Chubby 23

Chubby 2 components: A server and a client library Master is selected in the cell using Paxos algorithm 24

Other Applications The OpenReplica replication service: uses Paxos to maintain replicas for an open access system that enables users to create fault-tolerant objects provides high performance through concurrent rounds and flexibility through dynamic membership changes Autopilot cluster management service: A service used by Microsoft for Bing 25

Google Spanner: Other Applications A massively scalable distributed database NewSQL platform designed by Google Used internally within their infrastructure as part of the Google platform Uses the Paxos algorithm, makes heavy use of hardwareassisted time synchronization using GPS clocks and atomic clocks to ensure global consistency 26

References Dan C. Marinescu Cloud Computing: Theory and Practice Apache ZooKeeper, http://zookeeper.apache.org T. Chandra, R. Griesemer and J. Redstone (2007). "Paxos Made Live - An Engineering Perspective Michael Isard (2007), Autopilot: Automatic Data Center Management Philipp Lenssen, Google Chubby and the Paxos Algorithm. Zhang Yang, Chubby and Paxos presentation. Wikipedia, http://en.wikipedia.org/wiki/paxos_(computer_science). 27

Thank you! 28