Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Similar documents
Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

The Chubby lock service for loosely- coupled distributed systems

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

Distributed Systems 16. Distributed File Systems II

Consensus and related problems

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

The Chubby Lock Service for Loosely-coupled Distributed systems

Distributed File Systems II

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Exam 2 Review. October 29, Paul Krzyzanowski 1

CS November 2017

The Chubby Lock Service for Distributed Systems

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Lecture XIII: Replication-II

The Chubby lock service for loosely-coupled distributed systems

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

EECS 498 Introduction to Distributed Systems

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Chubby lock service. Service in a cloud. Big table and Chubby: perfect together. Big table. Badri Nath Rutgers University

Google File System 2

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Cluster-Level Google How we use Colossus to improve storage efficiency

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

BigTable: A Distributed Storage System for Structured Data

Intuitive distributed algorithms. with F#

CS November 2018

Distributed Systems Consensus

CPS 512 midterm exam #1, 10/7/2016

CSE 486/586 Distributed Systems

BigTable. CSE-291 (Cloud Computing) Fall 2016

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Lessons Learned While Building Infrastructure Software at Google

Distributed Filesystem

The Google File System

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

GFS: The Google File System

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Paxos provides a highly available, redundant log of events

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Paxos and Replication. Dan Ports, CSEP 552

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

CmpE 138 Spring 2011 Special Topics L2

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Recovering from a Crash. Three-Phase Commit

GFS: The Google File System. Dr. Yingwu Zhu

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Map-Reduce. Marco Mura 2010 March, 31th

Applications of Paxos Algorithm

Distributed Systems [Fall 2012]

EECS 498 Introduction to Distributed Systems

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Distributed Systems. GFS / HDFS / Spanner

Distributed File Systems

Primary-Backup Replication

7680: Distributed Systems

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Integrity in Distributed Databases

SimpleChubby: a simple distributed lock service

MapReduce & BigTable

Paxos Made Simple. Leslie Lamport, 2001

Distributed System. Gang Wu. Spring,2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

MapReduce. U of Toronto, 2014

Outline. Spanner Mo/va/on. Tom Anderson

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Hierarchical Chubby: A Scalable, Distributed Locking Service

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

The Google File System

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

To do. Consensus and related problems. q Failure. q Raft

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Reliability & Chubby CSE 490H. This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Consensus Protocols

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

BigData and Map Reduce VITMAC03

CS 425 / ECE 428 Distributed Systems Fall 2017

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Transcription:

Recap CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo Paxos is a consensus algorithm. Proposers? Acceptors? Learners? A proposer always makes sure that, If a value has been chosen, it always proposes the same value. Three phases Prepare: What s the last proposed value? Accept: Accept my proposal. Learn: Let s tell other guys about the consensus. 2 Recap: First Requirement In the absence of failure or msg loss, we want a value to be chosen even if only one value is proposed by a single proposer. P1. An acceptor must accept the first proposal that it receives. Recap: Second Requirement But the first requirement is not enough! There are cases that do not provide any consensus. We need to accept multiple proposals. Then we need to guarantee that once a majority chooses a value, all majorities should choose the same value. I.e., all chosen proposals have the same value. This guarantees only one value to be chosen. This gives our next requirement. P2. If a proposal with value V is chosen, then every higher-numbered proposal that is chosen has value V. 3 4 Recap: Strengthening P2 Recap: Strengthening P2 OK; how do we guarantee that? Can acceptors do something? Yes! So we can strengthen P2: P2a. If a proposal with value V is chosen, then every higher-numbered proposal accepted by any acceptor has value V. But guaranteeing P2a might be difficult because of P1. Scenario A value V is chosen. An acceptor C never receives any proposal (due to asynchrony). A proposer fails, recovers, and issues a different proposal with a higher number and a different value. C accepts it (violating P2a). By doing this, we have change the requirement to be something that acceptors need to guarantee. 5 6 C 1

Recap: Combining P1 & P2a Then can proposers do anything about that? P2b. If a proposal with value V is chosen, then every higher-numbered proposal issued by any proposer has value V. Now we have changed the requirement P2 to something that each proposer has to guarantee. How to Guarantee P2b P2b. If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value V. Two cases for a proposer proposing (N, V) If a proposer knows that there is and will be no proposal N < N chosen by a majority, it can propose any value. If that is not the case, then it has to make sure that it proposes the same value that s been chosen by a majority. (Rough) Intuition for the first case If there s a proposal chosen by a majority set S, then any majority set S will intersect with S. Thus, if the proposer asks acceptors and gets replies from a majority that it did not and will not accept any proposal, then we re fine. 7 8 Invariant to Maintain Paxos Phase 1 P2c. For any V and N, if a proposal with value V and number N is issued, then there is a set S consisting of a majority of acceptors such that either (A) no acceptor in S has accepted or will accept any proposal numbered less than N or, (B) V is the value of the highest-numbered proposal among all proposals numbered less than N accepted by the acceptors in S. A proposer chooses its proposal number N and sends a prepare request to acceptors. Maintains P2c. Acceptors need to reply: A promise to not accept any proposal numbered less than N any more (to make sure that the protocol doesn t deal with old proposals) If there is, the accepted proposal with the highest number less than N 9 10 Paxos Phase 2 Paxos Phase 3 If a proposer receives a reply from a majority, it sends an accept request with the proposal (N, V). V: the highest N from the replies (i.e., the accepted proposals returned from acceptors in phase 1) Or, if no accepted proposal was returned in phase 1, any value. Upon receiving (N, V), acceptors need to maintain P2c by either: Accepting it Or, rejecting it if there was another prepare request with N higher than N, and it replied to it. Learners need to know which value has been chosen. Many possibilities One way: have each acceptor respond to all learners Might be effective, but expensive Another way: elect a distinguished learner Acceptors respond with their acceptances to this process This distinguished learner informs other learners. Failure-prone Mixing the two: a set of distinguished learners 11 12 C 2

Problem: Progress (Liveness) There s a race condition for proposals. P0 completes phase 1 with a proposal number N0 Before P0 starts phase 2, P1 starts and completes phase 1 with a proposal number N1 > N0. P0 performs phase 2, acceptors reject. Before P1 starts phase 2, P0 restarts and completes phase 1 with a proposal number N2 > N1. P1 performs phase 2, acceptors reject. (this can go on forever) How to solve this? Providing Liveness Solution: elect a distinguished proposer I.e., have only one proposer If the distinguished proposer can successfully communicate with a majority, the protocol guarantees liveness. I.e., if a process plays all three roles, Paxos can tolerate failures f < 1/2 * N. Still needs to get around FLP for the leader election, e.g., having a failure detector 13 14 CSE 486/586 Administrivia More practice problems & example final posted Quick poll: Android platform class PhoneLab hiring Testbed developer/administrator Anonymous feedback form still available. Please come talk to me! Google Chubby A lock service Enables multiple clients to share a lock and coordinate A coarse-grained lock service Locks are supposed to be held for hours and days, not seconds. In addition, it can store small files. Design target Low-rate locking/unlocking Low-volume information storage Why would you need something like this? 15 16 Google Infrastructure Overview Google File System Google File System (GFS) Distributed file system Bigtable Table-based storage MapReduce Programming paradigm & its execution framework These rely on Chubby. Warning: the next few slides are intentionally shallow. The only purpose is to give some overview. A cluster file system Lots of storage (~12 disks per machine) Replication of files to combat failures CP U CP U 17 18 C 3

Google File System Files are divided into chunks 64MB/chunk Distributed & replicated over servers Two entities One master Chunk servers Google File System Master maintains all file system metadata Namespace Access control info Filename to chunks mappings Current locations of chunks Master replicates its data for fault tolerance Master periodically communicates with all chunk servers Via heartbeat messages To get state and send commands Chunk servers respond to read/write requests & master s commands. 19 20 Bigtable Bigtable Table-based storage on top of GFS Main storage for a lot of Google services Google Analytics Google Finance Personalized search Google Earth & Google Maps Etc. Gives a large logical table view to the clients Logical tables are divided into tablets and distributed over the Bigtable servers. Three entities Client library One master Tablet servers Table: rows & columns (row, column, timestamp) -> cell contents E.g., web pages and relevant info. Rows: URLs Columns: actual web page, (out-going) links, (incoming) links, etc. Versioned: using timestamps 21 22 MapReduce Programming paradigm Map: (key, value) à list of (intermediate key, intermediate value) Reduce: (intermediate key, list of intermediate values) à (output key, output value) Programmers write Map & Reduce functions within the interface given (above). Execution framework Google MapReduce executes Map & Reduce functions over a cluster of servers One master Workers MapReduce Execution flow Input Files Input files sent to map tasks M M M Map workers Master Intermediate keys partitioned into reduce tasks R R Reduce workers Output 23 24 C 4

Common Theme One master & multiple workers Why one master? This design simplifies lots of things. Mainly used to handle meta data; it s important to reduce the load of a single master. No need to deal with consistency issues Mostly fit in the memory à very fast access Obvious problem: failure We can have one primary and backups. We can then elect the primary out of the peers. How would you use a lock service like Chubby? Chubby A coarse-grained lock service Locks are supposed to be held for hours and days, not seconds. In addition, it can store small files. Used for various purposes (e.g., the master election) for GFS, Bigtable, MapReduce Potential masters try to create a lock on Chubby The first one that gets the lock becomes the master Also used for storing small configuration data and access control lists 25 26 Chubby Organization Client Interface Chubby cell (an instance) has typically 5 replicas. But each cell still serves tens of thousands of clients Among 5 replicas, one master is elected. Any one replica can be the master. They decide who is the master via Paxos. The master handles all requests. File system interface From a client s point of view, it s almost like accessing a file system. Typical name: /ls/foo/wombat/pouch ls (lock service) common to all Chubby names foo is the name of the Chubby cell /wombat/pouch interpreted within Chubby cell Contains files and directories, called nodes Any node can be a reader-writer lock: reader (shared) mode & writer (exclusive) mode Files can contain a small piece of information Just like a file system, each file is associated with some meta-data, such as access control lists. 27 28 Client-Chubby Interaction Client Lock Usage Clients (library) send KeepAlive messages Periodic handshakes If Chubby doesn t hear back from a client, it s considered to be failed. Clients can subscribed to events. E.g., File contents modified, child node added, removed, or modified, lock become invalid, etc. Clients cache data (file & meta data) If the cached data becomes stale, the Chubby master invalidates it. They Chubby master piggybacks events or cache invalidations on the KeepAlives Ensures clients keep cache consistent Each lock has a sequencer that is roughly a version number. Scenario A process holding a lock L issues a request R It then fails & lock gets freed. Another process acquires L and perform some action before R arrives at Chubby. R may be acted on without the protection of L, and potentially on inconsistent data. 29 30 C 5

Client API open() & close() GetContentsAndStat() Reads the whole file and meta-data SetContents() Writes to the file Acquire(), TryAcquire(), Release() Acquires and releases a lock associated with the file GetSequencer(), SetSequencer(), CheckSequencer() Primary Election Example All potential primaries open the lock file and attempt to acquire the lock. One succeeds and becomes the primary, others become replicas. Primary writes identity into the lock file with SetContents(). Clients and replicas read the lock file with GetContentsAndStat(). In response to a file-modification event. 31 32 Chubby Usage A snapshot of a Chubby cell Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta (UIUC). Few clients hold locks, and shared locks are rare. Consistent with locking being used for primary election and partitioning data among replicas. 33 34 C 6