Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Similar documents
Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

Paxos Made Live - An Engineering Perspective

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Applications of Paxos Algorithm

Consensus and related problems

Distributed System. Gang Wu. Spring,2018

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible

Comparison of Different Implementations of Paxos for Distributed Database Usage. Proposal By. Hanzi Li Shuang Su Xiaoyu Li

Distributed Consensus Protocols

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems 11. Consensus. Paul Krzyzanowski

CS 425 / ECE 428 Distributed Systems Fall 2017

Exam 2 Review. October 29, Paul Krzyzanowski 1

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

The Google File System (GFS)

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

A Reliable Broadcast System

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

The Chubby lock service for loosely- coupled distributed systems

CORFU: A Shared Log Design for Flash Clusters

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Distributed Systems Consensus

Basic vs. Reliable Multicast

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Paxos provides a highly available, redundant log of events

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Integrity in Distributed Databases

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

SimpleChubby: a simple distributed lock service

Protocol-Aware Recovery for Consensus-based Storage Ramnatthan Alagappan University of Wisconsin Madison

Distributed Systems 16. Distributed File Systems II

Recovering from a Crash. Three-Phase Commit

Abstract. Introduction

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

The Google File System

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

The Google File System

EECS 498 Introduction to Distributed Systems

CS November 2017

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

6.824 Final Project. May 11, 2014

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Low-Latency Multi-Datacenter Databases using Replicated Commit

NPTEL Course Jan K. Gopinath Indian Institute of Science

Recap. CSE 486/586 Distributed Systems Paxos. Paxos. Brief History. Brief History. Brief History C 1

The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google File System. By Dinesh Amatya

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Distributed System. Gang Wu. Spring,2018

Primary-Backup Replication

The Chubby Lock Service for Distributed Systems

GFS: The Google File System

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Lecture XIII: Replication-II

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed File Systems II

BigTable. CSE-291 (Cloud Computing) Fall 2016

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Distributed Systems Fault Tolerance

The Google File System

Today: Fault Tolerance

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Key-value store with eventual consistency without trusting individual nodes

To do. Consensus and related problems. q Failure. q Raft

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Intuitive distributed algorithms. with F#

Distributed Systems Homework 1 (6 problems)

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Dynamic Reconfiguration of Primary/Backup Clusters

Today: Fault Tolerance. Fault Tolerance

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Replicated State Machine in Wide-area Networks

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Ravana: Controller Fault-Tolerance in SDN

Enhancing Throughput of

Transcription:

Paxos Made Live An Engineering Perspective Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone Presented By: Dipendra Kumar Jha

Consensus Algorithms Consensus: process of agreeing on one result among a group of participants Use a consensus algorithm to ensure that all replicas are mutually consistent Build an identical log of values on each replica Using Paxos to build a fault-tolerant database for Chubby 2

Chubby Figure 1: AsingleChubbyreplica. 3

Paxos Three phases: 1. Election of coordinator (Propose and Promise) When a replica wants to become a coordinator, generates a unique sequence number higher than any it has seen Broadcasts sequence number to all replicas in a propose message If majority of replicas promise (reply and indicate they have not see a higher sequence number), then the replica acts as a coordinator Promise message include the most recent value they have heard, if any, along with sequence number from whom they heard it 4

Paxos (continued ) Three phases: 1. Accept and Acknowledge: Coordinator selects a value and broadcasts it to all replicas in accept messages. Other replicas either acknowledge or reject the message 3. Commit: If majority replicas acknowledge, consensus is reached Coordinator broadcasts a commit message to notify replicas. - There are multiple replicas working as coordinators at once. 5

Multi-Paxos Use Paxos to achieve consensus on a sequence of values Chain together multipe instance (execution) of Paxos Use a catch-up mechanism for lagging replicas Each replica maintains a locally persistent log to record all Paxos actions for catch-up May select a master for a long time for optimization Propose messages may be omitted if the coordinator identity does not change between instances Possible to batch a collection of values submitted by different application threads into a single Paxos instance 6

Implementation of Paxos Algorithmic Challenges Handling disk corruption (checksum, marker, catch-up) Master leases (Server up-to-date info for read operations) Epoch numbers (Detect master turnover and abort operations) Group Membership (Change in the set of replicas) Snapshots (Snapshot handle) Database transactions(multiop) 7

Implementation of Paxos Software Engineering Expressing the algorithm effectively Simple state machine specification language Compiler to translate specification to C++ Runtime consistency checking Assert statements, checksum requests Testing Safety mode Liveness mode Concurrency Multithreading database, to take snapshots, computer checksums and process iterators 8

Implementation of Paxos Unexpected failures Rapid master failover due to high multi-threading Failure in upgrading Chubby cell Different semantics from Chubby One database replica different from others Failure in upgrade script for migrating cells from 3DB to Paxos version Failure due to bugs in the underlying operating system 9

Measurements Designed the tests to be write intensive If the contents of the file are small, the test measures latency of the system If the contents of the file are large, the test measures throughput of the system The system take snapshot whenever the replicated log size exceeds 100MB Performance comparison improves with size of file and number of workers Not optimized for performance, possible to make it faster 10

Summary and open problems Implementation of a fault-tolerant database Based on the Paxos consensus algorithm Significantly harder to build this system than anticipated Significant gap between Paxos Algorithm and real-world system No tools to implement fault-tolerant algorithms Not enough testing criteria for building fault-tolerant systems Significant gap between theory and practice and not enough tools in fault-tolerant computing 11

12