Abstract. Introduction

Similar documents
SimpleChubby: a simple distributed lock service

The Google File System

Distributed Consensus Protocols

Trek: Testable Replicated Key-Value Store

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Replication in Distributed Systems

Distributed Computation Models

CPS 512 midterm exam #1, 10/7/2016

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

Google File System. Arun Sundaram Operating Systems

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Intuitive distributed algorithms. with F#

Google File System 2

Intra-cluster Replication for Apache Kafka. Jun Rao

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

The Google File System

GFS: The Google File System

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

NPTEL Course Jan K. Gopinath Indian Institute of Science

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

GFS: The Google File System. Dr. Yingwu Zhu

Distributed System. Gang Wu. Spring,2018

Operating System Design Issues. I/O Management

Applications of Paxos Algorithm

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

EECS 498 Introduction to Distributed Systems

Byzantine Fault Tolerant Raft

The Google File System

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Advanced Distributed Systems

Paxos and Replication. Dan Ports, CSEP 552

Broker Clusters. Cluster Models

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

Distributed File Systems II

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Brief introduction of SocketPro continuous SQL-stream sending and processing system (Part 1: SQLite)

Distributed Filesystem

NPTEL Course Jan K. Gopinath Indian Institute of Science

CSE 444: Database Internals. Lecture 25 Replication

The Google File System. Alexandru Costan

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

MillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo

Inside Broker How Broker Leverages the C++ Actor Framework (CAF)

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

CollabDraw: Real-time Collaborative Drawing Board

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

MapReduce. U of Toronto, 2014

Lecture XIII: Replication-II

STORM AND LOW-LATENCY PROCESSING.

MarkLogic Server. Database Replication Guide. MarkLogic 9 May, Copyright 2017 MarkLogic Corporation. All rights reserved.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer

Distributed Computing Environment (DCE)

CLOUD-SCALE FILE SYSTEMS

MarkLogic Server. Database Replication Guide. MarkLogic 6 September, Copyright 2012 MarkLogic Corporation. All rights reserved.

Distributed Objects and Remote Invocation. Programming Models for Distributed Applications

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Failures, Elections, and Raft

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Replicated State Machine in Wide-area Networks

Hierarchical Chubby: A Scalable, Distributed Locking Service

The Google File System

ZooKeeper. Table of contents

7680: Distributed Systems

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Communication. Distributed Systems Santa Clara University 2016

Inside core.async Channels. Rich Hickey

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

SEGR 550 Distributed Computing. Final Exam, Fall 2011

CA485 Ray Walshe Google File System

BigData and Map Reduce VITMAC03

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Key-value store with eventual consistency without trusting individual nodes

This tutorial will give you a quick start with Consul and make you comfortable with its various components.

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

Replication. Feb 10, 2016 CPSC 416

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

DRBD 9. Lars Ellenberg. Linux Storage Replication. LINBIT HA Solutions GmbH Vienna, Austria

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Distributed Systems 16. Distributed File Systems II

Distributed Object-Based Systems The WWW Architecture Web Services Handout 11 Part(a) EECS 591 Farnam Jahanian University of Michigan.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Transcription:

Highly Available In-Memory Metadata Filesystem using Viewstamped Replication (https://github.com/pkvijay/metadr) Pradeep Kumar Vijay, Pedro Ulises Cuevas Berrueco Stanford cs244b-distributed Systems Abstract This paper describes the design and implementation of a highly available key-value store based metadata filesystem using Viewstamped Replication [1]. Introduction Most of the distributed systems have to manage the metadata corresponding to the service that they provide. For example, GFS [3] master node maintains the metadata of the entire filesystem. Zookeeper [2] provides a replicated metadata filesystem which can be used to implement other distributed services like locking, leader election etc. It is inevitable that this metadata need to be highly available as this can be a single point of failure. Viewstamped replication [1] provides the algorithm to achieve consensus in distributed systems. We have implemented a simple in-memory key-value store that is replicated and made highly available using viewstamped replication. We have modeled the key-value store interface to be similar to the one in zookeeper, where the key is in the format of UNIX filesystem path and value is a string. At any point of time, there is a single primary and multiple backups. In order to guarantee strong consistency, all update/delete requests have to be issued to the primary. When the primary fails, one of the backups changes role to primary. In the current implementation, the read requests should also be issued to the primary, but with just a few lines of code change, the backups can respond to read requests as well. This system is designed for low read latency and hence the reads do not go through the replication protocol. If the backups allow reads, there is a greater probability of stale reads. To overcome that, sync mechanism, similar to the one in zookeeper needs to be provided. The system can tolerate f failures where the total number of nodes in the system configuration is 2f + 1. The system is being implemented using C++11 on Linux based platform. Client Workflow There are two main components in the system: the client library and the backend server executable. The client library provides the APIs for the KV store. The users of the system are expected to use the client library APIs to store their metadata. The client library uses to transfer the client requests to the backend server. The backend takes care of processing the client request, modifying the actual key-value store and replicating the update/deletion across all the backups. The main KV store client APIs are create, set, get, list and remove. A shell kind of utility is provided that can be used by the clients to invoke different commands to manage the KV store. The shell can be used if the users do not want to write code using the client library. If the size of the system configuration is N, then N instances of backend server need to be started. One of them would be elected as primary and the others will be backup. Each instance of the server needs to be started with a configuration file. The configuration file contains the details of the client and the replication ports, where the server has to start the RPC listener threads, and in addition, it contains the IP endpoint details of all the nodes in the system. This is how each server instance would know the IP and port details of the nodes in the system with which it has to 1

communicate to implement the consensus protocol. The client needs to be aware of the system configuration and all client requests should be sent only to the primary. If a request were sent to a backup, then the client would get an error explaining that the node is not primary. In case of primary failure, the client can broadcast the request to all the nodes and only one of them would accept the request, which is the new primary. From then on, the client can just send the request to the new primary. Server management client APIs can be provided for reconfiguration and shutdown support. This should allow the clients to dynamically add more server instances to the system and to cleanly shutdown any server instance for maintenance reasons. Design Client KvStore API Backup Replication Primary Replication Backup Replication Backup Figure 1 Architecture Figure 1 provides the high level architecture of the system, where the client uses to talk to the primary key-value server, which in turn uses to replicate the change across the backups. KvClient PP Replicator ViewChangeHandler RecoveryHandler KvMemStore OpLogger CheckpointManager Figure 2 Module Diagram Figure 2 provides the high level modules of the system, which are described below. 2

PP [6] module provides an based RPC [4] implementation. [5] stands for External Data Representation which describes the interface definition language (IDL), which can be used to express data structures, transmitted between machines using RPC. PP comprises of a compiler that facilitates stub generation from based IDLs. It consists of a runtime library that hides the network communication details and facilitates RPC invocations across different machines. KVClient module provides the client library that allows the clients to link to and invoke the APIs to store their metadata. KVClient uses PP to invoke to the KVMemStore and returns the results to the clients. KVMemStore module implements the in-memory key-value data-structure. It implements the KV store APIs and responds to the invoked by KVClient. It performs the necessary validations before updating the in-memory data-structure. One example of validation is to ensure that the keys are in the proper filesystem format. KVMemStore calls into the Replicator to replicate any changes before committing the actual change into its in-memory structure. If this KVMemStore is running as a backup, then it registers callbacks with the Replicator, so that the backup replicator can make upcalls to KVMemStore to update the committed operations. Replicator module implements the viewstamped replication [1] protocol. For any update/deletion operation it runs the distributed 2-phase commit kind of protocol as described in the viewstamped replication paper. For any update/deletion operation, replicator commits the operation only if a quorum is achieved for that operation. Replicator uses the OpLogger to log the operations. Replicator runs either in the primary mode or in the backup mode. If it runs in the primary mode, it accepts the replicate requests from KVMemStore and initiates the replication protocol with the backups. If it runs in the backup mode, it accepts requests from the primary replicator, processes them and responds back to the primary. The replicator running in the backup mode also keeps track of the primary health and invokes the failover protocol if it suspects the primary to have died. ViewChangeHandler module implements the failover protocol as described in viewstamped replication paper. ViewChangeHandler is invoked whenever the backup replicator suspects that the primary has died. This module implements the distributed view-change protocol to select a new primary and the new view. Once a quorum is achieved for the new view, the backups start responding to requests from the new primary. The concept of View is an integral part in viewstamped replication to achieve consensus. Backup replicators accept requests from primary only if they are in the same view. Operation logs are transmitted between the cohorts to implement the view-change protocol. The OpLogger module is used to obtain the latest log. RecoveryHandler module implements the recovery algorithm as described in viewstamped replication paper. Whenever a node goes in the recovering state, the replicator uses RecoveryHandler to recover the state of the node. The recovery algorithm has to transmit the logs and the OpLogger module is used to obtain the latest log. If the log is being trimmed, then the CheckpointManager is used to construct the complete log or to asynchronously transfer the checkpoint state to the recovering node before the actual recovery algorithm. OpLogger module facilitates logging the operations. The log is maintained in an inmemory data-structure. It provides the necessary interface to retrieve the log and to check if a particular operation is logged. The log is integral part of viewstamped replication algorithm, which is used to check if a particular operation request should be accepted, or not. It is used to make sure that the operations are committed in the exactly the same order across all the cohorts. 3

CheckpointManager is used to trim the log and to persist the KV state to disk. The OpLogger notifies this module whenever a certain log size threshold is reached and it asynchronously runs through the log and persists it to disk. It provides the necessary interfaces to obtain the state that has been checkpointed. Key modules that are missing from the above figure are StateTransfer and Reconfiguration. Refer to Status and Future Work sections for more details on those. Client Base KvStore (Singleton) IReplicator KvS :: Client IOpLogHandler VsReplicator KvStoreApi_v1 ReplicaState KvStoreApi_v1Server ViewChangeHandler InMemOpLogHandler VsReplicationClient RecoveryHandler IChkptHandler Replication_Api_v1 Replication_Api_v1Server Configurator (Singleton) SqliteChkpHandler ViewChangeClient ViewChange_Api_v1 ViewChange_Api_v1_server RecoveryClient RecoveryApi_v1 RecoveryApi_v1_server Figure 3 Class Diagram Figure 3 provides the class diagram of the system. From the class diagram, it is evident that the system is designed to be extensible and maintainable. The classes only use the required abstract interfaces instead of the actual implementing class. For example, the KvStore singleton class just composes of an instance of IReplicator and invokes the methods through the abstract instance. This kind of design facilitates changing the underlying implementation with minimal breakage to the dependent modules. For example, instead of binding a VsReplicator instance to the IReplicator in KvStore, tomorrow we can implement a new replicator, say RAFT [7], deriving 4

from the IReplicator and bind that to the KvStore. With minimal change, we can completely switch to a new replication implementation. Similarly VsReplicator just composes of an IOpLogHandler instance. So it is easy to switch to some other log handler instead of InMemOpLogHandler. Similar case holds for IOpLogHandler and IChkptHandler. VsReplicator composes of a ViewChangeHandler and RecoveryHandler instance. ReplicaState represents the replicator state that is shared between VsReplicator, ViewChangeHandler and RecoveryHandler. The Configurator is the entry point for the system. It is a singleton that parses the input configuration and maintains the configuration information. It does the bootstrap process, where it starts of the KvStore singleton class and registers a VsReplicator instance with it. The replicator is registered with an instance of InMemOpLogHandler and the logger is registered with an instance of SqliteChkptHandler. The kvs::client class composes of the PP generated KvStoreApi_v1 class and provides the wrappers to the client APIs to mange the KV store. Similarly VsReplicationClient wraps ReplicationApi_v1, ViewChangeClient wraps ViewChangeApi_v1 and RecoveryClient wraps RecoveryApi_v1. The corresponding backend server classes are KvStoreApi_v1Server, VsReplicationApi_v1Server, ViewChangeApi_v1Server and RecoveryApi_v1Server, which upon receiving the RPC requests, calls the corresponding implementing functions in other backend modules. All the *Api_v1 and *Api_v1Server classes are auto-generated by PP using the IDLs. Implementation This system is implemented using C++11 on Linux based platform. Source code can be accessed at https://github.com/pkvijay/metadr. Client to server communication and all communications between the replicas are performed using PP [6]. IDL in kvstore.x provides the client facing APIs. The backend IDLs are vsreplication.x, viewchange.x, recovery.x and vrtypes.x. IDL in vrtypes.x declare the common types that are shared between all the backend modules. Each replica server runs two socket listener threads, one to process kvstore and the other to process the replication. NOTE that the replications include the viewchange and the recovery. The VsReplicator runs a heartbeat thread, which runs either in the primary mode or in the backup mode. In the primary mode, it sends the COMMIT message to the backups with the primary s latest commit number. As mentioned in the viewstamped replication paper, commit operation is piggybacked to send the heartbeats. In the backup mode, the heartbeat thread monitors for the primary health status. If the last commit timestamp exceeded a certain threshold, it invokes the view-change protocol. Whenever the PREPARE request is received by a backup, it pushes the prepare request to a concurrent/blocking queue. Each backup runs a dedicated thread to process the prepare requests. This thread gets woken up if the concurrent queue is not empty after which it processes the prepare request and responds back to the primary if it accepts the prepare request. KV store is thread-safe, which means that multiple clients can issue simultaneous requests to the KV store and the backend takes care of synchronizing these requests. For the complete implementation detail, please refer to the source code link. Status Source code for the system is at https://github.com/pkvijay/metadr. The replication, view-change and recovery modules are complete. We wanted to complete the checkpoint module, but realized that checkpointing is strongly linked to state transfer. Once the checkpointing module trims the log, only the suffix of the log will be transferred during view-change and recovery 5

protocols. If any replica is lagging behind and needs the earlier log entries, it needs to be able to do a state transfer to catch up. So state-transfer and checkpointing modules have a strong dependency and both have to be implemented in order to ensure correctness of the view-change and recovery protocols. Future Work Even after this course completion, we are hoping to continue working on this system. There is still lot of work to be done to make this a more practical system. The main bottleneck with viewstamped replication is that the protocol involves transferring the entire log and it is a costly affair for a long running system. To address this issue, all the techniques mentioned in section 5 of the viewstamped replication [1] paper need to be implemented, the most important being the application state-transfer mechanism using Merkle tree. The reconfiguration protocol mentioned in section 7 of the paper has to be implemented to allow dynamic configuration changes without bringing down the whole system. Clean shutdown mechanisms need to be implemented to facilitate bringing down server instances for maintenance reasons. The optimization techniques mentioned in section 6 of the paper can be implemented as well. Some of them are really simple to implement, for example, reads can be allowed at the backups at the cost of consistency. The KV store is maintained as a simple hash table, where the key and value are strings. The scale of the metadata that can be stored in this system is limited by this in-memory data-structure. Techniques like prefix compression can be used to improve the KV store memory utilization and scale. The keys/nodes can also be decorated with additional properties to provide a data model similar to the one in zookeeper. References [1] Viewstamped Replication Revisited (http://www.scs.stanford.edu/14aucs244b/sched/readings/vr-revisited.pdf) [2] Zookeeper (http://www.scs.stanford.edu/14au-cs244b/sched/readings/zookeeper.pdf) [3] The Google File System (http://www.scs.stanford.edu/14au-cs244b/sched/readings/gfs.pdf) [4] RPC Specification version 2 (https://tools.ietf.org/html/rfc5531) [5] : External Data Representation Standard (http://tools.ietf.org/html/rfc4506) [6] PP (http://www.scs.stanford.edu/14au-cs244b/labs/xdrpp-doc/) [7] RAFT (http://www.scs.stanford.edu/14au-cs244b/sched/readings/raft.pdf) 6