The Chubby Lock Service for Distributed Systems

Size: px

Start display at page:

Download "The Chubby Lock Service for Distributed Systems"

Adam Rich
5 years ago
Views:

1 The Chubby Lock Service for Distributed Systems Xiuwen Yang School of Informatics and Computing Indiana University Bloomington 3209 East 10 th Street ABSTRACT In this paper, we describe the design, use as well as the design error of Chubby locking service. KEYWORDS Paxos, consensus, distributed, Chubby, fault-tolerant, file system, locks, KeepAlives, session, protocol, master, lease, Berkeley. 1. INTRODUCTION Chubby is a locking mechanism for loosely-coupled distributed system. It is designed for coarse-grained locking and also provides a limited but reliable fault-tolerant distributed file system. There is no need for software developers to pursue complex consensus protocol, instead, to use Chubby API to lock the shared resource as doing in sequential programing. The lock is advisory, but not mandatory, which offers more flexibility. The clients cache data, in order to reduce the load of the servers. Events are delivered to the clients, informing the status of the data they request. Google coordinates the access to the shared resource which is normally a small amount of metadata within several of Google systems, such as Google File System (GFS), BigTable, and MapReduce. Although it offers locking service for synchronization, Google also use it as a name server which replaces DNS. This survey paper emphasizes on the design, use and performance of Chubby. 2. DESIGN 2.1 Paxos algorithm The Paxos algorithm targets at solving consensus problem for implementing a fault-tolerant distributed system. Suppose there is a collection of processes that can propose values, a consensus algorithm ensures that only a single one value is chosen among all those processes. And if the value which proposed by a process is chosen, then this value need to be learned by all the other processes. 2.2 Chubby as a Lock Service Some people will argue that just because Paxos as a consensus protocol could use as a mechanism to solve distributed consensus problem, why not go directly to an implementation algorithm based on Paxos rather than a lock service? Because, first of all, most of the developers will not consider the consensus problem at the very beginning of the initiative development of a software or a service, only with the growth of it and increasing number of clients served by the service, developers realize that consensus will be a problem when it comes to the distributed service, and thus be serious about this problem. So, adopting a mechanism such as a lock service, one can solve those consensus problems simply by adding some statements to the existing system without changing the program structure and communication patterns. Additionally, Chubby is employed as a lock service as well as a file system has many advantages in many cases in real life. Most of the time, the system will not just elect a master, but to broadcast the address of this master, or may store some data and information for further aims. Thus, one can use Chubby to synchronize the access to the shared resources, and also could store metadata or configuration. Moreover, to most programmers, may not many people know about consensus protocol, but for sure, most of the programmers used locks in programming, at least heard of it. Therefore, maintain a lock-based interface is much more familiar to programmers, thus resulting in reliable mechanism for distributed decision making. Last, distributed-consensus algorithm use quorums to make decision, so they use several replicas to achieve high availability. While, Chubby reduces the number of servers needed for a reliable service system, thus even a single client can obtain a lock and make progress safely. Based on all the aspects that listed above, we conclude the goal of design in Chubby lock service: Coarse-grained locking service: based on loosely-coupled distributed reliable system, it offers coarse-grained locking service for clients. High usability and high reliability: ensure high usability and reliability of locking service as well as basic usability, throughput and storage ability. Directly store service information: offer archive file, storing configuration of service and relative information instead of establishing and maintaining another service. High scability: store data in RAM, supporting simultaneous access to one file among large amount of clients. Events mechanism: clients need to know the change of service in time. Through events mechanism, the server sends update message to clients regularly. Caching: use cache to store file data and node meta-data in case of frequently access to the server, thus reduce the traffic. 2.3 System Structure The Chubby system has two main components, see Figure 1; one part of the system is a Chubby cell which usually consists of five servers, each of those five servers is called replica aiming at reducing correlated failure, the other part of Chubby system is the client side linking with Chubby library which is served as user interface connecting with the system. Those two parts communicate using RPC.

2 One Chubby sell in Chubby system uses distributed consensus algorithm to elect a master, which the agreement that no master will be elected in the next few seconds, and the master is elected must be approved by the majority proposers, namely, all the servers in this system. And after a small period of peace, known as master lease, the new master will be again elected by consensus algorithm through voting. Figure 1, Chubby system structure Every replica in one Chubby cell maintains a copy of data, but only master can perform read and write operations into the database, and other replicas just simply update their own database by copying updates from the master. Clients find the master by sending master location requests to the replicas listed in the DNS. Once the client finds the location of master, it then can send read or write requests to the master until the master lease ends or the client stops to send requests to the master. If the client send read request to the master, then there is no need for other replicas except the master to access the database, while when it comes to write request, it cannot be satisfied by master alone, the master will notice other replicas to update their own database in order to maintain consistence of the date, thus in case the occurrence of failure in master. Once master fail, then other replicas will elect another new master through consensus algorithm, and if a replica fails, then the replacement system will select a new machine form a machine pool, and replica the IP address of the failed on in DNS with the IP address of the new machine, then update the cell member table that maintained by each replica. 2.4 File system Chubby employs a similar, but much simpler file system than UNIX, it also contains the directory name and slashes which indicate dependency, for example: ls/foo/wombat/pouch is a typical file name in Chubby file system, ls stands for lock service which is common through all Chubby cells, foo is a name of Chubby cell, and /wombat/pouch is a file name that only be visible by the Chubby cell which actually stores the file. The design which different from actual UNIX file system is that it does not offer the operation that move one file from one directory to another directory, and does not recode the last access date which is helpful in caching file meta-data. Chubby file system only contains file and directory, each file or directory is represented as node in file system, and similar to other file system, it only has one name according to one particular node, but differs in the idea that there are no symbolic or hard links. Berkeley DB is used in Chubby file system to store meta-data in each node. To establish a relationship similar to map relationship, in which, key the directory of the node and value is the contents of the node. Because Berkeley DB requires consecutive memory for data storage, so data is actually stored directly in DB rather than stored in the form of pointers. File operations include: FileSystem: create a file system, including a new Berkeley DB file filesystem.dat. FileSystem(const std::string&db): create a file system, starting initialization from Berkeley DB file string db. CreateNewFile: create a new file in file system, if directory is provided, then the system will create directory, ValidType is the valid type of nodes: temporary, permanent. Mkdir: create directory in file system, if no parent directory then it will be created, ValidType is the valid type of nodes. Open: open a named file or directory to produce a handle, analogous to a UNIX file descriptor. The file descriptor will be retuned under every circumstance. Close: close an open handle, and never fails. ReadFile: read the contents from a file node. ReadDir: read the contents from a directory node, and return the all the directory and file nodes information. Write: start from the position that pointed by ptr, write data of length size to file node. Size: get the size of current file. Delete: delete node, thus erase the contents in the node. If the node is directory, then all its children directories including file nodes inside them will be deleted. SyncToDisk: write the file which is in memory to Berkeley DB, default filesystem.dat. SyncFromDisk: read the file which is in Berkeley DB to file system in memory. Each node in Chubby file system can be used as a lock, typical lock is used between the clients and the lock manager, when a client wants a lock to protect the shared resource, first, he needs to send request to the lock manager, and indicates the type of the lock, either shared or exclusive. Then the lock manager responses to the client either grand the lock of refuse the operation invoked by client, once client successfully gets the lock, then he can do operations such as, read or write to the resource, the lock manger will reply back the status of resource to the client, when client has done operations, he will unlock the lock which is held by him, then other clients will continue to send request to the lock manager for resources. Which lock in Chubby file system is advisory lock which is, for example, suppose one client has locked the file, if some other client wants to access the file without sending request to unlock the file, it will not be prohibited. Another lock which is not used in Chubby system is mandatory lock, contrary to advisory lock that only the lock holder can access the file, if client who does not hold the lock wants to access the file, an exception will be returned by system to the client indicating that the file is inaccessible. There are two modes in the lock, shared mode which is known as read lock, many reads can share one read lock and access the same file without interfere, exclusive mode which is known as write lock, only one writer can

3 access to the file, and when the file is being written, no other reads can read the file. Data consistency is complex in distributed systems because communication is typically uncertain, and process may fail independently. Thus, to maintain an order consistent with the actual activity that invoked by each client, Chubby system includes virtual time, and virtual synchrony to solve this kind of problem. 2.5 Events Chubby clients may subscribe a range of events when they create a handle when a file is established; those events are passed to the clients who subscribe the events after the corresponding event occurs. Events include: File contents modified: often used to monitor the location of a service advertised via the file Child node added, removed, or modified: used to implement mirroring. Chubby master failed over: warns clients that other events may have been lost, so data must be rescanned. A handle has become invalid: this typically suggests a commination problem. Lock acquired: can be used to determine when a primary has been elected. Conflicting lock request from another client: allows the caching of locks. 2.6 Caching Chubby caches file data and node meta-data in a consistent, writethrough cache held in memory to reduce traffic. The master in Chubby cell maintains a list of what each client may be caching. And the cache is kept consistent by consistent by invalidations sent by the master. If the file data or node meta-data is modified, the master server will block this change, then sends invalidations to all the clients who may cache it. Once every client receives the invalidation, they will flush the invalidated state, and then acknowledge it. The master server may proceed only after it knows each client who caches the data invalided its cache to keep consistent. The caching protocol invalidates cached data on a change, and does not update it. Because a client may receive large number of updates when it accesses a file, thus result in inefficiency of caching protocol. Chubby system also caches open handles besides file data and meta-data. Thus, if a client requests to open a file which it has been opened previously, it just results in one RPC call to the master server. Lock is also can be cached in Chubby system. The system assumes that if a client holds the lock sufficiently long, it may be held by the same client again in the future. So, it is really for a client to request to acquire lock the first time, then it will be cached by the lock holder, thus reducing unnecessary RPC calls. When another client requests to conflicting lock, the master will notify the current lock hold to release the lock. 2.7 Communication Chubby system uses RPC mechanism which is provided by ICE (Internet Communication Engine). ICE is middleware similar to COBRA, offering an object-orientated platform comfortable with different programming environments. See Figure 2 for structure of ICE. Figure 2 ICE structure ICE code in this figure above includes runtime support for remote procedure call between client and server. Proxy code is the extension of the code which generated by Specification Language for ICE (slice) file. Proxy code in client responds to adapter code resides in server accordingly: proxy code is responsible for sending, while adapter code is responsible for receiving. Chubby system adopts two main methods in ICE asynchronous programming: AMI and AMD. Asynchronous Method Invocation (AMI), also known as asynchronous method calls or asynchronous pattern is a design pattern for asynchronous invocation of potentially long-running methods of an object in multithreaded object-oriented programming. Normally, an asynchronous call, the client process will be blocked in the called function until get the response from the server. If there is no response from the server, then client process has no means to continue. However, when it applied AMI method, there is no need for the client to wait after calling RPC functions to continue. Chubby uses AMI to realize lock operations and exclusive file operations as well as non-blocking RPC Asynchronous Method Dispatch (AMD) is a data communication method used when there is a need for the server side to handle a large number of long lasting client requests. Normally, when a servicing of request arrives at the server side, it will process the request immediately, and then returns to the client side. However, AMD allows the server to put callback function of RPC to blocking state, and then dispatches it to an available thread from a pool of threads after a prolonged time. AMD is mainly used in KeepAlive of Chubby system. This kind of RPC is the kind that not asking the server to response immediately, if so, it will largely increase the traffic of internet. 2.8 Sessions and KeepAlives Session is maintained to keep connection between master server and client. Each session is held for a time interval by handshakes called KeepAlives, and each session is associated with a lease, which is a time period that both client and master keep connection, and agreed by master that this time is not be terminated. Unless client notifies master, otherwise, file handles, lock service as well as cache remain valid during a session period. See figure 3. KeepAlives is a message sent by master periodically. It can be used to prolong the valid time of lease, and carry event updates to client side (See events in section 2.5). In normal condition, the lease will be renewed continuously with the repletion of KeepAlives handshakes. When client local lease timeout expires, which is, it does not receive the KeepAlives response, and then it steps into jeopardy

4 state. Because it has no idea whether this session will be terminated by master or not, it invalidates cache. In the meantime, the client will start to search a new master server. By iterating other servers except master in Chubby cell, the client gets the new view of cell. Whenever client receives an acknowledgment, it will send new KeepAlives message to the new master server, informing that master that itself is in jeopardy state, thus establish a new session with master. File handles will also be sent to master for updating. If a time interval, 45 seconds by default, passed, and no other server can be connected by client, then client will assume that the session has been terminated, then invalid session. Thus, in this period, the client cannot update its cache to keep data consistent. When the master lease timeout expires, then it means that for a certain amount time, the master is not able of receiving any KeepLives messages from the clients. Thus, master server will wait for a time period for the client s reconnection, if client cannot reconnect to master in this time period, master will assume that session between client and master is invalid. Therefore, master will clear file handles opened by client, lock acquired by client, and empty all temporary files that generated by client. All those operations are in consistent with other servers in Chubby cell besides master. If master server receives KeepAlives message from client, then the session remains valid. When a master fails, all the other servers just response to the call of getting view from the client, but ignore other API calls during the election time of a new master. Eventually a new master election succeeds. And then the new master proceeds: 1. It selects a new client epoch number, which clients are required to present on every call. Then the master can decide whether to receive packet based on the epoch number, an old epoch number will be rejected since it may be sent to the old master. 2. New master only process location-related requests, but not session-related requests. 3. Wait for KeepAlive message that carried with jeopardy state from clients. 4. Master responses to KeepAlives message fron client, establish new session between client and master, in the meantime, it refuses other session-related operations. Newly elected master response to client with KeepAlives message, warn client with master fail-over, then lets client to update file handles and locks. 5. Master waits responses of acknowledgements from all clients in each session, or let seesion expire. If master receives request to update handles in client, then it flushes handles in order to distinguish them from old ones. Add those updated Figure 3 Session between client the master handles from handle list in client session and handle list in its own. 6. Master response to all the operations that requested by client. 7. After some intervals, the master starts to check ephemeral files that have no open file handles. If so, master will delete those ephemeral files, release locks. 3. SCALING Chubby system must handle huge number of clients which are individual processes. Because there is just one master in one Chubby cell, so the client can overwhelm the master by a huge margin. Chubby use several approaches: Since clients always have tend to use nearby cell to avoid reliance on remote machines, so to create an arbitrary number of Chubby cells, and use one for a data centre. Lease time will be changed from, by default, 12 seconds to about 60 seconds when the master is in heavy load. Thus, this method can largely reduce KeepAlive RPCs which is the dominating type of request and communications between client and master. Chubby cache file data, meta-data, file handles to reduce the number of requests needed to be sent to the server. Use protocol-conversion servers to translate the Chubby protocol into less-complex protocols such as DNS and others. 4. USE Even though Chubby system is designed as a locking service, it is now heavily used inside Google as a name service, supplanting DNS. Normally, each entry in DNS has time-to-live field that associated with it, and DNS data will be dropped if it is not refreshed during this time period. Because is it common for developers to invoke thousands of processes in program and communicate with each other, thus resulting in thousands of DNS lookups, and things will become worse if it comes to a large program, and thousands of clients running program concurrently. However, Chubby s caching uses explicit invalidation so a constant rate of session KeepAlives requests can maintain an arbitrary number of cache entries indefinitely at a client. It is, nevertheless, a problem of load spike, even Chubby system can allow a single cell to sustain a large number of clients. Google resolves this by grouping name entries into batches so that a single lookup would return and cache the name mappings for a large number of related processes within a job.

5 5. DESIGN ERRORS Google infrastructures are mostly in C++, but a growing number of systems are being written in Java which has more complex protocol in client and a non-trivial client-side library. Thus, this creates potential problems for Chubby. To maintain library in Java would require care and expense, while an implementation without caching would burden the Chubby server. Even methods like running copies of protocol-conversion server that exports a simple RPC protocol that corresponds closely to Chubby s client API, it does not avoid the cost of writing, running and maintaining this additional server. 6. REFERENCES [1] Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In proceedings of the 7th Symposium on Operating Systems Design and Implementaion, Seattle, WA, November [2] Asynchronous Method Invocation. Distrivuted Programming with Ice. ZeroC, Inc.. Retrieved 22 November [3] Leslie Lamport. Paxos Made Simple. 01 November [4] David Mazi`eres, Paxos Made Pratical [5] Michi Henning, Mark Spurielll. Distributed Programming with Ice.

The Chubby Lock Service for Loosely-coupled Distributed systems

The Chubby Lock Service for Loosely-coupled Distributed systems Author: Mike Burrows Presenter: Ke Nian University of Waterloo, Waterloo, Canada October 22, 2014 Agenda Distributed Consensus Problem 3