Durability for Memory-Based Key-Value Stores

Size: px

Start display at page:

Download "Durability for Memory-Based Key-Value Stores"

Avis Jackson
5 years ago
Views:

1 Durability for Memory-Based Key-Value Stores Kiarash Rezahanjani Dissertation for European Master in Distributed Computing Programme Supervisor: Tutor: Flavio Junqueira Yolanda Becerra Júri President: Secretary: Vocal: Felix Freitag (UPC) Jordi Guitart (UPC) Johan Montelius (KTH) July 4, 2012

3 Acknowledgements I would like to thank Flavio Junqueira, Vincent Leroy and Yolanda Becerra who helped me in this work, especially when my steps faltered. Moreover, I owe my gratitude to my parents, Souri and Mohammad, who have been a constant source of love, motivation, support and strength all these years.

5 Hosting Institution Yahoo! Inc. is the world s largest global online network of integrated services with more than 500 million users worldwide. Yahoo! Inc. provides Internet services to users, advertisers, publishers, and developers worldwide. The company owns and operates online properties and services, and provides advertising offerings and access to Internet users through its distribution network of third-party entities, as well as offers marketing services to advertisers and publishers. Social media sites consist of Yahoo! Groups, Yahoo! Answers, and Flickr to organize into groups and share knowledge and photos. Search products comprise Yahoo! Search, Yahoo! Local, Yahoo! Yellow Pages, and Yahoo! Maps to navigate through the Internet and search for information. Yahoo! also provides a large number of specific communication, information and life-style services. In the business domain, Yahoo! HotJobs, provides solutions for employers, staffing firms, and job seekers; and Yahoo! Small Business that offers an integrated suite of fee-based online services, including web hosting, business mail and an e-commerce platform. Yahoo! Research Barcelona is the research lab hosted in the Barcelona Media Innovation Center focuses on Scalable computing, web retrieval, data mining and social media, including distributed and semantic search. This work has been done in Scalable computing group of Yahoo! Research Barcelona. Barcelona, July 4, 2012 Kiarash Rezahanjani

7 Abstract The emergence of multicore architecture as well as larger, less expensive RAM has made it possible to leverage the performance superiority of main memory for large databases. Increasingly, large scale applications demanding high performance have also made RAM an appealing candidate for primary storage. However, conventional DRAM is volatile, meaning that hardware or software crashes result in the loss of data. The existing solutions to this, such as write-ahead logging and replication, result in either partial loss of data or significant performance reduction. We propose an approach to provide durability to memory databases, with a negligible overhead and a low probability of data loss. We exploit known techniques such as chain replication, write-ahead logging and sequential writes to disk to provide durability while maintaining the high throughput and the low latency of main memory.

9 Contents 1 Introduction Motivation Contributions Structure of the Document Background and Related Work Background Memory Database Stable Storage Recovery Checkpoint Message Logging Pessimistic vs. Optimistic Logging Replication Replication Through Atomic Broadcast Chain Replication Disk vs RAM Related Work Redis RAMCloud i

10 2.2.3 Bookkeeper HDFS Discussion Design and Architecture Durability Target Systems Design Goals Design Decisions System Properties Fault Tolerance Model Availability Scalability Safety Consistent Replicas and Correct Recovered State Integrity Operational Constraints Architecture Abstractions Coordination of Distributed Processes Server Components Coordination Protocol Concurrency Stable Storage Unit (SSU) Load Balancing ii

11 3.6.6 Failover API Implementation Experimental Evaluation Network Latency Stable Storage Performance Impact of Log Entry Size Impact of Replication Factor Impact of Persistence on Disk Load Test Durability and Performance Comparison Conclusions Conclusions Future Work References 52 iii

12 iv

13 List of Figures 2.1 Buffered logging in RAMCloud. Based on (1) Bookkeeper write operation. Extracted from bookkeeper presentation slide (2) Pipeline during block constraction. Based on (3) System entities Leader states Follower states Log server operation Storage unit Clustering decision based on the servers available resources Failover Thoughput vs. Latency graph for our stable storage unit for different entry sizes with replication factor of three Throughput vs. Latency for stable storage unit with replication factor of two and three for log entry size of 200 bytes Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, when persistence to local disk is enabled and disabled Throughput of stable storage unit under sustained load Latency of stable storage unit under sustained load Performance comparison of stable storage unit and hard disk v

14 vi

15 List of Tables 4.1 RPC latency for different packet sizes within a datacenter Latency and throughput for a single client synchronously writing to stable storage unit vii

16 viii

17 1.1 Motivation 1 Introduction In the past decades, disk has been the primary system of storage. Magnetic disks offer reliable storage and a large capacity at a low cost. Although disk capacity has dramatically improved over the past decades; access latency and bandwidth of disks have not shown such improvements. Disk bandwidth can be improved by aggregating the bandwidth of several disks (e.g. RAID) but high access latency remains an issue. To mitigate these shortcomings and improve the performance of disk-based approaches a number of techniques are employed such as adding caching layers and data striping. However, these techniques complicate large scale application development and often become costly. In comparison to disk, RAM (refering to DRAM) offers hundreds of times higher bandwidth and thousands of times lower latency. In today s datacenters, commodity machines with up to 32 gigabyte of DRAM is common and it is cost-effective to have up to 64GB of DRAM (1). This makes it possible to deploy terabytes of data entirely in few dozens of commodity machines by aggregating their RAM. The superior performance of RAM and its dropping cost has made it an attractive storage means for applications demanding low latency and high throughput. As an example of such applications, Google search engine keeps entire its index table in RAM (4), the social network LinkedIn stores the social graph of all the members in memory and Google Bigtable holds SSTables block indexes in memory (5). This trend can also be seen in the appearance of many in-memory databases such as Redis (6) and Couchbase (7) that use memory as their primary storage. Despite the advantages of RAM over disk, RAM is subject to one major issue: volatility and consequently non-durability. Therefore, in the event of a power outage, hardware or software failures data stored in RAM will be lost. In memory-based storage systems, operating on commodity machines, providing durability while maintaining good performance is a major chal-

18 2 CHAPTER 1. INTRODUCTION lenge. The majority of existing techniques to provide durability of data, such as checkpointing and write-ahead logging either do not guarantee persistence of the entire data or result in significant performance degradation. For example, in periodic checkpointing, committed updates at the time interval between last checkpoint and failure point are lost or in the case of write-ahead logging to disk, the write latency is tighten to disk access latency. This work proposes a solution to provide durability for memory databases while preserving their high performance. 1.2 Contributions We propose an approach to provide durability for a cluster of memory databases, on a set of commodity servers with negligible impact on the database performance. We have designed and implemented a highly available stable storage system that provides low-latency high-throughput write operations allowing a memory database to log the state changes. This allows durable writes with low latency and recovery of the latest database state in case of failure. Our stable storage consists of a set of storage units that collectively provide fault-tolerance and load-balancing. Each storage unit consists of a set of servers; each server performs asynchronous message logging to record changes of database state. Log entries are replicated on memory of all the servers in the storage unit through chain replication. This minimizes the possibility of data loss derived by asynchronous writes in the case of servers failure and increases availability of logs for the purpose of recovery. Each server exploits the maximum throughput of the hard disk by sequentially writing the log entries. Our solution is tailored for a large cluster of memory-based databases that store data in the form of key-value pairs and comply to the characteristics of social network platforms. The evaluation results indicate our approach enables durable write operations with latency of less than one millisecond while providing a good level of durability. The results also indicate that our storage solution is able to outperform the conventional write-ahead logging on local disk in terms of latency. In addition to low response time, the system is designed to achieve high availability and read throughput through replication of log entries in several servers. The design also accomodates scalability by minimizing the interactions amongst the servers and utilizing local resources.

19 1.3. STRUCTURE OF THE DOCUMENT Structure of the Document The rest of this document is organized as follows. Chapter 2 provides a brief introduction on several techniques and concepts related to this work. Further in this chapter, we review four systems that have influenced the design and discuss the approach used by each one of the systems. In Chapter 3 we present our solution to the durability problem. We describe the properties of our system as well as the architecture and the implementation. Chapter 4 presents the results of the experimental evaluation and analyzes the results. Finally, Chapter 5 concludes this thesis by summarizing its main points and presenting the future work.

20 4 CHAPTER 1. INTRODUCTION

21 2.1 Background 2 Background and Related Work Memory Database In-memory or main memory database systems store the data permanently in main memory and disk is usually used only for backup. In disk-oriented databases data is stored in disk and it may be cached in memory for faster data access. Memcached (8) is an in-memory key-value store that is widely used for such as a purpose. For example, Facebook uses Memcached to put the data from MySQL database into memory (9) and consistency between Memcached and MySQL servers is managed by application software. In both systems an object can be kept in memory or in disk but the major difference is that in main memory databases the primary copy of an object lives in memory and in disk-oriented databases the primary copy lives in the disk. Main memory databases pose several properties different from disk-oriented databases and here we mention the most relevant ones to this project. The layout of data stored in disk is important, for example, sequential access and random access to data stored in disk causes a major performance difference while the method of access to memory is of no importance. Memory databases use data strcuctures that allow leveraging performance benefits of main memory. For example, T-tree is mainly used for indexing of memory databases while B-tree is prefered for index of disk-based relational databases (10). Main memory databases are able to provide a far faster access time and a higher throughput than disk-oriented databases. Although the latter provides a stronger durability as main memory is volatile and in case of a process crash or power outage data residing in memory will be lost (11). To mitigate this issue, disk is used as a backup for memory databases; hence, at the time of a crash the database can be recovered. We will discuss several approaches to provide durability of data and recovery of the system state.

22 6 CHAPTER 2. BACKGROUND AND RELATED WORK Stable Storage There are three storage categories (12): 1. Memory storage which loses the data at the time of process or machine failure and power outage. 2. Disk storage which survives the power outage and process failures except disk related crashes such as disk head crash and bad-sectors. 3. Stable storage which survives any type of failures and provides high degree of fault tolerance that is usually achieved through replication. This storage model suites applications which require reading back the correct data after writing with a very small probability of data loss Recovery Recovery techniques in a distributed environment can become complicated when a globally consistent state has to be recovered at several nodes and there are several writers or readers. Our approach is based on a single-writer single-reader model that simplifies the recovery; hence we discuss the recovery techniques given the single-reader single-writer model Checkpoint Checkpoint (snapshot) is a technique in fault-tolerant distributed systems to enable backward recovery by saving the system state from time to time onto a stable storage. Checkpoint is a suitable option for backup and disaster recovery as it allows having different versions of the system states at different point in time. Since checkpointing produces a single file that can be compressed, it can easily and quickly be transferred over the network to other data centers to enhance the availability and recovery of the service. Checkpoints are ready state of a system therefore it is only required to read the snapshot to reconstruct the state and there is no need for further processing. The downside of this approach is this method stores the snapshot of the server state from one point in time to another which means failure at any point in time will result in losing all

23 2.1. BACKGROUND 7 the changes made from the last snapshot up to the failure point. This characteristic makes this method undesirable if the latest state needs to be recovered. In practice this is implemented by forking a child process (with copy-on-write semantic) to persist the state (13). This could significantly slow down the parent process serving a large dataset or interrupts the service for hundreds of milliseconds particularly on a machine with poor CPU performance. This can specially become an issue when the system is at its peak load Message Logging It is not possible to always recover the latest state of a database using snapshots and in order to have a more recent state, the more frequent snapshots is required. This yields a high cost in terms of operations required for writing the entire state in a stable storage. To reduce the number of checkpoints and enable the latest state recovery, message logging technique can be used. In message-logging, a sequence number is associated with messages are recorded onto a stable storage. The underlying idea of message logging is to use the logs stored in stable storage and a check pointed state (as a starting point) to reconstruct the latest state by replying the logs on the given checkpoint. The checkpoint is only needed to limit the number of logs; hence shortening the recovery time. Message logging requires that after completion of recovery no orphan processes exist. An orphan process is a process that survived the crash and they are in different state from the recovered process (14). In Chapter 3 we will discuss this property in our design Pessimistic vs. Optimistic Logging Message logging can be categorized into two categories: Optimistic logging and pessimistic logging (14). Message logging takes time and logging methods can be categorized depending on whether a process waits to ensure every event is safely logged before the process can impact the rest of the system. Processes that do not wait for completion of logging of an event are optimistic and processes that block sending a message until the completion of logging of the previous message are

24 8 CHAPTER 2. BACKGROUND AND RELATED WORK pessimistic processes. Pessimistic logging sacrifices a better performance during failure-free run for a guarantee of recovering a consistent state with the crashed process state. In conclusion, optimistic logging is desirable from performance point of view and it is suitable for systems with a low failure rate. Pessimistic logging is suitable for systems with high failure rate or systems that reliability is critical. Write-ahead logging (WAL) can be considered as an example of pessimistic method that the logs should be persisted before the changes take place. WAL is widely used in databases to implement roll-forward recovery (redo) Replication There are two main reasons for replication: scalability and reliability. Replication enables fault-tolerance as in the event of a crash, system can continue working using other available replicas. Replication can also be used to improve performance and scalability, when many processes access a service provided by a single server, replication can be used to divide the load among several servers. There is variety of replication techniques with different consistency model, in this document we explain two major replication techniques and further we describe how our system benefits from replication in order to improve its reliability and minimizes data loss Replication Through Atomic Broadcast Atomic broadcast or total order broadcast is a well-known approach that guarantees all the messages are received reliably and in order by all the participants (15). Using atomic broadcast all updates can be delivered and processed in order, this property can be used to create a replicated data store that all the replicas have consistent states Chain Replication Chain replication is a simple straightforward replication protocol intended to support high throughput and high availability without sacrificing strong consistency guarantee. replication servers are linearly ordered to form a chain. In chain

25 2.1. BACKGROUND 9 The first server in the chain which is the entry point for queries is called head and the the last server which sends the replies is called tail. Each update request enters at the head of the chain and after being processed by the head the state changes are forwarded along a reliable FIFO channel to the next node in the chain and it continues in the same manner until it reaches the tail. This method handles queries by forwarding the queries to the tail of the chain. This method is not tolerant to network partition but instead it offers high throughput, scalability and consistency (16) Disk vs RAM Magnetic disk and RAM have several well-known differences. The RAM access time is orders of magnitude less than magnetic disk and its throughput is orders of magnitude higher. Access time for a record in magnetic disk consists of a seek time, rotational latency and transfer time. Among the three, seek time is dominant when records are not large (megabytes). The seek time of disk is several milliseconds and the transfer time varies depending on the bandwidth. For instance, for 1 MB the transfer time is 10ms for a disk with bandwidth of 100 MB/s. On the other hand, the access latency of a record in memory is a few nanoseconds and its bandwidth is several gigabytes per second (17), (18). This means RAM performs orders of magnitude better in terms of latency and throughput. The access method and the way data is structured in RAM do not make a difference in performance of RAM, although this is not the case for disk. Sequential writes to disk provide a far better latency and throughput than random writes because it eliminates the need for constant seek operations (19). Everest (20) is an example of a system that uses sequential writing to disk in order to increase the throughput. The other difference of RAM and magnetic disk is volatility. Memory is volatile and data will be lost at the time of power outage or crashing the process referencing the data. Magnetic disk is a non-volatile storage and data written to disk survives power outage and process crashes. However, writing to disk (forcing the data to disk) does not guarantee that data is persisted immediately. Disk has a cache layer which is non-volatile, therefore loss of power to cache results in loss of data being written to disk. One solution is to disable the cache, though this is not practical as it significantly degrades the disk performance; hence the application writing

26 10 CHAPTER 2. BACKGROUND AND RELATED WORK to disk. Other solutions are using non-volatile RAM as used by NetApp filer (21) or disks with battery-backed write cache such as HP SmartArray RAID controllers, this provides a power source independent from external environment to maintain the data for a short time allowing it to be written to disk at the time of power outage (22). Although, the latter options are not considered as commodity hardware. 2.2 Related Work In this part we present some of the existing systems related to this work that has influenced our solution in one way or another. Another reason to select the following systems to present in this report is that the collection of approaches implemented by these systems represents a comprehensive set of common methods applied to provide durability for many main memory databases. We describe: Redis (23) which is an in-memory database and uses writes to local disk as well as replication to achieve durability. Bookkeeper (24) that provides a write-ahead logging as a reliable fault tolerance distributed service. RAMCloud (1) a new approach to datacenter storage by keeping the data entirely in DRAM of thousands of commodity servers. HDFS (3), a highly available distributed file system with append-only capability for keyvalue pairs. At the end we discuss the pros and cons of each approach taken by these systems Redis Redis (23) is an in-memory key-value store that aims at providing low latency. To meet this objective Redis server holds the entire data in memory to avoid page swapping between memory and disk, and consequently the serialization/deserialization process. Redis provides a comprehensive set of options for durability of data as follows.

27 2.2. RELATED WORK Replication of full-state in memory Redis applies master-slave replication model so that all the slaves servers synchronize their states with the master server. The synchronization process is performed using non-blocking operations on both master and slaves; therefore they are able to serve clients queries while performing synchronization. This implies eventual consistency model of Redis server, meaning that slave servers might reply to clients queries with an old version of data while performing the synchronization. MongoDB is another example of a database system that uses a similar technique for replication (25). Redis implements a single-writer and multiple-readers model that clients are able to read from any replica but only permitted to write to one server. This model along with eventual consistency ensures all the replicas will eventually be in a same state, while maintaining a good performance in terms of latency and read throughput. 2. The other durability method of Redis is persisting the data into local disk using pointin-time snapshot (checkpoint) at specified intervals. In this method Redis server stores the entire state of the database server every T seconds or every W write operations onto the local disk. Copy-on-write semantic is applied to avoid interruption of service during persisting the data on disk. 3. Asynchronous logging is another approach taken by Redis to provide durability. Write operations are buffered in memory and flushed into disk by a background process in append-only fashion. The time to sink the data depends on the sync policy specified in configuration parameters (flush to disk every second or for every write) (26) RAMCloud RAMCloud (1) is a large scale storage system designed for cloud scale data-intensive applications requiring very low latency and high throughput. It stores a large volume of data entirely in DRAM by aggregating the main memory of hundreds or thousands commodity machines and aims at providing the same level of durability as disk by using a mixture of replication and backup techniques. RAMCloud applies buffered logging method for durability that utilizes both memory replication and logging onto disk. In RAMCloud only one copy of every object is kept in the memory and the backup is stored in the disks of several machines. The primary server updates its state

28 12 CHAPTER 2. BACKGROUND AND RELATED WORK Figure 2.1: Buffered logging in RAMCloud. Based on (1). upon receiving a write query and forward the log to the backup servers, acknowledgement is sent by a backup server once the log is stored in the memory. A write operation is returned by the primary server once all the backup servers acknowledge. Backup servers write the logs into disk asynchronously and remove the logs from the memory. To recover quickly and avoid disruption of the service, RAMCloud applies two optimizations. First is by truncating the logs to reduce the required amount of data to be read during recovery. This can be achieved by creating frequent checkpoint and discarding the logs up to that point or by cleaning the stale logs occasionally to reduce the size of log file. Second optimization is to divide the DRAM of each primary server into hundreds of shards and assigning each shard to one backup server. At the time of a crash each backup server reads the logs and act as a temporary primary server until a full state of the failed server can be reconstructed Bookkeeper Bookkeeper (24) provides write-ahead logging as a reliable distributed service (D-WAL). It is designed to tolerate failure by replicating the logs in several locations. It ensures that writeahead logs are durable and available to other servers so in the event of failure other servers can take over and resume the service. Bookkeeper allows WAL by replicating log entries across remote servers using a simple

29 2.2. RELATED WORK 13 quorum protocol. A write is successful if the entry is successfully written to all the servers in a quorum. A quorum of size f +1 is needed to tolerate concurrent failure of f servers. Bookkeeper allows aggregating disk bandwidth by stripping logs across multiple servers. An application using Bookkeeper service is able to choose the quorum size as well the number of servers used for logging. When the number of selected servers is greater than the quorum size Bookkeeper performs stripping the logs among the servers. Figure below shows the bookkeeper write operation and how it takes advantage of stripping. Figure 2.2: Bookkeeper write operation. Extracted from bookkeeper presentation slide (2). In figure 2.2 Ledger corresponds to a log file of an application, a Bookie is a storage server storing the ledgers and BK Client is used by an application to process the requests and interact with bookies. Assuming client selects three bookies and quorum size of two. Bookkeeper performs stripping by switching the quorums and spreading the load among the three bookies. This allows distribution of load among the servers and if a servers crashes service continues without interruption. A client can read different entries from different bookies, this allows a higher read throughput by aggregating the read throughput of individual servers. Bookkeeper also sequentially writes to disk by interleaving the entries into a single file and stores index of the entries

30 14 CHAPTER 2. BACKGROUND AND RELATED WORK Figure 2.3: Pipeline during block constraction. Based on (3) to locate and read the entries. This allows maximizing the disk bandwidth utilization of disk and the throughput. Bookkeeper follows a single-writer multiple-reader model and guarantees that once a ledger is closed by a client all the readers read the same data HDFS HDFS (3) is a scalable distributed file system for reliable storage of large datasets and it delivers the data at a high bandwidth to applications. What makes HDFS interesting with regard to our work is the way it performs I/O operations, and achieves high reliability and availability. HDFS allows an application to create a new file and write to the file. HDFS implements a single-writer multiple-reader model. When a client opens a file for writing, no other client is permitted to write to the same file until the file is closed. After a file is closed the file content cannot be altered, although new bytes can be appended to the file. HDFS splits a file into large blocks and stores the replicas of each block on different DataNodes. NameNode stores namespace tree and the mapping of file blocks to DataNodes. When writing to a file, if there is a need for a new block, NameNode allocates a new block and assigns a set of DataNodes to store the replicas of the block, then these DataNodes form a pipeline (chain). Data is buffered at the client side and when the buffer is filled bytes are pushed

31 2.3. DISCUSSION 15 through the pipeline. This prevents the overhead of the packet headers. The DataNodes are ordered in such a way that minimizes the distance of the client from the last node in the pipeline, thereby minimizes the latency. HDFSFileSink operator in Datanodes buffers the writes and the buffer is written into disk only when adding the next tuple exceeds the size of the buffer. Thereby, each server writes to disk asynchronously which enables a low latency of writes in HDFS. Placement of blocks replicas is critical for reliability, availability, and network bandwidth utilization. HDFS applies an interesting strategy to place the replicas. It provides a tradeoff between minimizing the write cost, maximizing reliability, availability and read bandwidth. HDFS places the first replica of each block on the same node as the writer and the second and third one on two different nodes in two different racks. HDFS enforces two restrictions: DataNodes cannot store more than one replica of any block, provided that there are sufficient racks in the cluster, no rack should store more than two replicas of any block. In this way, HDFS minimizes the probability of correlated failures as failure of two nodes in a same rack is more likely to occur than two nodes in different racks which maximize the availability and read bandwidth (3). 2.3 Discussion We summarize the approaches towards durability into four major categories. Replication of the full state into several locations. Periodic snapshots of the system state. Asynchronous logging of writes onto a stable storage. Synchronous logging of writes onto a stable storage. The full replication approach along with eventual consistency (e.g. Redis) ensures all the replicas will eventually be in a same state, while maintaining a good performance in terms of latency and read throughput. This approach provides low latency and high read throughput that linearly scales with the number of slave servers because all the read and write queries can be served from the memory

32 16 CHAPTER 2. BACKGROUND AND RELATED WORK without involving disk. However, this approach is subject to one major drawback; large memory requirement. This method becomes costly in term of hardware and more importantly utility cost when we have a large cluster of servers. DRAM memory is volatile and it requires constant electricity power meaning the machines need to be powered at all the times. For example, in today s datacenters the largest amount of DRAM which is cost effective is 64GB (1), having such a datacenter, to store 1TB of data requires 16 machines. To have a replication factor of three which is considered a norm to have a good level of durability (3), we need 32 extra servers. Even though, this approach offers great benefits but it is not a proper choice for a large cluster of in-memory databases as it becomes costly. The other drawback is the possibility of data loss. For example, in case of Redis, master server replies to updates before replication on slave servers has been completed (for a lower latency); hence if master fails in the time between the reply to update and before sending the update to replicas the data can be lost. To prevent such a risk the update should not return until all the replicas have received the update, although this increases the latency. This is a tradeoff that needs to be made between high performance and durability. The other risk associated with this approach is that in case of concurrent failure of all the servers holding the replicas (datacenter power outage) the entire data will be permanently lost. To mitigate this issue, data can be replicated in multiple datacenters, however this methods results in a high latency of updates (hundreds of milliseconds) for blocking calls or partial loss of data for asynchronous calls. Redis provides periodic snapshot. This is a good choice for backup and disaster recovery as it allows having different versions of the system states at different point in time. Since the full state is contained in a single file it can be compressed and be transferred to other data centers to enhance the availability and recovery of the service. Periodic snapshot stores server state from one point in time to another, however, failure at any point in time will result in losing all the updates from the last snapshot up to the failure point. This propery makes this method undesirable when the latest state needs to be recovered. The other point to consider is that forking a child process to persist the state could significantly slow down the parent process serving a large dataset or interrupting the service.

33 2.3. DISCUSSION 17 In comparison to snapshot, this method provides a better durability as every write operation can be written into disk. To improve performance, write operations are batched in memory before being written to disk. Thus, a failure results in loss of the buffered data. Logging performance can be improved by writing the updates into disk in an append-only fashion. This prevents the long latency of seek operations on disk (dedicated disks) by sequentially writing the logs. Therefore, if the sink thread is the only thread writing to disk (in append-only fashion) it can achieve a better write throughput. Logging provides a stronger durability than snapshot but it results in creating a larger log file and slower recovery process, since all the logs need to be played in order for rebuilding the full state of the dataset. To accelerate the recovery process number of logs required to build the state should be deducted. Two major techniques for truncating the log file are as follows. System state should frequently be checkpointed, so that the logs before the checkpoint can be removed. The other technique, that is implemented by Redis is cleaning the old logs. Redis rewrites the log file in the background to drop unneeded logs and minimize the log file size. Asyncronous logging to disk provides a better performance than synchronous loggig, however, this increases the possibility of losing the updates. Asynchronous logging is usually used along with replication to mitigate this issue. RAMCloud takes this approach by replicating the logs through broadcast, refers to it as buffered logging, that allows writes (also reads) to proceed at the speed of RAM along with a good durability and availability. Buffered logging allows a high write throughput, however if the write throughput continues at a sustained rate higher than disk throughput it eventually results in filling the entire OS memory and throughput drops to the throughput of the disk. Therefore, buffered logging provides a good performance as long as free memory is available. Moreover, buffered logging does not always guarantee durability as in case of a sudden power outage the buffered data will be lost. Therefore, it is suitable for applications that can afford the loss some updates. To deal with such scenarios cross-data center replication can be done, however the latency of write are expected to drop significantly. HDFS provides an append-only operation that can be used for the purpose of logging. HBase is an example of an application using this capability of HDFS for logging purpose (27). The idea of HDFS is similar to RAMCloud, though the major difference is that the replication model applied in HDFS is similar to chain replication that enables high write throughput. HDFS buffers

34 18 CHAPTER 2. BACKGROUND AND RELATED WORK the bytes in memory and writes a big chunk of data into disk when the buffer is full. HDFS creates one file for each client on each machine (residing replicas) which means if multiple clients concurrently write to file s blocks located in a same machine, the write performance degrades as writing to several files in a same disk requires frequent seek operations. HDFS addresses correlated failures through smart replication strategy by placing the replicas in multiple racks on different machines. In case of Bookkeeper, the quorum approach of Bookkeeper consume more resources from one of the participants as one needs to perform the multicast. For instance, In Bookkeeper, the client multicasts the log entries across several bookies consequently this consumes more bandwidth and CPU power of the client. One way to resolve this, could be outsourcing the replication responsibility to the sever ensemble and create a more balanced replication strategy. For example, Zookeeper (28), a coordination system for distributed processes, applies a quorum based approach on server side for replication by implementing a totally ordered bradcast protocol called Zab (29), however this complicates the server implementation. Our design decision to approach the durability problem in memory databases is mostly influenced by the approaches described above. In next chapter, we describe our solution in details.

35 3 Design and Architecture In this chapter, we define durability with respect to this work and describe how we approach the durability problem in memory-based key-value stores. We explain the system design and it s properties, and finally how the system is built. 3.1 Durability For the purpose of this work, durability means that if the value v corespondent to the key k is updated to u at time t, then a read for key k at time t such that t > t must return u, if no updates occured between time t and t. We assume that durability condition holds for a memory database as long as no crash has occured. This work is to address the durability of a memory database (in our case a key-value store) such that the latest committed value of every key can be retrieved after a crash. 3.2 Target Systems The proposed system design is tailored to provide durability for a cluster of in-memory databases storing data in form of key-value pairs which complies to the following specifications. 1. Dataset is large and the cluster of in-memory key-value stores consists of at least dozens of machines 2. Write query size (update/insert/delete) varies from few hundreds of bytes to few of KB (an example of write query is SET K V to set the value of key K to V ). 3. Workload is read dominant(10-20% of queries are write) 4. High availability of service is important

36 20 CHAPTER 3. DESIGN AND ARCHITECTURE The above specification is common for social networking platforms such as facebook, twitter and Yahoo! News Activity that store large amount of data in main memory and process large amounts of events. For example, in only year 2008, facebook had been serving 28 terabytes of data from memory (30) and this number is increasing. Based on (31), in facebook cluster, less than 6 percent of the queries are write queries. In social network platforms users write queries are generally small (less than 4KB) (32), for example, twitter message size is limited by 140 characters (33). 3.3 Design Goals In our design we aim to provide a high level of durability such that in the event of a crash, the latest state of the system is recoverable with a low probability of data loss. The objective is to achieve this goal with minimal impact on performance of memory database (read operations do not make any changes to the database state, thereby, only the write operations should be durable). We need to ensure that our system is highly available so that changes to database state can be reliably recorded to stable storage and the records can be read at the time of recovery. The system needs to scale with increasing number of databases and write operations. Maximizing utilization of local resources of the database cluster is another objective and we try to avoid additional dependency to external systems and create a self-contained application. Any guarantee about durability of a write should be provided before acknowledging the success of the write operation to the writer. Our durability mechanism should enable a low recovery time to enhance the availability of the database service. 3.4 Design Decisions In Chapter 2.2, we described and discussed the common approaches towards durability in memory-based databases. In this section, we explain our design decision with respect to the

37 3.4. DESIGN DECISIONS 21 target systems and the objectives. Checkpoint vs. Logging As checkpointing consumes a considerable amount of resources and it always leaves the possibility of data loss, we choose to use message logging to persist the changes of database state, thereby, the state can be reconstructed by replaying the logs (To reduce the recovery time and limit the number of logs, a snapshot of the system state is needed or the unneeded logs should be truncated before recovery. To eliminate the cost of this process during operation a background processes can be assigned to reconstruct the system state and store it into stable storage when the system is not under stress. This is part of the future work.) Pessimistic vs. Optimistic Logging We choose to use pessimistic logging to ensure that the changes will take place only after they are durable in a stable storage system. Low latency is one of our main objectives. In order to achieve this objective, we create a stable storage by mixture of in memory replication and asynchronous logging of changes of the database state. This allows storing log entries in several locations while providing low response time. We name the set of servers cooperating to perform replication and logging a stable storage unit or SSU. Asynchronous vs. Synchronous The asynchronous logging is the core of our design to provide low response time. The reason to choose asynchronous logging is to eliminate the latency of writing the logs onto disk. However, since DRAM is volatile this method carries the risk of losing the logs upon a crash. To address this issue we replicate the logs into memory of several machines before acknowledging for durability of the write. In this way, we can significantly reduce the probability of data loss as it is very unlikely that all the machines crash at the same time (3). The design targets low latency and high throughput for write operations by trading the guaranteed durability with a low probability of data loss. Further in this chapter, we discuss the possibility of losing the data and reliability of this method. Chain Replication vs. Broadcast In order to replicate the logs we choose to use chain replication for two main reasons. 1) Chain replication puts nearly the same load on the resources of each server, while in broadcast one of the participants utilizes more resources than the others. This allows providing an implicit load balancing. 2) Chain replication enables high throughput logging as the symmetric load on the servers allows utilizing the maximum resources of each server and minimizing the chance of the appearance of bottleneck. We also performed an experiment to help us with the decision. We measured the latency caused by network transmission using either approach. We discuss the experiment in Chapter 4.

38 22 CHAPTER 3. DESIGN AND ARCHITECTURE Local disk vs. Remote file System Logs can be persisted either in the local disk of the servers or an existing reliable remote file system (e.g. NFS, HDFS). We choose to use local disk of the server to maximize utilization of the local resources, reduce dependencies and avoid the use of network bandwidth for persistence. As the logs are replicated in memory of several machines and all the machines persist the logs onto their disk, we will have the replicas of the logs on several hard disks. This enhances the availability of the logs at the time of recovery and accelerates the recovery process by reading different partitions of the logs from different servers (hence aggregating disk bandwidth of the replicas). Faster recovery vs. Higher write performance During peak load where the system is under sustained intensive load, if the write throughput to the stoarge is higher than the write throughput to disk, the servers buffer eventually become saturated and the performance degrades significantly. Thereby, it is important to fully utilize the disk bandwidth and minimize the write latency in order to prevent saturation of the buffer as much as possible. We write the logs in an append-only fashion by sequentially writing them to disk in a single file to eliminate the seek time and maximize the disk throughput utilization. Therefore, we need to interleave the logs from all the writers into a single file. As opposed to having one file per writer this method (sequential writes to a single file) makes the recovery process slower since to recover the logs belong to one writer we need to read all the logs in the file in a sequential manner. Recovery needs to be done only at the time of crash and this does not happens frequently and is rare, on the other hand logging needs to be performed constantly (constant write operations). Thus, we choose to have a faster logging rather than faster recovery. Although the read performance can be improved by indexing the log entries (Bookkeeper (24) implements this method). Transport layer protocol We choose to use TCP/IP for communication as we want to deliver the messages in order and reliably to provide a consistent view of the logs (and the stored files) among all the servers in a chain. 3.5 System Properties Our stable storage system consistes of a set of stable storage units (storage unit or SSU). Every stable storage unit consists of several servers, each persisting log entries onto its local

39 3.5. SYSTEM PROPERTIES 23 disk. A writer process writes to only one of the stable storage units. A storage unit follows a fail-stop model and upon a SSU crash, its clients write to another storage unit. The system environment allows detection of failure through membership management service provided by an external system. Our solution follows a Single-writer Single-reader model. Log entries of a database application is written only by one process to the stable storage. The process that writes the logs is the process (same identifier) that reads the logs from the storage. The read operation needs to be performed at the time of recovery. Therefore, read and write operations on the same data item is never performed simutanously. A reader can read the logs from more than one server within the storage unit as all the servers store identical set of data (acknowledged log entries). A process writes to a different storage unit, if the storage unit fails or if the storage decides to disconnect the process Fault Tolerance Model The system needs to be fault tolerant to continue its service at the time of failure. We achieve fault tolerance through replication. In our system persistence of an acknowledged log is guaranteed for f simultaneous servers failures, if we have f +1 servers in the replication chain. However to guarantee the stable storage of a log we require f +2 servers to tolerate f simultaneous failures. We implement fail-stop model. A server halts in response to failure and we assume the servers crash can be detected by all the other servers in the storage unit. In the event of a server crash, the storage unit stops serving all its writers and it only persists the remaining logs in its servers buffer onto disk (Writers connect to another storage unit to continue the logging). Once all the logs are persisted into disk all servers restart and become available to form new storage unit. An alternative option to deal with failures is to repair the storage unit. However due to following reason we prefer to re-create a storage unit and avoid repairing. Repairing a storage unit requires addressing many failure scenarios which complicates the implementation. In addition, possibility of corner cases which have not been taken into account as well as the possibility of additional failures during repair further complicates the matters.

40 24 CHAPTER 3. DESIGN AND ARCHITECTURE Availability The system allows creation of many stable storage units. Each storage unit can provide different replication factors. Larger replication factors (number of servers in the replication chain within storage unit) provide three advantages: higher availability of the stored entries since all the servers within the storage units host a replica of logs. lower probability of data loss when a correlated failures occur. It is more likely that at least one server, holding the buffered data survives the failure and persists it to disk. higher read bandwidth by aggregating the bandwidth of servers hosting the replicas. Therefore, higher replication factor of storage unit enhances data availability and read throughput as well as stronger durability. However, in the case of a catastrophic failure of all the servers in a storage unit data can be lost. Larger number of storage units enhance the availability of the write operations since writers can continue logging upon the storage unit crash Scalability In our storage system, every storage unit is independent of every other unit and there are no shared resources or coordination amongst them. The independent nature of the storage units allows adding new units without impacting the service performance. The load is divided by assigning each set of writers to different units. Therefore, to create a storage unit, a set of servers with closest resource usage are selected in order to prevent a server from becoming a bottleneck in the chain. This allows the maximum resource utilization within a storage unit Safety Consistent Replicas and Correct Recovered State We reply on TCP protocol to transfer the messages in order, reliably and without duplication between the nodes. The servers in a chain are connected by a single TCP channel and messages are forwarded and persisted in the same order that they have been received. This ensures that

41 3.6. ARCHITECTURE 25 all the servers in a chain view and store the logs in the same order that messages are sent by the writer (writer also writes through a single TCP channel). In our system it is not possible to recover an incorrect or a stale state of database without the knowledge of the recovery processor (reader). Every writer is represented by a unique Id (client Id) and every log is uniquely identified by combination of the client Id and a log Id. The log Id increases by one for every new log entry. During recovery of a database state, logs are read and played in order and in case of a missing log (or duplicate), the recovery processor will be able to detect the missing (or duplicate) log entry Integrity During recovery we need to ensure that the object being read is not corrupted. This requires adding a checksum to every object stored in storage to enable verification of data being read (this feature is not part of the implementation) Operational Constraints The availability of the Zookeeper quorum is essential for the availability and operation of the system since we reply on Zookeeper for failure detection and accessing metadata regarding the nodes. The availability of write operation depends on availability of a storage unit and the availability of read operations requires at least one server that stores the requested logs. In order to operate continuously, at least two storage units should be available to quickly resume the service upon a storage unit crash. For example, in the current implementation we require six servers to have two storage units with replication factor of three. This requirement can be reduced by fixing the storage units upon a crash through replacing the failed server from a pool of available servers. However, this complicates the implementation. 3.6 Architecture The main idea is to create several storage units, each capable of storing and retrieving stream of logs reliably. Each storage consists of a number of coordinated servers that perform chain replication and asynchronously persisting the logs onto their local disk to provide a lower

42 26 CHAPTER 3. DESIGN AND ARCHITECTURE response time. Each log entry is acknowledged only after it is replicated in all the servers within the storage unit. Hence we ensure that the logs are persisted in the event of failure of some of the servers. The number of servers in each storage unit is equal to the replication factor it provides. In this section, we describe the architecture of the system (with respect to write operation) Abstractions Our system consists of three types of processes. Figure 3.1 illustrates the processes and below we describe their functions. Log server processes (server) form a storage unit and asynchronously store the log entries on local disks (in append-only fashion). They also read and stream the requested logs from the local disk upon the request of client process at the time of recovery. Figure 3.1 head, tail and the middle nodes are the log server processes. In Stable Storage Unit or storage unit (SSU) provides stable storage of log entries. It consists of a number of machines hosting two types of processes. Log server process (each on different machine) to replicate and store logs and state builder process. The number of machines is equal to replication factor provided by a stable storage unit. Client process (writer/reader) processes requests (writes) from an application and creates log entries. It streams the entries to an appropriate storage unit and responds to the application. The client process also reads the logs and reconstructs the database state at the time of failure (read operation is a future work). State builder processes are the background processes that read the logs from the local disk to compute the latest value of each key. Once the values are computed, they are stored into the disk and the old logs are removed from the disk. The purpose of this process is to reduce the recovery time by eagerly preparing the latest state of the key-value store. This process takes place whenever system is not under stress (part of future work).

43 3.6. ARCHITECTURE 27 Figure 3.1: System entities Coordination of Distributed Processes Zookeeper is a coordination system for distributed processes (28). We use Zookeeper for membership management and storing metadata about the server processes, storage units, and client processes. Data in zookeeper is in a form of tree structure and each node is called a znode. There are two type of znode: ephemeral and permanent. Ephemeral znodes exist as long as the season of zookeeper client creating that znode is alive. Ephemeral znodes can be used to detect failure of a process. Permanent znode stores the data permanently and ensures it is available. We use these metadata and the Zookeeper membership service to coordinate server processes for creating storage units and detect failures. Client processes also use Zookeeper service to locate storage units and detect their failures. Below we describe the metadata and the types of nodes used in our system. MetaData Log Server znode (ephemeral) IP/Port for coordination protocol IP/Port for streaming Rack: the rack that server is located Status: accept or reject storage join request Load status : updated resource utilization status Storage unit znode (ephemeral)

44 28 CHAPTER 3. DESIGN AND ARCHITECTURE Replication factor Status: accept/reject new clients List of log servers Load status: load of the log server with highest resource utilization File map znode (permanent) Mapping of logs to servers Client znode (ephemeral) Only used for failure detection Global view znode (permanent) List of servers and their roles (leader/follower) used to form stable storages units Server Components A log server process creates an ephemeral znode in Zookeeper upon its start and it constantly updates its status data at this node. This process follows a protocol that allows it to cooperate with other processes to form a storage unit. We first describe this protocol, and then we explain how an individual log server operates within a storage unit Coordination Protocol This protocol is used to form a replication chain (storage unit) and operates very similarly to two-phase commit (12). The protocol defines two roles servers: leader and follower. Leader is responsible to contact the followers and manages creation of a storage unit. Followers act as a passive process and only respond to the leader. Figure 3.2 and 3.3 describe the state transition of both leader and follower. If a server process is not part of storage it sets its state to listening state. In listening state a process frequently checks the global view data. If the process is listed as a leader it reads the list of its followers addresses. It sends the followers a join-request message and set a failure detector (sets a watch flag on their ephemeral znode) to detect their failures.

45 3.6. ARCHITECTURE 29 Figure 3.2: Leader states. Figure 3.3: Follower states.

46 30 CHAPTER 3. DESIGN AND ARCHITECTURE Followers are able to accept or reject the join request depending on their available resources. If a follower fails or rejects the request, the leader triggers the abort process and all processes resume their initial state. In order to abort, the leader sends an abort message to all the followers. Upon receiving of an abort message, each follower (and leader itself) cleans all the data structures and return to the initial state 3.3. Each follower sets failure detector for the leader before accepting the join request so that in case of the leader failure, it can detect the failure and resume the previous state. If all the followers accept the join request the leader sends a connection-request message carrying an ordered list of servers (including the leader). Each server connects to the previous and next server in the list as its predecessor and successor in the chain. Once a server is connected and ready to stream data, it sends connect-completion signal to the leader. If a server fails to connect or crashes, the leader aborts the process. Otherwise a complete chain of servers is ready and leader creates a znode for the new storage unit and sends a start signal along with the znode path of the storage unit to all the followers to start the service Concurrency Each server process consists of three main threads operating concurrently based on producerconsumer model. Figure 3.4 shows how threads in a single server operate and interact through shared data structures. Three shared data structures among the threads are: DataBuffer stores the logs entries in memory. SenderQueue keeps the ordered index of the log entries that should be either sent to the next server (head or middle server) or acknowledged to client (tail server). PersistQueue holds an ordered index of log entries that should be written to disk. The receiver thread reads the entries from TCP buffer and inserts the entries into DataBuffer. It also inserts the index of the entry into SenderQueue. DataBuffer has a pre-specified size (number of entries) and if the DataBuffer is full, receiver thread must wait until an entry is removed from the DataBuffer.

47 3.6. ARCHITECTURE 31 Figure 3.4: Log server operation. The sender thread waits until there exist an index in the SenderQueue. It reads the index of the entry from the SenderQueue to find and read the entry from the DataBuffer. If the server is the tail server for the entry it sends an acknowledgment to the corresponding client indicating that the entry has been replicated in all the servers. If the server is not the tail it simply sends the entry to the next server in the chain (successor). Once the message is sent to an the next hop, the sender thread puts the index of the entry in Persist queue. The persister thread waits until an index exists in the PersistQueue. It reads the index of the entry in the DataBuffer and persists the entry into disk in append-only fashion. This thread is the only thread persisting the entries and all the entries from different clients will be interleaved into a single file. Once the entry is written to disk it is removed from the DataBuffer by this thread Stable Storage Unit (SSU) A stable storage consists of a set of servers forming a replication chain. It ensures the replication and availability of delivered entries. One of the servers acts as a leader and holds the lease to the znode of the storage unit. The storage unit is considered failed, when one or more servers in the storage unit crash. Upon crash of storage unit, service is stopped and it only persists the entries from memory to disk. When the leader crashes, the znode is removed automatically and when the other servers fail, the leader removes the znode. Therefore, all clients will be notified about the storage failure. In a storage unit, every server can act as head, tail or middle server. Clients can connect to

48 32 CHAPTER 3. DESIGN AND ARCHITECTURE Figure 3.5: Storage unit. any server in the chain. The server acting as entry point is the head of the chain for the client and the last server in the chain (which sends the acknowledgment) is the tail. Figure 3.5 shows how several clients can stream to the storage unit Load Balancing We make load balancing decisions at three points. We ensure that a set of servers selected to create a storage unit (perform chain replication) have nearly the same load. This minimizes the chance of appearance of a bottleneck in the chain and maximizes resource utilization of each server. Figure 3.6 shows how servers are clustered to form a storage unit. One of the servers within the storage unit constantly updates the available resource of the storage unit and its status in zookeeper node. This enables the clients to select a storage unit with the lowest load by reading this data from Zookeeper servers. In a storage unit every server can act as head, tail or middle server. The tail consumes less bandwidth since it only sends acknowledgments to the client, while head and middle server need to transfer the entry to the next server in the chain. Hence, if all the clients choose the same server as the head, the tail server consumes half the bandwidth compared to the rest of the servers. To mitigate this issue, clients of one storage unit connect to different servers. In the current implementation clients randomly choose one server in the storage

49 3.6. ARCHITECTURE 33 Figure 3.6: Clustering decision based on the servers available resources. unit as the head. However, this can be improved by connecting clients to different servers in a round-robin fashion. Thereby, every server serves nearly equal number of clients as the head (and consequently as a tail). Figure 3.5 also shows the distribution of load through selection of different head servers by the clients. The decision to cluster the servers is based on the available resources of the servers. One of the log server process is in charge of compiling servers data and makes the clustering decisions. It reads the servers data from Zookeeper and sorts the available servers based on their free resources. Using this information, servers with similar amount of avaiable resources are grouped. The server with the largest amount of free resources in the group is chosen as the leader. This process is performed frequently and the output is written to Global View znode in Zookeeper. Each process frequently reads this data to determine its group and its role. Upon a crash of the process (responsible for updating the Global View znode), another process takes over the job. Our current implementation does not provide dynamic load balancing and it is part the future work Failover In the event of storage failure (any of the servers) clients of the storage are able to detect the failure and find another storage unit by querying zookeeper. The client connects to another storage unit (if one is available) to continue with writing the logs. An altenative to shorten the service disruption is to allow the client to hold connections to two storage units and upon the crash of one (the one writing to it), it immediately resumes the operation by switching to the other storage unit.

Distributed File Systems II

Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation