Durability for Memory-Based Key-Value Stores

Size: px
Start display at page:

Download "Durability for Memory-Based Key-Value Stores"

Transcription

1 Durability for Memory-Based Key-Value Stores Kiarash Rezahanjani Dissertation for European Master in Distributed Computing Programme Supervisor: Tutor: Flavio Junqueira Yolanda Becerra Júri President: Secretary: Vocal: Felix Freitag (UPC) Jordi Guitart (UPC) Johan Montelius (KTH) July 4, 2012

2

3 Acknowledgements I would like to thank Flavio Junqueira, Vincent Leroy and Yolanda Becerra who helped me in this work, especially when my steps faltered. Moreover, I owe my gratitude to my parents, Souri and Mohammad, who have been a constant source of love, motivation, support and strength all these years.

4

5 Hosting Institution Yahoo! Inc. is the world s largest global online network of integrated services with more than 500 million users worldwide. Yahoo! Inc. provides Internet services to users, advertisers, publishers, and developers worldwide. The company owns and operates online properties and services, and provides advertising offerings and access to Internet users through its distribution network of third-party entities, as well as offers marketing services to advertisers and publishers. Social media sites consist of Yahoo! Groups, Yahoo! Answers, and Flickr to organize into groups and share knowledge and photos. Search products comprise Yahoo! Search, Yahoo! Local, Yahoo! Yellow Pages, and Yahoo! Maps to navigate through the Internet and search for information. Yahoo! also provides a large number of specific communication, information and life-style services. In the business domain, Yahoo! HotJobs, provides solutions for employers, staffing firms, and job seekers; and Yahoo! Small Business that offers an integrated suite of fee-based online services, including web hosting, business mail and an e-commerce platform. Yahoo! Research Barcelona is the research lab hosted in the Barcelona Media Innovation Center focuses on Scalable computing, web retrieval, data mining and social media, including distributed and semantic search. This work has been done in Scalable computing group of Yahoo! Research Barcelona. Barcelona, July 4, 2012 Kiarash Rezahanjani

6

7 Abstract The emergence of multicore architecture as well as larger, less expensive RAM has made it possible to leverage the performance superiority of main memory for large databases. Increasingly, large scale applications demanding high performance have also made RAM an appealing candidate for primary storage. However, conventional DRAM is volatile, meaning that hardware or software crashes result in the loss of data. The existing solutions to this, such as write-ahead logging and replication, result in either partial loss of data or significant performance reduction. We propose an approach to provide durability to memory databases, with a negligible overhead and a low probability of data loss. We exploit known techniques such as chain replication, write-ahead logging and sequential writes to disk to provide durability while maintaining the high throughput and the low latency of main memory.

8

9 Contents 1 Introduction Motivation Contributions Structure of the Document Background and Related Work Background Memory Database Stable Storage Recovery Checkpoint Message Logging Pessimistic vs. Optimistic Logging Replication Replication Through Atomic Broadcast Chain Replication Disk vs RAM Related Work Redis RAMCloud i

10 2.2.3 Bookkeeper HDFS Discussion Design and Architecture Durability Target Systems Design Goals Design Decisions System Properties Fault Tolerance Model Availability Scalability Safety Consistent Replicas and Correct Recovered State Integrity Operational Constraints Architecture Abstractions Coordination of Distributed Processes Server Components Coordination Protocol Concurrency Stable Storage Unit (SSU) Load Balancing ii

11 3.6.6 Failover API Implementation Experimental Evaluation Network Latency Stable Storage Performance Impact of Log Entry Size Impact of Replication Factor Impact of Persistence on Disk Load Test Durability and Performance Comparison Conclusions Conclusions Future Work References 52 iii

12 iv

13 List of Figures 2.1 Buffered logging in RAMCloud. Based on (1) Bookkeeper write operation. Extracted from bookkeeper presentation slide (2) Pipeline during block constraction. Based on (3) System entities Leader states Follower states Log server operation Storage unit Clustering decision based on the servers available resources Failover Thoughput vs. Latency graph for our stable storage unit for different entry sizes with replication factor of three Throughput vs. Latency for stable storage unit with replication factor of two and three for log entry size of 200 bytes Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, when persistence to local disk is enabled and disabled Throughput of stable storage unit under sustained load Latency of stable storage unit under sustained load Performance comparison of stable storage unit and hard disk v

14 vi

15 List of Tables 4.1 RPC latency for different packet sizes within a datacenter Latency and throughput for a single client synchronously writing to stable storage unit vii

16 viii

17 1.1 Motivation 1 Introduction In the past decades, disk has been the primary system of storage. Magnetic disks offer reliable storage and a large capacity at a low cost. Although disk capacity has dramatically improved over the past decades; access latency and bandwidth of disks have not shown such improvements. Disk bandwidth can be improved by aggregating the bandwidth of several disks (e.g. RAID) but high access latency remains an issue. To mitigate these shortcomings and improve the performance of disk-based approaches a number of techniques are employed such as adding caching layers and data striping. However, these techniques complicate large scale application development and often become costly. In comparison to disk, RAM (refering to DRAM) offers hundreds of times higher bandwidth and thousands of times lower latency. In today s datacenters, commodity machines with up to 32 gigabyte of DRAM is common and it is cost-effective to have up to 64GB of DRAM (1). This makes it possible to deploy terabytes of data entirely in few dozens of commodity machines by aggregating their RAM. The superior performance of RAM and its dropping cost has made it an attractive storage means for applications demanding low latency and high throughput. As an example of such applications, Google search engine keeps entire its index table in RAM (4), the social network LinkedIn stores the social graph of all the members in memory and Google Bigtable holds SSTables block indexes in memory (5). This trend can also be seen in the appearance of many in-memory databases such as Redis (6) and Couchbase (7) that use memory as their primary storage. Despite the advantages of RAM over disk, RAM is subject to one major issue: volatility and consequently non-durability. Therefore, in the event of a power outage, hardware or software failures data stored in RAM will be lost. In memory-based storage systems, operating on commodity machines, providing durability while maintaining good performance is a major chal-

18 2 CHAPTER 1. INTRODUCTION lenge. The majority of existing techniques to provide durability of data, such as checkpointing and write-ahead logging either do not guarantee persistence of the entire data or result in significant performance degradation. For example, in periodic checkpointing, committed updates at the time interval between last checkpoint and failure point are lost or in the case of write-ahead logging to disk, the write latency is tighten to disk access latency. This work proposes a solution to provide durability for memory databases while preserving their high performance. 1.2 Contributions We propose an approach to provide durability for a cluster of memory databases, on a set of commodity servers with negligible impact on the database performance. We have designed and implemented a highly available stable storage system that provides low-latency high-throughput write operations allowing a memory database to log the state changes. This allows durable writes with low latency and recovery of the latest database state in case of failure. Our stable storage consists of a set of storage units that collectively provide fault-tolerance and load-balancing. Each storage unit consists of a set of servers; each server performs asynchronous message logging to record changes of database state. Log entries are replicated on memory of all the servers in the storage unit through chain replication. This minimizes the possibility of data loss derived by asynchronous writes in the case of servers failure and increases availability of logs for the purpose of recovery. Each server exploits the maximum throughput of the hard disk by sequentially writing the log entries. Our solution is tailored for a large cluster of memory-based databases that store data in the form of key-value pairs and comply to the characteristics of social network platforms. The evaluation results indicate our approach enables durable write operations with latency of less than one millisecond while providing a good level of durability. The results also indicate that our storage solution is able to outperform the conventional write-ahead logging on local disk in terms of latency. In addition to low response time, the system is designed to achieve high availability and read throughput through replication of log entries in several servers. The design also accomodates scalability by minimizing the interactions amongst the servers and utilizing local resources.

19 1.3. STRUCTURE OF THE DOCUMENT Structure of the Document The rest of this document is organized as follows. Chapter 2 provides a brief introduction on several techniques and concepts related to this work. Further in this chapter, we review four systems that have influenced the design and discuss the approach used by each one of the systems. In Chapter 3 we present our solution to the durability problem. We describe the properties of our system as well as the architecture and the implementation. Chapter 4 presents the results of the experimental evaluation and analyzes the results. Finally, Chapter 5 concludes this thesis by summarizing its main points and presenting the future work.

20 4 CHAPTER 1. INTRODUCTION

21 2.1 Background 2 Background and Related Work Memory Database In-memory or main memory database systems store the data permanently in main memory and disk is usually used only for backup. In disk-oriented databases data is stored in disk and it may be cached in memory for faster data access. Memcached (8) is an in-memory key-value store that is widely used for such as a purpose. For example, Facebook uses Memcached to put the data from MySQL database into memory (9) and consistency between Memcached and MySQL servers is managed by application software. In both systems an object can be kept in memory or in disk but the major difference is that in main memory databases the primary copy of an object lives in memory and in disk-oriented databases the primary copy lives in the disk. Main memory databases pose several properties different from disk-oriented databases and here we mention the most relevant ones to this project. The layout of data stored in disk is important, for example, sequential access and random access to data stored in disk causes a major performance difference while the method of access to memory is of no importance. Memory databases use data strcuctures that allow leveraging performance benefits of main memory. For example, T-tree is mainly used for indexing of memory databases while B-tree is prefered for index of disk-based relational databases (10). Main memory databases are able to provide a far faster access time and a higher throughput than disk-oriented databases. Although the latter provides a stronger durability as main memory is volatile and in case of a process crash or power outage data residing in memory will be lost (11). To mitigate this issue, disk is used as a backup for memory databases; hence, at the time of a crash the database can be recovered. We will discuss several approaches to provide durability of data and recovery of the system state.

22 6 CHAPTER 2. BACKGROUND AND RELATED WORK Stable Storage There are three storage categories (12): 1. Memory storage which loses the data at the time of process or machine failure and power outage. 2. Disk storage which survives the power outage and process failures except disk related crashes such as disk head crash and bad-sectors. 3. Stable storage which survives any type of failures and provides high degree of fault tolerance that is usually achieved through replication. This storage model suites applications which require reading back the correct data after writing with a very small probability of data loss Recovery Recovery techniques in a distributed environment can become complicated when a globally consistent state has to be recovered at several nodes and there are several writers or readers. Our approach is based on a single-writer single-reader model that simplifies the recovery; hence we discuss the recovery techniques given the single-reader single-writer model Checkpoint Checkpoint (snapshot) is a technique in fault-tolerant distributed systems to enable backward recovery by saving the system state from time to time onto a stable storage. Checkpoint is a suitable option for backup and disaster recovery as it allows having different versions of the system states at different point in time. Since checkpointing produces a single file that can be compressed, it can easily and quickly be transferred over the network to other data centers to enhance the availability and recovery of the service. Checkpoints are ready state of a system therefore it is only required to read the snapshot to reconstruct the state and there is no need for further processing. The downside of this approach is this method stores the snapshot of the server state from one point in time to another which means failure at any point in time will result in losing all

23 2.1. BACKGROUND 7 the changes made from the last snapshot up to the failure point. This characteristic makes this method undesirable if the latest state needs to be recovered. In practice this is implemented by forking a child process (with copy-on-write semantic) to persist the state (13). This could significantly slow down the parent process serving a large dataset or interrupts the service for hundreds of milliseconds particularly on a machine with poor CPU performance. This can specially become an issue when the system is at its peak load Message Logging It is not possible to always recover the latest state of a database using snapshots and in order to have a more recent state, the more frequent snapshots is required. This yields a high cost in terms of operations required for writing the entire state in a stable storage. To reduce the number of checkpoints and enable the latest state recovery, message logging technique can be used. In message-logging, a sequence number is associated with messages are recorded onto a stable storage. The underlying idea of message logging is to use the logs stored in stable storage and a check pointed state (as a starting point) to reconstruct the latest state by replying the logs on the given checkpoint. The checkpoint is only needed to limit the number of logs; hence shortening the recovery time. Message logging requires that after completion of recovery no orphan processes exist. An orphan process is a process that survived the crash and they are in different state from the recovered process (14). In Chapter 3 we will discuss this property in our design Pessimistic vs. Optimistic Logging Message logging can be categorized into two categories: Optimistic logging and pessimistic logging (14). Message logging takes time and logging methods can be categorized depending on whether a process waits to ensure every event is safely logged before the process can impact the rest of the system. Processes that do not wait for completion of logging of an event are optimistic and processes that block sending a message until the completion of logging of the previous message are

24 8 CHAPTER 2. BACKGROUND AND RELATED WORK pessimistic processes. Pessimistic logging sacrifices a better performance during failure-free run for a guarantee of recovering a consistent state with the crashed process state. In conclusion, optimistic logging is desirable from performance point of view and it is suitable for systems with a low failure rate. Pessimistic logging is suitable for systems with high failure rate or systems that reliability is critical. Write-ahead logging (WAL) can be considered as an example of pessimistic method that the logs should be persisted before the changes take place. WAL is widely used in databases to implement roll-forward recovery (redo) Replication There are two main reasons for replication: scalability and reliability. Replication enables fault-tolerance as in the event of a crash, system can continue working using other available replicas. Replication can also be used to improve performance and scalability, when many processes access a service provided by a single server, replication can be used to divide the load among several servers. There is variety of replication techniques with different consistency model, in this document we explain two major replication techniques and further we describe how our system benefits from replication in order to improve its reliability and minimizes data loss Replication Through Atomic Broadcast Atomic broadcast or total order broadcast is a well-known approach that guarantees all the messages are received reliably and in order by all the participants (15). Using atomic broadcast all updates can be delivered and processed in order, this property can be used to create a replicated data store that all the replicas have consistent states Chain Replication Chain replication is a simple straightforward replication protocol intended to support high throughput and high availability without sacrificing strong consistency guarantee. replication servers are linearly ordered to form a chain. In chain

25 2.1. BACKGROUND 9 The first server in the chain which is the entry point for queries is called head and the the last server which sends the replies is called tail. Each update request enters at the head of the chain and after being processed by the head the state changes are forwarded along a reliable FIFO channel to the next node in the chain and it continues in the same manner until it reaches the tail. This method handles queries by forwarding the queries to the tail of the chain. This method is not tolerant to network partition but instead it offers high throughput, scalability and consistency (16) Disk vs RAM Magnetic disk and RAM have several well-known differences. The RAM access time is orders of magnitude less than magnetic disk and its throughput is orders of magnitude higher. Access time for a record in magnetic disk consists of a seek time, rotational latency and transfer time. Among the three, seek time is dominant when records are not large (megabytes). The seek time of disk is several milliseconds and the transfer time varies depending on the bandwidth. For instance, for 1 MB the transfer time is 10ms for a disk with bandwidth of 100 MB/s. On the other hand, the access latency of a record in memory is a few nanoseconds and its bandwidth is several gigabytes per second (17), (18). This means RAM performs orders of magnitude better in terms of latency and throughput. The access method and the way data is structured in RAM do not make a difference in performance of RAM, although this is not the case for disk. Sequential writes to disk provide a far better latency and throughput than random writes because it eliminates the need for constant seek operations (19). Everest (20) is an example of a system that uses sequential writing to disk in order to increase the throughput. The other difference of RAM and magnetic disk is volatility. Memory is volatile and data will be lost at the time of power outage or crashing the process referencing the data. Magnetic disk is a non-volatile storage and data written to disk survives power outage and process crashes. However, writing to disk (forcing the data to disk) does not guarantee that data is persisted immediately. Disk has a cache layer which is non-volatile, therefore loss of power to cache results in loss of data being written to disk. One solution is to disable the cache, though this is not practical as it significantly degrades the disk performance; hence the application writing

26 10 CHAPTER 2. BACKGROUND AND RELATED WORK to disk. Other solutions are using non-volatile RAM as used by NetApp filer (21) or disks with battery-backed write cache such as HP SmartArray RAID controllers, this provides a power source independent from external environment to maintain the data for a short time allowing it to be written to disk at the time of power outage (22). Although, the latter options are not considered as commodity hardware. 2.2 Related Work In this part we present some of the existing systems related to this work that has influenced our solution in one way or another. Another reason to select the following systems to present in this report is that the collection of approaches implemented by these systems represents a comprehensive set of common methods applied to provide durability for many main memory databases. We describe: Redis (23) which is an in-memory database and uses writes to local disk as well as replication to achieve durability. Bookkeeper (24) that provides a write-ahead logging as a reliable fault tolerance distributed service. RAMCloud (1) a new approach to datacenter storage by keeping the data entirely in DRAM of thousands of commodity servers. HDFS (3), a highly available distributed file system with append-only capability for keyvalue pairs. At the end we discuss the pros and cons of each approach taken by these systems Redis Redis (23) is an in-memory key-value store that aims at providing low latency. To meet this objective Redis server holds the entire data in memory to avoid page swapping between memory and disk, and consequently the serialization/deserialization process. Redis provides a comprehensive set of options for durability of data as follows.

27 2.2. RELATED WORK Replication of full-state in memory Redis applies master-slave replication model so that all the slaves servers synchronize their states with the master server. The synchronization process is performed using non-blocking operations on both master and slaves; therefore they are able to serve clients queries while performing synchronization. This implies eventual consistency model of Redis server, meaning that slave servers might reply to clients queries with an old version of data while performing the synchronization. MongoDB is another example of a database system that uses a similar technique for replication (25). Redis implements a single-writer and multiple-readers model that clients are able to read from any replica but only permitted to write to one server. This model along with eventual consistency ensures all the replicas will eventually be in a same state, while maintaining a good performance in terms of latency and read throughput. 2. The other durability method of Redis is persisting the data into local disk using pointin-time snapshot (checkpoint) at specified intervals. In this method Redis server stores the entire state of the database server every T seconds or every W write operations onto the local disk. Copy-on-write semantic is applied to avoid interruption of service during persisting the data on disk. 3. Asynchronous logging is another approach taken by Redis to provide durability. Write operations are buffered in memory and flushed into disk by a background process in append-only fashion. The time to sink the data depends on the sync policy specified in configuration parameters (flush to disk every second or for every write) (26) RAMCloud RAMCloud (1) is a large scale storage system designed for cloud scale data-intensive applications requiring very low latency and high throughput. It stores a large volume of data entirely in DRAM by aggregating the main memory of hundreds or thousands commodity machines and aims at providing the same level of durability as disk by using a mixture of replication and backup techniques. RAMCloud applies buffered logging method for durability that utilizes both memory replication and logging onto disk. In RAMCloud only one copy of every object is kept in the memory and the backup is stored in the disks of several machines. The primary server updates its state

28 12 CHAPTER 2. BACKGROUND AND RELATED WORK Figure 2.1: Buffered logging in RAMCloud. Based on (1). upon receiving a write query and forward the log to the backup servers, acknowledgement is sent by a backup server once the log is stored in the memory. A write operation is returned by the primary server once all the backup servers acknowledge. Backup servers write the logs into disk asynchronously and remove the logs from the memory. To recover quickly and avoid disruption of the service, RAMCloud applies two optimizations. First is by truncating the logs to reduce the required amount of data to be read during recovery. This can be achieved by creating frequent checkpoint and discarding the logs up to that point or by cleaning the stale logs occasionally to reduce the size of log file. Second optimization is to divide the DRAM of each primary server into hundreds of shards and assigning each shard to one backup server. At the time of a crash each backup server reads the logs and act as a temporary primary server until a full state of the failed server can be reconstructed Bookkeeper Bookkeeper (24) provides write-ahead logging as a reliable distributed service (D-WAL). It is designed to tolerate failure by replicating the logs in several locations. It ensures that writeahead logs are durable and available to other servers so in the event of failure other servers can take over and resume the service. Bookkeeper allows WAL by replicating log entries across remote servers using a simple

29 2.2. RELATED WORK 13 quorum protocol. A write is successful if the entry is successfully written to all the servers in a quorum. A quorum of size f +1 is needed to tolerate concurrent failure of f servers. Bookkeeper allows aggregating disk bandwidth by stripping logs across multiple servers. An application using Bookkeeper service is able to choose the quorum size as well the number of servers used for logging. When the number of selected servers is greater than the quorum size Bookkeeper performs stripping the logs among the servers. Figure below shows the bookkeeper write operation and how it takes advantage of stripping. Figure 2.2: Bookkeeper write operation. Extracted from bookkeeper presentation slide (2). In figure 2.2 Ledger corresponds to a log file of an application, a Bookie is a storage server storing the ledgers and BK Client is used by an application to process the requests and interact with bookies. Assuming client selects three bookies and quorum size of two. Bookkeeper performs stripping by switching the quorums and spreading the load among the three bookies. This allows distribution of load among the servers and if a servers crashes service continues without interruption. A client can read different entries from different bookies, this allows a higher read throughput by aggregating the read throughput of individual servers. Bookkeeper also sequentially writes to disk by interleaving the entries into a single file and stores index of the entries

30 14 CHAPTER 2. BACKGROUND AND RELATED WORK Figure 2.3: Pipeline during block constraction. Based on (3) to locate and read the entries. This allows maximizing the disk bandwidth utilization of disk and the throughput. Bookkeeper follows a single-writer multiple-reader model and guarantees that once a ledger is closed by a client all the readers read the same data HDFS HDFS (3) is a scalable distributed file system for reliable storage of large datasets and it delivers the data at a high bandwidth to applications. What makes HDFS interesting with regard to our work is the way it performs I/O operations, and achieves high reliability and availability. HDFS allows an application to create a new file and write to the file. HDFS implements a single-writer multiple-reader model. When a client opens a file for writing, no other client is permitted to write to the same file until the file is closed. After a file is closed the file content cannot be altered, although new bytes can be appended to the file. HDFS splits a file into large blocks and stores the replicas of each block on different DataNodes. NameNode stores namespace tree and the mapping of file blocks to DataNodes. When writing to a file, if there is a need for a new block, NameNode allocates a new block and assigns a set of DataNodes to store the replicas of the block, then these DataNodes form a pipeline (chain). Data is buffered at the client side and when the buffer is filled bytes are pushed

31 2.3. DISCUSSION 15 through the pipeline. This prevents the overhead of the packet headers. The DataNodes are ordered in such a way that minimizes the distance of the client from the last node in the pipeline, thereby minimizes the latency. HDFSFileSink operator in Datanodes buffers the writes and the buffer is written into disk only when adding the next tuple exceeds the size of the buffer. Thereby, each server writes to disk asynchronously which enables a low latency of writes in HDFS. Placement of blocks replicas is critical for reliability, availability, and network bandwidth utilization. HDFS applies an interesting strategy to place the replicas. It provides a tradeoff between minimizing the write cost, maximizing reliability, availability and read bandwidth. HDFS places the first replica of each block on the same node as the writer and the second and third one on two different nodes in two different racks. HDFS enforces two restrictions: DataNodes cannot store more than one replica of any block, provided that there are sufficient racks in the cluster, no rack should store more than two replicas of any block. In this way, HDFS minimizes the probability of correlated failures as failure of two nodes in a same rack is more likely to occur than two nodes in different racks which maximize the availability and read bandwidth (3). 2.3 Discussion We summarize the approaches towards durability into four major categories. Replication of the full state into several locations. Periodic snapshots of the system state. Asynchronous logging of writes onto a stable storage. Synchronous logging of writes onto a stable storage. The full replication approach along with eventual consistency (e.g. Redis) ensures all the replicas will eventually be in a same state, while maintaining a good performance in terms of latency and read throughput. This approach provides low latency and high read throughput that linearly scales with the number of slave servers because all the read and write queries can be served from the memory

32 16 CHAPTER 2. BACKGROUND AND RELATED WORK without involving disk. However, this approach is subject to one major drawback; large memory requirement. This method becomes costly in term of hardware and more importantly utility cost when we have a large cluster of servers. DRAM memory is volatile and it requires constant electricity power meaning the machines need to be powered at all the times. For example, in today s datacenters the largest amount of DRAM which is cost effective is 64GB (1), having such a datacenter, to store 1TB of data requires 16 machines. To have a replication factor of three which is considered a norm to have a good level of durability (3), we need 32 extra servers. Even though, this approach offers great benefits but it is not a proper choice for a large cluster of in-memory databases as it becomes costly. The other drawback is the possibility of data loss. For example, in case of Redis, master server replies to updates before replication on slave servers has been completed (for a lower latency); hence if master fails in the time between the reply to update and before sending the update to replicas the data can be lost. To prevent such a risk the update should not return until all the replicas have received the update, although this increases the latency. This is a tradeoff that needs to be made between high performance and durability. The other risk associated with this approach is that in case of concurrent failure of all the servers holding the replicas (datacenter power outage) the entire data will be permanently lost. To mitigate this issue, data can be replicated in multiple datacenters, however this methods results in a high latency of updates (hundreds of milliseconds) for blocking calls or partial loss of data for asynchronous calls. Redis provides periodic snapshot. This is a good choice for backup and disaster recovery as it allows having different versions of the system states at different point in time. Since the full state is contained in a single file it can be compressed and be transferred to other data centers to enhance the availability and recovery of the service. Periodic snapshot stores server state from one point in time to another, however, failure at any point in time will result in losing all the updates from the last snapshot up to the failure point. This propery makes this method undesirable when the latest state needs to be recovered. The other point to consider is that forking a child process to persist the state could significantly slow down the parent process serving a large dataset or interrupting the service.

33 2.3. DISCUSSION 17 In comparison to snapshot, this method provides a better durability as every write operation can be written into disk. To improve performance, write operations are batched in memory before being written to disk. Thus, a failure results in loss of the buffered data. Logging performance can be improved by writing the updates into disk in an append-only fashion. This prevents the long latency of seek operations on disk (dedicated disks) by sequentially writing the logs. Therefore, if the sink thread is the only thread writing to disk (in append-only fashion) it can achieve a better write throughput. Logging provides a stronger durability than snapshot but it results in creating a larger log file and slower recovery process, since all the logs need to be played in order for rebuilding the full state of the dataset. To accelerate the recovery process number of logs required to build the state should be deducted. Two major techniques for truncating the log file are as follows. System state should frequently be checkpointed, so that the logs before the checkpoint can be removed. The other technique, that is implemented by Redis is cleaning the old logs. Redis rewrites the log file in the background to drop unneeded logs and minimize the log file size. Asyncronous logging to disk provides a better performance than synchronous loggig, however, this increases the possibility of losing the updates. Asynchronous logging is usually used along with replication to mitigate this issue. RAMCloud takes this approach by replicating the logs through broadcast, refers to it as buffered logging, that allows writes (also reads) to proceed at the speed of RAM along with a good durability and availability. Buffered logging allows a high write throughput, however if the write throughput continues at a sustained rate higher than disk throughput it eventually results in filling the entire OS memory and throughput drops to the throughput of the disk. Therefore, buffered logging provides a good performance as long as free memory is available. Moreover, buffered logging does not always guarantee durability as in case of a sudden power outage the buffered data will be lost. Therefore, it is suitable for applications that can afford the loss some updates. To deal with such scenarios cross-data center replication can be done, however the latency of write are expected to drop significantly. HDFS provides an append-only operation that can be used for the purpose of logging. HBase is an example of an application using this capability of HDFS for logging purpose (27). The idea of HDFS is similar to RAMCloud, though the major difference is that the replication model applied in HDFS is similar to chain replication that enables high write throughput. HDFS buffers

34 18 CHAPTER 2. BACKGROUND AND RELATED WORK the bytes in memory and writes a big chunk of data into disk when the buffer is full. HDFS creates one file for each client on each machine (residing replicas) which means if multiple clients concurrently write to file s blocks located in a same machine, the write performance degrades as writing to several files in a same disk requires frequent seek operations. HDFS addresses correlated failures through smart replication strategy by placing the replicas in multiple racks on different machines. In case of Bookkeeper, the quorum approach of Bookkeeper consume more resources from one of the participants as one needs to perform the multicast. For instance, In Bookkeeper, the client multicasts the log entries across several bookies consequently this consumes more bandwidth and CPU power of the client. One way to resolve this, could be outsourcing the replication responsibility to the sever ensemble and create a more balanced replication strategy. For example, Zookeeper (28), a coordination system for distributed processes, applies a quorum based approach on server side for replication by implementing a totally ordered bradcast protocol called Zab (29), however this complicates the server implementation. Our design decision to approach the durability problem in memory databases is mostly influenced by the approaches described above. In next chapter, we describe our solution in details.

35 3 Design and Architecture In this chapter, we define durability with respect to this work and describe how we approach the durability problem in memory-based key-value stores. We explain the system design and it s properties, and finally how the system is built. 3.1 Durability For the purpose of this work, durability means that if the value v corespondent to the key k is updated to u at time t, then a read for key k at time t such that t > t must return u, if no updates occured between time t and t. We assume that durability condition holds for a memory database as long as no crash has occured. This work is to address the durability of a memory database (in our case a key-value store) such that the latest committed value of every key can be retrieved after a crash. 3.2 Target Systems The proposed system design is tailored to provide durability for a cluster of in-memory databases storing data in form of key-value pairs which complies to the following specifications. 1. Dataset is large and the cluster of in-memory key-value stores consists of at least dozens of machines 2. Write query size (update/insert/delete) varies from few hundreds of bytes to few of KB (an example of write query is SET K V to set the value of key K to V ). 3. Workload is read dominant(10-20% of queries are write) 4. High availability of service is important

36 20 CHAPTER 3. DESIGN AND ARCHITECTURE The above specification is common for social networking platforms such as facebook, twitter and Yahoo! News Activity that store large amount of data in main memory and process large amounts of events. For example, in only year 2008, facebook had been serving 28 terabytes of data from memory (30) and this number is increasing. Based on (31), in facebook cluster, less than 6 percent of the queries are write queries. In social network platforms users write queries are generally small (less than 4KB) (32), for example, twitter message size is limited by 140 characters (33). 3.3 Design Goals In our design we aim to provide a high level of durability such that in the event of a crash, the latest state of the system is recoverable with a low probability of data loss. The objective is to achieve this goal with minimal impact on performance of memory database (read operations do not make any changes to the database state, thereby, only the write operations should be durable). We need to ensure that our system is highly available so that changes to database state can be reliably recorded to stable storage and the records can be read at the time of recovery. The system needs to scale with increasing number of databases and write operations. Maximizing utilization of local resources of the database cluster is another objective and we try to avoid additional dependency to external systems and create a self-contained application. Any guarantee about durability of a write should be provided before acknowledging the success of the write operation to the writer. Our durability mechanism should enable a low recovery time to enhance the availability of the database service. 3.4 Design Decisions In Chapter 2.2, we described and discussed the common approaches towards durability in memory-based databases. In this section, we explain our design decision with respect to the

37 3.4. DESIGN DECISIONS 21 target systems and the objectives. Checkpoint vs. Logging As checkpointing consumes a considerable amount of resources and it always leaves the possibility of data loss, we choose to use message logging to persist the changes of database state, thereby, the state can be reconstructed by replaying the logs (To reduce the recovery time and limit the number of logs, a snapshot of the system state is needed or the unneeded logs should be truncated before recovery. To eliminate the cost of this process during operation a background processes can be assigned to reconstruct the system state and store it into stable storage when the system is not under stress. This is part of the future work.) Pessimistic vs. Optimistic Logging We choose to use pessimistic logging to ensure that the changes will take place only after they are durable in a stable storage system. Low latency is one of our main objectives. In order to achieve this objective, we create a stable storage by mixture of in memory replication and asynchronous logging of changes of the database state. This allows storing log entries in several locations while providing low response time. We name the set of servers cooperating to perform replication and logging a stable storage unit or SSU. Asynchronous vs. Synchronous The asynchronous logging is the core of our design to provide low response time. The reason to choose asynchronous logging is to eliminate the latency of writing the logs onto disk. However, since DRAM is volatile this method carries the risk of losing the logs upon a crash. To address this issue we replicate the logs into memory of several machines before acknowledging for durability of the write. In this way, we can significantly reduce the probability of data loss as it is very unlikely that all the machines crash at the same time (3). The design targets low latency and high throughput for write operations by trading the guaranteed durability with a low probability of data loss. Further in this chapter, we discuss the possibility of losing the data and reliability of this method. Chain Replication vs. Broadcast In order to replicate the logs we choose to use chain replication for two main reasons. 1) Chain replication puts nearly the same load on the resources of each server, while in broadcast one of the participants utilizes more resources than the others. This allows providing an implicit load balancing. 2) Chain replication enables high throughput logging as the symmetric load on the servers allows utilizing the maximum resources of each server and minimizing the chance of the appearance of bottleneck. We also performed an experiment to help us with the decision. We measured the latency caused by network transmission using either approach. We discuss the experiment in Chapter 4.

38 22 CHAPTER 3. DESIGN AND ARCHITECTURE Local disk vs. Remote file System Logs can be persisted either in the local disk of the servers or an existing reliable remote file system (e.g. NFS, HDFS). We choose to use local disk of the server to maximize utilization of the local resources, reduce dependencies and avoid the use of network bandwidth for persistence. As the logs are replicated in memory of several machines and all the machines persist the logs onto their disk, we will have the replicas of the logs on several hard disks. This enhances the availability of the logs at the time of recovery and accelerates the recovery process by reading different partitions of the logs from different servers (hence aggregating disk bandwidth of the replicas). Faster recovery vs. Higher write performance During peak load where the system is under sustained intensive load, if the write throughput to the stoarge is higher than the write throughput to disk, the servers buffer eventually become saturated and the performance degrades significantly. Thereby, it is important to fully utilize the disk bandwidth and minimize the write latency in order to prevent saturation of the buffer as much as possible. We write the logs in an append-only fashion by sequentially writing them to disk in a single file to eliminate the seek time and maximize the disk throughput utilization. Therefore, we need to interleave the logs from all the writers into a single file. As opposed to having one file per writer this method (sequential writes to a single file) makes the recovery process slower since to recover the logs belong to one writer we need to read all the logs in the file in a sequential manner. Recovery needs to be done only at the time of crash and this does not happens frequently and is rare, on the other hand logging needs to be performed constantly (constant write operations). Thus, we choose to have a faster logging rather than faster recovery. Although the read performance can be improved by indexing the log entries (Bookkeeper (24) implements this method). Transport layer protocol We choose to use TCP/IP for communication as we want to deliver the messages in order and reliably to provide a consistent view of the logs (and the stored files) among all the servers in a chain. 3.5 System Properties Our stable storage system consistes of a set of stable storage units (storage unit or SSU). Every stable storage unit consists of several servers, each persisting log entries onto its local

39 3.5. SYSTEM PROPERTIES 23 disk. A writer process writes to only one of the stable storage units. A storage unit follows a fail-stop model and upon a SSU crash, its clients write to another storage unit. The system environment allows detection of failure through membership management service provided by an external system. Our solution follows a Single-writer Single-reader model. Log entries of a database application is written only by one process to the stable storage. The process that writes the logs is the process (same identifier) that reads the logs from the storage. The read operation needs to be performed at the time of recovery. Therefore, read and write operations on the same data item is never performed simutanously. A reader can read the logs from more than one server within the storage unit as all the servers store identical set of data (acknowledged log entries). A process writes to a different storage unit, if the storage unit fails or if the storage decides to disconnect the process Fault Tolerance Model The system needs to be fault tolerant to continue its service at the time of failure. We achieve fault tolerance through replication. In our system persistence of an acknowledged log is guaranteed for f simultaneous servers failures, if we have f +1 servers in the replication chain. However to guarantee the stable storage of a log we require f +2 servers to tolerate f simultaneous failures. We implement fail-stop model. A server halts in response to failure and we assume the servers crash can be detected by all the other servers in the storage unit. In the event of a server crash, the storage unit stops serving all its writers and it only persists the remaining logs in its servers buffer onto disk (Writers connect to another storage unit to continue the logging). Once all the logs are persisted into disk all servers restart and become available to form new storage unit. An alternative option to deal with failures is to repair the storage unit. However due to following reason we prefer to re-create a storage unit and avoid repairing. Repairing a storage unit requires addressing many failure scenarios which complicates the implementation. In addition, possibility of corner cases which have not been taken into account as well as the possibility of additional failures during repair further complicates the matters.

40 24 CHAPTER 3. DESIGN AND ARCHITECTURE Availability The system allows creation of many stable storage units. Each storage unit can provide different replication factors. Larger replication factors (number of servers in the replication chain within storage unit) provide three advantages: higher availability of the stored entries since all the servers within the storage units host a replica of logs. lower probability of data loss when a correlated failures occur. It is more likely that at least one server, holding the buffered data survives the failure and persists it to disk. higher read bandwidth by aggregating the bandwidth of servers hosting the replicas. Therefore, higher replication factor of storage unit enhances data availability and read throughput as well as stronger durability. However, in the case of a catastrophic failure of all the servers in a storage unit data can be lost. Larger number of storage units enhance the availability of the write operations since writers can continue logging upon the storage unit crash Scalability In our storage system, every storage unit is independent of every other unit and there are no shared resources or coordination amongst them. The independent nature of the storage units allows adding new units without impacting the service performance. The load is divided by assigning each set of writers to different units. Therefore, to create a storage unit, a set of servers with closest resource usage are selected in order to prevent a server from becoming a bottleneck in the chain. This allows the maximum resource utilization within a storage unit Safety Consistent Replicas and Correct Recovered State We reply on TCP protocol to transfer the messages in order, reliably and without duplication between the nodes. The servers in a chain are connected by a single TCP channel and messages are forwarded and persisted in the same order that they have been received. This ensures that

41 3.6. ARCHITECTURE 25 all the servers in a chain view and store the logs in the same order that messages are sent by the writer (writer also writes through a single TCP channel). In our system it is not possible to recover an incorrect or a stale state of database without the knowledge of the recovery processor (reader). Every writer is represented by a unique Id (client Id) and every log is uniquely identified by combination of the client Id and a log Id. The log Id increases by one for every new log entry. During recovery of a database state, logs are read and played in order and in case of a missing log (or duplicate), the recovery processor will be able to detect the missing (or duplicate) log entry Integrity During recovery we need to ensure that the object being read is not corrupted. This requires adding a checksum to every object stored in storage to enable verification of data being read (this feature is not part of the implementation) Operational Constraints The availability of the Zookeeper quorum is essential for the availability and operation of the system since we reply on Zookeeper for failure detection and accessing metadata regarding the nodes. The availability of write operation depends on availability of a storage unit and the availability of read operations requires at least one server that stores the requested logs. In order to operate continuously, at least two storage units should be available to quickly resume the service upon a storage unit crash. For example, in the current implementation we require six servers to have two storage units with replication factor of three. This requirement can be reduced by fixing the storage units upon a crash through replacing the failed server from a pool of available servers. However, this complicates the implementation. 3.6 Architecture The main idea is to create several storage units, each capable of storing and retrieving stream of logs reliably. Each storage consists of a number of coordinated servers that perform chain replication and asynchronously persisting the logs onto their local disk to provide a lower

42 26 CHAPTER 3. DESIGN AND ARCHITECTURE response time. Each log entry is acknowledged only after it is replicated in all the servers within the storage unit. Hence we ensure that the logs are persisted in the event of failure of some of the servers. The number of servers in each storage unit is equal to the replication factor it provides. In this section, we describe the architecture of the system (with respect to write operation) Abstractions Our system consists of three types of processes. Figure 3.1 illustrates the processes and below we describe their functions. Log server processes (server) form a storage unit and asynchronously store the log entries on local disks (in append-only fashion). They also read and stream the requested logs from the local disk upon the request of client process at the time of recovery. Figure 3.1 head, tail and the middle nodes are the log server processes. In Stable Storage Unit or storage unit (SSU) provides stable storage of log entries. It consists of a number of machines hosting two types of processes. Log server process (each on different machine) to replicate and store logs and state builder process. The number of machines is equal to replication factor provided by a stable storage unit. Client process (writer/reader) processes requests (writes) from an application and creates log entries. It streams the entries to an appropriate storage unit and responds to the application. The client process also reads the logs and reconstructs the database state at the time of failure (read operation is a future work). State builder processes are the background processes that read the logs from the local disk to compute the latest value of each key. Once the values are computed, they are stored into the disk and the old logs are removed from the disk. The purpose of this process is to reduce the recovery time by eagerly preparing the latest state of the key-value store. This process takes place whenever system is not under stress (part of future work).

43 3.6. ARCHITECTURE 27 Figure 3.1: System entities Coordination of Distributed Processes Zookeeper is a coordination system for distributed processes (28). We use Zookeeper for membership management and storing metadata about the server processes, storage units, and client processes. Data in zookeeper is in a form of tree structure and each node is called a znode. There are two type of znode: ephemeral and permanent. Ephemeral znodes exist as long as the season of zookeeper client creating that znode is alive. Ephemeral znodes can be used to detect failure of a process. Permanent znode stores the data permanently and ensures it is available. We use these metadata and the Zookeeper membership service to coordinate server processes for creating storage units and detect failures. Client processes also use Zookeeper service to locate storage units and detect their failures. Below we describe the metadata and the types of nodes used in our system. MetaData Log Server znode (ephemeral) IP/Port for coordination protocol IP/Port for streaming Rack: the rack that server is located Status: accept or reject storage join request Load status : updated resource utilization status Storage unit znode (ephemeral)

44 28 CHAPTER 3. DESIGN AND ARCHITECTURE Replication factor Status: accept/reject new clients List of log servers Load status: load of the log server with highest resource utilization File map znode (permanent) Mapping of logs to servers Client znode (ephemeral) Only used for failure detection Global view znode (permanent) List of servers and their roles (leader/follower) used to form stable storages units Server Components A log server process creates an ephemeral znode in Zookeeper upon its start and it constantly updates its status data at this node. This process follows a protocol that allows it to cooperate with other processes to form a storage unit. We first describe this protocol, and then we explain how an individual log server operates within a storage unit Coordination Protocol This protocol is used to form a replication chain (storage unit) and operates very similarly to two-phase commit (12). The protocol defines two roles servers: leader and follower. Leader is responsible to contact the followers and manages creation of a storage unit. Followers act as a passive process and only respond to the leader. Figure 3.2 and 3.3 describe the state transition of both leader and follower. If a server process is not part of storage it sets its state to listening state. In listening state a process frequently checks the global view data. If the process is listed as a leader it reads the list of its followers addresses. It sends the followers a join-request message and set a failure detector (sets a watch flag on their ephemeral znode) to detect their failures.

45 3.6. ARCHITECTURE 29 Figure 3.2: Leader states. Figure 3.3: Follower states.

46 30 CHAPTER 3. DESIGN AND ARCHITECTURE Followers are able to accept or reject the join request depending on their available resources. If a follower fails or rejects the request, the leader triggers the abort process and all processes resume their initial state. In order to abort, the leader sends an abort message to all the followers. Upon receiving of an abort message, each follower (and leader itself) cleans all the data structures and return to the initial state 3.3. Each follower sets failure detector for the leader before accepting the join request so that in case of the leader failure, it can detect the failure and resume the previous state. If all the followers accept the join request the leader sends a connection-request message carrying an ordered list of servers (including the leader). Each server connects to the previous and next server in the list as its predecessor and successor in the chain. Once a server is connected and ready to stream data, it sends connect-completion signal to the leader. If a server fails to connect or crashes, the leader aborts the process. Otherwise a complete chain of servers is ready and leader creates a znode for the new storage unit and sends a start signal along with the znode path of the storage unit to all the followers to start the service Concurrency Each server process consists of three main threads operating concurrently based on producerconsumer model. Figure 3.4 shows how threads in a single server operate and interact through shared data structures. Three shared data structures among the threads are: DataBuffer stores the logs entries in memory. SenderQueue keeps the ordered index of the log entries that should be either sent to the next server (head or middle server) or acknowledged to client (tail server). PersistQueue holds an ordered index of log entries that should be written to disk. The receiver thread reads the entries from TCP buffer and inserts the entries into DataBuffer. It also inserts the index of the entry into SenderQueue. DataBuffer has a pre-specified size (number of entries) and if the DataBuffer is full, receiver thread must wait until an entry is removed from the DataBuffer.

47 3.6. ARCHITECTURE 31 Figure 3.4: Log server operation. The sender thread waits until there exist an index in the SenderQueue. It reads the index of the entry from the SenderQueue to find and read the entry from the DataBuffer. If the server is the tail server for the entry it sends an acknowledgment to the corresponding client indicating that the entry has been replicated in all the servers. If the server is not the tail it simply sends the entry to the next server in the chain (successor). Once the message is sent to an the next hop, the sender thread puts the index of the entry in Persist queue. The persister thread waits until an index exists in the PersistQueue. It reads the index of the entry in the DataBuffer and persists the entry into disk in append-only fashion. This thread is the only thread persisting the entries and all the entries from different clients will be interleaved into a single file. Once the entry is written to disk it is removed from the DataBuffer by this thread Stable Storage Unit (SSU) A stable storage consists of a set of servers forming a replication chain. It ensures the replication and availability of delivered entries. One of the servers acts as a leader and holds the lease to the znode of the storage unit. The storage unit is considered failed, when one or more servers in the storage unit crash. Upon crash of storage unit, service is stopped and it only persists the entries from memory to disk. When the leader crashes, the znode is removed automatically and when the other servers fail, the leader removes the znode. Therefore, all clients will be notified about the storage failure. In a storage unit, every server can act as head, tail or middle server. Clients can connect to

48 32 CHAPTER 3. DESIGN AND ARCHITECTURE Figure 3.5: Storage unit. any server in the chain. The server acting as entry point is the head of the chain for the client and the last server in the chain (which sends the acknowledgment) is the tail. Figure 3.5 shows how several clients can stream to the storage unit Load Balancing We make load balancing decisions at three points. We ensure that a set of servers selected to create a storage unit (perform chain replication) have nearly the same load. This minimizes the chance of appearance of a bottleneck in the chain and maximizes resource utilization of each server. Figure 3.6 shows how servers are clustered to form a storage unit. One of the servers within the storage unit constantly updates the available resource of the storage unit and its status in zookeeper node. This enables the clients to select a storage unit with the lowest load by reading this data from Zookeeper servers. In a storage unit every server can act as head, tail or middle server. The tail consumes less bandwidth since it only sends acknowledgments to the client, while head and middle server need to transfer the entry to the next server in the chain. Hence, if all the clients choose the same server as the head, the tail server consumes half the bandwidth compared to the rest of the servers. To mitigate this issue, clients of one storage unit connect to different servers. In the current implementation clients randomly choose one server in the storage

49 3.6. ARCHITECTURE 33 Figure 3.6: Clustering decision based on the servers available resources. unit as the head. However, this can be improved by connecting clients to different servers in a round-robin fashion. Thereby, every server serves nearly equal number of clients as the head (and consequently as a tail). Figure 3.5 also shows the distribution of load through selection of different head servers by the clients. The decision to cluster the servers is based on the available resources of the servers. One of the log server process is in charge of compiling servers data and makes the clustering decisions. It reads the servers data from Zookeeper and sorts the available servers based on their free resources. Using this information, servers with similar amount of avaiable resources are grouped. The server with the largest amount of free resources in the group is chosen as the leader. This process is performed frequently and the output is written to Global View znode in Zookeeper. Each process frequently reads this data to determine its group and its role. Upon a crash of the process (responsible for updating the Global View znode), another process takes over the job. Our current implementation does not provide dynamic load balancing and it is part the future work Failover In the event of storage failure (any of the servers) clients of the storage are able to detect the failure and find another storage unit by querying zookeeper. The client connects to another storage unit (if one is available) to continue with writing the logs. An altenative to shorten the service disruption is to allow the client to hold connections to two storage units and upon the crash of one (the one writing to it), it immediately resumes the operation by switching to the other storage unit.

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

Physical Storage Media

Physical Storage Media Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

CPS 512 midterm exam #1, 10/7/2016

CPS 512 midterm exam #1, 10/7/2016 CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, and Ryan Stutsman) DRAM in Storage

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transactions with consistency, availability, and performance No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,

More information

Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication

Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication Yang Wang, Lorenzo Alvisi, and Mike Dahlin The University of Texas at Austin {yangwang, lorenzo, dahlin}@cs.utexas.edu

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Changing Requirements for Distributed File Systems in Cloud Storage

Changing Requirements for Distributed File Systems in Cloud Storage Changing Requirements for Distributed File Systems in Cloud Storage Wesley Leggette Cleversafe Presentation Agenda r About Cleversafe r Scalability, our core driver r Object storage as basis for filesystem

More information

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:

More information

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis 1 NoSQL So-called NoSQL systems offer reduced functionalities compared to traditional Relational DBMSs, with the aim of achieving

More information

Data Centers. Tom Anderson

Data Centers. Tom Anderson Data Centers Tom Anderson Transport Clarification RPC messages can be arbitrary size Ex: ok to send a tree or a hash table Can require more than one packet sent/received We assume messages can be dropped,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Distributed File Systems and Cloud Storage Part I Lecture 12, Feb 22, 2012 Majd F. Sakr, Mohammad Hammoud and Suhail Rehman 1 Today Last two sessions Pregel, Dryad and GraphLab

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

BookKeeper overview. Table of contents

BookKeeper overview. Table of contents by Table of contents 1...2 1.1 BookKeeper introduction...2 1.2 In slightly more detail...2 1.3 Bookkeeper elements and concepts... 3 1.4 Bookkeeper initial design... 3 1.5 Bookkeeper metadata management...

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Crash Recovery Part 1 (R&G ch. 18) Last Class Basic Timestamp Ordering Optimistic Concurrency

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data

More information

Distributed Systems. GFS / HDFS / Spanner

Distributed Systems. GFS / HDFS / Spanner 15-440 Distributed Systems GFS / HDFS / Spanner Agenda Google File System (GFS) Hadoop Distributed File System (HDFS) Distributed File Systems Replication Spanner Distributed Database System Paxos Replication

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Outline. Failure Types

Outline. Failure Types Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 10 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Lecture 18: Reliable Storage

Lecture 18: Reliable Storage CS 422/522 Design & Implementation of Operating Systems Lecture 18: Reliable Storage Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous versions of

More information

Google is Really Different.

Google is Really Different. COMP 790-088 -- Distributed File Systems Google File System 7 Google is Really Different. Huge Datacenters in 5+ Worldwide Locations Datacenters house multiple server clusters Coming soon to Lenior, NC

More information

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Persistent Memory. High Speed and Low Latency. White Paper M-WP006 Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

Apache BookKeeper. A High Performance and Low Latency Storage Service

Apache BookKeeper. A High Performance and Low Latency Storage Service Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 24 Mass Storage, HDFS/Hadoop Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ What 2

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Lecture XIII: Replication-II

Lecture XIII: Replication-II Lecture XIII: Replication-II CMPT 401 Summer 2007 Dr. Alexandra Fedorova Outline Google File System A real replicated file system Paxos Harp A consensus algorithm used in real systems A replicated research

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni Agenda Motivation Current Approaches FAB Design Protocols, Implementation, Optimizations Evaluation SSDs in enterprise

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani The Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani CS5204 Operating Systems 1 Introduction GFS is a scalable distributed file system for large data intensive

More information

OS and Hardware Tuning

OS and Hardware Tuning OS and Hardware Tuning Tuning Considerations OS Threads Thread Switching Priorities Virtual Memory DB buffer size File System Disk layout and access Hardware Storage subsystem Configuring the disk array

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel Rosenblum and John Ousterhout) a Storage System

More information

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights IBM Spectrum NAS Easy-to-manage software-defined file storage for the enterprise Highlights Reduce capital expenditures with storage software on commodity servers Improve efficiency by consolidating all

More information

OS and HW Tuning Considerations!

OS and HW Tuning Considerations! Administração e Optimização de Bases de Dados 2012/2013 Hardware and OS Tuning Bruno Martins DEI@Técnico e DMIR@INESC-ID OS and HW Tuning Considerations OS " Threads Thread Switching Priorities " Virtual

More information

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Lecture 21: Logging Schemes 15-445/645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Crash Recovery Recovery algorithms are techniques to ensure database consistency, transaction

More information

Documentation Accessibility. Access to Oracle Support

Documentation Accessibility. Access to Oracle Support Oracle NoSQL Database Availability and Failover Release 18.3 E88250-04 October 2018 Documentation Accessibility For information about Oracle's commitment to accessibility, visit the Oracle Accessibility

More information

Hadoop Distributed File System(HDFS)

Hadoop Distributed File System(HDFS) Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Blizzard: A Distributed Queue

Blizzard: A Distributed Queue Blizzard: A Distributed Queue Amit Levy (levya@cs), Daniel Suskin (dsuskin@u), Josh Goodwin (dravir@cs) December 14th 2009 CSE 551 Project Report 1 Motivation Distributed systems have received much attention

More information

Storage Devices for Database Systems

Storage Devices for Database Systems Storage Devices for Database Systems 5DV120 Database System Principles Umeå University Department of Computing Science Stephen J. Hegner hegner@cs.umu.se http://www.cs.umu.se/~hegner Storage Devices for

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours

More information

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System Haruo Yokota Tokyo Institute of Technology My Research Interests Data Engineering + Dependable Systems Dependable

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

Introduction to Distributed Data Systems

Introduction to Distributed Data Systems Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January

More information