A Distributed Namespace for a Distributed File System

Size: px
Start display at page:

Download "A Distributed Namespace for a Distributed File System"

Transcription

1 A Distributed Namespace for a Distributed File System Wasif Riaz Malik wasif@kth.se Master of Science Thesis Examiner: Dr. Jim Dowling, KTH/SICS Berlin, Aug 7, 2012 TRITA-ICT-EX-2012:173

2

3 Abstract Due to the rapid growth of data in recent years, distributed file systems have gained widespread adoption. The new breed of distributed file systems reliably store petabytes of data on commodity hardware, and also provide rich abstractions for massively parallel data analytics. The Hadoop Distributed File System (HDFS) is one such system which provides the storage layer for MapReduce, Hive, HBase and Mahout. The metadata server in HDFS, called the NameNode, is a centralized server which stores information about the whole namespace. The centralized architecture not only makes the NameNode a bottleneck and a single-point-of-failure, but also restricts the overall capacity of the filesystem. To solve the availability and scalability issues of HDFS, a new architecture is required. In this report, we propose a distributed implementation of the HDFS NameNode, where the filesystem metadata is stored in a distributed, in-memory, replicated database called MySQL Cluster. The NameNodes are state-less, and the throughput of the system can be increased by either adding NameNodes, or by adding more data nodes in NDB. HDFS clients can access the metadata by connecting to any one of the NameNodes. The evaluation section shows that the new architecture, as compared to HDFS, can handle more requests per second, store ten times more files and recover from failures within a few seconds.

4

5 Dedicated to my wife

6

7 Acknowledgements I would like to thank Dr. Jim Dowling for his valuable guidance and feedback throughout the thesis, my EMDC class fellows for providing feedback, and my colleagues at Swedish Institute of Computer Science (SICS) for exchanging ideas related to this thesis. Finally, I would like to thank my wife and my family for supporting me throughout my studies. Berlin, August 8, 2012

8

9 Contents 1 Introduction 1 2 Related Work / Background Google File System Architecture Master Chunkservers Clients TidyFS Architecture Metadata Server Storage Computers and the Node Service Clients HDFS Architecture Namenode Datanodes Clients Limitations of HDFS Namespace Scalability Throughput Failure Recovery KTHFS Architecture and Implementation Goals Architecture Execution Overview Stateless Namenode Writer Namenode Reader Namenode Datanodes Clients Storing Metadata in a Database Modification of Namenode Operations

10 Contents INodes Path Resolution Representation in HDFS Representation in KTHFS Block Metadata Representation in HDFS Representation in KTHFS Datanodes Metadata Representation in HDFS Representation in KTHFS The INode Cache MySQL Cluster (NDB) Data Access Querying Data Distribution Awareness Evaluation Capacity Throughput Availability Availability of the Namenodes Recovering from Failures Garbage Collection Pauses Availability of MySQL Cluster Conclusions and Future work Conclusion Future Work Scaling-out the Writer Namenode Fine Grained Locking Minimizing Database Round Trips Block Report Processing x

11 1 Chapter 1 Introduction Due to the immense growth of data in recent years, it has become a challenge to store, analyze and extract meaningful information out of it. The amount of data generated and stored by internet companies, enterprises and governments is on the rise, and according to forecasts, it will keep rising exponentially in the coming years. Traditional storage systems like relational databases and file systems are not scalable enough to manage the ever increasing datasets. This has given rise to new kind of systems, which can not only reliably store massive amounts of data over a cluster of machines, but also provide abstractions for parallel data analytics. They have the design goals of being highly scalable and fault tolerant, while sacrificing some of the semantics which the traditional systems provided. Recent distributed file systems like the Hadoop Distributed File System[1] store data on a cluster of machines and provide abstractions for execution frameworks like MapReduce[2] to analyse data in a massively parallel manner. Hadoop is widely used for large scale data intensive applications, such as bio-informatics, social networks, search engines etc. HDFS consists of a NameNode, a set of datanodes, and a set of clients. The NameNode stores the metadata of the file system such as the namespace, and also acts as a centralized master. It also exposes an API for clients to read and update this metadata. The datanodes store the actual file data on disk and periodically send heartbeats and block reports to the NameNode. The clients access a file by asking the NameNode for the datanodes holding that file, and then fetching the blocks from the NameNode datanodes. The NameNode runs on a single machine and stores the information about files and blocks in its memory. This keeps the architecture of HDFS fairly simple but also gives rise to three issues: 1. It is a single point of failure, which means that if the NameNode goes down, the whole HDFS cluster would become inaccessible. This discourages the use of HDFS for real-time and interactive workloads. 2. The number of files that can be stored in an HDFS cluster is limited by the amount of RAM available to the name-node. This puts a hard limit on the

12 1 Introduction size of the file system because the RAM can only be increased to a certain limit. This is also known as the file-count problem. Furthermore, since the NameNode runs on a JVM, a very large heap size is not recommended because frequent garbage collection pauses make the NameNode unresponsive for seconds or even minutes. 3. The time required to restart the NameNode after a failure might take up to an hour (depends on the number of files). There have been attempts[6][7] to solve this problem though. There are are workarounds to avert the file-count problem but they have drawbacks. For example, the Hadoop Archives and SequenceFile functionalities allow packing of multiple files into one file, hence consuming the metadata size of just one file. However, this passes additional complexity to applications for accessing files which might not be possible in some scenarios. Another workaround is to increase the block size of files in HDFS, which results in less number of blocks per file and hence less metadata consumption at the NameNode. However, this workaround is not effective for small files (having one block). A recent study by Shvacko[8] has shown that the centralized architecture of the NameNode limits the overall capacity and throughput of HDFS, and a new architecture is required to solve these issues. The AvatarNamenode[6] implementation by Facebook, and the recent Highly Available Namenode[7] only solve the availability issue, hence the NameNode still remains a limiting factor for throughput and namespace scalability. In this thesis report, we present a new distributed architecture for the HDFS NameNode, with the goal of making it horizontally scalable and highly available. In the remainder of this report, we will refer to the new architecture as KTHFS. The single NameNode is replaced by a set of shared-nothing stateless NameNodes which store all HDFS metadata in a distributed in-memory database called MySQL Cluster (NDB). The stateless NameNodes are light weight processes, and fetch the metadata from NDB only when it is required by a client. Since the NameNodes are stateless, they can be stopped, migrated and restarted in seconds without any downtime for clients. This also makes it possible to scale out the NameNode by launching multiple instances of it on different machines. A modified client API makes sure that the client requests are load balanced over the set of NameNodes in the system. The clients access the cluster of NameNodes by using a load balancing policy. Different load balancing policies like round robin or random can be plugged in to the client library. The capacity of KTHFS is not limited by the amount of RAM on a single NameNode, but is rather dependent on the storage capacity of NDB, which is known to have a maximum capacity of approximately two terabytes. We will discuss the memory requirements of KTHFS in the evaluation section. 2

13 There are significant challenges related to throughput of a NameNode in KTHFS. Since the metadata is fetched from a node (or possibly multiple nodes) in NDB over the network, the latency of the operations increases, and in turn affects the throughput of a single NameNode. Though it must be noted that HDFS is not designed for low latency operations and the Hadoop applications do not expect subsecond response times for metadata operations. The latency of operations can be improved by sacrificing some HDFS semantics, but that would violate the goal of maintaining full HDFS semantics in KTHFS. We have also designed and implemented a cache at the NameNodes to improve the latency of metadata operations. This cache reduces the round trips to the database for metadata operations but still returns the latest copy. In the next chapter, we will discuss the characteristics of some distributed file systems. In Chapter 3, we will list down the limitations of HDFS and in Chapter 4, we will describe the goals, and then elaborate the design and implementation of the project. In Chapter 5, we will discuss in detail the storage capacity, throughput and availability of the NameNodes in KTHFS. Finally in Chapter 6, we provide concluding remarks and then discuss some possible improvements to this work. 3

14

15 2 Chapter 2 Related Work / Background 2.1 Google File System The Google File System (GFS)[9] is a distributed file system for large scale dataintensive applications. It was designed by Google to meet its growing demand of data processing requirements. The design goals of GFS are scalability, reliability and fault tolerance. GFS considers failures to be the norm rather than the exception. The files in GFS are expected to be huge, typically greater than 15 MB. Multi-GB files are common in GFS deployments at Google. GFS does not allow mutation of a file at random locations, and the only mutation allowed to a file is append. GFS provides a familiar file system interface for user applications but does not implement any standards such as POSIX. The files and folders are stored in a hierarchical fashion, just like traditional file systems, and can be accessed by providing the full path. The operations supported by GFS are create, delete, open, close, read and write files. GFS also supports a feature called multi-writer append, which allows multiple writers to concurrently append data to a file Architecture A GFS cluster consists of a master node, chunkservers and clients. The clients and chunkservers can be run on the same machines. A file consists of multiple chunks, each chunk has a fixed size and is configured to be 64MB by default. The chunks are stored on chunkservers. Each chunk has a globally unique identifier generated by the master. The chunks are replicated thrice for reliability, however it is possible to set a different replication factor for each file Master The master stores all file system metadata in its memory. The metadata includes the namespace, mapping from files to chunks, and the chunk locations. The master

16 2 Related Work / Background exposes a file system interface for the clients, so that the clients can perform operations in the file system. It also handles namespace management and locking, check-pointing the metadata, replica placement and replica re-balancing. Storing everything in a single master simplifies the overall design of GFS, and enables the master to perform sophisticated chunk placement and replication decisions using global knowledge. However, the disadvantage of having a single master is that it becomes a single point of failure, and a bottleneck under high client workloads Chunkservers The chunkservers periodically send heartbeats to the master. They store the chunks on local disk as files, and expose an interface to clients for reading or writing chunks Clients The clients do not read/write through the master, but use the master to find out the locations of the chunkservers. If a client wants to read a file, it asks the master about the chunkservers which have the chunks of that file. The master replies with the locations of all replicas of the chunks, and the client fetches the closest chunk replica. Similarly, if a client wants to write a new chunk, it asks the master for a chunkserver. The master replies with the closest chunkserver, and the client writes to that chunkserver directly. 2.2 TidyFS TidyFS[10] is a simple and small distributed file system designed by Microsoft for parallel computations on clusters. TidyFS differs from GFS and HDFS by being simpler. It avoids complex replication protocols, concurrent writers, and uses lazy replication for file data instead of eager replication. The design goal of TidyFS is to be simple enough to efficiently handle the kind of workload that Microsoft has. A typical workload for TidyFS is write-once, read-many, high throughput data access. The fault tolerance model of TidyFS is slightly different than GFS, as it expects the higher level execution model to handle the storage failures. For example, TidyFS uses lazy replication for file data, and delegates the task of handling storage failures to applications running on top of TidyFS. Another difference is that it allows clients to read and write data natively, such that they can choose to access the data using different patterns (for example sequential, random etc). TidyFS does not implement the POSIX interface and provides a very simple API for clients. The most common file operations like create, delete, list, copy, concatenate etc. are supported. However, files are called streams, and blocks are referred to as partitions in TidyFS. There is no 6

17 2.2 TidyFS Figure 2.1: TidyFS architecture explicit directory tree maintained in TidyFS, so there are no operations for creating and deleting directories. However, the files can still be stored in a hierarchical manner by using arcs in the stream name. When a stream is created, the missing directory entries are implicitly created, and when the stream is deleted, all parent directories are recursively removed unless they contain other streams Architecture TidyFS consists of a metadata server, a cluster of storage computers, and a node service running on each storage computer. The metadata server is responsible for storing the metadata of the whole file system, managing the storage computers, replication of partitions, load balancing etc. Partitions are mostly similar to GFS chunks, but one key difference is that they can be of variable size. The partitions once written and commited, are immutable. They are lazily replicated by default, but the clients can eagerly replicate them if required Metadata Server The centralized metadata server stores all metadata of TidyFS i.e. information about streams, parts, storage computers, and mapping from streams to parts and from parts 7

18 2 Related Work / Background to storage computers. The metadata and the operations on it are replicated to other machines using Autopilot Replicated State Library[11] to be able to recover from failures. The metadata server also tracks the state of storage computers so that it can optimally replicate the data across them. It is also possible to define distinguished attributes on a per-stream or per-part basis. These attributes are generally refered to as extended file attributes in the file system world. In short, the metadata server is the master of the system, it responds to client queries, manages the storage computers, and optimally replicates the data across the cluster Storage Computers and the Node Service The storage computers store the actual data i.e. the parts. Each TidyFS cluster can have hundreds of storage computers. The node service is a windows service that runs periodically on each storage computer, and carries out routine tasks like sending reports to the metadata server, the amount of disk space available, the state of the storage computer etc. In TidyFS, unlike GFS, the garbage collection of unused, over-replicated blocks is done on the storage computers. However, the the list of candidates for deletion is sent to the metadata server for verification, and only the verified parts are deleted from the storage computers. This two phase process ensures that the non-committed parts are not deleted from the storage computers Clients The clients communicate with the metadata server using a client library. The client library has built-in failover support to tolerate metadata server failures. To write data, the client asks for a write path from the metadata server. Typically the write path points to the client s local machine. The client writes data to the write path and also keeps renewing the lease for this stream on the metadata server. Once it has finished writing, it closes the local file and adds the written part to the stream. At this moment, the data becomes available for other clients. If the client dies during the write, the lease expires and the stream is removed from the metadata server. 2.3 HDFS The Hadoop Distributed File System (HDFS) [1] is an open source implementation of GFS [9], built by Yahoo! to act as the storage layer for Hadoop[12] applications. It was designed to reliably store petabytes of data on commodity hardware, and to enable distributed computation frameworks like MapReduce [2] to analyse data in a massively parallel fashion. In addition to MapReduce, HDFS is used as the 8

19 2.3 HDFS Figure 2.2: The Architecture of Hadoop Distributed File System storage layer for Hive, HBase, Mahout, and also being used as a stand-alone file system. HDFS was primarily designed for use by MapReduce and other Hadoop components. The HDFS API is similar to the UNIX file system API, but several UNIX semantics have been relaxed in favor of simplicity, high performance and requirements of the applications at hand. In other words, HDFS is not a POSIX [13] compliant file system. Some of the POSIX functions not supported by HDFS are concurrent writers, random writes, locking of file section etc Architecture HDFS consists of three components, the Namenode, datanodes and clients. The Namenode stores the metadata of the file system and also acts as a centralized master node. The file data is stored in the datanodes. Each file in HDFS is split into equally sized blocks (64MB by default). Each block is replicated thrice by default on different datanodes. The HDFS client library is responsible for fetching the metadata of a file from the Namenode and then accessing the data from the datanodes Namenode The Namenode is a centralized server, which stores the metadata of the entire file system in its RAM. This metadata includes the inodes, blocks, mapping from inodes to blocks, and mapping from blocks to datanode locations. The files and directories in 9

20 2 Related Work / Background Figure 2.3: The hierarchy of inodes. The file a.txt is represented by the path /home/wasif/a.txt HDFS are represented by a hierarchy of inodes (as shown in 2.3). Each inode stores information such as the name of the file, its permissions, timestamps, quotas etc. The Namenode provides an API for the clients to query this metadata. The maximum number of metadata objects that can be stored in the Namenode is bounded by the size of its RAM. The Namenode persists the metadata on disk and also maintains a journal of operations so that it can recover from failures. Other important tasks of the Namenode include replication management, block placement, block balancing etc. In short, the NameNode has two important tasks, serving the client metadata requests, and efficiently managing the cluster of datanodes Datanodes The datanodes store the actual data of files in HDFS. An HDFS cluster can have hundreds or thousands of datanodes depending on the data requirements. When a datanode process starts, it registers itself with the NameNode to become a part of the filesystem cluster. It periodically sends heartbeats (by default every three seconds) to let the Namenode know that it is alive. These heartbeats also contain information such as the available disk space on datanodes, the current load etc, to enable the NameNode in making decisions about replication. As a response to the heartbeats, the Namenode sends commands to the datanodes. The datanodes also periodically send a report of all of the blocks they have to the NameNode (every hour by default)). The NameNode compares this report to the metadata stored in its RAM, and verifies its integrity. If a block reported by the datanode is not present in the NameNode s metadata, the NameNode asks the datanode to delete it as well. 10

21 2.3 HDFS Clients The clients access HDFS using a client library. The library provides functions such as create, open, list, delete files and directories. To open a file for reading, the client obtains from the Namenode a list of datanodes holding the blocks of that file. It then pulls the block data from the datanodes directly. Similarly, if a client wants to write a file, it asks the NameNode to create a metadata entry for the file and provide a list of datanodes where it can write the first block of the file. For each subsequent block, the client performs the same operations again i.e. ask the NameNode for a list of datanodes and write the content to the datanodes. It must be noted that the clients can only write sequentially to a file, and random writes are not allowed. 11

22

23 3 Chapter 3 Limitations of HDFS In this chapter, we will briefly reiterate the major limitations of Hadoop Distributed File System (HDFS). 3.1 Namespace Scalability In HDFS, the namespace (i.e. the metadata of files and directories) is stored in the RAM of one machine (Namenode). In large HDFS clusters, the namespace can become very large such that it can no longer fit in the RAM of one machine. 3.2 Throughput The throughput of the metadata operations is also bounded by the performance of one machine (Namenode). In large HDFS clusters which have hundreds of clients and datanodes, the Namenode becomes a bottleneck. 3.3 Failure Recovery If the HDFS Namenode crashes, it can be restarted but it can take up to an hour. During this time, the Namenode remains unavailable to the clients. The Namenode takes a long time to restart because it has to load all metadata in memory from disk before it can accept client requests.

24

25 4 Chapter 4 KTHFS Architecture and Implementation In this chapter, we will discuss the architecture and implementation details of KTHFS, which is a modified version of HDFS. The implementation of KTHFS was done by modifying the source code of HDFS (available at the Apache Hadoop website[12] under an Apache license[14]). In this section will describe the design goals of KTHFS, its architecture, implementation details and some optimizations which were done to improve the throughput of the Namenode. 4.1 Goals KTHFS aims to achieve the following goals: 1. Increase the capacity of the file system 2. Make the Namenode highly available 3. Scale out the throughput of the Namenode by adding more machines 4. Keep the Namenode API intact, such that applications using HDFS can use KTHFS without any changes 4.2 Architecture The overall architecture of KTHFS is similar to HDFS. However, instead of a single NameNode, KTHFS has a cluster of stateless NameNodes which store their state in a highly available, distributed, in-memory database. The architecture can be seen in Figure 4.1. There are two kinds of NameNodes in KTHFS, read NameNodes and write NameNodes; they will be discussed in detail later in this section. The datanodes are connected to only two NameNodes, and send heartbeats and block reports to both of them. However, only one of the NameNodes is active at one time. If the active

26 4 KTHFS Architecture and Implementation Figure 4.1: KTHFS Architecture NameNode fails, the standby NameNode can take its place instantaneously. This functionality is similar to AvatarNamenode[6] and the HA Namenode[7], but one key difference is that the metadata updates on the active NameNode are not sent to the standby NameNode, but are instead sent to MySQL Cluster. The clients can send metadata read operations to any NameNode in the system, but the write operations can only be sent to the active writer NameNode. Distributing the writer NameNode for parallel write operations is not the focus of our current implementation and can be taken up as future work Execution Overview A detailed flow of events during a read or write request is described below. To maintain the brevity of this paper, the flow of operations e.g. list files, delete files, make directory etc. will not be explained here as they are similar in concept to reads and writes. 1. Read Request a) If a client wants to read a file, it sends a read request to any name-node and waits for a reply b) The name-node upon receiving the client request fetches all metadata of the file from the MySQL Cluster database. The metadata includes 16

27 4.2 Architecture all information about the file i.e. name, permissions, timestamps, block locations etc. After the metadata has been retrieved, the name-node performs validation checks on the client request, and returns the address of the data-nodes which have the file blocks c) The client reads the whole file by sending read requests for all of the file blocks to the data-nodes 2. Write Request a) If a client wants to write to a file, it sends a write request to any name-node and waits for a reply b) The name-node upon receiving the client request, creates a new file by inserting a new row in the MySQL Cluster table and returns the data-node address on which the client should write the contents of the file c) The client after receiving the data-node address from the name-node, starts writing the file on that data-node d) Once the client has written all blocks of the file to the data-node, it sends a complete message to the NameNode, which marks the file as complete in MySQL cluster Stateless Namenode In KTHFS, the metadata of the file system is stored in MySQL Cluster, which is a distributed, but highly available in-memory database. The client writes the metadata by contacting a NameNode which in turn persists the metadata to MySQL Cluster. Similarly, when a client wants to read metadata, it contacts a NameNode, which fetches the data from MySQL Cluster and returns it to the client. The data in MySQL Cluster is replicated twice by default, which means the data remains available as long as one replica is accessible. MySQL Cluster maintains a log of operations on disk, which enables it to recover from failures. We removed the journaling and check-pointing functionality of the HDFS NameNode since it was redundant. The NameNode maintains the metadata of the whole file system. This metadata basically comprises of two parts: 1) the namespace, i.e. information about the files and directories, and 2) the blocks map, i.e. the block locations of all files. In KTHFS, as explained earlier, the data structures of the name-space and the blocks map are stored in MySQL Cluster. This makes the NameNode stateless and opens up many possibilities to improve the scalability and availability of HDFS NameNode. Some benefits of the stateless name-node architecture are mentioned below: 17

28 4 KTHFS Architecture and Implementation Operation Reader Namenode Writer Namenode createfile allocateblock close delete rename ls open getfilestatus Table 4.1: Operations supported by Reader and Writer NameNodes in KTHFS It allows multiple instances of name-node to run in parallel, which makes the file system highly available. The file system will remain available as long as there is at least one NameNode running and at least one replica in MySQL Cluster is available Starting a stateless name-node takes only a few seconds as it no longer has to load the metadata into its memory, which opens up the possibility of starting or stopping new NameNode instances as the demand changes over time New NameNode instances can be started without any synchronization with the existing NameNodes or datanodes The NameNode can be horizontally scaled both in terms of storage and throughput by adding more machines However, as can be seen in Figure 4.1, there are two kinds of NameNodes in KTHFS. The write NameNodes and the read NameNodes Writer Namenode The writer NameNode is responsible for datanode management and handling client requests which require access to up-to-date information about the datanodes. Datanode management tasks include heartbeat management, replication management, efficient block placement, rack awareness, block report processing etc. These tasks require an up to date information about state of each datanode in the cluster; the information includes the total storage capacity of a datanode, the remaining storage capacity, current workload etc. A list of operations which the writer NameNode supports can be seen in Table

29 4.3 Storing Metadata in a Database Reader Namenode The reader NameNode is only responsible for serving the read operations. A list of operations which the reader NameNode supports can be seen in Table 4.1. As seen in Figure 4.1, the reader NameNode is not connected with any datanodes in the system and therefore does not have complete information about the datanodes. Since, the block to datanode mapping is stored in NDB, the reader NameNodes do have access to the locations of all blocks stored in the datanodes. This enables the reader NameNodes to serve open requests. The open operation, also known as getblocklocations in HDFS, is the most frequently used operation in most HDFS deployments. A file in HDFS is split into different parts, and the client needs the location of these parts in order to read the file. When the client wants to open a file for reading, it sends an open request to the NameNode, which responds with the list of datanodes holding the blocks of that file. The client then reads the blocks of the file from those datanodes. Multiple instances of the reader NameNodes can be started on different machines to scale the overall throughput of the system Datanodes The datanode code was not modified, so the functionality of datanodes in KTHFS remains exactly the same as HDFS Clients The original HDFS client was built to work with just one NameNode. For KTHFS, this client library was modified to support multiple NameNodes. The client operations were classified into two types, read or write. The read operations are sent to the reader or writer NameNodes, whereas the write operations are sent to the writer NameNodes. These requests are load balanced on the existing NameNodes by using a round robin or random load balancing policy. Furthermore, the client library handles NameNode failures transparently from the client application, and guarantees to provide a response as long as one of the NameNodes is active. For example, if the client API detects a connection failure or timeout with a NameNode, it tries to connect to the next NameNode in the list and so on. The list of NameNodes is mentioned in the configuration file of the client, and does not change as new NameNodes are added or removed from the system. 4.3 Storing Metadata in a Database In KTHFS, the metadata of the file system is stored in MySQL Cluster (NDB) instead of Namenode s RAM. To enable this, all NameNode functionality was modified to 19

30 4 KTHFS Architecture and Implementation Figure 4.2: Different types of inodes in HDFS read and write the metadata to/from NDB. The NameNode is responsible for storing information about inodes, blocks and datanodes and the corresponding mappings between them. This section describes the changes done to represent, store and access the metadata from NDB Modification of Namenode Operations The NameNode is responsible for maintaining and enforcing HDFS semantics such as permission checking, path validation and resolution, consistency, atomicity etc. It exposes an API to the client to perform operations such as create, list, make directory, delete, allocate/commit/delete block, append file etc. Furthermore, a disk based writeahead log is appended for all operations which write or update metadata; this enables the NameNode to recover from failures. Since NDB does its own journalling and checkpointing to recover from failures, the journalling functionality in HDFS was redundant, and hence removed for KTHFS. Furthermore, each NameNode functionality was modified to store/fetch data from NDB instead of RAM INodes INode is the name given to the data structure that holds the information of a file system object, for example file, directory or a symbolic link. This information usually includes file name, permissions, access timestamps, access control information, file length etc but may vary in different file systems. The inodes are typically stored in a hierarchical manner, such that each inode points to one or several inodes. For example if a directory has three files, the inode of the directory will have three child inodes, one for each file. 20

31 4.3 Storing Metadata in a Database Property name parent permission accesstime modificationtime children Type byte[] INodeDirectory long long long List<INode> Table 4.2: Properties of an INode in HDFS Path Resolution In most file systems, inodes are stored in a hierarchical manner to represent the hierarchy of directories and files in a file system. As seen in Figure 2.3, the full path /home/wasif/file.txt has four components which are represented using four inodes. When a user or client tries to access /home/wasif/file.txt, the file system has to ensure that: All path components exist (this is sometimes referred to as path resolution) The user has permission to access each path component The NameNode has to resolve the path for each client operation which accesses files or directories, for example touch, mkdir, ls, chmod, mv, cp etc. The algorithmic complexity of path resolution is O(n), where n is the number of components in the path Representation in HDFS Table 4.2 shows a subset of properties of an inode in HDFS. Each file is represented by an INodeFile object, and each directory is represented by an INodeDirectory object. Both of these types are specialized types of INode. The files and directories in HDFS are represented by a tree of INode objects. The root node in this tree is an INodeDirectory object that represents the root directory of the file system. Each INodeDirectory object maintains a list of children which can be files or directories. The NameNode resolves a path by traversing the tree, starting from the root node. When a file is deleted, it is removed from this tree, and when a directory is deleted, the inode of the directory is deleted first, and then its children are deleted incrementally. The incremental deletion is safe because the children of the deleted directory would not be accessible. 21

32 4 KTHFS Architecture and Implementation Column inodeid name parentid isdirectory permission accesstime modificationtime Type long varchar(155) long boolean long long long Table 4.3: Properties of an INode in KTHFS Representation in KTHFS In KTHFS, the inodes are stored in a database table called INodeTable in MySQL Cluster. Each row in INodeTable represents an INode which can either be a file, directory or a symbolic link. Table 4.3 shows a subset of the columns in the INodeTable and their data types. To make the comparison easier, the data types mentioned in the table are the Java equivalents of MySQL data types. The inodeid is the primary key, and is a randomly generated 64-bit identifier generated by the NameNode. The name column stores the name of the inode, and the parentid is the inodeid of its parent. The path resolution is done by starting from the root inode and then verifying one by one if all path components exist. However, path resolution is quite inefficient because the NameNode has to send n queries one after the other to resolve the full path. Furthermore, these queries are not primary key lookups and result in a range scan in NDB. This causes multiple round trips between the datanodes in NDB and affects the latency of path resolution. To solve this problem, an INode cache was developed, which we will discuss later in this document Block Metadata Each file in HDFS is split into multiple blocks. Each block is described by a metadata object that stores information like blockid, size, inode it belongs to, state, locations etc. The blockid is a 64-bit unique identifier generated by the NameNode when a new block is allocated. If blocks are replicated thrice, then the NameNode stores the locations of three datanodes for each block in the system Representation in HDFS Table 4.4 shows a subset of the block metadata stored in HDFS NameNode. A block goes through several states in its lifetime. When a block is allocated, its 22

33 4.3 Storing Metadata in a Database Property blockid numbytes inode state location Type long long INodeFile enum Datanode[] Table 4.4: Properties of a Block in HDFS Column blockid blockindex inodeid numbytes state Type long int long long short Table 4.5: BlockInfo table in KTHFS state is under-construction. When the client finishes writing the block, the block state changes to committed, and when the block has at least one replica, it becomes complete. The NameNode also stores a reverse reference to the inode the block belongs to. This is used in block report processing when the NameNode wants to find out if a particular block belongs to an inode or not. The location of a block is stored in memory after the datanode informs the NameNode that it has fully received that block Representation in KTHFS There are two tables for storing the metadata of blocks in KTHFS, the BlockInfo table (see Table 4.5) and the Locations table(see Table 4.6). The tables have been normalized so that data is not unnecessarily duplicated and memory space in MySQL Cluster is efficiently utilized. As mentioned before, the blockid is a unique identifier for the block and is the primary key in the BlockInfo table. The blockindex is the position of the block in the file, and is used by the NameNode to sort the blocks in the correct order. The inodeid is a foreign key, and is used to get hold of the inode for a block. The inodeid column is also indexed to make it possible for the NameNode to obtain all blocks of a particular inode. The column numbytes stores the size of the block in bytes, and the state column stores the current state of the block. The Locations table has just three columns, blockid, index and storageid. The blockid and index together form the primary key in this table. The storageid is a unique identifier for each datanode in the file system and acts as the foreign key to join this table with the DatanodeInfo table. The DatanodeInfo table stores information about the datanodes and will be discussed in the next section. 23

34 4 KTHFS Architecture and Implementation Column blockid index storageid Type long int String Table 4.6: Block locations table in KTHFS Datanodes Metadata The datanodes register themselves with the NameNode at startup, and maintain a persistent connection. They send heartbeats to the NameNode every three seconds, and send a block report after every hour to the NameNode. The heartbeats have two purposes: firstly, to let the NameNode know that the datanode is alive, and secondly to send some statistics to the NameNode, such as total storage capacity at the datanode, remaining disk space, current workload etc. The NameNode uses these statistics to efficiently perform block placement and replica management. For example, the NameNode would stop returning those datanodes to the clients which don t have disk space available. The block report contains information about all the blocks stored at that datanode. The NameNode compares this list with the list of blocks it thinks that datanodes should have, and based on the result, asks the datanode to delete or replicate the blocks Representation in HDFS In the NameNode, each datanode is represented by a DatanodeID object, which contains information like IP address, port, and a unique identifier called storageid. This identifier is generated by the datanode when it starts for the first time; it has the following format: DS-randominteger-ip-port-timestamp The above format ensures that there are no collisions between the identifiers of different datanodes. When the datanodes connect to the Namenode, they send their respective storageids as part of the registration process. The current status and statistics of datanodes are stored in DatanodeInfo and DatanodeDescriptor objects. This includes the administrative state of a datanode, location in the network topology, available disk space, lists of blocks which need to be either replicated, recovered or deleted etc. All of the above mentioned information about the datanodes is not persisted to disk. So, if a NameNode crashes, all of this information is lost. When the NameNode reboots, it builds this information again after it receives heartbeats, block reports, and other messages from the datanodes. 24

35 4.4 The INode Cache Column storageid hostname port status location Type String String int int String Table 4.7: DatanodeInfo table in NDB Representation in KTHFS Table 4.7 shows the information stored in the DatanodeInfo table in NDB. The datanode statistics such as the available disk space, lists of blocks to be deleted, recovered, replicated, are not stored in NDB in the current implementation, however they are still locally stored on the NameNode. 4.4 The INode Cache As mentioned in the Section , the full path of a file or a directory is resolved by the NameNode to ensure that it exists and also if the client has permissions to access it. This is an expensive process because the NameNode has to iterate through all path components sequentially. For example, /home/wasif/work/file.txt is resolved by searching for wasif in the list of children of /, then searching for work in the list of children of wasif and so on. In HDFS, inodes are stored in memory and reading from memory is quite fast. In KTHFS, this process is quite expensive because each read operation causes an index scan across all datanodes in NDB. The index scan operation itself is quite inefficient because it involves querying all NDB datanodes which causes multiple round trips. This affects the latency of the path resolution process, and in turn affects the latency of all filesystem operations. Index scans in NDB have two issues: first, they are slower than primary key lookups, and second, they generate considerable network traffic and make it difficult to scale out MySQL Cluster. For example, the path /home/wasif/work/file.txt has five path components, and it can be resolved by generating five index scans in NDB. Each path component is represented by an inode (a row in the INode table in NDB). The requests can not be sent in parallel because the inodeids of the components are not known in advance. The inodes can not be queried by their local name only because there can be two or more files having the same local name but residing in different directories. Furthermore, the inodeid is not known in advance so primary key lookups are not possible. A caching scheme was developed with the goal of fetching the inodes using primary key lookups instead of range scans. The cache entries are stored in the NameNode s 25

36 4 KTHFS Architecture and Implementation Figure 4.3: The changes in cache state after different paths are resolved. (a) shows the initial state of the cache, (b) shows the state of the cache after /home/wasif/a.txt is accessed for the first time, (c) shows the state after /home/john/b.txt is accessed RAM. The pseudo code of the caching scheme can be seen in Algorithm 1. When a path is to be resolved, the NameNode checks if it is present in the cache, if it is not there, it is resolved by generating an index scan for each path component. The path components are then added to the cache, which is a tree like data structure with / as the root node. Each node in the cache tree has the following properties: inodeid, name, reference to its parent, list of children. When the same path is accessed again by any client, the cache would already have the inodeids of the path components. The NameNode would then perform primary key lookups for these path components instead of index scans. Figure 4.3 shows the changes in the cache as new paths are accessed. After the inodes are fetched from NDB, they are verified with the local cache. If nothing has changed, it means that the path is still up to date, else the NameNode deletes the outdated path in the cache and falls back to performing the index lookups for each path component. 4.5 MySQL Cluster (NDB) In this section, we will discuss the features of MySQL Cluster which were used in this implementation. To maintain the brevity of this report, we only discuss the most interesting features. 26

37 4.5 MySQL Cluster (NDB) Algorithm 1 Caching algorithm for path resolution 1: if cache.exists(path) then 2: cachedinodes cache.get(path) 3: inodes = getn odesu singp rimarykey(cachedinodes) 4: if verif y(inodes, path) then 5: return inodes 6: else 7: cache.delete(path) 8: inodes = getn odesu singindexscans(path) 9: cache.put(inodes) 10: return inodes 11: end if 12: else 13: inodes = getn odesu singindexscans(path) 14: cache.put(inodes) 15: return inodes; 16: end if Data Access MySQL Cluster offers various APIs to access the data stored in NDB, for example OpenJPA, ClusterJPA, MySQL Server, NDB API, ClusterJ etc. For our implementation, we used the ClusterJ API for the following reasons: It can access the data by querying the NDB datanodes directly, i.e. without going through a centralized server The queries and lookups are quite fast since ClusterJ does not have to perform query parsing etc. It is a Java API, so integrating it with the NameNode was straightforward Querying Data NDB provides three data query mechanisms: primary key lookups, index scans and full table scans. We briefly discuss them below. Primary key lookup A primary key lookup in NDB is the most efficient way of accessing a row because the query is sent only to the node holding that row. 27

38 4 KTHFS Architecture and Implementation Index Scans Index scans are sent in parallel to all nodes in the cluster, and are therefore much less efficient than primary key lookups Full Table Scans Full table scans are also sent to all nodes in the cluster and have to scan the entire table to find the result set. As a rule of thumb, we don t use full table scans in our implementation Distribution Awareness The rows in MySQL Cluster are partitioned across datanodes using the primary key by default. However, correlated rows in one table or multiple tables can be stored together in one datanode by using a feature called Distribution Awareness. This significantly improves the lookup performance since all of these rows can be fetched by just accessing one datanode. 28

39 5 Chapter 5 Evaluation In this chapter, we provide an evaluation of KTHFS and discuss its capacity, scalability and availability as compared to HDFS. 5.1 Capacity In HDFS, the metadata size of one file having two blocks (which are replicated three times) is 600 bytes approximately. The memory required at the Namenode to store hundred million such files is approximately 60 gigabytes. The HDFS NameNode runs in a JVM, and large heap sizes greater than 60 gigabytes are not considered practical for interactive workloads. Therefore the maximum capacity of HDFS is 100 million files approximately. In KTHFS, since the metadata is stored in NDB, the capacity of the file system is proportional to the capacity of NDB. MySQL Cluster can store one Terabyte of data across multiple datanodes, which means that the maximum capacity of KTHFS is around 1000 million files. Table 5.2 shows the memory requirements for different number of files. Capacity of HDFS Capacity of KTHFS 100 million 1000 million Table 5.1: The maximum number of files that can be stored in HDFS and KTHFS Number of files Memory required 100 million 86 GB 500 million 430 GB 1 billion 860 GB 1.5 billion 1290 GB 2 billion 1721 GB Table 5.2: Total memory required to store metadata of filesystem in MySQL Cluster.

40 5 Evaluation 5.2 Throughput To measure the throughput of HDFS and KTHFS, we used a tool called the Synthetic Load Generator, which is part of the HDFS source code. The experiments were carried out on machines having the following specifications: Two Six-Core AMD Opteron Processors with 2.5Ghz clock speed Ubuntu Sun JRE Giga Bytes Ram 1Gbit Ethernet The MySQL Cluster used for the experiments had six nodes each with 4 GB of allocated memory. We measured the throughput of KTHFS by adding different number of NameNodes to the system. Figure 5.1 shows the throughput of open operations in KTHFS for different number of NameNodes. The tests were executed two times, once with the INode cache enabled and once disabled. As can be seen, the collective throughput of KTHFS increases almost linearly as more namenodes are added. Also, the INode cache improves the overall throughput significantly because it enables the NameNode to perform primary lookups instead of index scans in NDB. Figure 5.2 shows a comparison between the throughput of open operations in HDFS and KTHFS. The throughput of one NameNode in KTHFS is less than the throughput of the HDFS NameNode. However, as more NameNodes are added in KTHFS, the overall throughput increases. Figure 5.1: Throughput of open operations in KTHFS. The blue bars show the throughput with the INode cache enabled, and the red bars show the throughput with the INode cache disabled 30

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

KTHFS A HIGHLY AVAILABLE AND SCALABLE FILE SYSTEM

KTHFS A HIGHLY AVAILABLE AND SCALABLE FILE SYSTEM KTHFS A HIGHLY AVAILABLE AND SCALABLE FILE SYSTEM Master Thesis - TRITA-ICT-EX-2013:30 JUDE CLEMENT D'SOUZA Degree project in Master of Science (SEDS) Stockholm, Sweden 2013 KTH ROYAL INSTITUTE OF TECHNOLOGY

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available Implementation of HDFS

Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available Implementation of HDFS KTH Royal Institute of Technology Master Thesis Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highly Available Implementation of HDFS Authors: Hooman Peiro Sajjad Mahmoud Hakimzadeh

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

GFS-python: A Simplified GFS Implementation in Python

GFS-python: A Simplified GFS Implementation in Python GFS-python: A Simplified GFS Implementation in Python Andy Strohman ABSTRACT GFS-python is distributed network filesystem written entirely in python. There are no dependencies other than Python s standard

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS. Ms. Selva Mary. G UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani The Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani CS5204 Operating Systems 1 Introduction GFS is a scalable distributed file system for large data intensive

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Distributed Systems. GFS / HDFS / Spanner

Distributed Systems. GFS / HDFS / Spanner 15-440 Distributed Systems GFS / HDFS / Spanner Agenda Google File System (GFS) Hadoop Distributed File System (HDFS) Distributed File Systems Replication Spanner Distributed Database System Paxos Replication

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University CS535 Big Data - Fall 2017 Week 6-A-1 CS535 BIG DATA FAQs PA1: Use only one word query Deadends {{Dead end}} Hub value will be?? PART 1. BATCH COMPUTING MODEL FOR BIG DATA ANALYTICS 4. GOOGLE FILE SYSTEM

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture7:DFS What is DFS? A method of storing and accessing files base in a client/server architecture. A distributed file system is a client/server-based application

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung ACM SIGOPS 2003 {Google Research} Vaibhav Bajpai NDS Seminar 2011 Looking Back time Classics Sun NFS (1985) CMU Andrew FS (1988) Fault

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Software Infrastructure in Data Centers: Distributed File Systems 1 Permanently stores data Filesystems

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Distributed File Systems 15 319, spring 2010 12 th Lecture, Feb 18 th Majd F. Sakr Lecture Motivation Quick Refresher on Files and File Systems Understand the importance

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed

More information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication

More information

Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System

Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System Qi Qi Instituto Superior Técnico - IST (Portugal) Royal Institute of Technology - KTH (Sweden) Abstract.

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

2/27/2019 Week 6-B Sangmi Lee Pallickara

2/27/2019 Week 6-B Sangmi Lee Pallickara 2/27/2019 - Spring 2019 Week 6-B-1 CS535 BIG DATA FAQs Participation scores will be collected separately Sign-up page is up PART A. BIG DATA TECHNOLOGY 5. SCALABLE DISTRIBUTED FILE SYSTEMS: GOOGLE FILE

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Google Disk Farm. Early days

Google Disk Farm. Early days Google Disk Farm Early days today CS 5204 Fall, 2007 2 Design Design factors Failures are common (built from inexpensive commodity components) Files large (multi-gb) mutation principally via appending

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University 11/5/2018 CS435 Introduction to Big Data - FALL 2018 W12.A.0.0 CS435 Introduction to Big Data 11/5/2018 CS435 Introduction to Big Data - FALL 2018 W12.A.1 Consider a Graduate Degree in Computer Science

More information

Staggeringly Large Filesystems

Staggeringly Large Filesystems Staggeringly Large Filesystems Evan Danaher CS 6410 - October 27, 2009 Outline 1 Large Filesystems 2 GFS 3 Pond Outline 1 Large Filesystems 2 GFS 3 Pond Internet Scale Web 2.0 GFS Thousands of machines

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

BigTable. CSE-291 (Cloud Computing) Fall 2016

BigTable. CSE-291 (Cloud Computing) Fall 2016 BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes

More information

Google is Really Different.

Google is Really Different. COMP 790-088 -- Distributed File Systems Google File System 7 Google is Really Different. Huge Datacenters in 5+ Worldwide Locations Datacenters house multiple server clusters Coming soon to Lenior, NC

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Today l Basic distributed file systems l Two classical examples Next time l Naming things xkdc Distributed File Systems " A DFS supports network-wide sharing of files and devices

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 41) K. Gopinath Indian Institute of Science Lease Mgmt designed to minimize mgmt overhead at master a lease initially times out at 60 secs. primary can request

More information

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006 Google File System, Replication Amin Vahdat CSE 123b May 23, 2006 Annoucements Third assignment available today Due date June 9, 5 pm Final exam, June 14, 11:30-2:30 Google File System (thanks to Mahesh

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 40) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks

More information

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads) Google File System goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) focus on multi-gb files handle appends efficiently (no random writes & sequential reads) co-design GFS

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

Distributed File Systems. Directory Hierarchy. Transfer Model

Distributed File Systems. Directory Hierarchy. Transfer Model Distributed File Systems Ken Birman Goal: view a distributed system as a file system Storage is distributed Web tries to make world a collection of hyperlinked documents Issues not common to usual file

More information