A Distributed Namespace for a Distributed File System

Size: px

Start display at page:

Download "A Distributed Namespace for a Distributed File System"

Sandra Fletcher
6 years ago
Views:

1 A Distributed Namespace for a Distributed File System Wasif Riaz Malik wasif@kth.se Master of Science Thesis Examiner: Dr. Jim Dowling, KTH/SICS Berlin, Aug 7, 2012 TRITA-ICT-EX-2012:173

3 Abstract Due to the rapid growth of data in recent years, distributed file systems have gained widespread adoption. The new breed of distributed file systems reliably store petabytes of data on commodity hardware, and also provide rich abstractions for massively parallel data analytics. The Hadoop Distributed File System (HDFS) is one such system which provides the storage layer for MapReduce, Hive, HBase and Mahout. The metadata server in HDFS, called the NameNode, is a centralized server which stores information about the whole namespace. The centralized architecture not only makes the NameNode a bottleneck and a single-point-of-failure, but also restricts the overall capacity of the filesystem. To solve the availability and scalability issues of HDFS, a new architecture is required. In this report, we propose a distributed implementation of the HDFS NameNode, where the filesystem metadata is stored in a distributed, in-memory, replicated database called MySQL Cluster. The NameNodes are state-less, and the throughput of the system can be increased by either adding NameNodes, or by adding more data nodes in NDB. HDFS clients can access the metadata by connecting to any one of the NameNodes. The evaluation section shows that the new architecture, as compared to HDFS, can handle more requests per second, store ten times more files and recover from failures within a few seconds.

5 Dedicated to my wife

7 Acknowledgements I would like to thank Dr. Jim Dowling for his valuable guidance and feedback throughout the thesis, my EMDC class fellows for providing feedback, and my colleagues at Swedish Institute of Computer Science (SICS) for exchanging ideas related to this thesis. Finally, I would like to thank my wife and my family for supporting me throughout my studies. Berlin, August 8, 2012

9 Contents 1 Introduction 1 2 Related Work / Background Google File System Architecture Master Chunkservers Clients TidyFS Architecture Metadata Server Storage Computers and the Node Service Clients HDFS Architecture Namenode Datanodes Clients Limitations of HDFS Namespace Scalability Throughput Failure Recovery KTHFS Architecture and Implementation Goals Architecture Execution Overview Stateless Namenode Writer Namenode Reader Namenode Datanodes Clients Storing Metadata in a Database Modification of Namenode Operations

10 Contents INodes Path Resolution Representation in HDFS Representation in KTHFS Block Metadata Representation in HDFS Representation in KTHFS Datanodes Metadata Representation in HDFS Representation in KTHFS The INode Cache MySQL Cluster (NDB) Data Access Querying Data Distribution Awareness Evaluation Capacity Throughput Availability Availability of the Namenodes Recovering from Failures Garbage Collection Pauses Availability of MySQL Cluster Conclusions and Future work Conclusion Future Work Scaling-out the Writer Namenode Fine Grained Locking Minimizing Database Round Trips Block Report Processing x

11 1 Chapter 1 Introduction Due to the immense growth of data in recent years, it has become a challenge to store, analyze and extract meaningful information out of it. The amount of data generated and stored by internet companies, enterprises and governments is on the rise, and according to forecasts, it will keep rising exponentially in the coming years. Traditional storage systems like relational databases and file systems are not scalable enough to manage the ever increasing datasets. This has given rise to new kind of systems, which can not only reliably store massive amounts of data over a cluster of machines, but also provide abstractions for parallel data analytics. They have the design goals of being highly scalable and fault tolerant, while sacrificing some of the semantics which the traditional systems provided. Recent distributed file systems like the Hadoop Distributed File System[1] store data on a cluster of machines and provide abstractions for execution frameworks like MapReduce[2] to analyse data in a massively parallel manner. Hadoop is widely used for large scale data intensive applications, such as bio-informatics, social networks, search engines etc. HDFS consists of a NameNode, a set of datanodes, and a set of clients. The NameNode stores the metadata of the file system such as the namespace, and also acts as a centralized master. It also exposes an API for clients to read and update this metadata. The datanodes store the actual file data on disk and periodically send heartbeats and block reports to the NameNode. The clients access a file by asking the NameNode for the datanodes holding that file, and then fetching the blocks from the NameNode datanodes. The NameNode runs on a single machine and stores the information about files and blocks in its memory. This keeps the architecture of HDFS fairly simple but also gives rise to three issues: 1. It is a single point of failure, which means that if the NameNode goes down, the whole HDFS cluster would become inaccessible. This discourages the use of HDFS for real-time and interactive workloads. 2. The number of files that can be stored in an HDFS cluster is limited by the amount of RAM available to the name-node. This puts a hard limit on the

12 1 Introduction size of the file system because the RAM can only be increased to a certain limit. This is also known as the file-count problem. Furthermore, since the NameNode runs on a JVM, a very large heap size is not recommended because frequent garbage collection pauses make the NameNode unresponsive for seconds or even minutes. 3. The time required to restart the NameNode after a failure might take up to an hour (depends on the number of files). There have been attempts[6][7] to solve this problem though. There are are workarounds to avert the file-count problem but they have drawbacks. For example, the Hadoop Archives and SequenceFile functionalities allow packing of multiple files into one file, hence consuming the metadata size of just one file. However, this passes additional complexity to applications for accessing files which might not be possible in some scenarios. Another workaround is to increase the block size of files in HDFS, which results in less number of blocks per file and hence less metadata consumption at the NameNode. However, this workaround is not effective for small files (having one block). A recent study by Shvacko[8] has shown that the centralized architecture of the NameNode limits the overall capacity and throughput of HDFS, and a new architecture is required to solve these issues. The AvatarNamenode[6] implementation by Facebook, and the recent Highly Available Namenode[7] only solve the availability issue, hence the NameNode still remains a limiting factor for throughput and namespace scalability. In this thesis report, we present a new distributed architecture for the HDFS NameNode, with the goal of making it horizontally scalable and highly available. In the remainder of this report, we will refer to the new architecture as KTHFS. The single NameNode is replaced by a set of shared-nothing stateless NameNodes which store all HDFS metadata in a distributed in-memory database called MySQL Cluster (NDB). The stateless NameNodes are light weight processes, and fetch the metadata from NDB only when it is required by a client. Since the NameNodes are stateless, they can be stopped, migrated and restarted in seconds without any downtime for clients. This also makes it possible to scale out the NameNode by launching multiple instances of it on different machines. A modified client API makes sure that the client requests are load balanced over the set of NameNodes in the system. The clients access the cluster of NameNodes by using a load balancing policy. Different load balancing policies like round robin or random can be plugged in to the client library. The capacity of KTHFS is not limited by the amount of RAM on a single NameNode, but is rather dependent on the storage capacity of NDB, which is known to have a maximum capacity of approximately two terabytes. We will discuss the memory requirements of KTHFS in the evaluation section. 2

13 There are significant challenges related to throughput of a NameNode in KTHFS. Since the metadata is fetched from a node (or possibly multiple nodes) in NDB over the network, the latency of the operations increases, and in turn affects the throughput of a single NameNode. Though it must be noted that HDFS is not designed for low latency operations and the Hadoop applications do not expect subsecond response times for metadata operations. The latency of operations can be improved by sacrificing some HDFS semantics, but that would violate the goal of maintaining full HDFS semantics in KTHFS. We have also designed and implemented a cache at the NameNodes to improve the latency of metadata operations. This cache reduces the round trips to the database for metadata operations but still returns the latest copy. In the next chapter, we will discuss the characteristics of some distributed file systems. In Chapter 3, we will list down the limitations of HDFS and in Chapter 4, we will describe the goals, and then elaborate the design and implementation of the project. In Chapter 5, we will discuss in detail the storage capacity, throughput and availability of the NameNodes in KTHFS. Finally in Chapter 6, we provide concluding remarks and then discuss some possible improvements to this work. 3

15 2 Chapter 2 Related Work / Background 2.1 Google File System The Google File System (GFS)[9] is a distributed file system for large scale dataintensive applications. It was designed by Google to meet its growing demand of data processing requirements. The design goals of GFS are scalability, reliability and fault tolerance. GFS considers failures to be the norm rather than the exception. The files in GFS are expected to be huge, typically greater than 15 MB. Multi-GB files are common in GFS deployments at Google. GFS does not allow mutation of a file at random locations, and the only mutation allowed to a file is append. GFS provides a familiar file system interface for user applications but does not implement any standards such as POSIX. The files and folders are stored in a hierarchical fashion, just like traditional file systems, and can be accessed by providing the full path. The operations supported by GFS are create, delete, open, close, read and write files. GFS also supports a feature called multi-writer append, which allows multiple writers to concurrently append data to a file Architecture A GFS cluster consists of a master node, chunkservers and clients. The clients and chunkservers can be run on the same machines. A file consists of multiple chunks, each chunk has a fixed size and is configured to be 64MB by default. The chunks are stored on chunkservers. Each chunk has a globally unique identifier generated by the master. The chunks are replicated thrice for reliability, however it is possible to set a different replication factor for each file Master The master stores all file system metadata in its memory. The metadata includes the namespace, mapping from files to chunks, and the chunk locations. The master

16 2 Related Work / Background exposes a file system interface for the clients, so that the clients can perform operations in the file system. It also handles namespace management and locking, check-pointing the metadata, replica placement and replica re-balancing. Storing everything in a single master simplifies the overall design of GFS, and enables the master to perform sophisticated chunk placement and replication decisions using global knowledge. However, the disadvantage of having a single master is that it becomes a single point of failure, and a bottleneck under high client workloads Chunkservers The chunkservers periodically send heartbeats to the master. They store the chunks on local disk as files, and expose an interface to clients for reading or writing chunks Clients The clients do not read/write through the master, but use the master to find out the locations of the chunkservers. If a client wants to read a file, it asks the master about the chunkservers which have the chunks of that file. The master replies with the locations of all replicas of the chunks, and the client fetches the closest chunk replica. Similarly, if a client wants to write a new chunk, it asks the master for a chunkserver. The master replies with the closest chunkserver, and the client writes to that chunkserver directly. 2.2 TidyFS TidyFS[10] is a simple and small distributed file system designed by Microsoft for parallel computations on clusters. TidyFS differs from GFS and HDFS by being simpler. It avoids complex replication protocols, concurrent writers, and uses lazy replication for file data instead of eager replication. The design goal of TidyFS is to be simple enough to efficiently handle the kind of workload that Microsoft has. A typical workload for TidyFS is write-once, read-many, high throughput data access. The fault tolerance model of TidyFS is slightly different than GFS, as it expects the higher level execution model to handle the storage failures. For example, TidyFS uses lazy replication for file data, and delegates the task of handling storage failures to applications running on top of TidyFS. Another difference is that it allows clients to read and write data natively, such that they can choose to access the data using different patterns (for example sequential, random etc). TidyFS does not implement the POSIX interface and provides a very simple API for clients. The most common file operations like create, delete, list, copy, concatenate etc. are supported. However, files are called streams, and blocks are referred to as partitions in TidyFS. There is no 6

2.2 TidyFS Figure 2.1: TidyFS architecture explicit directory tree maintained in TidyFS, so there are no operations for creating and deleting directories.

17 2.2 TidyFS Figure 2.1: TidyFS architecture explicit directory tree maintained in TidyFS, so there are no operations for creating and deleting directories. However, the files can still be stored in a hierarchical manner by using arcs in the stream name. When a stream is created, the missing directory entries are implicitly created, and when the stream is deleted, all parent directories are recursively removed unless they contain other streams Architecture TidyFS consists of a metadata server, a cluster of storage computers, and a node service running on each storage computer. The metadata server is responsible for storing the metadata of the whole file system, managing the storage computers, replication of partitions, load balancing etc. Partitions are mostly similar to GFS chunks, but one key difference is that they can be of variable size. The partitions once written and commited, are immutable. They are lazily replicated by default, but the clients can eagerly replicate them if required Metadata Server The centralized metadata server stores all metadata of TidyFS i.e. information about streams, parts, storage computers, and mapping from streams to parts and from parts 7

18 2 Related Work / Background to storage computers. The metadata and the operations on it are replicated to other machines using Autopilot Replicated State Library[11] to be able to recover from failures. The metadata server also tracks the state of storage computers so that it can optimally replicate the data across them. It is also possible to define distinguished attributes on a per-stream or per-part basis. These attributes are generally refered to as extended file attributes in the file system world. In short, the metadata server is the master of the system, it responds to client queries, manages the storage computers, and optimally replicates the data across the cluster Storage Computers and the Node Service The storage computers store the actual data i.e. the parts. Each TidyFS cluster can have hundreds of storage computers. The node service is a windows service that runs periodically on each storage computer, and carries out routine tasks like sending reports to the metadata server, the amount of disk space available, the state of the storage computer etc. In TidyFS, unlike GFS, the garbage collection of unused, over-replicated blocks is done on the storage computers. However, the the list of candidates for deletion is sent to the metadata server for verification, and only the verified parts are deleted from the storage computers. This two phase process ensures that the non-committed parts are not deleted from the storage computers Clients The clients communicate with the metadata server using a client library. The client library has built-in failover support to tolerate metadata server failures. To write data, the client asks for a write path from the metadata server. Typically the write path points to the client s local machine. The client writes data to the write path and also keeps renewing the lease for this stream on the metadata server. Once it has finished writing, it closes the local file and adds the written part to the stream. At this moment, the data becomes available for other clients. If the client dies during the write, the lease expires and the stream is removed from the metadata server. 2.3 HDFS The Hadoop Distributed File System (HDFS) [1] is an open source implementation of GFS [9], built by Yahoo! to act as the storage layer for Hadoop[12] applications. It was designed to reliably store petabytes of data on commodity hardware, and to enable distributed computation frameworks like MapReduce [2] to analyse data in a massively parallel fashion. In addition to MapReduce, HDFS is used as the 8

19 2.3 HDFS Figure 2.2: The Architecture of Hadoop Distributed File System storage layer for Hive, HBase, Mahout, and also being used as a stand-alone file system. HDFS was primarily designed for use by MapReduce and other Hadoop components. The HDFS API is similar to the UNIX file system API, but several UNIX semantics have been relaxed in favor of simplicity, high performance and requirements of the applications at hand. In other words, HDFS is not a POSIX [13] compliant file system. Some of the POSIX functions not supported by HDFS are concurrent writers, random writes, locking of file section etc Architecture HDFS consists of three components, the Namenode, datanodes and clients. The Namenode stores the metadata of the file system and also acts as a centralized master node. The file data is stored in the datanodes. Each file in HDFS is split into equally sized blocks (64MB by default). Each block is replicated thrice by default on different datanodes. The HDFS client library is responsible for fetching the metadata of a file from the Namenode and then accessing the data from the datanodes Namenode The Namenode is a centralized server, which stores the metadata of the entire file system in its RAM. This metadata includes the inodes, blocks, mapping from inodes to blocks, and mapping from blocks to datanode locations. The files and directories in 9

2 Related Work / Background Figure 2.3: The hierarchy of inodes. The file a.txt is represented by the path /home/wasif/a.txt HDFS are represented by a hierarchy of inodes (as shown in 2.3).

20 2 Related Work / Background Figure 2.3: The hierarchy of inodes. The file a.txt is represented by the path /home/wasif/a.txt HDFS are represented by a hierarchy of inodes (as shown in 2.3). Each inode stores information such as the name of the file, its permissions, timestamps, quotas etc. The Namenode provides an API for the clients to query this metadata. The maximum number of metadata objects that can be stored in the Namenode is bounded by the size of its RAM. The Namenode persists the metadata on disk and also maintains a journal of operations so that it can recover from failures. Other important tasks of the Namenode include replication management, block placement, block balancing etc. In short, the NameNode has two important tasks, serving the client metadata requests, and efficiently managing the cluster of datanodes Datanodes The datanodes store the actual data of files in HDFS. An HDFS cluster can have hundreds or thousands of datanodes depending on the data requirements. When a datanode process starts, it registers itself with the NameNode to become a part of the filesystem cluster. It periodically sends heartbeats (by default every three seconds) to let the Namenode know that it is alive. These heartbeats also contain information such as the available disk space on datanodes, the current load etc, to enable the NameNode in making decisions about replication. As a response to the heartbeats, the Namenode sends commands to the datanodes. The datanodes also periodically send a report of all of the blocks they have to the NameNode (every hour by default)). The NameNode compares this report to the metadata stored in its RAM, and verifies its integrity. If a block reported by the datanode is not present in the NameNode s metadata, the NameNode asks the datanode to delete it as well. 10

21 2.3 HDFS Clients The clients access HDFS using a client library. The library provides functions such as create, open, list, delete files and directories. To open a file for reading, the client obtains from the Namenode a list of datanodes holding the blocks of that file. It then pulls the block data from the datanodes directly. Similarly, if a client wants to write a file, it asks the NameNode to create a metadata entry for the file and provide a list of datanodes where it can write the first block of the file. For each subsequent block, the client performs the same operations again i.e. ask the NameNode for a list of datanodes and write the content to the datanodes. It must be noted that the clients can only write sequentially to a file, and random writes are not allowed. 11

23 3 Chapter 3 Limitations of HDFS In this chapter, we will briefly reiterate the major limitations of Hadoop Distributed File System (HDFS). 3.1 Namespace Scalability In HDFS, the namespace (i.e. the metadata of files and directories) is stored in the RAM of one machine (Namenode). In large HDFS clusters, the namespace can become very large such that it can no longer fit in the RAM of one machine. 3.2 Throughput The throughput of the metadata operations is also bounded by the performance of one machine (Namenode). In large HDFS clusters which have hundreds of clients and datanodes, the Namenode becomes a bottleneck. 3.3 Failure Recovery If the HDFS Namenode crashes, it can be restarted but it can take up to an hour. During this time, the Namenode remains unavailable to the clients. The Namenode takes a long time to restart because it has to load all metadata in memory from disk before it can accept client requests.

25 4 Chapter 4 KTHFS Architecture and Implementation In this chapter, we will discuss the architecture and implementation details of KTHFS, which is a modified version of HDFS. The implementation of KTHFS was done by modifying the source code of HDFS (available at the Apache Hadoop website[12] under an Apache license[14]). In this section will describe the design goals of KTHFS, its architecture, implementation details and some optimizations which were done to improve the throughput of the Namenode. 4.1 Goals KTHFS aims to achieve the following goals: 1. Increase the capacity of the file system 2. Make the Namenode highly available 3. Scale out the throughput of the Namenode by adding more machines 4. Keep the Namenode API intact, such that applications using HDFS can use KTHFS without any changes 4.2 Architecture The overall architecture of KTHFS is similar to HDFS. However, instead of a single NameNode, KTHFS has a cluster of stateless NameNodes which store their state in a highly available, distributed, in-memory database. The architecture can be seen in Figure 4.1. There are two kinds of NameNodes in KTHFS, read NameNodes and write NameNodes; they will be discussed in detail later in this section. The datanodes are connected to only two NameNodes, and send heartbeats and block reports to both of them. However, only one of the NameNodes is active at one time. If the active

4 KTHFS Architecture and Implementation Figure 4.1: KTHFS Architecture NameNode fails, the standby NameNode can take its place instantaneously.

26 4 KTHFS Architecture and Implementation Figure 4.1: KTHFS Architecture NameNode fails, the standby NameNode can take its place instantaneously. This functionality is similar to AvatarNamenode[6] and the HA Namenode[7], but one key difference is that the metadata updates on the active NameNode are not sent to the standby NameNode, but are instead sent to MySQL Cluster. The clients can send metadata read operations to any NameNode in the system, but the write operations can only be sent to the active writer NameNode. Distributing the writer NameNode for parallel write operations is not the focus of our current implementation and can be taken up as future work Execution Overview A detailed flow of events during a read or write request is described below. To maintain the brevity of this paper, the flow of operations e.g. list files, delete files, make directory etc. will not be explained here as they are similar in concept to reads and writes. 1. Read Request a) If a client wants to read a file, it sends a read request to any name-node and waits for a reply b) The name-node upon receiving the client request fetches all metadata of the file from the MySQL Cluster database. The metadata includes 16

27 4.2 Architecture all information about the file i.e. name, permissions, timestamps, block locations etc. After the metadata has been retrieved, the name-node performs validation checks on the client request, and returns the address of the data-nodes which have the file blocks c) The client reads the whole file by sending read requests for all of the file blocks to the data-nodes 2. Write Request a) If a client wants to write to a file, it sends a write request to any name-node and waits for a reply b) The name-node upon receiving the client request, creates a new file by inserting a new row in the MySQL Cluster table and returns the data-node address on which the client should write the contents of the file c) The client after receiving the data-node address from the name-node, starts writing the file on that data-node d) Once the client has written all blocks of the file to the data-node, it sends a complete message to the NameNode, which marks the file as complete in MySQL cluster Stateless Namenode In KTHFS, the metadata of the file system is stored in MySQL Cluster, which is a distributed, but highly available in-memory database. The client writes the metadata by contacting a NameNode which in turn persists the metadata to MySQL Cluster. Similarly, when a client wants to read metadata, it contacts a NameNode, which fetches the data from MySQL Cluster and returns it to the client. The data in MySQL Cluster is replicated twice by default, which means the data remains available as long as one replica is accessible. MySQL Cluster maintains a log of operations on disk, which enables it to recover from failures. We removed the journaling and check-pointing functionality of the HDFS NameNode since it was redundant. The NameNode maintains the metadata of the whole file system. This metadata basically comprises of two parts: 1) the namespace, i.e. information about the files and directories, and 2) the blocks map, i.e. the block locations of all files. In KTHFS, as explained earlier, the data structures of the name-space and the blocks map are stored in MySQL Cluster. This makes the NameNode stateless and opens up many possibilities to improve the scalability and availability of HDFS NameNode. Some benefits of the stateless name-node architecture are mentioned below: 17

28 4 KTHFS Architecture and Implementation Operation Reader Namenode Writer Namenode createfile allocateblock close delete rename ls open getfilestatus Table 4.1: Operations supported by Reader and Writer NameNodes in KTHFS It allows multiple instances of name-node to run in parallel, which makes the file system highly available. The file system will remain available as long as there is at least one NameNode running and at least one replica in MySQL Cluster is available Starting a stateless name-node takes only a few seconds as it no longer has to load the metadata into its memory, which opens up the possibility of starting or stopping new NameNode instances as the demand changes over time New NameNode instances can be started without any synchronization with the existing NameNodes or datanodes The NameNode can be horizontally scaled both in terms of storage and throughput by adding more machines However, as can be seen in Figure 4.1, there are two kinds of NameNodes in KTHFS. The write NameNodes and the read NameNodes Writer Namenode The writer NameNode is responsible for datanode management and handling client requests which require access to up-to-date information about the datanodes. Datanode management tasks include heartbeat management, replication management, efficient block placement, rack awareness, block report processing etc. These tasks require an up to date information about state of each datanode in the cluster; the information includes the total storage capacity of a datanode, the remaining storage capacity, current workload etc. A list of operations which the writer NameNode supports can be seen in Table

29 4.3 Storing Metadata in a Database Reader Namenode The reader NameNode is only responsible for serving the read operations. A list of operations which the reader NameNode supports can be seen in Table 4.1. As seen in Figure 4.1, the reader NameNode is not connected with any datanodes in the system and therefore does not have complete information about the datanodes. Since, the block to datanode mapping is stored in NDB, the reader NameNodes do have access to the locations of all blocks stored in the datanodes. This enables the reader NameNodes to serve open requests. The open operation, also known as getblocklocations in HDFS, is the most frequently used operation in most HDFS deployments. A file in HDFS is split into different parts, and the client needs the location of these parts in order to read the file. When the client wants to open a file for reading, it sends an open request to the NameNode, which responds with the list of datanodes holding the blocks of that file. The client then reads the blocks of the file from those datanodes. Multiple instances of the reader NameNodes can be started on different machines to scale the overall throughput of the system Datanodes The datanode code was not modified, so the functionality of datanodes in KTHFS remains exactly the same as HDFS Clients The original HDFS client was built to work with just one NameNode. For KTHFS, this client library was modified to support multiple NameNodes. The client operations were classified into two types, read or write. The read operations are sent to the reader or writer NameNodes, whereas the write operations are sent to the writer NameNodes. These requests are load balanced on the existing NameNodes by using a round robin or random load balancing policy. Furthermore, the client library handles NameNode failures transparently from the client application, and guarantees to provide a response as long as one of the NameNodes is active. For example, if the client API detects a connection failure or timeout with a NameNode, it tries to connect to the next NameNode in the list and so on. The list of NameNodes is mentioned in the configuration file of the client, and does not change as new NameNodes are added or removed from the system. 4.3 Storing Metadata in a Database In KTHFS, the metadata of the file system is stored in MySQL Cluster (NDB) instead of Namenode s RAM. To enable this, all NameNode functionality was modified to 19

30 4 KTHFS Architecture and Implementation Figure 4.2: Different types of inodes in HDFS read and write the metadata to/from NDB. The NameNode is responsible for storing information about inodes, blocks and datanodes and the corresponding mappings between them. This section describes the changes done to represent, store and access the metadata from NDB Modification of Namenode Operations The NameNode is responsible for maintaining and enforcing HDFS semantics such as permission checking, path validation and resolution, consistency, atomicity etc. It exposes an API to the client to perform operations such as create, list, make directory, delete, allocate/commit/delete block, append file etc. Furthermore, a disk based writeahead log is appended for all operations which write or update metadata; this enables the NameNode to recover from failures. Since NDB does its own journalling and checkpointing to recover from failures, the journalling functionality in HDFS was redundant, and hence removed for KTHFS. Furthermore, each NameNode functionality was modified to store/fetch data from NDB instead of RAM INodes INode is the name given to the data structure that holds the information of a file system object, for example file, directory or a symbolic link. This information usually includes file name, permissions, access timestamps, access control information, file length etc but may vary in different file systems. The inodes are typically stored in a hierarchical manner, such that each inode points to one or several inodes. For example if a directory has three files, the inode of the directory will have three child inodes, one for each file. 20

31 4.3 Storing Metadata in a Database Property name parent permission accesstime modificationtime children Type byte[] INodeDirectory long long long List<INode> Table 4.2: Properties of an INode in HDFS Path Resolution In most file systems, inodes are stored in a hierarchical manner to represent the hierarchy of directories and files in a file system. As seen in Figure 2.3, the full path /home/wasif/file.txt has four components which are represented using four inodes. When a user or client tries to access /home/wasif/file.txt, the file system has to ensure that: All path components exist (this is sometimes referred to as path resolution) The user has permission to access each path component The NameNode has to resolve the path for each client operation which accesses files or directories, for example touch, mkdir, ls, chmod, mv, cp etc. The algorithmic complexity of path resolution is O(n), where n is the number of components in the path Representation in HDFS Table 4.2 shows a subset of properties of an inode in HDFS. Each file is represented by an INodeFile object, and each directory is represented by an INodeDirectory object. Both of these types are specialized types of INode. The files and directories in HDFS are represented by a tree of INode objects. The root node in this tree is an INodeDirectory object that represents the root directory of the file system. Each INodeDirectory object maintains a list of children which can be files or directories. The NameNode resolves a path by traversing the tree, starting from the root node. When a file is deleted, it is removed from this tree, and when a directory is deleted, the inode of the directory is deleted first, and then its children are deleted incrementally. The incremental deletion is safe because the children of the deleted directory would not be accessible. 21

32 4 KTHFS Architecture and Implementation Column inodeid name parentid isdirectory permission accesstime modificationtime Type long varchar(155) long boolean long long long Table 4.3: Properties of an INode in KTHFS Representation in KTHFS In KTHFS, the inodes are stored in a database table called INodeTable in MySQL Cluster. Each row in INodeTable represents an INode which can either be a file, directory or a symbolic link. Table 4.3 shows a subset of the columns in the INodeTable and their data types. To make the comparison easier, the data types mentioned in the table are the Java equivalents of MySQL data types. The inodeid is the primary key, and is a randomly generated 64-bit identifier generated by the NameNode. The name column stores the name of the inode, and the parentid is the inodeid of its parent. The path resolution is done by starting from the root inode and then verifying one by one if all path components exist. However, path resolution is quite inefficient because the NameNode has to send n queries one after the other to resolve the full path. Furthermore, these queries are not primary key lookups and result in a range scan in NDB. This causes multiple round trips between the datanodes in NDB and affects the latency of path resolution. To solve this problem, an INode cache was developed, which we will discuss later in this document Block Metadata Each file in HDFS is split into multiple blocks. Each block is described by a metadata object that stores information like blockid, size, inode it belongs to, state, locations etc. The blockid is a 64-bit unique identifier generated by the NameNode when a new block is allocated. If blocks are replicated thrice, then the NameNode stores the locations of three datanodes for each block in the system Representation in HDFS Table 4.4 shows a subset of the block metadata stored in HDFS NameNode. A block goes through several states in its lifetime. When a block is allocated, its 22

33 4.3 Storing Metadata in a Database Property blockid numbytes inode state location Type long long INodeFile enum Datanode[] Table 4.4: Properties of a Block in HDFS Column blockid blockindex inodeid numbytes state Type long int long long short Table 4.5: BlockInfo table in KTHFS state is under-construction. When the client finishes writing the block, the block state changes to committed, and when the block has at least one replica, it becomes complete. The NameNode also stores a reverse reference to the inode the block belongs to. This is used in block report processing when the NameNode wants to find out if a particular block belongs to an inode or not. The location of a block is stored in memory after the datanode informs the NameNode that it has fully received that block Representation in KTHFS There are two tables for storing the metadata of blocks in KTHFS, the BlockInfo table (see Table 4.5) and the Locations table(see Table 4.6). The tables have been normalized so that data is not unnecessarily duplicated and memory space in MySQL Cluster is efficiently utilized. As mentioned before, the blockid is a unique identifier for the block and is the primary key in the BlockInfo table. The blockindex is the position of the block in the file, and is used by the NameNode to sort the blocks in the correct order. The inodeid is a foreign key, and is used to get hold of the inode for a block. The inodeid column is also indexed to make it possible for the NameNode to obtain all blocks of a particular inode. The column numbytes stores the size of the block in bytes, and the state column stores the current state of the block. The Locations table has just three columns, blockid, index and storageid. The blockid and index together form the primary key in this table. The storageid is a unique identifier for each datanode in the file system and acts as the foreign key to join this table with the DatanodeInfo table. The DatanodeInfo table stores information about the datanodes and will be discussed in the next section. 23

34 4 KTHFS Architecture and Implementation Column blockid index storageid Type long int String Table 4.6: Block locations table in KTHFS Datanodes Metadata The datanodes register themselves with the NameNode at startup, and maintain a persistent connection. They send heartbeats to the NameNode every three seconds, and send a block report after every hour to the NameNode. The heartbeats have two purposes: firstly, to let the NameNode know that the datanode is alive, and secondly to send some statistics to the NameNode, such as total storage capacity at the datanode, remaining disk space, current workload etc. The NameNode uses these statistics to efficiently perform block placement and replica management. For example, the NameNode would stop returning those datanodes to the clients which don t have disk space available. The block report contains information about all the blocks stored at that datanode. The NameNode compares this list with the list of blocks it thinks that datanodes should have, and based on the result, asks the datanode to delete or replicate the blocks Representation in HDFS In the NameNode, each datanode is represented by a DatanodeID object, which contains information like IP address, port, and a unique identifier called storageid. This identifier is generated by the datanode when it starts for the first time; it has the following format: DS-randominteger-ip-port-timestamp The above format ensures that there are no collisions between the identifiers of different datanodes. When the datanodes connect to the Namenode, they send their respective storageids as part of the registration process. The current status and statistics of datanodes are stored in DatanodeInfo and DatanodeDescriptor objects. This includes the administrative state of a datanode, location in the network topology, available disk space, lists of blocks which need to be either replicated, recovered or deleted etc. All of the above mentioned information about the datanodes is not persisted to disk. So, if a NameNode crashes, all of this information is lost. When the NameNode reboots, it builds this information again after it receives heartbeats, block reports, and other messages from the datanodes. 24

35 4.4 The INode Cache Column storageid hostname port status location Type String String int int String Table 4.7: DatanodeInfo table in NDB Representation in KTHFS Table 4.7 shows the information stored in the DatanodeInfo table in NDB. The datanode statistics such as the available disk space, lists of blocks to be deleted, recovered, replicated, are not stored in NDB in the current implementation, however they are still locally stored on the NameNode. 4.4 The INode Cache As mentioned in the Section , the full path of a file or a directory is resolved by the NameNode to ensure that it exists and also if the client has permissions to access it. This is an expensive process because the NameNode has to iterate through all path components sequentially. For example, /home/wasif/work/file.txt is resolved by searching for wasif in the list of children of /, then searching for work in the list of children of wasif and so on. In HDFS, inodes are stored in memory and reading from memory is quite fast. In KTHFS, this process is quite expensive because each read operation causes an index scan across all datanodes in NDB. The index scan operation itself is quite inefficient because it involves querying all NDB datanodes which causes multiple round trips. This affects the latency of the path resolution process, and in turn affects the latency of all filesystem operations. Index scans in NDB have two issues: first, they are slower than primary key lookups, and second, they generate considerable network traffic and make it difficult to scale out MySQL Cluster. For example, the path /home/wasif/work/file.txt has five path components, and it can be resolved by generating five index scans in NDB. Each path component is represented by an inode (a row in the INode table in NDB). The requests can not be sent in parallel because the inodeids of the components are not known in advance. The inodes can not be queried by their local name only because there can be two or more files having the same local name but residing in different directories. Furthermore, the inodeid is not known in advance so primary key lookups are not possible. A caching scheme was developed with the goal of fetching the inodes using primary key lookups instead of range scans. The cache entries are stored in the NameNode s 25

The pseudo code of the caching scheme can be seen in Algorithm 1.

36 4 KTHFS Architecture and Implementation Figure 4.3: The changes in cache state after different paths are resolved. (a) shows the initial state of the cache, (b) shows the state of the cache after /home/wasif/a.txt is accessed for the first time, (c) shows the state after /home/john/b.txt is accessed RAM. The pseudo code of the caching scheme can be seen in Algorithm 1. When a path is to be resolved, the NameNode checks if it is present in the cache, if it is not there, it is resolved by generating an index scan for each path component. The path components are then added to the cache, which is a tree like data structure with / as the root node. Each node in the cache tree has the following properties: inodeid, name, reference to its parent, list of children. When the same path is accessed again by any client, the cache would already have the inodeids of the path components. The NameNode would then perform primary key lookups for these path components instead of index scans. Figure 4.3 shows the changes in the cache as new paths are accessed. After the inodes are fetched from NDB, they are verified with the local cache. If nothing has changed, it means that the path is still up to date, else the NameNode deletes the outdated path in the cache and falls back to performing the index lookups for each path component. 4.5 MySQL Cluster (NDB) In this section, we will discuss the features of MySQL Cluster which were used in this implementation. To maintain the brevity of this report, we only discuss the most interesting features. 26

37 4.5 MySQL Cluster (NDB) Algorithm 1 Caching algorithm for path resolution 1: if cache.exists(path) then 2: cachedinodes cache.get(path) 3: inodes = getn odesu singp rimarykey(cachedinodes) 4: if verif y(inodes, path) then 5: return inodes 6: else 7: cache.delete(path) 8: inodes = getn odesu singindexscans(path) 9: cache.put(inodes) 10: return inodes 11: end if 12: else 13: inodes = getn odesu singindexscans(path) 14: cache.put(inodes) 15: return inodes; 16: end if Data Access MySQL Cluster offers various APIs to access the data stored in NDB, for example OpenJPA, ClusterJPA, MySQL Server, NDB API, ClusterJ etc. For our implementation, we used the ClusterJ API for the following reasons: It can access the data by querying the NDB datanodes directly, i.e. without going through a centralized server The queries and lookups are quite fast since ClusterJ does not have to perform query parsing etc. It is a Java API, so integrating it with the NameNode was straightforward Querying Data NDB provides three data query mechanisms: primary key lookups, index scans and full table scans. We briefly discuss them below. Primary key lookup A primary key lookup in NDB is the most efficient way of accessing a row because the query is sent only to the node holding that row. 27

38 4 KTHFS Architecture and Implementation Index Scans Index scans are sent in parallel to all nodes in the cluster, and are therefore much less efficient than primary key lookups Full Table Scans Full table scans are also sent to all nodes in the cluster and have to scan the entire table to find the result set. As a rule of thumb, we don t use full table scans in our implementation Distribution Awareness The rows in MySQL Cluster are partitioned across datanodes using the primary key by default. However, correlated rows in one table or multiple tables can be stored together in one datanode by using a feature called Distribution Awareness. This significantly improves the lookup performance since all of these rows can be fetched by just accessing one datanode. 28

39 5 Chapter 5 Evaluation In this chapter, we provide an evaluation of KTHFS and discuss its capacity, scalability and availability as compared to HDFS. 5.1 Capacity In HDFS, the metadata size of one file having two blocks (which are replicated three times) is 600 bytes approximately. The memory required at the Namenode to store hundred million such files is approximately 60 gigabytes. The HDFS NameNode runs in a JVM, and large heap sizes greater than 60 gigabytes are not considered practical for interactive workloads. Therefore the maximum capacity of HDFS is 100 million files approximately. In KTHFS, since the metadata is stored in NDB, the capacity of the file system is proportional to the capacity of NDB. MySQL Cluster can store one Terabyte of data across multiple datanodes, which means that the maximum capacity of KTHFS is around 1000 million files. Table 5.2 shows the memory requirements for different number of files. Capacity of HDFS Capacity of KTHFS 100 million 1000 million Table 5.1: The maximum number of files that can be stored in HDFS and KTHFS Number of files Memory required 100 million 86 GB 500 million 430 GB 1 billion 860 GB 1.5 billion 1290 GB 2 billion 1721 GB Table 5.2: Total memory required to store metadata of filesystem in MySQL Cluster.

5 Evaluation 5.2 Throughput To measure the throughput of HDFS and KTHFS, we used a tool called the Synthetic Load Generator, which is part of the HDFS source code.

40 5 Evaluation 5.2 Throughput To measure the throughput of HDFS and KTHFS, we used a tool called the Synthetic Load Generator, which is part of the HDFS source code. The experiments were carried out on machines having the following specifications: Two Six-Core AMD Opteron Processors with 2.5Ghz clock speed Ubuntu Sun JRE Giga Bytes Ram 1Gbit Ethernet The MySQL Cluster used for the experiments had six nodes each with 4 GB of allocated memory. We measured the throughput of KTHFS by adding different number of NameNodes to the system. Figure 5.1 shows the throughput of open operations in KTHFS for different number of NameNodes. The tests were executed two times, once with the INode cache enabled and once disabled. As can be seen, the collective throughput of KTHFS increases almost linearly as more namenodes are added. Also, the INode cache improves the overall throughput significantly because it enables the NameNode to perform primary lookups instead of index scans in NDB. Figure 5.2 shows a comparison between the throughput of open operations in HDFS and KTHFS. The throughput of one NameNode in KTHFS is less than the throughput of the HDFS NameNode. However, as more NameNodes are added in KTHFS, the overall throughput increases. Figure 5.1: Throughput of open operations in KTHFS. The blue bars show the throughput with the INode cache enabled, and the red bars show the throughput with the INode cache disabled 30

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS