Distributed Metadata Management for Parallel Filesystems

Size: px
Start display at page:

Download "Distributed Metadata Management for Parallel Filesystems"

Transcription

1 Distributed Metadata Management for Parallel Filesystems A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Vilobh Meshram, B.Tech(Computer Science) Graduate Program in Computer Science and Engineering The Ohio State University 2011 Master s Examination Committee: Dr. D.K. Panda, Advisor Dr. P. Sadayappan

2 c Copyright by Vilobh Meshram 2011

3 Abstract Much of the research in storage systems has been focused on improving the scale and performance of the data-access throughput that read and write large amounts of file data. Parallel file systems do a good job of scaling large file access bandwidth by striping or sharing I/O resources across many servers or disks. However, the same cannot be said about scaling file metadata operation rates. Most existing parallel filesystems choose to concentrate all the metadata processing load on a single server. This centralized processing can guarantee correctness, but it severely hampers scalability. This downside is becoming more and more unacceptable as metadata throughput is critical for large scale applications. Distributing metadata processing load is critical to improve metadata scalability when handling huge number of client nodes. However, in such a distributed scenario, a solution to speed up metadata operations has to address two challenges simultaneously, namely scalability and reliability. We propose two approaches to solve the challenges mentioned above for metadata management in parallel filesystems with a focus towards reliability and scalability aspects. As demonstrated by experiments, the approach to solve the problem of distributed metadata management achieves significant improvements over native parallel filesystems by large margin for all the major metadata operations. With 256 client processes, our approach to solve the problem of distributed metadata management ii

4 outperforms Lustre and PVFS2 by a factor of 1.9 and 23, respectively, to create directories. With respect to stat() operation on files, our approach is 1.3 and 3.0 times faster than Lustre and PVFS. iii

5 This work is dedicated to my parents and my sister iv

6 Acknowledgments I consider myself extremely fortunate to have met and worked with some remarkable people during my stay at Ohio State. While a brief note of thanks does not do justice to their impact on my life, I deeply appreciate their contributions. I begin by thanking my adviser, Dr. Dhabaleswar K.Panda. His guidance and advice during the course of my Masters studies have shaped my career. I am thankful to Dr. P. Sadayappan for agreeing to serve on my Master s examination committee. Special thanks to Xiangyong Ouyang for all the support and help. I would also like to thank Dr.Xavier Besseron for his insightful comments and discussions which helped me to strengthen my thesis. I am especially grateful to Xiangyong, Xavier and Raghu and I feel lucky to have collaborated closely with them. I would like to thank all my friends in the Network Based Computing Research Laboratory for their friendship and support. Finally, I thank my family, especially my parents and my sister. Their love, action, and faith have been a constant source of strength for me. None of this would have been possible without them. v

7 Vita April 18, Born - Amravati, India B.Tech., Computer Science, COEP, Pune University, Pune, India Software Development Engineer, Symantec R&D India Graduate Research Associate, The Ohio State University Publications Research Publications Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar and Dhabaleswar K. Panda Can a Decentralized Metadata Service Layer benefit Parallel Filesystems?. accepted in IASDS 2011 workshop in conjunction with Cluster 2011 Vilobh Meshram, Xiangyong Ouyang and Dhabaleswar K. Panda Minimizing Lookup RPCs in Lustre File System using Metadata Delegation at Client Side. OSU Technical Report OSU-CISRC-7/11-TR20, July 2011 Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram and Dhabaleswar K. Panda Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?. to appear in Reselience 2011 workshop in conjunction with Euro-Par 2011 Fields of Study vi

8 Major Field: Computer Science and Engineering Studies in High Performance Computing: Prof. D. K. Panda vii

9 Table of Contents Page Abstract Dedication Acknowledgments Vita List of Tables ii iv v vi xi List of Figures xii 1. Introduction Parallel Filesystems Metadata Management in Parallel Filesystems Distributed Coordination Service Motivation of the Work Metadata Server Bottlenecks Consistency management of Metadata Problem Statement Organization of Thesis Related Work Metadata Management approaches Scalable filesystem directories viii

10 3. Delegating metadata at client side (DMCS) RPC Processing in Lustre Filesystem Existing Design Design and challenges for delegating metadata at client side Design of communication module Design of DMCS approach Challenges Metadata revocation Distributed Lock management for DMCS approach Performance Evaluation File Open IOPS: Varying Number of Client Processes File Open IOPS: Varying File Pool Size File Open IOPS: Varying File path Depth Summary Design of a Decentralized Metadata Service Layer for Distributed Metadata Management Detailed design of Distributed Union FileSystem (DUFS) Implementation Overview FUSE-based Filesystem Interface ZooKeeper-based Metadata Management File Identifier Deterministic mapping function Back-end storage Algorithm examples for Metadata operations Reliability concerns Performance Evaluation Distributed coordination service throughput and memory usage experiments Scalability Experiments Experiments with varying number of distributed coordination service servers Experiment with different number of mounts combined using DUFS Experiments with different back-end parallel filesystems Summary ix

11 5. Contributions and Future Work Summary of Research Contributions and Future Work Delegating metadata at client side Design of a decentralized metadata service layer for distributed metadata management Bibliography x

12 List of Tables Table Page 1.1 LDLM and Oprofile Experiments Transaction throughput with a fixed file pool size of 1,000 files Transaction throughput with varying file pool Transaction throughput with varying file pool Metadata operation rates with different underlying storage xi

13 List of Figures Figure Page 1.1 Basic Lustre Design Zookeeper Design Example of consistency issue with 2 clients and 2 MetaData servers Design of DMCS approach File open IOPS, Each Process Accesses 10,000 Files File open IOPS, Using 16 Client Processes Time to Finish open, Using 16 Processes Each Accessing 10,000 Files DUFS mapping from the virtual path to the physical path using File Identifier (FID) DUFS overview. A, B, C and D show the steps required to perform an open() operation Sample physical filename generated from a given FID Algorithm for the mkdir() operation Algorithm for the stat() operation ZooKeeper throughput for basic operations by varying the number of ZooKeeper Servers xii

14 4.7 Zookeeper memory usage and its comparison with DUFS and basic FUSE based file system memory usage Scalability experiments with 8 Client nodes and varying number of client processes Scalability experiments with 16 Client nodes and varying number of client processes Operation throughput by varying the number of Zookeeper Servers File operation throughput for different numbers of back-end storage Operation throughput with respect to the number of clients for Lustre and PVFS xiii

15 Chapter 1: INTRODUCTION High-performance computing (HPC) is an integral part of today s scientific, economic, social, and commercial fabric. We depend on HPC systems and applications for a wide range of activities such as climate modeling, drug research, weather forecasting, and energy exploration. HPC systems enable researchers and scientists to discover the origins of the universe, design automobiles and airplanes, predict weather patterns, model global trade, and develop life-saving drugs. Because of the nature of the problems that they are trying to solve, HPC applications are often data-intensive. Scientific applications in astrophysics (CHIMERA and VULCAN2D), climate modeling (POP), combustion (S3D), fusion (GTC), visualization, astronomy, and other fields generate or consume large volumes of data. This data is on the order of terabytes and petabytes and is often shared by the entire scientific community. Today s computational requirements are increasing at a geometric rate that involves large quantities of data. While the computational power of microprocessors has kept pace with Moore s law as a result of increased chip densities, performance improvements in magnetic storage have not seen a corresponding increase. The result has been an increasing gap between the computational power and the I/O subsystem performance of current HPC systems. Hence, while supercomputers keep getting faster, we do 1

16 not see a corresponding improvement in application performance, because of the I/O bandwidth bottleneck. The parallel file system do a good job in improving the data throughput rate by striping or sharing I/O resources across many servers and disks. The same cannot be said about metadata operations. Every time a file is opened, saved, closed, searched, backed up or replicated, some portion of metadata is accessed. As a result, metadata operations fall in the critical path of a broad spectrum of applications. Studies [20,23] show that over 75% of all filesystem calls require access to file metadata. Therefore, efficient management of metadata is very crucial for the overall system performance. Even though the modern distributed file systems architectures like Lustre [4], PVFS [10] and Google File System [13] separate the management of metadata from the storage of the actual file data all the namespace is managed by a centralized metadata server. These architectures have proven to easily scale the storage capacity and bandwidth. However, the management of metadata remains a bottleneck. Recent trends in high-performance computing have also seen a shift toward distributed resource management. Scientific applications are increasingly accessing data stored in remote locations. This trend is a marked deviation from the earlier norm of co-location of application and its data. So in such a distributed environment the management of metadata becomes even more difficult as the important points of reliability, consistency and scalability need to be taken care of. As we saw above that in most of the parallel filesystem single metadata server manages the entire namespace, new approaches need to be designed to take care of distributed metadata management. Few of the parallel filesystems have a design for better metadata management in order to overcome the problem faced by the single point of metadata bottleneck 2

17 but considering the complexity of distributed metadata management the effort is still in progress. Our research focuses on addressing these two problem. We have examined the existing paradigms and suggested better alternatives. In the first part we focus on an approach for the Lustre filesystem to overcome the problem of single point of bottleneck. In the second part we design and evaluate our scheme for distributed metadata management for parallel filesystems with the primary aim of improving the scalability of the filesystem while maintaining the reliability and consistency aspects. 1.1 Parallel Filesystems Parallel Filesystems are mostly used in High Performance Computing environments which deals with or generates, massive amount of data. Parallel Filesystems usually separate the processing of metadata from data. Some parallel file systems, e.g., Lustre have a separate Metadata Server to handle metadata operations whereas some parallel filesystems, e.g., PVFS may keep the metadata and data at the same place. Lets consider the case of Lustre. Lustre is a POSIX compliant, open-source distributed parallel filesystem. Due to the extremely scalable architecture of the Lustre filesystem, Lustre deployments are popular in scientific supercomputing, as well as in the oil and gas, manufacturing, rich media, and finance sectors. Lustre presents a POSIX interface to its clients with parallel access capabilities to the shared file objects. Lustre is an object-based filesystem. It is composed of three components: a Metadata server (MDS), object storage servers (OSSs), and clients. Figure 1.1 illustrates the Lustre architecture. Lustre uses block devices for file data and metadata storage and each block device can be managed by only one Lustre service. The total 3

18 data capacity of the Lustre filesystem is the sum of all individual OST capacities. Lustre clients access and concurrently use data through the standard POSIX I/O system calls. MDS provides metadata services. Correspondingly, a MDC (metadata client) is a client of those services. One MDS per filesystem manages one metadata target (MDT). Each MDT stores file metadata, such as file names, directory structures, and access permissions. OSS (object storage server) exposes block devices and serves data. Correspondingly, OSC (object storage client) is client of the services. Each OSS manages one or more object storage targets (OSTs), and OSTs store file data objects. LDAP Server configuration information, network connection details & security management Clients directory operations, meta-data & concurrency file I/O & locking Meta-Data Server (MDS) Object Storage Targets (OST) frecovery, file status & file creation Figure 1.1: Basic Lustre Design 4

19 1.2 Metadata Management in Parallel Filesystems Parallel filesystems like Lustre and Google Filesystem separate out from the classical distributed file systems like NFS, etc. in a way that they separate out the management of metadata from actual file data. In classical distributed filesystems like NFS the server has to manage both data and metadata. This increases the load on the server and limits the performance and scalability of the filesystem. Parallel filesystem store the metadata on a separate server known as the metadata server(mds). Lets consider the example of Lustre Filesystem. In terms of on-disk storage of metadata, the parallel file system keeps additional information known as Extended Attributes(EA) apart from normal file metadata attributes like inode, etc. EA information along with the normal file attributes is handed over to the client in case of the getattr or lookup operation. So when the client wants to perform an actual I/O, the client is aware of which servers to talk to or to understand how the file is striped amongst servers. From the MDS point of view, each file is composed of multiple data objects striped on one or more OSTs. A file objects layout information is defined in the extended attribute (EA) of the inode. Essentially, EA describes the mapping between file object id and its corresponding OSTs. This information is also known as striping EA. So if the stripe size is 1MB, then this would mean that [0,1M), [4M,5M) are stored say as object x, which is on OST p; [1M, 2M), [5M, 6M) are stored say as object y, which is on OST q; [2M,3M), [6M, 7M) are stored say as object z, which is on OST r. Before reading the file, a client will query the MDS via MDC and be informed that it should talk to OST p, OST q, OST r for this operation. This information is structured in so-called LSM, and client side LOV (logical object 5

20 volume) is to interpret this information so client can send requests to OSTs. Here again, the client communicates with OST through a client module interface known as OSC. Depending on the context, OSC can also be used to refer to an OSS client by itself. All client/server communications in Lustre are coded as an RPC request and response. Within the Lustre source, this middle layer is known as Portal RPC, or ptl-rpc which translates and interprets filesystem requests to and from the equivalent form of RPC request and response, and the LNET module to finally put that down onto the wire. Most of the parallel file system follow such a kind of architecture where a single metadata server manages the entire namespace. So in a scenario when the load on MDS increases, the performance of MDS slows down which slows down the performance of the entire file system. The MDS consists of many important components like the Lustre Distributed Lock Manager (LDLM) which occupies a major chunk of the processing time at the Lustre. We performed experiments using the Oprofile tool to profile the Lustre code to understand the amount of time consumed by the LDLM module. The experiment was performed on 8 client nodes. Figure 1.1 shows the amount of time consumed by the Lock Manager module at the MDS. In such a kind of environment where a single metadata manages the entire namespace most of the time is spent in the LDLM module and in communication. By communication we mean sending a blocking AST to the client holding a valid copy and then invalidating the local cache at that client. Also, allowing only a single Metadata Target (MDT) in a filesystem means that Lustre metadata operations can be processed only as quickly as a single server and its backing filesystem can manage. In order to improve the 6

21 performance and scalability of parallel filesystem the effort has been made in the direction of distributed metadata management. Table 1.1: LDLM and Oprofile Experiments File Percentage ldlm/ldlm lockd.c ldlm/ldlm inodebits.c ldlm/ldlm internal.h ldlm/ldlm lib.c ldlm/ldlm lock.c ldlm/ldlm pool.c ldlm/ldlm request.c ldlm/ldlm resource.c Clustered Metadata Server (CMD) is an approach proposed by the Lustre community for distributed metadata management. With CMD functionality, multiple MDS can provide a single file system s namespace jointly, storing the directory and file metadata on a set of MDT. Clustered Metadata (CMD) means there are multiple active MDS servers in one Lustre file system, the MDS workload can be shared among several servers, so that the metadata performance will be significantly improved. Although CMD will improve the performance and scalability of Lustre, it also brings some difficulties. The most complex one are recovery, consistency and reliability. In CMD, one metadata operation may need to update several different MDSs. To maintain the consistency of the filesystem, the update must be atomic. If the update on one MDS fails, all other updates must be rolled back to their original states. To handle this, CMD uses a global lock. But a global lock slows down the overall throughput of the filesystem. 7

22 1.3 Distributed Coordination Service Google s chubby [9] is a distributed lock service which gained wide adoption within their data centers. Chubby lock service is intended to provide coarse-grained locking as well as reliable storage for a loosely-coupled distributed system. The purpose of the lock service is to allow its clients to synchronize their activities and to agree on basic information about their environment. The primary goals include reliability, availability to a moderately large set of clients, and easy-to-understand semantics; throughput and storage capacity are considered secondary. Chubby s client interface is similar to that of a simple file system that performs whole file reads and writes, augmented with advisory locks and with notification of various events such as file modification. Chubby helps developers to deal with coarse-grained synchronization within their systems, and in particular to deal with the problem of electing a leader from among a set of otherwise equivalent servers. For example, the Google File System [13] uses a Chubby lock to appoint a GFS master server, and Bigtable [11] uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master. In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. The primary purpose of storing the root in chubby is improved reliability and consistency aspect. So even in the event of a node failure, etc, we are still able to view the contents of the directory due to the reliability provided by Chubby. Apache Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure. ZooKeeper [14] 8

23 is a distributed, open-source coordination service for distributed applications. It exposes a simple set of interfaces that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance and naming. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is organized similarly to a standard file system. The namespace consist of special nodes known as Znodes. Znodes do not store data but they store configuration information. The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The strict ordering means that sophisticated synchronization primitives can be implemented at the client. ZooKeeper is replicated over a sets of hosts. ZooKeeper performs better in a read intensive workload than in a write/update intensive workload [14]. Figure 1.2: Zookeeper Design 9

24 1.4 Motivation of the Work Parallel file systems can easily scale bandwidth and improve performance by operating on data in parallel using strategies such as data striping, sharing resources, etc. However, most parallel file systems do not provide the ability to scale and parallelize metadata operations as it is inherently more complex than scaling the performance of data operations [6]. PVFS provides some level of parallelism through distributed metadata servers that manage different ranges of metadata. The Lustre community has also proposed the idea of Clustered Metadata Server (CMD) to minimize the load on a single Metadata Server, wherein multiple metadata servers share the metadata processing workload Metadata Server Bottlenecks The MDS is currently restricted to a single node, with a fail-over MDS that becomes operational if the primary server becomes nonfunctional. Only one MDS is ever operational at a given time. This limitation poses a potential bottleneck as the number of clients and/or files increase. IOZone [2] is used to measure the sequential file IO throughput, and Postmark [5] is used to measure the scalability of the MDS performance. Since MDS performance is the primary concern of this research, we discuss the Postmark experiment with more details. Postmark is a file system benchmark that performs a lot of metadata intensive operations to measure MDS performance. Postmark first creates a pool of small files (1KB to 10KB), and then starts many sequential transactions on the file pool. Each transaction performs two operations to either read/append a file or create/delete a file. Each of these operations happens with the same probability. The transaction throughput is measured to 10

25 approximate workloads on an Internet server. Table 1.2 gives the measured transaction throughput with a fixed file pool size of 1,000 files and different number of transactions on this pool. The transaction throughput remains relatively constant at varied transaction number. Since the cost for MDS to perform an operation does not change at a fixed file number, this result is expected. Table 1.3 on the other hand, changes the file pool size and measures the corresponding transaction throughput. By comparing the entries in Table 1.3 with their counterparts in Table 1.2, it becomes clear that a large file pool results in a lower transaction throughput. We also performed experiments by varying the number of transactions whereas keeping the number of files in the file pool to be constant. Table 1.4 shows the details. As seen from Table 1.4 for a constant file pool size and varying number of transactions we don t see a huge mutation in the transaction throughput. The MDS caches the most recently accessed metadata of files (the inode of a file). A client file operation requires the metadata information about that file to be returned by MDS. At larger number of files in the pool, a client request is less likely to be serviced from the MDS cache. A cache miss results in the MDS looking up its disk storage to load the inode of requested file, which results in the lower transaction throughput in Table 1.3. Table 1.2: Transaction throughput with a fixed file pool size of 1,000 files Number of transactions Transactions per second 1, , , ,

26 Table 1.3: Transaction throughput with varying file pool Number of files in pool Number of transactions Transactions per second 1,000 1, ,000 5, ,000 10, ,000 20, Table 1.4: Transaction throughput with varying file pool Number of files in pool Number of transactions Transactions per second 5,000 1, ,000 5, ,000 10, ,000 20, Consistency management of Metadata Majority of the distributed filesystems use a single metadata server. However, this is a bottleneck that limits the operation throughput. Managing multiple metadata servers brings many difficulties. Maintaining consistency between two copies of the same directory hierarchy is not straightforward. We illustrate such a difficulty in Figure 1.3. We have two metadata servers (MDS) and we consider two clients that perform an operation on the same directory at the same time. Client 1 creates the directory d1 and client 2 renames the directory d1 to d2. As shown in Figure 1.3a, each client performs its operation in the following order: first on the MDS1, then on the MDS2. From the MDS point of view, there is no guarantee on the execution order of the 12

27 Client 1 1. mkdir d1 on MDS1 2. mkdir d1 on MDS2 Client 2 1. mv d1 d2 on MDS1 2. mv d1 d2 on MDS2 Time (a) On the client side MDS 1 MDS 2 1. mkdir d1 from client1 2. mv d1 d2 from client2 1. mv d1 d2 from client2 2. mkdir d1 from client1 Time Result: d2 Result: d1 (b) On the MetaData server side Figure 1.3: Example of consistency issue with 2 clients and 2 MetaData servers requests since they are coming from different clients. As shown in Figure 1.3b, the requests can be executed in a different order on each metadata server while still respecting the ordering that the clients demand. In this case, the resulting states of the two metadata servers are not consistent. This small example highlights that distributed algorithms are required to maintain the consistency between multiple metadata servers. Each client operation must appear to be atomic and must be applied in the same order on all the metadata servers. For this reason, we decided to use a distributed coordination service like ZooKeeper in the proposed metadata service layer. Such a coordination service implements the required distributed algorithms in a reliable manner. 13

28 1.5 Problem Statement The amount of data generated and consumed by high-performance computing applications is increasing exponentially. Current I/O paradigms and file system designs are often overwhelmed by this deluge of data. To improve the I/O throughput to a certain extent parallel file systems do a good job by incorporating features such as sharing resources, data striping, etc. Distributed filesystems often dedicate a subset of the servers for metadata management. File systems such as NFS [17], AFS [15], Lustre [4], etc use a single metadata server to manage a globally shared file system namespace. While simple, this design does not scale, resulting in the metadata server becoming a bottleneck and a single point of failure. In this thesis, we study and critique the current metadata management techniques in parallel file systems by taking Lustre file system as our use case. We propose two new designs for metadata management for parallel file systems. In the first part, we present a design where we delegate the metadata at client side to solve the problem of a single metadata server (MDS) becoming a bottleneck while managing the entire namespace. We aim at minimizing the memory pressure at the MDS by delegating some of the metadata to clients so as to improve the scalability of Lustre. In the second part, we design a decentralized metadata service layer and evaluate its benefits in parallel filesystem kind of environment. Decentralized metadata service layer takes care of distributed metadata management with the primary aim of improving the scalability of the filesystem while maintaining the reliability and consistency aspects. Specifically, our research attempts to answer the following questions: 14

29 1. What are the challenges and problems associated with a single server managing the entire namespace for a parallel file system? 2. How to solve the problem of minimizing the load on a single MDS by distributing the metadata at Client side? 3. What are the challenges and problems associated with distributed metadata management? 4. Can a distributed coordination service be incorporated into parallel filesystems for distributed metadata management so as to improve the reliability and consistency aspects? 5. How will a decentralized metadata service layer perform with respect to various metadata operations as compared to the basic variant of parallel filesystems such as Lustre [4] and PVFS [10]? 6. Will a decentralized metadata service layer designed for distributed metadata management do a good job in improving the scalability of parallel file system? Will it help in maintaining the consistency and reliability of the file system? 1.6 Organization of Thesis The rest of the thesis is organized as follows. Chapter 2 presents an overview of the work in the area of parallel file systems with focus on metadata management for parallel file systems. Chapter 3 proposes a distributed metadata management technique by delegating metadata at client side. In Chapter 4 we explore the feasibility of using a Distributed coordination service for distributed metadata management. We conclude our work and present future research directions in Chapter 5. 15

30 Chapter 2: RELATED WORK In this chapter, we discuss some of the current literature related to metadata management in high performance computing environments. We highlight the drawbacks of current metadata managements paradigms in parallel filesystems and suggest better design and algorithms for metadata management in parallel filesystem. 2.1 Metadata Management approaches File system metadata management has long been an active area of research [15]. With the advent of commodity clusters and parallel file systems [4], managing metadata efficiently and in a scalable manner offers significant challenges. Distributed file systems often dedicate a subset of the servers for metadata management. Mapping the semantics of data and metadata across different, non overlapping servers allows file systems to scale in terms of I/O performance and storage capacity. File systems such as NFS [17], AFS [15], Lustre [4], and GFS [13] use a single metadata server to manage a globally shared file system namespace. While simple, this design does not scale, resulting in the metadata server becoming a bottleneck and a single point of failure. File Systems like NFS [17], Coda [21] and AFS [15] may also partition their namespace statically among multiple servers, so most of the major metadata operations are centralized. The pnfs [12] allows for distributed 16

31 data but retains the concept of centralized metadata. Other parallel file systems like GPFS [22], Intermezzo [7] and Lustre [4] use directory locks for file creation, with the help of a distributed lock management (DLM) for better performance. Lustre uses a single metadata server to manage the entire namespace. Lustre distributed lock management module handles locks between clients and servers and local locks between the nodes. The Lustre community has also mentioned the fact of a single Metadata Server being a bottleneck in HPC kind of environments. So they came up with the concept of Lustre Clustered Metadata Server (CMD). CMD is still a prototype and there is no implementation for it till now. The original design for CMD was proposed in In CMD files are identified by a global FID and are assigned a metadata server, once we know the FID, we can directly deal with the server. Getting this FID still requires a centralized/master metadata server and this information is not redundant. So this will still involve a bottleneck at the Master node in the CMD. Also the reliability and availability factor depends a lot on the Master node in CMD. To mitigate the problems associated with a central metadata server, AFS [15] and NFS [17] employ static directory subtree partitioning [24] to partition the data namespace across multiple metadata servers. Each server is delegated the responsibility of managing the metadata associated with a subtree. Hashing [8] is another technique used to partition the file system namespace. It uses a hash of the file name to assign metadata to the corresponding MDS. Hashing diminishes the problem of hot spots that is often experienced with directory subtree partitioning. The Lazy Hybrid metadata management scheme [8,23] combines hierarchical directory management and hashing with lazy updates. Zhu et al. proposed using Hierarchical Bloom Filter Arrays [25] to map file names to the corresponding metadata servers. They 17

32 used two levels of Bloom Filter Arrays with differing degrees of accuracy and memory overhead to distribute the metadata management responsibilities across multiple servers. Ananth et. al explored multiple algorithms for creating files on a distributed metadata file system for scalable metadata performance. In past, in order to get more metadata mutation throughput, efforts were aimed to mount more independent file systems into a larger aggregate, but each directory or directory sub-tree is still managed by one metadata server. Some systems use cluster metadata servers in pairs for fail-over, but not increased throughput. Some systems allow any server to act as a proxy and forward requests to the appropriate server; but this also does not increase metadata mutation throughput in a directory [3]. Symmetric shared disk file systems, that support concurrent updates to the same directory use complex distributed locking and cache consistency semantics, both of which have significant bottlenecks for concurrent create workloads, especially from many clients working in one directory. Moreover, file systems that support client caching of directory entries for faster read only workloads, generally disable client caching during concurrent update workload to avoid excessive consistency overhead. A recent trend among distributed file systems is to use the concept of objects to store data and metadata. CRUSH [23] is a data distribution algorithm that maps object replicas across a heterogeneous storage system. It uses a pseudo-random function to map data objects to storage devices. Lustre, PanFS and Ceph [23] use various nonstandard object interfaces requiring the use of dedicated I/O and metadata servers. Instead, our work breaks away from the dedicated server paradigm and redesigns parallel file systems to use standards-compliant OSDs for data and metadata storage. Also there has been work in the area of combining multiple partitions into a virtual 18

33 mount point. UnionFS (Linux official union filesystem in kernel mainline) [18] has a lot of options but it does not support load balancing between branches. Most of the file system which combine multiple partitions into a virtual mount work on a single node to combine local partitions or directory. Also some union file system cannot extract the parallelism their default behavior is to use the first partition until it reaches a threshold (based on the free space). It cannot attain the throughput even after combining multiple mount points and is restricted by the throughput of the first mounted partition. 2.2 Scalable filesystem directories GPFS is a shared-disk file system that uses a distributed implementation of Fagin s extendible hashing for its directories. Fagins extendible hashing dynamically doubles the size of the hash-table pointing pairs of links to the original bucket and expanding only the overgrowing bucket (by restricting implementations to a specific family of hash functions). It has a two-level hierarchy: buckets (to store the directory entries) and a table of pointers (to the buckets). GPFS represents each bucket as a disk block and the pointer table as the block pointers in the directory s i-node. When the directory grows in size, GPFS allocates new blocks, moves some of the directory entries from the overgrowing block into the new block and updates the block pointers in the i-node. GPFS employs its client cache consistency and distributed locking mechanism to enable concurrent accesses to a shared directory. Concurrent readers can cache the directory blocks using shared reader locks, which enables high performance for read-intensive workloads. Concurrent writers, however, need to acquire write locks from the lock manager before updating the directory blocks stored on the 19

34 shared disk storage. When releasing (or acquiring) locks, GPFS versions before force the directory block to be ushed to disk (or read back from disk) inducing high I/O overhead. Newer releases of GPFS have modied the cache consistency protocol to send the directory insert requests directly to the current lock holder, instead of getting the block through the shared disk subsystem [22]. Still GPFS continues to synchronously write the directory s i-node (i.e., the mapping state) invalidating client caches to provide strong consistency guarantees. Lustre proposed clustered metadata [1] service which splits a directory using a hash of the directory entries only once over all available metadata servers when it exceeds a threshold size. The effectiveness of this split once and for all scheme depends on the eventual directory size and does not respond to dynamic increases in the number of servers. Ceph is another object-based cluster file system that uses dynamic sub-tree partitioning of the namespace and hashes individual directories when they get too big or experience too many accesses. There has been some work in the area of designing a distributed indexing scheme for metadata management. GIGA+ [16], examines the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. GIGA+ builds directories with million/trillions of files with high degree of concurrency. Compared to GPFS, GIGA+ allows the mapping state to be stale at the client and never be shared between servers, thus seeking even more scalability. Compared to Lustre and Ceph, GIGA+ splits a directory incrementally as a function of size, i.e., a small directory may be distributed over fewer servers than a larger one. Furthermore, GIGA+ facilitates dynamic server addition achieving 20

35 balanced server load with minimal migration. This work is interesting but is more relevant in workloads where the directories have a huge fan-out factor or when the application creates million/trillions of files in a single directory. In GIGA+ every server only keeps a local view of the partitions it is managing and no shared state is maintained and hence there are no synchronization and consistency bottlenecks. But in case the server or the partition goes down or the root level directory gets corrupted then the files wont be able to access. 21

36 Chapter 3: DELEGATING METADATA AT CLIENT SIDE (DMCS) In this chapter we focus on the problem faced by managing the entire namespace by a central coordinator. We propose our design, delegating metadata at client side, to handle the problem mentioned in section Before we delve into the design for delegating metadata at client side we first have a look at the Remote Procedure Call (RPC) processing in Lustre Filesystem. 3.1 RPC Processing in Lustre Filesystem When we consider the RPC processing in Lustre we also talk about how lock processing works in Lustre [5, 7, 3, 18] and how our modifications can benefit to minimize the number of LOOKUP RPC. Lets consider an example. Let us assume client C1 wants to open the file /tmp/lustre/d1/d2/foo.txt to read. In this case /tmp/lustre is our mount point. During the VFS path lookup, Lustre specific lookup routine will be invoked. The first RPC request is lock enqueue with lookup intent. This is sent to MDS for lock on d1. The second RPC request is also lock enqueue with lookup intent and is sent to MDS asking inodebits lock for d2. The lock returned is an inodebits lock, and its resources would be represented by the fid of d1 and d2. The subtle point to note is, when we request a lock, we generally need a resource 22

37 id for the lock we are requesting. However in this case, since we do not know the resource id for d1, we actually request a lock on its parent /, not on the d1 itself. In the intent, we specify it as a lookup intent and the name of the lookup is d1. Then, when the lock is returned, the lock is for d1. This lock is (or can be) different from what the client requested, and the client notices this difference and replaces the old lock requested with the new one returned. The third RPC request is a lock enqueue with open intent, but it is not asking for lock on foo.txt. That is, you can open and read a file without a lock from MDS since the content is provided by Object Storage Target(OST). OSS/OST also has a LDLM component and in order to perform I/O on the OSS/OST, we request locks from an OST. In other words, what happens at open is that we send a lock request, which means we do ask for a lock from LDLM server. But, in the intent data itself, we might (or not) set a special flag if we are actually interested in receiving the lock back. And the intent handler then decides (based on this flag), whether or not to return the lock. If foo.txt exists previously, then its fid, inode content (as in owner, group, mode, ctime, atime, mtime, nlink, etc.) and striping information are returned. If client C1 opens the file with the O CREAT flag and the file does not exist, the third RPC request will be sent with open and create intent, but still there will be no lock requests. Now on the MDS side, to create a file foo.txt under d2, MDS will request through LDLM for another EX lock on the parent directory. Note that this is a conflicting lock request with the previous CR lock on d2. Under normal circumstances, a fourth RPC request (blocking AST) will go to client C1 or anyone else who may have the conflicting locks, informing the client that someone is requesting a conflicting lock and requesting a lock cancellation. MDS waits until it gets a cancel RPC from the client. Only then does the MDS gets 23

38 the EX lock it was asking for earlier and can proceed. If client C1 opens the file with LOV DELAY flag, MDS creates the file as usual, but there is no striping and no objects are allocated. User will issue an ioctl call and set the stripe information, then the MDS will fill in the EA structure. 3.2 Existing Design In the following section we explain the existing approach followed by Lustre for metadata management. 1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS. 2. In step 2 the processing is done at the MDS side where the Lock Manager will grant the lock for the resource requested by the Client. A second RPC will be sent from the client to the MDS with the intent to create or open the file. 3. So at the end of step 2 client 1 will get the lock, extended attribute (EA) information and other metadata details which the client will need to open the file successfully. 4. Once the client gets the EA information and the lock handle, the client can proceed ahead with the I/O operation. 5. MDS keeps track of the allocation by making use of queues. When multiple clients try to access the same file the new client will wait in the waiting queue till the time the original client who is the current owner of the lock releases the lock. The MDS will then hand over the lock to the new client. Say client2 wants to access the same file which was earlier opened by client1 then client2 24

39 will be placed in the waiting queue. MDS will send a blocking AST to client1 to revoke the lock granted. Client1 on receiving the blocking AST will release the lock. In a scenario where client1 is down or something goes wrong, MDS will wait for a ping timeout of 30 sec after which it will revoke the lock. Once the lock is revoked the MDS will grant a lock handle and EA for the file to client2. Client2 can proceed with I/O once it gets the lock handle and EA information. 3.3 Design and challenges for delegating metadata at client side Before moving ahead with the actual design of our approach we discuss how the Lustre Networking works and the communication module that we developed to do remote memory copy operations Design of communication module We have designed a communication module for data movement. This communication module will bypass the normal Lustre Networking stack protocols and will help to do remote memory data movement operations. We use the LNET API, originated from Sandia Portals, to design the communication module. With our design for the communication module we use the put and get API to do remote memory copy. The remote copy can be used by clients to copy the metadata information from the client to whom the metadata has been delegated by MDS. LNET identifies its peers using LNET process id which consists of nid and pid. The nid identifies the id of the node, and pid identifies the process on the node. For example, in the case of socket Lustre Network Driver (LND) (and for all currently existing LNET LNDs), there is only one instance of LNET running in the kernel space; the process id therefore uses a 25

40 reserved ID (12345) to identify itself. Portal RPC is a client for LNET layer. Portal RPC takes care of the RPC processing logic. A portal is composed of a list of match entries (ME). Each ME can be associated with a buffer, which is described by a memory descriptor (MD). ME itself defines match bits and ignore bits, which are 64-bit identifiers that are used to decide if the incoming message can use the associated buffer space. A memory buffer is described by a memory desciptor (MD). Consider an example to illustrate the point. Say a client wants to read ten blocks of data from the server. It first sends an RPC request to the server telling that it wants to read ten blocks and it is prepared for the bulk transfer (meaning the bulk buffer is ready). Then, the server initiates the bulk transfer. When the server has completed the transfer, it notifies the client by sending a reply. Looking at this data flow, it is clear that the client needs to prepare two buffers: one is associated with bulk Portal for bulk RPC, the other one is associated with reply Portal Design of DMCS approach In the following section we explain the design details of the client side metadata delegation approach. 1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS. 2. In Step 2, the processing is done at the MDS side where the Lock Manager will grant the lock for the resource requested by the client. A second RPC will be sent from the client to the MDS with the intent to create or open the file. So at the end of step 2, C1 will get the lock handle, EA information and other metadata details. Conceptually step 1 and 2 are similar to what we have in the current Lustre Design but in our approach we modify the step 2 slightly. 26

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Lustre overview and roadmap to Exascale computing

Lustre overview and roadmap to Exascale computing HPC Advisory Council China Workshop Jinan China, October 26th 2011 Lustre overview and roadmap to Exascale computing Liang Zhen Whamcloud, Inc liang@whamcloud.com Agenda Lustre technology overview Lustre

More information

Dynamic Metadata Management for Petabyte-scale File Systems

Dynamic Metadata Management for Petabyte-scale File Systems Dynamic Metadata Management for Petabyte-scale File Systems Sage Weil Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller UC Santa Cruz November 1, 2006 Presented by Jae Geuk, Kim System Overview Petabytes

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Distributed File Systems and Cloud Storage Part I Lecture 12, Feb 22, 2012 Majd F. Sakr, Mohammad Hammoud and Suhail Rehman 1 Today Last two sessions Pregel, Dryad and GraphLab

More information

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Improving I/O Bandwidth With Cray DVS Client-Side Caching Improving I/O Bandwidth With Cray DVS Client-Side Caching Bryce Hicks Cray Inc. Bloomington, MN USA bryceh@cray.com Abstract Cray s Data Virtualization Service, DVS, is an I/O forwarder providing access

More information

Parallel File Systems for HPC

Parallel File Systems for HPC Introduction to Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 The Need for 2 The File System 3 Cluster & A typical

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files Addressable by a filename ( foo.txt ) Usually supports hierarchical

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems Lustre Clustered Meta-Data (CMD) Huang Hua H.Huang@Sun.Com Andreas Dilger adilger@sun.com Lustre Group, Sun Microsystems 1 Agenda What is CMD? How does it work? What are FIDs? CMD features CMD tricks Upcoming

More information

Lustre A Platform for Intelligent Scale-Out Storage

Lustre A Platform for Intelligent Scale-Out Storage Lustre A Platform for Intelligent Scale-Out Storage Rumi Zahir, rumi. May 2003 rumi.zahir@intel.com Agenda Problem Statement Trends & Current Data Center Storage Architectures The Lustre File System Project

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Today l Basic distributed file systems l Two classical examples Next time l Naming things xkdc Distributed File Systems " A DFS supports network-wide sharing of files and devices

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Ceph: A Scalable, High-Performance Distributed File System

Ceph: A Scalable, High-Performance Distributed File System Ceph: A Scalable, High-Performance Distributed File System S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long Presented by Philip Snowberger Department of Computer Science and Engineering University

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Software Infrastructure in Data Centers: Distributed File Systems 1 Permanently stores data Filesystems

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 2 Issue 10 October, 2013 Page No. 2958-2965 Abstract AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi,

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture7:DFS What is DFS? A method of storing and accessing files base in a client/server architecture. A distributed file system is a client/server-based application

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

Remote Directories High Level Design

Remote Directories High Level Design Remote Directories High Level Design Introduction Distributed Namespace (DNE) allows the Lustre namespace to be divided across multiple metadata servers. This enables the size of the namespace and metadata

More information

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani The Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani CS5204 Operating Systems 1 Introduction GFS is a scalable distributed file system for large data intensive

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Chapter 10: File System Implementation

Chapter 10: File System Implementation Chapter 10: File System Implementation Chapter 10: File System Implementation File-System Structure" File-System Implementation " Directory Implementation" Allocation Methods" Free-Space Management " Efficiency

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

HLD For SMP node affinity

HLD For SMP node affinity HLD For SMP node affinity Introduction Current versions of Lustre rely on a single active metadata server. Metadata throughput may be a bottleneck for large sites with many thousands of nodes. System architects

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3. CHALLENGES Transparency: Slide 1 DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems ➀ Introduction ➁ NFS (Network File System) ➂ AFS (Andrew File System) & Coda ➃ GFS (Google File System)

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3

More information

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung ACM SIGOPS 2003 {Google Research} Vaibhav Bajpai NDS Seminar 2011 Looking Back time Classics Sun NFS (1985) CMU Andrew FS (1988) Fault

More information

What is a file system

What is a file system COSC 6397 Big Data Analytics Distributed File Systems Edgar Gabriel Spring 2017 What is a file system A clearly defined method that the OS uses to store, catalog and retrieve files Manage the bits that

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions Roger Goff Senior Product Manager DataDirect Networks, Inc. What is Lustre? Parallel/shared file system for

More information

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following: CS 470 Spring 2018 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018 File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018 Today s Goals Supporting multiple file systems in one name space. Schedulers not just for CPUs, but disks too! Caching

More information

DNE2 High Level Design

DNE2 High Level Design DNE2 High Level Design Introduction With the release of DNE Phase I Remote Directories Lustre* file systems now supports more than one MDT. This feature has some limitations: Only an administrator can

More information

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Lustre SMP scaling. Liang Zhen

Lustre SMP scaling. Liang Zhen Lustre SMP scaling Liang Zhen 2009-11-12 Where do we start from? Initial goal of the project > Soft lockup of Lnet on client side Portal can have very long match-list (5K+) Survey on 4-cores machine >

More information

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,

More information

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25 Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/25 Examples AFS NFS SMB/CIFS Coda Intermezzo HDFS WebDAV 9P 2/25 Andrew File System

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Chapter 11 Implementing File System Da-Wei Chang CSIE.NCKU Source: Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Outline File-System Structure

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System RAIDIX Data Storage Solution Clustered Data Storage Based on the RAIDIX Software and GPFS File System 2017 Contents Synopsis... 2 Introduction... 3 Challenges and the Solution... 4 Solution Architecture...

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Silberschatz 1 Chapter 11: Implementing File Systems Thursday, November 08, 2007 9:55 PM File system = a system stores files on secondary storage. A disk may have more than one file system. Disk are divided

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following: CS 470 Spring 2017 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters

More information

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used

More information

COMP520-12C Final Report. NomadFS A block migrating distributed file system

COMP520-12C Final Report. NomadFS A block migrating distributed file system COMP520-12C Final Report NomadFS A block migrating distributed file system Samuel Weston This report is in partial fulfilment of the requirements for the degree of Bachelor of Computing and Mathematical

More information

Hierarchical Chubby: A Scalable, Distributed Locking Service

Hierarchical Chubby: A Scalable, Distributed Locking Service Hierarchical Chubby: A Scalable, Distributed Locking Service Zoë Bohn and Emma Dauterman Abstract We describe a scalable, hierarchical version of Google s locking service, Chubby, designed for use by systems

More information

File System Implementation

File System Implementation File System Implementation Last modified: 16.05.2017 1 File-System Structure Virtual File System and FUSE Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance. Buffering

More information

File System Internals. Jo, Heeseung

File System Internals. Jo, Heeseung File System Internals Jo, Heeseung Today's Topics File system implementation File descriptor table, File table Virtual file system File system design issues Directory implementation: filename -> metadata

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Chapter 11: Implementing File-Systems

Chapter 11: Implementing File-Systems Chapter 11: Implementing File-Systems Chapter 11 File-System Implementation 11.1 File-System Structure 11.2 File-System Implementation 11.3 Directory Implementation 11.4 Allocation Methods 11.5 Free-Space

More information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

More information