DJ NFS: A Distributed Java-Based NFS Server

Size: px

Start display at page:

Download "DJ NFS: A Distributed Java-Based NFS Server"

Charlene Clarke
5 years ago
Views:

1 DJ NFS: A Distributed Java-Based NFS Server Jeffrey Bergamini Brian Wood California Polytechnic State University San Luis Obispo, CA jbergami@calpoly.edu bmwood@calpoly.edu Abstract In an effort to improve the scalability and portability of a standard NFS server, we have implemented DJ NFS, a cross-platform distributed network file system. The system is easy to install and use, and allows standard NFS clients to seamlessly interact with a file system distributed over a cluster of heterogeneous file storage servers. This paper presents the architecture of our system, and discusses some of the advantages and disadvantages of the design. This is followed by a presentation of tests conducted to measure the performance of our system in comparison with a standard NFS server, as well as a discussion of possible future improvement.

2 Introduction NFS is a well-supported protocol for the access of remote file systems, usually through the exchange of UDP packets [NFS] over a LAN. NFS is typically implemented in a multiple client single server approach. A major limitation of this paradigm is its reliance on the finite resources that a single server can provide to multiple clients. DJ NFS is comprised of a cluster of two types of systems: a single metadata server, and multiple mediators. Mediators present clients with an interface identical to a typical NFS (version 2) server, providing multiple clients with a universal view of file storage across the system. Unknown to clients, file storage is actually distributed across multiple machines upon which mediators run. Figure 1. An example DJ NFS configuration Metadata servers maintain information about the structure of the file system, including directory structure, file attributes and distribution information. In order to satisfy client requests, mediators communicate with the metadata server and other mediators using a slightly altered version of the NFS protocol. Like NFS, this modified protocol is implemented using remote procedure calls (RPCs). Written entirely in Java, a DJ NFS system can be placed on any group of machines capable of running a JVM. Ideal situations for DJ NFS are Beowulf clusters, public labs, or any other setups where there are many machines available on a local network that have unused disk capacity.

Overview As an overview of the DJ NFS architecture, the steps used to satisfy the common client requests of create, write, and read will be discussed.

3 Overview As an overview of the DJ NFS architecture, the steps used to satisfy the common client requests of create, write, and read will be discussed. Before doing so, it is worthwhile to discuss dummy files, a feature that metadata servers use to maintain file system coherence. Dummy files A dummy file is a simple text file that contains three pieces of information: a file owner s IP address, a file handle, and a file size. Dummy files are created within a special root directory on the metadata server (e.g. DJNFSROOT ). The location of a dummy file within or under this root directory perfectly mirrors the client s view of that file s location. For example, if a client looks at the directory /mnt/djnfs/usr/tmp on a DJ NFS system mounted at /mnt/djnfs, and sees two 5MB files in that directory, filea.txt and fileb.txt, then the metadata server will have two dummy files located in DJNFSROOT/usr/tmp named filea.txt and fileb.txt. The actual 5MB of data that comprise the files are stored on either one or two mediators, the addresses of which are stored in the dummy files. Dummy files allow the metadata server to satisfy some attribute lookups (e.g. GETATTR NFS calls) using Java procedures on the actual file system, and they allow for the file system to be persistently stored. Except at server start time, running metadata servers do not open dummy files to lookup file owners, file sizes, or file handles; that information is stored in memory via hash tables. Create A create operation 1. NFS client makes call to Mediator. 2. Mediator forwards call to metadata server. 3. Metadata server creates local dummy file, naming the file according to its global name, creates a new file handle, and returns the file handle to the mediator 4. Mediator creates storage file, naming the file based on its file handle, then returns to client. Figure 2. The steps taken for file creation

4 When a client requests file creation, the mediator to which that client is connected becomes the owner and actual storage location of that file. Mediators forward create requests to the metadata server, so that the metadata server can update the file system structure by creating a dummy file, and returns a new file handle. Once a file handle has been returned, the mediator creates a file with a name based on the file handle inside a directory that shouldn t be accessible to local users. The mediator then returns to the NFS client. Write A write operation with remote handling 1. NFS client makes call to Mediator. 2*. Mediator makes call to Metadata server. 3*. Metadata server tells the mediator who owns the file. 4. Mediator makes write request to owner. 5. Owner mediator returns from write operation. 6. Mediator returns to client from write request. 7. Owner informs metadata server of write completion. 8. Metadata server updates dummy file * These operations only occur once per file Figure 3. An example of how a write operation is handled with DJ NFS When a client makes a write request for a file not stored on its mediator, the mediator must discover the owner of the file to be written to. This information is stored in memory by mediators after the first time a file is accessed, but initially a mediator must request the owner s IP from the metadata server. Once the owner is known, the mediator then forwards the write request to the mediator that owns the file. This mediator then performs the actual write on its locally stored file. After writing the data, the mediator returns from the write request, and informs the metadata server that a write operation has completed. Read Read operations occur in almost exactly the same fashion as writes. When a mediator receives a read request for a file not stored locally, the request is forwarded to the owner mediator. That mediator performs the read and sends the data to the initial mediator, which sends the data to the client. Unlike writes, read completions do not trigger calls to the metadata server, except when the owner address must be discovered. Design Tradeoffs and Performance Optimizations Speed vs. Capacity The main tradeoff in DJ NFS is the sacrifice of speed in exchange for added capacity and

5 flexibility of the file system. The above figures show this fairly bluntly. In standard NFS, the only network traffic is between client and server. In the example of a write, that traffic consists of a single request and a single response. DJ NFS introduces a potential for a lot more communication within the system, as illustrated in Table 1. Standard NFS Write 1. Client Server (call) 2. Server Client (return) DJ NFS Write Best case (client s mediator owns file) 1. Client Med (call) 2. Med MDS (update file info) 3. MDS Med (return) 4. Med Client (return) DJ NFS Write Worst Case (another mediator is owner; file not previously accessed) 1. Client Med (call) 2. Med MDS (find owner)* 3. MDS Med (return)* 4. Med OwnerMed (call owner) 5. OwnerMed MDS (update file info) 6. OwnerMed Med (return) 7. Med Client (return) Table 1: Relative numbers of RPCs for a write operation For every operation, DJ NFS has the standard NFS traffic, but additionally requires some amount of traffic from the calling mediator to the metadata server (MDS) and/or another mediator. The amount of communication sounds like it would create a big performance hit. In many cases it does, but DJ NFS is also optimized to minimize that cost in certain situations. The overhead from the Best Case example in Table 1 can be nearly eliminated if the mediator is running on the same machine as the client. In that case, the communication between client and mediator is done over the local loopback network device, which is very fast. Every mediator knows at all times which files it owns, so in this case there is no need to ask the MDS who owns the file. The only other traffic is a call to update the metadata on the MDS for the given file (a SETATTR call), which is always less than a kilobyte of data. In a fast network, this will hardly be an issue. * There is also a medium case in between best and worst. Each mediator keeps a cache of file ownership, so if its client performs an operation on a file owned by another mediator, the mediator needs to query the metadata server for ownership information only once. Single Server Design The choice to use a single metadata server was based mostly on the fact that any other design would likely introduce even more overhead per operation. However, the design also sacrifices some fault-tolerance and introduces a possible bottleneck. As previously mentioned, DJ NFS uses a single metadata server per file system. This is nice in terms of synchronizing the metadata itself, since changes to the structure of the

6 file system need not propagate anywhere else. Keeping the metadata as consistent as standard NFS is trivial using this method. A downside to the single server model is that it creates a single point of failure. If the metadata server goes down, the file system as a whole is not functional until it is recovered. The DJ NFS metadata server and mediators do include functionality to recover from such a crash, as long as the data stored on disk isn t lost. Alternate designs might involve multiple metadata servers, or in the extreme case make every mediator in charge of the metadata of the files it owns. These are discussed in Future Work. The metadata server may also have scalability issues in a system where there are a lot of files being created, removed, or written to simultaneously. Each of these operations involves a call to the metadata server, so in a system with many active mediators, it might be possible to overwhelm the metadata server with calls. Possible solutions are discussed in Future Work. Capacity and Speed vs. Reliability We made a conscious decision not to build any file data redundancy into DJ NFS. This was partly a matter of convenience, but mostly a way of staying true to our goal of maximized file system capacity and NFS-like operation. It would relatively trivial to add some kind of automatic, distributed backup scheme into DJ NFS. Any changes to a file could be mirrored on a backup copy of the same file on another mediator. However, this would automatically decrease total file system capacity by one half. Worse, it would double the amount of file data traveling across the network during a write. Still worse, it would likely compound the possible metadata server bottleneck problem. Much of the traffic to and from the metadata server is notification of write completion. In order to independently verify write completion on not only the file but its backup copy, this traffic might be doubled. Another easy addition would be a backup metadata server, similar to that of the Google File System. The metadata server could notify mediators of the backup server when MOUNT is called, and constantly echo any metadata changes to the backup server. Mediators could then try the backup server if the main server stopped responding. Description of Performance Tests The goal of our performance testing is to evaluate our system in comparison to a typical NFS server. In order to do so, we have run file system benchmarking in controlled test environments. These tests are meant to provide a gross measure of the execution time overhead and expenses of our approach. These tests will help understand the penalties associated with 1) user-space interpreted Java execution, and 2) increased computing and network hops required for distributed file management. For our tests, we have employed IOZone [IOZone], an automated file system benchmarking tool that is designed to measure a wide array of file system operations.

7 IOZone is capable of testing multiple file system operations, but we have concentrated on evaluation of only reads and writes. According to IOZone, write tests measure the performance of writing a new file, while read tests measure the performance of reading an existing file. During the tests, ranges of file sizes from 64 K to 256 MB are used for evaluation. Additionally, the tests use variable record sizes, ranging from 4 KB to 256 KB. We employed four test environments during our analyses. For all of these tests, the NFS client employed is that provided with Knoppix Linux (Kernel version 2.4.x) distributions [Knoppix]. The standard NFS server employed for testing is likewise that provided with Knoppix Linux distributions. Our tests were run using NFS clients interfacing with four types of NFS servers: 1) Standard NFS Server running on the local client machine. 2) Standard NFS Server running on a remote machine. 3) DJ NFS with mediator running on the local machine. 4) DJ NFS with mediator running on a remote machine. All machines in use are identical: Desktops connected on a 100 Mbps LAN, with Pentium GHz processors, 512 MB of RAM, using FAT32 partitions on 7,200 RPM drives. They run a combination of Windows XP Professional and Knoppix 3.3 with version 1.4.2_04 of Sun s JRE. Test Results Averaged across all file sizes and record sizes tested during benchmarking, our results produced the following comparisons between our system and a standard NFS server: Operation Comparison Result Read Local Mediator, Local NFS DJ NFS is 67% slower than NFS Write Local Mediator, Local NFS DJ NFS is 22% faster than NFS Read Remote Mediator, Remote NFS DJ NFS is 78% slower than NFS Write Remote Mediator, Remote NFS DJ NFS is 2% slower than NFS Table 2. Summary of test results DJ NFS reads from a local mediator are likely slower than their NFS counterparts because our implementation of the mediator does not employ any type of file caching. Conversely, DJ NFS writes suffer show a performance improvement over NFS, likely due to asynchronous file I/O internal to the JVM. A more detailed inspection of the benchmarking reveals several notable features. When handling writing to smaller files, remote mediators perform writing faster than remote NFS servers (see figure 2). Mediators are slow at reading from large files, likely due to the fact that they (having no caching ability) must open the file and seek to an arbitrary point before doing the actual read.

8 Remote NFS W rite Throughput (KB/s) File Size (KB) Record Size (KB) Figure 4. Benchmarking results of file reading by a remote NFS server Remote Mediator W rite Throughput (KB/s) File Size (KB) Record Size (KB) Figure 5. Benchmarking results of file writing by a remote DJ NFS mediator

9 Remote Mediator Read Throughput (KB/s) File Size (KB) Record Size (KB) Figure 6. Benchmarking results of file reading by a remote DJ NFS mediator. Remote NFS Read Throughput (KB/s) File Size (KB) Record Size (KB) Figure 7. Benchmarking results of a remote NFS server Related Work Research on distributed file systems has been active for over 25 years, and such work has produced an enormous variety of systems (Braam 1999). As a short review of unique

10 systems, we will present details of 1) NFSP (Lombard and Denneulin 1999), a simple distributed NFS system, 2) the Google File System (GFS) (Ghemawat et al 2003), a robust system that serves terabytes of files for Google's particular application demands, and 3) xfs (Anderson et al. 1995), a highly distributed, complex, high performance system. Given the time constraints on our present implementation effort, DJ NFS does not provide all the services of the above systems; recommendations for improvements are discussed in Future Work. NFSP NFSP harnesses the storage resources of multiple cooperating machines, and provides a standard NFS interface to clients. Like DJ NFS, NFSP relies on two types of components, a single metadata server, and multiple I/O daemons that actually store file data. Like our system, the metadata server uses simple, small text files to persistently store the structure of the global file system. Unlike DJ NFS, NFSP routes all write operations through the metadata server. We have built our system such that data associated with reads and writes never touches the metadata server. By caching network addresses, mediators in our system forward reads and writes directly to appropriate file-owning mediators. This approach reduces the load on the metadata server, and obviates the need for IP spoofing that NFSP employs to provide RPC source and destination consistency. The Google File System The Google File System is a propriety system that exemplifies some of the advantages that distributed file systems have over traditional network file systems. GFS, like DJ NFS, is comprised of a single metadata server and multiple file storage servers. Unlike DJ NFS, GFS does not provide an NFS interface. Instead, GFS handles typical file operations such as read, write, create, delete, etc., using methods that are optimized specifically to suite Google's application requirements. Unlike DJ NFS, GFS clients interact directly with both the metadata server as well as file storage systems. Such a design is possible because GFS consists of customized clients that do not require an NFS interface. As implemented, the GFS system is said to consist of thousands of cheap, commodity file storage servers, and serve hundreds of TB worth of data. In such a setting system failure is the norm, and as such, GFS has implemented several features that increase fault tolerance such as file replication and a backup mechanism for quick metadata fail over recovery. xfs xfs is billed as a serverless network file system, in which file storage, memory caching, and system control is completely distributed across a cluster of cooperating workstations. XFS provides clients with a typical NFS interface, and is built using multiple storage servers, cleaners, and managers. The xfs approach allows any workstation in the system to control or serve any of the files in the system. In order to provide high performance, xfs provides disk striping and cooperative caching, allowing the system to take full advantage of the bandwidth available with fast local area networks.

11 The use of multiple file managers in xfs is a departure from the metadata management model of our system, NFSP and GFS. This feature of xfs, as well as its use of a logstructured file system and file replication offers fault tolerance advantages. Future Work Support for NFS Version 3 DJ NFS currently only supports NFS version 2 and its underlying semantics. Support for version 3 of the protocol (Callagan et al. 1995) would likely address some of DJ NFS s weaknesses. NFS 3 introduces mechanisms that allow the client to optimize the way the server handles a given file, using more flexible data sizes and the COMMIT call. NFS 2 file operations are completely stateless. That is, any read or write calls are expected to be independent of each other. There is no way in NFS 2 to open a file, read or write it as desired, then close it. No matter how often a file is used, there is no good way to keep a file open or close it. Some NFS servers do allow the option to mount the file system asynchronously (the async option), where the server may choose to keep a file open and/or delay writes for a certain amount of time in an attempt to batch them. However, this is generally considered unsafe due to the race conditions it introduces. NFS 3 introduces the COMMIT call, which provides the benefits of the open/access/close paradigm. Based on the assumption that a client knows more about how a file is being used than the server does, COMMIT gives the client much more control over how the server handles the files it is accessing. A read or write from a client is interpreted as an open by the server. If successive reads and writes for the same file arrive at the server, the file is kept open. The file will not be closed until the client issues a COMMIT call for that specific file, or until the server determines that it hasn t been accessed for a while. The COMMIT call in DJ NFS would do a lot to improve performance and eliminate the possible bottleneck of the single metadata server. The attribute updates that are sent from mediator to metadata server after each write could be batched until the client calls COMMIT on a file. In addition to lightening the load on the metadata server, this would probably go a long way toward getting DJ NFS s performance closer to standard NFS. It would reduce the communication overhead quite a bit, and allow mediators to keep files open rather than reopening them and seeking to the appropriate offset for every read or write received. NFS 2 is limited to 8192 bytes of file data per call. This means that when a client reads or writes a file, NFS 2 dictates that each independent call may transmit no more than 8192 bytes of data. For example, writing 256 megabytes of data into a file requires at least 32,768 write calls. Combine that with the stateless file handling, and you ve got the RPC, the server opening the file, writing to it, closing it and returning from the RPC, repeated 32,768 times. NFS 3 places no artificial limits [RFC1813] on the data size per RPC. Data size is limited by the size of a UDP packet, so a well-designed client can optimize the calls it makes based on the demands of applications. The benefits for standard NFS would apply equally to DJ NFS.

12 One more benefit that NFS 3 would provide is a general decrease in the amount of RPCs involved. Operations that involve changing file attributes on NFS 2 do not by default return the new attributes, so a subsequent GETATTR call is necessary for verification. NFS 3 piggybacks them with the return from the original operation. Also, reading directory contents is much improved in terms of RPC calls. In NFS 2, a READDIR call returns a list of file handles but not their attributes, so a GETATTR per file is necessary to read detailed directory information. NFS 3 piggybacks file attributes onto the return of the READDIRPLUS, eliminating all those extra RPCs. NFS 3 also supports 64-bit file handles and offsets, so file sizes and number of files are greatly extended. Distribution of Metadata We ve repeatedly addressed the problem of the potential metadata server bottleneck, the overhead of frequent communication between it and mediators, and the single point of failure. While these issues could be mostly addressed by supporting NFS 3 and a backup metadata server, the option remains to move away from single server model. A modified DJ NFS might use multiple metadata servers, or even eliminate them altogether and migrate their functionality into the mediators. This would make DJ NFS very similar to Berkeley s xfs. It would also introduce quite a bit more complexity, and would probably be a wash in terms of improving fault-tolerance, but it couldn t hurt to look into the possibility. Distribution of Files A nice addition to DJ NFS would be the ability to spread the ownership of a file over multiple mediators. This would address the issue of storing very large files on DJ NFS, since the current design doesn t allow any single file to be larger than the available capacity of the mediator that owns it. If a mediator could own a chunk of a file rather than the whole thing, that limitation would no longer be an issue. It would also introduce the possibility of load-balancing frequently accessed files by distributing them in segments over multiple mediators. This would complicate metadata storage, as well as the behavior of mediators on reads and writes. For example, a read or a write may happen along the boundary of ownership for a file, making the completion of that call more complicated. Citations (1989) [NFS] NFS: Network file system specification. RFC1094, March (1995) Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roseli, D., and Wang, R. Serverless Network File Systems In Proceedings of the 15th Symposium on Operating System Principles. ACM

13 (1995) [RFC1813] Callaghan, B., Pawlowski, B., and Staubach, B., NFS Version 3 Protocol Specification, IETF RFC (1999) Braam, P. File Systems for Clusters from a Protocol Perspective", In Extreme Linux Workshop #2, USENIX Technical Conference. USENIX, June (2003) Ghemawat, S., Gobioff, H.,. Leung, S.T. The Google File System. SOSP '03 October 19-22, (2004) [IOZone] The IOZone Filesystem Benchmark. (2004) [Knoppix] Knoppix Linux. (2002) Lombard, P. and Denneulin, Y. nfsp: A Distributed NFS Server for Clusters of Workstations In International Parallel and Distributed Processing Symposium. April 15-19, 2002

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management