Avoiding the Cache Coherence Problem in a. Parallel/Distributed File System. Toni Cortes Sergi Girona Jesus Labarta

Size: px

Start display at page:

Download "Avoiding the Cache Coherence Problem in a. Parallel/Distributed File System. Toni Cortes Sergi Girona Jesus Labarta"

Lionel Randell Kennedy
6 years ago
Views:

1 Avoiding the Cache Coherence Problem in a Parallel/Distributed File System Toni Cortes Sergi Girona Jesus Labarta Departament d'arquitectura de Computadors Universitat Politecnica de Catalunya - Barcelona ftoni, sergi, jesusg@ac.upc.es Abstract In this paper we present PAFS, a new parallel/distributed le system. Within the whole le system, special interest is placed on the caching and prefetching mechanisms. We present a cooperative cache that avoids the coherence problem while it continues to be highly scalable and achieves very good performance. We also present an aggressive prefetching algorithm that allows full utilization of the big caches oered by the cooperative cache mechanism. Keywords Input/Output, Parallel/Distributed File System, Cooperative Cache, Aggressive Prefetching, PAFS, xfs. 1 Introduction In recent years, lots of work has been devoted to parallel I/O and parallel/distributed le systems. This work has produced many dierent le systems along with as many dierent caching policies. A common exponent in most of these le systems is the idea of cooperation between nodes in order to achieve a better system performance. A good example is the idea of cooperative caches. In a cooperative cache all nodes work together in order to make a global cache. This kind of cooperation increases the cache size and the hit ratio thus improving the le system performance. This cooperation between nodes raises a very important problem: maintaining the sharedinformation coherent. A le system with cooperative cache usually has to implement complicated and expensive mechanisms in order to avoid incoherences in the cached data. In this paper we present a simple and ecient solution to this problem. PAFS is a parallel/distributed le system designed to work in a parallel machine or network of workstations. Each node in the network runs a micro-kernel operating system and all services are handled by user-level servers. Besides, each node may have none, one or even several discs connected to it. As part of this le system, we have also implemented a cooperative cache algorithm that avoids the coherence problem. The solution presented will not only solve the coherence problem allowing a much simpler code, but it will also increase the le system performance. This increase in performance has been measured through simulation comparing our proposal against the algorithms proposed in xfs (the le system designed as part of the NOW project [1, 2, 8]). As cooperative caches oer huge caches, we believe that they should be used in order to do some kind of aggressive prefetching. Along this line, we also present Full-File-On-Open, a prefetching algorithm that takes advantage of these huge cache sizes. This report has been supported by the Spanish Ministry of Education (CICYT) under the TIC and TIC contracts. 1

2 All the results presented in this paper are obtained through simulation so that a wide range of environments and architecture congurations may be studied. These simulations have been done using the Sprite workload [3]. This paper is structured into 11 sections. Section 2 gives an overview to some of the related work. Section 3 describes the environment where PAFS is expected to run and it is followed by a terminology section. Section 5 describes PAFS and the cooperative cache implemented in the le system. Section 6 describes xfs and its caching policy (N-Chance-Forwarding). A comparison between both le systems is also presented in this section. In Section 7 Full- File-On-Open, an aggressive prefetching policy, is described. Section 8 gives details of the simulator and the traces used in order to obtain the results presented later. Sections 9 and 10 give detailed performance results of the le system with both the cooperative cache and the prefetching algorithm. Finally, Section 11 gives the most signicant conclusions that can be extracted from this work. 2 Related Work In the last years many parallel/distributed le systems have been developed and most of them have placed special interest in developing cache strategies and prefetching algorithms (ParSys [4], Galley [16], Scotch [11], Zebra [12], PIOUS [24], Vesta [5] and sfs [22] among others). Besides all the above mentioned projects, we would like to place special interest on two other research projects as they have interesting similarities to the work presented here. The rst one is the xfs le system [1, 2, 8]. xfs is a server-less le system with a cooperative cache (N-Chance Forwarding) designed to work on a network of workstations. This system is the reference point used in our paper in order to compare the performance of our le system. The basic dierences between PAFS and xfs are the replacement algorithm used and the way the cache coherence problem is handled. Also, in their deep study of N-Chance Forwarding only read operations are contemplated while both read and write operations are studied in our work. More details about xfs are given in a later section of this paper(x6). The second one was developpedd by Le et al. They have done a great deal of theoretical work on the use of remote memory for caching activities [19, 20]. In their latest project they also propose a mechanism which also avoids the cache coherence problem. The main dierence is that our servers don't need to communicate among themselves in order to know the placement of a given block. Another dierence is that a distinct replacement algorithm is proposed in this paper. Furthermore, only a cooperative replacement algorithm was presented by Le et al. while we present a complete le system. Finally, in our work all measurements are done using a real workload. Besides the above-mentioned dierences with each particular related project, there is a general dierence from all of them when compared with our research. All projects presented so far are aimed at network of workstation running a unix-like operating system on top. Our work assumes that each node in the network runs a micro-kernel operating system and that the le-system services are provided by a server, or servers, running on top of the micro kernel. As the le system is implemented by a server all requests have to be sent to this server, or servers, and cannot be done through a simple system call. Besides, fewer actions can be completely handled locally in the requesting node. These dierences may somewhat modify the key issues of the design. There has also been interesting research in cooperative memory usage [9, 23, 21]. This kind of cooperation has also been studied in the database eld by Franklin et al. [10]. On the prefetching eld we should mention the work done by D. Kotz about prefetching on MIMD multiprocessors [14, 15]. We would also like to cite the transparent informed prefetching developed by Gibson et al. [11, 28]. In their work a very aggressive prefetching policy, similar to Full-File-On-Open, is presented. The dierence between our work and the research mentioned above is that we not only present a prefetching algorithm but the interaction with the cooperative cache mechanism is also studied. A centralized version of this work has also been developed by this research group [6, 26]. 2

3 3 Target Environment The le system presented in this paper is targeted for a parallel machine or network of workstations (NOWs). This parallel machine, or NOW, may have a very large number of nodes connected through a very fast interconnection network. Each node may have none, one or even several disks connected to it. From now on, we will refer to both architectures (NOW and parallel) as a parallel machine. Each node runs a micro-kernel operating system on top instead of a full unix-like one. All functions not oered by the kernel itself are implemented by user-level servers. This is also the case for the le-system operations. This operating-system architecture has been chosen as it allows better cooperation between nodes. Besides, it may oer a single system image to the applications running on top of the parallel machine, simplifying its use. We believe that this single image is the way to go. This work started as a le-system prototype for the PAROS operating system microkernel [17]. This target platform dened the environment we work with. In order to be able to implement our parallel/distributed le system, the underlying micro-kernel should oer the following abstractions and functionality. First, we should be able to have multi-threaded applications. This allows us to implement a le server with several threads which can work on dierent requests in parallel. Second, ports are needed to allow communication between applications. User requests and completion notications are sent using this mechanism. No data transfer is done using ports, as a faster mechanism can be used. Finally, a memory copy operation is needed. This mechanism is used to transfer data between the cache and the user. Our assumption is that any processor can set up a data transfer between any other two processors. The processor that invokes the copy is charged with all the overhead. When we refer to a memory copy, the copy request and the copy itself are all included. Similar remote-memory-access mechanisms are suported in a variety of distributed-memory systems [30, 7]. 4 Terminology In this section we describe some concepts and terminology that may help the reader to understand some of the ideas presented later in this paper. As in all caches, requesting a block means ending up either with a cache hit or a cache miss. The dierence, in a cooperative cache environment, is that a cache hit may be either local or remote. If the requested block is found in the same node as the requesting client is running we have local hit. On the other hand, if the requested block is found in the cache of a dierent node we have a remote hit. We also use the term global hit in order to identify both kinds of hits. It is also important to dierentiate between the possible situations that may be found on a cache miss. The rst one appears when block that has to be replaced has been modied but has not been written to disk yet (it is dirty). We call to this situation a miss on dirty. On the other hand, if the block to be replaced does not need to be written to disk (it is clean) we are looking at a miss on clean. 5 File System and Cache Design In this section we present PAFS, a parallel/distributed le system with a cooperative cache that avoids the coherence problem. This description will be divided into two main issues: the le system architecture and the cooperative cache design. 5.1 File System Architecture When designing a le system there are two main issues that have to be taken in account: how the data is distributed among the disks and how this data and meta-data are managed, by the servers, before they get to the clients. In this work we only examine the second point as the ideas we propose are valid no matter how data is stored on the disks. 3

4 Client File z cacheserver diskserver f-0 File f NET cacheserver Block f-0 Client File f Block f-1 diskserver f-1 Figure 1: PAFS architecture. Scalability is one of the most important issues in distributed/parallel le systems. In order to achieve the desirable scalability two kinds of servers are implemented in PAFS: cache-servers and disk-servers (Figure 1). Cache-servers are in charge of serving the clients requests. They manage the cache and meta-data information. If the data needed by a cacheserver is not in memory, and has to be fetched from disk, this information is requested from a disk-server. Disk-servers are processes responsible for physically reading and writing the blocks requested by the cache-servers. The system may have as many cache-servers as needed and they may run on any node, even if the node does not have a disk. Besides, there should be one disk-server running on each node with one or more disks. In order to implement a highly scalable system, the inter-server communication has to be as low as possible. For this reason, we propose a load distribution that does not need any communication between cache-servers. Each cache-server is responsible for a set of les. It keeps all the information needed to nd a block from the cache or the disk without the need of any other server. Clients know which servers are in charge of which les computing a hash function using the le name (or le-id). Disk-servers only serve cache-server requests. A disk-server reads blocks from a disk and places them on a given buer in a given node using a memory copy primitive. It can also write the contents of a buer from any node to the disk. When a disk-server receives a write operation it copies the block to a local buer, answers the cache-server and then proceeds to the physical write. Disk-servers do not know how the data is distributed between the disks, they only know how to nd a given block in their local disk (or disks). In our le system architecture there are no dedicated nodes. Cache-servers and disk-servers may share a node with any number of clients. Both servers try to consume as few resources as possible allowing an ecient node sharing. The current version places the blocks of a le among the disks using a round-robin algorithm. Each block is placed in a dierent disk as its predecessor and its successor. Although this distribution has been chosen, many others could also be used. 5.2 Cache The most important part of this le-system design, and the main topic presented in this paper, is the cache. We have designed a cooperative cache that has the advantages of cooperation and avoids the problems derived from the coherence mechanisms. A cooperative cache is a mechanism where all nodes in the parallel machine cooperate in order to obtain an improved global cache. In fact, in this research project we have suppressed the concept of "local cache" in favor of a single and big global one. Each node gives a part of its local memory to the global cache which is managed by the cache-servers as will be explained later. A given node is not allowed to modify the contents of the cache blocks placed on its memory as they belong to the whole system. This global cache is divided among all the cache-servers in partitions (Figure 2). Each cache-server is responsible for caching the data in its les in its partition of the cache. All the blocks that make a cache partition are scattered among all nodes. One server cannot access 4

5 cacheserver Cache Blocks Network cacheserver cacheserver Partition Partition Partition Figure 2: Cache blocks distribution in partitions. nor modify any block which belongs to another server's cache partition. A cache-server works completely isolated from the rest of cache-servers as there is no overlap in their responsibilities. The replacement algorithm used by the cache-servers is based on the the well-known LRU (Least Recently Used). This means that when a client requests a block, this block replaces the least recently used one in the cache partition managed by this server. We should notice that this block may be in any node and not necessarily in the client's. Our cooperative cache algorithm does not care about increasing the number of local hits. This idea diers from most cooperative algorithms implemented so far, as they try to maximize the number of local hits. In this paper we prove that this may not be the wisest thing to do if remote hits can be done eciently. The idea of not encouraging local hits may be good for whole-system performance but it may degrade applications with a small le working set. These applications would probably have a very high local-hit ratio if other cooperative algorithms were used. In order to be fair to these applications we have modied the LRU algorithm a little bit. This new version follows two steps before deciding which block is to be replaced. First, it checks for a block placed in the same node as the client's within the queue-tip. We dene the queue-tip as a percentage of the least recently used blocks from the LRU-queue. If such a block is found, it is replaced. Otherwise, the least recently used block is substituted. A study of the inuence this factor has on the overall system performance is detailed in the performance section (x9.6). From now on, we will call this replacement algorithm Pseudo-Global-LRU (PG-LRU). One of the most expensive and complicated issues in a parallel/distributed le system is to handle the cache coherence problems. These problems appear when replication is allowed and several copies of the same information are running around. This replication is mainly done to increase the local-hit ratio of the global cache. As we consider that a high local-hit ratio is not the key issue to obtain a high performing cache, we avoid any kind of replication. If a block is already found in the global cache, it is sent to the user but no replication on the client's node is performed. If there is no replication, no cache coherence problems may appear and we can get rid of all the coherence mechanisms. This simplies the cache and le-system design very much. In the performance section (x9) we will show that this simplication is not only innocuous to the system performance but it may even increase it. The last important issue consists of the way blocks are distributed among the servers. So far, we have a xed distribution. On boot time all cache blocks are evenly distributed among all servers. This may seem very simplistic but no problems have been detected due to this distribution. If any performance degradation appears, a dynamic redistribution of blocks should be studied. Until then, we propose the simplest algorithm: xed partitions. 5

6 Finally, each cache-server implements a delayed write policy. Every 30 seconds, all modied blocks in the cache are sent to the disk through the disk-servers. This interval between cache ushes has been taken from the one used in Unix. 5.3 Fault Tolerance The above caching mechanism is not fault tolerant as a node failure means that all modied blocks in the failed node's memory are lost without being saved in to the appropriate disk. As this may not be acceptable in some environments (especially NOWs) we propose a couple of mechanisms to achieve the desired fault tolerance. The simplest idea that achieves the objective consists of implementing a write-through policy. Every time a block is modied it is sent to the disk server which writes it to disk. This can be done as long as the disks are not too busy. In order to study the impact this policy has on the overall system performance some gures are presented in the performance section (x9.5). Another possibility, which has not been implemented yet, consists of emulating a RAID [27]. We propose that a set of blocks are used to keep the parity of a line of blocks. Every time a block is modied, the parity block is also modied. If a node with dirty blocks fails, all dirty blocks in its cache can be rebuilt and sent to the disk. As the write-through policy has worked well enough, we have not implemented this policy but it may be a useful one in some environments. 6 Algorithms Comparison In order to compare the performance obtained by PAFS we compare it with xfs. This le system was chosen because it has one of the latest cooperative cache algorithms found in the bibliography (N-Chance Forwarding). We will rst explain the le system and its cache briey and a conceptual comparison with xfs will follow. 6.1 xfs This le system and its cooperative cache were developed as part of the Berkeley NOW project [1, 2, 8]. In this section we explain the basic ideas of the xfs le system and its cooperative cache (N-Chance Forwarding). Although it is a very important issue in their work, we will not describe how the data is stripped on the disks as it is not relevant when comparing xfs to our proposal. Nor will we explain the servers needed for such disk organization File System Architecture There are two main abstractions in the subset of xfs we are interested in: the OS kernel and the manager. The kernel is a regular unix-like operating system. It handles all le system requests from the clients running on its top. This kernel has been slightly modied in order to be also able to serve operations requested from other nodes. Managers are servers which keep track of the le-system data and meta-data. They are also responsible for the consistency of the cache blocks. A le-system operation starts when the client requests a block from its OS kernel. If the requested block is in the kernel's buer-cache, it is handed to the user. So far no dierences compared to a regular unix operation appear. The dierences come into sight when the kernel does not have the block cached. As this block may be cached in a remote node, the kernel contacts a manager to request the block. The manager checks if a copy is already cached in any node. If such copy exists, the request is forwarded to the remote node holding the block. The kernel of this node will send a copy of the requested block to the kernel which asked for it on the rst time. If there are no copies cached, the manager reads the block from disk and sends it to the requesting kernel. A diagram of this process is presented in Figure 3. Each manager is responsible for a set of les. The kernel knows which manager is responsible for which les indexing the Manager-Map using the le-id. The Manager-Map is a table, replicated in all sites, that tells which server is responsible for which les. In order to increase 6

7 Data request The block is not in the global cache The block is in the cache Client Mgr. Kernel Kernel The block is not in the local cache The block is in a remote cache Kernel The block is sent to the requesting node Figure 3: Steps followed to fulll a client request in xfs. the eciency of the system, xfs tries to assign les used by a client to a manager colocated on that machine. This is done using a policy called First Writer. When a client creates a le, xfs chooses a manager colocated in the same machine [1]. In order to port this version to our micro-kernel operating system architecture only one change has been needed. As the micro-kernel does not handle le-system operations, we have placed a le-system server on each node. This server behaves as the unix kernel. The only dierence is that accessing this server means sending a message to a port instead of performing a system call. As this server is always colocated in the same machine as the client, not much overhead should be added. More detailed information on the inuence this modication has, can be found in the performance section (x9.4) N-Chance Forwarding N-Chance Forwarding is the cooperative cache algorithm implemented on xfs. It divides the cache a node has, into two parts. The rst one is used to cache the local data and the second one holds data cached by remote nodes. The size of these two parts is not xed but dynamically adjusted depending on the node I/O activity. N-Chance Forwarding allows each node to cache the blocks its applications request. The dierence with some kind of isolated self-caching algorithms is that it attempts to avoid discarding unreplicated blocks (singlets) from the client memory. When a client discards a block, the server checks to see if that block is the last copy in the whole cache. If the block is a singlet, rather than discarding it, it forwards the data to a random peer. The peer that receives the data adds the block to its LRU list as if it had been recently referenced. This forwarding can only take place N times before the block is referenced again. After N forwardings without being referenced the block is discarded. If a client has a remote hit, the block is replicated from the remote cache to the local one of the requesting client. The parameter N indicates how many times a singlet should be allowed to be forwarded without being referenced before nally being discarded. The value we have chosen for the comparison is N=2, as it was described as the best choice in the N-Chance Forwarding paper [8] Coherence Mechanism xfs utilizes a token-based cache consistency scheme similar to Sprite [25] and AFS [13] except that xfs manages consistency on a per-block rather than on a per-le basis. Before a kernel modies a block, it must acquire write ownership of that block. The client sends a message to the block's manager. The manager then invalidates any other cached copies of the block, updates its cache consistency information to indicate the new owner, and replies to the client, giving permission to write. Once the kernel owns a block, the kernel may write the block repeatedly without having to ask the manager for ownership each time. The client maintains 7

8 write ownership until some other clients reads or writes the data, at which point the manager revokes ownership, forcing the client to stop writing the block, to ush any changes to stable storage, and to forward the data to the new client. 6.2 Comparison In this paper we compare two le systems and especially their caching policies: N-Chance Forwarding (implemented in xfs) and PG-LRU (implemented in PAFS). In this section we describe the main dierences between both systems. We will also list the main issues that have to be taken into account when comparing both algorithms. The most general dierence between both le systems is the platform they are designed to run on. PAFS is designed to work on a parallel machine (or NOW) where each node runs a micro-kernel operating system. All services, including the le system, have to be implemented by user-level servers instead of by the kernel itself. On the other hand, xfs works on a unixlike operating system. The le system code is implemented in the kernel and some operations can be handled without sending or receiving any messages. A second dierence is the way the coherence problem is tackled. While PAFS avoids it by simply avoiding replications, xfs implements a token-based-on-a-per-block-basis cache consistency scheme. On the caching algorithms there is also a very important conceptual dierence. PG-LRU does not try to increase the number of local hits. It tries to speed up remote hits in order to make less signicant the local-hit ratio. N-Chance Forwarding places special interest in achieving a high local-hit ratio in order to increase its performance. Besides these general dierences there are a few issues that should be studied in order to compare both le systems. First the impact the coherence algorithms has on the system performance should de studied. The total time spent serving global hits should also be compared. If local hits are very fast but the remote hits take very long, the total time spent serving block hits may be too high. We should also study the overhead of maintaining a high local-hit ratio. This overhead may be too high if compared to the gains obtained. All these issues and some less important ones have to be taken in account when comparing both algorithms in the performance section. 7 Aggressive Prefetching A very important eect produced by any cooperative caching algorithm is the huge caches that can be obtained. These global caches are so big that most of the blocks kept are several hours old. For example, in our simulation (50 nodes with 16MBytes "local caches"), it took 13 hours to ll the global cache under the Sprite workload [3]. It is well known that prefetching is not always a good idea as it may end up delaying the application if many miss-predictions are made [15, 29]. Nevertheless, if the cache is big enough, these miss-predictions should not aect the overall cache performance as prefetched blocks replace very old data. Extending this idea, the bigger the cache is the more aggressive the prefetching policies can be. In order to take advantage of the huge caches oered by the cooperative mechanism, we present an aggressive prefetching algorithm named Full-File-On-Open. This algorithm starts to prefetch the whole le as soon as the le is open. This has several basic advantages over other common algorithms. As it starts prefetching before any data has been accessed, the rst block may be in cache when requested. This increases the le system performance on small les made up of one or two blocks. If les are open at the beginning of the program execution but the data is not accessed until some time later, many blocks may have already been cached before they are really needed. This algorithm also takes advantage of the time between requests. If this time is very long, as there is no limit on the number of blocks that can be brought to the cache, many blocks can be prefetched. This increases the probability of nding a block in the cache when needed. As each cache-server is responsible for a set of les it is also in charge of prefetching the blocks of those les. Each server only works on one prefetching block at a time, but as there are several servers, a good degree of parallelism can be obtained. 8

9 At rst sight, this algorithm may seem to degrade the le system performance if many huge les are open as they may ll the whole cache. This behavior is studied later in the performance section (x10). 8 Simulator and Trace Files 8.1 Simulator The le system and cache simulator used in this project is part of DIMEMAS 1 [18]. It reproduces the behavior of a distributed memory parallel machine. This software not only simulates the machine and disk access but also dierent short term process scheduling policies. The whole simulator is a trace-driven one where traces contain CPU, communication and I/O demand sequences of every process instead of the absolute time for each event. The communication model implemented in the simulator is very important to be able to understand the results presented later. All communications are divided into two parts: a startup or latency and a data transfer rate. The startup is constant for each type of communication (port or memory copy) and it is assumed to require CPU activity. This startup is dierent whether the communication is within a node or it crosses the interconnection network. The data transfer time is proportional to the size of the data sent and the interconnection network bandwidth. In our model, all communications are synchronous although asynchronous communication can be achieved by creating new threads. 8.2 Simulation Parameters In all runs, we have simulated a 50-node parallel machine where each node assigned 16MBytes of its local memory to the global cache. This cache was divided into 8KByte blocks, the same size as the disk blocks. The whole system had 8 disks where the data was distributed in a round-robin fashion. The disks we have used for our simulations are modeled using two parameters: latency and bandwidth. The latency is the time needed to seek and search a block. We have used a 10.5 milliseconds read latency and a 12.5 milliseconds write latency. The bandwidth is the amount of bytes that can be transfered in a unit of time. We have used a 10MBytes/s bandwidth. These values have been extracted as an average of several real disks. Although the environment used in this work is somehow dierent from the one used in the NOW project [1, 2, 8], we use similar network parameters. This should help the reader to compare both environments. A study of the inuence the network parameters have on the system performance will also be presented in the performance section (x9.2). Unless otherwise specied, nodes are connected through a 155 Mbits/s interconnection network and local copies are done at 320 Mbits/s. We assumed a 100-microsecond remote-port startup and 50-microsecond local-port startup. Memory copies have a 25-microsecond startup if they are within a node and a 50-microsecond one if the copy is between dierent nodes. PAFS has been simulated with 8 cache-servers and 8 disk-servers. On the other hand, xfs has been simulated with 50 servers as we want to have a server on each node in order to minimize the impact of implementing the le system as a server and not in the kernel. All servers are sharing their nodes with other applications. Regarding the particular parameters of each algorithm we have used N=2 for xfs and a queue-tip size of a 10% of the LRU-list of each cache-server. 8.3 Sprite Workload In order to get the results presented in this paper, we have used some parts of the Sprite workload, described in detail by Baker et al. [3]. The Sprite user community included about 30 full-time and 40 part-time users of the system. These traces list the activity of 48 client machines and some servers over a two day period measured in the Sprite operating system. 1 DIMEMAS is performance prediction simulator developed by CEPBA-UPC and is available as a PALLAS GmbH product 9

10 Total Warm Cache Ops. Blocks Ops. Blocks Read Write Open Close Seek Unlink Total Table 1: Number of operations and accessed blocks during the whole simulation and the period where the simulation results were taken. All values in this table are in thousands of operations or thousands of 8KByte blocks. Average Read Time Average Write Time PAFS micros micros. xfs micros micros. Gain 12.6 % 42.4 % Table 2: Average read and write operation times. Although the trace is two days long, all measurements presented in this paper are taken from the 15th hour to the 48th hour. This is done because we used the rst fteen hours to warm the cache. In Table 1, we can see the number of requested operations and accessed blocks during the simulation period. This table is meant to help the reader to understand the real load placed on the le system. 9 Cache Performance In this section we present the performance measurements obtained by PAFS and especially by PG-LRU (the cooperative caching algorithm presented in this paper). As there are many parameters that can inuence the cache performance, we have decided to study them one by one. Each of the following subsections focus on one of these parameters. All measurements presented in this section are compared to the ones obtained by xfs and N-Chance-Forwarding which have also been simulated. 9.1 Performance Comparison In this subsection we compare the average time spent to perform a read and a write operation by both le systems. In Table 2 we can see the average times mentioned above. We observe that both read and write operations are faster when using PAFS than when xfs is used. The average read time is a 12% faster and the write time is also 42% faster. In order to explain this gain in performance we will rst focus on the read operations and a dissertation on write operations will follow. In Figure 4 we can observe the total time spent performing read and write operations. These times have been normalized to the largest one (reads on xfs) in order to make the graph easier to understand. In this graph we can see the time spent working on misses, remote hits and local hits. The time spent performing global hits can be easily obtained by adding the time spent on both local and remote hits. In this gure we also observe that the time spent on global hits by PAFS is signicantly less than the one spent by xfs. As the global-hit ratio is practically the same (85.5%), this dierence is the reason behind some of the gain obtained by our le system. We should focus on the remote hits if we want to understand this dierence. A remote hit takes around 10 times longer in xfs than in PAFS. This means that our le system could have up to 10 times more remote hits than xfs and still spend less time in global hits. As PAFS only has

11 1.0 xfs Normalized Time PAFS Misses Remote Hits Local Hits 0.2 xfs PAFS 0.0 Read Write Figure 4: Time spent by the whole le system serving read and write operations. All times are normalized to the time spent by xfs serving read operations. remote hits for each of theirs, the total time is signicantly less. Their remote hits are so expensive because they have to copy data twice: once from a remote memory to the local memory, and once from the local cache to the user. They also have to forward many blocks, contact the server which will contact the block owner, revoking block ownership and so on. On the other hand, PAFS only copies the data from a remote cache to the user and that is all. Besides, when xfs copies the block from the remote cache, the whole block is copied through the interconnection network while our version only copies the bytes requested by the user. The average size of the blocks requested by the user is 4660 bytes, half the size of the cache block. Another important aspect is that misses on clean take longer in xfs that in PAFS. In their le system, bringing a block to the cache may require forwarding a block to another node. These extra work also decreases the overall read performance. Let's now explain the gain obtained in write operations. The main reason for the performance dierence is the overhead produced by the coherence algorithm. Most writes have to ask for the block ownership and quite a few of them also have to invalidate the copies kept in other nodes. All these operations are expensive and increase the average write time in xfs. Besides, block forwarding and the extra copies needed to bring the block to the local cache before modication, have a signicant impact on the write performance. As write operations are very fast, any overhead has a signicant impact on the average operation time. 9.2 Network Bandwidth Inuence A very important part of this work is to study the inuence the network bandwidth has on the results presented in the above subsection. In order to perform this study, we have run several simulations varying the interconnection network bandwidth. In Figure 5 we can examine this variation. The X axis shows the ratio between the local memory copy bandwidth (L BW) and the interconnection network one (R BW). The interval used starts at 1 where both bandwidths are equal and ends where the remote bandwidth is 10 times slower than the local one. This interval should include most parallel machines and NOW congurations. We observe that the bandwidth ratio aects the read operation in a similar way to both algorithms. This means that the time gained by xfs due to its higher local-hit ratio is lost because of its remote hits. As all communications needed to serve a remote hit go through the interconnection network, the xfs remote hits are highly penalized. Write operations behave in a dierent manner. This dierence resides in the way misses are treated in both algorithms. A miss in PAFS nearly always means a remote copy as the block will probably replace a block in a remote node. In consequence, a slow down in the interconnection network speed means a slow down in the write operation. On the other hand, a miss under xfs is always handled locally. It replaces a local block and no remote copies 11

12 4000 Microseconds PAFS Average Read Time 1000 PAFS Average Write Time xfs Average Read Time xfs Average Write Time L_BW/R_BW Ratio Figure 5: Network bandwidth inuence on the average read and write operation times PAFS Average Read Time PAFS Average Write Time xfs Average Read Time xfs Average Write Time Microseconds "Local-Cache" Size (in MBytes) Figure 6: "Local-Cache" size inuence on the average read and write operation times. have to be done unless a forwarding is needed. These extra remote copies performed by PAFS cannot be outweighed by the extra work due to the coherence algorithms and the dierence between the average write time in both le systems decreases as the network slows down. Summing up, we can see that the results presented in this paper are valid even with slow interconnection networks. Anyway, the faster the interconnection network is, the better the cooperative cache will behave. 9.3 "Local-Cache" Size Inuence Another important aspect that should be studied is the inuence the "local-cache" size has on the le system performance. In order to fulll this study we have simulated "local-cache" sizes from 1MByte up to 16MBytes which is the default conguration presented in this work (Figure 6). We can see, as expected, that decreasing the cache size, increases the read average time. As was shown in Figure 4, 80% of the total time spent on read operations was used to satisfy 15% of the blocks (misses). This means that the le system performance is led by the miss ratio. As can be seen in Table 3, smaller caches have a much lower global-hit ratio. This increase in misses also increases the average read time and it can end up doubling it in the worst case. In Table 3 we can also observe that xfs obtains a slightly better global-hit ratio than PAFS. This is due to the xed partition policy. There are some times when the blocks assigned to a cache-server are not enough to keep its working set and thus it has a lower hit ratio. Anyway, this didn't aect the overall results much. 12

13 1MB 2MB 4MB 8MB 16MB PAFS 67% 73% 78% 81% 85% xfs 68% 74% 78% 81% 85% Table 3: "Local-cache" size inuence on the read global-hit ratio Microseconds PAFS Average Read Time PAFS Average Write Time xfs Average Read Time xfs Average Write Time Local-Port Startup (in Microseconds) Figure 7: Local-Port startup inuence on the average read and write operation times. Write operations are quite insensitive to the cache size. In order to explain this somewhat surprising result we should rst examine the work needed to perform a write hit and a write miss. If the missed block is a new one (from a growing le), or the write operation overwrites the old block completely, the operation does not need to access the disk. This means that the time to serve a miss is very similar to the time needed to complete a hit. This fact explains the little impact that reducing the global-hit ratio has on write operations. In Figure 6 we can also observe that the "local-cache" size aects both algorithms (PAFS and xfs) in a similar way. The reason behind this behavior on read operation is the very similar global-hit ratio. The similarity of behavior on write operations is due to the small impact the "local cache" size has on this kind of operations. 9.4 Local-Port Startup Inuence As was mentioned when describing the xfs le system, the simulated version has one dierence from the original version. As we work on a micro-kernel architecture, the le system cannot reside on the kernel and has to be implemented by a server. This means that requesting data from the le system will need a message to a server instead of a system call. In order to study the impact this modication has on the system performance we have run some simulations modifying the local-port startup (Figure 7). The simulation range goes from an instantaneous startup (minimizing the inuence of sending a message to the server) to the default startup used in this paper (50 microseconds). This study shows that varying the local port startup only aects the xfs algorithm as PAFS does not base its performance on local communications. We can also observe that although some improvement is obtained on xfs with small startups, this improvement is not very signicant. 9.5 Write-Through Overhead All the results presented so far have not had any kind of fault tolerance mechanisms. Only a 30-second syncer was simulated. As this may be unacceptable in some environments, we proposed a write-through policy. In Table 4 we can see the average read and write operation times with write-through and without it. The simulation runs show that, under this workload, a write-through policy does not produce a signicant overhead. This little overhead is due to a not very high disk activity and 13

14 Average Read Time Average Write Time PAFS Delayed-Write micros micros. PAFS Write-Through micros micros. xfs micros micros. Table 4: Average read and write operations time when write-through and delayed-write are used. a good caching policy. We understand that if the workload placed more stress on the disk, a higher overhead could appear due to the write-through policy. 9.6 Queue-Tip Size Inuence The replacement algorithm presented in this paper tries to increase the local-hit ratio without increasing the complexity and overhead of the algorithm in order not to penalize applications with a small le working set. In order to achieve this objective we have dened a section of the LRU-queue (queue-tip) where we will rst try to nd a block in the node which requested it (Section 5.2). In this Section we study the impact the size of this queue-tip has on the overall performance. We have studied several queue-tip sizes. The simulated range goes from 0% up to a 25% of the LRU-queue. Throughout all this range we have seen that there is no real gain after the rst 5%. This means that there is no point in examining more than the least recently used 5% of the queue in order to achieve a higher local-hit ratio. 10 Prefetching Performance As cooperative caches oer such big caches, they should be used for aggressive prefetching. In this paper, one such algorithm (Full-File-On-Open) has been presented. In this section we study the impact this algorithm has on the system performance. The study presented in this section is only done on PAFS but the result should be extendible to any le system with a cooperative caching algorithm Algorithm Performance The rst step in order to study the performance gain obtained by Full-File-On-Open is to compare it with other algorithms. In Figure 8 we present the average read and write operation time when no prefetching is active and with two dierent prefetching algorithms: One-Block- Ahead and Full-File-On-Open. One-Block-Ahead is the typical algorithm implemented in many le systems. Every time a block is read or written, the next sequential block is queued in order to be prefetched. This algorithm tries to take advantage of the usual sequential le access. At the same time it is a very conservative algorithm as only one block is prefetched each time minimizing the impact of prefetch misses. Examining Figure 8 we observe that both prefetching algorithms improve the average read time, but no signicant gain is obtained in the average write time. Let's study both operations separately and explain the inuence the prefetching algorithms have on both of them. Read operations take advantage of any prefetching algorithm as the global-hit ratio is incremented by both of them. We should also notice that prefetch misses do not have any signicant inuence because the global cache is very big and most of the replaced blocks are several hours old. We can also see that Full-File-On-Open gives much better results that One-Block-Ahead. This happens due to several reasons. First, if the les are only one or two blocks long they are completely prefetched before the application starts reading them most of the time. Second, if the time between the open operation and the rst access to the le is a long one, some parts of the le have been prefetched before they are needed. And last, if there is a long interval 14

15 3000 Microseconds NP OBA FFOO 0 Read (4662 Bytes) Write (6912 Bytes) Figure 8: Performance comparison of PAFS with One-Block-Ahead (OBA) and Full-File-On-Open (FFOO) prefetching algorithms, and no prefetch (NP) Average Read Time (NP) Average Write Time (NP) Average Read Time (FFOO) Average Write Time (FFOO) Microseconds "Local-Cache" Size (in MBytes) Figure 9: File size inuence on the Full-File-On-Open (FFOO) prefetching algorithm. between two consecutive operations on a le, the prefetching mechanism may bring many blocks to the cache. Write operations are not improved and even a degradation of their performance may appear if prefetching algorithms are used. The little inuence the global-hit ratio has on write operations is the reason why no better average write-operation times are obtained. The reason behind this lack of inuence was explained in Subsection 9.3. The possible loss in performance appears when a block that would not need to be read from disk is written while the prefetching operation takes place. This write will have to wait until the block is in the cache but it will not be used as it will be completely overwritten File Size and Cache Size Inuence The aggressive prefetching algorithm presented in this paper may degrade the le system performance if the les are too big. In order to study this inuence without modifying the trace les, the size of the global cache has been reduced. In Figure 9, the results of this experiment are shown. We observe that even with small "local caches" the presence of the aggressive prefetching algorithm decreases the average read operation time. This means that no real interferences are found if big les are prefetched. Another important observation is that if an aggressive prefetching is done, smaller "local caches" are needed in order to obtain similar performance. For example, in our simulations, 15

16 8MByte "local-caches" with Full-File-On-Open behave nearly the same as 16MByte ones also with Full-File-On-Open. 11 Conclusions In this paper we have presented PAFS, a le system with a cooperative cache (PG-LRU) that needs no coherence mechanisms. We have done so with no loss in performance compared to other current cooperative caching algorithms. We have also seen that the importance that historically has been given to a high local-hit ratio may not be as important as it was believed. If remote hits are implemented eciently, a high local-hit ratio becomes a secondary issue. Another important result is that in cooperative caches, a global-hit ratio does not benet write operations and it may even degrade them. This can be extended to the prefetching area where no prefetching should be done if a block is to be overwritten. We can also conclude that cooperative mechanisms oer huge global caches that should be used by aggressive prefetching algorithms in order to improve the overall le system performance. A prefetching algorithm (Full-File-On-Open) that falls in such an aggressive category has also been presented. Finally, we have shown that the results presented in this paper are also valid with fast, and not so fast, interconnection networks. Congurations where the network has a bandwidth 10 times less than the local memory bandwidth may eciently run the le system presented in this paper. Acknowledgments We owe special thanks to Michael D. Dahlin for answering all our questions about the way N-Chances Forwarding works. We are grateful to the people at Berkeley who gathered the Sprite traces that helped us to feed our simulator and get the results we present in this paper. We would also like to thank E. Markatos and Pedro de Miguel whose comments improved the contents of this paper. Finally we thank Maite Ortega for her help in the implementation the rst prototype. References [1] T.E. Anderson, M.D. Culler, D.A Patterson et al. "A Case for NOW (Networks of Workstations)," IEEE Micro, February 1995, pp [2] T.E. Anderson, M.D. Dahlin, J.M Neefe, D.A Patterson et al. "Serverless Network File Systems," 15th Symposium on Operating System Principles, December 1995, pp [3] M.G. Baker, J.H. Hartman, M.D. Kupfer et al. "Measurements of a distributed File System," Proc. Of the 13th Symposium on Operating Syste Principles, 1991, pp [4] J. Carretero, F. Perez, P. de Miguel, et al. "ParFiSys: A Parallel File System for MPP," ACM SIGOPS, Vol. 30, No. 2, April 1996, pp [5] P. F. Corbett, S.J. Baylor and, D.G. Feitelson "Overview of the Vesta Parallel File System," ACM SIGARCH, Vol. 21, No. 5, 1993, pp [6] Same authors "PACA: A Cooperative File System Cache for Parallel Machines," Euro- Par'96, Lyon, August 1996 [7] D.E. Culler, A. Drusseau, S. Copen, et al. "Parallel Programming in Split-C," Proceedings of Supercomputing'93, pp [8] M.D. Dahlin, R.Y Wang, T.E. Anderson and, D.A Patterson "Cooperative Caching: Using Remote Client Memory to Improve File System Performance," Operating Systems Design and Implementation, Monterrey, November 1994, pp [9] M.J. Feeley, W.E. Morgan, F.H. Pighin et al. "Implementing Global Memory Management in a Workstation Cluster," 15th Symposium on Operating Systems Principles, December

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations