6. Results. This section describes the performance that was achieved using the RAMA file system.

Size: px

Start display at page:

Download "6. Results. This section describes the performance that was achieved using the RAMA file system."

Cameron Miles
5 years ago
Views:

1 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding any metadata, parity data, client request packet overhead bytes, server to server communication bytes, or any retried data. All server and client programs started from a remote computer with a window for each node. The calculated results were displayed at a fixed time interval or packet count. (presently set to 30,000 packets). The performance was measured by counting packets over a time interval and calculating the resulting performance. Measured parameters were: time, packet count and distribution by packet type, file data in MB/sec, service time in milliseconds (msec) per block, percent cache hit/miss, inq fullness (measured in percent of capacity), blocks/writev (low, high and average), msec/writev (average), lines/lseek (average), reason for cleaning (in percent of lines cleaned for starvation or because cache was too full). The throughput in MB/sec is the only one reported. The other measured parameters were used to better understand and learn the file system activity. The tests were performed on file systems with one, two, three, or four servers. In all cases, when a file was striped, each stripe was striped on all disks minus one disk (the parity disk when it was needed). For example, when testing parity on a four disk file system, the data for each stripe occupied three disks and one disk was used for the stripe parity. When a three disk file system was tested, then for each stripe the data occupied two disks and a third disk held the parity information. When comparing these two cases it is important to note that with a four disk file system, the data is 75% of the total information that is 100

2 carried over the network and written to disk, and with the 3 disk file system, this is 66.6%, therefore the expected MB/sec/disk of file data is better with a four disk file system than with a three disk file system. However, the improvement in performance due to increased number of servers is reduced by the switch performance as the number of servers grow as shown in Section To illustrate the effect of the number of servers on the amount of file data that is written on a disk, consider this example: assume that the network can consistently deliver 12MB/sec/disk and that the disk can write 12MB/sec/disk and that we are writing a file with parity in a four or three disk file system. The ratio of data:parity is 3:1 and 2:1 respectively and therefore, using a four disk file system each disk will write 9MB/sec of file data and 3 MB/sec of parity data, but with a three disk file system, only 8MB/sec of file data and 4MB/sec of parity data. The amount of file data that was written or read in each test was about 50% of the total file system capacity, i.e.1-4gbs and was repeated tens of times before a result was recorded. The tests ran until the performance was consistent. The results when the server cache was empty were ignored, and results for the ramp up of adding clients to the test were ignored as well. In each test, clients were added to the test as long as there were almost no lost packets (few per messages per client process). I also attempted, whenever possible, to have a constant server NICs to client NICs ratio. However, this was difficult to achieve because the entire system consisted of 20 NICs, and each server has 2 NICs, so as the number of servers increased, there were not enough client NICs available to keep the ratio constant. 101

3 There were no other users on the system, but Linux system processes were running as normal system tasks and NFS was mounted as well. There were instances when the tests were repeated with better results than the ones reported in this document. The variance could be 5%-10%. Based on my discussions with the system administrator, this could be related to memory usage by system processes, or due to system activity with the disks or the network. The issues affecting the actual results are: File size - large files are expected to perform better than small files overall. This is because small files have fewer blocks over which to amortize metadata overhead costs. New file versus reused file - new files have more metadata overhead cost per data block than rewritten files, resulting in slower overall performance. This is because that the file size needs to be continuously updated as a new file grows in size. Files with parity for improved fault tolerance - incur the cost of writing parity information, resulting in slower overall performance compared to files without parity protection. Rewriting incomplete stripes in files with parity protection - incur additional cost of reading missing stripe data in addition to writing parity information, resulting in slower overall performance compared to files with complete stripe data for parity calculations. 102

4 Striping - affects the distribution of data on a disk and on all the disks. Striping on more disks reduces the overall portion of parity data that is required and therefore improves performance. In addition, the more data blocks that are stored on a disk before striping on the next disk improves the performance because it results in more vectors per writev system call. (see Section 3.2.1). Overall switch performance per server does not scale linearly as explained in Section As more ports participate in communication, aggregate switch performance improves, but performance per node is reduced Effect of Disk Organization on Performance Disk Capacity The disk capacity was fixed at 2 Gigabytes per disk Block Size Two block sizes were tested. 8,192 bytes (8K) and 16,384 bytes (16K). Most of the reported results in this document are those for the 8K block size. The 16K block size was tested in order to demonstrate that the results are in line with the expected numbers based on the effect of the larger block size on the network delivery capabilities as explained in Section and the disk capabilities as explained in Section In both instances the performance is expected to be somewhat better. The reasons for choosing these block sizes is discussed in Section 3.3. In both cases the same number of file bytes per disk were striped. 103

5 Number of Lines and Overflow Lines In all tests the disks were divided into 250 lines with 0 overflow lines. The code however supports overflow lines. The effect on disk performance when the overflow line(s) start to fill up was not tested Stripe Size The stripe size that was used in the test was derived from the baseline tests of the network and disk. Examining the disk writev performance in Section we see that as the number of vectors increases, the write performance increases as well, however, once the number of vectors reaches 16 for an 8KB block (8 for 16K blocks), the improvement per additional vector is smaller (see also Figure 5). I therefore selected the stripe size to be 16 and 8 blocks per file respectively Server Cache Size The effect of cache size was not examined in my experiments. In all tests the cache had room for 64MB (8000 8K blocks or K blocks) of data or parity or mirror inodes, whichever arrived from the network or was read from the disk. In addition, all tests were designed to minimize cache hits. This is because the intent was to measure the network and disk effect on the file system, not the caching improvements by reading from cache. It was not possible to completely eliminate cache hits, but in all tests the cache hits were between 0-3% and tests were repeated many times. The selection of the cache size is based on the assumption that the blocks are distributed roughly evenly among the disk lines and that in order to optimize the cache cleaning, a line needs to have on average a full stripe to be written to disk efficiently. With 8000 cache 104

6 block, 250 lines, and 16 blocks/stripe/disk for 8K blocks and 8 blocks/stripe/disk for 16K blocks, this is equal to about 2 stripes per line. Indeed, the tests showed that this assumption was correct, and the average number of vectors per writev was more than one stripe. An 8000 block cache fills in about 8 seconds if data arrives at 12 MB/sec. If the cache had only 4000 blocks, with an average of 1 stripe per line, some lines would have less than a full stripe to be cleaned to disk. At 16 blocks per writev the disk can write MB/sec, but with 8 blocks per writev only 7.46 MB/sec, a 37% reduction in performance for that line. In general, if the cache is not large enough, the cache will become full faster and cleaning will have to be performed with fewer dirty blocks in each line, hence, small vector count, yielding slower disk writes. If the cache is very large, initially it will take longer to fill. However, since the arrival rate of data is the same, no matter how large the cache is beyond some minimal size, cleaning will occur as often, at the rate that data arrives into the cache. What I need to guarantee is that there is a minimal average number of blocks per line available in the cache to guarantee enough blocks per line to be written in each cleaning cycle for optimal write time. Table 2 shows the relative improvement in disk performance based on the number of blocks written when cleaning a line (8K blocks, 16 blocks per disk line). The maximum performance improvement for disk writev is as follows: If the file system has many blocks that are frequently reused (read after write, read repeatedly, rewritten often), then a larger cache is advantageous because these reused blocks can wait in the cache before repeated use and repeated disk access is limited. 105

7 Number of Stripes Amount of Metadata The metadata consists of directory blocks, inodes and mirror inodes. The directory blocks occupy 0.28% of the total disk space for 8K bytes blocks. In addition, each file carries overhead that is equal to the size of two inodes, written on two separate disks. One inode is on the disk holding data block 0, and the same inode is mirrored on the next sequential disk for fault tolerance. In the 8K bytes block organization, 136 inodes fit in a block, so about 1.5% of a block is needed per file to store inodes. Therefore, for the smallest file where the file size equals 1 block of data, the metadata overhead is 3% plus the space for the directories for a total of 3.28%. Writev MB/sec 1/ % % % % Improvement Over Single Stripe Table 2. Relative Improvement of Disk Write Performance With the Number of Stripes per Writev (8K Blocks, 16 Blocks/Stripe/Disk) 6.2. Effect of Network Switch on Performance My expectation is that aggregate performance will scale as more servers are added to the system, however, the scaling is not linear based on observations that were explained in Figure Client Server Ratio The client:server ratio deals with four separate issues: How many client processes are sending/receiving data from the servers. 106

8 As more client processes are involved in data exchange with the servers, the likelihood of a server being idle is reduced. The reason is that while one client process is waiting for a packet from the server, another client is ready with a packet for the server. For best performance I want enough clients to ensure that the servers are continuously fed with work. The number of client processes is not unbounded, however, because I want to minimize the number of timed out requests which waste network bandwidth and therefore reduce overall file system performance. In all tests client processes were added as long as the performance improved by the addition. This number varied with each test because the client nodes were a mix of different CPUs, memory, NICs and speed. The range was between 1 and 20 per node. All the write data in the tests was generated directly into the client buffers, so no data copies were involved. How many client nodes are sending/receiving data from the servers. All processes communicating with servers in a particular client node share the resources (system buffers, NICs, CPU) of that node, so a node has limited capability to generate work for the servers. More client nodes can generate more work for the servers, and more likely generate enough work so that no server is ever idle. However, more client nodes contend for network resources, therefore their capability to produce work for the servers is bounded by the switch capabilities. If all servers in a system can serve x packets per second, distributed evenly among them, there is no advantage in doing the work by a single client over many clients, and therefore, in my testing, all available clients were participating in the experiments. How many client NICs are sending/receiving data from the servers. 107

9 In general, more NICs can improve packet delivery over the attached network as shown in Figure 1, Figure 4 and explained in Section All the NICs in these experiments were enabled and some were channel bonded as detailed in Section How many server NICs are sending/receiving data from the clients. Server NICs are fixed at two NICs per server and are channel bonded Flow Control and Time-out To control congestion, an initial time-out value for replies was set. When a reply did not arrive, the time-out value was increased by one millisecond. This approach made the timeout value elastic in one direction and appropriate for the specific network condition at the time of the congestion. In effect, the elasticity of the time-out value directly controls the number of packets per second flowing in the switch. If a server can service 1000 packets per second and we attempt to send 1100 packets per second, then at least 100 of them will time-out without getting a reply. Now these clients will attempt to retry, but with a longer time-out, thereby not attempting again for a longer period of time, effectively reducing the number of packets per second that are sent to the server, until the rate of requests can be handled by the servers. Since the time-out value represents the current congestion condition at the switch or the servers, it is desired to share the information among all threads in a client process. The time-out value is therefore protected by a semaphore. The current implementation does not reduce a timer value that is too high (larger than some minimum threshold) when a reply does arrive on time, for several reasons: The semaphore makes it costly to modify the time-out value because locking and unlocking are involved and since in most cases a 108

10 reply does arrive on time, there will be too many inspections of a timer that needs to stay unchanged. In addition, when replies arrive on time, it does not matter how long the timer is set for, since it is not needed for the arriving replies. If I let the timer grow unbounded, it is possible that the timer will reach values that are too high, therefore, an upper limit is set. The initial time-out values are set, based on the packet size, to between 400 and 1500 milliseconds. The limit is set to 3000 milliseconds Read Performance Reading data involves two types of messages. One is a file open request and reply, and the other is read data request and reply. The open request is needed so that the file striping parameters and file size may be known to the reading client for hash function calculation. The parity blocks, parity flag, or the mirror inodes are not involved in data reading. The cost of the open requests is amortized over all the data blocks read from the file and since it involves a single small packet exchange over the switch and possibly a disk read for the block that contains the file inode, which is most likely cached from previous open and create requests in the same line, its overhead is minimal. Therefore only large files were tested for read performance. All, but one, tests were designed to completely eliminate cache hits, therefore all data was read from disk when requested by the clients. The read data was placed in the cache for future use because that is the design of the file system, however, the data block was never reused, and the corresponding cache slot was replaced by another read block when space was needed. The disks were performing reads only. The data blocks were read sequentially. 109

11 The specific disks that were used have a cache buffer (track buffer) of 2MB each. This buffer is not accessible to the file system directly, but when the file system requests data from the disk and that data is in the track buffer already, then the disk head does not need to move, and the data is transferred from the track buffer to the file system buffer. This transfer is relatively quick because it does not require a mechanical motion (of the disk head). When the disk track buffer is fully loaded, it is considered warm. What was observed in the read tests is that the read performance improves as the number of file blocks in a line is increased. In other words, the read performance is directly related to the striping parameters, the number of blocks striped on a single disk before moving to another disk. The explanation for this observation is that the disk reads ahead when a block is requested, and keeps the additional data in the track buffer until the space is needed for other reads. When starting the server programs, the disk track buffer is empty and therefore each read request involves the physical reading of data from the disk surface. As the test progressed, the disk cache warmed and the read performance improved as some of the read blocks were satisfied from the track buffer. The upper limit on the performance was the network speed. The read performance fluctuates in a wide range depending on the ratio of reads that are satisfied from the track buffer and the ones that require a disk head movement. The reads were for files striped at 1 and 16 blocks/disk and were requested in groups of 1 to 16. When more than one block was requested, all the requests were sent to the server before any of the replies were examined by the client. This process is done in a tight loop and the requests leave the client very fast compared to the rate of replies coming back from the server. When several clients are doing this same loop, possibly to the same 110

12 server, a large number of read requests could queue at the server and some are dropped because the server request queue is limited in capacity (to force time-out on the client side when the server is too busy). Indeed, when the group was large, some packets were lost and requests had to be retried. This is also related to flow control issues that were discussed in Section There were always enough requests to keep the servers busy. The performance of the reads for a four disk system was 0.9MB/sec/server when the files were striped at 1block/disk, which is consistent with the base read test for the disk as shown in Figure 12. The reason for this performance is that the read ahead information in the disk cache is wasted when the next block that is read from the same file is on another disk. When the files were striped at 16 blocks/disk, the performance fluctuated between 2MB/sec/disk and 10MB/sec/disk, but most of the time in the middle of that range. This performance shows the positive effect of the disk read ahead into the track buffer Aggregate read performance is then 3.6MB/sec to 40MB/sec for a four disk file system. For comparison, a read test when all the data is in cache (100% cache hits in four servers) performed at 15.6MB/sec/disk for an aggregate performance of 62.4MB/sec for a four disk file system. The performance of a smaller file system was not tested since the dominating factor in read performance is the disk itself, not the network and therefore similar results are expected. 111

13 6.4. Write Performance Writing data involves several types of packet exchanges between client and server and between server and server. If the file already exists (in the case of rewrites), a file open request and reply is needed between client and server. For a new file, a file create message dialogue between the client and server needs to be exchanged, an inode for the new file needs to be created, an operation that might include reading inode blocks from the disk to check for duplicates, and possibly, space for the new file inode must be found if all existing inode blocks which are cached are full. In addition, the new/changed inode needs to be mirrored on another disk. All this, before any data block can be written. After the file is successfully created/opened, parity, file size update, and possibly invalidate parity messages are needed in addition to data, to guarantee consistency of file data. All of these operations are detailed in the following subsections. Because of the varied amount of overhead based on file type and how this overhead is amortized over different file types, this section is divided to distinguish results for small files, large files, files with and without parity and new files versus rewritten files Large Files Without Parity This section reports the performance results for writing files that are not fault tolerant and are therefore not protected by parity. The work involved in writing these files is as follows: Creating an inode for the file on one of the servers (inode disk). 112

14 Copying the inode to another server (mirror disk). This is necessary even if the file is not fault tolerant because multiple inodes occupy the same disk block, and other inodes in the same block may be fault tolerant. This work is done by the inode disk and requires server-to-server communication. Sending the data blocks. Updating the inode with the new file size as it grows. This requires a small packet to be sent from the client to the inode disk after each data block is sent. Updating the mirror inode (on another disk) with the growing file size at some reduced frequency than the update of the inode. This work is done by the inode disk and it requires server-to-server communication. Because the files are large, the cost of the creation and mirroring of the inode is amortized over many data blocks and is a very small cost to performance. However, the cost of updating the inode and mirror inode with the growing size of the file is expensive, a penalty of 32%-39% for writing new files over rewriting old files. This can be reduced by less frequent updates, say every other block, once a second, or every stripe, but none of these options were measured (further discussed in Chapter 7). A different approach could be to write higher numbered blocks ahead of lower numbered blocks, in which case the file size is not updated for every data block because lower numbered blocks look to the file system as rewrites, even when they are rewriting holes in the file system. Another solution could be to create files with an initial estimate of the file size, and later (maybe when closing the file) updating the size with the exact value. I believe that creating a file which is really a big hole is not a problem because it occupies no disk space, and 113

15 attempting to read hole blocks from the file will return NULLs in the read buffers, but not EOF. The next two sections report the actual performance for new and rewritten large files without parity New Large Files Without Parity The file system performance for writing new large files without parity is illustrated in Figure 11 and Figure 12. Figure 11 shows how an individual server performs among all servers in the system for 8K blocks. Figure 12 shows how the entire system scales as servers are added to the system. All figures also show the stand alone disk read and writev performance for reference. The disk writev performance line in the charts represents the value measured for a stand alone disk with the same number of vectors for writev as the striping of the tested file data. The disk read performance line in the chart is for a single block read. Examining the individual server performance we observe that when the system grows from a single server to two servers, the performance of each server is lower than a system of a single server. This is explained by several factors: Server to server communication: The fact that a single server system has no mirroring of inodes activity eliminates server to server communication, thereby leaving more resources for file data processing. On a multiple server system, each inode is duplicated on a second server, but on a single server system this is not possible. This fact reduces 114

16 performance from 7.8 MB/sec/disk for a single server system to 7.1 MB/sec/disk for a 2 server system and 6.9 MB/sec/disk for a 3 or 4 server system, a cost of roughly 9% and 11.5% respectively to performance MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-noParity (I+M+W) rew ritefile-noparity (W) Figure 11. Individual Server Write Performance for Large Files Without Parity 115

17 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-noParity (I+M+W) rew ritefile-noparity (W) Figure 12. Aggregate Server Write Performance for Large Files Without Parity HOL blocking: By having more servers, the switch performance is reduced as explained in Section writev performance: The average blocks/writev (not shown in the figures) is reduced when writing inodes or mirror inodes because they consist of single blocks. When writing a single block, the disk can write at 1.4MB/sec (for 8K blocks) but when writing 16 blocks/writev call, the disk can write 11.8MB/sec. When writing 17 blocks/ 116

18 writev, the case when the inode is in a line that has a stripe ready to be transferred to disk as well, the performance is 12.3 MB/sec, not much better than a full stripe without the inode, but much better than writing the inode and data in two separate writev calls. When two full stripes are cleaned in a single writev call, the disk can perform at 14.4MB/sec, or 14.6MB/sec for two stripes and an inode. In general, when an inode or mirror block is cleaned alone, as opposed to being cleaned as one additional vector in cleaning a data stripe, the average write time of writing file data is reduced because of the overhead of writing inodes and mirror blocks. The exact reduction in speed depends on the ratio between single inodes and tag along inodes to be cleaned per line. In Figure 11, the performance of a single disk is shown to be better than the writev for a single stripe as shown in Figure 5. That is because in some cases there is more than one stripe ready to be cleaned in a line, and the figure shows the writev performance for a single stripe. For multiple servers, this benefit is reduced by the other costs which are outlined above, and therefore the performance line is below the reference line of a full stripe writev performance. The aggregate performance of the entire system shows that the file system scales as servers are added from 7.8MB/sec for a single server, to 14.2MB/sec, 20.7MB/sec and 27.6MB/sec for 2, 3, and 4 servers respectively Rewrite Large Files Without Parity The best performance is achieved when rewriting files without parity. All communication between clients and servers consists of file data alone. Inodes are not updated with file size 117

19 and parity is not involved. The performance is expected to be as close to disk speed as possible as long as the switch can deliver the packets as fast as the disk can rewrite them. As shown in Figure 2 and Figure 5, the network can deliver around 18MB/sec to a single server, and the disk can write at 11.86MB/sec when there are 16 vectors for the writev call. If there are more blocks on the line to write, the disk can perform better. Figure 11 and Figure 12 show that the actual performance for a single server is 12.4MB/sec, indicating that the average writev vector count is around 18. When a second server is added to the test, each individual server can write at 11.7MB/sec, a reduction of 5.6% per server compared with a single server, due to HOL blocking inside the connecting switch. Three and four servers perform at 10.2 MB/sec, a reduction of 17.7% per server compared with a single server. Three and four servers perform at about the same rate per server. This reduction in performance is consistent with the results achieved for the network baseline performance as explained in Section The aggregate performance of the entire system shows that the file system scales as servers are added from 12.4MB/sec for a single server, to 23.4MB/sec, 30.6MB/sec and 40.8MB/sec for two, three, and four servers respectively. 16K blocks are expected to perform somewhat better than 8K ones, and 32K blocks are expected to perform somewhat worse. Indeed, the 16K blocks results are 11.7MB/sec/disk and the 32K blocks perform at 9.1MB/sec/disk on a four disk file system. This is compared with the 10.2MB/sec/disk for the 8K blocks on the same system, and is consistent with the differences in the network performance and disk writev performance of the same parameters. 118

20 Large Files With Full Parity In order to add fault tolerance, parity is added and written to disks so that files can be recovered in case of a disk fault. The amount of additional data varies and depends on the parameters that are given by the user for each particular file when it is created or at any time later. If a file is created with the NOPARITY flag set, the file is not protected and is lost if its inode is on a faulty disk, or data from a faulty disk is lost if the inode resides on a good disk. The amount of additional data that is written is proportional to the striping details of each file. A file that is striped on n disks, carries 1/n of its size (in stripes, rounded up) additional information as parity. In addition to sending parity and data over the switch and writing them to the disk, there is an additional overhead that is caused by the need to keep data consistent at all times. The issue manifests itself as follows: assume that a client rewrites a portion of a stripe and then aborts for some reason without rewriting the changed parity block. If now one of the disks where the stripe reside crashes, missing data can be recovered, but it will not be correct and even worse, it will not be known that it is incorrect. Table 3 shows an example: Event Data On Disk 1 Data On Disk 2 Data On Disk 3 Data On Parity Disk old stripe data =9 changes before disk 3 crashes incorrect data on recovered disk =6 incorrect 9 Table 3. Inconsistent Disk Example The additional work that is required to guarantee consistency when a client faults is to tag a parity block as invalid while the data of the stripe that it protects is being modified. 119

21 After the modification is complete, a new parity block is written and the invalid tag is erased. This action is only required when rewriting a stripe, not when a stripe is written for the first time, i.e. not for new files. The actual work involved is to send a small message to the parity disk holding the parity for the particular stripe, instructing the file system to tag the parity block as invalid. Only after the reply from this message is received by the client may the client proceed with the change to the data itself. The file system at the parity disk needs to do some additional work as well. It needs to monitor the invalid parity blocks and make sure that the update to the block arrives in some finite amount of time. If it does not, the server assumes that the client process died, the updated parity block is not coming, and it recreates a new parity block by reading the entire stripe from the servers holding the stripe data. At that time the parity block is correct and is not tagged any longer. Only one tagging message is required for an entire stripe. The work that is required for inode and mirror inode maintenance is the same as with files that have no parity. 120

22 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-realTimeParity (I+M+W+P) rew ritefile-realtimeparity (W+V+P) rew ritefile-partialparity (R+W+V+P) Figure 13. Individual Server Write Performance for Large Files With Parity New Large Files With Full Parity The file system performance for writing large files with parity is demonstrated in Figure 13 and Figure 14. Figure 13 shows how an individual server performs among all servers in the system for 8KB blocks. Figure 14 shows how the entire system scales as servers are added to the system for of 8KB blocks. All figures also show the stand alone disk read and write performance for reference. The disk write performance is the value 121

23 measured for a stand alone disk with the same number of vectors for writev as the striping of the tested file data. The disk read performance is for a single block read. 50 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-realTimeParity (I+M+W+P) rew ritefile-realtimeparity (W+V+P) rew ritefile-partialparity (R+W+V+P) Figure 14. Aggregate Server Write Performance for Large Files With Parity. Since at least two disks are needed for fault tolerance protection, one for parity and at least one for data, the charts do not have any points for a single server system. No invalid tag messages are sent to the parity disks when new files are written. 122

24 In each system, at least one disk is used for parity and the rest is data, therefore the amount of work done to generate the parity is fixed in all experiments, one block per stripe, but the amount of work done for data is what makes the difference in performance. The figures start with a two server system because at least two disks are needed to generate parity, one for the data and one for the parity. Examining the individual server performance we observe that contrary to files without parity protection, when the system grows from two server to three and four servers, the performance of each server is better than a system of fewer servers. The individual server performance among all other servers shows that the file system scales as servers are added from 4.6MB/sec/server for a two server, to 4.8MB/ sec/server and 5.5MB/sec/server for 3 and 4 servers respectively. These measurements are for file data only. This is explained by several factors: Inodes or mirror inode blocks may be combined with parity stripes when they are written to disk. This reduces the cost of writing inodes and mirror inodes to disk compared with the cost of writing them as individual blocks. This means that the cost of writing parity blocks is reduced by the saving in the cost of writing inodes and mirror inodes. The cost of transferring a parity block through the switch is the same as transferring a data block, and writing a data stripe to disk takes the same time as writing a data stripe to disk. Therefore, when striping on more disks, the portion of all communication and disk activity for data is larger than for a system with fewer disks, and the performance scales. For example: With a two disk system, the amount of file data is the same as 123

25 parity data, the data is in fact mirrored [56], but with a three disk system, 2/3 of the information is file data and 1/3 is parity, and with a four disk system 3/4 is data and 1/4 is parity. HOL blocking; By having more servers, the switch performance is reduced as explained in Section but not enough to impair scalability. The aggregate performance of the entire system shows that the file system scales as servers are added from 9.2MB/sec for a two server, to 14.4MB/sec and 22MB/sec for 3 and 4 servers respectively Rewrite Large Files With Full Parity The performance for rewriting large files is affected by the need to invalidate the parity block before any data block of a stripe is modified. This incurs the additional cost of sending a small message to the parity disk. However, performance should improve compared to new files, because no inode activity is required. In all tests, the clients did not abort before sending the updated parity blocks, and therefore there was no additional cost to generate the parity block by reading the stripe data from all other servers in the system. The individual server performance among all other servers shows that the file system scales as servers are added from 4.5MB/sec/server for two servers, to 5.7MB/sec/server and 6.5MB/sec/server for three and four servers respectively. These measurements are for file data only. As we can see, for a two disk system, the performance is lower for rewriting files than for new files by 0.1MB/sec/disk because the cost of invalidating parity blocks is greater than the cost of updating inodes with a growing file size. However, this cost is 124

26 fixed per stripe, and therefore, system with 3 and 4 disks perform better when rewriting a file than when writing a new file. The aggregate performance of the entire system shows that the file system scales as servers are added from 9MB/sec for two servers, to 17.1MB/sec and 26MB/sec for three and four servers respectively Large Files With Partial Parity When a file which is protected by parity is modified, the associated parity block needs to be modified as well. If the rewrite is to the entire stripe, the parity block is calculated and updated in exactly the same manner as if it was a new stripe, except for the fact that the old parity block is already tagged as invalid as explained in Section If, however, only some of the blocks belonging to a stripe are rewritten, a mechanism is needed to get the rest of the information that is needed in order to calculate the modified parity block. There are two options: read before write - read data and parity before it is rewritten, subtract that old data from the old parity and add the new data to create the new parity block. read missing data - read the data blocks that are not modified and add them to the new data to generate a new parity block. Both solutions require reading of old data in order to generate the new parity block. Which one is better? The answer depends on the portion of data that is modified in each stripe as shown in Table 4 and Table 5. Method 2 is better when more blocks are rewritten, and Method 1 is better when few blocks are rewritten. The amount of additional data depends 125

27 number of data blocks rewritten, striping on 4 disks plus parity required number of read blocks required number of write blocks 1 1 data block, 1 parity block 1 data block, 1 parity block 3 3 data blocks, 1 parity block 3 data blocks, 1 parity block Table 4. Rewrite Data Cost Using Read Before Write Method on the striping parameters that are given by the user for each particular file when it is created. number of data blocks rewritten, striping on 4 disks plus parity required number of read blocks required number of write blocks 1 3 data blocks 1 data block, 1 parity block 3 1 data block 3 data blocks, 1 parity block Table 5. Rewrite Data Cost Using Read Missing Data Method When partial stripes are written for a new file, the procedure is exactly the same, except for the invalid tag which is not needed and is not sent. The client does not know ahead of time that the stripe is going to be partial. Only after waiting for some finite amount of time for the complete stripe to be ready for parity calculation, the client realizes that the entire stripe is not ready in the allotted time, and a partial stripe block is calculated and sent to the parity disk. The parity disk then proceeds as detailed above. The parity disk has no knowledge of why the parity block is partial. It could belong to a file with a hole, or could belong to a new file which is not being created sequentially, or could belong to a slow client that did not finish the stripe in the allowed time. In all of these cases, the server behaves the same, in that it reads missing data (holes) from other servers and builds a new parity block. The invalid tag is then erased. 126

28 The work that is required for inode and mirror inode maintenance is the same as with files that have no parity. All tests were done to existing files, and each stripe was missing all the blocks on one of the disks in the stripe. The work involved then was: tag old parity block as invalid, rewrite data (blocks of one data disk for a three disk system, two data disks for the four disk system), send partial parity block to the server, server reads missing data block from the server holding the missing data, and write the new parity block. Since all files were preexisting, no inode activity was needed. The individual server performance among all other servers shows that the file system scales as servers are added from 2.9MB/sec/server for a three server, to 4.1MB/sec/server for a four servers system. As we can see, the performance is lower for rewriting partial file data with parity to ones with complete rewriting files with parity by 49% for a three disk system and 37% for a four disk system because reading is done at a rate of 0.94MB/sec. Whenever possible, reading is done in parallel if more than a single data block is missing from the parity block. The aggregate performance of the entire system shows that the file system scales as servers are added from 8.7MB/sec for three servers, to 16.4MB/sec for four servers Small Files Without Parity Rewriting files without parity involves no inode activity, and therefore requires the same operations, regardless of the file size. Therefore, rewriting small files is not detailed here. 127

29 It is expected that rewriting small files has about the same performance as rewriting large files of identical striping parameters. Writing new small files is affected by the inode creation, mirroring, and file open and close operations which are amortized over fewer data blocks. In all tests the file size was 16 blocks striped on one disk. The reason is that in order to compare these results with results for large files, we need the 16blocks/disk striping. In order to make the inode operations as costly as possible (for comparison purposes), an inode creation was forced every 16 blocks by making the file size 16 blocks. The performance for writing small files is 4.1MB/sec/disk for a one disk file system and 3.1 for two, three or four disk file system respectively. The one server file system requires no mirror block updates and therefore performs better. This is a reduction of 49% and 57% compared with large files of the same file system, and is due to the inode update work which is amortized over fewer data blocks. The aggregate performance of writing small files without parity is 4.1, 7.2, 10.3 and 13.4MB/sec for one, two, three and four servers file system respectively Small Files With Parity As with files without parity, rewriting files with parity or partial parity involves the same operations, regardless of the file size. Therefore, rewriting small files with full or partial parity is not detailed here. It is expected that rewriting small files with full or partial parity has the same performance as rewriting large files of identical striping parameters. 128

4. Environment. This chapter describes the environment where the RAMA file system was developed. The

4. Environment. This chapter describes the environment where the RAMA file system was developed. The 4. Environment This chapter describes the environment where the RAMA file system was developed. The hardware consists of user computers (clients) that request reads and writes of file data from computers