Implementation and Performance Evaluation of RAPID-Cache under Linux

Implementation and Performance Evaluation of RAPID-Cache under Linux Ming Zhang, Xubin He, and Qing Yang Department of Electrical and Computer Engineering, University of Rhode Island, Kingston, RI 2881 {mingz, hexb, qyang}@ele.uri.edu Abstract Recent research results [1] using simulation have demonstrated that the RAPID-Cache (Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache) has the potential for significantly improving the performance and reliability of disk I/O systems. To validate whether RAPID-Cache can live up to its promise in a real world environment, we have designed and implemented a RAPID-Cache under Linux operating system as a kernel device driver. As expected, measured performance results are very promising. Numerical results using a popular benchmark program have shown a factor of two to six performance gains in terms of average system throughput. Furthermore, the RAPID-Cache driver is completely transparent to the Linux operating system. It does not require any change to the OS nor the ondisk data layout. As a result, it can be used as an add-on to an existing system to obtain immediate performance and reliability improvement. Key Words: Disk I/O, File Cache, RAPID-Cache, Performance Evaluation 1. Introduction Modern disk I/O systems make extensive use of nonvolatile RAM (NVRAM) write caches for asynchronous writes [2,3,4]. Such write caches significantly reduce the response time of disk I/O systems seen by users, particularly in RAID systems. Large write caches can also improve system throughput by taking advantages of both temporal and spatial localities, as data may be overwritten several times or combined together to be large chunks before being written to disks. However, the use of single-copy write cache compromises systems reliability because RAM is less reliable than disks in terms of Mean Time To Failure. Dual-copy caches can overcome the reliability problem but is prohibitively costly since RAM is much more expensive than disk storage. We have proposed a new disk cache architecture called Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache, or RAPID-Cache for short, to provide fault-tolerant caching for disk I/O systems inexpensively. Simulation results [1] have shown that RAPID- Cache is an effective cache structure that provides better performance and reliability with low cost compared to single or dual-copy cache structures. In order to justify the feasibility and validate our simulation results of RAPID-Cache in real world environments, we have implemented a RAPID-Cache prototype under the Red Hat Linux 7.1. Using our implementation, we carried out measurements with different cache configurations including single-copy unified cache, dual-copy unified cache and RAPID-Cache. Numerical results show that all these three cache configurations provide better performance than the basic Linux file system cache. And they also show that the RAPID-Cache architecture can provide the highest performance and reliability with same cost among these cache configurations. The rest of the paper is organized as follows. The next section presents the detailed design and implementation of RAPID-Cache. Section 3 presents our performance evaluation methodology, numerical results, and analysis. We conclude our paper in Section 4. DRAM/NVRAM Primary Unified Cache Data Disk Disk Controller Backup Cache Small NVRAM Log Disk Figure 1. RAPID-Cache on top of a disk system. 1

2. Design and Implementation The RAPID-Cache organization consists of two main parts: a unified cache and a backup cache. The unified cache in RAPID-Cache has the same structure as a normal unified cache that can use system DRAM or NVRAM to provide higher reliability. RAPID-Cache s backup cache consists of a two-level hierarchical structure with a small NVRAM on top of a log disk, similar to the structure of our previous work [5, 6]. 2.1 Backup Cache Structure The backup cache in a RAPID-Cache consists of a LRU cache; several segment buffers, a log disk and a disk segment table. The LRU cache and the segment buffers should reside in the NVRAM while the disk segment table can be stored in system DRAM to reduce cost. When a system needs to be recovered after a crash, the disk segment table can be reconstructed by contents in the LRU cache and the log disk. The log disk can be either a separate disk to provide better I/O performance or just a logical partition of data disks to reduce cost. The LRU cache records the recently used data that come from upper layer via the write requests. It contains a hash table, a number of hash entries and some data blocks to store the data. The hash table is hashed by data s LBA (logical block address). The total cache size is configurable in our implementation that can range from several MB to several hundred MB. The segment buffers are used to construct log segments before they are written to the log disk. The number of segment buffers is configurable in the RAPID-Cache, usually between two and eight. More segment buffers allow faster destage of data to log disk but require more costly NVRAM. Furthermore, the speed of moving data from segment buffers to a log disk is limited by the log disk bandwidth. Currently, the number of segment buffers in our RAPID-Cache is eight. The size of each segment buffer is 32KB in our implementation. Each log buffer contains 31 1KB size data slots and a header recording all LBAs of the data in a slot. The size of a segment buffer can also be configured to 64KB or 128KB and thus the number of data slots will be 63 and 127 correspondingly. The log disk is used to store the less frequently accessed data in the backup cache. Data in it is organized in the format of segments similar to that in a Log-structured File System such as the Sprite LFS [7] and the BSD LFS [8]. Each disk segment has the same structure as the segment buffer. A reserved area at the beginning of the log disk is used to store some metadata about the log disk. Such metadata include the size of log disk, how many disk segments it can hold, which is the next available segment, and so on. In order to speed up the update of disk metadata, a copy of the metadata is maintained in the NVRAM. During the normal state, only the value in the NVRAM is updated. And the metadata in the log disk is synchronized by the metadata in the NVRAM when needed. The disk segment table contains information about the log segments on the log disk. For each segment on the log disk, it has a corresponding entry recording all LBAs of slots of data in that segment. That entry also has a bitmap with a bit for each data slot to indicate whether the data is valid or not. Figure 2. Backup Cache detailed structure. 2.2 Operations The operations performed on RAPID-Cache include Read, Write, Destage and Garbage 2

Collection. There is also a recovery operation only being executed during system reconstruction. We have implemented all the operations in our implementation. More detailed descriptions of these operations can be found in [1]. 2.3 Interfaces with Linux We can integrate RAPID-Cache with existing Linux system in many different ways. We may modify the Linux kernel source code directly or just let it be a stand-alone kernel module. It can also be implemented at different kernel layers, such as file system layer, block device layer, or even lower storage device driver layer. After carefully examining the Linux kernel structure, especially the Linux md and LVM drivers [9], we decided to build RAPID-Cache as a standalone device driver in the block device layer. It uses one or several real disk partitions as its data device and another disk partition as its log disk. It exports itself as a virtual disk-like block device to upper file systems. After loading the RAPID-Cache module into kernel, we can simply build file systems by mkfs and perform I/O operations on it as if on a real disk device. This implementation has several apparent benefits. First, since it is a stand-alone device driver, it can be installed easily under Linux system without recompiling the kernel and can be immigrated to other versions of kernel with little modification. Second, because it is built in the block device layer and without any modification to upper file system drivers and lower storage device drivers, it works well with all different kinds of file systems and storage devices, greatly broadening its usability. Third, since RAPID-Cache uses existing partitions as its data device and it can be loaded dynamically without any modification to a partition s layout and data on it, it can provides immediate performance improvement with very low cost. 3. Performance Evaluation To observe how well RAPID-Cache performs, we carry out performance evaluation by means of measurements. We concentrate on measuring the overall system performance in several different circumstances. 3.1 Experimental Setup Like other operating systems, Linux file system provides two different operation modes to satisfy different reliability and performance requirements. The two modes are asynchronous mode using write-back and synchronous mode using write-through. Although a write-through mode can provide much higher reliability than a write-back mode, it has a much lower system throughput, especially when handling small writes. Since RAPID-Cache uses NVRAM as its primary unified cache and also provides full redundancy, it can provide the same reliability as the original system when both of them act under synchronous mode and get much better performance. We will run RAPID-Cache in both asynchronous mode and synchronous mode to evaluate the system throughput. We have chosen five different configurations as our target systems listed in Table 1 as follows. Here we choose two different RAPID-Cache configurations, one has same total cache buffer size as the single-copy or dual-copy unified cache and another has same primary cache size as the single-copy unified cache. Denotation System Cache Meaning RAM (MB) Buffer (MB) 256/ 256 Basic System 192/64 192 64 Single-copy unified cache 192/32+32 192 32+32 Dual-copy unified cache 192/56+8 192 56+8 RAPID-Cache 184/64+8 184 64+8 RAPID-Cache Table 1. Different Measure Target Configurations. Parameters CPU Pentium III 866MHz, 256KB L2 Cache Memory 256MB PC133 ECC Hard Disk Maxtor 5T1H1, ATA-5, 1.2GB, 2MB Buffer, 72RPM, Average Seek Time < 8.7ms, Average Latency = 4.17ms, Data Transfer Rate (To/From Media) up to 57MBytes/sec [1] Table 2. Test Environment Parameters. Table 2 shows the configuration of the test machine. We run all tests under Red Hat Linux 7.1 with kernel 2.4.2. We also add some internal 3

counters to observe the dynamic behaviors of the cache program. They are: Read_Request and Write_Request are numbers of read or write requests that file system send to a cache. Read_Hit is the number of cache read hits. It is increased each time a data is found in the cache while reading; Write_Hit is the number of cache hits for write operations. Write Hit means the whole block containing the written data is in the cache; Write_Hold is the number of times we find an empty entry to hold the incoming write request although it is not a Write Hit. Since either Write Hit or Write Hold can eliminate real write I/O operations to the data disk, we here define WriteHitRa tio 3.2 Benchmark Write_ Hit + Write_ Hold = 1% Write_ Request The benchmark program we used in our test is PostMark [11], a popular file system benchmark developed by Network Appliance Corp. It measures performance in terms of transaction rates in an ephemeral small-file environment by creating a large pool of continually changing files. PostMark generates an initial pool of random text files ranging in size from a configurable low bound to a configurable high bound. This file pool is of configurable size and can be located on any accessible file system. Once the pool has been created, a specified number of transactions occur. Each transaction consists of a pair of smaller transactions, i.e. Create file or Delete file and Read file or Append file. Each transaction s type and files it affected are chosen randomly. The read and write block size can be tuned. On completion of each run, a report is generated showing some metrics such as elapsed time, transaction rate, total number of files created and so on. In our measurement, we run PostMark in several different configurations. They range from the smallest pool with 1, initial files and 1, transactions to the largest pool with 1, initial files and 35, transactions. The total data set accessed are also from 695.2MB (313.79MB read and 381.23MB write) to 2325.62MB (1127.86MB read and 1197.76MB write). All of them are much larger than the system memory that is 256MB and cache memory size. All other parameters of PostMark are left unchanged. The default read/write block size is 1KB. 3.3 Measurement Results 3.3.1 Asynchronous Mode The first experiment we performed is to measure the overall system performance of above five different configurations in an asynchronous mode. Under the asynchronous mode, the file system acknowledges write complete to the host as soon as a data is written to the file system cache without waiting for disk operations. It is similar to the copy-back in cache memory terminology. Figure 3 shows the measured PostMark throughputs in terms of transactions per second using different requests pools. Throughput(tps) 5 4 3 2 1 Throughput of Different Cache Configurations 1k 15k 2k 25k 3k 35k 256/ 192/64 192/32+32 192/56+8 184/64+8 Figure 3. System I/O Performance Measured by PostMark using small pools. From Figure 3 we can see that a 64MB single unified cache performs the best. However, as we mentioned in the introduction, a single write cache compromises the system reliability since it creates a single point of failure. This is particularly true for RAID systems. Not only disks are more reliable than RAM but also all modern RAID systems provide data redundancy through parity disks for fault tolerance. If we use only a single write cache, it becomes the most critical component and compromises the system 4

reliability. Modern disk systems use dual copy write caches to guarantee reliability. From Figure 3, we can see that both RAPID-Cache configurations show better performance than the dual-copy cache configuration with up to 55% performance gain observed. Compared to the basic system, both RAPID-Cache configurations improve the performance by a factor of 2. We can expect larger performance gain if we use separate memory for caching instead of using system s memory. Table 3 lists the statistical values collected in the experiment using the 1, transactions data set. From this table we can see that the number of read requests is much less than write requests implying that the file system cache did a very good job in caching read requests. It is very interesting to note that almost all read requests filtered out from the file system cache also miss the disk cache and go to the data disk. This is what we have expected because data not present in the file system cache are not frequently used data and are very likely to be destaged to disks from the disk cache. And some of the read data should be the metadata of data disk that need be read from data disk while doing the measurement. We also noticed that the number of read requests for the cache configuration (192/32+32) is slightly larger than the other three configurations. We speculate that the reason for this is high miss ratio for write requests. Such high miss ratio may give rise to more disk operations and therefore more metadata operations. Observing the write hit ratios for the 4 different cache configurations, we noticed that RAPID cache with 192/56+8 configuration has about 9% less hit ratio than the single write cache case. After making the primary unified cache size same as the single-copy unified cache size, the hit ratio comes closer to the single-copy unified cache case (89% vs. 88.5%) resulting in similar performance with extra full duplicate redundancy. In other word, RAPID cache architecture achieves the similar system performance and allows full redundancy with exactly the same hardware resources as a singlecopy unified cache. Read_Request Read_Hit Write_Request Write_Hit Write_Hold Write Hit Ratio 192/64 1338 1 31745 262858 5793 89% 192/32+32 151 1 312928 16293 3814 52.4% 192/56+8 1338 1 29951 236524 555 8.1% 184/64+8 1338 1 318253 275855 591 88.5% Table 3. Cache internal counters result for small pool. Throughput with different transactions Write Request Miss Count 192/56+8 184/64+8 192/56+8 184/64+8 45 4 4 35 3 25 2 15 1 5 35 3 25 2 15 1 5 1k 15k 2k 25k 3k 35k 4k 1k 15k 2k 25k 3k 35k 4k Transactions Transactions 5

Figure 4a. Throughput Figure 4b. Write Request Miss Count Read_Request Read_Hit Write_Request Write_Hit Write_Hold Hit Ratio 192/64 1338 1 1526591 1486893 5928 97.8% 192/32+32 1338 1 1561319 13535 4393 86.8% 192/56+8 1338 1 15268 1461877 5792 96.2% 184/64+8 1338 1 1541562 1491952 6116 97.2% Table 4. Cache internal counters result for small pool. We also noticed throughput differences of two RAPID-Cache configurations as a result of different unified cache sizes as shown in Figure 4. The results show that RAPID-Cache with the configuration 184/64+8 always performs better than that of 192/56+8 except for small number of transactions (1k). This performance difference can be attributed to the following facts. Reducing the file system cache from 192 to 184 results in more file system cache misses. As a result, our primary unified cache with size 64 MB will receive more requests coming out of the file system cache. However, the total number of misses from the primary cache is reduced rather than increased as shown in Figure 4. In other word, although the total number of requests to the primary cache is increased, the actual number of write requests that go to the disk is reduced because of the additional 8 MB to the write cache. This result indicates that our write cache does much better job than the traditional file system cache in handling write requests. This is also the reason why configuration 192/64 performs much better than configuration 256/ as shown in Figure 3. 3.3.2 Synchronous Mode After evaluating the overall system performance under asynchronous mode, our next experiment is to check how well each cache configuration performs under synchronous mode that provides higher reliability. After mounting the virtual disk-like device in synchronous mode, we run PostMark with the smallest pool to measure the different cache configurations performance. The throughput results are shown in Figure 5 and the internal statistic counter values are listed in Table 4. Throughput (tps) 5 4 3 2 1 Throughput in Sync Mode 256/ 192/64 192/32+32 192/56+8 184/64+8 Different Cache Configurations Figure 5. Cache Performance Measured by PostMark using small pool. Figure 5 and Table 4 above clearly show that all cache configurations perform much better than the basic configuration. Both RAPID-Cache configurations show 6% performance gain compared to the original Linux system. Such 6 folds performance gain indicates that our cache algorithm works very well. It is important to note that to obtain the same system reliability, the 64MB unified cache for configuration 192/64 should use NVRAM as opposed to standard DRAM. Such NVRAM will increase the system cost. With RAPID cache configuration, however, only the 8MB buffer need to be NVRAM because of the log disk right below the RAM buffer. With full duplicate redundancy, the RAPID cache will have much higher reliability than the baseline Linux system while at the same time achieving 6 times better performance. Two noticeable changes in Table 4 as compared to Table 3 are hit ratio and write request amount. The high hit ratio of the unified cache implies that high efficiency of our cache algorithm. The reason why they are higher than asynchronous case is as follows. In synchronous mode, write data are not cached at file system cache, which means all write requests pass through the file system cache. As a result, data locality is caught 6

by the unified write cache as opposed to file system cache in asynchronous mode. As for Write_Request changes from about 3k in asynchronous mode to over 1.5G in synchronous mode, they are mainly for meta data operations. In Linux ext2 file system, metadata such as block group descriptor and inode are used to record information about the file system [11]. A block group descriptor records inode bitmap, data block bitmap, count of free inodes and data blocks etc. Each inode records file name, size, time modified etc. For example, an append operation will request free data blocks from file system, modify the count of free data blocks and the data block bitmap in block group descriptor, modify the file size and time modified in the inode, and write data to data block. This sequence of operations results in a lot of meta data operations, particularly for small data files. Create files and delete files also generate large metadata modifications. Since we mount the file system in synchronous mode, all these operations go to disks instead of being cached in memory resulting in a large amount of write requests see from disk. 4. Conclusions In this paper, we have presented our implementation of RAPID-Cache and carried out performance evaluation base on the implementation. The measured results show great performance improvement compared with the original Linux system without RAPID-Cache. Compared with single-copy unified cache and dual-copy unified cache, RAPID-Cache provides better performance and reliability with low cost. Acknowledgements This research is supported in part by National Science Foundation under grants MIP-971437 and CCR-73377. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank the anonymous reviewers for their many helpful comments and suggestions. References [1] Y. Hu, Q. Yang, and T. Nightingale, RAPID- Cache a Reliable and Inexpensive Write Cache for High Performance Storage Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 2, February 22 [2] J. Menon and J. Cortney, "The architecture of a fault-tolerant cached RAID controller," in Proceedings of the 2th Annual International Symposium on Computer Architecture, (San Diego, California), pp. 76-86, May 16-19, 1993. [3] K. Treiber and J. Menon, "Simulation study of cached RAID5 designs," in Proceedings of Int'l Symposium on High Performance Computer Architectures, (Raleigh, North Carolina), pp. 186-197, Jan. 1995. [4] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, "RAID: High-performance, reliable secondary storage," ACM Computing Surveys, vol. 26, pp. 145-188, June 1994. Dartmouth College, July 1994. [5] Y. Hu and Q. Yang, "DCD -- disk caching disk: A new approach for boosting I/O performance," in Proceedings of the 23rd International Symposium on Computer Architecture, (Philadelphia, Pennsylvania), pp. 169-178, May 1996. [6] X. He, Q. Yang, VC-RAID: A Large Virtual NVRAM Cache for Software Do-it-yourself RAID, Proceedings of the International Symposium on Information Systems and Engineering (ISE'21), pp.334-34, June 21. [7] J. Ousterhout and F. Douglis, "Beating the I/O bottle-neck: A case for log-structured file systems," Technical Report, Computer Science Division, Electrical Engineering and Computer Sciences, University of California at Berkeley, Oct. 1988. [8] M. Rosenblum and J. Ousterhout, "The design and implementation of a log-structured file system," ACM Transactions on Computer Systems, pp. 26 -- 52, Feb. 1992. [9] Hard Disk Drive Specifications Models: 5T6H6, 5T4H4, 5T3H3, 5T2H2, 5T1H1, Maxtor, [1] J. Katcher, PostMark: A New File System Benchmark, Technical Report TR322, Network Appliance, URL: http://www.netapp.com/tech_library/322.html. [11] M Beck, H BOHME, Linux Kernel Internals, 2 nd Editions. ADDISON-WESLEY.ISBN:-21-33143-8. 7