Deconstructing on-board Disk Cache by Using Block-level Real Traces

Size: px

Start display at page:

Download "Deconstructing on-board Disk Cache by Using Block-level Real Traces"

Reynold Joseph
5 years ago
Views:

1 Deconstructing on-board Disk Cache by Using Block-level Real Traces Yuhui Deng, Jipeng Zhou, Xiaohua Meng Department of Computer Science, Jinan University, , P. R. China On-board disk cache is an effective approach to improve disk performance by reducing the number of physical accesses to the magnetic media. Disk drive manufacturers are increasing the on-board disk cache size to match the capacity growth of the backend magnetic media. Some disk drives nowadays have a cache of 32 MB. Modern computer systems use large amounts of memory to improve performance. This feature has a significant impact on the behavior of disk cache. This is because computer systems are complex systems consisting of various components. The components are correlated with each other. Therefore, a specific component can not be isolated from the overall system when we analyze its performance behavior. This paper employs three block-level real traces to explore the performance behavior of the on-board disk cache by considering the impacts of the cache hierarchy contained in computer systems. The analysis gives three major implications: (1) I/O stream at block-level demonstrates very strong spatial locality and negligible temporal locality. Therefore, prefetch in disk cache can boost the performance by leveraging the spatial locality. However, read/write cache can only achieve marginal benefits because the block-level I/O traffic has almost no temporal locality. (2) Static write cache does not achieve performance gains since the write stream does not have too much interference with the read stream. Therefore, it is better to leave the disk cache shared by both the write and read streams. (3) Read cache dominates the contribution to the hit ratio besides prefetch. Thus, it is better to focus on improving the read performance rather than write performance of disk cache. Keywords: on-board cache; disk drives; prefetch; locality. 1. Introduction The storage hierarchy in current computer architectures is designed to take advantage of data access locality to improve overall performance. Each level of the hierarchy has higher speed, lower latency, and smaller size than lower levels [9, 10]. The performance gap between processor and memory has been alleviated by fast cache memories. However, the performance gap of RAM to disk drive has been widened to 6 orders of magnitude in 2000 and will continue to widen by about 50% per year [10]. Therefore, the disk I/O subsystem is repeatedly identified as a major bottleneck to system performance in many computing systems. To alleviate the impact of the widening gap, many research efforts have been invested in improving the performance of disk drives. One of the most effective approaches is employing an on-board disk cache to reduce the number of disk I/Os. Almost all modern disk drives employ a small amount of on-board cache (SRAM) (usually less than 32MB) as a staging area. This staging area acts both as a speed-matching buffer and a disk cache. As a speed-matching buffer, it allows the disk firmware to overlap the transfers over the interface and to or from the magnetic media, thus bridging the data transfer speed differences between disk interface and slower-speed magnetic media. As a disk cache, it can speed up accesses to data on the disk drives. Because

2 accessing data from cache is much faster than from magnetic disk, the disk cache can significantly improve performance by avoiding slow mechanical latency, if the data accesses are satisfied from the disk cache (cache hit). However, SRAM is still very expensive. This results in a very small cache size in comparison to the capacity of the magnetic disk. Therefore, it is very important to take full advantage of the precious disk cache space. In the past decades, how to effectively leverage the limited space of disk cache has been investigated thoroughly by previous work. Smith [5] considered a number of design parameters for such a disk cache, including cache location, cache size, replace algorithms, block sizes, access time, bandwidth, consistency, error recovery, etc. Each of these parameters was discussed and/or evaluated in terms of trace driven simulations. He reported that the addition of a disk cache to a large computer system can significantly improve performance, and a disk cache (in a large IBM-type system) of less than 8 Mbytes can capture percent of I/O requests. Karedla et al. [2] examined some popular strategies and replacement algorithms for disk cache, as well as the advantages and disadvantages of caching at different levels of the computer system hierarchy. They also evaluated the performance of three cache replacement algorithms: random replacement (RR), least recently used (LRU), and a frequency-based variation of LRU known as segmented LRU (SLRU). Soloviev [12] investigated the performance of multi-segment disk caches for prefetch. He concluded that the disk prefetch offers about the same performance as main memory prefetch if request queues are managed in the disk controllers rather than in the host. Otherwise, the memory prefetch makes disk prefetch obsolete. Therefore, he questioned the benefits of using the disk cache for prefetch. Shriver et al.[28] developed performance models of the disk drive components including queues, caches, and the disk mechanism and a workload transformation technique for composing them. They also built a a new analytic model for disk drives that do readahead and request reordering. Since the volatile disk write cache could result in a data loss when an error or power failure occurs, non-volatile write cache can be employed to solve this problem. DOME [19] is a new destage algorithm for the non-volatile disk write cache. This algorithm leverages an observation that in the write cache, the data, that are overwritten once, have more probability to be overwritten again than the data that are written only once. Traditionally, in order to ensure the correctness of a write, an additional read is immediately issued after the completion of the write to verify the written content. However, the Read After Write (RAW) degrades the overall performance of a disk drive since it doubles the service time of write operations. In order to reduce the performance impacts, Idle Read After Write (IRAW) is proposed to retain the content of a completed and acknowledged write request in the disk cache and verify the content written on the magnetic disk with the cached content during idle times [18]. By using detailed simulations, Zhu and Hu [11] drew conclusions that larger on-board disk cache does not have much impact on system performance. They argued that 512KB disk cache is enough to reduce the cost without affecting performance. They also reported that when the number of disk cache segments is bigger than the number of concurrent workloads, prefetch offers significant performance improvements for workloads with read sequentiality. In contrast to the existing work, this paper attempts to explore the performance the behavior of on-board disk cache in the hierarchical cache architecture of typical computer systems, deconstruct the disk cache, and analyze the impacts of each function of the disk cache on the disk performance. Three real traces collected at block-level are employed to evaluate the impacts. Experimental results provide useful insights into the behavior of the on-board disk cache. The remainder of this paper is organized as follows. An overview of hierarchical cache architecture is introduced in Section 2. Section 3 deconstructs the on-board disk cache, and describes each component of them including the cache organization, cache algorithms, and cache functions. The simulation environment and experimental evaluation are depicted in section 4. Section 5 concludes the paper with remarks on main contributions and indications.

3 2. Hierarchical cache architecture Applications Disk file system Device driver Disk controller 1 2 Cache Queue 3 Cache Queue 4 Disk drive Fig.1 Data flow and the corresponding computer components A computer is a complex system composed of several components such as CPU, main memory, disk drive controllers, buses, and disk drives. Fig. 1 shows a basic data flow of a typical computer system [13]. For example, an application issues a file-level I/O which writes its data to the local disk file system (e.g. ext2/ext3) through a write system call. According to write through policy, the data to be written will travel via the volume manager, device driver, PCI bus, disk controller, and finally turned into block-level I/O and be written to the disk drives. If write back policy is chosen, the data in the memory cache will be written back to the disk drives when the data is evicted from the cache. A reply will be finially sent back to the application. When a read request (file-level I/O) issued from an application wishes to access data stored in the disk drive, it first checks the memory cache. If the data can be found in the cache, the read request will be served and the corresponding data will be sent to the application immediately. Otherwise, the required data has to be first retrieved from the disk drives through block-level I/O, then pass through the peripheral bus to the bus adapter, then across the PCI bus into the system memory cache through the system bus, and finally the data will be sent to the application. Fig. 1 shows that a typical computer maintains caches and queues at different levels. For example, the disk file system has a file system cache, and both the device driver and disk drive maintain caches and queues. The disk controller in Fig. 1 can be an intelligent device which also manages and performs caching and scheduling of requests for the disk drive. Cache is normally employed to avoid physical I/Os. Queue provides a temporal container so that schedulers at different levels can reorder or rearrange the requests in the queue to improve I/O response time [21, 22, 23]. The caches and queues at different levels construct a hierarchical architecture. The cache policies and schedulers at different levels have significant impacts on the data access pattern going to the disk drives. For instance, the data streams issued from multiple applications could be merged and recomposed when the data goes through the file system cache. This combination and re-composition again affect the schedulers. A typical example is the difference (e.g. request size) between file-level I/O at point 1 and block-level I/O at point 4 illustrated in Fig. 1. File-level I/O requests can be variable in length as defined by the file system protocol, while block-level I/O requests always access data in units of multiple sectors, each sector is usually 512 bytes. Therefore, incurred by the hierarchical cache architecture, the data access pattern at the application level will be essentially changed when the data finally reaches the disk drive. Since the block-level I/O is the termination of a data flow, and the pattern of block-level I/O has significant impacts on the cache algorithms and scheduling policies of disk drives, this paper will explore the behavior of the on-board disk cache by using traces collected at block level. 3. Disk cache Today s RAM has access time ranging from 7 to 10 nanoseconds. We assume that 512 Byte data (one sector size of disk drive) need to be accessed in the RAM which has 64 bit chip configuration and 10 nanoseconds access

4 time. The disk cache access time is about T cache = = milliseconds. The latest 64 Hitachi Ultrastar 15K147 [1] has characteristics of 3.7 milliseconds average seek time T seek, RPM and maximal 1129 Mbits/sec internal media transfer rate. Based on the discussions in [14], it is very easy to calculate that the average rotational latency is T rotate = =2 milliseconds, and the internal transfer time of Byte is T transfer = = milliseconds, respectively. Therefore, we have the average disk 1129 access time T access = T seek + T rotate + T transfer = = milliseconds. According to the above discussions, the magnetic disks are millisecond devices, and RAMs are nanosecond devices. Therefore, accessing data from cache is much faster than from magnetic disk. This is the reason why the disk cache can significantly improve performance by avoiding slow mechanical latency, if the data accesses are satisfied from the disk cache (cache hit). Disk cache works on the premise that the data in the cache will be reused often by temporarily holding data, thus reducing the number of physical accesses to the magnetic disk. To achieve this goal, caches exploit the principles of data locality to improve hit ratio. Data locality can be further divided into spatial locality and temporal locality. The spatial locality implies that if a block is referenced, then nearby blocks will also soon be accessed. The temporal locality implies that a referenced block will tend to be referenced again in the near future. Compared with kinds of I/O optimizations that increase the efficiency of I/Os, reducing the number of physical disk I/Os by increasing the hit ratio of disk cache is the most effective method to improve disk performance. 3.1 Disk cache organization Disk cache is normally divided into independent and equal segments that correspond to sequential streams of data. Such a division can better serve multiple streams of sequential data. Some modern disk drives can dynamically resize their segment size, thus altering the number of segments to respond to perceived patterns of workload. Effectively, each I/O stream is treated as having its own cache, since each segment contains data that is disjointed from all other segments. The cache segments tend to be organized as circular queues of logically sequential disk sectors, with new sectors pushed into an appropriate queue either from the bus (during a write) or from the disk media (during a read). When the controller detects that there are more streams than segments, segment replacement takes place to make room for the new streams [4]. There are several typical cache replacement algorithms including Random Replacement (RR), Least Frequently Used (LFU), and Least Recently Used (LRU) [2]. RR replaces cache lines by randomly selecting a cache line to evict. This policy is very fast, requires no extra storage, and is the easiest to implement. However, it performs poorly because it does not take advantage of the spatial and temporal locality. LFU is based on the access counts of the cache lines. The cache lines which have been used least frequently are evicted. Unfortunately, the recently active but currently cold cache lines tend to remain entrenched in the cache. Therefore, the inactive data increases the miss ratio and reduces the cache performance. LRU evicts the cache lines used least in the recent past on the assumption that it will not be used in the near future. LRU is simple to implement for small

5 caches but becomes computationally expensive for large ones. Therefore, LRU and the variation of LRU are the most frequently used algorithms in disk cache. MRU Read LRU Write Eviction Fig. 2 Organization and logical flow of a typical disk cache Fig. 2 shows the organization and the corresponding logical flow of a typical disk cache, where each rectangle represents a cache segment [18]. The cache segments are organized as a single linked list. The head of the list is the Most Recently Used (MRU) segment which has the highest retention priority among all the cache segments. The tail segment is the Least Recently Used (LRU) segment, and it has the lowest retention priority. This list runs from the MRU segment to the LRU segment. The closer a segment is to the LRU segment, the higher priority the segment will be evicted. Generally, the data retrieved from the magnetic disk is placed in the MRU segment, and the write data is placed in the LRU segment. This is because the data recently written is not highly likely to be read in the near future. When the MRU segment is occupied by new data, the previous holder of the MRU segment and all other segments are pushed one position toward the LRU segment, and their retention priority are also reduced. This results in the eviction of the LRU segment. If there is a cache hit, the segment holding the required data is promoted to the MRU position and there is no eviction from the cache. 3.2 Disk cache functions Disk cache basically plays four roles: (1) as a working memory for disk firmware; (2) as a speed matching buffer between the disk media and the disk interface; (3) as a prefetch buffer; (4) as a read/write cache [24]. We will explore the functions of prefetch and read/write cache since they are supposed to have significant impacts on the performance of disk drives. The prefetch and read/write cache take advantage of the spatial locality and temporal locality to improve performance, respectively Prefetch Disk cache normally implements prefetch to take advantage of the spatial locality by anticipating future requests for data and bringing it into the cache. Since most applications tend to access data sequentially, it is likely that the next read request will access the data that follows the last request. Thus, subsequent commands could access that prefetched data from the cache instead of from the magnetic disks. This helps increase the hit ratio of disk cache. An effective prefetch can boost the performance of storage systems by reducing the physical disk I/Os. Traditional prefetch policies aim to minimize I/O time by deciding (1) when to fetch a block from disk drive; (2) which block to fetch; (3) which block to replace. The prefetch can be stopped when the disk drive is required to serve a new request arrival or another pending request. Although the prefetch may be stopped when it reaches

6 the end of a track or cylinder, most disk drives aggressively prefetch beyond these boundaries even though a new request might have to wait for the head switch or seek to complete. In some cases, prefetch stops when the cache segments are full. However, a large prefetch can have a negative impact on small caches, because it can displace the data that would have been useful in the cache. Prefetch and read/write cache operations are independent features. Each of them can be enabled/disabled independently. However, the prefetch feature actually overlaps cache operation Read cache When a read request from file system wishes to access the data stored on disk drives, it first checks whether the requested data is in the read cache. If the data can be found in the cache (read hit), the read request will be served and the corresponding data will be sent to the above file system through the disk drive interface immediately. Otherwise, the required data has to be first retrieved from the magnetic disks, saved in the read cache, and finally sent to the file system through the disk interface. The read hit can be further classified as full read hit and partial read hit. For a full read hit, the data is simply transferred from the read cache to the interface. For a partial read hit, disk firmware may utilize the data or just ignore them. When the read cache runs out of data, which is incurred by a read request, the disk will disconnect from its interface bus until additional data is read form the magnetic media to the disk cache. When the amount of available data in the read cache reaches a water mark, the disk will reconnect to the interface bus and continue the interrupted data transfer. The read cache can be employed by the workloads, which have strong temporal locality, to boost the read performance. This is because current accessed data tend to be accessed again in the near future. A read cache can be used for caching frequently used data. It is also normally combined with prefetch to enhance the performance of disk drives Write cache Write cache can use either write back (a.k.a. write behind or fast write) or write through policy to handle write requests. According to the write back policy, when a write request arrives at the hard drive, it immediately notifies the computer that the write request completes even though the data has not been written to the magnetic disks. The data is stored in the write cache and marked as dirty. When the write segments are filled with dirty data, the disk will disconnect from its interface bus and flush the data to the magnetic media. When the amount of the dirty data in the write cache falls below a watermark, the disk will reconnect to the interface bus and continue the interrupted data transfer to the disk cache. The process of forcing the transfer of dirty data from the write cache to the backend magnetic disk is called destage. This policy can improve the performance by eliminating the time that the computer waits for writes to complete. Unlike the write back policy, write through policy does not store a data copy in the write cache, and does not tell the computer that the write request is done until the data is written on the magnetic media. The write back policy can significantly improve the performance of disk drives. Biswas et al. [20] reported that a simple write back policy for write cache is effective in reducing the total number of disk writes by over 50%. There are a number of different algorithms for the write back: immediate write back, write back with full cache, write back with thresholds, periodic write back, idle write back and opportunistic write back [19,20]. The threshold algorithm, which employs a high watermark to enable and a low watermark to disable the destage, normally has better performance than others. However, write back policy takes a risk of loss data when an error or power failure occurs. This is because the

7 on-board disk cache is a volatile memory. Therefore, the host software has to verify what data is actually written on the magnetic disks and take appropriate steps to rewrite the data as part of the restart process. Traditionally, read after write is adopted to ensure the correctness of a write by verifying the written content via an additional read immediately after the completion of the write [18]. The write cache of some high-end disk drives can be disabled from the server through its SCSI interface when very strong data integrity is required. Non-volatile memory may be able to handle this issue. However, most of modern disk drives do not adopt non-volatile memory for the disk write cache. If read cache is enabled, the data written to the disk is retained in the cache to be made available for future read cache hits. The same buffer space and segmentation is used for read functions. When a write command is issued and the read cache is enabled, the cache is first checked to see if any logical blocks that are to be written are already stored in the cache from a previous read or write command. If there are, the respective cache segments are cleared. The new data is cached for subsequent Read commands. If the number of write data logical blocks exceed the size of the segment being written into, when the end of the segment is reached, the data is written into the beginning of the same cache segment, overwriting the data that was written there at the beginning of the operation; however, the drive does not overwrite data that has not yet been written to the medium. If write caching is enabled, then the drive may return Good status on a write command after the data has been transferred into the cache, but before the data has been written to the medium. If an error occurs while writing the data to the medium, and Good status has already been returned, a deferred error will be generated The Synchronize Cache command may be used to force the drive to write all cached write data to the medium. Upon completion of a Synchronize Cache command, all data received from previous write commands will have been written to the medium. 3.3 Disk cache size The disk cache today can hold more data due to the increasing cache size (e.g. Ultrastar A7K1000 has 32MB disk cache [15]). This results in higher hit ratio. However, studies have indicated that increasing the cache beyond its optimal size has diminishing performance benefits. Cost is another factor in determining cache size, because memory is still more expensive than magnetic storage. The on-board disk cache tends to be more expensive than main memory because dual ported static RAM in disk controllers has to be able to work with both the media device and the bus interface [11]. To achieve an optimal cost-to-performance ratio, system designers generally believe that the size of a cache should be at least 0.1 to 0.3 percent of the back-end storage. Manufacturers typically offer caches between 0.1 and 1 percent of the back-end storage [2]. Hsu and Smith [3] reported that disk cache in megabyte range is sufficient, and for a very large disk cache, the hit ratio continues to slightly improve as the cache size is increased beyond a threshold. Therefore, further increased cache capacity only achieves a limited contribution to the hit ratio, which results in low cost-effective. 4. Evaluation 4.1 Evaluation environment Table 1. Disk characteristics of Seagate-Cheetah15k5 Storage capacity (GByte) 146 RPM 15000

8 Sustained bandwidth (Mbytes/sec) Up to 125 Average seek time (ms) 3.5 Average read/write (ms) 4.0 Cache segment size 600KB Number of cache segment 32 Number of write segment 11 Prefetch Y Maximum prefetch size 600KB Zero latency Y A real implementation of the comprehensive and complicated system would be difficult and take an extremely long time. Trace driven simulation is a principal approach to evaluate the effectiveness of our proposed design, because it is much easier to change parameters and configurations in comparison with a real implementation. The trace driven simulation is a form of event driven simulation in which the events are taken from a real system that operates under conditions similar to the ones being simulated. By using a simulator and reference traces, we can evaluate the system in different environments and under a variety of workloads. DiskSim [8] is an efficient, accurate, highly configurable, and trace-driven disk system simulator. We employed the DiskSim simulator to deconstruct and evaluate the on-board disk cache. Several experimentally validated disk models are distributed with DiskSim. The experimental results reported in this paper were generated by using the validated Seagate-Cheetah15k5 disk model. The detailed disk characteristics are summarized in Table 1. In this paper, we adopt three metrics to explore the behavior of the on-board disk cache. Namely, average response time, hit ratio, and read hit ratio( 要进一步解释 ). According to the three metrics, we analyze the impacts of different cache behavior on the overall disk performance. The average response time includes both the time needed to serve the I/O request and the time spent on waiting or queuing for service. The hit ratio of disk cache indicates the percentage of requests that check the on-board disk cache and find some useable data. The read hit ratio of disk cache denotes the percentage of read requests that check the on-board disk cache and find all the requested data. 4.2 Real traces As discussed in section 2, the caches and queues in a typical computer system build a hierarchical architecture. The workload experienced by a higher level cache is often very much different from that seen by a lower level. For example, if an application buffers I/Os, then several application-level calls may be combined into one storage system operation. According to Fig. 1, we can have traces at points 1, 2, 3, and 4 on the data path. However, block-level trace at point 4 has the actual disk requests which can reflect the real performance behavior of the on-board disk cache. This is because the schedulers and buffering policies at file system also reorder or rearrange the block-level I/O requests to optimize performance. Therefore, the block-level I/O has advantages in control of the cache. Maximum number of write segments This specifies the number of cache segments available for holding write data at any point

9 in time. Because write-back caching is typically quite limited in current disk cache management schemes, some caches only allow a subset of the segments to be used to hold data for write requests (in order to minimize any interference with sequential read streams). Table 2 Characteristics of four block-level traces Trace name Mds Rsrch Wdev Proj Number of requests Read percentage(%) Average read request size (KB) Sequential reads 18.48% 13.1% 2.8% 6.74% Average inter-arrival time(ms) Average write request size (KB) Sequential writes 3.45% 0.6% 2.0% 8.98% Average inter-arrival time(ms) In order to understand better the data access patterns generated by modern servers in data centre, Narayanan et al. [7] instrumented the core servers in Microsoft s data centre to collect block-level traces in They traced 36 volumes containing 179 disks on 13 servers. Each server has two disk drives configured as a RAID1 for booting the server, and uses one or more RAID5 as data volumes. Windows Server 2003 SP2 is adopted as the operating system of all the servers. Data is stored through NTFS and accessed through a variety of interfaces including CIFS and HTTP. The length of traces covers one week. The traces are gathered per-volume below the file system cache. The traces are collected using Event Tracing For Windows (ETW) [6], and each event describes an I/O request seen by a Windows disk device (i.e., volume), including a timestamp, the disk number, the start logical block number, the number of blocks transferred, and the type (read or write). In our experiments, we extracted 7 one-day traces from the Microsoft trace, and modified the trace format to meet the requirements of Disksim. Table 2 illustrates the characteristics of the four block-level traces, where Mds, Rsrch, Wdev, and Proj indicate that the servers are used for media, research projects, test Web, and project directories, respectively. Please note that trace Wdev and Proj are write intensive. The sequential reads indicates that the number of read requests whose starting addresses are sequential to the immediately previous request to the same device, followed by the fraction of requests that are sequential reads. The same concept applies to writes. 4.3 Impact of prefetch Zero-latency access [16], which is a new feature of modern disk drives, can start transferring data when the disk head is positioned above any of the sectors in a request. If multiple contiguous sectors are required to read, the disk head can read the sectors from the magnetic media into disk cache in any order with zero-latency access support. The sectors in the cache are assembled in ascending Logical Block Number (LBN) order and sent to the host. If exactly one track is required, the disk head can begin reading data as soon as the seek is completed. It involves no rotational latency because all sectors on the track are needed. The same concept applies to writes with a reverse procedure which moves the data from host memory to the disk cache before it can be written onto the media. Therefore, the rotational latency decreases with the growth of the useful blocks in a track [14]. As Table 1 indicates, the disk model used in our experiment (Seagate-Cheetah15k5) supports zero-latency access. Due to this feature, enabling prefetch up to the end of the track following the track containing the last sector of the read

10 request does not incur too much overhead. This policy attempts to stay one logical track ahead of any sequential read streams that are detected % % Average response time(ms) % Hit ratio(%) 30.00% 20.00% % 0.00 Mds Rsrch Wdev Proj 0.00% Mds Rsrch Wdev Proj Prefetch No prefetch Traces Prefetch 53.56% 59.23% 6.03% 9.20% No prefetch 0.88% 0.00% 5.73% 6.20% Traces (a) Average response time (b) Hit ratio of disk cache % 80.00% Read hit ratio (%) 60.00% 40.00% 20.00% 0.00% Mds Rsrch Wdev Proj Prefetch 88.92% 69.22% 25.60% 22.16% No prefetch 0.01% 0% 24.33% 11.20% Traces (c) Read hit ratio of disk cache Fig. 3 Impacts of prefetch Fig. 3 illustrates the impacts of prefetch on the average response time, hit ratio, and read hit ratio of disk cache by using four block-level traces. In this evaluation, we enabled the prefetch first, and then disabled the prefetch function of Seagate-Cheetah15k5 and left all the disk cache for read/write cache. Fig. 3 (a) shows that when the prefetch is disabled, the average response time of Mds and Rsrch are increased 7% and 115%, respectively. It is interesting to observe that when the traces Mds and Rsrch are used to perform the evaluation, the hit ratios of disk cache are decreased to less than 1%. This implies that the read/write cache has a negligible contribution to the disk hit ratio with the Mds and Rsrch traces. The read hit ratio of the two traces illustrated in Fig.3 (c) shows a similar trend. Prefetch boosts the read hit ratio of Mds and Rsrch from 0.01% and 0% to 88.92% and 69.22%, respectively. This is because the two traces contain strong read sequentiality as Table 2 illustrated. In contrast to Mds and Rsrch, Wdev and Proj demonstrate completely different performance behaviour. According to Fig.3 (a), after switching off the prefetch, the average response time of Wdev and Proj are decreased 51.36% and 50.88%, respectively. However, the hit ratio and read hit ratio are both reduced to a certain degree. Table 2 shows that, in contrast to Mds and Rsrch, Wdev and Proj traces are write intensive, contain very weak sequential reads, and relatively big request size. Therefore, after enabling prefetch, the prefetch competes for cache space with the read/write cache, and decrease the effective cache space for the intensive write traffic, thus decreasing the disk performance. Fig. 3 indicates that Mds and Rsrch contain a strong spatial locality which can be leveraged by prefetch. Wdev and Proj demonstrate a different pattern. The hit ratio of 5.73% and 6.20% with prefetch disabled indicate

11 that read/write cache works well for the two traces. According to the above analysis, Fig. 3 indicates that the prefetch function has a significant impact on the average response time, hit ratio, and read hit ratio of disk cache. 4.4 Impact of write cache Average response time(ms) P P+SW NP NP+SW Mds Rsrch Wdev Proj % 50.00% 40.00% Hit ratio(%) 30.00% 20.00% 10.00% 0.00% P P+SW NP NP+SW Mds 53.56% 53.62% 0.88% 1.02% Rsrch 59.23% 59.29% 0.00% 0.00% Wdev 6.03% 5.88% 5.73% 5.62% Proj 9.20% 9.25% 6.20% 6.17% (a) Average response time (b) Hit ratio of disk cache % 80.00% Read hit ratio(%) 60.00% 40.00% 20.00% 0.00% P P+SW NP NP+SW Mds 88.92% 89.02% 0.01% 0.25% Rsrch 69.22% 69% 0% 0% Wdev 25.60% 24.98% 24.33% 23.86% Proj 22.16% 22.50% 11.20% 11.09% (c) Read hit ratio of disk cache Fig. 4 Impacts of write cache There are two ways that disk drives allocate available cache space for write cache. First, write and read streams share the available space. In this case, read and write operations contend for the space. Second, some on-board disk caches have dedicated segments for write operations. The statically dedicated write segments can minimize the impact of write requests on the cache/prefetch of sequential read streams. We evaluated the performance when the static write cache is either enabled or disabled. In the measurements, the number of overall segments is 32, and the maximum number of write segments is 11. Fig. 4 shows the impacts of write cache on the average response time, hit ratio, and read hit ratio, where P, P+SW, NP, NP+SW represent prefetch, prefetch plus static write cache, no prefetch, and no prefetch plus static write cache, respectively. Before the test, we expected to see that a dedicated write cache should be able to enhance the overall disk performance by minimally affecting the read cache hit ratio. However, Fig. 4 depicts that the static write cache has negligible impacts on the average response time, overall disk hit ratio, and read hit ratio when the prefetch is switched either on or off. In Section 4.3, we concluded that read/write cache works well. However, Fig. 4 (b) shows that the static write cache has a negligible impact on the hit ratio. This indicates that the write requests do not have too much interference with the read streams, and the read cache dominates the contribution to the hit ration across four traces. Fig. 4 (c) confirms this conclusion. In order to explore the impacts of write cache capacity on the performance, we increased the write cache

12 from 1 segment to 8 segments with the four real traces. However, we observed negligible performance variation. Since prefetched disk blocks need to be stored in the disk cache, theoretically, prefetch can potentially compete for buffer cache entries [27]. However, modern computers normally employ a few GB host memory which is orders of magnitude larger than the disk cache. Any data rewritten will be rewritten in the host memory rather than the small on-board disk cache. Due to this reason, the write cache has a negligible impact on the disk performance and hit ratio. Based on the evaluation, we believe it is better to leave the disk cache shared by both the write and read streams. Because prefetch and read/write cache leverages the spatial locality and temporal locality to improve performance, respectively. This evaluation also confirms that the I/O traffic at the disk drive level does not demonstrate strong temporal locality [4]. Since the three traces used are not write intensive, we employed synthetic traces to explore the impacts of different access patterns in extreme scenario on disk performance. Each synthetic trace consists of 100,000 requests. Read probability of the traces is set as 10%, 30%, 50%, 70%, and 90% respectively. Sequentiality is defined as 0%, 50%, and 100% respectively. Fig.5 (a) shows that if data access pattern is 100% random, the average response time is slightly increased with the growth of the read probability regardless of prefetch. Furthermore, the performance slightly degrades with prefetch enabled. This is because prefetch does not work well for the 100% random accesses, and the overhead incurred by prefetch outperforms the performance improvement. When the random access is decreased from 100% to 50% and the prefetch is off, we did not observe any performance variation. When the access pattern is further reduced to 0% random (100% sequential), the average response time shows the same trend. This indicates that when prefetch is disabled, read probability and sequentiality have negligible impacts on the average response time. This also implies that data access pattern has a negligible influence on the on-board read/write cache. However, when prefetch is on, the average response time is significantly decreased with the growth of the read probability for both 50% random and 0% random. This is reasonable, because prefetch works better with the growth of both read probability and sequentiality. Our experimental results demonstrate that tit ratio and read hit ratio are not affected by the read probability and sequentiality when prefetch is off. They range between 0% and 0.002%. After the prefetch is switched on, we did not observe any changes of both the hit ratio and read hit ratio with 100% random accesses as Fig.5 (b) illustrated. However, the hit ratios grow significantly with the increase of read probability after the sequentiality is improved to 50% and 100%. This is because higher read probability and sequentiality enhance the effect of prefetch. The conclusions are consistent with what we obtained from the three real traces. 4.5 Impact of cache size Average response time(ms) Prefetch No prefetch Number of segment 60.0% 50.0% 40.0% Hit ratio(%) 30.0% 20.0% 10.0% 0.0% prefetch 53.2% 53.5% 53.6% 53.6% 53.7% 53.7% noprefetch 0.9% 0.9% 0.9% 0.9% 1.0% 1.0% Number of segment Fig. 5 Impacts of cache size for Mds

13 Average response time(ms) Prefetch No prefetch Number of segment 80.0% 60.0% Hit ratio(%) 40.0% 20.0% 0.0% Prefetch 52.0% 58.2% 59.2% 59.4% 60.1% 60.1% No prefetch 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Number of segment Fig. 6 Impacts of cache size for Rsrch Average response time(ms) Prefetch No prefetch Number of segment 10.0% 8.0% 6.0% Hit ratio(%) 4.0% 2.0% 0.0% Prefetch 4.0% 5.2% 6.0% 6.4% 7.4% 8.5% No prefetch 3.8% 5.0% 5.7% 6.7% 7.0% 8.0% Number of segment Fig. 7 Impacts of cache size for Wdev Average response time(ms) Prefetch No prefetch Number of segment 12.0% 10.0% 8.0% Hit ratio(%) 6.0% 4.0% 2.0% 0.0% Prefetch 7.6% 8.4% 9.2% 9.6% 9.9% 10.4% No prefetch 5.4% 5.9% 6.2% 6.4% 6.5% 6.7% Number of segment Fig. 8 Impacts of cache size for Proj In order to explore the impacts of disk cache size on the average response time and the hit ratio of disk cache, we increased the number of cache segment from 8 to 256. Since the static write cache has a negligible impact on the disk performance, we leave the disk cache shared by the write streams and read streams. Fig. 5 and Fig. 6 demonstrate the same behavior. When the prefetch is on and the number of segment is increased from 8 to 16, the average response time is slightly reduced and the hit ratio is slightly increased. However, further increasing the number of segments has a negligible impact on the disk performance. When the prefetch is off, the cache size has no influence on both the average response time and the hit ratio. This is because the traces Mds and Rsrch are both dominated by read traffic and contain a certain degree of sequentiality. There fore, prefetch can leverage the increased cache space to increase the performance.

14 In contrast to Fig.5 and Fig.6, Fig. 7 and Fig. 8 show different patterns in terms of the average response time and hit ratio. When prefetch is switched either on or off, the average response time keeps decreasing and the hit ratio keeps improving. Fig. 7 and Fig. 8 also indicate a threshold for the cache size (32 segments). Both the disk performance and hit ratio are significantly significantly improved before the cache size reaches the threshold. The performance and hit ratio continue to slightly improve as the cache size is increased beyond 20 segments. Therefore, the further increased cache size only achieves a limited contribution. In contrast to Fig.6, it is not obvious to observe a threshold contained in Fig.7. Fig.8 demonstrates different pattern in terms of the average response time and hit ratio. It confirms the impact of prefetch. However, when the prefetch is switched either on or off, increasing the cache size improves the disk performance and cache hit ratio. Fig.6, Fig.7, and Fig.8 also show that the performance impacts incurred by the increased cache size actually depend on the different workload. 5. Discussions and conclusion Computer systems are complex systems consisting of various components. All of the components are correlated with each other. Therefore, a specific component can not be isolated from the overall system when we explore its performance behavior. Modern computer systems use significant amounts of memory to improve performance by allowing asynchronous prefetch and write back, and by holding a pool of data that can be re-accessed quickly by clients. For example, EMC 8830 storage array adopts 64GB memory, and HP rp7400 server employs 32GB memory. Even a high-performance laptop normally uses a few GB memory. This feature has a significant impact on the behavior of disk cache. Because the capacity of host memory is normally orders of magnitude larger than disk cache, any data brought into host memory will be re-accessed there, not in the disk cache. This can be regarded as an exclusive feature of cache contained in the modern computer systems [17]. Hsu and Smith [3] reported that disk cache in the megabyte range is sufficient. For a very large disk cache, its hit ratio continues to slightly improve as the cache size is increased beyond 4% of the storage used. This indicates that if the disk cache size grows beyond a certain threshold, the increased cache only achieves a limited contribution to the hit ratio, which is not cost-effective. For a fixed cache size, a very important issue is how to effectively share the limited cache capacity for both prefetch and read/write cache. If the prefetch is too aggressive, it can pollute the disk cache and may even degrade the cache hit ratio and system performance. Competing for cache size between prefetch and read/write cache has been investigated thoroughly by previous work [25, 26, 27]. However, their work may not be applied to the on-board disk cache. The experimental results in this paper give the following indications: (1) I/O stream at block-level demonstrates very strong spatial locality and negligible temporal locality. Therefore, prefetch in the disk cache can boost the performance of disk drives by leveraging the spatial locality. However, read/write cache can only achieve marginal benefits because the block-level I/O traffic has almost no temporal locality. (2) Static write cache does not achieve performance gains since the write streams do not have too much interference with the read streams. Therefore, we believe it is better to leave the disk cache shared by both the write and read streams. (3) The disk cache is volatile memory. Because of data reliability concerns it is used to improve read rather than write performance by aggressive prefetch and data retention. We should focus on how to further improve the read performance.

15 Furthermore, our experimental results also show that increasing the cache size does improve the disk performance. However, in some scenarios, when the cache size reaches a threshold, further increasing can only obtain marginal performance growth. Therefore, an optimal cache size is cost-effective. This is consistent with the conclusions reported in the previous work. ACKNOWLEDGMENT The work was supported by the National Natural Science Foundation (NSF) under grant (No ) and a startup research fund from Jinan University. Any opinions, findings and conclusions are those of the authors and do not necessarily reflect the views of the above agencies. References [1]. Ultrastar 15K147 hard disk drives specifications. [2]. Ramakrishna Karedla, J. Spencer Love, Bradley G. Wherry. Caching strategies to improve disk system performance. Computer, Vol.27, No.3, 1994, pp [3]. W.W.Hsu and A.J.Smith. The performance impact of I/O optimizations and disk improvements. IBM Journal of Research and Development. Vol.48, No.2, 2004, pp [4]. E. V. Carrera and R. Bianchini. Improving Disk Throughput in Data-Intensive Servers. In Proceedings of the 10th International Symposium on High-Performance Computer Architecture [5]. A. J. Smith. Disk cache-miss ratio analysis and design considerations. ACM Transactions on Computer Systems, 3 (3): , [6]. Microsoft Event tracing. [7]. Narayanan, D., Donnelly, A., Rowstron, A Write off-loading: practical power management for enterprise storage. ACM Transactions on Storage. Vol. 4, No.3, article 10, [8]. J. S. Bucy, J. Schindler, S. W. Schlosser, G. R. Ganger, et al. The DiskSim simulation environment version 4.0 reference manual. CMU-PDL May 2008 [9]. N. R. Mahapatra and B. Venkatrao, The processor-memory bottleneck: problems and solutions, ACM Crossroads 5(3) (1999). [10]. S. W. Schlosser, J. L. Griffin, D. F. Nagle, G. R. Ganger, Designing computer systems with MEMS-based storage, in: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2000, pp [11]. Y. Zhu, Y. Hu. Disk Built-in Caches: Evaluation on System Performance. In Proceedings of 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems(MASCOTS 2003), [12]. V. Soloviev. Prefetching in Segmented Disk Cache for Multi-Disk Systems. In Proceedings of the 4th workshop on I/O in parallel and distributed systems. 1996, pp [13]. Y. Deng. Deconstructing Network Attached Storage Systems. Journal of Network and Computer Applications. Vol.32, No.5, 2009, pp [14]. Y. Deng. Exploiting the Performance Gains of Modern Disk Drives by Enhancing Data Locality. Information Sciences. Elsevier Science. Vol.179, No.14, 2009, pp [15]. Ultrastar A7K1000 hard disk drives specifications. [16]. J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger, Track aligned extents: matching access patterns to disk drive characteristics, in Proceedings of Conf. on File and Storage Technologies (FAST02), 2002,

I/O CANNOT BE IGNORED

LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.