Multilevel Cache Management Based on Application Hints

Size: px

Start display at page:

Download "Multilevel Cache Management Based on Application Hints"

Alexander Ellis
5 years ago
Views:

1 Multilevel Cache Management Based on Application Hints Gala Golan Supervised by Assoc. Prof. Assaf Schuster, Dr. Michael Factor, Alain Azagury Computer Science Department, Technion Haifa 32000, ISRAEL November 24, Introduction Caching has been used in many operating systems to improve file system performance. Typically, caching is implemented by retaining in main memory a few of the most recently accessed disk blocks, thus handling repeated accesses to a block without involving the disk. This reduces delays in two ways: the waiting process is serviced faster, and contention for the disk arm is reduced [1]. A replacement algorithm or policy is used to decide which block is the best candidate for eviction when the cache is full. The most important metric for a cache replacement algorithm is its hit rate - the fraction of pages that can be served from main memory. The miss rate is the fraction of pages that must be paged into the cache from auxiliary memory [21]. An optimal algorithm, is one for which the hit rate is maximal. Such algorithms have been the focus of many theoretical and experimental studies. Belady proposed the optimal offline paging algorithm, MIN [2], in 1966, and different measures of competitiveness help analyze the efficiency of online algorithms [3]. In this research we will focus on the problem of management of multiple levels of cache. We propose an optimal offline algorithm for replacement in multiple caches, based on an algorithm for the Relaxed List Update Problem [23], and the DEMOTE operation [24]. We describe the obstacles in proving the optimality of the proposed algorithm, and use simulation to show that it is better than known algorithms. For the first time we will rely on file system information in the storage level [14], using it to imitate the proposed offline algorithm. Computer Science Department, Technion. IBM Haifa Labs. 1

2 2 Background 2.1 Cache replacement Algorithms The most commonly used algorithm is LRU, where the least recently accessed block is evicted whenever a new block is requested [4]. This algorithm is based on the idea of locality of reference. LRU has no notion of frequency, a problem which a few variations of it try to fix. LRU-K [5], 2Q [6], LRFU [7], and LRIS [8] are such variations. 2.2 Application oriented caching: Detection based More advanced algorithms use more history information for each block in the cache (and sometimes for blocks already evicted from the cache), trying to guess the access pattern of the application. This may help identify the best candidate for eviction. DEAR [9], AFC [10], and UBM [11] all collect data about the file the block belongs to, or the application requesting it, and derive access patterns such as sequential, looping, and other. For example, In the UBM scheme, information for each file is maintained as a 4-tuple consisting of a file descriptor, a start block number, an end block number, and a loop period. Sequences of consecutive block references are detected up to the current time. The information is kept in a table and is updated whenever a block reference occurs. The detection mechanism results in files being categorized into three types: sequential, looping, and other patterns. The buffer cache is divided into three partitions to accommodate these three patterns. Each partition is managed by its own policy; MRU (Most Recently Used) is used for both sequential and looping files, and for the other files, any scheme can be used. Instead of choosing a block to evict, this algorithm chooses a victim partition - the partition from which a block will be evicted according to its policy. The victim partition is chosen to be the one with the lowest marginal gain - Hit(n)-Hit(n-1), where n is the partition size, and Hit(x) is the expected hit rate for an x-sized partition. The sequential marginal gain is always 0, so if there are blocks in this partition they are chosen first. The Marginal gain of the looping references depends of whether or not an entire loop fits in the cache. For LRU (here used for other references), an approximation of Belady s lifetime function is used. 2.3 Application oriented caching: Hint based A further step is to rely on the application s hints passed to the cache management. This reduces the complexity of the algorithm, but limits its applicability. Only applications that have been explicitly altered to manage the caching of their own blocks can benefit from such algorithms. LRU-SP [12] and TIP2 [13] are two such algorithms. In LRU-SP, an interface allows applications to assign priorities to files (or, temporarily, to blocks in files), and to specify file cache replacement policies for each priority level. The kernel 2

3 always replaces blocks with the lowest priority first. Applications like UNIX sort, the dinero simulator, and the postgress database were modified to include calls to these interface functions. A different approach was taken in TIP2. Instead of dictating the replacement policy, the applications disclose information about their future requests. This enables the cache manager to make its own decisions, taking into account information about the state of the cache and the different applications running concurrently. The applications were modified to specify a file (name/descriptor) and an access pattern (sequential/explicit access string). The cache management scheme then balances caching and prefetching, by computing the value of each block to its requesting application, and evicting the globally least valuable block. 2.4 Passing information to the Storage Tier The algorithms which differentiate between blocks based on the requesting application, or even the block s file, have all been designed to work in the file system cache. The main reason for this is the lack of information about the block s attributes anywhere outside the file system. A way to pass such extra information from the file system to the I/O system has been suggested in [14], and has been implemented in the Linux kernel The Information Gap When a process requests data from a file, the file system is responsible for locating that data on the device. When the request cannot be serviced from the file system s cache, an I/O request is passed through the storage tier to the device queue for the actual I/O to take place. A decision is made by the file system, whether to generate a synchronous or an asynchronous I/O request. In the asynchronous case the user application is disconnected from the lower layers of the I/O stack, and keeps running while the I/O request is being passed down to the storage device. In case of accessing a SCSI disk, the SCSI command is generated by the device driver only after the file system has placed the request. At this stage the file system data structures (which hold the information about the process, user, file and application) are not always available Bridging the Gap In some cases (mainly synchronous read operations), the application requesting the I/O would like to wait for the operation to complete before continuing. For this to happen, the buffer header is added to the request wait-queue, and the application s process is kept in a Wait state. When the I/O completes the process is released from this state and the application may continue running. While waiting in the queue for the completion of the I/O, the file system s data structures are available, meaning any IO request that will be handed down to the storage tier can also contain information such as requesting application, user, or file. The goal is to make the process information available for all I/O requests, not just the synchronous ones. This can be done by adding the buffers to the queue without changing the 3

4 process state, thus making the information accessible without forcing a delay in the application s normal run. Details about the implementation can be found in [14]. 2.5 Multilevel All the studies described above have dealt with a simple architecture, when one cache, the file system s cache, is used between the application s request and the disk. A multilevel cache hierarchy consists of h levels of caches, C h,... C 1 and forms a tree, i.e., each cache at level C i+1 is shared by one or more caches at level C i and a cache at level C i is a direct child of only one cache at level C i+1. This is a common architecture for CPU caches [15], but also for the network [16, 17] and storage [18, 19] domains. In those domains, there are also configurations which are not a tree, for which many hosts access many caches. Several studies state the effect of multiple levels of cache existing in the same system [16, 17, 18, 19], and a few replacement algorithms were proposed for a second level cache, such as Multi-Q [20] and ARC [21] Theoretical Discussion Aggarwal et al [22] introduced RLUP (Relaxed List Update Problem) as a model for the management of hierarchical memory. In the list update problem (LUP), we wish to maintain a linear list in response to access requests. If the current list x is x 1, x 2,... x k, the cost of accessing item x p is p. After being accessed, item x p can be moved at no cost to any location earlier in the list. Relaxed list update problem (RLUP) is a variant in which the cost to access item x p is c p, where 0 c 1... c k. The requested item x p can be repeatedly swapped with the immediately preceding item. The goal is to find an optimal service for a request sequence, where a service for x and ρ = r 1, r 2... r n is a sequence A = A 1, A 2... A n, where A i is the sequence of swaps executed when r i is issued. The cost of a service is the cost of accessing the requested elements. Chrobak and Noga [23] suggest an analogy between hierarchical memory and an item list: Multilevel caching if the multilevel cache system consists of m fully associative caches, where cache i has size s i and access time f i, it can be modeled as RLUP in the following way: we let k = m l=1 s l and the cost function is c i = f j where j is the smallest for which j l=1 s l i. They then propose an optimal offline algorithm to solve this problem. Given a request sequence ρ and an item x, first ρ (x) is the index of the first occurrence of x in ρ. (if x does not appear in ρ we assume first ρ (x) = ). Algorithm OPT Suppose that the current list is x = x 1, x 2... x k, and the remaining request sequence is ρ = x p σ, that is x p is the first request in ρ. Pick j {1,..., p} that maximizes first ρ (x j ). If j = p, do not make any swaps on x p, and apply OPT to x and σ. if j < p, 4

5 execute swap(x j, x p ) and apply OPT to the new list x and ρ. In other words, when serving x p, OPT repeatedly swaps x p with items that precede x p and are accessed the latest The DEMOTE Operation Wong and Wilkes [24] define the notion of inclusive caching; most array caches use management policies which duplicate the same data blocks at both the client and disk array levels of the cache hierarchy - they are inclusive. READ operations that miss in the client are more likely to miss in the array and incur a disk access penalty. In exclusive caching, a data block is cached at either a client or the disk array, but not both. In order to achieve exclusive caching, they introduce the DEMOTE operation as an extension to the SCSI command set. It is similar to a WRITE; the array tries to put the demoted block into its re-read cache, ejecting another block if necessary to make space. The operation is short-circuited if a copy of the block is already in the array s cache. Under the assumption that SANs are much faster then disks, even though a DEMOTE may incur a SAN block transfer, performance gains are still possible. 3 Proposed Research 3.1 Optimal Offline solution The 1-1 model Cache1 Cache2 Disk One cache in the file system (cache1) of size i, and one cache in the disk (cache2) of size j. The cost of reading a page is F cache if the page is in the filesystem s cache c(x) (F cache+)f network if the page is in the disk s cache (F cache + F network+)f disk if the page is not in any cache Operations: READ x - brings block x to cache i, and evicts it from cache i+1, when cache 3 Disk. If x is not in cache i+1, READ from cache i+2. DEMOTE x - moves x from cache i to cache i+1, when x is evicted from cache i. 5

6 This translates to a list x = x 1,..., x i, x i+1,..., x i+j, x i+j+1,..., x k, where F cache if 0 < x i c(x) F network if i < x j + i F disk if j + i < x This translation to an RLUP problem enables us to use the proposed OPT algorithm Assumptions 1. F cache << F network << F disk. 2. We can read a block without actually bringing it into the cache (this can be simulated by allocating one extra buffer in each cache). 3. We are working offline, meaning the entire request sequence is known. 4. OPT for this model, where there are 3 price ranges, will only swap blocks between caches, so no swapping will occur between blocks in the same cache The proposed offline algorithm On access to block x p : If x p is in cache1 (the host), access it from there, perform no swaps. Else if x p is in cache2 (the storage controller), find x j according to OPT [21]. If x j is x p, or if x j is in cache2, read x p from cache2 with no swapping (READ x p and DEMOTE x p ). Else, x j is in cache1. DEMOTE x j (moving it into cache2), and move x p to cache1. Else, x p is only in the disk. Find x j according to OPT. If x j is x p, or if x j is only on disk, read x p from the disk with no swapping (READ x p without DEMOTE, so it is not saved in cache2). Else, if x j is in cache2, remove it from cache2 (cache2 performs DEMOTE x j ), save x p in cache2, and repeat the case for which x p is in cache2 (first Else). Else, x j is in cache1. Remove it from cache1 (no DEMOTE, so it is not saved in cache2), and READ x p to cache1. This algorithm is based on the assumption that a DEMOTE operation can be considered a free swap, for which case the proof of the original algorithm holds. This assumption can be justified by two things: 6

7 F cache << F network << F disk (assumption 1). A DEMOTE from the disk cache to the disk is indeed free, since the disk still holds the data, and it does not need to be transferred back. In [24], the authors assume that a DEMOTE costs the same as a READ that hits in the array controller, and that clients demote a block for every READ. They therefore approximate the cost of demotions by doubling the latency of array hits (Cache2). This assumption is later supported by simulations of varying Network speeds. The proposed algorithm can also be viewed as performing the MIN algorithm with two changes: When reading a block into a cache, instead of having to evict an existing block first, the new block can be discarded immediately after its use. When evicting a block into a cache, the DEMOTE operation is used only if the block should be saved into this lower level cache according to a MIN algorithm performed on that cache. 3.2 Proof Various problems arise when trying to formally prove the optimality of this algorithm: 1. The original reason why the solution for RLUP is indeed optimal has to do with the interface of a linked list. When searching a list, after reaching a certain element, the locations of the previous elements are known, so the switch between them is indeed free. This is not necessarily the case for hierarchical memory - it depends on the definition of the price for each level. 2. A real optimal algorithm has to have no constraints. This means, for example, it can know what all caches hold without needing DEMOTE to communicate between them. This is not how we define our model. 3. Can we really read blocks while skipping caches in the way? Any algorithm that uses buffers for its internal use effectively reduces the cache size and therefore cannot be competitive, and therefore cannot be optimal. 4. What should the cost function be? If we set the price for each level to be the cost of physically accessing this level, then there is no reason to assume a switch is indeed free. We can say that when reading a block from level i, all levels before are accessed. When reading a block, it must go through all upper levels anyway, with or without updating these caches. Thus, inserting the block into one or more of the upper 7

8 caches does not cost any additional time. The problem is that we charge each read for the price of transferring it, so if we don t actually do the transfer (i.e., the switch) we pay for more than we have to. We can assume the price of each level is negligible compared to the level below it. In this case, only the accesses to the lowest level are of any interest, so all the caches above (in our case, Cache1 and Cache2) can be treated as one. This makes the problem uninteresting - for most access patterns any algorithm will be as good as the optimal, and for the other inputs, the amortized cost will be very small. Due to these problems, we will show that the proposed algorithm achieves better hit rates by simulation. We will use a simulator based on the one used in [21], and implement several of the algorithms mentioned in the background section. We will use this simulation for the traces described below, to show that the proposed algorithm performs better than all of them. 3.3 Implementing an Approximation Databases are good examples of applications where the access pattern can be easily predicted, making useful hints available to the cache [12, 13]. We will use DB2 s query optimizer s output to predict future accesses to the disk. Our traces will be extracted from running a TPC benchmark on DB2. We will then implement a mechanism based on the one described in [11]: The size (or scale) of the loop will be specified by the application (i.e., DB2 or an underlying file system). Sequential accesses can be marked as non cacheable, and the rest of the accesses may contain either an approximation of the next access or no information (default partition). For example - assuming there is one byte for passing information about each block, possible values for that byte would be: Value Meaning 0 Default caching algorithm 1 Very short loop (= high priority) Possible loop sizes (or loop size ranges) 255 Infinite loop (= low priority, non-cacheable) Assuming this is the best approximation of the optimal algorithm for only one cache level, the optimality for two levels of cache holds since: The host performs the algorithm as is, but also uses DEMOTE. No DEMOTE is used when discarding non cacheable blocks, thus preventing them from ever being cached at the disk controller (note they are discarded when passes to the host). 8

9 Short loops will be cached in the host, while longer loops will be cached at the disk. This is guaranteed because the short loop references will be discarded after being passed to the host and the longer ones will be accessed from the disk. 3.4 Multiple hosts We will then repeat the process for the case of more than one cache in each level of the hierarchy. We will propose an optimal offline algorithm for this broader case, and use simulation to show it is better than known algorithms. We will then implement an online algorithm imitating it The 2-1 Model Cache1 Cache2 \ / Cache3 Disk There are two caches in the file system (Cache1 and Cache2) for two hosts, and one cache in the disk controller(cache3). The costs of access and assumptions 1-3 from the 1-1 model hold here as well (F cache and F network are the same for both hosts). Since cache1 and cache2 are not fully associative (we do not wish to store blocks in cache1 that host1 does not demand), we cannot swap blocks between cache1 and cache2. Furthermore, we do not wish to evict blocks from cache3 as soon as they are read by one host, since the other host may request them (this has also been suggested in [24]). There are some access patterns where partly inclusive caching is preferred. For this reason, the optimality of the algorithm suggested for the 1-1 model does not hold for this model. Based on the MIN view of the algorithm for the 1-1 case, we propose an algorithm for the 2-1 case: Cache1 and cache2 both perform MIN on their blocks, calling DEMOTE when evicting blocks from the cache. Cache3 will store block x (when READ x or DEMOTE x are performed) if: There is in the future a READ x operation, without a DEMOTE x preceding it, and There is a block y in cache3 for which there is in the future a READ y operation with a DEMOTE y operation preceding it (save x instead of y in cache3), or There is no block y in cache3 with a DEMOTE y operation in the future (remove some block according to MIN, and save x in cache3). 9

10 Showing the optimality of this model will first of all involve working with a few instances of the database simultaneously (in order to generate two independent traces). It will also involve modifying the simulator to include a notion of time, in order to correctly merge two inputs from two different hosts, and to create a correct input for a lower level cache. The implementation for an online solution will now involve dealing with the frequency of requests from each host. When there is more than one host, the blocks from the different hosts are merged into the same three partitions in the disk. The information passed down by the hosts might have to be adjusted according to each host s frequency of accesses. For example, if host1 makes 4 requests for each request from host2, and assuming they store the same loop length in the disk cache, host1 s blocks should be assigned an effective shorter loop size. References [1] Nelson, B. Welch, and J. Ousterhout. Caching in the Sprite Network File System. ACM Transactions on Computer Systems, 6(1): , February [2] L.A. Belady, A study of replacement algorithms for a virtual-storage computer. IBM Sys. J, vol. 5, no. 2, pp , [3] A. Fiat, R. Karp, M. Luby, L.A. McGeoch, D. Sleator, N.E. Yong, Competitive paging algorithms, Journal of Algorithms, 12, pages , [4] Peter J. Denning. The Working Set Model for Program Behavior. Communications of the ACM, 11(5): , May [5] E. O Neil, P. O Neil, G. Weikum, The LRU-K page replacement algorithm for database disk buffering, Proc. ACM SIGMOD International Conference on Management of Data, [6] Theodore Johnson and Dennis Shasha, 2Q: a low overhead high performance buffer management replacement algorithm, In Proc. of the Twentieth International Conference on Very Large Databases, Santiago, Chile, pages , [7] D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, On the Existence of a Spectrum of Policies that subsumes the Least Recently Used (LRU) and Least Frequently Used (LFU) Policies, in Proceedings of the 1999 ACM SIGMETRICS Conference, pp , [8] S. Jiang and X. Zhuang. LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proc. of SIGMETRICS

11 [9] Jongmoo Choi, Sam H. Noh, Sang Lyul Min and Yookun Cho, An Implementation Study of a Detection-Based Adaptive Block Replacement Scheme, USENIX Annual Technical Conference, Monterey, CA, June [10] J. Choi, S. H. Noh, S. L. Min, and Y. Cho, Towards Application/File-Level Characterization of Block References: A Case for Fine-Grained Buffer Management, Measuring and Modeling of Computer Systems, June 2000, pp [11] J.M Kim, J. Choi, J. Kim, S. H. Noh, S. L. Min, and Y. Cho, and C.S Kim, A Low- Overhead High-Performance Unified Buffer Management Scheme that Exploits Sequential and Looping References, In Proc. of the 4th USENIX Symposium on Operating Systems Design and Implementation, October 2000, pp [12] P. Cao, E.W. Felten, and K. Li, Implementation and Performance of Application- Controlled File Caching. In Proc. of the First Symposium on Operating Systems Design and Implementation, USENIX Association, November [13] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 79 95, December [14] Tsipora Barzilai, Gala Golan. Accessing Application Identification Information in the Storage Tier. Disclosure IL , IBM Haifa Labs, November 26, [15] J.-L. Baer, W.-H. Wang, On the inclusion properties for multi-level cache hierarchies, The 15th Annual International Symposium on Computer architecture, 1988, Honolulu, Hawaii, United States [16] P. Rodriguez, C. Spanner, and E. W. Biersack, Web caching architectures: hierarchical and distributed caching, Proceedings of WCW 99. [17] D. L. Willick, D. L. Eager, and R. B. Bunt. Disk Cache Replacement Policies for Network Fileservers. In Proceedings of the 13th International Conference on Distributed Computing Systems, May [18] D. Muntz and P. Honeyman. Multi-level Caching in Distributed File Systems -or- your cache ain t nuthin but trash. In Proceedings of the Usenix Winter 1992 Technical Conference pages , Berkeley, CA, USA, January Usenix Association. [19] S.J.Lee and C.W.Chung, VLRU: Buffer Management in Client-Server Systems, in IEICE Transactions on Information and Systems, Vol. E83-D, No. 6, June 2000, pp [20] Y. Zhou and J.F. Philibin, The Multi-Queue Replacement Algorithm for second Level Buffer Caches, in Proc. USENIX annual Tech. Conf. (USENIX 2001), Boston, MA, pp , June

12 [21] Nimrod Megiddo and Dharmendra S. Modha, ARC: A SELF-TUNING, LOW OVER- HEAD REPLACEMENT CACHE, USENIX File and Storage Technologies Conference (FAST), March 31, 2003, San Francisco, CA [22] Aggarwal, B. Alpern, A. Chandra, and M. Snir. A Model for Hierarchical Memory, In Proceedings of the 19th Annual ACM Symposium of Theory of Computing (STOC), pages , New York City, May [23] Marek Chrobak and John Noga. Competitive algorithms for multilevel caching and relaxed list update. In Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms, pages 87 96, [24] Theodore M. Wong and John Wilkes, My cache or yours? Making storage more exclusive, In Proc. of the USENIX Annual Technical Conference, 2002, pages

A Low-Overhead High-Performance Unified Buffer Management Scheme that Exploits Sequential and Looping References

A Low-Overhead High-Performance Unified Buffer Management Scheme that Exploits Sequential and Looping References A ow-overhead High-Performance Unified Buffer Management that Exploits Sequential and ooping References Jong Min Kim Sang yul Min Jongmoo Choi Yookun Cho School of Computer Science and Engineering Seoul