A Novel Buffer Management Scheme for SSD

Size: px

Start display at page:

Download "A Novel Buffer Management Scheme for SSD"

Owen Glenn
5 years ago
Views:

1 A Novel Buffer Management Scheme for SSD Qingsong Wei Data Storage Institute, A-STAR Singapore Bozhao Gong National University of Singapore Singapore Cheng Chen Data Storage Institute, A-STAR Singapore Abstract Random writes significantly limit the application of flash memory in enterprise environment due to its poor latency, negative impact on lifetime and high garbage collection overhead. Several buffer management schemes for flash memory are proposed to overcome this issue, which operate either at page granularity or block granularity. Traditional page-based buffer management schemes leverage temporal locality to pursue cache hit ratio improvement without considering sequentiality of flushed data. Current block-based buffer management schemes exploit spatial locality to improve the sequentiality of the write accesses passed to the flash memory at a cost of low buffer utilization. None of them achieves both high cache hit ratio and good sequentiality at the same time, which are two critical factors determining the efficiency of buffer management for flash memory. In this paper, we propose a novel buffer management scheme referred to as, which divides the buffer space into page region and block region to make full use of both temporal and spatial localities among accesses in hybrid form. dynamically balances our two design objectives of cache hit ratio and sequentiality for different workloads. can efficiently improve the performance and influence the I/O requests so that more sequential accesses are passed to the flash memory. has been extensively evaluated under various enterprise workloads. Our benchmark results conclusively demonstrate that can achieve up to 84% performance improvement and 85% garbage collection overhead reduction compared to existing buffer management schemes. Keywords-flash memory; buffer management; hybrid; cache hit ratio; write sequentiality I. INTRODUCTION Flash memory is rapidly becoming promising technology for the next-generation storage due to a number of strong technical merits including (i) low access latency (ii) low power consumption (iii) higher resistance to shocks (iv) light weight and (v) better endurance. As an emerging technology, flash memory received strong interest in both academia and industry [4-7]. Flash memory has been traditionally used in portable devices. More recently, as price drops and capacity increases, this technology has made huge strides into personal computer and server storage space in the form of Solid State Drive (SSD) with the intention of replacing traditional hard disk drives (HDD). In fact, two leading on-line search engine service providers, google.com and baidu.com, both announced their plans to migrate existing hard disk based storage system to a platform built on SSDs [3]. However, SSD suffers from random write issue when applied in enterprise environment. Firstly, writes on SSD are highly correlated with access patterns. The electrical properties of flash cells result in random writes being much slower than the sequential writes. The performance optimizations inside SSD including striping, interleaving and prefetching are not effective any more for random write because less sequential locality is left for them to exploit. Secondly, NAND flash memory can incur only a finite number of erases for a given physical block due to the nature of the technology. Therefore, increased erase operations due to random writes shorten the lifetime of a SSD. Experiments in [7] show that random write intensive workload could make flash memory wear out over hundred times faster than sequential write intensive workload. Finally, random writes result in higher overhead of garbage collection than sequential writes. If the incoming writes are randomly distributed over the logical block address space, sooner or later all physical flash memory blocks will be fragmented which has significant impact on garbage collection and performance. Therefore, random write is a critical problem to both performance and lifetime, which restricts SSDs widespread acceptance in datacenters [5]. SSD controller uses a part of memory as buffer to overcome this issue. In current practice, there are several major efforts in buffer management, which can be classified either into a page-based or a block-based buffer management scheme. Most of them are based on the existence of locality in access patterns, either temporal locality or spatial locality. However, none of them is able to achieve high cache hit ratio and good sequentiality simultaneously, which are two critical factors determining the efficiency of buffer management for flash memory. Page-based buffer management schemes adopt cache hit ratio improvement as sole objective by exploiting the temporal locality of data accesses. There are a large number of diskbased algorithms proposed such as LRU, CLOCK [6], WOW[7], Q [9] and ARC [8]. All these algorithms focus only on how to better utilize temporal locality, so that they are able to better predict the pages to be accessed and try to minimize page fault rate []. However, direct application of these algorithms is inappropriate for SSD because spatial locality is unfortunately ignored. Block-based buffer managements such as [3], [], and LB-CLOCK[3] exploit spatial locality to change the access pattern and provide more sequential writes for flash memory. Resident pages in the cache are grouped on

2 the basis of their logical block associations. When a logical page cached in the RAM buffer is accessed, all pages in the same block are placed at the head of the list based on the assumption that all pages in this block have the same recency. Because hundreds of pages exist in an erase block, this assumption may not hold true, especially for random dominant workloads. Because block position in the list is determined by temporal locality of partial page accesses, most pages in this block may not be accessed in the near future, which pollutes the buffer space. To make free space in the buffer, all pages in the victim block are removed simultaneously. However, some pages in the victim block may be accessed in the near future. This strategy tradeoffs spatial locality with temporal locality, which results in low buffer space utilization and low cache hit ratio. Since buffer size is very critical to energy consumption and cost of SSD, it is important to improve buffer space utilization. In this paper, we propose a novel buffer management scheme referred to as, which adopts high cache hit ratio and good sequentiality as two objectives by exploiting both temporal and spatial localities among access patterns. divides buffer space into page region and block region and manage them in hybrid form. In the page region, buffer data is managed and sorted in page granularity to improve buffer space utilization, while block region manages data in block granularity. To achieve both objectives, we give preference to random access pages for staying in the page region, while sequential access pages in the block region are replaced first. Buffer data in the page region will be dynamically migrated to the block region if the number of pages in a block reaches a threshold, which is adaptive to different workloads. Leveraging hybrid management and dynamic migration, not only improves performance and extends SSD lifetime, but also significantly reduces the internal fragmentation and garbage collection overhead associated with random write. The rest of this paper is organized as follows. Section II provides an overview of background and our motivation. In Section III, we present details of hybrid buffer management scheme (called ). Evaluation and measurement results are presented in Section IV. In Section V, we give a brief study of related works in the literature; and conclusions and possible future work are summarized in Section VI. II. BACKGROUND AND MOTIVATION In this section, we present three basic concepts that are essential to our work and our motivation. A. Flash Memory Technology There are two types of flash memories, NOR and NAND [6]. NOR flash memory supports random accesses in bits and is mainly used for storing code. NAND flash memory is designed for data storage with denser capacity and only allows access in units of sectors. Most SSDs available on the market are based on NAND flash memories. In this paper, flash memory refers to NAND flash memory specifically. NAND flash memory can be classified into two categories, Single-Level Cell (SLC) and Multi-Level Cell (MLC) NAND. A SLC flash memory cell stores only one bit, while a MLC flash memory cell can store two bits or even more. For both SLC and MLC NAND, a flash memory package is composed of one or more dies (chips). Each die within a package contains multiple planes. A typical plane consists of thousands (e.g. 48) of blocks and one or two registers of the page size as an I/O buffer. Each block in turn consists of 64 to 8 pages. Each page has a KB or 4KB data area and a metadata area (e.g. 8 bytes) for storing identification, page state and Error Correcting Code (ECC) information. Flash memory supports three major operations, read, write, and erase. Read and write are performed in units of pages. A unique requirement of flash memory is that flash blocks must be erased before they can be reused and erase operation must be conducted in block granularity. [7]. In addition, each block can be erased only a finite number of times. A typical MLC flash memory has around, erase cycles, while a SLC flash memory has around, erase cycles. After wearing out, flash memory cells can no longer store data. B. SSD SSDs use flash memory as their storage medium. Flash Translation Layer (FTL)[8,9], a critical firmware implemented in the SSD controller, allows operating systems to access flash memory devices in the same way as conventional disk drives. The FTL plays a key role in SSD and many sophisticated mechanisms are adopted to optimize SSD performance. It provides address mapping, wear leveling and garbage collection. Logical Block Mapping Generally, FTL schemes can be classified into three groups depending on the granularity of address mapping: page-level, block-level, and hybrid-level FTL schemes[9]. In the page-level FTL scheme, a logical page number (LPN) can be mapped to a physical page number (PPN) in flash memory. This mapping approach is efficient and shows great garbage collection efficiency, but it requires a large amount of RAM space to store the mapping table. On the other hand, block-level FTL is space efficient and requires an expensive read-modify-write operation when writing only part of a block. In order to overcome these disadvantages, the hybrid-level FTL scheme was proposed. Hybrid-level FTL uses a block-level mapping to manage most data blocks and uses a page-level mapping to manage a small set of log blocks, which works as a buffer to accept incoming write requests []. They show high garbage collection efficiency and require a smallsized mapping table. However, Hybrid-level FTLs incur expensive full merges for random write dominant workloads. Garbage Collection Since data in flash memory cannot be updated in place, the FTL simply writes the data to another clean page and marks the previous page as invalid. When running out of clean blocks, a garbage collection module scans flash memory blocks and recycles invalidated pages. If a pagelevel mapping is used, the valid pages in the scanned block are copied out and condensed into a new block. For block-level and hybrid-level mappings, the valid pages need to be merged together with the updated pages in the same block. Wear Leveling Due to the locality in most workloads, writes are often performed over a subset of blocks. Thus some flash memory blocks may be frequently overwritten and tend to

3 wear out earlier than other blocks. FTLs usually employ wear leveling algorithm to ensure that equal use is made of all the available write cycles for each block [3]. C. Buffer Management in SSD Many SSD controllers use a part of RAM as read buffer or write buffer. Different buffer cache management policies are proposed to improve performance and extend lifetime of flash memory. One problem of SSD is that the background garbage collection and wear leveling compete for internal resources with the foreground user accesses. If most foreground user accesses can be hit in buffer cache, the influences of each other will be significantly reduced. High cache hit ratio can significantly reduce the direct accesses from/to flash memory, which helps to achieve low latency for foreground user accesses and save resources for background tasks. On the other hand, sequentiality of write accesses passed to flash memory is critical because random write has following negative impacts on SSD. ) Shorten SSD Lifetime For SSD, the more random the writes are, the more erase operations are needed. Due to the nature of the technology, NAND flash memory can incur only a finite number of erases for a given physical block. Therefore, increased erase operations due to random writes make a flash storage wear out much faster than sequential writes. ) High Garbage Collection Overhead Random writes result in higher overhead of garbage collection than sequential writes. For SSD adopting hybrid FTL, the more random the writes are, the more merge operations [] are needed. At the worst case, each individual page in a log block would belong to a different mapping unit and needs expensive full merge operation correspondingly []. In addition, random write operations are most likely to trigger garbage collection. These internal operations running in the background may compete for resources with incoming foreground requests and cause increased latency. 3) Internal Fragmentation Flash memory does not support in-place update. Therefore, if the incoming writes are randomly distributed over the logical block address space, sooner or later all physical flash memory blocks may have an invalid page, which is called internal fragmentation [6]. Such an internal fragmentation has significant impact on garbage collection and performance. First, the cleaning efficiency drastically drops. Second, after fragmentation, each write becomes excessively expensive and the bandwidth of sequential write collapses much lower than that on a regular laptop disk [7]. Finally, the prefetching mechanism inside SSD would not be effective any more since logically continuous pages are not physically continuous to each other. This causes the bandwidth of sequential read to drop closely to the bandwidth of random read. 4) Little Chance for Performance Optimization SSD leverages striping and interleaving to improve performance based on sequential locality [5,7]. If a write is sequential, the data can be striped and written across different dies or planes in parallel. Interleaving is used to hide the latency of costly operations. Single multi-page read or write can be efficiently interleaved, while multiple single-page reads or writes can only be conducted in separate way. While above optimizations can dramatically improve performance for workload with more sequential locality, its ability to deal with random write is very limited because less sequential locality is left for it to exploit. Therefore, both cache hit ratio and sequentiality are two critical factors determining the efficiency of buffer management for flash memory. D. Motivation The typical workload in an enterprise system is a mixture of random and sequential accesses, which expose temporal locality and spatial locality respectively. Both page-based buffer management schemes and block-based buffer management schemes fail to utilize both temporal and spatial localities among the enterprise workloads to improve cache hit ratio and sequentiality for SSD. To illustrate the limitation of current buffer management schemes, let us consider an example reference stream mixed with sequential and random accesses, shown in the Table I. In this example, page-based LRU achieves 6 hits higher than block-based LRU, and block-based LRU has one sequential flush better than page-based LRU. Hybrid LRU achieves 3 cache hits and one sequential flush, which combines the advantages of both page-based LRU and block-based LRU. Since the buffer is positioned at a level higher than the flash memory and receives I/O requests directly from host, we are motivated to design a novel buffer management scheme to fully utilize both temporal and sequential localities to achieve high cache hit ratio and good sequentiality for SSD. III. HYBRID BUFFER MANAGEMENT Since both cache hit ratio and sequentiality affect the efficiency of buffer management in terms of performance, garbage collection overhead and lifetime, we adopt improvement on both as our design objectives. Our rationale is that one objective cannot be achieved at a cost of sacrificing another objective. To do so, we propose a novel buffer management scheme referred to as, which exploits both page and block to manage buffer space in hybrid form. By utilizing both temporal and spatial localities among accesses, pursues high cache hit ratio and good sequentiality for different enterprise workloads. A. Hybrid Management Several previous studies [,] claimed that the request frequency and the file size are inversely correlated, i.e. the most popular files are typically small in size, while the large files are relatively unpopular. Most files are small and most file accessed are small, e.g., [4] reports that 8% of file accesses

4 TABLE I. COMPARISON OF PAGE-LEVEL LRU, BLOCK-LEVEL LRU AND HYBRID LRU. CACHE SIZE IS 8 PAGES AND AN ERASE BLOCK CONTAINS 4 PAGES. HYBRID LRU MAINTAINS BUFFER AT PAGE AND BLOCK GRANULARITY. ONLY FULL BLOCKS WILL BE MANAGED AT BLOCK GRANULARITY AND WILL BE SELECTED AS VICTIM. IN THIS EXAMPLE, WE USE [] TO DENOTE BLOCK BOUNDARY. Access Page-Level LRU Block-Level LRU Hybrid LRU Cache(8) Flush Hit? Cache(8) Flush Hit? Cache(8) Flush Hit?,,,3 3,,, Miss [,,,3] Miss [,,,3] Miss 5,9,,4 4,,9,5,3,,, Miss [4],[9,],[5],[,,,3] Miss 4,,9,5,[,,,3] Miss 7 7,4,,9,5,3,, Miss [5,7], [4],[9,] [,,,3] Miss 7,4,,9,5 [,,,3] Miss 3 3,7, 4,,9,5,, Hit [3], [5,7], [4],[9,] Miss 3,7,4,,9,5 Miss,3,7,4,9,5,, Hit [9,], [3], [5,7], [4] Hit,3,7,4,9,5 Hit,,3,7,4,9,5, Hit [,3],[9,], [5,7], [4] Miss,,3,7,4,9,5 Miss 4 4,,,3,7,9,5, Hit [4], [,3],[9,], [5,7] Hit 4,,,3,7,9,5 Hit,4,,,3,7,9,5 Hit [,,3],[4],[9,],[5,7] Miss,4,,,3,7,9,5 Miss,,4,,,3,7,9 5 Miss [9,,],[,,3],[4] [5,7] Miss,,4,,,3,7,9 5 Miss 7 7,,,4,,,3,9 Hit [7],[9,,],[,,3],[4] Miss 7,,,4,,,3,9 Hit Sequential flush Cache hit 6 3 are to files of less than KB and the locality type of each request is deeply related to its size. Random accesses are small and popular, which have high temporal locality. Page-based buffer management is good at exploiting temporal locality to achieve high cache hit ratio. Sequential accesses are large and unpopular, which have high spatial locality. The block-based buffer management scheme can effectively exploit spatial locality. To fully utilize both temporal and spatial localities among enterprise workloads, we divide the buffer space into page region and block region, as shown in the Figure. In the page region, buffer data is managed and sorted in page granularity to improve buffer space utilization. Block region operates at the logical block granularity that has the same size as the erasable block size in the NAND flash memory. A page is either in page region or in block region. Both regions serve incoming requests. Pages in page region are organized as page LRU list. When a page cached in the page region is accessed (read/write), only this page is placed at the head of the page LRU list. Blocks in block region are organized and sorted on the basis of block popularity. Block popularity is defined as block access frequency including reading and writing of any pages of the block. When a logical page of a Page Region LRU List 37 Block Region Figure. Hybrid Buffer Management, Buffer space is divided into two regions, page region and block region. In the page region, buffer data is managed and sorted in page granularity, while block region manages data in block granularity. Block in block region is slected as victim for replacement. 8 9 Block Popularity List block is accessed (including read miss), we increase the block popularity by one. Sequentially accessing multiple pages of a block is treated as one block access instead of multiple accesses. Thus, block with sequential accesses will have low popularity value, while block with random accesses has high popularity value. The blocks with same popularity are sorted on the order of the number of pages in the block. Therefore, the temporal locality among the random accesses and spatial locality among sequential accesses can be fully exploited by page-based buffer management and blockbased buffer management respectively. B. Servicing Both Read and Write Operations In the real application, read and write accesses are mixed. Usage patterns exhibit block-level temporal locality: the pages in the same logical block are likely to be accessed (read/write) again in the near future. Separately servicing the read and write accesses in different buffer space may destroy the original locality present among access sequences. Moreover, servicing reads helps to reduce the load in the flash data channel which is shared by both read and write operations [3]. By servicing foreground read operations, the flash data channel s bandwidth can also be saved to conduct background garbage collection task, which helps to reduce the influences of each other. [] is designed only for write buffer. It uses page padding to improve sequentiality of flushed data at a cost of additional reads, which impacts the overall performance. For random dominant workload, needs to read a large number of additional pages, which can be seen in our experiments. Unlike, we leverage the block-level temporal locality among read and write accesses to naturally form sequential block. treats read and write as a whole to make full use of locality of accesses. groups both dirty and clean pages belonging to the same erase block into a logical block in the block region.

5 For read requests, attempts to fetch data from page region and block region. If the read request is not hit in buffer, would then fetch data from flash memory and a copy of the data will be placed in buffer as reference for future requests. Upon the arrival of a write request, places it into buffer cache instead of synchronously writing it into flash memory. For both reading missed data and writing data, places them in either page region or block region in the following way. If the corresponding block exists in block region, places the data in the block region and reorders the block list with updated popularity. Otherwise places it at the beginning of the page region and updates Block B+tree, which will be discussed in the following subsection D. Page LRU List Block Assembling Page Region Blk. 9 Blk. 5 7 Blk. C. Replacement Policy This paper views negative impacts of random write on SSD lifetime and performance as penalty. The cost of sequential write miss is much lower than that of random write. Popular data will be frequently updated. When replacement happens, unpopular data should be replaced instead of popular data. Keeping popular data in buffer as long as possible can minimize the penalty. For this purpose, we give preference to random access pages for staying in the page region, while sequential access pages in block region are replaced first. The least popular block in the block region is selected as victim. If more than one block has the same least popularity, a block having the largest number of cached pages is selected as a victim. Once a block is selected as victim, there are two cases to deal with. (i) If there are dirty pages in this block, both dirty pages and clean pages of this block are sequentially flushed into flash memory. This policy guarantees that logically continuous pages can be physically placed onto continuous pages, so as to avoid internal fragmentation. By contrast, [3] flushes only dirty pages in the victim block and discards all the clean pages without considering the sequentiality of flushed data. (ii) If there are no dirty pages in the block, all the clean pages of this block will be discarded. Only if block region is empty, we select the least recently used page as victim from page region. The corresponding pages belonging to the same block as this victim page will be replaced and flushed. This policy tries to avoid single page flush which has high impact on garbage collection and internal fragmentation. With the filtering effect of the cache on I/O requests, we influence the I/O requests from host so that more sequential page requests and less random page requests are passed to the flash memory thereafter. The flash memory is then able to process the requests with stronger spatial locality more efficiently. D. Threshold-based Migration Buffer data in page region will be migrated to block region if the number of pages in a block reaches the threshold, as shown in Figure. How to determine the threshold value will be discussed in the following subsection E. With filter effect of the threshold, random pages will stay in the page region, while the sequential blocks reside in the block Block Region Number of pages >= THR migrate Blk.5 Blk Block migration Blk. Blk.9 Figure. Threshold-based Migration. Buffer data in page region will be migrated to block region only if the number of pages in a block reaches the threshold. Grey boxes denote that a block is found and migrated to block region. An erase block consists of 4 pages. region. Therefore, temporal locality among random pages and spatial locality among sequential blocks can be fully utilized in the hybrid buffer management. To implement threshold-based migration, we use a Block B+tree to assemble sequential blocks in the page region, as shown in Figure 3. The Block B+tree uses block number as key. A data structure called block node is introduced to describe a block in terms of block popularity, number of pages in the block including clean and dirty pages, and pointer array. The pointers in the array point to the pages belonging to the same block. Upon arrival of an access in the page region, we deal with in following way. We use the logical page number to calculate the corresponding block number. Then we search the Block B+tree using the block number. If the block exists in the Block B+tree, we update the block node including block popularity, number of pages and pointer. If the number of pages in the block reaches the threshold, all the pages in the block will be migrated to block region. The Block B+tree deletes the block node and reconstructs the B+tree. The migrated block will be inserted into corresponding location of the block list in the block region according to its popularity. If it is a new access, we allocate a new block node and add it into the Block B+tree. The function of the Block B+tree is as follows. Firstly, it manages the blocks for migration. Secondly, if we have to replace pages from page region for free space, it is used to quickly search the block which the victim page belongs to. Victim block Block Popularity List

6 Key: Block Number Number of pages Block Popularity Pointer Array Block Node Figure 3. Block B+tree Root node Interior node Leaf node Page LRU List By using the Block B+tree, the pages belonging to the same block can be quickly searched and located. Meanwhile, the space overhead of the Block B+tree is limited. As shown in Figure 3, Block B+tree generally includes three parts: block nodes, leaf nodes and interior nodes (including root node). In order to analyze the space overhead, we first make following assumptions: (i) Integer and pointer type consumes 4 bytes, (ii) Fill factor of Block B+tree is at least 5% [36], (iii) the total size of interior nodes is half of total size of leaf nodes (in practice, the number of interior nodes is much smaller than leaf nodes. Fan out of B+tree is 33 on average [36]. Even when fan out is equal to two, the number of interior nodes is less than leaf nodes), (iv) every block node consumes bytes when pointer array only includes one pointer. In this case the number of the block nodes is max. When the length of page list is L, the size of buffer pages is L**4 (KB page). Accordingly, the number of block nodes is L (assumption iv), the total size of block nodes is L* (assumption iv), the total size of leaf nodes is L**8 (one unit corresponding to one block node in leaf node consumes 8 bytes), and the total size of interior nodes is L**4 (assumption iii). Therefore, the space overhead of Block B+tree is less than % of buffer pages. E. Dynamic Threshold Our objective is to improve sequentiality of accesses passed to flash memory, at the meantime maintain high buffer space utilization. The threshold is utilized to balance the two objectives, whose value is critical to the efficiency of our proposed buffer management scheme. In order to investigate the proper threshold value, we tested the effects of different threshold value through repetitive experiments over a set of workloads. However, we found in our experiments that it is difficult to find an average wellperformed value for all types of workloads. Different threshold values should be set to achieve optimal results even for the same workload. Statically setting threshold value can not adapt to enterprise workloads with complex features and interleaved I/O requests. We realized that the value of threshold is highly dependent on workload features. For random dominant workload, a small value is suitable because it is difficult to form sequential blocks. For sequential dominant workload, a large value is desirable because a lot of partial blocks instead of full blocks will be migrated from page region to block region if a small threshold is set. Dynamic threshold is achieved by using the following heuristic method. We use THR migrate to denote the threshold. Clearly, the value of THR migrate is from to the total number of pages in a block. N block and N total represent the size of block region and total buffer space, respectively. We use γ to denote the ratio between the N block and N total. The value of γ is under following constraint, which is used to control whether to enlarge or reduce the threshold. N block γ = θ () N total Where θ is a parameter configured based on buffer size. Since the block region is used to store block candidates whose pages are sequential enough for replacement and flush, the size of the block region should be much smaller than the size of page region. We set the value of θ as % for small buffer size and % for large buffer size. Initially, the value of THR migrate is set to. If the value of γ is becoming larger than θ, it indicates that the size of the block region breaks the above constraint. To reduce the size of the block region, a large value of THR migrate is required to increase the difficulty of page migration from the page region to the block region. Then, the value of THR migrate will be doubled until γ is less than θ. On the other hand, the value of THR migrate will be halved if block region is empty. The dynamic method can properly adjust the threshold value based on different workload features. With our experiments, the dynamic threshold performs well with various buffer size imposed by different application sets. IV. EVALUATION A. Experiment Setup ) Trace-driven Simulator We built a trace-driven buffer cache simulator by interfacing a modified version of the DiskSim 4. [] and its SSD extension [5]. We implemented following buffer cache schemes in the SSD code: ), ) [], and 3) [3]. Block Associative Sector Translation (BAST) [3] is implemented as FTL in SSD. Configuration values of SSD listed in the Table II are taken from [5]. ) Workload Traces We use a mixture of real-world and synthetic traces to study the efficiency of different buffer management schemes on a wide spectrum of enterprise-scale workloads. Table III presents salient features of our workload traces. We employ a read-dominant I/O trace from an OLTP application running at a financial institution [] made available by the Storage Performance Council (SPC),

7 TABLE II. SPECIFICATION OF SSD CONFIGURATION Page Read to Register 5μs Page Program (Write) from Register μs Block Erase.5ms Serial Access to Register (Data bus) μs Die Size GB Block Size 56 KB Page Size 4 KB Data Register 4 KB Erase Cycles K SSD Capacity 3GB henceforth referred to as Financial trace. We also employ a write-dominant I/O trace called CAMWEBDEV, which was collected by Microsoft [3] made available by the Storage Network Information Association (SNIA). Besides read and write dominant workloads, we want to assess the behavior of different buffer management schemes under mixed workloads. For this purpose, we use MSNFS and Exchange traces, which were collected by Microsoft made available by SNIA[3]. Finally, we also use a synthetic trace to study the behavior of different buffer management schemes for a sequential dominant workload, which is referred to as Syn trace. The five traces used in our experiments cover wide range of workload characteristics from random to sequential and from read dominant to write dominant. TABLE III. SPECIFICATION OF WORKLOADS Workload Avg. Req. Size(KB) Write (%) Seq. (%) Avg. Req. Interarrive Time(ms) Financial MSNFS Exchange CAMWEBDEV Syn ) Evaluation Metrics In this study, we utilize (i) response time as seen at the I/O driver (this is the sum of the device service time and time spent waiting in the driver s queue), (ii) cache hit ratio, (iii) number of erases (indicators of the garbage collection overhead) and (iv) distribution of write length (this indicates the sequentiality of write accesses passed to flash memory) to characterize the behavior of different buffer management schemes. B. Experiment Results Figures 4 through Figure 7 show the average response time, cache hit ratio, number of erases and distribution of write length of, and caching schemes for the four workloads when we vary memory size. To indicate the write length distribution, we use CDF curves to show percentage (shown on Y-axis) of written pages whose sizes are less than a certain value (shown on X-axis). Because of space limitation, we only present CDF curves for MB buffer in Figure 4 through Figure 7. We also present CDF curves for 6 MB buffer under different traces in Figure 9. The following observations are made from the results. ) Financial trace Figure 4 shows that outperforms and in terms of average response time, cache hit ratio, number of erases and number of sequential writes under the completely read dominant trace. With the memory size of MB, the average response time of is. msecond. By contrast, the average response time of is.75 msecond (see Figure 4(a)). makes 84% improvement in terms of average response time compared to. Accordingly, there is a 85% hit ratio increase (see Figure 4(b)) and an 85% erase reduction (see Figure 4(c)) compared to. also exhibits 76% faster, 46% more cache hits and 8% lower erases compared to for a MB buffer. Figure 4(d) shows that the percentage of -page write of and is 56% and %, respectively. By contrast, only has 5% small writes, better than and. Further, provides much more large writes than and. For example, almost 3% of the writes are larger than 4 pages in size for, while and only have 4% and % writes larger than 4 pages. We can also see from Figure 9 that algorithm is very efficient in increasing sequentiality of write accesses for different traces. The results indicate that the performance gain of comes from two aspects: high cache hit ratio and reduced garbage collection. This is because exploits page and block to manage buffer space in hybrid way, taking both temporal and spatial localities into account. makes its contributions through improving cache hit ratio and increasing the portion of sequential writes. ) MSNFS trace MSNFS trace is a random workload in which reads are about 34% more than writes. Careful analysis of this workload reveals that it exhibits a very high degree of both spatial and temporal locality. Figure 5 shows that exhibits up to 8% faster, 39% more cache hits and 78% lower erases compared to for the cache size up to 3 MB. also performs up to 63%, 63% and 38% better in terms of average response time, cache hits ration and number of erases compared to. Beyond 3 MB, the advantages of over and narrow down because cache size is large enough to accommodate most accesses. It is thus obvious to see that is better than and for workloads of this nature. This is because block-based and tradeoff spatial locality with temporal locality, while can efficiently leverage both temporal and spatial localities. 3) Exchange trace Exchange trace is a random workload in which writes are about 44% more than reads. For a memory size of MB, the average response time of is.73 msecond. By contrast, the average response time of is 3.4 msecond (see Figure 6(a)). makes 45% improvement in terms of average response time compared to. Accordingly, there is a 6% hit ratio increase (see Figure 6(b)) and a % erase reduction (see Figure 6(c)).

8 Repsonse Time (msec) (a) Response time varies with buffer size Hit Ratio (%) (b) Cache hit ratio varies with buffer size Number of Erases Cumulative Probability (%) Write Length (Pages) (c) Number of erases varies with buffer size Figure 4. Financial Trace (d) Distribution of write length when buffer is MB Repsonse Time (msec) Hit Ratio (%) (a) Response time varies with buffer size (b) Cache hit ratio varies with buffer size Number of Erases Cumulative Probability(%) Write Length (Pages) (c) Number of erases varies with buffer size Figure 5. MSNFS Trace (d) Distribution of write length when buffer is MB

9 Repsonse Time (msec) (a) Response time varies with buffer size Hit Ratio (%) (b) Cache hit ratio varies with buffer size Number of Erases Cumulative Probability(%) Write Length (Pages) (c) Number of erases varies with buffer size Figure 6. Exchange Trace (d) Distribution of write length when buffer is MB Repsonse Time (msec) Hit Ratio (%) (a) Response time varies with buffer size (b) Cache hit ratio varies with buffer size Number of Erases Cumulative Probability(%) Write Length (Pages) (c) Number of erases varies with buffer size Figure 7. CAMWEBDEV Trace (d) Distribution of write length when buffer is MB

10 also exhibits 3% faster, 8% more cache hits and 5% lower erases than for a MB buffer. The percentage of small writes (less than pages) of and is 44% and 9%, respectively. By contrast, has 4% small writes, better than and (see in Figure 6(d)). Further, provides much more sequential writes than and. For example, almost 38% of the writes are larger than 4 pages in size for, while and only have % and.% writes larger than 4 pages. 4) CAMWEBDEV trace CAMWEBDEV trace is a completely random write dominant workload. Figure 7 shows that performs 7% and 4% faster than and for the buffer size of MB. Accordingly, there is a 8% and 65% hit ratio increase compared to and. There is also a 48% and % reduction of erases. We observed that is also efficient in reducing the number of small writes and increasing the number of sequential writes for write intensive workload. Figure 7 (d) shows that and produce 8% and 98% small writes (less than 4 pages). By contrast, has 74% small write, which is letter than and. also provide % large writes, which is larger than 8 pages in size. However, and only have 7% and.% large writes. Figure 9(d) also shows that with buffer size of 6MB, and produce 8% and 4% small writes (less than 4 pages). By contrast, has only 8% small writes. also provide 33% large writes, which is larger than 8 pages in size. However, and only have 8% and 5% large writes. As buffer size increases from MB to 6MB, improvement of is much larger than and. This indicates that algorithm is more efficient than and in increasing sequentiality of write accesses across different buffer sizes. The results further show that the distribution of write length is directly correlated to the garbage collection overhead and performance. With buffer size of MB, is able to produce 7% writes whose sizes are larger than 4 pages compared to 8%, % for and (see Figure 7(d)). Accordingly, there is a 48% and % garbage collection overhead reduction for compared to and (see Figure 7(c)). Consequently, performance is improved by 7% and 4% compared to and (see Figure 7(a)). The correlation clearly indicates that write length is a critical factor affecting SSD performance and garbage collection overhead. 5) Effect of workloads We observed that efficiency of is different under different workload traces. With buffer size of MB, achieves 84%, 8%, 45% and 7% performance improvement over for Financial, MSNFS, Exchange and CAMWEBDEV traces, respectively. Accordingly, there is a 76%, 63%, % and 4% performance improvement compared to. The results Repsonse Time (msec) Figure 8. Average response time of, and under Syn trace indicate that outperforms and for different types of random workloads in an enterprise system. For sequential write dominant trace, we only show performance results in the Figure 8 due to space limitation. We can see that still performs better than and, but the advantage of over and is not so significant because this workload provides more spatial locality for them to exploit, compared to above random workloads. 6) Additional overhead To study the overhead of different buffer management schemes under different workloads, we present total read pages during replaying traces in Figure. We can see from the results that conducts a large number of read operations. Let s take Figure (a) as an example. With the buffer size of MB, results in 368% and 57% more page reading than and respectively. Accordingly, the average response time of is 53% and 48% slower than and (see Figure 4(a)) respectively. This is because that uses page padding to improve the number of sequential writes. For completely random workload in enterprise environment, needs to read a large number of additional pages, which impacts the overall performance. By contrast, our proposed achieves better performance without additional reads. treats read and write as a whole and leverages the block-level temporal locality among read and write accesses to naturally form sequential block. 7) Effect of Threshold To investigate how threshold value affects the efficiency of proposed, we tested with static thresholds and dynamic threshold for different traces, as shown in Figure. Let s take Figure (c) as an example. With the memory size of 6 MB, the average response time of is.79msecond,.65msecond and.73msecond when threshold value is, 4, and 8 respectively. By contrast, the average response time of is.55msecond for dynamic threshold, which is much better than that of static thresholds. We further observed that the same threshold is also unable to achieve optimal performance for different workloads. Figure (a) shows that with buffer size of 6MB, performs

11 Cumulative Probability(%) Write Length(pages) Write Length(pages) Write Length(pages) Write Length(pages) (a) Financial (b) MSNFS trace (c) Exchange (d) CAMWEBDEV Figure 9. Distribution of Write length of, and when buffer size is 6MB Read ( pages) (a) Financial (b) MSNFS trace (c) Exchange (d) CAMWEBDEV Figure. Total pages read by, and under 4 traces Response time(msec) THR= THR=8 THR=3 Dynamic THR THR=4 THR=6 THR= THR= THR=8 THR=3 Dynamic THR THR=4 THR=6 THR= THR= THR=8 THR=3 Dynamic THR THR=4 THR=6 THR= THR= THR=8 THR=3 Dynamic THR THR=4 THR=6 THR= (a) Financial (b) MSNFS trace (c) Exchange (d) CAMWEBDEV Figure. Effect of Threshold on better when threshold is set as 64 for Financial trace, compared to other static thresholds. However, with buffer size of 6 MB and threshold of 64, the average response time of is.67msecond for CAMWEBDEV trace, which is worse compared to threshold of, 4, and 8 respectively(see Figure (d)). By contrast, the results in Figure show that dynamic threshold achieves best performance for Financial, MSNFS, Exchange and CAMWEBDEV traces respectively. The variation in performance curves shown in the Figure clearly indicates that threshold value has significant impact on efficiency of our proposed. Statically setting threshold is unable to achieve optimal performance. Dynamically adjusting

12 the threshold for enterprise workloads makes proposed workload adaptive. V. RELATED WORKS In this section, we present related works in the literature. A. Buffer Cache Management ) Disk Buffer Cache Management One of the most active research areas on improving disk I/O performance is buffer caching. Over the years, numerous replacement algorithms have been proposed to reduce actual disk accesses. The oldest and yet still widely adopted algorithm is the Least Recently Used (LRU) algorithm. The popularity of LRU comes from its simple and effective exploitation of temporal locality: a block that is accessed recently is likely to be accessed again in the near future. There are also a large number of other algorithms proposed such as CLOCK[6], Q [9], MQ [33], ARC [8], and LIRS [34]. By exploiting the temporal locality of data accesses, all these replacement algorithms are designed by adopting cache hit ratio improvement as the sole objective to minimize disk activities. However this can be a misleading metric for SSD. As discussed in Section II, sequentiality of write accesses passed to SSD significantly influences performance, lifetime, internal fragmentation and garbage collection overhead. These replacement algorithms are not effective for SSD because spatial locality is unfortunately ignored. DULO [] scheme introduces spatial locality into the consideration of page replacement and thus makes replacement algorithms aware of page placements on the disk. DULO utilizes both temporal and spatial locality in buffer management for hard disk. In DULO, the characteristics of a hard disk are exploited so that sequential access is more efficient than random access. However, it cannot be directly applied in SSD because it consider hard disk layout instead of SSD layout. ) SSD Buffer Cache Management Existing caching algorithms for flash memory include CFLRU [35], [], [3] and LB-CLOCK [3] algorithms. Clean first LRU (CFLRU) [35] is a page-based buffer cache management algorithm for flash storage. It divides the host buffer space into working region and eviction region. Victim buffer pages are selected from the eviction region. To exploit the asymmetric performance of flash memory read and write operations, it attempts to choose a clean page as a victim rather than dirty pages. CFLRU reduces number of writes by performing more reads. The flash aware buffer policy () [3] is block-based buffer cache management policy used for flash memory. In, the buffer pages belonging to the same erasable block are grouped together. considers the number of resident cached pages in a block as the sole criteria to select a victim block. evicts a block having the largest number of cached pages. In case of a tie, it considers the LRU order. All the dirty pages in the victim group are flushed, and all the clean pages in it are discarded. This policy may results in internal fragmentation, which significantly impacts the efficiency of garbage collection and performance. The main application of is in portable media players in which access pattern of write is sequential. Block Padding LRU () [] is another block-based buffer cache management scheme only for flash memory write. uses block-level LRU, page padding, and LRU compensation to establish a desirable write pattern with RAM buffering. However it does not consider read requests. For completely random workload, incurs a large number of additional reads, which significantly impact the overall performance. Large Block CLOCK (LB-CLOCK) [3] algorithm considers recency and block space utilization metrics to make cache management decisions. LB-CLOCK dynamically varies the priority between these two metrics to adapt to changes in workload characteristics. B. Flash Memory Several studies have been conducted on flash storage concerning the performance of random writes at various levels of storage hierarchy [4,5,6,7]. Research works on FTL try to improve performance and address the problems of high garbage collection overhead. BAST exclusively associates a log block with a data block. In presence of small random writes, this scheme suffers from increased garbage collection cost. FAST [9] keeps a single sequential log block dedicated for sequential updates while other log blocks are used for random writes. SuperBlock FTL scheme [] utilizes block level spatial locality in workloads by combining consecutive logical blocks into a Superblock. It maintains page level mappings within the superblock to exploit temporal locality by separating hot and cold data within the superblock. The Locality-Aware Sector Translation (LAST) scheme [4] tries to alleviate the shortcomings of BAST and FAST by exploiting both temporal locality and sequential locality in workloads. It further separates random log blocks into hot and cold regions to reduce garbage collection cost. Unlike currently predominant hybrid FTLs, Demand-based Flash Translation Layer (DFTL) [] is purely page-mapped, which exploits temporal locality in enterprise-scale workloads to store the most popular mappings in on-flash limited SRAM while the rest are maintained on the flash device itself. MFT [8], a block device level solution, translates random writes to sequential writes between the file system and SSD. FlashLite [7] does it between application and the file system with idea similar to PP file sharing. Griffin [4] is proposed to use a log-structured HDD as a write cache to improve the sequentiality of the write accesses to the SSD. This paper differs from the above mentioned studies in a number of ways. First, is a hybrid buffer management scheme which exploits both page and block to manage buffer space and considers both cache hit ratio and sequentiality as design metrics. A second major feature of this study is that

CBM: A Cooperative Buffer Management for SSD

CBM: A Cooperative Buffer Management for SSD 3 th International Conference on Massive Storage Systems and Technology (MSST 4) : A Cooperative Buffer Management for SSD Qingsong Wei, Cheng Chen, Jun Yang Data Storage Institute, A-STAR, Singapore June