박사학위논문 Ph. D. Dissertation. Software Optimization Methods for High-Performance Flash-based Storage Devices

Size: px

Start display at page:

Download "박사학위논문 Ph. D. Dissertation. Software Optimization Methods for High-Performance Flash-based Storage Devices"

Preston Julius Lindsey
5 years ago
Views:

1 박사학위논문 Ph. D. Dissertation 고성능플래시저장장치를위한소프트웨어최적화기법 Software Optimization Methods for High-Performance Flash-based Storage Devices 박선영 ( 朴善英 Park, Seon-yeong) 전산학과 Department of Computer Science KAIST 2011

2 고성능플래시저장장치를위한소프트웨어최적화기법 Software Optimization Methods for High-Performance Flash-based Storage Devices

3 Software Optimization Methods for High-Performance Flash-based Storage Devices Advisor : Professor Maeng, Seungryoul by Park, Seon-yeong Department of Computer Science KAIST A thesis submitted to the faculty of KAIST in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science. The study was conducted in accordance with Code of Research Ethics Approved by Professor Maeng, Seungryoul [Advisor] 1 Declaration of Ethical Conduct in Research: I, as a graduate student of KAIST, hereby declare that I have not committed any acts that may damage the credibility of my research. These include, but are not limited to: falsification, thesis written by someone else, distortion of research findings or plagiarism. I affirm that my thesis contains honest conclusions based on my own careful research under the guidance of my thesis advisor.

4 고성능플래시저장장치를위한소프트웨어최적화기법 박선영 위논문은한국과학기술원박사학위논문으로 학위논문심사위원회에서심사통과하였음 년 11 월 26 일 심사위원장맹승렬 ( 인 ) 심사위원이준원 ( 인 ) 심사위원배두환 ( 인 ) 심사위원김진수 ( 인 ) 심사위원허재혁 ( 인 )

5 DCS 박선영. Park, Seon-yeong. Software Optimization Methods for High-Performance Flash-based Storage Devices. 고성능플래시저장장치를위한소프트웨어최적화기법. Department of Computer Science p. Advisor Prof. Maeng, Seungryoul. Text in English. ABSTRACT Flash memory has emerged as a strong candidate for a storage device because of its fast random access, low power consumption, small and light form, and shock resistance. However, flash memory has different hardware characteristics compared to the traditional storage device, magnetic disk. In particular, flash memory has uniform access speed, asymmetric read and write performance, and no in-place update. To narrow the hardware gap between flash memory and magnetic disks, flash memorybased storage devices employ special software programs such as Flash Translation Layers (FTLs) and flash file systems. These software programs have been improved to reduce the management overheads of flash memory in terms of both time and energy. However, in spite of these efforts, flash memory does not exhibit full perform gain under existing operating systems, which have been optimized under the assumption that secondary storage is composed of magnetic disks. Therefore, it is necessary to revisit operating system policies and mechanisms, and to revise them for flash memory-based secondary storage. This thesis deals with the software optimization in order to use flash memory as a secondary storage device. First, a new page replacement algorithm, Cost-based LRU (CBLRU) is proposed. The CBLRU analyzes the cost and benefit of two replacement decisions using the information of early evicted clean pages. Based on the precise analysis, this algorithm reduces the number of costly write and erase operations, while avoiding an escalation of cache misses. Second, this research suggests an I/O request handling method that reorders the outstanding requests in the request queue to increase the use of parallel components in flash-based storage devices. In addition, it explores write request mapping methods that also affect the parallel executions of storage components. Simulation results with real workloads show that the I/O performance of the proposed replacement algorithm is enhanced by up to 7.7% compared with the existing LRU algorithm. Also, the I/O request handling method enhances the performance of flash-based storage by up to 11.8% without increasing hardware components such as flash memory chips and bus channels. i

6 Contents Abstract i Contents ii List of Tables iv List of Figures v Chapter 1. Introduction 1 Chapter 2. Background and Motivation Flash memory Hardware characteristics Limitations Software support Flash Translation Layer (FTL) Flash filesystems Flash-specific cache management Flash-based Solid-State Drives (SSDs) Flash memory packages Standard host interface Request rescheduling algorithm Write request mapping Motivation Chapter 3. Flash-aware Cache Management Demand paging for NAND flash memory Impact on energy consumption Evaluation Flash memory friendly cache replacement algorithm; CFLRU Algorithm overview Window size for minimal replacement cost Simulation study Implementation on Linux Implementation results Enhanced cache replacement algorithm; CBLRU Algorithm overview ii

7 3.3.2 Two replacement decisions in on-line algorithm Cost-benefit analysis Eviction boundary Simulation study Integrated evaluation Chapter 4. Parallel Architecture-aware I/O Request Handling Performance analysis of HDDs and flash-based SSDs Experimental setup Empirical analysis Rescheduling and write mapping methods System architecutre Request rescheduling Dynamic write mapping Evaluation Chapter 5. Conclusion 66 References 68 Summary (in Korean) 72 iii

8 List of Tables 2.1 Energy consumption and process time of flash devices [1] Comparison of SLC and MLC NAND flash in their specifications [2][3][4] Power and energy coefficient values [5] Workloads used in experiment Workload characteristics used in simulation Parameters used in the simulator Specifications of storage devices used in experiments Idle power consumption of each device Parameters used in the SSD simulator Workload characteristics used in the evaluation Average response time improvement by request rescheduling schemes Average response time improvement by dynamic write mapping schmems Rescheduling methods using SLC and MLC iv

9 List of Figures 2.1 Flash memory cell organization[6] NAND flash organization [7] SLC and MLC NAND flash memory [8] Software architecture for NAND flash memory Page-mapping FTL scheme Block-mapping FTL scheme Hybrid-mapping FTL scheme Example of CRAW-C algorithm Example of LRU-WSR algorithm Write buffer management for flash memory SSD block diagram Flash package organization Flash-based storage architecture for mobile embedded devices[9][5] Energy consumption breakdown for code shadowing and demand paging Example of CFLRU algorithm Probability of future reference per page The number of flash operations of MIN, LRU, CFLRU-static, and CFLRU-dynamic Estimated time of MIN, LRU, CFLRU-static, and CFLRU-dynamic LRU list in Linux kernel Weighted flash read and write counts for each application Measured time and expected energy for each application Two alternative replacement victims Coexistence of two replacement policies None of references to both IEPs and dirty repository pages Reference on IEP Reference on page in dirty repository page Eviction boundary Change of eviction boundary Expected I/O delays System model Average read response time The number of read requests Overall I/O delay Read throughput of SSDs Write throughput of SSDs Read and write throughput of HDDs Power consumption of SSD read requests v

10 4.5 Power consumption of SSD write requests Power consumption of HDD read and write requests Energy consumption for sequential read requests Energy consumption for random read requests Energy consumption for sequential write requests Energy consumption for random write requests Throughput and power of SSD3 after burst random writes Throughput of two filesystems with FileBench Energy consumption of two filesystems with FileBench Request rescheduling schemes Response time improvement of synthetic workloads Normalized response time of both request rescheduling and dynamic write mapping Various rescheduling depth on SLC Various rescheduling depth on MLC Write mapping methods Write mapping effect on rescheduling methods vi

11 Chapter 1. Introduction Flash memory has emerged as a strong candidate for a storage device because of its advantageous features such as fast random access, low power consumption, small and light form, and solid-state robustness. The recent technological development of a single flash memory chip with a several gigabyte capacity has accelerated this trend of replacing magnetic disks with flash memory in mobile computing devices, such as smart-phones, tablet PCs, PDAs, and laptop computers. Moreover, flash memory is even being considered as secondary storage for desktop computers and enterprise servers [10]. Flash memory has different hardware characteristics compared to the conventional storage device, magnetic disk drive. First, flash memory has fast access speed. Magnetic disk drives have mechanical positioning components and thereby they have slower access speeds. Second, flash memory has asymmetric read and write performance. Flash write is about ten times slower than flash read [11] and the power consumption of flash write is about six times higher than flash read [1]. Third, it requires erase operations before rewriting data. The unit size of an erase operation is usually bigger than that of a read/write operation. In fact, the erase operation has slow completion time and high power consumption [1]. In addition, the number of erase operations is limited to 10, ,000 times per erase block. Thus, flash memory requires a well-designed scheme to evenly wear out the flash memory region to extend the lifetime of the flash storage device[12]. Due to these hardware characteristics, flash memory requires special software support such as Flash Translation Layers (FTLs) or flash file systems to emulate conventional magnetic disk drives. The FTL maps externally seen logical pages to internally used physical pages [13]. Many different FTL schemes have been introduced to reduce not only the RAM space required to contain the mapping information but also to reduce the costly write and erase operations [14][15][16][17][18][19]. Also, flash file systems such as JFFS2 [20], YAFFS [21], UBIFS [22] and LogFS [23] are designed for flash memory. These file systems reduce the number of write and erase operations adopting a log-structured concept like LFS [24], which appends new data at the end of previously written data. Although various FTL schemes and flash file systems have been introduced to reduce management overhead of flash-based storage devices, they cannot obtain full performance enhancement with traditional operating systems, since the operating systems have been designed without the consideration of flash memory characteristics. They have been variously optimized under the assumption that secondary storage device consists only of magnetic disks. Unfortunately, flash memory exhibits different characteristics compared to magnetic disks, therefore, it is necessary to revisit operating system policies and 1

12 mechanisms in order to revise them for flash memory-based secondary storage. This thesis describes the software optimization issues in order to use flash memory as a secondary storage device. For this purpose, current operating system policies and mechanisms are revisited to find out the software layers that need to be re-designed. In particular, two software layers largely depend on hardware characteristics of storage devices: page cache management and I/O request queue handling. First, existing cache management schemes focus on the cache hit rate to reduce the number of accesses to magnetic disk drives. However, a cache management scheme for flash-based storage should consider asynchronous read and write performance as well as cache hit rate in order to reduce total replacement cost. Second, existing I/O request handlers only take into account data access distance to reduce head positioning delay. However, this is unnecessary for flash-based storage because of its fast access speed. Rather than reducing data access distance, the I/O request handler should consider parallel architecture of flash storage components. Generally, flash-based storage is composed of multiple components such as channels, chips, dies, and planes. These are to increase I/O throughput as well as storage capacity. Conventional page replacement algorithms such as LRU (Least Recently Used) consider only the cache hit rate to reduce the number of storage accesses. However, the page replacement policy for flash memory has another important point, namely, asymmetric read and write performance. Replacement cost is the elapsed time or consumed energy when evicting a victim page and fetching a new page. If a victim page has to be written to the flash memory because it has been modified, this write cost is added to the replacement cost. If a victim page has not been modified, the write cost can be omitted. If writing the victim page causes a flash erase operation, the replacement cost becomes more expensive. As mentioned previously, the flash write and erase operations are much slower than flash read, therefore, a replacement policy should consider the number of costly flash write and erase operations. However, if a new replacement policy keeps modified pages in its cache to decrease the number of flash write and erase operations, it is possible to increase cache misses because of the lack of cache space. Consequently, a flash-aware cache replacement policy has to carefully reduce the number of write operations, avoiding the escalation of cache misses which lead to a large number of flash read operations. This research proposes a new replacement algorithm, cost-based LRU (CBLRU) that considers both the cache hit rate and replacement cost. It maintains cache state for two different replacement decisions to measure their costs and benefits. Using a precise cost and benefit analysis, CBLRU makes efficient replacement decisions to decrease overall I/O delay. The I/O delay is reduced by up to 7.7% in comparison to the existing LRU replacement algorithm. Flash-based storage generally has multiple channel buses and chips because the performance of a single flash memory chip is lower than that of a magnetic disk drive. Moreover, state-of-the-art flash memory chips have independently working multiple dies inside, and each die contains multiple planes that 2

13 operate simultaneously [25][26]. This integration provides a large storage capacity within a restricted space and improves performance. However, this parallelism can be fully utilized only when the planes on a single die perform operations of the same type at the same time. Moreover, a plane provides a command pipeline to expedite the process of queued requests [27]. However, this command pipelining achieves a performance boost only when consecutive requests are of the same type. Therefore, the effectiveness of plane-level parallelism strongly depends on the pattern of request sequence. This research suggests an I/O request handling method that reorders the outstanding requests in the request queue to further improve the effectiveness of plane-level parallelism. In addition to this, various effects on the rescheduling algorithms are explored. Firstly, the rescheduling effects on SLC (Single Level Cell) and MLC (Multi Level Cell) flash memory are presented. MLC flash memory has twice the page size as well as longer read and program times compared with SLC flash memory. Secondly, proper rescheduling depth with real workloads is investigated. Rescheduling depth is directly proportional to rescheduling overhead. Lastly, the effect on write mapping methods is presented. Write mapping methods also impact the parallel executions of storage components. The impact of the proposed rescheduling methods is that performance is enhanced by up to 9.2% in SLC flash memory and up to 11.8% in MLC flash memory without increasing hardware components such as flash memory chips and bus channels. The rest of this thesis is organized as follows. Chapter 2 presents related work and the motivation behind this thesis. Chapter 3 describes a page replacement policy for flash-based storage. Chapter 4 presents the I/O request handling methods that enhance the internal parallelism of flash-based storage by using request rescheduling algorithms and write request mapping methods. Chapter 5 summarizes current research and presents future research directions. 3

14 Chapter 2. Background and Motivation 2.1 Flash memory Flash memory is constantly powered nonvolatile memory which can read and program data electrically. Because of its useful feature like low power consumption, fast random access, solid-state reliability, small and lightweight form factor, it has been growing fast for last twenty years. In the early development of flash memory, it is mainly used on mobile devices such as digital cameras, cellphones, and music players. As the flash memory technology has been developed and its density has been continually increased, it starts using as a general-purpose storage like flash-based Solid-State Drives (SSDs). Owing to multi-channel technology, flash-based SSDs are even superior to traditional Hard Disk Drives (HDDs) in the aspect of read and write performance. This chapter provides the background of hardware technology and software programs for flash memory, and presents the motivation of this thesis. 2.2 Hardware characteristics Flash memory is an array of memory cells which include floating-gate transistors. According to the types of the floating-gate transistors, flash memory is categorized into NOR flash and NAND flash. The organization of the NOR flash cell allows for individual access to each cell, as shown in Figure 2.1(a), so it has fast random access capabilities. NOR flash is more suitable for code storage because the code can be directly executed in place (XIP) without copying to RAM. Each cell in initial state is logically set to one. When it is programmed, the cell is set to zero. Generally, it can be programmed with the unit of byte or word at a time. When it is erased, the cell is reset to one. The unit of erase operation is block and typical block size is 64, 128, or 256KB. On the other hand, the organization of the NAND flash cell allows for each cell access through adjacent cells, as shown in Figure 2.1(b), thus it cannot provide byte-level random access but it achieves small chip area per cell. It also has faster time and lower energy consumption when erasing and programming data, as shown in Table Thus, NAND flash is more appropriate for data storage. A read or program operation in NAND flash can be performed with the unit of page. The page size is typically 512B, 2KB, or 4KB, and an erase block consists of 32, 64, or 128 pages. Each page has spare area of 16, 64, or 128 bytes. It includes auxiliary information like bad-block identification and error-correction code (ECC). The organization of NAND flash memory is presented in Figure The valuses are converted to the same amount of 512 byte data at 8 bit data and address bus. 4

15 Figure 2.1: Flash memory cell organization[6] Energy (nj) Time (ns) Read NOR Program Erase Standby 0.33 mw NA Read NAND Program Erase Standby 0.13 mw NA Table 2.1: Energy consumption and process time of flash devices [1] To decrease price per bit, some NAND flash memory with Multi-Level Cell (MLC) technology stores two or more bits of information per cell. Traditional Single-Level Cell (SLC) NAND flash memory stores one bit per cell. MLC NAND flash memory has a larger capacity but a decrease in several performance parameters, as shown in Table 2.2. Power consumption of two types of flash is almost the same, but normally, SLC NAND flash memory is about two to three times faster than MLC NAND flash memory. In addition to this, because MLC NAND flash memory divides voltage level into four or more within a cell, as shown in Figure 2.3, it requires a higher level of error correction than SLC NAND flash memory[8]. Because of high density, low cost, and fast erase and program operations, NAND flash has been growing fast in data storage area. Many mobile devices utilize NAND flash memory as a major storage medium. Moreover, enterprise servers even consider NAND flash-based storage as their replacement storage devices for its performance benefit such as significantly fast reads and writes, low power consumption, and quiet and cool storage in comparison with traditional magnetic disk drives. In this thesis, we mainly focus on NAND flash memory, and flash memory refers to NAND flash memory in the rest of thesis. 5

Figure 2.2: NAND flash organization [7] 2.2.1 Limitations The major hardware limitations of flash memory for utilizing data storage are slow write performance, block erasure requirement, and memory wear-out.

16 Figure 2.2: NAND flash organization [7] Limitations The major hardware limitations of flash memory for utilizing data storage are slow write performance, block erasure requirement, and memory wear-out. First, to program and erase data, flash memory requires electric charge and discharge, which take longer time than data access to read. In Table 2.1, a write operation takes about six to seven times longer than a read operation in NAND flash memory. Furthermore, if the write operation needs a preceding erase operation for updating data, it takes much longer time to erase a block. Second, a whole block in flash memory has to be erased at a time to rewrite a page. However, the unit size of erase block is a lot bigger than that of read or write page. As mentioned before, an erase block is composed of 32, 64, or 128 read or write pages. To avoid inefficient rewriting procedures, flash memory needs a software layer, Flash Translation Layer (FTL). The detailed description of FTL will be mentioned in Section Third, flash memory has a limited number of erase-write cycles. Most SLC NAND flash has approximately 100,000 cycles, while MLC flash rages from 3,000 to 10,000 cycles[8]. Thus, flash memory requires a well-designed scheme to evenly wear out the flash memory region to extend the lifetime of flash storage. 2.3 Software support Flash memory requires software support to complement its hardware limitations. It cannot be utilized like traditional HDDs for itself because it needs erase operations for updating data and the unit 6

17 Figure 2.3: SLC and MLC NAND flash memory [8] SLC MLC Page Size/Block Size 2KB/128KB 2KB/256KB Page Read 25 us 60 us Page Write 200 us 800 us Block Erase 1.5 ms 1.5 ms Block Endurance 100,000 cycles 5,000 cycles Cost per GB (mid 2007) $10.37 $6.81 Table 2.2: Comparison of SLC and MLC NAND flash in their specifications [2][3][4] size of a read/write page is different from that of an erase block. Flash Translation Layer (FTL) or flash filesystems are used to drive flash memory in operating systems, as shown in Figure 2.4 (a) and Figure 2.4 (b). Removable storage such as Multi-Media Cards (MMCs) and USB Flash Drives (UFDs), and independent storage such as Solid-State Drives (SSDs) generally have FTL as a firmware in embedded controller. Figure 2.4 (c) shows the software architecture with removable or independent flash storage. Besides this, cache management schemes that consider the flash characteristics can help to enhance the read/write performance. These schemes provide well-designed write buffering algorithm to reduce the number of erase operations to flash memory. This section presents existing studies related to software support for flash memory Flash Translation Layer (FTL) Flash memory has a characteristic of erase-before-write, thus erase operations have to be issued on the blocks where new data will be written. Meanwhile, read or write operations are occurred on page-basis. Because of this imbalance between read/write page size and erase block size, flash memory needs special software layer, FTL (Flash Translation Layer). The primary role of FTL is mapping logical addresses, which are seen from the upper layer, to corresponding physical addresses, which are assigned 7

18 Figure 2.4: Software architecture for NAND flash memory 8

19 to the flash memory. Various algorithms for FTL have been studied and tried to reduce the number of costly erase operations and valid page copies. To make a free space in flash memory, an block has to be selected as a victim. After that, the valid pages in the block have to be copied into a new block. Lastly, the whole block is erased with issuing an erase commend. This procedure is called merge in FTL. The merge operation sometimes takes long time to complete for about a few seconds. The phenomenon of blocking I/O during the merge operation is called freezing in storage market. The feezing phenomenon is highly related to the FTL scheme. Therefore, it should have an efficient algorithm to handle the merge operations. One of important issues of FTL design is victim selection. FTL has to find a victim block as fast as possible and also make a small number of valid copies. Another important issue is the size of mapping information. Most mobile and embedded computing devices have restricted resources such as memory and power. Thus, a FTL scheme with small mapping information is more suitable for these devices. However, FTL schemes with small mapping information tend to have poor performance when writing randomly [28][29][30]. This is because the merge operation frequently occurs with small mapping information. Thus, flash-based storage that requires high performance random writes should have enough RAM space to hold large mapping information or have sophisticated scheme to maintain it. There are three different kinds of FTL schemes according to the granularity of the mapping information and mapping method: page-mapping scheme, block-mapping scheme, and hybrid-mapping scheme. Page-mapping scheme (also called as sector-mapping scheme) [31][17] has fine granularity of mapping information, which translates logical sector address to physical page address of flash memory. Pagemapping scheme is simple and fast. However, depending on the increse in flash storage capacity, the mapping information is becoming too big to fit into RAM cache and also start-up time of reconstructing mapping table is not bearable. Thus, a sophisticated method to maintain large mapping information is needed. Figure 2.5 describes the page-mapping scheme. In this figure, the mapping information is contained in the form of table, called mapping table. On the other hand, block-mapping scheme [18][19] has coarse granularity of mapping information, which translates logical sector address to physical block address of flash memory. Figure 2.6 describes the block-mapping scheme. Although the mapping information of the block-mapping scheme is very small, processing partial rewrites of already written blocks induces a lot of the merge operations. Hybridmapping scheme [14][15][32][16] has more flexibility than the block-mapping scheme. It utilizes twolevel mapping; block-mapping and page-mapping within resolved block. Figure 2.7 presents the hybridmapping scheme. It saves a lot of space for mapping information and obtains reasonable performance gain. 9

20 Figure 2.5: Page-mapping FTL scheme Figure 2.6: Block-mapping FTL scheme 10

21 Figure 2.7: Hybrid-mapping FTL scheme Flash filesystems Flash filesystems such as JFFS2[20], YAFFS[21], UBIFS[22] and LogFS[23] are for use on embedded NAND flash memory. Figure 2.4(b) shows a software layer with flash filesystem. They place directly on the flash memory chips, rather than using FTL emulating a traditional disk drive. These filesystems are based on log-structured concept like LFS[24] which appends new data at the end of previously written data. Because flash memory cannot use in-place-update, the log-structured concept fits well for flash memory, providing the reduced number of erase operations in flash memory. The first generation of flash filesystem is JFFS2 and YAFFS, but they are not usable on flash memory lager than 512MB[33]. The second generation of flash filesystem is YAFF and UBIFS. They are designed for large flash memory chips with making their data structure scalable Flash-specific cache management A considerable amount of studies on flash-aware cache replacement algorithm have been conducted. CRAW-C [34] has a similar concept with ARC replacement algorithm[35]. It adjusts the cache size according to ghost page references. Major difference is that the CRAW-C separates cache space into three regions: read area, write area, and compressed area. Each region has ghost region which includes the information of previously evicted pages in each region. Initially, each region has the size proportional to replacement cost. If a ghost page is referenced, the size of corresponding region is increased proportionally. Major drawback of this algorithm is that it only concerns the hit rates of three regions. Actually, increasing hit rate is not directly related to reduce replacement cost. Moreover, a large amount of cache misses in a region with small replacement cost can occur because of proportional increase of a region 11

22 page cache read area write area compressed area ghost read area ghost write area ghost compressed area total cache size Figure 2.8: Example of CRAW-C algorithm LRU list Dirty Cold Dirty Hot Clean Dirty Cold Dirty Hot Send to MRU position Figure 2.9: Example of LRU-WSR algorithm with large replacement cost. LRU-WSR [36] and CCF-LRU [37] keep dirty pages in the cache until they are no longer used. Because these algorithms retain all dirty pages, cache space for clean pages remains small. As a result, the number of cache misses of clean pages is hiked up when dirty pages take most of the cache space. There are a few studies[38][30] on buffer management. The main idea of these studies is about using well-designed write buffering algorithm to reduce the number of erase operations to flash memory. The write buffer cache is located in the upper layer of FTL as depicted in Figure 2.10(a). It perceives the structure of its lower layer FTL and provides block-based replacement algorithm, rather than sectorbased replacement algorithm. It can flushes all sectors in a victim block at a time, thus the merge operation for the victim block become efficient as shown in Figure 2.10(b). However, because the write buffer size is usually small, it does not absorb repeated write requests that much. If the number of write requests is decreased, time to perform costly program and erase operations will be shortened. Thus, there needs an efficient method to reduce the number of write requests in upper-level cache that is usually larger than write buffer cache. 2.4 Flash-based Solid-State Drives (SSDs) SSDs are storage devices that use flash memory as their storage media. They have the standard host interfaces as well as the form-factors of ordinary hard disks. Because they emulate the interface of hard disks, they can replace existing hard disks without any additional hardware or software support. As shown in Figure 2.11, an SSD has an embedded processor that manages the FTL and external interface, along with a flash controller, which connects flash memory packages through a multi-channel bus. A flash 12

23 Figure 2.10: Write buffer management for flash memory controller can issue a command through each channel independently. By increasing the bus channels, the performance of SSDs can improve, so that cutting-edge SSDs now show sustained read throughput of up to 250MB/s and write throughput of up to 200MB/s Flash memory packages Each flash package consists of multiple dies as shown in Fig Each die contains multiple planes, which have physical pages inside. Although each die is able to perform a read, write or erase operation independently, all the planes on a die can only carry out the same type of operation at one time. A command that makes two planes work simultaneously is called a two-plane command, and a command that makes n planes work at the same time is called an n-plane command. Currently, the most widely used configuration is a die with two or four planes with the two-plane command. However, the number of integrated planes on a die is increasing due to improvements in the design and fabrication technology. The increased number of planes will also inflate the importance of effective use of n-plane commands for the performance of SSDs. Every plane contains a register called the data register, which temporally stores a page before issuing a read or write command. When a plane processes a write command, data is first transferred from the 13

24 SDRAM Buffer Flash package Flash package Host Interface Connector Host Interface Logic Embedded Processor FTL Management Request Queue Control Flash Controller Flash package Flash package Flash package Flash package Flash package Flash package Figure 2.11: SSD block diagram Die 0 Flash Package Die 1! "! Figure 2.12: Flash package organization 14

25 controller to the data register, which usually takes about 50µs. After that, the data stored in the data register is written to the corresponding physical page, which takes about 250µs. On the other hand, when reading a physical page, the data is read from a physical page into the data register, which takes about 25µs, and then transferred from the data register to the controller, again, which takes about 50µs. In each plane, there is another register called the cache register for pipelining the consecutive commands. While processing a series of write commands, the cache register temporally stores data to write until the data register becomes available after finishing writing the previous data. On the other hand, when processing consecutive read commands, the cache register is used to store data temporally before sending it to the flash controller. While a page is read from a physical page to the data register, the cache register transfers the previous read data that was sent from the data register. This command pipelining improves the performance significantly when the consecutive commands are of the same type. However, when read and write commands are interchanged with each other, the pipeline does not bring any performance boost because the next command can not start until both the data and cache registers become available after finishing the previous command Standard host interface The host interfaces of recent SSDs provide outstanding request queues by Native Command Queuing (NCQ) standard of SATAII interface. Generally, the SSDs with the NCQ feature can store up to 32 outstanding requests. For traditional hard disks, request queues are used to reduce disk head movement by reordering the outstanding requests. However, in comparison, methods for managing outstanding request queues in SSDs have not yet been as actively researched up until now Request rescheduling algorithm There is no published request rescheduling algorithm for SSD in the market. However, Dirik and Jacob [39] presented an observation that the request scheduling policy that prioritizes read requests over write achieves faster average response time. The rationale behind the result was that the execution time of read requests is shorter than that of write requests Write request mapping Shin et al. [40] examined various write mapping algorithms for FTLs. However, they mainly focused on static mapping while we have explored the effect of dynamic mapping policies. Chang and Kuo [41] suggested a scheme to improve write performance by striping write requests dynamically. However, they did not consider the existence of the outstanding request queue and only prevented mapping to physical pages located in busy flash packages. This approach can be seen as a primitive form of dynamic write 15

26 mapping that is based on a one-depth request queue. Dirik and Jacob [39] used a simple write mapping algorithm to evaluate the performance of diverse configurations of parallelized channels and banks. It maps a write request to free packages in a round-robin manner based on the sequence number of the write request. However, different from their algorithm, our approach takes the sub-package-level parallelisms into consideration. In addition to that, our approach considers not only the sequence number, but also the dynamic changes of load distribution due to the existence of pre-issued read and erase operations. 2.5 Motivation Our work is motivated by the observation of current operating system policies and mechanisms. Most operating systems are customized for magnetic disk-based storage system. Thus, their replacement policies only concern the number of cache hits and their I/O request rescheduling policies only consider the data access distance to lessen head positioning delay. However, operating systems using flash memory as a secondary storage device should consider flash hardware characteristics which are different from magnetic disk drives. The replacement policies should take different read and write cost of flash memory into account when they replace pages to reclaim free space. We try to reduce the number of costly write operations and potential erase operations until the degradation of cache hit rate does not harm the performance. Embedded computing devices of mobile devices can be one of target applications for new replacement policies. In these devices, the operating system has much more chances to optimize I/O subsystem to adopt specialized features based on the characteristics of flash memory. Using the kernel level information effectively, the operating system can reduce the number of costly write and erase operations, and as a result, increase the performance. Also, the I/O request rescheduling policies should consider the internal hardware architecture of flash-based storage. A flash-based SSD is composed of concurrently operated components and has no mechanical components. Thus, SSDs do not need to consider mechanical positioning delay when they reschedule I/O requests. Rather than this, increasing the chance of parallel executions is more effective. In the current SSD architecture, request mapping policies, which is not applicable to magnetic disk drives, should also be considered to increase internal parallelism because uneven request distribution may cause significant performance degradation. Thus, we take the parallelism level of storage components and queue status of outstanding requests at plane-, die-, and flash package-level into account to boost SSD performance. 16

27 Chapter 3. Flash-aware Cache Management As the embedded applications have been growing in number and size, there is more need for increased memory size. However, the use of energy consuming memory, usually SDRAM has become burdensome in battery-powered embedded computing devices. To mitigate the increasing requirement of memory size, demand paging method can be used. The demand paging method is for operating system to copy pages from secondary storage to main memory only when the request for those pages is happened. This chapter focuses on NAND flash-based demand paging in embedded computing system. It analyzes energy consumption of NAND flash-based demand paging and proposes a new replacement algorithm in this environment. 3.1 Demand paging for NAND flash memory The first generation of flash-based storage architecture in mobile embedded devices utilized NOR flash and working memory like SRAM. However, as the mobile embedded devices evolved into date-centric applications, larger storage became necessary. The second generation of storage architecture began to use NAND flash memory for storing data. However, it cost more because of the increased number of system components[9]. The third generation utilized only NAND flash memory and working memory. In this architecture, code shadowing method is needed because NAND flash memory cannot execute application programs in place. The code shadowing method offers good performance since the code is executed in fast working memory like SDRAM. However, copying at boot-time causes slow booting time, and a large amount of working memory is required to hold both OS kernel and application codes. Most of all, high power consumption of working memory is a critical problem in battery-powered embedded devices. To utilize the limited memory space, demand paging technique can be used. Demand paging allows application codes to be executed by loading code and data on demand from secondary storage to working memory[5]. Figure 3.1 depicts the flash-based storage architectures. The demand paging technique can be easily adopted in the embedded computing devices because many recent embedded processors such as ARM920T and PowerPC440GX are equipped with Memory Management Unit (MMU), which is essential hardware for demand paging. In addition to this, it is necessary for modern computing environment due to following reasons. First, users of mobile embedded devices such as PDAs or smartphones want to download and use many application programs through Internet, thus the amount of required working memory is changing dynamically. Second, keeping all 17

28 Figure 3.1: Flash-based storage architecture for mobile embedded devices[9][5] 18

29 Component Coefficient Value CPU[43] P CPU active 411 (mw) P CPU idle 0.15 (mw) SDRAM[44] E activeread (mj/4kb) E activewrite (mj/4kb) P idle 45.1 (mw) P refresh 6.4 (mw) P retention 1.8 (mw) NAND[45] E flashread 9.44 (uj/4kb) E flashwrite (uj/4kb) E flasherase (uj/4kb) 0.13 (mw) P flashretention Table 3.1: Power and energy coefficient values [5] Workload CPU utilization Scenario gqview 0.57% Performs a slide show of seven images and adjusts their sizes. kword 1.64% Edits several lines in a text document. kspread 2.62% Calculates sum, average, minimum, and maximum of numerical data, and sorts them. acrobat 4.31% Views a PDF document. mozilla 0.77% Browses several web sites such as shopping mall, news center, mail server, etc. Table 3.2: Workloads used in experiment applications in the working memory is not efficient because the applications are not active at the same time. In this section, we examine two storage architectures used only with NAND flash memory; code shadowing and demand paging Impact on energy consumption With the demand paging method, the size of working memory, SDRAM can be reduced. Since memory retention energy is proportional to the size of working memory, the demand paging method saves as much energy as reduced working memory. Meanwhile, it needs to spend a certain amount of energy for accessing NAND flash memory for reading and writing pages. The retention energy of NAND flash memory is negligible in comparison with SDRAM, as shown in Table 3.1. The memory retention energy depends on the total execution time of applications, while the energy of accessing NAND flash memory depends on the amount of NAND flash references while CPU is active. Intuitively, the applications with low CPU utilization have large benefit of demand paging. The CPU utilization is the ratio of active CPU cycles to total CPU cycles. For most interactive embedded applications, it is reported that over 95% of time is spent waiting for user input[42]. We actually ran five interactive applications on Linux and gained similar results. Table 4.4 presents the CPU utilization of the five interactive applications. 19

30 Energy Consumption Model The total energy consumed by a task, E task is the sum of energies consumed by system components such as processor, bus, memory. If t is elapsed time, we can have the equation of tatal energy consumption like below. E task (t) = E CPU (t) + E mem (t) + E bus (t) (3.1) The energy consumption of CPU, E CPU is divided into active and inactive state depending on whether CPU is in execution or in idle mode. If t active and t inactive are the elapsed time in active and inactive state, and P CPU active and P CPU idle are active and idle mode power of CPU, the energy consumption of CPU is calculated like below. E CPU (t) = t active P CPU active + t inactive P CPU idle (3.2) For working memory, the number of accesses to the memory, N readaccess and N writeaccess is directly proportional to the energy consumption. The energy consumption of access, E active read and E active write can be estimated from the active power specified in the datasheet[44]. Specifically, SDRAM consumes idle power, P idle and refresh power, P refresh during the CPU execution. If there is no memory access, the memory stays in power down state consuming only retention power, P retention. E mem (t) = E active read N read access + E active write N write access + t active (P idle + P refresh ) + t inactive P retention (3.3) For NAND flash memory, the number of access to flash memory is N flash read and N flash write, and the energy consumption of each access is E flash read and E flash write. If there is no flash memory access during the flash idle time, t flashidle, it stays in power down state consuming only retention power, P flash retention. E flash (t) = E flash read N flash read + E flash write N flash write + t flash idle P flash retention (3.4) In case of bus model, the energy is a function of the voltage swing on the lines that switched, V dd, the capacitance of one interconnect line and the pins attached to it, C switch, the number of bus accesses, 20

31 N access, the bus width, N width, and the bit change rate R change. The pin capacitance values are obtained from the data sheets[43][44][45]. The bus energy is consumed when a value of a line is changed from 0 to 1 or 1 to 0. The bit change rate is determined to be the average number of changed bits per 32bits bus width through program execution profiling. In our case, the bit change rate value is E bus (t) = R change N width N access C switch V 2 dd (3.5) Evaluation Evaluation Methodology We investigate the impact of demand paging on energy consumption through trace-driven simulation. We assume that CPU is Intel PXA MHz[43] and parameters for SDRAM and NAND are obtained from [44] and [45], respectively. Table 3.1 shows coefficient values to determine the total energy consumption according to the model presented in Two memory configurations are considered in this experiment. We assume that the system consists of two 32 MB SDRAM modules and one NAND flash memory module for code shadowing, while only one 32 MB SDRAM module is used for demand paging. It is also assumed that Linux operating system and X-windows consume about 16 Mbytes, thus the memory size that applications can freely use is 48 Mbytes for code shadowing and 16 Mbytes for demand paging. The memory reference trace of each application is gathered using Valgrind [46] on Linux/X86 platform under the same scenario shown in Table 4.4. Valgrind is a tool for dynamic binary analysis and its cache profiler called Cachegrind has been slightly modified to generate instruction and data reference traces. Experimental Results Figure 3.2 compares the total energy consumption of shadowing and demand paging. On the average, CPU is responsible for about 15% of the total energy consumption, while SDRAM modules for the rest in case of shadowing. The portion of bus is less than 1% in both cases. It is not surprising to see that the energy consumption by CPU and SDRAM access does not change much even if we use demand paging. Ideally, demand paging should reduce the energy required for SDRAM retention by half because it uses only half of the SDRAM module. However, this is not the case in Figure 3.2 because the total execution time is slightly increased due to page faults. Demand paging also consumes some additional energy in NAND flash memory for reading and writing pages. 21

32 Gqview Kword Kspread Acrobat Mozilla Bus NAND SDRAM Retention SDRAM Access 10 CPU 0 Shadowing Demand Paging Shadowing Demand Paging Energy (J) Shadowing Demand Paging Shadowing Demand Paging Shadowing Demand Paging Figure 3.2: Energy consumption breakdown for code shadowing and demand paging 3.2 Flash memory friendly cache replacement algorithm; CFLRU The operating systems with the demand paging method should have a cache replacement algorithm because not all application pages cannot fit into the limited memory space. Traditional operating systems and its cache replacement algorithms have been optimized in various ways assuming that the secondary storage consists of magnetic disks. Unfortunately, flash memory exhibits different characteristics compared to magnetic disks. Most notably, flash memory does not have a seek time, and its data can not be overwritten before being erased. In addition, flash memory has asymmetric read and write operation characteristics in terms of performance and energy consumption. Therefore, it is necessary to revisit operating system policies and to optimize them for flash memory-based secondary storage. Most operating systems use an approximated LRU (Least Recently Used) algorithm for a replacement policy. However, operating systems with flash memory need to adopt a new replacement policy which considers not only cache hit-rate but also replacement cost. The replacement cost involves the elapsed time and consumed energy for evicting a victim page and for fetching a new page. The cost occurs when a victim page has to be written to the flash memory because the victim was modified before. This needs the cost of a flash write operation, and more worse, this may need the cost of a flash erase operation and many valid page copies. The valid page copy means that a page in a reclaiming erase block has valid data so it has to be copied into a new block before erasing the block. As mentioned in Chapter 2, the cost of flash write and erase operations are more expensive than flash read. Therefore, the new replacement policy should decrease the number of costly flash write and erase operations, while avoiding escalation of cache misses which lead to a large number of flash read operations. We propose a new replacement policy, called Clean-First LRU (CFLRU), which takes into consideration of the imbalance of read and write costs of flash memory when replacing pages. The basic idea 22

33 behind CFLRU is to keep a certain amount of dirty pages deliberately in page cache to reduce the number of flash write operations, while preventing the overall performance from being significantly affected due to the degraded cache hit rate Algorithm overview When cache replacement occurs, two kinds of replacement costs are involved. One is generated when a requested page is fetched from secondary storage to the page cache in RAM. Using Belady s MIN algorithm[47], this cost can be minimized by selecting a victim that has the largest forward distance in the future references. Among online algorithms, LRU has been commonly used for replacement algorithm because it exploits the property of locality in references. The other cost is generated when a page is evicted from the page cache to secondary storage. This cost can be eliminated by selecting a clean page for eviction. A clean page contains the same copy of the original data in secondary storage, thus the clean page can be just dropped from the page cache when it is evicted by the replacement policy. Satisfying only one kind of replacement cost would benefit from its advantage, but for a long term, it would affect the other kind of replacement cost, and vice versa. For example, a replacement policy might decide to keep dirty pages in cache as many as possible to save the write cost on flash memory. However, by doing this, the cache will run out of space, and consequently the number of cache misses will be increased. This causes increased replacement cost to read requested pages from flash memory. On the other hand, a replacement policy that focuses mainly on increasing the cache hit count will evict dirty pages, which will increase the replacement cost of writing evicted pages into flash memory. Thus, a sophisticated scheme to compromise both sides of efforts is needed to minimize the total cost. CFLRU (Clean-First LRU) algorithm is modified from the LRU algorithm. It divides the LRU list into two regions to find a minimal cost point, as shown in Figure 3.3. The working region consists of recently used pages and most of cache hits are generated in this region. The clean-first region consists of pages which are candidates for eviction. CFLRU selects a clean page to evict in the clean-first region first to save flash write cost. If there is no clean page in this region, a dirty page at the end of the LRU list is evicted. For example, under the LRU replacement algorithm, the last page in the LRU list is always evicted first. Thus, the priority for being a victim page is in the order of P8, P7, P6, and P5, in Figure 3.3. However, under the CFLRU replacement algorithm, it is in the order of P7, P5, P8, and P Window size for minimal replacement cost The size of the clean-first region is called a window size, w. Finding the right window size of the clean-first region is important to minimize the total replacement cost. A large window size will increase the cache miss rate and a small window size will increase the number of evicted dirty pages, that is, the 23

34 Working region Clean-first region LRU list P1 P2 P3 P4 P5 P6 P7 P8 D C D C C D C D MRU LRU C D : Clean page : Dirty page Window, w Figure 3.3: Example of CFLRU algorithm number of flash write operations. The flash write operations can also cause a large number of the costly erase operations. Therefore, the window size of the clean-first region needs to be decided properly in order to minimize the overall replacement cost. Let us assume that C W is the cost of a flash write operation and C R is the cost of a flash read operation. If N D denotes the number of dirty pages that should have been evicted in the LRU order but are kept in the cache, the gain of the CFLRU algorithm is calculated like below. G CFLRU = C W N D (3.6) If N C is the number of clean pages that are evicted instead of dirty pages within the clean-first region, and P i (k) is the probability of the future reference of a clean page, i, which is evicted at the k th position, the cost of the CFLRU algorithm, C CFLRU can be calculated like below. N C C CFLRU = C R P i (k) (3.7) i Figure 3.4 shows the probability of the future reference at each position of the LRU list. In this graph, the x-axis indicates the position of a page in the LRU list. Left-end of the x-axis is the most recently used page and rightend of the x-axis the least recently used page. The y-axis indicates the probability of future reference at each position of pages. For example, if the k th positioned page i in the LRU list is selected for an eviction, it is likely to be referenced in the future and fetched into the cache again with its probability of P i (k). From what has been discussed above, the benefit of CFLRU is calculated like below. B w = MAX winsize=1ton(g CFLRU C CFLRU ) (3.8) In this equation, N is the total number of pages in the cache and B w is the maximum benefit of CFLRU when the window size is w. However, in the real world, it is not easy to find the probability 24

35 p i-th page Clean page Dirty page 0 MRU LRU Pages in LRU list Window, w Figure 3.4: Probability of future reference per page of the future reference at each position of the LRU list. Thus, we investigate the proper window size of the clean-first region with statically defined parameters and also devise a method to adjust the window size dynamically. The static method initially fixes the window size with an average well-performed value obtained from repetitive experiments over a predetermined application set. However, static method can not dynamically adjust the window size to various cache size and different application sets. The dynamic method can properly adjust the window size based on periodically collected information about flash read and write operations. If W and R represent the ratio of write and read operations for a given time period, respectively, the cost difference between adjacent periods, ( W C W + R C R ) can control whether to enlarge or reduce the window size. With our experiments, the dynamic algorithm performs well with various cache states imposed by different application sets Simulation study Before implementing the CFLRU replacement algorithm in the Linux kernel, we performed a simulation study. The objective of this study is to compare the performance of the pure LRU with the CFLRU algorithm. In addition, we simulated the offline algorithm, Belady s MIN. It is known to be the optimal algorithm that maximizes the cache hit rate but is not a cost optimal algorithm. Simulation Methodology We gathered virtual memory reference traces using a profiling tool called Valgrind [46] as explained in Section 3.1. We obtained instruction and data reference traces through Cachegrind. We chose five different application programs and executed them on Linux/x86 machine, according to the scenario shown in Table 4.4. The characteristics of workload traces used in this simulation are described in Table 25

36 Workloads Memory Memory references used Total Instruction Data (MB) Read Read Write Gqview ,940, ,355 11,437,768 16,606,758 (3.1 %) (39.5 %) (57.4 %) Kword ,779,386 1,025,217 2,837, ,253 (21.4 %) (59.4 %) (19.2 %) Kspread ,366,261 1,515,521 6,264,476 3,586,264 (13.3 %) ( %) (31.6 %) Acrobat ,815,848 2,062,732 2,256,161 6,496,955 reader (19.1 %) (20.9%) (60.0%) Mozilla ,533,372 20,243,704 14,893,383 6,396,285 (48.7 %) (35.9 %) (15.4 %) Table 3.3: Workload characteristics used in simulation 3.3. In the simulations, we assume that the system has 32MB SDRAM and a 128MB flash memory for swap space. It is also assumed that the operating system and x-windows system consume 16MB of SDRAM, thus each application can freely use about 16MB of the remaining SDRAM. Because the Valgrind profiling tool can gather virtual memory reference traces only for one application at a time, the cache is dedicated for one application. Simulation Results Figure 3.5 shows the number of flash read and write operations for four algorithms; Belady s offline MIN algorithm, and online LRU, CFLRU static and CFLRU dynamic algorithms. It is clear that Belady s MIN algorithm is not optimal with flash memory because it do not consider flash read and write cost. However, it provides the performance by minimizing the hit count. In the CFLRU algorithms, the number of read operations are increased and the number of write operations are decreased in comparison with the LRU algorithm. This is because CFLRU tries to keep dirty pages instead of clean pages. Figure 3.6 shows estimated time delay to execute five applications. This estimation is based on the counts of flash read and write operations shown in Figure 3.5 and the measured time for flash read, write, and erase operations shown in NAND flash datasheet[45]. The CFLRU-static algorithm presents the lowest cost among six different configurations of the window size, in the form of S/x, where S is the cache size and x varies from 1 to 6. The CFLRU algorithms are prior to LRU in every applications. Compared with the LRU algorithm, the average time deay of CFLRU-static and CFLRU-dynamic are reduced by 11.8% and by 12.0%, respectively. Gqview and Acrobat reader have little gain from CFLRU algorithm because they have large number of write references in comparison with read references, so their cache includes small number of clean pages. In this case, the CFLRU algorithm fails to evict clean pages and operates like LRU. However, other applications shows better performance of 10 to 25%. Sometimes, the dynamic algorithm has a small amount of performance degradation when comparing with the static 26

37 algorithm. This is due to a wrong decision that might happen if the computed replacement cost to predict the future events is not accurate with the actually occurred events. However, CFLRU-dynamic still has reasonable performance, and it has an advantage that we do not have to reconfigure the window size whenever the cache size and the application set change Implementation on Linux We have implemented the CFLRU replacement algorithm based on the Linux kernel 2.4. In this section, we briefly address the replacement policy of the Linux kernel, and explain our modification for the CFLRU replacement algorithm and some optimization issues. Original Linux Page Reclamation The Linux kernel 2.4 has the page cache that consists of two pseudo-lru lists, the active list and the inactive list, as shown in Figure 3.7 [48]. The pages in the active list are recently accessed while the pages in the inactive list are not. When the kernel decides to make free space, it starts the reclaiming phase. First, it scans the pages in the inactive list. Priority value decides the scanning range of pages in the inactive list. Priority of the inactive list is increased, from 1/6 to 1/5, and 1, until the enough number of pages (usually 32) is freed. Second, if there are too many process-mapped pages, it starts the swap-out phase. Page reclamation in the swap-out phase is performed in a round-robin fashion in the Linux kernel 2.4, starting the scanning from the memory that was last checked. In this phase, the pages that are in the active list or recently accessed are skipped. CFLRU Implementation To implement the CFLRU replacement algorithm in the Linux kernel, we insert an additional reclamation function that chooses clean victims first before starting the original reclamation phases. Similar to the original Linux kernel, the additional reclamation function consists of two phases. In the first phase, clean pages in the inactive list are evicted first until the enough number of pages is freed. In the original Linux kernel, dirty pages become ready to write when they are selected as victims, but under our CFLRU, dirty pages are simply skipped. In the second phase, clean pages that belong to the process region are swapped out. As mentioned above, reclamation in the swap-out phase is not based on LRU in the Linux kernel 2.4. In the inactive list of the Linux kernel, the concept of priority is correctly matched with that of the window size of the clean-first region in CFLRU. We first configure the priority value statically, as a default. However, the window size for the minimal replacement cost depends on the specific reference patterns of applications. To adjust the window size properly, the priority values can be changed dynamically. 27

38 Figure 3.5: The number of flash operations of MIN, LRU, CFLRU-static, and CFLRU-dynamic 28

Figure 3.6: Estimated time of MIN, LRU, CFLRU-static, and CFLRU-dynamic tail active_list head head inactive_list tail Priority (= 1/6, 1/5,..., 1) Selecting a victim page Figure 3.

39 Figure 3.6: Estimated time of MIN, LRU, CFLRU-static, and CFLRU-dynamic tail active_list head head inactive_list tail Priority (= 1/6, 1/5,..., 1) Selecting a victim page Figure 3.7: LRU list in Linux kernel A kernel daemon periodically checks the replacement cost and compare with last replacement cost to decide whether to increase or decrease the priority. To avoid oscillation of the priority value, the difference between the current cost and the last cost has to exceed a predetermined threshold. Optimizations for Flash Memory The I/O subsystem of the Linux kernel is optimized for disk-based storage. Disk scheduling policy and sequential read-ahead are examples of optimizations to reduce the seek time of disk-based storage. However, the disk optimization does not help improve the performance of a system with flash memory. It might rather decrease the performance due to the wasting time for doing the disk optimizations. Especially, read-ahead has a bad effect on cache hit rate when most of read-ahead pages are never accessed in the near future. The cache space for actively used pages is reduced because of read-ahead pages. In our experience, sequential read-ahead pages achieve low hit rate in many cases. Random access time of flash memory is faster than that of magnetic disk because flash memory has no mechanical parts 29

40 to search data. We remove the sequential read-ahead function in the Linux kernel and these results in large improvement on cache hit rate Implementation results The CFLRU replacement algorithm is evaluated on a system with a Pentium IV processor and 32MB SDRAM running the Linux kernel We evaluate the performance of the replacement algorithms per application, so the choice of the cache size, 32MB, is appropriate for the cache replacement algorithm. The system has been emulated to have 64MB and 256MB of flash memory for swap space and for file system (ext2), respectively. The emulated flash memory has the same latency of real NAND flash memory as in [1]. We compare four Linux kernel implementations; plain Linux kernel, Linux kernel without swap readahead (no-readahead), Linux kernel with CFLRU static algorithm (CFLRU-static), and Linux kernel with CFLRU dynamic algorithm (CFLRU-dynamic). The Linux kernels with CFLRU static algorithm and CFLRU dynamic algorithm also do not exploit read-ahead method. The window size of the CFLRU static algorithm is 1/4 of the inactive list. To measure the performance of four methods, we chose five applications; gcc, tar, diff, encoding, and file system benchmarks. These applications are not generally used in mobile devices but are proper for measuring the time delay. Figure 3.8 shows the weighted cost calculated by the number of flash read and write operations. As mentioned before, flash memory is partitioned into two regions; one for swap system and the other for file system. The flash write is weighted eight times highter than flash read, based on Table 2.1. No readahead method reduces the overall read count when comparing the plain Linux kernel. This is because not all readhead pages are hitted in the cache. From this result, we expect that only turning off the readahead function can obtain better performance in computing devices with flash storage. In CFLRU methods, the overall flash read count is slightly increased, while the flash write count is decreased, thus the sum of weighed counts is reduced. This is because CFLRU tries to evict clean pages and hold dirty pages. Figure 3.9 shows the measured time and expected energy. The expected energy is calculated with byte read and byte write counts from/to flash memory. No read-ahead method is 2.4% faster than plain kernel, and its energy saving is 4.4%. The average time delays of CFLRU-static and CFLRUdynamic method are reduced by 6.2% and by 5.7%, respectively, while the expected energy savings of CFLRU-static and CFLRU-dynamic by 11.4% and 12.1%, respectively. 3.3 Enhanced cache replacement algorithm; CBLRU Strategies to decide how many dirty pages should be kept in the page cache and when to flush them into secondary storage devices is crucial for flash-aware cache replacement algorithms. The major 30

41 Figure 3.8: Weighted flash read and write counts for each application Figure 3.9: Measured time and expected energy for each application 31

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks Jinho Seol, Hyotaek Shim, Jaegeuk Kim, and Seungryoul Maeng Division of Computer Science School of Electrical Engineering