An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member, IEEE Abstract As more consumer electroncs adopt monolthc kernels, NAND flash memory s used for the swap space n vrtual memory systems. Whle flash memory has the advantages of low-power consumpton, shock-resstance and non-volatlty, t requres garbage collecton due to ts erasebefore-wrte characterstc. The effcency of the garbage collecton scheme largely affects the performance of flash memory. Ths paper proposes a novel garbage collecton technque whch explots data redundancy between the man memory and flash memory n flash memory-based vrtual memory systems. Compared to the prevous approach, our proposed scheme takes nto consderaton the localty of data to mnmze the garbage collecton overhead. In addton, by consderng the computatonal overhead of the garbage collecton algorthm, we also propose an adaptve scheme whch can mnmze the computatonal overhead wth margnal I/O performance degradaton. Expermental results demonstrate that the proposed garbage collecton scheme mproves performance by 37% on average compared to those of prevous schemes 1. Index Terms NAND flash memory, Flash Translaton Layer (FTL), Garbage Collecton, Vrtual Memory, Buffer Management. I. INTRODUCTION NAND flash memory s wdely used n constructng storage unts for consumer electroncs such as cellular phones, dgtal cameras and portable meda players because of ts merts of low power consumpton, hgh random access performance and hgh shock-resstance. Flash memory s a good devce for use as swap space n vrtual memory system as well as for fle and code storage due to ts low access cost [1, 2, 3]. Compared wth hard dsk drves, flash memory can reduce the page swappng cost sgnfcantly. As more consumer electroncs adopt monolthc kernels such as embedded Lnux, flash memorybased vrtual memory systems wll become more popular. However, most research on flash memory has focused on flash fle systems, wth only a few studes on flash memory-based vrtual memory systems. The characterstcs of flash memory are qute dfferent from those of hard dsk drves. A flash memory chp s composed of several blocks and each block conssts of multple pages. For 1 Ths research was supported by Basc Scence Research Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence and Technology (2010-0010387). S. J and D. Shn are wth the School of Informaton and Communcaton Engneerng, Sungkyunkwan Unversty, Suwon, Korea (e-mal: dongkun@skku.edu). Contrbuted Paper Manuscrpt receved 09/17/10 Current verson publshed 12/23/10 Electronc verson publshed 12/30/10. 0098 3063/10/$20.00 2010 IEEE example, n a large block mult-level-cell (MLC) NAND flash memory, one block s composed of 128 pages of 4 KB each. Flash memory supports the three commands of read, program (wrte) and erase. Whle the unts for read and program commands are pages, the unt for erase command s a block. A flash memory page cannot be overwrtten f t has already been programmed, and the correspondng block should be erased before data s wrtten to the page. Ths feature s called the erase-before-wrte constrant. Therefore, most flash storage systems wrte the updated data to other non-programmed pages, nvaldatng the old pages. Ths requres an address mappng scheme whch translates the logcal address used n the operatng systems to the physcal address used n the flash memory. In order to handle these specal features, a software layer called the flash translaton layer (FTL) s usually used between the fle system and the flash memory [4, 5, 6]. The FTL has two man functons. The frst s address mappng, whch can be dvded nto three categores dependng on the granularty: block-level, page-level and hybrd mappngs. The second functon of FTL s garbage collecton (GC) to reclams the flash pages that have been nvaldated by the update operatons. The GC has three steps,.e., vctm block selecton, vald page mgraton and vctm block erase. The vctm block selecton dentfes the vctm block that wll nvoke the lowest GC cost,.e., that wth the smallest number of page mgratons. The vald page mgraton moves the vald pages from the vctm block to other clean blocks. The last step erases the vctm block for future wrte requests. The garbage collecton nvokes sgnfcant overhead snce t requres a large number of page mgratons and block erasures. Therefore, an effcent GC scheme s essental for hgh performance flash memory storage systems. There have been many studes on garbage collecton n flash memory storage; however, only a few works have focused on the GC for flash memory-based vrtual memory systems. When flash memory s used as swap space, the GC should explot the data redundancy between the man memory and the flash memory n order to elmnate unnecessary page copyng. When a vrtual memory page s swapped n, ths page exsts n both the man memory and the flash storage. Snce ths page wll be wrtten back to the flash memory when t gets swapped out at the next tme, there s no need to copy these duplcated pages durng GC, as s done n the duplcaton-aware garbage collecton (DA-GC) [7]. DA-GC targets the page-level address mappng FTLs and thus shows a good performance at pagelevel mappng. We found that DA-GC cannot dsplay ts merts for hybrd mappng FTLs wthout consderng the localty nformaton. For

2356 ths reason, we propose localty and duplcaton-aware garbage collecton (LDA-GC) algorthms for flash memory-based vrtual memory systems, consstng of the localty and duplcatonaware vctm block selecton technque (LDA-VBS) and the localty and duplcaton-aware block merge technque (LDA- BM). These technques sgnfcantly reduce the GC overhead n the hybrd mappng FTL by consderng the update probablty of duplcated data. The experments usng a trace-drven smulator show that the proposed technques can mprove the overall flash I/O performance, on average, by 37% compared to that of the exstng duplcaton-unaware garbage collecton (DU-GC) scheme for vrtual memory benchmarks. The remander of the paper s organzed as follows: Secton 2 provdes a survey of the relevant lterature on flash memory management technques. Secton 3 presents the motvatons of ths paper. The detaled descrptons on LDA-VBS and LDA- BM technques are provded n Secton 4. Secton 5 presents the performance evaluaton results. Fnally, Secton 6 concludes the paper. II. RELATED WORKS IEEE Transactons on Consumer Electroncs, Vol. 56, No. 4, November 2010 Most prevous studes on flash memory have focused on address mappng schemes. Block-level mappng [4] mantans the translaton nformaton between the logcal block address and the physcal block address; therefore, the offsets of a page are the same wthn both the logcal block and the physcal block. In page-level mappng [5], a logcal page address s translated nto a physcal page address. Due to the ndependent management of pages, page-level mappng s more effcent than block-level mappng, but t requres a large memory space for the mappng table. Hybrd mappng [6, 8, 9] uses both page-level mappng and block-level mappng and reserves a porton of the flash blocks as a log buffer. Hence, hybrd mappng FTLs are called the log buffer-based FTLs. Blocks n the log buffer are called the log blocks. The normal data blocks use block-level mappng, whle the log blocks use page-level mappng. All wrte requests are frst sent to the log buffer; f there s no free space n the log buffer, then vald data n a vctm log block are moved nto data blocks to make free space. The hybrd mappng FTL technque can yeld hgh performance wth a small mappng table. Therefore, most FTLs employ the hybrd mappng technque. There are several studes on log buffer-based FTLs. The block-assocatve sector translaton scheme (BAST) [6] assocates a log block wth only one data block; that s, when any page of a data block s updated, the new data should be wrtten to the assocated log block. The GC s nvoked when there are no clean pages n the assocated log block or when no log block s assocated wth the target data block; ths occurs frequently for random wrtes. The GC selects one of the log blocks and moves all vald pages of the log block and ts assocated data block to a clean block. The log block and the data block are then erased and are exploted as new log blocks. A drawback of the BAST scheme s ts frequent GCs for random wrte patterns. To solve ths problem, the fullyassocatve sector translaton (FAST) scheme was proposed [8], n whch one log block can be assocated wth multple data blocks. Therefore, frequent GC nvocatons for random wrte requests can be prevented. However, the FAST scheme has a large GC cost once garbage collecton s nvoked because t moves many vald pages n several data blocks that are assocated wth the vctm log block. Generally, flash memory storage systems have a buffer cache to hde the long latency of flash memory. Buffer cache management s mportant for achevng hgh performance snce the I/O requests on flash memory change dependng on the buffer cache management technque. There are several flashaware buffer management schemes, ncludng FAB [10], CFLRU [11] and BPLRU [12]. However, these technques do not consder duplcated pages. Lee et al. [13] have proposed a buffer-aware garbage collecton (BA-GC) technque whch explots duplcated pages that are wrtten n both the buffer cache and the flash memory. Durng garbage collecton, the duplcated drty pages are evcted nto the flash memory to elmnate unnecessary page mgratons. L et al. [7] proposed the duplcaton-aware garbage collecton (DA-GC) technque for flash memory-based vrtual memory systems. DA-GC does not move the duplcated pages n the flash memory durng the vald page mgraton. Therefore, the pages are removed from the flash memory after GC erases the vctm blocks; however, these pages reman n the man memory. Snce the target of the DA-GC technque s the swap space of vrtual memory systems, there s no crtcal consstency problem even when the duplcated pages are lost from the man memory due to a sudden power falure. The duplcated pages, whch reman only n the man memory after GC, are wrtten n the flash memory when they are swapped out. Although DA-GC can reduce the garbage collecton overhead of the page-level mappng FTL, t may ncrease the garbage collecton overheads of hybrd mappng FTLs because t generates more wrte requests on the log buffer and thus nvokes frequent GCs n hybrd mappng. Our proposed technques are based on the DA-GC scheme. However, our localty-aware approaches solve the problem of DA-GC n hybrd mappng FTLs. III. MOTIVATION In ths secton, we ntroduce the DA-GC technque and ts problem n hybrd mappng where the log blocks are used as the wrte buffer for data blocks. Fg. 1 shows an example of duplcaton-unaware garbage collecton (DU-GC) n the FAST hybrd mappng. The page cache has sx pages that are sorted by ther access recences. Page P2 s the least-recently-used (LRU) page, and page P9 s the most-recently-used (MRU) page. The pages P2 and P11 are drty (.e., the page cache and the flash memory have dfferent data), and the remanng pages are clean. Flash memory conssts of seven physcal blocks whose physcal block numbers (PBNs) are 0 6; PBN 0, PBN 1 and PBN 2 are allocated for data blocks, and PBN 3 and PBN 4 are allocated for log blocks. Snce the data blocks are managed usng block-level mappng, all pages are wrtten at the specfed page offsets wthn the data block.

S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2357 Fg. 1 An example of duplcaton-unaware garbage collecton. PBN 5 and PBN 6 are free blocks reserved for garbage collecton. We assume that each flash block s composed of four pages. For the sequence of wrte requests on the logcal pages, (P1, P3, P8, P10, P4, P5, P4, P4), the log blocks contan these pages, and the correspondng pages n the data blocks are nvaldated. Each log block can be assocated wth multple data blocks. For example, PBN 3 n the log buffer s assocated wth data blocks PBN 0 and PBN 2, and PBN 4 s assocated wth only PBN 1. Garbage collecton should be nvoked when there s no free space n the log buffer. If PBN 3 s selected for a vctm block, then the GC copes all vald pages n PBN 3 and ts assocated data blocks (PBN 0 and PBN 2) nto the free blocks, PBN 5 and PBN 6. The vald pages P0 P3 and P8 P11 are coped nto PBN 5 and PBN 6, respectvely. Snce the vctm log block and ts assocated data blocks are merged nto free blocks, ths step s called the block merge. After the block merge operaton s completed, PBN 5 and PBN 6 are changed nto data blocks, and PBN 0, PBN 2 and PBN 3 are erased, wth one of them beng allocated as a new log block. The DU-GC does not consder the duplcated pages n the page cache. If we have nformaton on the page cache, a more effcent GC can be mplemented by consderng the duplcated pages n both the page cache and the flash memory. For many embedded devces such as moble phones, the page cache and the FTL can share ther nformaton because both of them are executed on the same processor. If the page cache manager can notfy the FTL of the drty page nformaton, unnecessary page mgratons can be prevented. Fg. 2 shows the duplcaton-aware garbage collecton (DA-GC) scheme [7]. When GC selects PBN 3 as a vctm block, t does not copy the duplcated pages P1, P2, P9, and P11 snce they are also contaned n the page cache. Therefore, the number of page mgratons s reduced by half. Instead, P1 and P9 are changed nto drty states n the page cache snce they should be wrtten n the flash memory when they are evcted from the page cache. Fg. 2 An example of duplcaton-aware garbage collecton. Although the DA-GC scheme sgnfcantly reduces the block merge cost, more pages wll be sent to the flash memory from the page cache snce all of the duplcated clean pages are changed nto drty pages. The ncreased number of page evctons n the DA-GC scheme has no adverse effects on garbage collecton n page-level mappng snce the pages can be wrtten at any locaton n a block. Therefore, DA-GC s an effectve technque n page-level mappng. However, DA-GC may nvoke frequent garbage collectons wth hybrd mappng. For nstance, as shown n Fg. 2, four physcal pages n PBN 5 and PBN 6 are not utlzed n DA-GC. Instead, when pages P1 and P2 are evcted from the page cache due to page replacement, they should be wrtten n the log blocks. As a result, the log blocks more quckly consume free space. If these pages reman clean untl they are evcted from the page cache,.e., garbage collectons or host requests do not make these pages drty, then they wll not be wrtten n the flash memory. In partcular, snce page P1 has not recently been used, there s lttle possblty for the page to be updated (and consequently changed to drty) before t s evcted from the page cache. Therefore, t s better to copy page P1 durng the block merge and leave t clean. However, snce page P9 s the MRU page, t s lkely to become drty even though DA- GC does not change ts state. Therefore, even f page P9 s excluded from page mgratons by DA-GC, there may be no beneft wth regard to the GC cost. Consequently, the duplcaton-aware scheme should be appled selectvely consderng the localtes of the duplcated pages. To solve the problem of DA-GC n hybrd mappng, we propose a localty-aware vctm block selecton technque, called LDA-VBS, and a localty-aware block merge technque, called LDA-BM, for DA-GC. These technques dvde the page cache nto two regons, LRU and MRU regons, and use dfferent polces for each regon. The LDA- VBS technque selects the vctm log block that nvokes a small number of state changes for the duplcated clean pages n the LRU regon of the page cache. The LDA-BM technque determnes whether to copy each duplcated page durng a block merge based on the localty of the correspondng page n the page cache. In addton, we propose the LRU drty page

2358 evcton (LDE) technque that forces drty pages n the LRU regon of the page cache to be evcted durng garbage collecton n order to reduce unnecessary page mgratons. The proposed technques can prevent frequent garbage collectons durng hybrd mappng whle explotng the advantage of DA-GC that reduces unnecessary copes of duplcated pages. IV. LOCALITY-AWARE GARBAGE COLLECTION A. Localty and Duplcaton-Aware Vctm Block Selecton General vctm block selecton algorthms consder only the block merge cost when selectng a vctm block. However, n order to prevent duplcated clean pages n the LRU regon of the page cache from beng changed nto drty, we should consder not only the merge cost but also the potental loss resultng from the ncrease n wrte requests n the DA-GC. The proposed LDA-VBS technque optmzes both the garbage collecton overhead and the potental loss. Under the DA-GC scheme, we can represent the garbage collecton overhead, C GC (L ), for a vctm log block, L, as follows: C L ) ( A( L ) 1) C ( L ) ( C C ), (1) GC ( e r w where A(L ) and (L ) denote the number of data blocks assocated wth L and the number of non-duplcated (.e., they exst only n the flash memory) vald pages n L or ts assocated data blocks, respectvely. For example, n Fg. 2, A(PBN 3) s 2 and (PBN 3) s 4. C e, C w and C r represent the tmng costs for block erase, page wrte and page read n the flash memory, respectvely. Only (L ) number of flash page reads and wrtes are requred snce DA-GC does not copy the duplcated pages durng the block merge. After the block merge s completed, A(L ) number of data blocks and one log block are erased. Therefore, A(L ) + 1 number of block erases are requred. However, as explaned n Secton 3, DA-GC changes the duplcated clean pages n the page cache nto drty pages, nvokng more wrte requests from the page cache to the flash memory. Therefore, DA-GC has potental loss as follows: C loss ( L ) ( L ) ( C w ), (2) where (L ) represents the number of duplcated pages of the log block L whose correspondng pages n the page cache are changed from clean nto drty by DA-GC and are not updated further by followng host requests untl they are evcted. In Fg. 2, two clean pages, P9 and P1, are changed nto drty by DA- GC. However, snce the MRU page P9 has a hgh possblty of beng changed to drty by host requests, the value of (L ) wll be less than 2. The cost for wrtng the drty pages nto the flash memory s (L ) C w. In addton, the wrte requests nvoke more garbage collectons. We add the overhead cost of that represents the average block merge cost per one drty page wrte. The approxmate value of s C r +C w +C e /N page because a drty page wrtten n the flash memory nvokes one page read/wrte for page mgraton and one block erase per N page number of pages, where N page represents the total number of flash pages n a flash block. IEEE Transactons on Consumer Electroncs, Vol. 56, No. 4, November 2010 However, t s mpossble to know the exact values of (L ) durng GC wthout knowledge of future host requests. To predct these values, we used the 3-regon LRU cache [13] n whch a page cache s dvded nto three regons: an MRU regon, an LRU regon, and an ntal regon, each of whch provdes the update probablty of a page n the regon. By dynamcally adjustng the sze of each regon based on the transton rates between the three regons, the 3-regon LRU buffer dentfes the page cache access pattern. The update probablty of each duplcated page n the page cache can be determned based on the regon nformaton ncludng the page. To consder both the garbage collecton overhead and the potental loss, the overall garbage collecton cost can be represented as follows: C L ) C ( L ) C ( L ) (3) total ( GC loss The LDA-VBS technque selects the vctm block wth the lowest value of C total (L ) n order to prevent DA-GC from nvokng a large potental loss. B. Localty and Duplcaton-Aware Block Merge Snce the pages n the page cache have dfferent probabltes of beng updated, the LDA-BM technque uses dfferent polces dependng on the future access probablty of each page. The clean pages n the MRU regon of the page cache have hgh probabltes of beng changed to drty before they are evcted from the page cache even though DA-GC does not change ther states. On the contrary, the clean pages n the LRU regon of the page cache have low possbltes of beng changed to drty by host requests. Therefore, t may be benefcal not to apply the DA-GC technque to the clean pages n the LRU regon,.e., not to change the clean data n the LRU regon nto drty data. Then, the page mgraton cost of GC ncreases compared to that of DA-GC. However, we can reduce the frequency of GCs that may nvoke large overhead. (The ntal regon of the 3-regon cache s regarded as beng ncluded n the MRU regon for smple mplementaton.) Fg. 3 shows the proposed LDA-BM technque. Page P1 n the flash memory s coped durng the block merge operaton and P1 n the page cache remans clean snce the page s n the LRU regon of the page cache. However, the clean page n the MRU regon P9 and the drty pages P2 and P11 are not coped durng the block merge. Even though page P9 s changed to drty, the potental loss due to the change wll be small snce t has a hgh possblty of beng changed to drty by future host requests. For the LDA-BM technque, the vctm block selecton polcy should be modfed. We can represent the garbage collecton overhead, C GC (L ), for a vctm log block, L, as follows. CGC( L ) ( A( L ) 1) Ce ( ( L ) ( L )) ( Cr Cw), (4) where (L ) denotes the number of duplcated pages of the log block L whose correspondng pages n the page cache are duplcated clean pages n the LRU regon. Compared to the DA- GC cost n Equaton (1), LDA-BM has a larger GC cost snce t requres more flash page mgratons. However, LDA-BM reduces the potental loss of DA-GC snce t does not change

S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2359 the clean pages n the LRU regon nto drty pages. Therefore, the potental loss of LDA-BM s as follows: C loss ( L ) MRU ( L ) ( Cw ), (5) where MRU (L ) s the number of duplcated pages of the log block L whose correspondng pages n the page cache are clean pages n the MRU regon and are changed from clean nto drty by DA-GC. The potental loss s smaller than that n Equaton (2) snce (L ) s larger than MRU (L ). Fg. 3 An example of LDA-BM. C. LRU Drty Page Evcton The LRU drty page evcton (LDE) technque explots the duplcated drty data n the LRU regon of the page cache n order to reduce the GC cost. It s better to move the duplcated drty pages n the LRU regon of the page cache nto the flash memory durng garbage collecton because the pages have hgh probabltes of beng evcted to the flash memory wthout further updates. Then, we can effcently utlze the data blocks by reducng the number of flash memory spaces unutlzed by DA-GC. The coped drty pages n the page cache are changed to clean. For example, f we use the LDE technque for the case n Fg. 3, page P2 s coped from the page cache to the flash block PBN 5 and s changed nto clean n the page cache. Then, when page P2 s evcted from the page cache, there s no wrte request to the log block of flash memory. We can smultaneously use both LDA-BM and LDE technques durng the block merge operaton to apply dfferent polces for the duplcated pages n the LRU regon of the page cache. Whle LDA-BM s appled to the duplcated clean pages, LDE s appled to the duplcated drty pages. We can reduce the potental garbage collecton overhead nvoked by DA-GC by usng two technques n the LRU regon of the page cache. When both the LDA-BM and LDE technques are used, the garbage collecton overhead, C GC (L ), for a vctm log block, L, s calculated as follows: CGC ( L ) ( A( L ) 1) Ce ( ( L ) ( L )) ( Cr Cw) (6) ( L ) ( Cb Cw) where (L ) denotes the number of duplcated pages of the log block L whose correspondng pages n the page cache are duplcated drty pages n the LRU regon, and C b represents the transfer cost of a page from the page cache to the flash memory. We assume that C b s larger than C r due to the bus transacton. Compared to the DA-GC cost n Equatons (1) and (4), usng both LDA-BM and LDE technques nvokes a larger GC cost snce (L ) number of pages should be coped from the page cache to the flash memory. However, the LDE technque has a potental beneft. Snce LDE changes the duplcated drty pages n the LRU regon of the page cache nto clean pages, the number of wrte requests to the log blocks s reduced. Therefore, the total GC cost s as follows: Ctotal( L ) CGC ( L ) Closs( L ) Cbeneft( L ), where C ( L ) ( L ) ( C ) and (7) C loss beneft ( L ) MRU LRU ( L ) ( C w w ). In ths equaton, LRU (L ) represents the number of duplcated pages of the log block L whose correspondng pages n the page cache are drty pages n the LRU regon and are changed nto clean by LDE wthout beng updated by followng host requests. TABLE I THE STATE CHANGES OF A DUPLICATED PAGE UNDER DIFFERENT SCHEMES. page state state after GC cache before count DA-GC LDA-BM LDA-BM area GC LDA-VBS /LDE MRU Drty Drty Drty Drty (L ) regon Clean Drty Drty Drty LRU Drty (L ) Drty Drty Clean regon Clean (L ) Drty Clean Clean Table I summarzes the state changes of a duplcated page under each scheme. DA-GC and LDA-VBS have the lowest GC costs but have the largest potental losses on future GC costs snce they change all of the duplcated clean pages n the page cache nto drty pages. LDA-BM has a hgher GC cost than does DA-GC but has a lower potental loss because t does not change the states of the duplcated clean pages n the LRU regon. Usng LDE n addton to LDA-BM, the GC cost ncreases; however, there s a potental beneft snce the duplcated drty pages n the LRU regon are changed nto clean pages. D. Adaptve LDA-VBS Even though the proposed LDA-VBS technque can sgnfcantly reduce the flash memory I/O cost, t nvokes hgh computatonal overhead to dentfy the duplcated pages at every garbage collecton. The vctm block selecton algorthm should determne whether each page n a log block has a duplcated page n the page cache. In partcular, the computatonal overhead ncreases n proporton to the number of log blocks n a log buffer. Snce the garbage collecton overhead s generally reduced as the sze of the log buffer ncreases, hgh-performance flash storage systems prefer a large log space. Consequently, the hgh complexty of LDA-VBS could be a burden for such systems. To overcome ths problem, only a porton of the log buffer can be examned when choosng a vctm log block. Our approach s to use a vctm wndow whch ncludes k number of

2360 log blocks close to the LRU poston n a log buffer and nspect only the log blocks wthn the vctm wndow, nstead of examnng all log blocks. The log blocks n the LRU poston tend to have a relatvely low block merge cost because they are lkely to have a small number of vald pages. Therefore, ths approach can reduce the computatonal overhead wthout sgnfcant damage to the performance of LDA-VBS. No large gan (prev_gc_cost 0.8 < cur_gc_cost) change randomly k+ k k- change randomly Large loss (prev_gc_cost 1.2 < cur_gc_cost) Fg. 4 Vctm wndow adaptaton. There s a trade-off between the I/O performance and the computatonal overhead,.e., wth a large vctm wndow, a better vctm log block can be selected but the computatonal overhead s sgnfcantly hgher. Therefore, t s mportant to choose a proper vctm wndow sze k durng the vctm selecton step. Snce the optmal value of k depends on the workload pattern, we adjust t by observng the garbage collecton cost. As shown n Fg. 4, we change the vctm wndow sze k nto k + or k and then observe the change n GC cost. If the current GC cost (cur_gc_cost) wth the vctm wndow sze k + s not reduced by more than 20% compared to the prevous GC cost (prev_gc_cost), we restore the vctm wndow sze nto k. On the other hand, f the current GC cost wth the vctm wndow sze k s ncreased by more than 20% compared to the prevous GC cost, the vctm wndow sze s restored nto k. Usng ths adaptaton algorthm, the smallest vctm wndow sze whch has only a small dfference n computatonal overhead compared to the unlmted vctm wndow sze can be determned. V. EXPERIMENTS A. Expermental Envronments We mplemented a trace-drven smulator n order to evaluate the performances of the proposed schemes. The smulator conssts of the page cache smulator and the storage smulator. The page cache s managed by the 3- regon LRU algorthm [13] to dvde t nto the LRU and MRU regons. We used fve real vrtual memory traces collected usng Valgrnd toolset, whch are captured whle executng several applcatons, acrobat, gqvew, kword, mozlla and offce, on a Lnux system. The flash memory model used n the smulaton was based on Samsung SLC large block NAND flash memory [14], n whch each flash block s composed of 64 pages and each page s 2 KB. The tmng delays of page read, page wrte and block erase (C r, C w and C e ) are 25 usec, 200 usec and 2 msec, respectvely. The seven schemes, shown n Table II, were compared. Each scheme used dfferent vctm block selecton and block merge technques. All schemes used the FAST hybrd mappng IEEE Transactons on Consumer Electroncs, Vol. 56, No. 4, November 2010 FTL [8]. We assumed that the normal VBS algorthm was the round-robn (RR) selecton polcy, whch selects the oldest log block as the vctm. Snce the oldest block generally has a small number of vald pages, the RR polcy nvokes a small GC cost and thus t s a reasonable soluton. TABLE II A SUMMARY OF THE EVALUATED SCHEMES. schemes vctm block selecton block merge LDE DU-GC RR DU-BM No DA-GC RR DA-BM No LDA-GC 1 LDA-VBS DA-BM No LDA-GC 2 RR LDA-BM No LDA-GC 3 LDA-VBS LDA-BM No LDA-GC 4 RR LDA-BM Use LDA-GC 5 LDA-VBS LDA-BM Use RR : Round-Robn polcy B. Performance Comparson Fg. 5 presents the total I/O executon tmes of the examned GC schemes normalzed by those of DU-GC. The I/O executon tmes nclude the flash read, wrte and erase costs nvoked by the garbage collecton as well as the page swap-out. The page cache sze s 4 MB and the flash memory has 32 log blocks. The performance of DA-GC was smlar or nferor to that of DU-GC because the potental loss of DA- GC was larger than the GC cost reducton resultng from not copyng the duplcated pages. By comparng the results of LDA-GC 1 and LDA-GC 2, t can be known that the LDA-VBS technque (whch has a performance mprovement of 10% on average) s more effectve than s the LDA-BM technque (whch has a performance mprovement of 6% on average) because LDA-VBS sgnfcantly reduces the garbage collecton overhead as well as the potental loss. The LDA-GC 3 scheme, whch uses both LDA-VBS and LDA-BM, showed more sgnfcant performance mprovements (by 24% on average) due to the synergetc effect of the two technques. The LDA-GC 4 scheme, whch uses both LDA-BM and LDE, mproved the performance by 28% on average. The LDA-GC 5 scheme, whch uses all of the proposed technques, reduced the I/O executon tmes by 37% on average compared to that of DU-GC. To analyze the performance dfferences, we observed the behavors of each garbage collecton scheme. Fg. 6 shows the number of drty page evctons from the page cache under each GC scheme normalzed to that of DU-GC. Snce the DA- GC scheme changes the duplcated clean pages of the page cache nto drty pages, t ncreases the number of page wrtes by 27% on average. The LDA-GC 1 scheme nvokes a smaller number of page evctons snce t selects the vctm block wth a low potental loss resultng from the ncrease n page evctons. However, t also nvokes more page evctons compared to those of DU-GC. Usng the DA-BM technque, the drty page evctons were reduced to a smlar level of that of DU-GA, as shown n the results of LDA-GC 2 and LDA- GC 3. The LDA-GC 4 and LDA-GC 5 schemes usng both LDA- BM and LDE technques showed smaller numbers of page evctons by 14 18% compared to that of DU-GC.

S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2361 Fg. 5 Total I/O executon tmes normalzed to that of DU-GC. Fg. 7 Total number of GC nvocatons (normalzed to that of DU-GC). Fg. 6 Total number of page evctons from the page cache (normalzed to that of DU-GC). Fg. 8 Average number of page mgratons per garbage collecton (normalzed to that of DU-GC). The ncreased number of page evctons nvokes frequent garbage collectons for the log buffer of the flash memory. Fg. 7 shows the number of garbage collectons nvoked durng benchmark executons under the proposed GC schemes. These values were normalzed to those of DU-GC. Whle DA-GC, LDA-GC 1, LDA-GC 2 and LDA-GC 3 nvoked larger numbers of GCs than dd DU-GC due to the ncreased number of page evctons, LDA-GC 4 and LDA-GC 5 that used the LDE technque outperformed DU-GC by about 4 5% due to the potental beneft shown n Equaton (7). However, snce all of the proposed LDA-GC schemes showed performance mprovements over DU-GC, accordng to the results n Fg. 5, t s nferred that the proposed LDA-GC schemes requre smaller costs per GC nvocaton than does DU-GC. Fg. 8 shows the number of page mgratons durng block merge operatons. The DA-GC scheme requred a smaller number of page mgratons compared to that of the DU-GC scheme snce t does not copy the duplcated pages durng block merge operatons. Snce the LDA-VBS technque selects the vctm log block consderng the page mgraton cost, LDA-GC 1, LDA-GC 3, and LDA-GC 5 showed smaller numbers of page mgratons compared to those of DA-GC, LDA-GC 2, and LDA-GC 4, respectvely. The number of page mgratons was ncreased slghtly by the LDA-BM technque (as shown n the results of LDA-GC 2 ) snce the technque copes the duplcated pages whose correspondng pages n the page cache are clean pages n the LRU regon. The LDE technque copes the drty duplcated pages n the LRU regon of the page cache nto the flash memory durng block merge operatons. Therefore, we can expect that the page mgratons wll be ncreased further by the LDE technque. However, the LDA-GC 4 scheme, whch uses both the LDA-BM and LDE technques, showed a smaller number of page mgratons compared to that of DA-GC. Ths s because the LDE technque reduced the number of full merge operatons, as shown n Table IV, whch shows the number of block merges accordng to ther type, normalzed wth respect to that of DU- GC. The merge cost of full merge s hgher than those of swtch merge and partal merge because full merge requres a large number of erase operatons and page mgratons [6]. Therefore, t s mportant to reduce the number of full merges n order to mnmze the garbage collecton cost. Whle DA-GC ncreased the numbers of all types of merges, LDA-GC 3, LDA-GC 4 and LDA-GC 5 reduced the number of full merge operatons because they mtgated the randomness of wrte requests on the log buffer by reducng drty page evctons from the page cache. Fg. 9 shows the average number of erase operatons per garbage collecton. The GC schemes nvokng fewer full merge operatons generated fewer erase operatons. From the results of Fg. 7 and Fg. 9, we show that the LDA- GC 3, LDA-GC 4 and LDA-GC 5 schemes can prolong the lfespan of flash memory whch has the program/erase cycle lmt snce they consume smaller numbers of program/erase cycles per garbage collecton and nvoke smaller numbers of garbage collectons.

2362 The reduced page mgratons and erase operatons of the proposed technques affect the average cost per garbage collecton, as shown n Fg. 10. All GC schemes provded smaller average GC costs than dd DU-GC. Snce the LDA-GC 2 scheme nvoked more page mgratons, as shown n Fg. 8, and less erase operatons, as shown n Fg. 9, t has a smlar GC cost to that of DA-GC. The average GC costs of LDA-GC 1, LDA- GC 3, and LDA-GC 5, whch use the LDA-VBS technque, were lower than those of other schemes snce LDA-VBS consders the GC cost of the vctm log block. The LDA-GC 5 scheme acheved the best performance snce t had the lowest average GC cost, as shown n Fg. 10, and the smallest number of GC nvocatons, as shown n Fg. 7. IEEE Transactons on Consumer Electroncs, Vol. 56, No. 4, November 2010 pages whle varyng the number of log blocks n the flash memory from 8 to 128. The page cache sze was fxed to 4 MB. As the number of log blocks ncreased, the executon tmes were reduced. When there are many log blocks, a long tme s requred for a log block to be selected as a vctm block. Therefore, when a log block s selected as a vctm block by the garbage collecton, most of the pages n the vctm block may be nvald, and thus the GC nvokes a small page mgraton cost. The performance gaps between DU- GC and LDA-GC ncreased as the number of log blocks ncreased. Ths s because LDA-VBS can dentfy a better vctm block when there are many log blocks avalable. 35000 30000 25000 20000 15000 10000 5000 0 1M 2M 4M 8M 16M 35 30 25 20 15 10 5 0 DU-GC LDA-GC 3 LDA-GC 5 N dup (LDA- GC 3 ) N dup (LDA- GC 5 ) Fg. 11 I/O executon tmes and average numbers of duplcated pages when varyng the page cache szes (kword workload). Fg. 9 Average number of erase operatons per one garbage collecton (normalzed to that DU-GC). Total executon tme (ms) Average numbers of duplcated pages Fg. 12 I/O executon tmes and average numbers of duplcated pages when varyng the number of log blocks (kword workload). Fg. 10 Average garbage collecton cost (normalzed to that DU-GC). C. The Effects of Page Cache Sze and Log Buffer Sze We also evaluated the effect of the page cache sze. Fg. 11 llustrates the total I/O executon tme and the average number of duplcated pages excluded from page mgratons (N dup ) for kword workload whle varyng the page cache sze from 1 MB to 16 MB. The number of log blocks was fxed at 32. The performances mproved as the page cache sze ncreased because the ht rato of the page cache ncreased. LDA-GC 3 and LDA-GC 5 showed better performances than dd DU-GC, regardless of the page cache sze. Moreover, as the page cache sze ncreased, the performance gaps between the DU-GC and LDA-GC schemes ncreased snce the number of duplcated pages ncreased. When there are a large number of duplcated pages, the proposed schemes have more chances to reduce the garbage collecton overhead. We also evaluated the effect of flash log buffer sze. Fg. 12 shows the I/O executon tme and the average number of duplcated The average number of duplcated pages ncreased as the number of log blocks ncreased snce LDA-VBS, whch selects the log block wth many duplcated pages, has more vctm canddates. However, the value reached ts peak when the number of log blocks was 32. If there are too many log blocks, the vctm block has a small number of vald pages, and thus the number of duplcated pages decreases. D. Adaptve Vctm Block Selecton Fg. 13 shows the performance changes n LDA-GC 5 whle varyng the log block vctm wndow for vctm block selecton. As the sze of the vctm wndow ncreased, performance mproved snce the LDA-VBS can select a better vctm log block among more canddates. However, there were margnal changes n performance when the vctm wndow was large (32 128). The computatonal overhead for fndng duplcated pages decreases but the flash memory I/O cost ncreases as the vctm wndow decreases. Therefore, t s mportant to select the smallest vctm wndow n whch the performance s not sgnfcantly degraded compared wth that of the largest vctm wndow. For example, the

S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2363 optmal vctm wndow sze for kword trace s 64 snce there s no large dfference n performance when the vctm wndow s larger than 64. The adaptve LDA-VBS, explaned n Secton 4.4, can determne the optmal ponts, as shown n Fg. 14. The two LDA- VBS schemes usng statc vctm wndows of one page and 128 pages, respectvely, were compared wth the adaptve LDA-VBS usng dynamc vctm wndows. We used two dfferent values for the ntal vctm wndow sze n the adaptve LDA-VBS scheme. The adaptve LDA-VBS schemes adjust ther vctm wndow szes by observng the garbage collecton cost. As a result, ther performances were smlar to that of the statc scheme wth the largest vctm wndow. Fg. 13 I/O executon tme comparson when varyng the vctm wndows sze. REFERENCES [1] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Km. Energy-aware demand pagng on NAND flash-based embedded storages. In Proc. of ISLPED 04, pages 338 343, 2004. [2] Y. Joo, Y. Cho, C. Park, S. W. Chung, E. Chung, and N. Chang. Demand pagng for OneNAND flash execute-n-place. In Proc. of CODES+ISSS 06, pages 229 234, 2006. [3] J. In, I. Shn, and H. Km. SWL: a searchwhle-load demand pagng scheme wth NAND flash memory. In Proc. of LCTES 07, pages 217 226, 2007. [4] A. Ban. Flash fle system optmzed for page-mode flash technologes. Unted State Patent, No. 5,937,425, 1999. [5] A. Ban. Flash fle system. Unted State Patent, No. 5,404,485, 1995. [6] J. Km, J. M. Km, S. H. Noh, S. L. Mn, and Y. Cho. A spacee cent flash translaton layer for compact flash systems. IEEE Trans. on Consumer Electroncs, 48(2):366 375, 2002. [7] H.-L. L, C.-L. Yang, and H.-W. Tseng. Energy-aware flash memory management n vrtual memory system. IEEE Trans. VLSI, 16(8):952 964, 2008. [8] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song. A log bu er-based flash translaton layer usng fullyassocatve sector translaton. ACM Trans. on Embedded Computng Systems, 6(3), 2007. [9] C. Park, W. Cheon, J. Kang, K. Roh, W. Cho, and J. S. Km. A reconfgurable FTL (flash translaton layer) archtecture for NAND flash-based applcatons. ACM Trans. on Embedded Computng Systems, 7(4):1 23, 2008. [10] H. Jo, J. Kang, S. Park, J. Km, and J. Lee. FAB: Flash-aware bu er management polcy for portable meda players. IEEE Trans. on Consumer Electroncs, 52(2):485 493, 2006. [11] S.-Y. Park, D. Jung, J.-U. Kang, J.-S. Km, and J. Lee. CFLRU: a replacement algorthm for flash memory. In Proc. of CASES 06, pages 234 241, 2006. [12] H. Km and S. Ahn. BPLRU: a bu er management scheme for mprovng random wrtes n flash storage. In Proc. of FAST 08, pages 1 14, 2008. [13] S. Lee, D. Shn, and J. Km. Bu er-aware garbage collecton technques for NAND flash memory-based storage systems. In Proc. of IWSSPS 08, pages 27 32, 2008. [14] Samsung Electroncs. 1G x 8 Bt / 2G x 8 Bt / 4G x 8 Bt NAND Flash Memory. http://www.samsung.com/global/busness/semconductor /products/flash/products NANDFlash.html, 2007. Fg. 14 I/O performance of the adaptve VBS scheme. VI. CONCLUSIONS Flash memory s a good devce for use as swap space n vrtual memory systems. For flash memory-based vrtual memory systems, localty and duplcaton-aware garbage collecton technques are proposed to reduce garbage collecton overhead by removng the duplcated pages from the flash memory. In order to solve the potental loss problem of the prevous duplcaton-aware garbage collecton technque n hybrd mappng FTLs, the proposed LDA-VBS technque consders both the garbage collecton overhead and the potental loss. The LDA-BM and LRU drty page evcton technques selectvely apply duplcaton-aware page mgraton dependng on the localty of each page n the page cache. The expermental results showed that there was 37% of average mprovement compared to that of DU-GC. BIOGRAPHIES Seunggu J receved the B.S. degree n computer scence from Dankook Unversty, Korea n 2007. He s currently a Master student n the School of Informaton and Communcaton Engneerng, Sungkyunkwan Unversty. Hs research nterests nclude embedded software, fle systems and flash memory. Dongkun Shn (M 08) receved the B.S. degree, the M.S. degree, and the Ph.D. degree n computer scence and engneerng from Seoul Natonal Unversty, Korea, n 1994, 2000 and 2004, respectvely. He s currently an Assstant Professor n the School of Informaton and Communcaton Engneerng, Sungkyunkwan Unversty (SKKU). Before jonng SKKU n 2007, he was a senor engneer of Samsung Electroncs Co., Korea. Hs research nterests nclude embedded software, low-power systems, computer archtecture, and multmeda and real-tme systems.