A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

Size: px

Start display at page:

Download "A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks"

Jewel Morgan
5 years ago
Views:

1 A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks Jinho Seol, Hyotaek Shim, Jaegeuk Kim, and Seungryoul Maeng Division of Computer Science School of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology (KAIST) Guseong-dong, Daejeon, Republic of Korea {jhseol, htsim, ABSTRACT Solid State Disks (SSDs) are superior to magnetic disks from a performance point of view due to the favorable features of flash memory. Furthermore, thanks to improvement on flash memory density and adopting a multi-chip architecture, SSDs replace magnetic disks rapidly. Most previous studies have been conducted for enhancing the performance of SSDs, but these studies have been worked on the assumption that the operation unit of a host interface is the same as the operation unit of flash memory, where it is needless to give consideration to partially-filled pages. In this paper, we analyze the overhead caused by the partially-filled pages, and propose a buffer replacement algorithm exploiting multi-chip parallelism to enhance the write performance. Our simulation results show that the proposed algorithm improves the write performance by up to 3% over existing approaches. Categories and Subject Descriptors C.3 [Special-purpose and Application-based Systems]: Real-time and embedded systems; B.3.3 [Memory Structures]: Performance Analysis and Design Aids Simulation General Terms Algorithms, Design, Measurement, Performance Keywords flash memory, buffer replacement algorithm, solid state disk (SSD) This research was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute for Information Technology Advancement). (IITA-29-C ) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES 9, October 11 16, 29, Grenoble, France. Copyright 29 ACM /9/1...$ INTRODUCTION flash memory is in the limelight as a new storage medium. It is widely adopted as a storage system for various embedded devices such as digital cameras, MP3 players, Portable Media Players (PMPs), and cellular phones because it has certain advantages: low power consumption, small size and lightweight. Solid State Disks (SSDs), which adopt flash memory as a storage medium, provide better performance than magnetic disks. Concretely, SSDs achieve short start-up time, fast random accesses, and low power consumption by eliminating the mechanical overheads such as seek time, spin-up delay, and rotational delay in magnetic disks [4]. Therefore, SSD becomes a competitive substitute for a magnetic disk. To enhance the overall performance and capacity, SSDs employ a multi-chip architecture where two or more flash memory chips are used [4, 13, 22]. In this architecture, multiple operations are distributed over flash chips to ensure the highly improved performance of a multi-chip architecture. However, they assumed that the unit of a host interface is the same as the unit of flash memory [9, 17]. The assumption was correct in those days, but is not acceptable anymore because the unit of flash memory has become larger than the unit of a host interface [2, 1] as the density of flash memory increases. Because of the difference in size between the units of host interface and flash memory, SSD sometimes should write a partially-filled page, which turn into a read-modify-write operation [5]. In a traditional page-level buffer replacement policy, we observed that chip-waiting problems are frequently generated by read-modify-write operations, which is to wait for a chip to become idle in order to read a pre-existing page. Accordingly, we need to develop a new buffer management policy that is aware of a multi-chip architecture and considers the effect of read-modify-write operations to improve multi-chip parallelism. In this paper, our goal is to alleviate the overhead of readmodify-write operations while exploiting a multi-chip parallelism. For this, we analyze the overhead of read-modifywrite operations in a multi-chip architecture, and also propose a buffer replacement algorithm called MCA (Multi- Chip based replacement Algorithm) that alleviates the overhead of read-modify-write operations by exploiting a multichip architecture. Experimental results are presented to analyze the effect of read-modify-write operations and show performance improvement of the proposed algorithm. The rest of the paper is organized as follows. Section 2 137

2 contains the overview of the characteristics of flash memory followed by the overall architecture of SSDs. Section 3 summarizes related work. The detailed design of MCA is discussed in Section 4. Section 5 compares MCA with previous work and shows evaluation results, and Section 6 concludes the paper. 2. BACKGROUND 2.1 flash memory characteristics Three main operations of flash memory are read, write (program), and erase. flash memory consists of a number of blocks, each of which consists of the same number of pages. The unit of a read and write operation is a page, and the unit of an erase operation is a block. If a page is once written, it cannot be overwritten until the block that the page belongs to is erased. This characteristic is called erase-before-write. The erase operation time is much longer than the write operation time because a block is relatively larger than a page. There are two types of flash memories: Single- Level Cell (SLC) and Multi-Level Cell (MLC). The capacity of MLC flash memory is larger than that of SLC flash memory, while MLC flash memory shows a long operational latency. Table 1 summarizes the size of a page and a block, and operation time in flash memory. MLC flash memory is widely used in general storage systems because it offers the advantage of capacity per price. Thus, we use the configuration of MLC flash memory in this paper. Table 1: The characteristics of flash memory SLC [2] MLC [1] (KB) 2 4 Block Size(KB) #pages/block Read Time(μs) 25 6 Write Time(μs) 2 8 Erase Time(μs) Overall architecture of SSD In order to emulate a block device interface, SSD has special management layer to hide the characteristics of flash memory. A software layer of a storage device, a flash translation layer (FTL), is in charge of the management with two main functions: Address translation from a logical sector address to a physical flash memory address such as (chip number, block number, page number) Garbage collection that moves valid pages to another block and erases blocks to reclaim invalid pages In addition to the FTL, SSD has a buffer manager to improve performance. A buffer manager stores data from host first, and writes the data to flash memory afterwards. The functionality of a buffer manager varies depending on the implementation of SSD. For instance, a buffer manager can play a role of cache. This paper uses the term buffer Host System Buffer Cache(Write Buffer) FTL Logical Address Logical Address Physical Address Memory SSD Figure 1: Architecture of SSD cache 1 to denote a buffer manager that has the functionality of cache. The buffer cache concerns only write requests because most read requests are processed by the buffer cache in a host system [14]. Using a buffer cache contributes to improving performance by hiding a long operation time because flash memory has a long write time. Moreover, it can effectively reduce the number of write requests to flash memory when a write request is hit in the buffer cache. As shown in Figure 1, the buffer cache processes all requests from a host system. After absorbing a number of requests by replacing the data in it, finally it sends write requests to FTL when it is full or a flush request is received. When FTL receives the write requests from the buffer cache, FTL writes the data to flash memory after determining where to write the data. 2.3 Multi-chip based storage system As the need for a large-capacity and high-performance storage system grows, a multi-chip architecture is adopted for storage system [13, 22]. There are two types of multi-chip configurations depending on whether a data bus is shared or not [4]. A shared control architecture is the way that all flash chips have their own data path. Another way, called a shared bus architecture, is that the data bus is shared among a num- 1 The buffer cache also exists in an operating system. The buffer cache in this paper represents the buffer cache in SSD unless there is an explicit statement. Controller Chip Chip 4 Chip 1 Chip 5 Chip 2 Chip 6 Chip 3 Chip 7 Figure 2: Multi-channel architecture Channel Channel 1 138

3 ber of flash chips. Although a shared control architecture shows better performance than a shared bus architecture, it needs the higher cost of implementation. Moreover, considering that write and erase latencies are very long compared with data transfer time, data bus contention is not serious even in a shared bus architecture. Thus, we focus on a shared bus architecture in this paper. The number of chips that can be shared by one data bus is limited because performance does not increase any more when too many chips are connected to one data bus. When a chip on a shared bus is supposed to send or receive data using the bus, it has to wait until the bus is idle. Therefore, if the need for using more flash chips arises, another design called a multi-channel architecture should be used. Because bus congestion is distributed over all channels, performance degradation is alleviated significantly. Figure 2 shows an example of a multi-channel architecture with 2 channels. Each channel consists of a shared bus and 4 flash chips, and is connected to its flash controller. 3. RELATED WORK We classify buffer cache algorithms into two categories: a hit ratio based algorithm and an FTL based algorithm. The algorithms in the first category concentrate on increasing cache hit ratio regardless of a chip architecture and storage medium. The is the most famous algorithm in this category, and some algorithms such as 2Q [12] and ARC [2] have been proposed to achieve better performance. DIS- TANCE [21] is slightly distinguished from these algorithms because it takes advantage of the workload characteristics. Nevertheless, it is also classified as the first category since it increases cache hit ratio. The characteristics of FTLs are weighed in FTL based algorithms. FAB [11] and BP [14] come into this category. Algorithms in this category reduce the overhead of garbage collection by exploiting the characteristics of FTLs. However, these algorithms are not feasible for a high-performance storage system because these algorithms show long latencies on a specific chip to write all pages in a block at once. There are some studies to increase the parallelism of a multi-chip architecture. Kang et al. [13] proposed three optimization techniques to exploit the parallelism. The striping technique splits a request into sub-requests over multiple channels. If a request is not enough to split, interleaving is a good choice. The interleaving technique does not split a request, but instead a number of requests are handled simultaneously across several channels. In a single channel architecture, the pipelining technique is the best choice. In this technique two requests are not handled simultaneously because the data bus is shared. However, the second request can be transferred after transferring the first request while the first request is being written. The striping technique and the interleaving technique are used when there are multiple channels, and the pipelining technique is used within a single channel where a shared bus architecture is used. Chang et al. [9] have approached in terms of an assignment policy to exploit the parallelism. When FTL receives a read operation, the FTL knows the chip to read because the requested page is stored in a specific chip already. On the other hand, when a write operation is received, assigning a chip is required to write a page. A simple approach to select a chip is using a modulo operation, called a static chip assignment policy. All logical pages should exist in a specific chip like RAID- [1] in the policy. Chang et al. have pointed out that a static chip assignment policy does not guarantees that all chips are used evenly, and thus have proposed a dynamic chip assignment policy. In the dynamic chip assignment policy, a chip that is idle and has the largest number of free pages is selected to write a page, and as a result, all logical pages can exist in any chip. Figure 3 shows the difference between two chip assignment policies assuming pages are written in numerical order. The chip-level parallelism is achieved by using the dynamic chip assignment policy because a write operation can be executed without waiting for the completion of a previous write operation. Thus, the pipelining technique is achieved naturally in a dynamic chip assignment policy. Chip Chip 1 Chip 2 Chip (a) Static Chip Chip 1 Chip 2 Chip (b) Dynamic Figure 3: Two kinds of chip assignment policies 4. EXPLOITING PARALLELISM 4.1 Motivation The basic operation unit of secondary storage systems is a sector of 512 bytes while that of flash memory is a page of 248 or 496 bytes according to the manufacturer. Because of the difference in size, SSDs need following steps to write a single sector: (1) reading the page where a requested sector is stored, (2) modifying the page with the requested sector, and (3) writing the modified page. These serial operations are called a read-modify-write operation [5, 4]. There are two reasons for triggering a read-modify-write operation. The first reason is a small request size. A small request, which is smaller than a page, always needs a read-modify-write operation. This is often observed when write requests are random. The other reason is the alignment problem of a Physical Chips Chip 1 7 Chip Chip1 Chip2 Chip3 Chip1 Chip2 Chip T6 T7 W6 T8 Timing Diagram W7 T9 Chip Wait Time W8 Write Buffer W9 Full Page RT2 T2 4 Partial Page (Read-modify-write) Time A Time B T Transfer W Write RT Read & Transfer Figure 4: Example of chip-waiting problem W2 139

4 host system. If a request does not align with a page, a readmodify-write operation occurs even though a host system writes largely enough to cover a page. This problem appears frequently in ordinary file systems such as NTFS and ext3. Because a read-modify-write operation consists of serial operations, it causes a chip-waiting problem. Figure 4 illustrates the problem. In this example, we assume that the logical pages are distributed in physical chips by using a dynamic chip assignment policy and the buffer cache in SSD has 5 full pages 2 and 1 partial page 3. If the buffer cache is flushed, the least recently used 4 pages are distributed by the dynamic chip assignment policy. Afterwards, the buffer cache flushes Page2, which is a partial page, at TimeA. The pre-existing page of Page2, therefore, should be read from Chip2, and then the modified Page2 should be written. As a result, even though Chip is idle at TimeA, the write operation is delayed until TimeB when Chip2 becomes idle. This unwanted chip wait time degrades the performance. When the page size is 4KB in NTFS as shown in Section 5.3, 22% of write operations are read-modify-write operations, but they take 59% of whole write time. Moreover, the larger the size of a page becomes, the more frequently readmodify-write operations occur [4, 5]. To solve this problem, we propose a new buffer replacement algorithm for SSD. 4.2 Multi-Chip based replacement algorithm Overview To avoid the chip-waiting problem, we need to reschedule the flushing order by considering the state of flash chips, which denotes busy or idle; for this rescheduling, a buffer cache is aware of the chip where the pre-existing page of a victim page exists by referring to the address information of FTL. As a buffer cache and FTL coexist in SSD and are designed together, this is tractable [18]. Based on knowing both the mapping information of FTL and the page information of a buffer cache, we suggest a Multi-Chip based replacement Algorithm (MCA) Operations of MCA It is possible to apply MCA with a hit ratio based algorithm together in that a hit ratio based algorithm concerns only cache hit ratio regardless of a chip architecture or the characteristics of FTL. We adopt MCA with the scheme, which is the most widely-used cache algorithm. To evict a page in the buffer cache, MCA, first of all, selects a target chip that is a candidate chip to perform a write operation. The chip that is idle and has the largest number of free pages is selected as a target chip. The next step is to select a victim page that is a candidate page to be evicted from the buffer cache. When the victim page is selected in the order within the write buffer cache, MCA decides to flush or skip the victim page in order to avoid chip-waiting event caused by a read-modify-write operation. For this, a full page and a partial page are treated differently in MCA. If a full page is selected as a victim page, MCA unconditionally writes the page to the target chip because the page does 2 In this paper, a full page refers the fully-filled page where all sectors in the page are written. Thus, it does not occur a read-modify-write operation. 3 In this papaer, a partial page refers the partially-filled page where some sectors in the page are not written. It results in a read-modify-write operation. Algorithm 1 MCA 1: procedure EvictVictimPage 2: F inishf lag false 3: C φ set of examined chips 4: if all chips are busy then 5: Wait until one of busy chips becomes idle 6: end if 7: TC GetNextTargetChip(C) 8: VP The page victim page 9: while F inishf lag = false do 1: if VP is (a full page) then 11: Write VP to TC 12: F inishf lag true 13: else 14: PA Physical address of VP 15: n The chip number where PA exists 16: if Chip n is idle then 17: Read PA from Chip n 18: Modify the page with VP 19: Write the modified page to TC 2: F inishf lag true 21: else 22: if the next page of VP exists then 23: VP the next page of VP 24: else 25: C C {TC} 26: TC GetNextTargetChip(C) 27: VP The page 28: end if 29: end if 3: end if 31: end while 32: end procedure 33: function GetNextTargetChip(C) 34: f 35: A {all chip numbers } - C 36: TC -1 37: for all n A do 38: if Chip n is idle then 39: if (the number of free pages of chip n) >f then 4: f (the number of free pages of Chip n) 41: TC n 42: end if 43: end if 44: end for 45: if TC =-1 then 46: Wait until one of busy chips becomes idle 47: TC the chip number that has become idle 48: end if 49: return TC 5: end function not generate a read-modify-write operation. On the other hand, a partial page is flushed from a buffer cache as long as the chip where the pre-existing page of the flushed page exists is idle. If the chip is busy, MCA tries to select the next page as a victim page, and repeats the algorithm to avoid a chip-waiting problem. Algorithm 1 represents the specific steps of selecting a victim page. Figure 5 shows the example of MCA. Left half of the figure 14

5 Physical Chips Write Buffer Cache Timing Diagram (MCA) Initial State Chip Chip 1 Chip 2 T 7 W 7 R&T1 T 1 T 11 W 11 T 12 W 12 W 1 R&T T W Chip 3 T 3 W 3 T1 W 1 Step Write Chip Chip 1 Chip 2 T 7 W 7 R&T1 T 1 Timing Diagram () T 12 W 12 T 3 W 3 W 1 R&T T 11 W 11 Write 4 Chip 3 T W T 1 W Step Step Target Chip Valid Page Empty Page Invalid Page Write Page Write 7 Full Page Partial Page (Read-modify-Write) Evicted Page Figure 5: Example of MCA shows chips and the allocated pages in the chips, and the other half shows the status of the buffer cache for each step. The left most page in the buffer cache is the least recently used () page and the right most page in a write buffer cache is the most recently used () page. Step1 in Figure 5 shows the status just after Page6 is written. Chip is selected as a target chip because it is idle and has the largest number of free pages, and Page7, the page in buffer cache, is selected as a victim page depending on the scheme. Since Page7 is a full page, it is evicted and simply written to Chip. Then, Chip2 is selected as a target chip and Page1 is selected as a victim page. At this point, the chip where the pre-existing page of Page1 exists should be examined whether it is idle or not because Page1 is a partial page that has a possibility to wait for another chip due to a read-modify-write operation. Since Chip2 is idle at Step1, Page1 is evicted and the pre-existing page of Page1 is read from Chip2 immediately, and then it is the modified and rewritten into Chip2. The following request makes a different situation in Step3. Chip is selected as a target chip and Page is selected as a victim page first. In this situation, because Page1 has been being writing to Chip2 since Step2, Chip2 where the pre-existing page of Page exists is busy. Therefore, Page should not be evicted to prevent a read-modify-write operation from unwanted waiting. Ac- Figure 6: Timing diagram of MCA and cordingly, MCA selects the next page as a victim page, and then the page is evicted and written to the target chip because it is a full page. As MCA reduces useless chip wait time, performance improvement is achievable by exploiting parallelism in MCA. Figure 6 illustrates the timing diagrams of MCA and the pure scheme under the same requests as the previous example. A read-modify-write operation disrupts pipelining in the pure scheme, whereas pipelining works relatively well by rescheduling the order of write operations in MCA. 5. PERFORMANCE EVALUATION 5.1 Simulation environment We implemented a trace-driven SSD simulator using SystemC [3]. SystemC is a library built on C++ to model hardware and software. With SystemC, it is possible to implement simulators in various levels of modeling such as a Register Transfer Level (RTL) and a Transaction Level Model (TLM) [7]. We implemented SSD components in the TLM level as depicted in Figure 7. Host Machine Host Controller CPU SDRAM Controller SDRAM Control Bus Controller Data Bus Chip Chip 4 Chip 1 Chip 5 Figure 7: SSD components Chip 2 Chip 6 Chip 3 Chip 7 SSD Channel Bus In our simulation, a buffer management policy and FTL algorithm are implemented as software modules. Time overheads for software processing is not considered because they can be optimized according to implementation details, while we measure bus and chip latencies that occupy the major portion of the total cost. The specification of MLC flash memory in Table 1 is used for modeling flash memory. 141

6 Table 2: The characteristics of traces Ratio of Read / Trace File System Storage Size Description Write request size Winxp NTFS 12G General PC usage on WindowsXP. Trace includes web browsing with internet explorer, picassa, messenger (MSN), google desktop, ghost view & ghost script, terminal 24.% / 76.% service and movie player. Ubuntu ext3 8GB General PC usage on Ubuntu. Trace includes web browsing with fire fox, terminal, open office, messenger (pidgin) 9.6% / 9.4% and mail service (thunder bird). Mac HFS+ 82G General PC usage on Max OS X 1.4. This trace is a part of the full trace that was used in [6]. 3.6% / 69.4%. Mp3 ext3 8G Copying 111 mp3 files into the disk (Avg. file size = 5M, Total file size = 557MB).1% / 99.9% Product NTFS 4GB Sysmark 27 Preview productivity bench mark on WindowsXP 25.1% / 74.9% Tpcc ext3 8G TPCC-UVa [19] benchmark test with 1 warehouse 45.% / 55.% As the page transfer time is 15.6μs and the write operation time is 8μs, 8 flash chips are used on a channel bus to hide write operation delay. According to the specification, the size of a physical page is 4KB. However, the page size of SSDs has been significantly enlarged to enhance the parallelism with a multi-channel architecture [15, 8]. In these SSDs, several physical pages are grouped into one logical page called a super page 4 [8]. The size of a super page is much larger than the size of a physical page, and in fact, the size in some SSDs is up to 128KB [15]. Therefore, we use one channel in the experiments where a super page size is 4KB, and we use two and more channels with the striping technique in some experiments where a super page is over 4KB. There are some extra blocks, which are invisible to the host system and internally used by FTL for garbage collection. When simulation is initialized, extra blocks are set as free blocks, and other blocks are set as valid data blocks. In the following simulation, the number of extra blocks is configured as 3% of total blocks. The number of blocks per chip is set differently depending on the storage size of workload. We set the frequency of a channel bus as 4MHz because the operation clock of a flash chip is 25ns. The control bus speed relying on a processor clock and the data bus speed relying on an SDRAM clock are set as 83MHz and 166MHz, respectively. 5.2 Workload We have used some workload traces for performance evaluation. The details of the traces are explained in Table 2. The traces on NTFS are obtained from laptop computers on Microsoft Windows XP by using DiskMon. The traces on ext3 and HFS+ are obtained by using blktrace and fs_usage, respectively. 5.3 Analysis of read-modify-write operations Figure 8 shows the ratio of full pages to partial pages. A dynamic chip assignment policy and a page level replacement algorithm, 4KB page and 8MB write buffer cache are used in this experiment. The rate of read-modify-write operations is over 22% in 4 This page is called a clustered page in [15]. 1% 8% 6% 4% 2% % Full Page Write Partial Page Write (a) Write Count 1% 8% 6% 4% 2% % Full Page Write Partial Page Write (b) Write Time Figure 8: 4KB page and 8MB write buffer cache Winxp, but these operations consume up to 59% of total write time. Because a read-modify-write operation needs to read 1 page, transfer 2 pages, and write 1 page, to write a partial page takes much longer than to write a full page. Besides, the additional chip-waiting time is added to the overall execution time. As HFS+ is designed especially for flash memory, the size of most write operations are 4KB and they are aligned with 4KB. Hence, the rate of read-modifywrite operations is relatively low in Mac. Mp3 also has the low rate of read-modify-write operations because most requests are lengthy and sequential. Enlarging the size of a super page impacts greatly on the overhead of read-modify-write operations [4]. The bigger the super page size becomes, the more often read-modify-write operations occur, and this tendency is especially severe in Mac. Most write requests become read-modify-write operations when the super page size is over 4KB because the size of most write requests is 4KB in HFS+. As the super page size in most SSD implementation is over 4KB [15], readmodify-write operations are dominant even though HFS+ file system is used. To analyze the effect of buffer sizes on read-modify-write operations, we made the same experiment with 16M write buffer cache. In this experiment, the rate of read-modifywrite operations is around 21%, and these operations consume 58% of total write time in Winxp. This means that enlarging buffer size is not an effective way to reduce the overhead of read-modify-write operations. 142

7 SA- DA SA- DA SA- DA- (a) Winxp (b) Ubuntu (c) Mac SA- DA SA- DA SA- DA- (d) Mp3 (e) Product (f) Tpcc Figure 9: Simulation results of each configuration SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- SA- DA- WINXP UBUNTU MAC MP3 PRODUCT TPCC WINXP UBUNTU MAC MP3 PRODUCT TPCC Channel Transfer Time Etc Chip Wait Read Time Write Time (a) Component Execution Time (b) Operation Execution Time Figure 1: Execution time for each configuration 5.4 Evaluation results Throughput Table 3 shows three different kinds of configurations for comparison, and Figure 9 shows the throughput of these configurations with 4KB page. The workloads that have many read-modify-write operations show much better performance with MCA. Some workloads, however, such as Mp3 and Mac do not show big difference between and MCA because there are few read-modify-write operations. da-lru shows overwhelming performance than sa-lru in Mac because there are few read-modify-write operations. However, da-lru shows worse performance than sa-lru in general PC usage workloads such as Winxp and Ubuntu because the overheads of chip-waiting problems in these workloads are much bigger than the overhead in Mac. We break down the execution time of each configuration in two ways. First, we measure consumed time for each component in Figure 1(a). In order to read or write a page, transferring through a channel bus is inevitable. Channel Transfer Time means time consumption while transferring on a channel bus. Chip Wait refers the time to wait for a chip to finish a current operation so as to execute a new operation. Etc includes all other time including time consumption in SDRAM, transferring in host interface and data bus, and so on. Channel Transfer Time and Etc take almost same time over all configurations, but the main difference is Chip Wait time. Since da-mca focuses on reducing chip wait time, this difference is linked directly with performance improvement. Another way to break down is to distinguish read operations from write operations. Figure 1(b) shows the time Table 3: Three configurations Chip assignment policy Buffer replacement policy sa-lru Static da-lru Dynamic da-mca Dynamic MCA 143

8 consumption for read and write operations. Read times are similar over all configurations whereas write times are different. Consequently, the total execution time is reduced as much as the time that is reduced during writing Cache hit ratio To analyze the effect on the hit ratio of the pure scheme, we compare the hit ratio of the pure with that of MCA in Figure 11. The result shows that the hit ratios of the pure scheme are similar to that of MCA over all workload. In consequence, MCA does not have a particular effect on hit ratio of the pure scheme. 7% 6% 5% 4% 3% 2% 1% % MCA MCA MCA MCA MCA MCA WINXP UBUNTU MAC MP3 PRODUCT TPCC Figure 11: Hit ratio for each configuration Effect of super page sizes The super page size is an important element that could affect performance [15]. Thus, we measure the effect of super page sizes by changing the super page size from 4KB to 16KB. We implement a large super page size by extending channels [8]. We use 2 channels to use 8KB super page like Figure 13. One super page is stripped into 2 channels. Therefore, one request that is 8KB super page is split into 2 sub-requests that are 4KB on the same offset of different channels. The 8KB super page, whose logical page number is 3 in the example, consists of the first page of Chip1 and the first page of Chip5. 16KB super page is implemented with 4 channels in the same way Chip4 Chip5 Chip6 Chip Controller Figure 13: Configuration for 8KB super page The throughput results using different super page sizes are shown in Figure 12. Generally, a large super page size shows better performance in this experiment because a number of operations can be executed simultaneously by using a multi-channel architecture. The only exception is da-lru in Mac. As we mentioned, most requests are 4KB in HFS+. If the super page size becomes larger than 4KB, most 4KB requests become partial pages, and in the end, the partial pages result in performance degradation. The difference of throughput between da-lru and damca are getting bigger as the page size increases. 2% performance improvement is achieved by using MCA when the super page size is 4KB, and over 3% performance improvement is achieved when the super page size is 16KB. As far as the write performance is concerned, MCA shows over 3% write performance improvement even though the page size is 4KB. 6. CONCLUSION For the past decade, applications that use flash memory have been increasing rapidly. However, the way to exploit a number of flash memories has not been actively investigated. In this paper, we analyzed the effect of read-modify-write operations. Because the request unit of flash memory is larger than the request unit of a host system, read-modifywrite operations occur frequently in a general purpose file systems such as NTFS and ext3. Because these operations disrupt pipelining, they are major obstacles to exploiting parallelism. For this reason, we proposed a Multi-Chip based replacement algorithm (MCA). With the algorithm, performance improvement through enhanced parallelism is achievable by rescheduling the order of write operations. It is possible to combine MCA with another hit ratio based algorithm because MCA is an orthogonal approach to the algorithm in that MCA concentrates on exploiting parallelism. Accordingly, MCA is also applicable to another scheduling technique such as NCQ where a high level of parallelism is required. Our experimental results show MCA outperforms the pure scheme. The overall performance is increased by a maximum of 2%, and write performance is increased by over 3% compared with the pure scheme when the page size is 4KB. 7. REFERENCES [1] 2Gx8 bit flash memory (K9GAG8UM). Samsung Electronics, 26. [2] 2Gx8 bit flash memory (K9WAG8U1A). Samsung Electronics, 26. [3] SystemC [4] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy. Design tradeoffs for SSD performance. In Proceedings of USENIX Annual Technical Conference, pages 57 7, 28. [5] A. Birrell, M. Isard, C. Thacker, and T. Wobber. A design for high-performance flash disks. ACM SIGOPS Operating Systems Review, 41(2):88 93, 27. [6] T. Bisson, S. A. Brandt, and D. D. Long. A hybrid Disk-Aware Spin-Down algorithm with I/O subsystem support. In Proceedings of the 26th International Performance, Computing, and Communications Conference (IEEE IPCCC), pages , 27. [7] L. Cai and D. Gajski. Transaction level modeling: an overview. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages ACM, 23. [8] A. M. Caulfield, L. M. Grupp, and S. Swanson. Gordon: using flash memory to build fast, 144

9 SA- DA SA- DA SA- DA- (a) Winxp (b) Ubuntu (c) Mac SA- DA SA- DA SA- DA- (d) Mp3 (e) Product (f) Tpcc Figure 12: Throughput using different super page sizes power-efficient clusters for data-intensive applications. In Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, pages , Washington, DC, USA, 29. ACM. [9] L. Chang and T. Kuo. An adaptive striping architecture for flash memory storage systems of embedded systems. In Proceedings of the 8th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages , 22. [1] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: high-performance, reliable secondary storage. ACM Computing Surveys, 26(2): , [11] H. Jo, J. Kang, S. Park, J. Kim, and J. Lee. FAB: flash-aware buffer management policy for portable media players. IEEE Transactions on Consumer Electronics, 52(2): , 26. [12] T. Johnson and D. Shasha. 2Q: a low overhead high performance buffer management replacement algorithm. In Proceedings of the 2th International Conference on Very Large Data Bases, pages Morgan Kaufmann Publishers Inc., [13] J. Kang, J. Kim, C. Park, H. Park, and J. Lee. A multi-channel architecture for high-performance flash-based storage system. Journal of Systems Architecture, 53(9): , 27. [14] H. Kim and S. Ahn. BP: a buffer management scheme for improving random writes in flash storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1 14, 28. [15] J. Kim, D. Jung, J. Kim, and R. Huh. A methodology for extracting performance parameters in solid state disks (SSDs). In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (IEEE MASCOTS 9), London, United Kingdom, 29. [16] J. H. Kim, S. H. Jung, and S. Y. Ho. Cost and performance analysis of mapping algorithms in shared-bus multi-chip configuration. In Proceedings of the International Workshop on Software Support for Portable Storage (IWSSPS), 28. [17] T. Kuo, J. Hsieh, L. Chang, and Y. Chang. Configurability of performance and overheads in flash management. In Proceedings of the 26 conference on Asia South Pacific design automation, pages , 26. [18] S. Lee, D. Shin, and J. Kim. Buffer-Aware garbage collection for flash Memory-Based storage systems. In Proceedings of the International Workshop on Software Support for Portable Storage (IWSSPS), pages 27 32, 28. [19] D. R. Llanos. TPCC-UVa benchmark. diego/tpcc-uva.html, 26. [2] N. Megiddo and D. S. Modha. ARC: a Self-Tuning, low overhead replacement cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages , San Francisco, CA, 23. USENIX Association. [21] B. Ozden, R. Rastogi, and A. Silberschatz. Buffer replacement algorithms for multimedia storage systems. In Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems, pages , [22] J. Shin, Z. Xia, N. Xu, R. Gao, X. Cai, S. Maeng, and F. Hsu. FTL design exploration in reconfigurable high-performance SSD for server applications. In Proceedings of the 23rd international conference on Supercomputing (ICS 9), pages , Yorktown Heights, NY, USA, 29. ACM. 145

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks Jinho Seol, Hyotaek Shim, Jaegeuk Kim, and Seungryoul Maeng Division of Computer Science School of Electrical Engineering