Is Buffer Cache Still Effective for High Speed PCM (Phase Change Memory) Storage?

Size: px

Start display at page:

Download "Is Buffer Cache Still Effective for High Speed PCM (Phase Change Memory) Storage?"

Clare Parker
5 years ago
Views:

2011 IEEE 17th International Conference on Parallel and Distributed Systems Is Buffer Cache Still Effective for High Speed PCM (Phase Change Memory) Storage? Eunji Lee, Daeha Jin, Kern Koh Dept.

1 2011 IEEE 17th International Conference on Parallel and Distributed Systems Is Buffer Cache Still Effective for High Speed PCM (Phase Change Memory) Storage? Eunji Lee, Daeha Jin, Kern Koh Dept. of Computer Engineering Seoul National University Seoul, Korea {ejlee, dhjin, Hyokyung Bahn Dept. of Computer Science and Engineering Ewha University Seoul, Korea Abstract Recently, PCM (phase change memory) emerges as a new storage media and there is a bright prospect that PCM will be used as a storage device in the near future. Since the optimistic access time of PCM is expected to be almost identical to that of DRAM, we can make a question that the traditional buffer cache will be still effective for high speed secondary storage such as PCM. This paper answers it by showing that the buffer cache is still effective in such environments due to the software overhead and the bimodal block reference characteristics. Based on this observation, we present a new buffer cache management scheme appropriately for the system where the speed gap between the cache and storage is small. To this end, we analyze the condition that caching gains and find the characteristics of I/O traces that can be exploited in managing buffer cache for PCM-storage. Keywords-Phase Change Memory; Buffer Cache; File Systmes; I. INTRODUCTION For decades, the wide speed gap between main memory and hard disks has been a serious performance bottleneck in computer systems. To relieve this problem, operating systems store requested disk blocks in a certain part of main memory called buffer cache, thereby servicing subsequent requests directly without accessing slow disk storage. The primary issue of buffer cache management is in minimizing the number of disk accesses, as hard disk is five or six orders of magnitude slower than DRAM buffer cache. Recently, high-speed nonvolatile storage technologies such as PCM (Phase Change Memory) emerge and there is a bright prospect that PCM will be used as a storage replacing hard disk or coexisting with it in 2020 [1]. This may be possible due to the rapid enhancement of micro-fabrication processes and multi-level cell (MLC) technologies [2, 3, 4]. It is anticipated that the cost of PCM will be no more than 3-5x of hard disk (HDD), and its power consumption will also be 10x lower than HDD. Furthermore, PCM is byte-addressable and its optimistic access time is expected to be almost identical to that of DRAM. Thus, some research community considers PCM as a disk-like secondary storage [5, 6, 7, 8], while others use it as a part of DRAM-like memory hierarchy [9-13, 19]. Among these two branches of research trend, this paper considers PCM as storage devices in line with studies by Venkataraman et al, Condit et al, etc [5, 6]. With this basic situation, we can make a question that the traditional buffer cache will be still effective for high speed secondary storage such as PCM. The first contribution of this paper lies in that we answer it by analyzing the condition that caching of a block will gain or not when the storage access time is relatively small or identical to buffer cache. Though the access time of DRAM and PCM could be identical, our empirical analysis shows that accessing a block from PCM storage is 1.1 to 1.3 times slower than accessing it from DRAM buffer cache due to the software overhead of I/O processing. Furthermore, current hardware technologies indicate that PCM access time may be slightly longer than that of DRAM [19]. By considering overall these situations, we show the condition when the caching of a block will be effective according to the different access time of secondary storage. Our results show that a cached block from PCM will be beneficial to I/O performance only when the number of hits during each cache residence is more than 2, and in some cases, it would be 3 to 5 depending on the cache miss penalty. The reason is that caching itself requires an additional cost to store the block in the buffer cache. Specifically, if a block is stored to buffer cache and then delivered to user program space, one more memory copy is needed. To offset this overhead and gain from caching, more than a certain number of subsequent cache hits are needed. In other words, if a block is likely to be referenced less than a certain lower bound during in the cache, it would be more effective to filter out the block from caching. We show this condition through the measurement of each time component during file I/O processes and then analyze them. We analyze various file I/O traces, and make two prominent observations that can be exploited in managing buffer cache for PCM-storage, which is our second contribution. The first observation is that a large portion of blocks are referenced only once during in the cache, which would degrade the I/O performance if they are cached in our environments. Secondly, we show that blocks referenced more than twice are highly likely to be re-referenced many times in the future. These hot blocks should be the target of buffer caching in our PCM storage environments. Based on the aforementioned observations, our third contribution is to present a new buffer cache management /11 $ IEEE DOI /ICPADS

scheme appropriately for PCM-based storage systems. Before caching a referenced block, our proposed scheme estimates whether it will be referenced more than twice in the cache.

2 scheme appropriately for PCM-based storage systems. Before caching a referenced block, our proposed scheme estimates whether it will be referenced more than twice in the cache. Specifically, we do not cache a block when it is first referenced, and insert it in the cache only after its second reference occurs within a certain time window. To do this, we use a small amount of history buffer that does not store the contents of actual blocks, but maintains the information that the blocks were referenced recently. This is reasonable since we have shown that blocks referenced twice are highly likely to be re-referenced many times in the future. One problem with this scheme is that it incurs a cache miss for the first reference of those blocks that will be referenced more than three times as it does not cache at the first reference. However, unlike HDD environments that require five or six orders of magnitude larger accessing cost compared to accessing the cache, the miss penalty of PCM is slightly larger than a cache access, and thus retrieving it again from PCM incurs reasonably small cost. Furthermore, as we do not know blocks to be referenced more than three times a priori, the benefit of filtering a large amount of single-referenced blocks will be more effective than the cost of accessing PCM one more time. Our final contribution is to extend the algorithm to consider the asymmetric read/write operation cost of PCM. Specifically, the write access time of PCM is expected to be about 8-10 times slower than that of DRAM [14]. Thus, it is intuitive that caching of a write reference always gains. This is because bypassing of a write reference directly incurs an expensive PCM write operation. However, our analysis shows that this is not the case and the filtering of the first write reference is also effective in some cases. The reason is shown through the analysis of various file I/O traces, in which a large portion of write references are made only once during the residence of cache. Simulation experiments with various I/O traces show that our scheme improves the performance of file systems by 23% on average and up to 75%. The remainder of this paper is organized as follows. Section II shows the motivation of this research through analyzing various file I/O traces and discussing the new buffer caching condition where the miss penalty is very small. Section III presents a new buffer cache management algorithm for PCM storage. Then, Section IV presents our experimental results obtained through trace-driven simulations to assess the effectiveness of the proposed algorithm. Section V discusses some related work on this research. Finally, we conclude this Figure 1. Access time to file system and buffer cache on ramdisk 10 paper in Section VI. Figure 2. The minumum number of hits to make caching profitable II. MOTIVATION A. Modeling the cache performance in PCM storage To investigate whether the conventional buffer cache is still effective in fast storage devices, we measured several time components of file system operations in the Linux. We modified the ext2 file system and measured the time of directly accessing data in the storage file system bypassing the buffer cache and the time of accessing data in the buffer cache. As PCM is not commercially available and the read performance of PCM is similar to that of DRAM, we used ram-disk consisting of DRAM as PCM file storage in our measurement. As Figure 1 shows, the access time of data from secondary storage is 1.1 to 1.3 times longer than accessing it from the buffer cache. Though we assume the device access time of DRAM and PCM to be identical, this speed gap happens due to the overhead of file system software layers. Based on this result, one can think that buffer cache is still effective for file accesses in fast secondary storage whose physical access time is identical to that of main memory. However, we show that this is not always true but some conditions should be satisfied for caching to be efficient. In the experiments shown in Figure 1, the storage access time T does not include the buffer cache layer overhead. That is, the retrieved block from PCM bypasses the buffer cache, and then it is transferred directly to user memory. If buffer cache is used, the missed block should be stored into the buffer cache first and then copied to user space. This additional memory copy overhead is important in the PCM storage environment as the time required for one additional memory copy is similar to PCM access time. This implies that the benefit of caching becomes smaller or caching does not gain in the worst case due to the trade-off between the caching overhead and the PCM access time. Thus, to make caching profitable, the gain of cache hits should be larger than the caching overhead. In other words, a certain number of hits may be needed for a cached block to be beneficial. This is different from slow storage devices such as hard disk where a single hit of a cached block is always beneficial due to significant time difference between caching overhead and storage accesses. 357

3 Figure 3. The ratio of single reference (zero hit) blocks during residing in cache For quantitative comparison, we measure the time component of cache hit t, storage access time bypassing the cache T, and the cache miss penalty Tm including the cache uploading overhead of the requested block from storage. Then, we find the condition where caching is effective in terms of the number of hits. The following expressions represent the time required to service a requested block with cache Tcache and without cache Tno_cache, respectively. Tcache = Tm + (n 1) t Tno_cache = n T where n is the number of references for a block during in the cache. The condition that makes caching profitable is obtained when Tcache is smaller than Tno_cache. This condition can be expressed as n > (Tm t) / (T t). With the above equation, we can calculate the minimum number of references n, necessary for a cached block to make the caching profitable as the storage access time varies. Figure 2 plots the above equation as a graph. As seen in the graph, the number of cache hits required for caching to gain increases dramatically as the performance gap between memory and storage becomes smaller. Applying this model to the PCM storage system where the read access is only 1.3 times slower than memory, the number of cache hits required for caching to be beneficial is at least twice during in the cache. Thus, blocks accessed less than three times are better to bypass the cache in order to improve the performance. B. Analyzing the file I/O traces Now let us analyze the file I/O access traces to investigate the hit count distribution for cached blocks. Figure 3 shows the ratio of blocks that does not referenced again in the cache before evicted from the cache, varying the cache size from 0.1 to 1.0 relatively to the total I/O footprint. In reality, the cache size of 1.0 implies an identical condition to the infinite cache capacity that a complete block references in the trace can be cached at the same time. This is unrealistic condition but we present it to show the complete trend of hit count distribution as a function of the cache size. In practical aspects, the cache size smaller than 0.5 will represent most of real system situations. As shown in Figure 3, the ratio of blocks accessed only once (i.e., zero hit) accounts for a large portion of cached blocks. This large amount of single-referenced blocks does not gain to cache performance as they are not re-referenced at all before evicted from the cache. Some kind of file I/O accesses such as sequential references or large loop references whose length exceeds the cache size can make this situation. We make another important analysis for efficient caching in PCM storage environments. That is, blocks incurring hits during in the cache are highly likely to make multiple hits before they are evicted from the cache. Figure 4 shows the ratio of multiple hits as a function of the cache size for each trace. As shown in the figure, multiple hits for a cached block accounts for 88-99% of the total cache hits. This indicates that blocks referenced more than twice tend to be referenced again in the near future, that should be the target of our caching. In summary, the hit count of blocks in the buffer cache exhibits bimodal distributions where the most of cached blocks make either no hits or many hits. III. BUFFER CACHE FOR FAST STORAGE In this section, we present a new buffer cache management scheme for fast storage devices, specially targeting on PCM storage, whose access time is almost identical to DRAM memory access time. Figure 4. The ratio of multiple hits in the cache among total hit references 358

4 A. Selective Cache Bypassing Scheme The principle of caching is to retrieve a block from slow storage and maintain it in the cache even after servicing the current request assuming the block to be requested again in the near future. Usually, we can achieve the performance gain even when the cached block is subsequently requested only once during in the cache. However, this is not the case for fast storage devices such as PCM as shown in Section II. Specifically, if the number of re-references after being cached does not large enough to offset the caching cost, caching would lead to the performance degradation even though a certain number of hits occur from the cache. As the speed gap between storage and memory becomes narrow, the marginal cache hits required to cover the caching cost also increase. As the relative storage access time varies from 10 5 to 1.1 compared to the cache access time, the minimum number of hits to obtain a net profit also changes from 1 to 8. This situation will incur much more cases that a cached block eventually becomes non-profitable. As a result, we need to predict such non-profitable blocks that are unlikely to be re-referenced enough times and bypass those requests from the cache. To this end, we propose the selective cache bypassing scheme that does not cache a block when it is first referenced. The block is allowed to be cached after it is referenced again within a certain period of time window. To maintain the time window, we use a small amount of history buffer that does not store the contents of actual blocks, but maintains the information that the blocks were referenced recently. The optimal size of the history buffer varies depending not only on the workload characteristics but also on the actual cache size, and thus it can be a control parameter to be tuned. We think that finding the optimal size of the history buffer is beyond the scope of this paper. As a basic configuration, we set the size of history buffer to the same size of the actual cache. This is reasonable because a bypassed block itself is not cached but its history information is maintained as if it is cached until it is evicted from the history buffer whose size is identical to actual cache. Note that maintaining this size of history buffer has very low overhead because it only contains a small size of metadata (less than 20 bytes for each block) whereas each of the actual blocks contains 4KB of data. Now, let us return to the description of the selective cache bypassing scheme. The motivation of this scheme is already explained in Section II where the hit count distribution of cached blocks is bimodal (i.e., zero hit or many hits). Thus, the second reference within a short time duration is a good indicator of a block if it will be accessed many times or not in the near future. Therefore, the cache bypassing on the first access is effective in discriminating non-profitable blocks and filtering them out from the cache. The benefit of our selective cache bypassing scheme can be observed in terms of two aspects. First, the time cost of storing non-profitable blocks into the cache can be saved. In addition, our scheme can save the expensive cache space from being polluted by non-profitable blocks. The space can be utilized for more profitable blocks. However, our scheme has a weakness in that it incurs an additional miss for those blocks that are finally placed in the cache. This may incur significant performance degradation when the miss penalty is large like hard disks, but it is not the case for our environments. To quantify the effectiveness of our selective cache bypassing scheme, we analyze the benefit and cost of our scheme in terms of time as follows: Benefit = Tcaching R Cost = Tmiss_penalty (1 R) where R represents the ratio of single-accessed blocks, Tcaching the additional time for storing a block into the cache, and Tmiss_penalty the time of retrieving a block from secondary storage. Then the final gain of our scheme can be calculated as Gain = Benefit Cost. The above equation represents that our scheme gains when the benefit from saving the caching cost of single-accessed blocks is larger than the cost of additional miss penalty caused by the first access bypassing. If the storage access overhead Tmiss_penalty is significantly larger than the caching time cost Tcaching like hard disks, our scheme does not gain unless the ratio of singleaccessed blocks R is high. However, if the storage is fast enough, that implies the values of Tcaching and Tmiss_penalty are similar, our scheme gains even though the portion of singleaccessed blocks is relatively small. For example, when the secondary storage access is 1.3 times slower than the buffer cache like PCM, our scheme always gains when the ratio of single-accessed blocks is just larger than 23%. Since our hit count analysis (Figure 3) indicates that the ratio of singleaccessed blocks is larger than 50% for all practical conditions, we can conclude that our selective bypassing will be beneficial in PCM storage systems. Furthermore, the actual gain will be larger than this analysis as we do not consider the gain from additional cache space through bypassing. B. Considering write performance of PCM Another important issue in applying our scheme to PCM is that PCM has asymmetric read and write operation time. A write operation is known to be about 8-10 times slower than a read operation as shown in Table 1 [14]. Due to this feature of PCM, our scheme may not be efficient for a write operation because bypassing write requests accompanies an expensive write I/O to PCM. When we apply the same Gain expression in the previous paragraph to the write operation, bypassing becomes beneficial when the ratio of single-accessed blocks is larger than 75% of the total write operations. This is very tight condition for caching to be effective. In other words, the expected cost is likely to be larger than the expected benefit. Considering this situation, applying the bypassing scheme to write operations seems to be impractical. Thus, we basically adopt the bypassing scheme to read operations only to avoid the possible performance degradation due to the write overhead of PCM. However, as we will discuss in the next section, empirical results show that bypassing write operations also gains for many cases. The reason is that most write references are made only once. In our traces, more than 75% of writes are single-accessed writes, for practical cache sizes in case of proxy and varmail workloads. On the other hand, we 359

5 cannot observe that the bypassing write operations incurs serious performance degradations in other traces. The reason is that write requests of a block usually follow a read request of the same block. This implies that write operations are difficult to be the first reference of a block. Thus, if we apply the bypassing scheme for write operations, it does not happen frequently. In addition, the effect of bypassing write operations is not significant as most workloads are readintensive. In case of the workloads we used, only one workload, proxy, is write-intensive, and the ratio of read operations is significantly large in the other three workloads, 7x to 37x of write operations. TABLE 1. DRAM AND PCM PERFORMANCE CHARACTERISTICS DRAM PCM Read Latency 50ns 50ns Write Latency 50ns 400~500us IV. EXPERIMENTAL RESULT In this section, we present the performance evaluation results to assess the effectiveness of the selective cache bypassing scheme. Trace-driven simulation is performed to manage the buffer caching system with accurate I/O timing of PCM including software overheads. We collected system call traces with strace utility during running Filebench applications [18]. Our traces consist of four workloads: proxy server, varmail, web server, and video server. The characteristics of these traces are summarized in Table 2. The performance of buffer caching schemes is measured by the total I/O time for the given workloads. We compare the performance of our selective cache bypassing scheme with the conventional no-bypassing scheme. The cache replacement policy is set to the LRU policy for all of our experiments. Figure 5 shows the total I/O time for the two schemes, with the cache size ranging from 0.1 to 1.0 of the maximum cache usage of the program. Cache size of 1.0 means the environment that a complete block references in the trace can be cached at the same time and thus no cache replacement is needed. In practical aspects, the cache size smaller than 0.5 will represent most of real system situations. As shown in Figure 5, our selective bypassing scheme TABLE 2. SUMMARY OF WORKLOAD CHARACTERISTICS Total # of distinct block requests Ratio of ops. (read:write) Operation counts proxy 11,085 1:2.24 1,461,219 varmail 9, :1 213,227 web server 28, :1 1,321,168 video server 13, :1 4,940,520 performs better than the conventional no-bypassing scheme for most cases, and the performance gain is 11% on average, and by up to 36%. In particular, our selective bypassing scheme exhibits excellent performance in the cache size ranging from 0.1 to 0.5 that represents the realistic system environments. In the small cache size, since the time duration that blocks reside in the cache is short, blocks are more likely to be evicted from the cache without hit. In this situation, our bypassing scheme performs well by filtering the large amount of zero-hit cache blocks. It can save a substantial amount of additional memory access cost of storing them into the cache. In addition, the bypassing scheme has an effect of increasing the effective cache size because it reduces the I/O requests coming into the cache and prevents the cache from being polluted with sequential accesses. This gain is larger when the cache size becomes smaller because it cannot accommodate the working-set of program enough. Note that the marginal performance gain per increasing cache size is very large when the workload suffers from short cache capacity. Specifically, our selective bypassing scheme provides the same performance as that of no-bypassing scheme with several times smaller cache. On the other hand, when the cache size becomes larger than 0.5, the performance gap of the two schemes becomes smaller and finally their performances merge to a single point for all traces. The reason is that most references can be accommodated in the cache size of 0.5 irrespective of the cache management schemes. Now, let us investigate the performance of read/write aware bypassing schemes for PCM that has asymmetric read and write cost. We measure the total I/O time for three schemes, R-bypassing that bypasses read operation only, RW-bypassing that bypasses both read and write operations, and nobypassing. The write time of PCM is set to be ten times larger than the read time. Figure 6 shows that the R-bypass and RW- Figure 5. Total I/O time when storage access is 1.3 times longer than cache 360

bypass schemes perform better than the conventional nobypassing scheme by 23%, 20% on average and up to 75%, 74%, respectively. Now, let us compare the performance of R-bypass and RWbypass schemes.

Specifically, RW-bypassing outperforms R-bypassing by up to 13% and 16% in proxy and varmail workloads.

6 bypass schemes perform better than the conventional nobypassing scheme by 23%, 20% on average and up to 75%, 74%, respectively. Now, let us compare the performance of R-bypass and RWbypass schemes. We have expected that R-bypass performs better than RW-bypass for the most cases, but the result is against our expectation. Specifically, RW-bypassing outperforms R-bypassing by up to 13% and 16% in proxy and varmail workloads. The reason for this improvement is that those traces include a considerable amount of single-accessed writes. To investigate this, we extract the write operations from the workloads and analyze the ratio of single-accessed (zero hit) blocks in cache for the write workloads. As seen in Figure 7, the ratio of single accessed blocks is mostly over 75% in the cases where RW-bypassing performs better than R- bypassing, which is lower bound to make write bypassing beneficial. In summary, though the write bypassing has a relatively smaller benefit compared to the cost of miss penalty, it can enhance the performance when a large number of write references are single-referenced. In the web server trace, R-bypass and RW-bypass schemes exhibit almost identical performance. That is because the webserver trace is read intensive where the read operations are 37 times more than the write operations. Thus, the write bypassing does not affect the overall I/O performance significantly. The video server trace includes many small sized loops and has relatively less single-accessed blocks than other traces. For this reason, the performance enhancement due to bypassing is smallest among the traces. Since the ratio of zero-hit write blocks in cache is not so high, RW-bypass performs worse than R-bypass. As the performance effect of our bypassing scheme dynamically varies according to the workload pattern and cache capacity, as well as storage access time, it is required to monitor the dynamic situation changes and use the bypassing scheme adaptively. We will remain this adaptive scheme as our future work. Before concluding this section, we briefly discuss the write endurance problem of PCM, which is a challenging issue in using PCM as a practical storage system. Since the maximum number of writes allowed for each PCM cell is limited to 10 7 ~10 8, the research community have studied on reducing the amount of writes performed to PCM and balancing the write count of each PCM cell to extend the overall lifetime of PCM. Though this paper does not focus on this issue, we investigate the effect of our scheme on the write amount of PCM. Figure 8 shows the total write amount to PCM with our bypassing schemes and conventional no-bypassing scheme. As seen in the figure, our bypassing schemes reduce writes to PCM significantly compared to the conventional no-bypassing scheme for most cases. When comparing R-bypass and RWbypass schemes, R-bypass is superior to RW-bypass in terms of PCM writes for all cases because write bypassing incurs additional writes to PCM. When the cache size is very large, RW-bypass increases the number of writes compared to nobypassing, but they are not realistic cache conditions. Unlike RW-bypass, R-bypass always reduces the number of PCM writes irrespective of cache size and all workload conditions. Figure 6. Performance of bypass cache algorithms considering asymmteric operation cost Figure 7. The ratio of zero hit blocks in cache for write workloads 361

Figure 8. Write amount to PCM storage for three buffer cache algorithms V. RELATED WORKS A. Phase Change Memory Technology In this section, we describe the feasibility of PCM based storage systems.

7 Figure 8. Write amount to PCM storage for three buffer cache algorithms V. RELATED WORKS A. Phase Change Memory Technology In this section, we describe the feasibility of PCM based storage systems. Although there remain several challenges to be resolved such as stability, it is expected that PCM will be used as a main memory and/or secondary storage in the near future. This expectation is based on that fast progress in density and significant benefits in power consumption and performance when replacing DRAM and/or hard disk with PCM. In terms of density, PCM is anticipated to outperform DRAM, and even NAND flash memories. Specifically, DRAM is considered hard to be fabricated beyond 40nm, and NAND flash memory technology has almost reached the scalability limit because it relies on the memory structure that is increasingly difficult to shrink at smaller lithography. In contrast, although the current PCM technology is nothing but 90nm, PCM has already been proven to have stable characteristics to a 5nm node [2]. Therefore, PCM is evaluated to have a promise of scalability beyond that of other memory technologies such as NAND or NOR flash memories. In addition to micro-fabrication process development, multilevel cells (MLC) technology of PCM is accelerating the density progress. Although most PCM prototypes are being produced as single-level cell (SLC) which considers only two states of crystalline and amorphous, recent researches demonstrated the additional intermediate states that enable MLC [3, 4]. MLC can store multiple bits by choosing between multiple levels of electrical charge. Due to MLC technology, PCM is possible to provide an order of magnitude higher scalability than other nonvolatile RAMs such as FeRAM and MRAM, which are structurally hard to provide MLC. For this reason, the major semiconductor manufacturers including Samsung and Intel have an optimistic outlook for the potential of PCM technology. Using PCM instead of DRAM or hard disk can bring a large saving in energy consumption as well. The fact that PCM requires only 1/100 of energy consumption amount of hard disks becomes another momentum towards PCM based storage. Several recent researches are considering the hybrid main memory with large PCM and small amount of DRAM which trades performance and energy efficiency [11, 13]. In conclusion, it has been demonstrated in many previous researches that PCM has potential to be used as large sized storage. Such expectations in those works indicate that designing efficient software for the emerging PCM based storage systems is also necessarily required. B. Software techniques for PCM Recently considerable research has attempted to introduce PCM in conventional hierarchical memory layers. Most approaches exploit PCM as main memory replacing or supplementing DRAM in order to improve performance and energy efficiency. The key issue in these works is how to handle write efficiently to overcome a long write latency and limited endurance of PCM. Mogul et al have suggested an efficient memory management policy for the hybrid memory system consisting of DRAM and PCM. They proposed a page-attribute aware memory allocation policy that places read-only pages like code segment in PCM, while loading read/write pages in DRAM so as to avoid write from occurring in PCM [11]. Querishi et al also have proposed a PCM and DRAM based hybrid main memory. They use a small amount of DRAM as a write buffer for PCM to mitigate wear-out and hide a long write latency of PCM [13]. Lee et al attempt to improve write performance between last level on-chip cache and main memory. They have proposed two policies; buffer reorganization and partial writes, which track data modifications and write only modified cache lines or words to the PCM array [14, 10]. There have been studies that aim to overcome the endurance limitation in using PCM as a write-intensive main memory. Zhou et al have suggested the row shifting and segment swapping techniques in order to mitigate wearing effect and prolong the lifetime of PCM [12]. Ipek et al have proposed a dynamically replicated memory scheme that maps two faulty physical pages into a single logical page, thereby enabling to reuse pages that contain hard faults [9]. Meanwhile, some file systems that consider non-volatile RAM like PCM as final data storage have been suggested. The proposed file systems mostly discuss an efficient way of using a small amount of non-volatile RAM. Pramfs is designed for storing the frequently accessed or strongly important data in a non-volatile ram block, so as to enable the fast reboot and system survival from crash [8]. It is mounted on a block of non-volatile RAM separate from normal system memory. MRAMFS [16] and NEB file system [7] have been suggested in order to improve space efficiency of expensive storage 362

8 based on non-volatile RAM. MRAMFS saves space by applying compression on metadata, while NEB file system improves space efficiency by using extent-based file management. In contrast, as density of non-volatile RAM like PCM progresses rapidly, the recent studies are considering a scalable PCM based storage system as a replacement of conventional storage device like hard disk or flash memories. Some studies prospect fast and scalable non-volatile RAM to bring a unified memory architecture where main memory and storage are served by one single memory device. Baek et al have designed and implemented a software layer to support both file object and memory object together for unified memory system [17]. Moreover, the studies on file system that considers reliability as well as performance have been conducted. Condit et al have suggested a new copy-on-write file system for byte-addressable storage, BPFS [6]. By exploiting byte-accessibility, BPFS performs in-place write when update size is smaller than atomic operation unit, thereby breaking the recursive path node update of copy-onwrite behavior. Venkataraman et al have suggested a novel data structure to store data fast and efficiently in NVN based single-level store [5]. VI. CONCLUSION As high performance storage such as PCM emerges, the effectiveness of traditional buffer cache should be reinvestigated. This paper showed that buffer cache is still effective even when the storage is nearly as fast as main memory due to software overhead. However, since the gain of caching becomes small, the caching is beneficial only for the blocks that will be frequently hit in the cache. We observed that a large portion of cached blocks are never hit in the cache, and the hit count of blocks in the buffer cache exhibits the bimodal distribution; no hits or many hits. Based on this observation, we presented a new buffer cache management scheme called selective cache bypassing that does not cache a block on its first access, and caches it when re-referenced within the time window, regarding re-access as an indicator of many cache hits in the future. Experimental results showed that our scheme outperforms conventional no-bypassing scheme in the PCM storage system by 23% on average and up to 75%. ACKNOWLEDGMENT This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No and No ). REFERENCES [1] R. F. Freitas and W. W. Wilcke, Storage-class memory: the next storage system technology, IBM Journal of Research and Development, Vol. 52, No. 4, , [2] C. D. Wright, M. M Aziz, M. Armand, S Senkader, and W Yu, Can We Reach Tbit/sq.in. Storage Densities With Phase- Change Media? European Phase Change and Ovonics Symposium (EPCOS), [3] F. Bedeschi et al, A multi-level-cell bipolar-selected phasechange memory, International Solid-State Circuits Conf., [4] T. Nirschl et al. Write strategies for 2 and 4-bit multi-level phase-change memory, International Electron Devices Meeting, [5] S. Venkataraman et al, Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory, 9th USENIX Conference on File and Storage Technologies (FAST), [6] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger and D. Coetzee, Better I/O through byte-addressable, persistent memory, ACM SIGOPS 22nd symposium on Operating systems principles (SOSP), [7] S. Baek, C. Hyun, J. Choi, D. Lee and S. H. Noh, Design and Analysis of a Space Conscious Nonvolatile-RAM File System IEEE Region 10 Conference (TENCON), [8] PRAMFS: [9] E. Ipek, J. Condit, E. B. Nightingale, D. Burger, and T. Moscibroda, Dynamically Replicated Memory: Building Reliable systems from Nanoscale Resistive Memories, Architectural Support for Programming Languages and Operating Systems (ASPLOS), [10] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, Phase Change Memory Architecture and the Quest for Scalability, Communications of the ACM, Vol. 53, Issue 7, pp , [11] J. C. Mogul, E. Argollo, M. Shah and P. Faraboschi, Operating system support for NVM+DRAM hybrid main memory, 12th workshop on Hot Topics in Operating Systems (HotOS XII), [12] P. Zhou, B. Zhao, J. Yang and Y. Zhang, A durable and energy efficient main memory using phase change memory technology, 36th International symposium on Computer Architecture (ISCA), [13] M. K. Qureshi, V. Srinivasan and J. A. Rivers, Scalable high performance main memory system using phase-change memory technology, 36th International symposium on Computer Architecture (ISCA), [14] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, Architecting Phase Change Memory as a Scalable DRAM Alternative, 36th International Symposium on Computer Architecture (ISCA), [15] Numonyx, Phase Change Memory: A new memory to enable new memory usage models. White paper, [16] N. K. Edel, D. Tuteja, E. L. Miller, and S. A. Brandt, MRAMFS: A Compressing File System for Non-Volatile RAM, 12th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, [17] S. Baek, K. Sun, J. Choi, E. Kim, D. Lee and S. H. Noh, Taking advantage of storage class memory technology through system software support, workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), [18] [19] S. Lee, H. Bahn, and S. H. Noh, Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory, 19th IEEE Int'l Symp. on Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS),

P2FS: supporting atomic writes for reliable file system design in PCM storage

LETTER IEICE Electronics Express, Vol.11, No.13, 1 6 P2FS: supporting atomic writes for reliable file system design in PCM storage Eunji Lee 1, Kern Koh 2, and Hyokyung Bahn 2a) 1 Department of Software,