Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University 2 School of Computer Science & Technology, Huazhong University of Science & Technology 3 Department of Computer Science, University of Texas at Dallas {tanyujuan, jarod2046, edwinsha}@gmail.com, {dfeng}@hust.edu.cn Abstract Data deduplication is a lossless compression technology that replaces the redundant data chunks with pointers pointing to the already-stored ones. Due to this intrinsic data elimination feature, the deduplication commodity would delinearize the data placement and force the data chunks that belong to the same data object to be divided into multiple separate parts. In our preliminary study, it is found that the de-linearization of the data placement would weaken the data spatial locality that is used for improving data read performance, deduplication throughput and efficiency in some deduplication approaches, which significantly affects the deduplication performance. In this paper, we first analyze the negative effect of the de-linearization of data placement to the data deduplication performance with some examples and experimental evidences, and then propose an effective approach to reduce the de-linearization of data placement by sacrificing little compression ratios. The experimental evaluation driven by the real world datasets shows that our proposed approach effectively reduces the de-linearization of the data placement and enhances the data spatial locality, which significantly improves the deduplication performances including deduplication throughput, deduplication efficiency and data read performance, while at the cost of little compression ratios. I. INTRODUCTION Data deduplication is a lossless compression technology that has been widely used in the large-scale primary and secondary storage systems. It breaks the data streams into serials of data chunks and removes the redundant ones to save storage space. However, due to the removal of the redundant data chunks, the data deduplication de-linearizes the data placement and deteriorates the data layout, resulting in degraded deduplication performance. The data layout is the overview of the placement of the data chunks that are stored in storage systems. In deduplication storage systems, there are two kinds of data chunks: the new unique data chunks that will write to the disk storage sequentially and the redundant data chunks that need to be removed by being linked to the data chunks that have been stored already. For the latter redundant ones, it is likely that they have separate locations since the preceding data streams store them and thus can not be stored with the former new unique ones together. This redundancy elimination de-linearizes the data placement absolutely and makes the data chunks not to be stored in the same order as they emerge in the data stream, which would significantly weaken the spatial locality and affect the data read performance, deduplication throughput, data availability, and etc. For example, during the data restores in case of disaster recoveries, it would need a lot of disk seeks for data reconstruction since the data chunks that belong to the same file or directory are divided into multiple separate parts and can not be retrieved together. In the worst case, it would need one disk seek for every single chunk, which would significantly affect the data read performance. Moreover, the weakening of the spatial locality due to the de-linearization of the data placement has a significant negative effect on some deduplication approaches that use the spatial locality to improve the deduplication throughput and efficiency. The data deduplication meets a significant disk bottleneck problem due to the fact that that it should fetch the chunk index from disk to RAM page by page for redundancy identification under limited RAM usage, which throttles the deduplication process and leads to degraded deduplication throughput and data write performance. Existing well-known and widely used deduplication approaches mainly focus on exploiting the data spatial locality (i.e., also called duplicate locality) to fetch the useful chunk index from disk to memory in batch to alleviate the disk bottleneck, such as DDFS [1] and SiLo [2]. SiLo exploits further the spatial locality to improve the deduplication throughput while not giving too much deduplication efficiency (i.e., compression ratio). To our best known, most of the research vendors are very interested in the spatial locality that emerges in the backup stream to improve the deduplication performance, but few of them have paid attention to the fact that this spatial locality will get weaker and weaker with the increasing amount of the deduplicated data (i.e., the new unique data chunks) and the de-linearization of the data placement, which will in turn degrade the deduplication performance. In this paper, we focus on reducing the de-linearization of the data placement and enhancing the spatial locality to improve the deduplication performance, including the deduplication throughput, deduplication efficiency, and data read performances. Our contributions are summarized by the following three key points.

Firstly, we analyze the negative effect of the delinearization of data placement to the spatial locality that can be used for improving the deduplication throughput and data read performance. Secondly, we have done experimental study and found that the spatial locality gets worse with the increasing amount of the deduplicated data and de-linearization of data placement, which makes the existing deduplication approaches that exploit spatial locality to improve deduplication performance less effective. Finally, we propose an effective approach to reduce the de-linearization of data placement and to enhance the spatial locality, which effectively improves the deduplication throughput, deduplication efficiency and data read performance, as showed by the experimental results with real-world datasets. The rest of this paper is organized as follows. In the next section we will analyze the negative effect of the de-linearization of data placement on the deduplication performance. In Section III, we will propose an efficient approach to reduce the de-linearization of data placement. Section IV evaluates our proposed approach with real world datasets and section V concludes the paper. II. BACKGROUND AND MOTIVATION During data deduplication process, the data chunks are divided into two categories: the deduplicated-data chunks (i.e., new unique data chunks) and the redundant data chunks. In order to maximize the deduplication throughput and data write performance, the deduplicated-data chunks are writing to and stored on the disks sequentially. Therefore, if there are no redundant data chunks to be removed, all of the data chunks will be stored on the disks in the same order as that they emerge in the data stream. However, due to the removal of the redundant chunks across multiple data objects (i.e., the data object is used for representing the super-chunk that is composed of multiple data chunks, such as one data stream, one file directory, or a single file), the new unique data chunks are separated from the redundant data chunks, and the locations of the redundant data chunks are determined by the data objects who write it to the deduplication storage system, thus de-linearizing the data placement and weakening the data spatial locality. Furthermore, with the increasing amounts of the stored deduplicated-data chunks and the de-linearization of the data placement, the spatial locality that emerges in the backup stream is getting weaker and weaker, which significantly affects the whole deduplication performance. In this section, we will discuss how the de-linearization of data placement affects the spatial locality and what the negative effect of weakening spatial locality is. A. The Spatial Locality and The De-linearization of The Data Placement The spatial locality means that if a particular location is referenced at a particular time, then the nearby locations Chunk Metadata Part.1 Chunk Data Part... Part.1 Part.N-2 Part.2 Part.N-1 Part.3 Part.N-1 Part.3 Part.2 Part.N-2 Part... Part.N Part.N Fig. 1: An Example of One File that Has N Data Fragments. will likely be referenced in the near future. In deduplication storage systems, this concept of the spatial locality is very useful for both of the deduplication process in data writing and the data reconstruction in data reading. In deduplication process, the spatial locality is also called duplicate locality, which denotes that the data chunks near a duplicate chunk are likely to be duplicate ones with a high probability [1]. By creating and maintaining this duplicate locality that emerges in the data stream on disk storage, the nearby duplicate data chunks can be fetched to RAM in advance for redundancy identification when one duplicate chunk is found, thus avoiding the disk accesses and improving the deduplication throughput. While for the data reconstruction in data reading, the spatial locality can be regarded as that if one data chunk is read, the nearby chunks will be read in the very near future. If this spatial locality is available for data reading, the nearby chunks can be read in batch and many disk seeks can be substantially reduced to improve the data reading performance. Unfortunately, for either the deduplication process or data reconstructions, the existing of such spatial locality anytime and anywhere is an ideal state in deduplication storage systems. While for deduplication process, the spatial locality can be created and maintained for the initial backup sessions. But when the followed backup session that shares some redundant chuncks with the preceeding ones comes to the system, its redundant data chunks have to be removed and only the new unique data chunks are writing to and stored on the disks sequentially. Thus it is impossible to store the data chunks together that are composed of this backup session and impossible to fully create and maintain the spatial locality that emerges in this data stream on the disk storage. Furthermore, as more data streams coming to the system and the amount of the deduplicated-data increasing, the spatial locality will get much weaker due to the de-linearization of data placement. Thus the deduplication throughput that is improved through exploiting spatial locality will be gradually degraded. Section II-B1 will present experimental evidences to verify this outcome. While for data reading, the spatial locality used for data reconstruction has the same tendency as that used for redundancy identification with the de-linearization of the data placement. By taking Fig.1 (i.e., Fig.1 depicts the data layout of a file with N separate parts stored on the disk, with the chunk metadata in the front and the real chunk data in the latter) for example, its N 1 data parts, from Part.1 to P art.n 1, are all shared with other files and not stored together with Frag.N. Thus when reading this file from the disk, it needs about N disk seeks with the assumption

that these separate parts are far apart from each other and any two of them can not be read together by one disk seek. The data reading time F(read) can be calculated by F(read) = N Time seek +f size /W seq (1) with the assumption that Time seek denotes the time required for each disk seek, f size represents the size of this reading file and W seq denotes the sequential read bandwidth. But if this file is not deduplicated and all of its data chunks are stored together with Frag.N, its reading time F(read) is only equal to 1 Time seek +f size /W seq. By ignoring their common time f size /W seq, F(read) f size /W seq is N times larger than F(read) f size /W seq. Moreover, as the amount of the stored deduplicated-data increasing, the followed coming data objects will share more redundant data with the preceding ones, and thus the logically contiguous data chunks will be scattered across more disk locations and the spatial locality that can be used for data reading will get much more worse, and so as the data reading performance. B. Experimental Evidences In existing deduplication storage systems, the exploration of spatial locality (i.e., duplicate locality)that emerges in backup streams mainly works for two research goals: one is to alleviate the disk bottleneck such as in DDFS and Sparse Indexing, and the other is to sacrifice reasonable deduplication efficiency while improving the deduplication throughput like that in SiLo. Unfortunately, the continual removal of the redundant chunks and the de-linearization of the data placement by data deduplication can weaken the duplicate locality, especially for long term data backups and retentions, which significantly affects the deduplication throughput and deduplication efficiency. 1) The Degradation of Deduplication Throughput: In our preliminary experimental study, we have implemented the deduplication approach proposed by DDFS [1] and evaluated its deduplication throughput driven by realworld datasets. Fig. 2 depicts the average deduplication throughput obtained from 20 full backup generations of one author s file system of about647gb data. As showed in this figure, the deduplication throughput is decreased with the increasing of the backup generations, from the 213MB/s for generation 1 to only 110MB/s for generation 20, which is consistent with our intuition that the deduplication throughput will be degraded with the increasing amounts of the deduplicated-data chunks. 2) The Degradation of Deduplication Efficiency: In addition to improving the deduplication throughput, some researchers exploit the duplicate locality in order not to loose too much deduplication efficiency while improving deduplication throughput, such as SiLo [2]. SiLo groups the data chunks into segments and then the segments are grouped into big blocks, and the redundancy identification is based on similar segments instead of full chunk index under limited available RAM capacity. When finding out similar segments, it not only checks the chunk redundancy Deduplication Throughput (MB/s) 240 220 200 180 160 140 120 100 0 2 4 6 8 10 12 14 16 18 20 Fig. 2: The degradation of the deduplication throughput. Deduplication Efficiency 0.97 0.96 0.95 0.94 0.93 0.92 0 2 4 6 8 10 12 14 16 18 20 Fig. 3: The degradation of the deduplication efficiency. among those similar segments but also checks the blocks that those segments belong to, thus improving the deduplication efficiency. The number of the redundant chunks that can be found among the blocks depends heavily on the duplicate locality of the redundant chunks existed in the blocks. The weakening of the duplicate locality would make the amount of the redundant data chunks that exist in these blocks reduced and degrade the deduplication efficiency. By using the datasets of about 20 incremental backup generations of one author s file system as in SiLo, we have evaluated its deduplication efficiency. The deduplication efficiency is defined as the redundant data actually existing in the dataset divided by the data that is removed by SiLo. Fig. 3 shows the obtained deduplication efficiency from the 20 backup generations. As seen from the results, it is found that the deduplication efficiency decreases with the increasing of the backup generations, which reveals that the duplicate locality is gradually weakened and that some of redundant data chunks don t exist in the nearby blocks which are to be found and removed. III. REDUCING THE DE-LINEARIZATION OF DATA PLACEMENT As analyzed in Section II, the removal of the redundant data and the de-linearization of the data placement can weaken the data spatial locality that is used for improving the deduplication throughput, deduplication efficiency and data read performance, which significantly affects the deduplication performances. Nevertheless, due to the fact that the removal of the redundant data is the primary concern

for data deduplication to save storage space and the fact that the de-linearization of data placement can not be avoided, we focus on reducing the de-linearization of data placement instead of removing it. In this section, we propose an effective method, called, which aims to reduce the de-linearization of data placement and enhance the data spatial locality to improve the deduplication performances. A. Key Idea The key idea of is to choose some redundant data writing to the disk storage instead of removing it, thus reducing the de-linearization of data placement. However, due to the fact that the write of the redundant data would consume extra storage space and violate the primary concern of data deduplication, the key to implement is to determine which data chunks to be removed or not to be removed to enhance the spatial locality while at the cost of little compression ratios. Inspired by the analysis that the degraded deduplication performance is mainly caused by the weak spatial locality due to the de-linearization of data placement in section II, selects the unremoved redundant chunks according to a metric called Spatial Locality Level(i.e., SPL for short). In, Spatial Locality Level is used to measure the data spatial locality for chunk groups and can be dynamically calculated during deduplication process. If the dynamically calculated Spatial Locality Level is lower than a preset value, will write the corresponding chunks to the disk storage instead of removing them, thus reducing the de-linearization of the data placement and enhancing the spatial locality. B. Design and implementation mainly focuses on the selection of some redundant chunks that are not to be removed. It works after finding out all the redundant data chunks and the correlated locations of the already-stored ones for each data stream. Upon each incoming data stream, breaks it into serials of chunks and groups multiple contiguous chunks into segments. Each of segments varies from 0.5MB to 2MB based on the chunk content. The segment is a processing unit that reads and writes data chunks. After finding out all the redundant chunks and the correlated locations, calculates the spatial locality for each segment that has the redundant chunks. In, we define the Spatial Locality Level SPL(m,k) as SPL(m,k) = Seg m Seg k Seg m where Seg m is the incoming segment and Seg k is the segment that has been already stored on disks, Seg m Seg k represents the data chunks that are shared by Seg m and Seg k.spl(m,k) is the spatial locality that the data chunks in Seg m correspond to that in Seg k which can be fetched together by one disk seek. If Seg m Seg k = Seg m, SPL(m,k) = 1, which means that Seg m has strong spatial (2) locality corresponding to Seg k and all of the chunks in Seg m can be retrieved through one disk seek by reading Seg k. IfSPL(m,k) is smaller than a preset valueαthat indicates the corresponding spatial locality is weak, will not remove the redundant chunks in Seg m Seg k and will write them to the disk storage together with the new unique chunks in Seg m, thus reducing the de-linearization ofseg m and enhancing the spatial locality. The preset value α can be adjusted and controlled to trade off the spatial locality improvement and the sacrificed compression ratios for different datasets. Due to the space constrictions, we don t depict the architecture and the data flow path of in this paper. IV. EXPERIMENTAL EVALUATION We have implemented based on the deduplication approaches proposed in DDFS and evaluated its performances on deduplication throughput, deduplication efficiency, and data read performance. The dataset used for the evaluation is generated from 66 backups of the file systems by five graduate students in our research group, totaling about 1.72TB data. To access s benefits, we have further compare the experimental results with that obtained from DDFS and SiLo prototype systems, as showed in this section. As a side note, the DDFS and SiLo prototype systems are called as DDFS-Like and SiLo-Like by us since we only implement their deduplication approaches by ourselves instead of borrowing their developed prototype systems for the evaluation, and we only show the experimental results of by setting α as 0.1 due to the space restrictions. A. Deduplication Throughput As in deduplication approaches such as DDFS, the deduplication throughput is decreased by the de-linearization of the data placement and the weakening of the spatial locality. In this subsection, we focus on the benefits of in improving the deduplication throughput by reducing the de-linearization of data placement. Fig. 4 compares the average deduplication throughput of, DDFS-Like and SiLo-Like obtained from the 66 backup generations. As showed in the result, it is found that the deduplication throughput of DDFS-Like is much less than that of., by reducing the de-linearization of data placement, archives the comparable deduplication throughput with that of SiLo. Moreover, sometimes when the backup stream has very good spatial locality, the deduplication throughput of is even higher than that of SiLo, such as in backup generation 1, 2, 3, 4, 5, 41 and 42. This is because that, when the spatial locality is very good, can prefetch the nearby duplicate chunks continually to RAM for the redundancy identification when one duplicate chunk is found by one disk seek, while Silo needs to waste some disk seeks for the irrelative chunks since its prefetching policy is based on the similar segments with some probability.

Deduplication Throughput (MB/s) 260 240 220 200 180 160 140 120 100 80 DDFS-Like SiLo-Like 60 Fig. 4: The comparison of the deduplication throughput. Deduplication Efficiency 1.00 0.98 0.96 0.94 0.92 0.90 SiLo-Like 0.88 Fig. 5: The comparison of the deduplication efficiency. B. Deduplication Efficiency In order to achieve high deduplication throughput, both of and SiLo keep some redundant data not removed during deduplication process. dosen t remove some redundant data for reducing the de-linearization of data palcement while SiLo ignores the redundant data that exists among the less-similar segments with some probability. Fig. 5 plots the deduplication efficiency of and SiLo obtained from the 66 backup generations, where the deduplication efficiency here is defined as the redundant data actually existing in the dataset divided by the redundant data that is removed by or SiLo. Moreover, to compare the amount of the redundant data that is kept by and SiLo well, we only calculate the redundant data that exists in the segments who share part of the redundant chunks with others, ignoring the redundant chunks in those segments who have the duplicate ones sharing all of the redundant chunks that would be removed by both and SiLo. As seen from the results, it is found that the amount of the redundant data kept by De- Frag is much less than that by SiLo. When the backup generation reaches 66, SiLo has 12% of the redundant data not removed while has only 4% of the redundant data not removed, which indicates that can achieve higher deduplication efficiency than that of SiLo, although their deduplication throughput is approachable as showed in Fig. 4. Data Read (MB/s) 60 55 50 45 40 35 30 DDFS-Like 25 Fig. 6: The comparison of Data Read Performance. C. Data Read Performance As analyzed in section II, the reduction of the delinearization of data placement can enhance the spatial locality and has a direct impact on the data read performance. In our experiments, we compare the read performance of and DDFS-Like by reconstructing the backup generations from 1 to 20. Fig. 6 shows the experimental results. As showed by the results, it is found that s read performance is higher than that of DDFS-Like, which indicates that has the potential benefit in improving the data read performances by reducing the de-linearization of data placement. V. CONCLUSION Many researchers are very interested in reducing the data fragments that are caused by data elimination to improve the data read performance in deduplication storage systems [3], [4], [5]. In this paper, motivated by the analysis that the de-linearization of data placement has a negative effect on the spatial locality and on the deduplication performance, we propose, which keeps some redundant data not removed to reduce the de-linearization of data placement and to improve some deduplication performances rather than the only data read performance. As showed by our experimental results with real world datasets, effectively enhances the spatial locality and improves the deduplication performance including the deduplication throughput, deduplication efficiency and data read performance. REFERENCES [1] B. Zhu, K. Li, and H. Patterson, Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST 08, Feb. 2008. [2] W. Xia, H. Jiang, D. Feng, and Y. Hua, SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput, in USENIX 11, Jun. 2011. [3] K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti, idedup: Latency-aware, inline data deduplication for primary storage, in FAST 12, Feb. 2012. [4] Y. J. Nam, D. Park, and D. Du, Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets, in MASCTOS 12, Aug. 2012. [5] M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki, Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR 12, Jun. 2012.