Reducing The De-linearization of Data Placement to Improve Deduplication Performance
|
|
- Wilfred McCoy
- 6 years ago
- Views:
Transcription
1 Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University 2 School of Computer Science & Technology, Huazhong University of Science & Technology 3 Department of Computer Science, University of Texas at Dallas {tanyujuan, jarod2046, edwinsha}@gmail.com, {dfeng}@hust.edu.cn Abstract Data deduplication is a lossless compression technology that replaces the redundant data chunks with pointers pointing to the already-stored ones. Due to this intrinsic data elimination feature, the deduplication commodity would delinearize the data placement and force the data chunks that belong to the same data object to be divided into multiple separate parts. In our preliminary study, it is found that the de-linearization of the data placement would weaken the data spatial locality that is used for improving data read performance, deduplication throughput and efficiency in some deduplication approaches, which significantly affects the deduplication performance. In this paper, we first analyze the negative effect of the de-linearization of data placement to the data deduplication performance with some examples and experimental evidences, and then propose an effective approach to reduce the de-linearization of data placement by sacrificing little compression ratios. The experimental evaluation driven by the real world datasets shows that our proposed approach effectively reduces the de-linearization of the data placement and enhances the data spatial locality, which significantly improves the deduplication performances including deduplication throughput, deduplication efficiency and data read performance, while at the cost of little compression ratios. I. INTRODUCTION Data deduplication is a lossless compression technology that has been widely used in the large-scale primary and secondary storage systems. It breaks the data streams into serials of data chunks and removes the redundant ones to save storage space. However, due to the removal of the redundant data chunks, the data deduplication de-linearizes the data placement and deteriorates the data layout, resulting in degraded deduplication performance. The data layout is the overview of the placement of the data chunks that are stored in storage systems. In deduplication storage systems, there are two kinds of data chunks: the new unique data chunks that will write to the disk storage sequentially and the redundant data chunks that need to be removed by being linked to the data chunks that have been stored already. For the latter redundant ones, it is likely that they have separate locations since the preceding data streams store them and thus can not be stored with the former new unique ones together. This redundancy elimination de-linearizes the data placement absolutely and makes the data chunks not to be stored in the same order as they emerge in the data stream, which would significantly weaken the spatial locality and affect the data read performance, deduplication throughput, data availability, and etc. For example, during the data restores in case of disaster recoveries, it would need a lot of disk seeks for data reconstruction since the data chunks that belong to the same file or directory are divided into multiple separate parts and can not be retrieved together. In the worst case, it would need one disk seek for every single chunk, which would significantly affect the data read performance. Moreover, the weakening of the spatial locality due to the de-linearization of the data placement has a significant negative effect on some deduplication approaches that use the spatial locality to improve the deduplication throughput and efficiency. The data deduplication meets a significant disk bottleneck problem due to the fact that that it should fetch the chunk index from disk to RAM page by page for redundancy identification under limited RAM usage, which throttles the deduplication process and leads to degraded deduplication throughput and data write performance. Existing well-known and widely used deduplication approaches mainly focus on exploiting the data spatial locality (i.e., also called duplicate locality) to fetch the useful chunk index from disk to memory in batch to alleviate the disk bottleneck, such as DDFS [1] and SiLo [2]. SiLo exploits further the spatial locality to improve the deduplication throughput while not giving too much deduplication efficiency (i.e., compression ratio). To our best known, most of the research vendors are very interested in the spatial locality that emerges in the backup stream to improve the deduplication performance, but few of them have paid attention to the fact that this spatial locality will get weaker and weaker with the increasing amount of the deduplicated data (i.e., the new unique data chunks) and the de-linearization of the data placement, which will in turn degrade the deduplication performance. In this paper, we focus on reducing the de-linearization of the data placement and enhancing the spatial locality to improve the deduplication performance, including the deduplication throughput, deduplication efficiency, and data read performances. Our contributions are summarized by the following three key points.
2 Firstly, we analyze the negative effect of the delinearization of data placement to the spatial locality that can be used for improving the deduplication throughput and data read performance. Secondly, we have done experimental study and found that the spatial locality gets worse with the increasing amount of the deduplicated data and de-linearization of data placement, which makes the existing deduplication approaches that exploit spatial locality to improve deduplication performance less effective. Finally, we propose an effective approach to reduce the de-linearization of data placement and to enhance the spatial locality, which effectively improves the deduplication throughput, deduplication efficiency and data read performance, as showed by the experimental results with real-world datasets. The rest of this paper is organized as follows. In the next section we will analyze the negative effect of the de-linearization of data placement on the deduplication performance. In Section III, we will propose an efficient approach to reduce the de-linearization of data placement. Section IV evaluates our proposed approach with real world datasets and section V concludes the paper. II. BACKGROUND AND MOTIVATION During data deduplication process, the data chunks are divided into two categories: the deduplicated-data chunks (i.e., new unique data chunks) and the redundant data chunks. In order to maximize the deduplication throughput and data write performance, the deduplicated-data chunks are writing to and stored on the disks sequentially. Therefore, if there are no redundant data chunks to be removed, all of the data chunks will be stored on the disks in the same order as that they emerge in the data stream. However, due to the removal of the redundant chunks across multiple data objects (i.e., the data object is used for representing the super-chunk that is composed of multiple data chunks, such as one data stream, one file directory, or a single file), the new unique data chunks are separated from the redundant data chunks, and the locations of the redundant data chunks are determined by the data objects who write it to the deduplication storage system, thus de-linearizing the data placement and weakening the data spatial locality. Furthermore, with the increasing amounts of the stored deduplicated-data chunks and the de-linearization of the data placement, the spatial locality that emerges in the backup stream is getting weaker and weaker, which significantly affects the whole deduplication performance. In this section, we will discuss how the de-linearization of data placement affects the spatial locality and what the negative effect of weakening spatial locality is. A. The Spatial Locality and The De-linearization of The Data Placement The spatial locality means that if a particular location is referenced at a particular time, then the nearby locations Chunk Metadata Part.1 Chunk Data Part... Part.1 Part.N-2 Part.2 Part.N-1 Part.3 Part.N-1 Part.3 Part.2 Part.N-2 Part... Part.N Part.N Fig. 1: An Example of One File that Has N Data Fragments. will likely be referenced in the near future. In deduplication storage systems, this concept of the spatial locality is very useful for both of the deduplication process in data writing and the data reconstruction in data reading. In deduplication process, the spatial locality is also called duplicate locality, which denotes that the data chunks near a duplicate chunk are likely to be duplicate ones with a high probability [1]. By creating and maintaining this duplicate locality that emerges in the data stream on disk storage, the nearby duplicate data chunks can be fetched to RAM in advance for redundancy identification when one duplicate chunk is found, thus avoiding the disk accesses and improving the deduplication throughput. While for the data reconstruction in data reading, the spatial locality can be regarded as that if one data chunk is read, the nearby chunks will be read in the very near future. If this spatial locality is available for data reading, the nearby chunks can be read in batch and many disk seeks can be substantially reduced to improve the data reading performance. Unfortunately, for either the deduplication process or data reconstructions, the existing of such spatial locality anytime and anywhere is an ideal state in deduplication storage systems. While for deduplication process, the spatial locality can be created and maintained for the initial backup sessions. But when the followed backup session that shares some redundant chuncks with the preceeding ones comes to the system, its redundant data chunks have to be removed and only the new unique data chunks are writing to and stored on the disks sequentially. Thus it is impossible to store the data chunks together that are composed of this backup session and impossible to fully create and maintain the spatial locality that emerges in this data stream on the disk storage. Furthermore, as more data streams coming to the system and the amount of the deduplicated-data increasing, the spatial locality will get much weaker due to the de-linearization of data placement. Thus the deduplication throughput that is improved through exploiting spatial locality will be gradually degraded. Section II-B1 will present experimental evidences to verify this outcome. While for data reading, the spatial locality used for data reconstruction has the same tendency as that used for redundancy identification with the de-linearization of the data placement. By taking Fig.1 (i.e., Fig.1 depicts the data layout of a file with N separate parts stored on the disk, with the chunk metadata in the front and the real chunk data in the latter) for example, its N 1 data parts, from Part.1 to P art.n 1, are all shared with other files and not stored together with Frag.N. Thus when reading this file from the disk, it needs about N disk seeks with the assumption
3 that these separate parts are far apart from each other and any two of them can not be read together by one disk seek. The data reading time F(read) can be calculated by F(read) = N Time seek +f size /W seq (1) with the assumption that Time seek denotes the time required for each disk seek, f size represents the size of this reading file and W seq denotes the sequential read bandwidth. But if this file is not deduplicated and all of its data chunks are stored together with Frag.N, its reading time F(read) is only equal to 1 Time seek +f size /W seq. By ignoring their common time f size /W seq, F(read) f size /W seq is N times larger than F(read) f size /W seq. Moreover, as the amount of the stored deduplicated-data increasing, the followed coming data objects will share more redundant data with the preceding ones, and thus the logically contiguous data chunks will be scattered across more disk locations and the spatial locality that can be used for data reading will get much more worse, and so as the data reading performance. B. Experimental Evidences In existing deduplication storage systems, the exploration of spatial locality (i.e., duplicate locality)that emerges in backup streams mainly works for two research goals: one is to alleviate the disk bottleneck such as in DDFS and Sparse Indexing, and the other is to sacrifice reasonable deduplication efficiency while improving the deduplication throughput like that in SiLo. Unfortunately, the continual removal of the redundant chunks and the de-linearization of the data placement by data deduplication can weaken the duplicate locality, especially for long term data backups and retentions, which significantly affects the deduplication throughput and deduplication efficiency. 1) The Degradation of Deduplication Throughput: In our preliminary experimental study, we have implemented the deduplication approach proposed by DDFS [1] and evaluated its deduplication throughput driven by realworld datasets. Fig. 2 depicts the average deduplication throughput obtained from 20 full backup generations of one author s file system of about647gb data. As showed in this figure, the deduplication throughput is decreased with the increasing of the backup generations, from the 213MB/s for generation 1 to only 110MB/s for generation 20, which is consistent with our intuition that the deduplication throughput will be degraded with the increasing amounts of the deduplicated-data chunks. 2) The Degradation of Deduplication Efficiency: In addition to improving the deduplication throughput, some researchers exploit the duplicate locality in order not to loose too much deduplication efficiency while improving deduplication throughput, such as SiLo [2]. SiLo groups the data chunks into segments and then the segments are grouped into big blocks, and the redundancy identification is based on similar segments instead of full chunk index under limited available RAM capacity. When finding out similar segments, it not only checks the chunk redundancy Deduplication Throughput (MB/s) Fig. 2: The degradation of the deduplication throughput. Deduplication Efficiency Fig. 3: The degradation of the deduplication efficiency. among those similar segments but also checks the blocks that those segments belong to, thus improving the deduplication efficiency. The number of the redundant chunks that can be found among the blocks depends heavily on the duplicate locality of the redundant chunks existed in the blocks. The weakening of the duplicate locality would make the amount of the redundant data chunks that exist in these blocks reduced and degrade the deduplication efficiency. By using the datasets of about 20 incremental backup generations of one author s file system as in SiLo, we have evaluated its deduplication efficiency. The deduplication efficiency is defined as the redundant data actually existing in the dataset divided by the data that is removed by SiLo. Fig. 3 shows the obtained deduplication efficiency from the 20 backup generations. As seen from the results, it is found that the deduplication efficiency decreases with the increasing of the backup generations, which reveals that the duplicate locality is gradually weakened and that some of redundant data chunks don t exist in the nearby blocks which are to be found and removed. III. REDUCING THE DE-LINEARIZATION OF DATA PLACEMENT As analyzed in Section II, the removal of the redundant data and the de-linearization of the data placement can weaken the data spatial locality that is used for improving the deduplication throughput, deduplication efficiency and data read performance, which significantly affects the deduplication performances. Nevertheless, due to the fact that the removal of the redundant data is the primary concern
4 for data deduplication to save storage space and the fact that the de-linearization of data placement can not be avoided, we focus on reducing the de-linearization of data placement instead of removing it. In this section, we propose an effective method, called, which aims to reduce the de-linearization of data placement and enhance the data spatial locality to improve the deduplication performances. A. Key Idea The key idea of is to choose some redundant data writing to the disk storage instead of removing it, thus reducing the de-linearization of data placement. However, due to the fact that the write of the redundant data would consume extra storage space and violate the primary concern of data deduplication, the key to implement is to determine which data chunks to be removed or not to be removed to enhance the spatial locality while at the cost of little compression ratios. Inspired by the analysis that the degraded deduplication performance is mainly caused by the weak spatial locality due to the de-linearization of data placement in section II, selects the unremoved redundant chunks according to a metric called Spatial Locality Level(i.e., SPL for short). In, Spatial Locality Level is used to measure the data spatial locality for chunk groups and can be dynamically calculated during deduplication process. If the dynamically calculated Spatial Locality Level is lower than a preset value, will write the corresponding chunks to the disk storage instead of removing them, thus reducing the de-linearization of the data placement and enhancing the spatial locality. B. Design and implementation mainly focuses on the selection of some redundant chunks that are not to be removed. It works after finding out all the redundant data chunks and the correlated locations of the already-stored ones for each data stream. Upon each incoming data stream, breaks it into serials of chunks and groups multiple contiguous chunks into segments. Each of segments varies from 0.5MB to 2MB based on the chunk content. The segment is a processing unit that reads and writes data chunks. After finding out all the redundant chunks and the correlated locations, calculates the spatial locality for each segment that has the redundant chunks. In, we define the Spatial Locality Level SPL(m,k) as SPL(m,k) = Seg m Seg k Seg m where Seg m is the incoming segment and Seg k is the segment that has been already stored on disks, Seg m Seg k represents the data chunks that are shared by Seg m and Seg k.spl(m,k) is the spatial locality that the data chunks in Seg m correspond to that in Seg k which can be fetched together by one disk seek. If Seg m Seg k = Seg m, SPL(m,k) = 1, which means that Seg m has strong spatial (2) locality corresponding to Seg k and all of the chunks in Seg m can be retrieved through one disk seek by reading Seg k. IfSPL(m,k) is smaller than a preset valueαthat indicates the corresponding spatial locality is weak, will not remove the redundant chunks in Seg m Seg k and will write them to the disk storage together with the new unique chunks in Seg m, thus reducing the de-linearization ofseg m and enhancing the spatial locality. The preset value α can be adjusted and controlled to trade off the spatial locality improvement and the sacrificed compression ratios for different datasets. Due to the space constrictions, we don t depict the architecture and the data flow path of in this paper. IV. EXPERIMENTAL EVALUATION We have implemented based on the deduplication approaches proposed in DDFS and evaluated its performances on deduplication throughput, deduplication efficiency, and data read performance. The dataset used for the evaluation is generated from 66 backups of the file systems by five graduate students in our research group, totaling about 1.72TB data. To access s benefits, we have further compare the experimental results with that obtained from DDFS and SiLo prototype systems, as showed in this section. As a side note, the DDFS and SiLo prototype systems are called as DDFS-Like and SiLo-Like by us since we only implement their deduplication approaches by ourselves instead of borrowing their developed prototype systems for the evaluation, and we only show the experimental results of by setting α as 0.1 due to the space restrictions. A. Deduplication Throughput As in deduplication approaches such as DDFS, the deduplication throughput is decreased by the de-linearization of the data placement and the weakening of the spatial locality. In this subsection, we focus on the benefits of in improving the deduplication throughput by reducing the de-linearization of data placement. Fig. 4 compares the average deduplication throughput of, DDFS-Like and SiLo-Like obtained from the 66 backup generations. As showed in the result, it is found that the deduplication throughput of DDFS-Like is much less than that of., by reducing the de-linearization of data placement, archives the comparable deduplication throughput with that of SiLo. Moreover, sometimes when the backup stream has very good spatial locality, the deduplication throughput of is even higher than that of SiLo, such as in backup generation 1, 2, 3, 4, 5, 41 and 42. This is because that, when the spatial locality is very good, can prefetch the nearby duplicate chunks continually to RAM for the redundancy identification when one duplicate chunk is found by one disk seek, while Silo needs to waste some disk seeks for the irrelative chunks since its prefetching policy is based on the similar segments with some probability.
5 Deduplication Throughput (MB/s) DDFS-Like SiLo-Like 60 Fig. 4: The comparison of the deduplication throughput. Deduplication Efficiency SiLo-Like 0.88 Fig. 5: The comparison of the deduplication efficiency. B. Deduplication Efficiency In order to achieve high deduplication throughput, both of and SiLo keep some redundant data not removed during deduplication process. dosen t remove some redundant data for reducing the de-linearization of data palcement while SiLo ignores the redundant data that exists among the less-similar segments with some probability. Fig. 5 plots the deduplication efficiency of and SiLo obtained from the 66 backup generations, where the deduplication efficiency here is defined as the redundant data actually existing in the dataset divided by the redundant data that is removed by or SiLo. Moreover, to compare the amount of the redundant data that is kept by and SiLo well, we only calculate the redundant data that exists in the segments who share part of the redundant chunks with others, ignoring the redundant chunks in those segments who have the duplicate ones sharing all of the redundant chunks that would be removed by both and SiLo. As seen from the results, it is found that the amount of the redundant data kept by De- Frag is much less than that by SiLo. When the backup generation reaches 66, SiLo has 12% of the redundant data not removed while has only 4% of the redundant data not removed, which indicates that can achieve higher deduplication efficiency than that of SiLo, although their deduplication throughput is approachable as showed in Fig. 4. Data Read (MB/s) DDFS-Like 25 Fig. 6: The comparison of Data Read Performance. C. Data Read Performance As analyzed in section II, the reduction of the delinearization of data placement can enhance the spatial locality and has a direct impact on the data read performance. In our experiments, we compare the read performance of and DDFS-Like by reconstructing the backup generations from 1 to 20. Fig. 6 shows the experimental results. As showed by the results, it is found that s read performance is higher than that of DDFS-Like, which indicates that has the potential benefit in improving the data read performances by reducing the de-linearization of data placement. V. CONCLUSION Many researchers are very interested in reducing the data fragments that are caused by data elimination to improve the data read performance in deduplication storage systems [3], [4], [5]. In this paper, motivated by the analysis that the de-linearization of data placement has a negative effect on the spatial locality and on the deduplication performance, we propose, which keeps some redundant data not removed to reduce the de-linearization of data placement and to improve some deduplication performances rather than the only data read performance. As showed by our experimental results with real world datasets, effectively enhances the spatial locality and improves the deduplication performance including the deduplication throughput, deduplication efficiency and data read performance. REFERENCES [1] B. Zhu, K. Li, and H. Patterson, Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST 08, Feb [2] W. Xia, H. Jiang, D. Feng, and Y. Hua, SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput, in USENIX 11, Jun [3] K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti, idedup: Latency-aware, inline data deduplication for primary storage, in FAST 12, Feb [4] Y. J. Nam, D. Park, and D. Du, Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets, in MASCTOS 12, Aug [5] M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki, Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR 12, Jun
FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance
FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan, Jian Wen, Zhichao Yan, Hong Jiang, Witawas Srisa-an, Baiping Wang, Hao Luo Outline Background and Motivation
More informationFGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance
FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan, Jian Wen, Zhichao Yan, Hong Jiang, Witawas Srisa-an, aiping Wang, Hao Luo College of Computer Science, Chongqing
More informationDesign Tradeoffs for Data Deduplication Performance in Backup Workloads
Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia
More informationDELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE
WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily
More informationAccelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information
Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen *, Wen Xia, Fangting Huang, Qing
More informationIn-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge
In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge Smitha.M. S, Prof. Janardhan Singh Mtech Computer Networking, Associate Professor Department of CSE, Cambridge
More informationParallelizing Inline Data Reduction Operations for Primary Storage Systems
Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr
More informationDUE to the explosive growth of the digital data, data
1162 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 4, APRIL 2015 Similarity and Locality Based Indexing for High Performance Data Deduplication Wen Xia, Hong Jiang, Senior Member, IEEE, Dan Feng, Member,
More informationSparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality
Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble Work done at Hewlett-Packard
More informationChunkStash: Speeding Up Storage Deduplication using Flash Memory
ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication
More informationCopyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.
1 Using patented high-speed inline deduplication technology, Data Domain systems identify redundant data as they are being stored, creating a storage foot print that is 10X 30X smaller on average than
More informationHYDRAstor: a Scalable Secondary Storage
HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies (FAST '09) February 26 th 2009 C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J.
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationSpeeding Up Cloud/Server Applications Using Flash Memory
Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta and Jin Li Microsoft Research, Redmond, WA, USA Contains work that is joint with Biplob Debnath (Univ. of Minnesota) Flash Memory
More informationHyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process
Hyper-converged Secondary Storage for Backup with Deduplication Q & A The impact of data deduplication on the backup process Table of Contents Introduction... 3 What is data deduplication?... 3 Is all
More informationDeduplication File System & Course Review
Deduplication File System & Course Review Kai Li 12/13/13 Topics u Deduplication File System u Review 12/13/13 2 Storage Tiers of A Tradi/onal Data Center $$$$ Mirrored storage $$$ Dedicated Fibre Clients
More informationAn Application Awareness Local Source and Global Source De-Duplication with Security in resource constraint based Cloud backup services
An Application Awareness Local Source and Global Source De-Duplication with Security in resource constraint based Cloud backup services S.Meghana Assistant Professor, Dept. of IT, Vignana Bharathi Institute
More informationDesign Tradeoffs for Data Deduplication Performance in Backup Workloads
Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu, Dan Feng, and Yu Hua, Huazhong University of Science and Technology; Xubin He, Virginia Commonwealth University; Zuoning
More informationSAM: A Semantic-Aware Multi-Tiered Source De-duplication Framework for Cloud Backup
: A Semantic-Aware Multi-Tiered Source De-duplication Framework for Cloud Backup Yujuan Tan 1, Hong Jiang 2, Dan Feng 1,, Lei Tian 1,2, Zhichao Yan 1,Guohui Zhou 1 1 Key Laboratory of Data Storage Systems,
More informationA Review on Backup-up Practices using Deduplication
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 9, September 2015,
More informationCABdedupe: A Causality-based Deduplication Performance Booster for Cloud Backup Services
2011 IEEE International Parallel & Distributed Processing Symposium CABdedupe: A Causality-based Deduplication Performance Booster for Cloud Services Yujuan Tan 1, Hong Jiang 2, Dan Feng 1,, Lei Tian 1,2,
More informationWAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane, Mark Huang, Grant Wallace, & Windsor Hsu Backup Recovery Systems Division EMC Corporation Introduction
More informationAPPLICATION-AWARE LOCAL-GLOBAL SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES OF PERSONAL STORAGE
APPLICATION-AWARE LOCAL-GLOBAL SOURCE DEDUPLICATION FOR CLOUD BACKUP SERVICES OF PERSONAL STORAGE Ms.M.Elakkiya 1, Ms.K.Kamaladevi 2, Mrs.K.Kayalvizhi 3 1,2 PG Scholar, 3Assistant Professor, Department
More informationCheetah: An Efficient Flat Addressing Scheme for Fast Query Services in Cloud Computing
IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications Cheetah: An Efficient Flat Addressing Scheme for Fast Query Services in Cloud Computing Yu Hua Wuhan National
More informationParallel Processing for Data Deduplication
Parallel Processing for Data Deduplication Peter Sobe, Denny Pazak, Martin Stiehr Faculty of Computer Science and Mathematics Dresden University of Applied Sciences Dresden, Germany Corresponding Author
More informationP-Dedupe: Exploiting Parallelism in Data Deduplication System
2012 IEEE Seventh International Conference on Networking, Architecture, and Storage P-Dedupe: Exploiting Parallelism in Data Deduplication System Wen Xia, Hong Jiang, Dan Feng,*, Lei Tian, Min Fu, Zhongtao
More informationChapter 11: File System Implementation. Objectives
Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block
More informationDEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 1-5-29 DEBAR: A Scalable High-Performance Deduplication
More informationarxiv: v3 [cs.dc] 27 Jun 2013
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups arxiv:1302.0621v3 [cs.dc] 27 Jun 2013 Chun-Ho Ng and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong
More informationHYDRAstor: a Scalable Secondary Storage
HYDRAstor: a Scalable Secondary Storage 7th TF-Storage Meeting September 9 th 00 Łukasz Heldt Largest Japanese IT company $4 Billion in annual revenue 4,000 staff www.nec.com Polish R&D company 50 engineers
More informationOperating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017
Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3
More informationSHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers
2011 31st International Conference on Distributed Computing Systems Workshops SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers Lei Xu, Jian Hu, Stephen Mkandawire and Hong
More informationDeduplication Storage System
Deduplication Storage System Kai Li Charles Fitzmorris Professor, Princeton University & Chief Scientist and Co-Founder, Data Domain, Inc. 03/11/09 The World Is Becoming Data-Centric CERN Tier 0 Business
More informationRecord Placement Based on Data Skew Using Solid State Drives
Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories, NEC j-suzuki@ax.jp.nec.com
More informationAlternative Approaches for Deduplication in Cloud Storage Environment
International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 13, Number 10 (2017), pp. 2357-2363 Research India Publications http://www.ripublication.com Alternative Approaches for
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationdedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)
University Paderborn Paderborn Center for Parallel Computing Technical Report dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD) Dirk Meister Paderborn Center for Parallel Computing
More informationDeploying De-Duplication on Ext4 File System
Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College
More informationA Scalable Inline Cluster Deduplication Framework for Big Data Protection
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of Summer 5-30-2012 A Scalable Inline Cluster Deduplication
More informationSingle-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder
Single-pass restore after a media failure Caetano Sauer, Goetz Graefe, Theo Härder 20% of drives fail after 4 years High failure rate on first year (factory defects) Expectation of 50% for 6 years https://www.backblaze.com/blog/how-long-do-disk-drives-last/
More informationHPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud
HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun Wu 1,4, Chen Wang 2, Yinjin Fu 3, Sherif Sakr 1, Liming Zhu 1,2 and Kai Lu 4 The University of New South
More informationThe Logic of Physical Garbage Collection in Deduplicating Storage
The Logic of Physical Garbage Collection in Deduplicating Storage Fred Douglis Abhinav Duggal Philip Shilane Tony Wong Dell EMC Shiqin Yan University of Chicago Fabiano Botelho Rubrik 1 Deduplication in
More informationDEC: An Efficient Deduplication-Enhanced Compression Approach
2016 IEEE 22nd International Conference on Parallel and Distributed Systems DEC: An Efficient Deduplication-Enhanced Compression Approach Zijin Han, Wen Xia, Yuchong Hu *, Dan Feng, Yucheng Zhang, Yukun
More informationCHAPTER 5 PROPAGATION DELAY
98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,
More informationEfficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage
Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage YAN-KIT LI, MIN XU, CHUN-HO NG, and PATRICK P. C. LEE, The Chinese University of Hong Kong 2 Backup storage systems often remove
More informationOasis: An Active Storage Framework for Object Storage Platform
Oasis: An Active Storage Framework for Object Storage Platform Yulai Xie 1, Dan Feng 1, Darrell D. E. Long 2, Yan Li 2 1 School of Computer, Huazhong University of Science and Technology Wuhan National
More informationStorage Efficiency Opportunities and Analysis for Video Repositories
Storage Efficiency Opportunities and Analysis for Video Repositories Suganthi Dewakar Sethuraman Subbiah Gokul Soundararajan Mike Wilson Mark W. Storer Box Inc. Kishore Udayashankar Exablox Kaladhar Voruganti
More informationReducing Replication Bandwidth for Distributed Document Databases
Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1, Andy Pavlo 1, Sudipta Sengupta 2 Jin Li 2, Greg Ganger 1 Carnegie Mellon University 1, Microsoft Research 2 Document-oriented
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationSMD149 - Operating Systems - File systems
SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection
More informationVeeam and HP: Meet your backup data protection goals
Sponsored by Veeam and HP: Meet your backup data protection goals Eric Machabert Сonsultant and virtualization expert Introduction With virtualization systems becoming mainstream in recent years, backups
More informationCompression and Decompression of Virtual Disk Using Deduplication
Compression and Decompression of Virtual Disk Using Deduplication Bharati Ainapure 1, Siddhant Agarwal 2, Rukmi Patel 3, Ankita Shingvi 4, Abhishek Somani 5 1 Professor, Department of Computer Engineering,
More informationOperating Systems. Operating Systems Professor Sina Meraji U of T
Operating Systems Operating Systems Professor Sina Meraji U of T How are file systems implemented? File system implementation Files and directories live on secondary storage Anything outside of primary
More information[537] Fast File System. Tyler Harter
[537] Fast File System Tyler Harter File-System Case Studies Local - FFS: Fast File System - LFS: Log-Structured File System Network - NFS: Network File System - AFS: Andrew File System File-System Case
More informationHP Dynamic Deduplication achieving a 50:1 ratio
HP Dynamic Deduplication achieving a 50:1 ratio Table of contents Introduction... 2 Data deduplication the hottest topic in data protection... 2 The benefits of data deduplication... 2 How does data deduplication
More informationASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System
JOURNAL OF COMPUTERS, VOL. 7, NO. 8, AUGUST 2012 1853 : An Adaptive Sequential Prefetching Scheme for Second-level Storage System Xiaodong Shi Computer College, Huazhong University of Science and Technology,
More informationA Comparative Survey on Big Data Deduplication Techniques for Efficient Storage System
A Comparative Survey on Big Data Techniques for Efficient Storage System Supriya Milind More Sardar Patel Institute of Technology Kailas Devadkar Sardar Patel Institute of Technology ABSTRACT - Nowadays
More informationChapter 14: File-System Implementation
Chapter 14: File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery 14.1 Silberschatz, Galvin and Gagne 2013 Objectives To describe
More informationData De-duplication for Distributed Segmented Parallel FS
Data De-duplication for Distributed Segmented Parallel FS Boris Zuckerman & Oskar Batuner Hewlett-Packard Co. Objectives Expose fundamentals of highly distributed segmented parallel file system architecture
More informationChapter 1. Storage Concepts. CommVault Concepts & Design Strategies: https://www.createspace.com/
Chapter 1 Storage Concepts 4 - Storage Concepts In order to understand CommVault concepts regarding storage management we need to understand how and why we protect data, traditional backup methods, and
More informationWhite paper ETERNUS CS800 Data Deduplication Background
White paper ETERNUS CS800 - Data Deduplication Background This paper describes the process of Data Deduplication inside of ETERNUS CS800 in detail. The target group consists of presales, administrators,
More informationGoogle File System. Arun Sundaram Operating Systems
Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)
More informationPurity: building fast, highly-available enterprise flash storage from commodity components
Purity: building fast, highly-available enterprise flash storage from commodity components J. Colgrove, J. Davis, J. Hayes, E. Miller, C. Sandvig, R. Sears, A. Tamches, N. Vachharajani, and F. Wang 0 Gala
More informationEaSync: A Transparent File Synchronization Service across Multiple Machines
EaSync: A Transparent File Synchronization Service across Multiple Machines Huajian Mao 1,2, Hang Zhang 1,2, Xianqiang Bao 1,2, Nong Xiao 1,2, Weisong Shi 3, and Yutong Lu 1,2 1 State Key Laboratory of
More informationMaintaining Mutual Consistency for Cached Web Objects
Maintaining Mutual Consistency for Cached Web Objects Bhuvan Urgaonkar, Anoop George Ninan, Mohammad Salimullah Raunak Prashant Shenoy and Krithi Ramamritham Department of Computer Science, University
More informationFour Steps to Unleashing The Full Potential of Your Database
Four Steps to Unleashing The Full Potential of Your Database This insightful technical guide offers recommendations on selecting a platform that helps unleash the performance of your database. What s the
More informationImproving Backup and Restore Performance for Deduplication-based Cloud Backup Services
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationUNIT III MEMORY MANAGEMENT
UNIT III MEMORY MANAGEMENT TOPICS TO BE COVERED 3.1 Memory management 3.2 Contiguous allocation i Partitioned memory allocation ii Fixed & variable partitioning iii Swapping iv Relocation v Protection
More informationFile system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems
File system internals Tanenbaum, Chapter 4 COMP3231 Operating Systems Architecture of the OS storage stack Application File system: Hides physical location of data on the disk Exposes: directory hierarchy,
More informationMahanaxar: Quality of Service Guarantees in High-Bandwidth, Real-Time Streaming Data Storage
Mahanaxar: Quality of Service Guarantees in High-Bandwidth, Real-Time Streaming Data Storage David Bigelow, Scott Brandt, John Bent, HB Chen Systems Research Laboratory University of California, Santa
More informationThere is a general need for long-term and shared data storage: Files meet these requirements The file manager or file system within the OS
Why a file system? Why a file system There is a general need for long-term and shared data storage: need to store large amount of information persistent storage (outlives process and system reboots) concurrent
More informationProcess size is independent of the main memory present in the system.
Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.
More informationDEDUPLICATION AWARE AND DUPLICATE ELIMINATION SCHEME FOR DATA REDUCTION IN BACKUP STORAGE SYSTEMS
DEDUPLICATION AWARE AND DUPLICATE ELIMINATION SCHEME FOR DATA REDUCTION IN BACKUP STORAGE SYSTEMS NIMMAGADDA SRIKANTHI, DR G.RAMU, YERRAGUDIPADU Department Of Computer Science and Professor, Department
More informationCOMMVAULT. Enabling high-speed WAN backups with PORTrockIT
COMMVAULT Enabling high-speed WAN backups with PORTrockIT EXECUTIVE SUMMARY Commvault offers one of the most advanced and full-featured data protection solutions on the market, with built-in functionalities
More informationCS 111. Operating Systems Peter Reiher
Operating System Principles: File Systems Operating Systems Peter Reiher Page 1 Outline File systems: Why do we need them? Why are they challenging? Basic elements of file system design Designing file
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationComputer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg
Computer Architecture and System Software Lecture 09: Memory Hierarchy Instructor: Rob Bergen Applied Computer Science University of Winnipeg Announcements Midterm returned + solutions in class today SSD
More informationHedvig as backup target for Veeam
Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...
More informationTechnology Insight Series
IBM ProtecTIER Deduplication for z/os John Webster March 04, 2010 Technology Insight Series Evaluator Group Copyright 2010 Evaluator Group, Inc. All rights reserved. Announcement Summary The many data
More informationLocality and The Fast File System. Dongkun Shin, SKKU
Locality and The Fast File System 1 First File System old UNIX file system by Ken Thompson simple supported files and the directory hierarchy Kirk McKusick The problem: performance was terrible. Performance
More informationA file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files.
File Systems A file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files. Module 11: File-System Interface File Concept Access :Methods Directory
More informationDASH COPY GUIDE. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31
DASH COPY GUIDE Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 31 DASH Copy Guide TABLE OF CONTENTS OVERVIEW GETTING STARTED ADVANCED BEST PRACTICES FAQ TROUBLESHOOTING DASH COPY PERFORMANCE TUNING
More informationUtilization of Data Deduplication towards Improvement of Primary Storage Performance in Cloud
Utilization of Data Deduplication towards Improvement of Primary Storage Performance in Cloud P.Sai Sandip 1, P.Rajeshwari 2, Dr.G.Vishnu Murthy 3 M.Tech Student, Computer Science Dept., Anurag Group of
More informationA multilingual reference based on cloud pattern
A multilingual reference based on cloud pattern G.Rama Rao Department of Computer science and Engineering, Christu Jyothi Institute of Technology and Science, Jangaon Abstract- With the explosive growth
More informationOperating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France
Operating Systems Memory Management Mathieu Delalandre University of Tours, Tours city, France mathieu.delalandre@univ-tours.fr 1 Operating Systems Memory Management 1. Introduction 2. Contiguous memory
More informationFCFS: On-Disk Design Revision: 1.8
Revision: 1.8 Date: 2003/07/06 12:26:43 1 Introduction This document describes the on disk format of the FCFSobject store. 2 Design Constraints 2.1 Constraints from attributes of physical disks The way
More informationA Comparison of File. D. Roselli, J. R. Lorch, T. E. Anderson Proc USENIX Annual Technical Conference
A Comparison of File System Workloads D. Roselli, J. R. Lorch, T. E. Anderson Proc. 2000 USENIX Annual Technical Conference File System Performance Integral component of overall system performance Optimised
More informationImproving throughput for small disk requests with proximal I/O
Improving throughput for small disk requests with proximal I/O Jiri Schindler with Sandip Shete & Keith A. Smith Advanced Technology Group 2/16/2011 v.1.3 Important Workload in Datacenters Serial reads
More informationL9: Storage Manager Physical Data Organization
L9: Storage Manager Physical Data Organization Disks and files Record and file organization Indexing Tree-based index: B+-tree Hash-based index c.f. Fig 1.3 in [RG] and Fig 2.3 in [EN] Functional Components
More informationA LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,
More informationIdentifying and Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level
Identifying and Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level Jacob Farmer, CTO Cambridge Computer SNIA Legal Notice The material contained in this tutorial
More informationC13: Files and Directories: System s Perspective
CISC 7310X C13: Files and Directories: System s Perspective Hui Chen Department of Computer & Information Science CUNY Brooklyn College 4/19/2018 CUNY Brooklyn College 1 File Systems: Requirements Long
More informationFile Systems: FFS and LFS
File Systems: FFS and LFS A Fast File System for UNIX McKusick, Joy, Leffler, Fabry TOCS 1984 The Design and Implementation of a Log- Structured File System Rosenblum and Ousterhout SOSP 1991 Presented
More informationCS720 - Operating Systems
CS720 - Operating Systems File Systems File Concept Access Methods Directory Structure File System Mounting File Sharing - Protection 1 File Concept Contiguous logical address space Types: Data numeric
More informationDrive Space Efficiency Using the Deduplication/Compression Function of the FUJITSU Storage ETERNUS AF series and ETERNUS DX S4/S3 series
White Paper Drive Space Efficiency Using the Function of the FUJITSU Storage ETERNUS F series and ETERNUS DX S4/S3 series The function is provided by the FUJITSU Storage ETERNUS F series and ETERNUS DX
More informationData Deduplication Methods for Achieving Data Efficiency
Data Deduplication Methods for Achieving Data Efficiency Matthew Brisse, Quantum Gideon Senderov, NEC... SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationTable Compression in Oracle9i Release2. An Oracle White Paper May 2002
Table Compression in Oracle9i Release2 An Oracle White Paper May 2002 Table Compression in Oracle9i Release2 Executive Overview...3 Introduction...3 How It works...3 What can be compressed...4 Cost and
More informationFinding a needle in Haystack: Facebook's photo storage
Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,
More informationPERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019
PERSISTENCE: FSCK, JOURNALING Shivaram Venkataraman CS 537, Spring 2019 ADMINISTRIVIA Project 4b: Due today! Project 5: Out by tomorrow Discussion this week: Project 5 AGENDA / LEARNING OUTCOMES How does
More informationThe Google File System
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
More information