Multi-level Byte Index Chunking Mechanism for File Synchronization

, pp.339-350 http://dx.doi.org/10.14257/ijseia.2014.8.3.31 Multi-level Byte Index Chunking Mechanism for File Synchronization Ider Lkhagvasuren, Jung Min So, Jeong Gun Lee, Jin Kim and Young Woong Ko * 1 Dept. of Computer Engineering, Hallym University, Chuncheon, Korea {Ider555, jso, jeonggun.lee, jinkim, yuko}@hallym.ac.kr Abstract In this paper, we propose a probabilistic algorithm for detecting duplicated data blocks in low bandwidth network. The algorithm identifies duplicated regions of the destination file, and only sends non-duplicated region of data. The proposed system produces two types of double Index table for a file, each chunk sizes are 4MB and 32KB, respectively. At the first level, system client detects large sized identical data blocks using 4MB chunk sized indextable by using byte-index chunking approach in rapid time. At the second level, we perform byte-index chunking using 32KB index-table on entire non-duplicated data area produced through first level file similarity detection. This gives us opportunity to more accuracy rated data deduplication and doesn t consume so much time because deduplication work restricted by only non-duplicated area. Experiment result shows the proposed approach can reduce processing time significantly comparable to fixed-size chunking. Also data deduplication rate is as high as variable-sized chunking. Keywords: Deduplication, Cloud storage, Chunk, Index-table, Anchor Byte 1. Introduction Explosive growth of digital data causes storage crises and data deduplication becomes one of the hottest topics in the data storage. Data deduplication is a method of reducing storage capacity by eliminating duplicated data. By adapting data deduplication, we can attain more files than before. Most existing deduplication solutions aim to remove duplicate data in the data storage systems using the traditional chunk-level deduplication strategies. There are lots of data deduplication system, In Content-defined Chunking [1], each block size is partitioned by anchoring based on their data patterns. This scheme can prevent the data shifting problem of the Static Chunking approach. One of the well-known Content-defined Chunking algorithms is LBFS[2], a network file system designed for low bandwidth networks. However, content-defined data deduplication approach can achieve high deduplication ratio, but requires much time to perform deduplication process in comparison to the other data deduplication approaches. Static chunking[3] is the fastest algorithm among the others for detecting duplicated blocks but the performance is not acceptable with boundary shifting problem. In this paper, the main idea is to apply lookup process in a destination file by predicting the duplicated region compared to source files with high probability. There is a fundamental concept to expedite finding redundant data process. We use a table which is named Index-table (size 256*256). This two dimensional matrix structured table is used as reference of file chunks on the server during lookup process. The server chunks * Corresponding author : yuko@hallym.ac.kr ISSN: 1738-9984 IJSEIA Copyright c 2014 SERSC

are stored to Index-table as indexed by their Anchor byte values. Also Index-table stores metadata (server file chunk hashes and their indexes) in each cell. In this work, our key idea is to adapt multi-level byte-index chunking for large sized files. The byte-index chunking shows efficient and improved performance for small and medium sized files, but shows insufficient performance for large sized files exceeding several gigabytes. Therefore, we exploited multi-level approach to enhance byte-index chunking. The proposed scheme divides files into two groups considering file size. If the file size is over 5GB (in current implementation), we use 4MB large index-table and 32KB small index-table for accelerating data detecting process. But file size is below 5GB then we perform file detection process only using 32KB sized index-table. This isolation makes huge impact to fast performance for large scale data files. When we are looking for identical data on the large scale file (file size is over than 5GB), we starts the lookup process (high-level byte-index chunking) with 4MB chunk sized index-table. If we find a chunk that likely to be duplicated using 4MByte index-table, then we start low-level byte-index chunking using small chunk on 32KB index-table. The strategy of the proposed system is to find large-sized duplicated region and later performs detailed deduplication process. The rest of this paper is organized as follows. In Section 2, we describe related works about data deduplication system. In Section 3, we explain the design principle of proposed Byte-index Chunking system and implementation details. In Section 4, we show performance evaluation result of the proposed system and we conclude and discuss future research plan. 2. Related Works There are several different data deduplication algorithms [3-7]: Static-Chunking (SC), Content-defined Chunking (CDC), Whole-file Chunking (WFC) and delta encoding. Static Chunking let files be divided into a number of fixed-sized blocks, and then apply hash functions to create the hash key of the blocks. Venti [3] is a network storage system using Static Chunking, where 160-bit SHA1 hash key is used as the address of the data. This enforces a write-once policy since no other data block can be found with the same address. The addresses of multiple writes of the same data are identical. So duplicate data is easily identified and the data block is stored only once. The main limitation of Static Chunking is boundary shift problem. For example when adding a new data to a file, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file. Therefore, it's difficult to find duplicated blocks in the file, which makes deduplication performance degrade. In Content-defined Chunking, each block size is partitioned by anchoring based on their data patterns. This scheme can prevent the data shifting problem of the Static Chunking approach. One of the well-known Content-defined Chunking system is that is a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server s file system or the client s cache. Using this technique, LBFS achieves up to two orders of magnitude reduction in bandwidth utilization on common workloads, compared to traditional network file systems. Delta encoding [4] stores data in the form of differences between sequential data. Lots of backup system adopts this scheme in order to give their users previous versions of the same file from previous backups. This reduces associated costs in the amount of data that has to be stored as differing versions, moreover, those costs in the uploading of each file that has been updated. DRED system use delta encoding approach to implement deduplication service. DRED system [5][6] can efficiently remove the duplicated data which use delta encoding technique in Web pages, and e-mail. 340 Copyright c 2014 SERSC

3. System Design and Implementation Figure 1 describes overall structure of the proposed system that processes data deduplication between the server and client. In this work, we implemented a deduplication server, employing source-based approach using byte-index approach with a refined and improved touch. The server maintains metadata between server and client and the server produces double Index-tables (size: 256*256), which chunk size is 4MB (when file size over than 5GB). In order to perform lookup process for finding high probability of the duplicated data, Index-table provides the key information of 256x256 sized table structure; keeping chunk numbers and the chunk s edge chunk byte values are used as their cell row and column numbers. From this information, we can find parts of data blocks which have very high probability of duplicated information in very fast way and got a scheme that, eventually, allows for efficient data deduplication by utilizing data file pattern, as shown in Figure 1. Figure 1 System Architecture Overview: Multi-level Byte Index approach Suppose we have a file to synchronize. Then it has to be divided into fixed-length blocks, each block s SHA-1 hash value, make numbering each chunks to reference (chunk-index) and its each edge bytes (left and right boundary byte) values are saved in the server. As can be seen in Figure 2, in the proposed system, we consider edge bytes as Anchor points. Figure 2. Overview of Chunking with Anchor Points Copyright c 2014 SERSC 341

After the chunking process, the proposed system then produces [256, 256] amount of index table for the synchronized file. For every chunk of file, we need to set its chunk-index to the corresponding cell of Index-table. For a chunk, first anchor point (left edge byte) value of byte represents the horizontal direction index and a last anchor point (right edge byte) value of byte is a reference to the vertical direction of the Index-table. Figure 3. Overview of Filling Index-table Table Figure 3 shows how Index-table is filled with reference points of the chunk. When proposed system creates metadata (with chunk index, chunk hash and chunk anchor points (value of edge bytes)) list, server sends to the server only Index-table. 3.1. Predicting Duplicated Data of Look-Up Process In look-up process, system aims to find the chunks that are expected to be duplicated (highly probable to be duplicated chunk) using Index table. Also in this process, while we are looking highly probable to be duplicated chunk for improving our search results to be more accurate, we don t only search chunks one by one, but we aim to seek adjacent double chunks for per offset in the modified file. If we can predict some data to be adjacent chunks as highly probable to be duplicated from the destination file by their anchor bytes, it means these parts of data in destination file has not only same length with chunks of the source file, also these data store same bytes of values at the position where their boundaries of each chunk. If we look up just a single chunk with Index-table (using only 2 anchor-bytes), then there might be plenty of offsets that no duplicates. This makes file similarity process to be wasteful and with unnecessary data processing consumptions. Nonetheless, it is a very rare occasion that is not duplicated chunks comes up from the predicting result when we try look up two adjacent chunks. In other words, it is more accurate and avoiding unnecessary time consumption. That s why we call this part of data in destination file as be highly probable duplicate chunk. 342 Copyright c 2014 SERSC

Figure 4. Duplicated Chunk Look Up Process Overview Probability to be a duplicate is the only one from 256 4 (4,294,967,296) occasions. Nevertheless, Highly probable duplicate chunk is possibly seen to be duplicated but we confirm whether they are duplicated or not by their SHA1 value after predicting them to be Highly probable duplicate chunk. When proposed system receives Index-table from the server, whole file lookup process starts to determine highly probable duplicate chunks in the modified destination file (destination file). Process starts as reading destination file from begin till end. But each reading step, i.e., offset position at i (first boundary byte of the first chunk) in file, we also need to read the 3 more bytes, which are where :i+k-1 (last boundary byte of the first chunk), i+k (for first boundary byte of the second chunk) and i+2* K-1 (last anchor byte of the second). Figure 5. Lookup Chunks which are Might be Duplicated in Modified File Then look up process performs as follows: First, system takes four anchor bytes (shown in Figure 5), checks cell of Index table stores any values using first two anchor bytes to confirm first chunk index is stored Copyright c 2014 SERSC 343

in the index-table. If the convenient cell does not store any value, then it is safe to say that current offset (first boundary byte position of first chunk) cannot be the first byte of the duplicate chunk. Thus system shifts one byte right to continue our look up process. If the cell stores any index values, then we anticipate it to possible to be a duplicate. Then, system checks the second chunk boundary (lookup adjacent chunk) from Index-table as previous way to whether a chunk index is stored or not. If the cell, which indexed the second chunk boundary byte values, does not store any value, then it means that current offset ((the first boundary byte position of the first chunk) is not a high probability to not to be the first byte of the duplicate chunk. Therefore, we shift one byte right to our look up process. If the cell, which indexed the second chunk boundary byte values, does include index chunk values, then we need to match the indexes between founded and the previous founded indexes (stored in first chunk boundary indexed cell) whether are adjacent or not. If these are not adjacent, then it implies that the current offset is not a beginning ((first boundary byte position of first chunk) of duplicate chunks and we shift one byte right to our look up process. If these indexes are adjacent, then we see this as a high probability to be duplicate data. Hence, we put these two adjacent chunks (the first boundary byte begins from current offset) to our suspicious list (list that storing chunks that have been determined to be a high probability to be a duplicate). Afterwards, we shift look up process position by 2*K to strive to find next chunks. This gives us a plenty of opportunities to save time avoid to read unnecessary bytes. Although we determine the chunks that might be duplicated, it does not necessarily imply that the chunks are duplicated precisely. It is doubtful that the chunks are duplicated by being checked of their hash values to the original file chunk s hash values. Therefore, it is to estimate the hash of suspicious chunks and send them to the server for ensuring the duplication. The server receives the hash values and compares them to the convenient chunk hashes of the original file to confirm whether chunk has been duplicated or not. If any suspicious chunk is proven not to be a duplicate (even though the occasion has a very low likelihood), then we send back to the client the chunk to lookup step. The client receives the non-duplicated chunk from the server and performs the lookup process only within the chunk range that has been come from the server. After that, the lookup process is carried out as the same as before. The lookup process, confirming suspicious chunk, ensures that the process will be implemented repeatedly until there is no duplicate chunks that came from the server or there is no suspicious chunk. Note that these repeated processes cover the limited range of data. 3.2. Multi-Level Byte Index Chunking In proposed system double Index-tables are used when file size is over than consistent amount. In fact, chunk size value is appreciable impact to be influence on result of any deduplication approach. Setting chunk size as too large may let deduplication speed to accelerate but decrease the amount of detection identical data on file. Oppositely small chunk sized deduplication process may give long time consumption but high rate detection of identical data. Therefore proposed system uses mixing small-sized chunks and big-sized chunks mixing at the case of big amount file are being synchronized. 344 Copyright c 2014 SERSC

Figure 6. File Similarity Detecting Process of Multi-level Byte Index based Chunking Approach When synchronized file size is over than 5GB, we use Multi-level byte index chunking approach. Proposed system produces double Index table for a file each chunk sizes are correspondingly 4MB and 32KB. Then at the first level, system client detecting big identical data using 4MB chunk sized Index-table by byte Index chunking approach very fast time. In fact shifting by 4MB in the lookup process affords to fast performing. At the second level we perform byte-index chunking using 32KB Index-table on entire non-duplicated data from first level file similarity detection. This gives us opportunity to more accuracy rated data deduplication and doesn t consume so much time because of restricted by only nonduplicated area. 3.3. System Overview and Implementation Main goal of proposed system relies on transferring only non-overlapping chunks and circumvents sensitivity. System server has to perform few processes in idle mode that will already have done before the arrival of the file deduplication request from client. In other words, the processes of indexing the server file, calculates hashes and move them to the server database are have to be done in the server at first in the system idle mode. Figure 7. Multi-level Byte Index based Deduplication System when File Size is Lower than 5GB Copyright c 2014 SERSC 345

Figure 7 presents the process flow of general file (file size is lower than 5GB) deduplication. When the server receives a deduplication request from the client, the server examines every chunk index of through the requesting file, take out the chunk number index of them and put them into Index-table. Particularly, this means distribute the all the chunk numbers of requested file from database to the 256*256 table (Index-table) mentioned in the previous sections. After that, the server sends Index-table to the client. The client receives Index-table and examines the client file bytes to look for a high probability of duplicate chunks using received Index-table. It estimates hash values in the lookup results of duplicated probability chunks. The client sends these hash values to the server to confirm whether highly probable chunks are duplicated by their hash values or not. If any of chunks is not proven to be duplicated, then the non-duplicated chunk will be sent back to the client for reexamination to find another duplicated probability chunk in the range of non-duplicated chunk only. Figure 8 describes deduplication process when file size is over than 5GB. Difference from general process proposed system performs looking-up process twice using double Indextables to determine high probable duplicated data. In other word, system finishes to define duplicated data on destination file with using 4MB chunk size Index table, system repeats previous processes to define identical data on destination file using 32KB chunk sized Index table only in non-duplicated data state that ascertained from previous detection. Further process are performed same as general described in previously. Figure 8. Multi-level Byte Index Based Deduplication System when File Size is over than 5GB 346 Copyright c 2014 SERSC

4. Performance Evaluation This section discusses the evaluation results of the proposed system with several experiments. First, we examined the behavior of proposed system in terms of deduplication capability by comparing Content-defined chunking and Fixed-sized chunking approach. Next, we measured the data deduplication time consumption for several approaches under workloads. The server and the client platform consist of 3 GHz Pentium 4 Processor, WD- 1600 JS hard disk, 100 Mbps network. The software is implemented on Linux kernel version 2.6.18 Fedora Core 9. To perform comprehensive analysis on data deduplication algorithm, we implemented several deduplication algorithms for comparison purpose including fixedlength chunking, and content-defined chunking. We made experimental data set using for modifying a file in a random manner. In this experiment, we modified a data file using lseek() function in Linux system using randomly generated file offset and applied a patch to make test data file. Table 1. Amount of Modified Data Version Files of Given Older Version Number Data Size New Data OverLap(%) 1 5220 MB 634 MB 87 2 5220 MB 1091 MB 79 3 5220 MB 1613 MB 69 4 5220 MB 2252 MB 57 5 5220 MB 2727 MB 48 6 5220 MB 3146 MB 39 7 5220 MB 3715 MB 28 8 5220 MB 4180 MB 19 9 5220 MB 4702 MB 9 In Figure 9, we performed data deduplication experiment on deduplication capability varying duplication rate. We examined each of to see how much duplication there is between files under a chunking approaches workload. From the deduplication graph in Figure 9, we can see each of content-defined chunking approach, fixed-sized chunking approach and byteindex chunking could find data redundancy of given percentage amount modification file between given original file. From the overlap of modified file in graph, content-defined chunking approach does the closest data deduplication ratio result to overlap of modified file than others. Almost, there is no difference between overlapping amount and content-defined chunking approach overlapping amount. Fixed-size chunking approach shows the lowest deduplication ratio result because of vulnerable to shifts inside data stream. For multi-level byte-index based chunking data deduplication approach shows a much better result than fixed chunking approach though it couldn t reach high as content-defined chunking approach. Copyright c 2014 SERSC 347

Figure 9. Deduplication Result Varying Overlapped Data Size We measured the deduplication speed varying given percentage modification amount, Fig 10. We can see content-defined chunking approach is performed in longest time. Multi-level Byte-index based chunking approach is seen slightly slower than fixed chunking approach in the experiment. With this experiment result, we can conclude that Multi-level Byte-index chunking is very practical approach compared to several well-known data deduplication algorithms. Figure 10. Deduplication Performance Time Varying Overlapped Data Size 348 Copyright c 2014 SERSC

5. Conclusion In this paper, we propose a multi-level byte index chunking for detecting duplicated data in large-scale network systems. The algorithm identifies duplicated regions of a file and only sends non-duplicated part of data. The key idea is to adapt the byte index chunking approach for efficient metadata handling. Multi-level byte index chunking classifies data file by their size and predict highly similar region that contains identical data blocks, which performs within fast time using double chunk sized Index-tables. The key points are classifying data by their size and predict highly probable to identical data within fast time using double chunk sized Index-tables. Lookup process shifting only single byte or twice chunk size depending if there is probable chunk existed at the offset position or not. Multi-level byte index chunking approach can get separate amount as content-defined approach can do in data deduplication. Several issues remain open. First, our work has limitations on supporting simple data file which has redundant data blocks with spatial locality; therefore, if the file has several modifications then overall performance will be degrade. For future work, we plan to build a massive deduplication system with huge number of files. In this case, handling file similarity information needs more elaborated scheme. Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(.2012R1A1A2044694). References [1] K. Eshghi and H. Tang, A framework for analyzing and improving content-based chunking algorithms, Hewlett-Packard Labs Technical Report TR., vol. 30, (2005). [2] A. Muthitacharoen, B. Chen and D. Mazieres, A low-bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, (2001), pp. 174-187. [3] S. Quinlan and S. Dorward, Venti: a new approach to archival storage, Proceedings of the FAST 2002 Conference on File and Storage Technologies, (2002). [4] M. Ajtai, R. Burns, R. Fagin, D. D. E. Long and L. Stockmeyer, Compactly encoding unstructured inputs with differential compression, Journal of the Association for Computing Machinery, vol. 49, no. 3, (2002), pp. 318-367. [5] F. Douglis and A. Iyengar, Application-specific delta-encoding via resemblance detection, Proceedings of the USENIX Annual Technical Conference, (2003), pp. 1-23. [6] P. Kulkarni, F. Douglis, J. LaVoie and J. Tracey, Redundancy elimination within large collections of files, In: Proceedings of the annual conference on USENIX Annual Technical Conference, USENIX Association, (2004). [7] H. M. Jung, S. Y. Park, J. G. Lee and Y. W. Ko, Efficient Data Deduplication System Considering File Modification Pattern, International Journal of Security and Its Applications, vol. 6, no. 2, (2012). Authors Ider Lkhagvasuren, graduated from Mongolian University of Science and Technology in 2007. He also graduated Department of Computer Engineering, Hallym University with master s degree in 2013. He is currently working as a researcher in Hallym University. His research interests include Data deduplication and Cloud system. Copyright c 2014 SERSC 349

Jungmin So, received the B.S. degree in computer engineering from Seoul National University in 2001, and Ph.D. degree in Computer Science from University of Illinois at Urbana-Champaign in 2006. He is currently an assistant professor in Department of Computer Engineering, Hallym University. His research interests include wireless networking and mobile computing. Jeong-Gun Lee, received the B.S. degree in computer engineering from Hallym University in 1996, and M.S. and Ph.D. degree from Gwangju Institute of Science and Technology (GIST), Korea, in 1998 and 2005. He is currently an assistant professor in the Computer Engineering department at Hallym University. Jin Kim, received an MS degree in computer science from the college of Engineering at the Michigan State University in 1990, and in 1996 a PhD degree from the Michigan State University. Since then he has been working as a professor on computer engineering at the Hallym University. His research includes Bioinformatics and data mining. Young Woong Ko, received both a M.S. and Ph.D. in computer science from Korea University, Seoul, Korea, in 1999 and 2003, respectively. He is now a professor in Department of Computer engineering, Hallym University, Korea. His research interests include operating system, embedded system and multimedia system. 350 Copyright c 2014 SERSC