Multi-level Byte Index Chunking Mechanism for File Synchronization

Similar documents
Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Byte Index Chunking Approach for Data Compression

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Alternative Approaches for Deduplication in Cloud Storage Environment

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

DEC: An Efficient Deduplication-Enhanced Compression Approach

Deduplication Storage System

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

SSD Garbage Collection Detection and Management with Machine Learning Algorithm 1

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Compression and Decompression of Virtual Disk Using Deduplication

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

EaSync: A Transparent File Synchronization Service across Multiple Machines

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING

NLE-FFS: A Flash File System with PRAM for Non-linear Editing

Deploying De-Duplication on Ext4 File System

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

Partial Caching Scheme for Streaming Multimedia Data in Ad-hoc Network

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

Network Intrusion Forensics System based on Collection and Preservation of Attack Evidence

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Design of the Journaling File System for Performance Enhancement

Design and Implementation of Various File Deduplication Schemes on Storage Devices

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Rethinking Deduplication Scalability

3D Grid Size Optimization of Automatic Space Analysis for Plant Facility Using Point Cloud Data

A Reverse Differential Archiving Method based on Zdelta

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Data deduplication for Similar Files

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

Frequency Based Chunking for Data De-Duplication

The Effectiveness of Deduplication on Virtual Machine Disk Images

Optimizing Fsync Performance with Dynamic Queue Depth Adaptation

Reducing Costs in the Data Center Comparing Costs and Benefits of Leading Data Protection Technologies

Construction Scheme for Cloud Platform of NSFC Information System

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving

Transaction Processing in Mobile Database Systems

A Load Balancing Scheme for Games in Wireless Sensor Networks

A Virtual-Synchronized-File Based Privacy Protection System

Presented by: Nafiseh Mahmoudi Spring 2017

Low Overhead Geometric On-demand Routing Protocol for Mobile Ad Hoc Networks

Development of Massive Data Transferring Method for UPnP based Robot Middleware

Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Speeding Up Cloud/Server Applications Using Flash Memory

CS307: Operating Systems

Erik Riedel Hewlett-Packard Labs

An Evaluation of Using Deduplication in Swappers

Lecture 2: Memory Systems

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

Remote Direct Storage Management for Exa-Scale Storage

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

A Comparison of File. D. Roselli, J. R. Lorch, T. E. Anderson Proc USENIX Annual Technical Conference

DDSF: A Data Deduplication System Framework for Cloud Environments

Implementation and Performance Evaluation of RAPID-Cache under Linux

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

HP Dynamic Deduplication achieving a 50:1 ratio

An Overview of Projection, Partitioning and Segmentation of Big Data Using Hp Vertica

SMCCSE: PaaS Platform for processing large amounts of social media

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Decision analysis of the weather log by Hadoop

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Keywords: disk throughput, virtual machine, I/O scheduling, performance evaluation

SELECTING VOTES FOR ENERGY EFFICIENCY IN PROBABILISTIC VOTING-BASED FILTERING IN WIRELESS SENSOR NETWORKS USING FUZZY LOGIC

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Improvement of Buffer Scheme for Delay Tolerant Networks

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

The Logic of Physical Garbage Collection in Deduplicating Storage

Efficient Resource Management for the P2P Web Caching

Fast H.264 Video Decoding by Bit Order Reversing and Its Application to Real-time H.264 Video Encryption. Handong Global University, Pohang, Korea

Design of Hierarchical Crossconnect WDM Networks Employing a Two-Stage Multiplexing Scheme of Waveband and Wavelength

Web page recommendation using a stochastic process model

Iomega REV Drive Data Transfer Performance

Single Instance Storage Strategies

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Hyper Text Transfer Protocol Compression

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed Sequential Access MAC Protocol for Single-Hop Wireless Networks

Deduplication File System & Course Review

Distributed Interference-aware Medium Access Control for IEEE Visible Light Communications

HEAD HardwarE Accelerated Deduplication

Improvement of Matrix Factorization-based Recommender Systems Using Similar User Index

CS307 Operating Systems Main Memory

Improved MAC protocol for urgent data transmission in wireless healthcare monitoring sensor networks

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

ASN Configuration Best Practices

Dynamic Deferred Acknowledgment Mechanism for Improving the Performance of TCP in Multi-Hop Wireless Networks

Transcription:

, pp.339-350 http://dx.doi.org/10.14257/ijseia.2014.8.3.31 Multi-level Byte Index Chunking Mechanism for File Synchronization Ider Lkhagvasuren, Jung Min So, Jeong Gun Lee, Jin Kim and Young Woong Ko * 1 Dept. of Computer Engineering, Hallym University, Chuncheon, Korea {Ider555, jso, jeonggun.lee, jinkim, yuko}@hallym.ac.kr Abstract In this paper, we propose a probabilistic algorithm for detecting duplicated data blocks in low bandwidth network. The algorithm identifies duplicated regions of the destination file, and only sends non-duplicated region of data. The proposed system produces two types of double Index table for a file, each chunk sizes are 4MB and 32KB, respectively. At the first level, system client detects large sized identical data blocks using 4MB chunk sized indextable by using byte-index chunking approach in rapid time. At the second level, we perform byte-index chunking using 32KB index-table on entire non-duplicated data area produced through first level file similarity detection. This gives us opportunity to more accuracy rated data deduplication and doesn t consume so much time because deduplication work restricted by only non-duplicated area. Experiment result shows the proposed approach can reduce processing time significantly comparable to fixed-size chunking. Also data deduplication rate is as high as variable-sized chunking. Keywords: Deduplication, Cloud storage, Chunk, Index-table, Anchor Byte 1. Introduction Explosive growth of digital data causes storage crises and data deduplication becomes one of the hottest topics in the data storage. Data deduplication is a method of reducing storage capacity by eliminating duplicated data. By adapting data deduplication, we can attain more files than before. Most existing deduplication solutions aim to remove duplicate data in the data storage systems using the traditional chunk-level deduplication strategies. There are lots of data deduplication system, In Content-defined Chunking [1], each block size is partitioned by anchoring based on their data patterns. This scheme can prevent the data shifting problem of the Static Chunking approach. One of the well-known Content-defined Chunking algorithms is LBFS[2], a network file system designed for low bandwidth networks. However, content-defined data deduplication approach can achieve high deduplication ratio, but requires much time to perform deduplication process in comparison to the other data deduplication approaches. Static chunking[3] is the fastest algorithm among the others for detecting duplicated blocks but the performance is not acceptable with boundary shifting problem. In this paper, the main idea is to apply lookup process in a destination file by predicting the duplicated region compared to source files with high probability. There is a fundamental concept to expedite finding redundant data process. We use a table which is named Index-table (size 256*256). This two dimensional matrix structured table is used as reference of file chunks on the server during lookup process. The server chunks * Corresponding author : yuko@hallym.ac.kr ISSN: 1738-9984 IJSEIA Copyright c 2014 SERSC

are stored to Index-table as indexed by their Anchor byte values. Also Index-table stores metadata (server file chunk hashes and their indexes) in each cell. In this work, our key idea is to adapt multi-level byte-index chunking for large sized files. The byte-index chunking shows efficient and improved performance for small and medium sized files, but shows insufficient performance for large sized files exceeding several gigabytes. Therefore, we exploited multi-level approach to enhance byte-index chunking. The proposed scheme divides files into two groups considering file size. If the file size is over 5GB (in current implementation), we use 4MB large index-table and 32KB small index-table for accelerating data detecting process. But file size is below 5GB then we perform file detection process only using 32KB sized index-table. This isolation makes huge impact to fast performance for large scale data files. When we are looking for identical data on the large scale file (file size is over than 5GB), we starts the lookup process (high-level byte-index chunking) with 4MB chunk sized index-table. If we find a chunk that likely to be duplicated using 4MByte index-table, then we start low-level byte-index chunking using small chunk on 32KB index-table. The strategy of the proposed system is to find large-sized duplicated region and later performs detailed deduplication process. The rest of this paper is organized as follows. In Section 2, we describe related works about data deduplication system. In Section 3, we explain the design principle of proposed Byte-index Chunking system and implementation details. In Section 4, we show performance evaluation result of the proposed system and we conclude and discuss future research plan. 2. Related Works There are several different data deduplication algorithms [3-7]: Static-Chunking (SC), Content-defined Chunking (CDC), Whole-file Chunking (WFC) and delta encoding. Static Chunking let files be divided into a number of fixed-sized blocks, and then apply hash functions to create the hash key of the blocks. Venti [3] is a network storage system using Static Chunking, where 160-bit SHA1 hash key is used as the address of the data. This enforces a write-once policy since no other data block can be found with the same address. The addresses of multiple writes of the same data are identical. So duplicate data is easily identified and the data block is stored only once. The main limitation of Static Chunking is boundary shift problem. For example when adding a new data to a file, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file. Therefore, it's difficult to find duplicated blocks in the file, which makes deduplication performance degrade. In Content-defined Chunking, each block size is partitioned by anchoring based on their data patterns. This scheme can prevent the data shifting problem of the Static Chunking approach. One of the well-known Content-defined Chunking system is that is a network file system designed for low bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server s file system or the client s cache. Using this technique, LBFS achieves up to two orders of magnitude reduction in bandwidth utilization on common workloads, compared to traditional network file systems. Delta encoding [4] stores data in the form of differences between sequential data. Lots of backup system adopts this scheme in order to give their users previous versions of the same file from previous backups. This reduces associated costs in the amount of data that has to be stored as differing versions, moreover, those costs in the uploading of each file that has been updated. DRED system use delta encoding approach to implement deduplication service. DRED system [5][6] can efficiently remove the duplicated data which use delta encoding technique in Web pages, and e-mail. 340 Copyright c 2014 SERSC

3. System Design and Implementation Figure 1 describes overall structure of the proposed system that processes data deduplication between the server and client. In this work, we implemented a deduplication server, employing source-based approach using byte-index approach with a refined and improved touch. The server maintains metadata between server and client and the server produces double Index-tables (size: 256*256), which chunk size is 4MB (when file size over than 5GB). In order to perform lookup process for finding high probability of the duplicated data, Index-table provides the key information of 256x256 sized table structure; keeping chunk numbers and the chunk s edge chunk byte values are used as their cell row and column numbers. From this information, we can find parts of data blocks which have very high probability of duplicated information in very fast way and got a scheme that, eventually, allows for efficient data deduplication by utilizing data file pattern, as shown in Figure 1. Figure 1 System Architecture Overview: Multi-level Byte Index approach Suppose we have a file to synchronize. Then it has to be divided into fixed-length blocks, each block s SHA-1 hash value, make numbering each chunks to reference (chunk-index) and its each edge bytes (left and right boundary byte) values are saved in the server. As can be seen in Figure 2, in the proposed system, we consider edge bytes as Anchor points. Figure 2. Overview of Chunking with Anchor Points Copyright c 2014 SERSC 341

After the chunking process, the proposed system then produces [256, 256] amount of index table for the synchronized file. For every chunk of file, we need to set its chunk-index to the corresponding cell of Index-table. For a chunk, first anchor point (left edge byte) value of byte represents the horizontal direction index and a last anchor point (right edge byte) value of byte is a reference to the vertical direction of the Index-table. Figure 3. Overview of Filling Index-table Table Figure 3 shows how Index-table is filled with reference points of the chunk. When proposed system creates metadata (with chunk index, chunk hash and chunk anchor points (value of edge bytes)) list, server sends to the server only Index-table. 3.1. Predicting Duplicated Data of Look-Up Process In look-up process, system aims to find the chunks that are expected to be duplicated (highly probable to be duplicated chunk) using Index table. Also in this process, while we are looking highly probable to be duplicated chunk for improving our search results to be more accurate, we don t only search chunks one by one, but we aim to seek adjacent double chunks for per offset in the modified file. If we can predict some data to be adjacent chunks as highly probable to be duplicated from the destination file by their anchor bytes, it means these parts of data in destination file has not only same length with chunks of the source file, also these data store same bytes of values at the position where their boundaries of each chunk. If we look up just a single chunk with Index-table (using only 2 anchor-bytes), then there might be plenty of offsets that no duplicates. This makes file similarity process to be wasteful and with unnecessary data processing consumptions. Nonetheless, it is a very rare occasion that is not duplicated chunks comes up from the predicting result when we try look up two adjacent chunks. In other words, it is more accurate and avoiding unnecessary time consumption. That s why we call this part of data in destination file as be highly probable duplicate chunk. 342 Copyright c 2014 SERSC

Figure 4. Duplicated Chunk Look Up Process Overview Probability to be a duplicate is the only one from 256 4 (4,294,967,296) occasions. Nevertheless, Highly probable duplicate chunk is possibly seen to be duplicated but we confirm whether they are duplicated or not by their SHA1 value after predicting them to be Highly probable duplicate chunk. When proposed system receives Index-table from the server, whole file lookup process starts to determine highly probable duplicate chunks in the modified destination file (destination file). Process starts as reading destination file from begin till end. But each reading step, i.e., offset position at i (first boundary byte of the first chunk) in file, we also need to read the 3 more bytes, which are where :i+k-1 (last boundary byte of the first chunk), i+k (for first boundary byte of the second chunk) and i+2* K-1 (last anchor byte of the second). Figure 5. Lookup Chunks which are Might be Duplicated in Modified File Then look up process performs as follows: First, system takes four anchor bytes (shown in Figure 5), checks cell of Index table stores any values using first two anchor bytes to confirm first chunk index is stored Copyright c 2014 SERSC 343

in the index-table. If the convenient cell does not store any value, then it is safe to say that current offset (first boundary byte position of first chunk) cannot be the first byte of the duplicate chunk. Thus system shifts one byte right to continue our look up process. If the cell stores any index values, then we anticipate it to possible to be a duplicate. Then, system checks the second chunk boundary (lookup adjacent chunk) from Index-table as previous way to whether a chunk index is stored or not. If the cell, which indexed the second chunk boundary byte values, does not store any value, then it means that current offset ((the first boundary byte position of the first chunk) is not a high probability to not to be the first byte of the duplicate chunk. Therefore, we shift one byte right to our look up process. If the cell, which indexed the second chunk boundary byte values, does include index chunk values, then we need to match the indexes between founded and the previous founded indexes (stored in first chunk boundary indexed cell) whether are adjacent or not. If these are not adjacent, then it implies that the current offset is not a beginning ((first boundary byte position of first chunk) of duplicate chunks and we shift one byte right to our look up process. If these indexes are adjacent, then we see this as a high probability to be duplicate data. Hence, we put these two adjacent chunks (the first boundary byte begins from current offset) to our suspicious list (list that storing chunks that have been determined to be a high probability to be a duplicate). Afterwards, we shift look up process position by 2*K to strive to find next chunks. This gives us a plenty of opportunities to save time avoid to read unnecessary bytes. Although we determine the chunks that might be duplicated, it does not necessarily imply that the chunks are duplicated precisely. It is doubtful that the chunks are duplicated by being checked of their hash values to the original file chunk s hash values. Therefore, it is to estimate the hash of suspicious chunks and send them to the server for ensuring the duplication. The server receives the hash values and compares them to the convenient chunk hashes of the original file to confirm whether chunk has been duplicated or not. If any suspicious chunk is proven not to be a duplicate (even though the occasion has a very low likelihood), then we send back to the client the chunk to lookup step. The client receives the non-duplicated chunk from the server and performs the lookup process only within the chunk range that has been come from the server. After that, the lookup process is carried out as the same as before. The lookup process, confirming suspicious chunk, ensures that the process will be implemented repeatedly until there is no duplicate chunks that came from the server or there is no suspicious chunk. Note that these repeated processes cover the limited range of data. 3.2. Multi-Level Byte Index Chunking In proposed system double Index-tables are used when file size is over than consistent amount. In fact, chunk size value is appreciable impact to be influence on result of any deduplication approach. Setting chunk size as too large may let deduplication speed to accelerate but decrease the amount of detection identical data on file. Oppositely small chunk sized deduplication process may give long time consumption but high rate detection of identical data. Therefore proposed system uses mixing small-sized chunks and big-sized chunks mixing at the case of big amount file are being synchronized. 344 Copyright c 2014 SERSC

Figure 6. File Similarity Detecting Process of Multi-level Byte Index based Chunking Approach When synchronized file size is over than 5GB, we use Multi-level byte index chunking approach. Proposed system produces double Index table for a file each chunk sizes are correspondingly 4MB and 32KB. Then at the first level, system client detecting big identical data using 4MB chunk sized Index-table by byte Index chunking approach very fast time. In fact shifting by 4MB in the lookup process affords to fast performing. At the second level we perform byte-index chunking using 32KB Index-table on entire non-duplicated data from first level file similarity detection. This gives us opportunity to more accuracy rated data deduplication and doesn t consume so much time because of restricted by only nonduplicated area. 3.3. System Overview and Implementation Main goal of proposed system relies on transferring only non-overlapping chunks and circumvents sensitivity. System server has to perform few processes in idle mode that will already have done before the arrival of the file deduplication request from client. In other words, the processes of indexing the server file, calculates hashes and move them to the server database are have to be done in the server at first in the system idle mode. Figure 7. Multi-level Byte Index based Deduplication System when File Size is Lower than 5GB Copyright c 2014 SERSC 345

Figure 7 presents the process flow of general file (file size is lower than 5GB) deduplication. When the server receives a deduplication request from the client, the server examines every chunk index of through the requesting file, take out the chunk number index of them and put them into Index-table. Particularly, this means distribute the all the chunk numbers of requested file from database to the 256*256 table (Index-table) mentioned in the previous sections. After that, the server sends Index-table to the client. The client receives Index-table and examines the client file bytes to look for a high probability of duplicate chunks using received Index-table. It estimates hash values in the lookup results of duplicated probability chunks. The client sends these hash values to the server to confirm whether highly probable chunks are duplicated by their hash values or not. If any of chunks is not proven to be duplicated, then the non-duplicated chunk will be sent back to the client for reexamination to find another duplicated probability chunk in the range of non-duplicated chunk only. Figure 8 describes deduplication process when file size is over than 5GB. Difference from general process proposed system performs looking-up process twice using double Indextables to determine high probable duplicated data. In other word, system finishes to define duplicated data on destination file with using 4MB chunk size Index table, system repeats previous processes to define identical data on destination file using 32KB chunk sized Index table only in non-duplicated data state that ascertained from previous detection. Further process are performed same as general described in previously. Figure 8. Multi-level Byte Index Based Deduplication System when File Size is over than 5GB 346 Copyright c 2014 SERSC

4. Performance Evaluation This section discusses the evaluation results of the proposed system with several experiments. First, we examined the behavior of proposed system in terms of deduplication capability by comparing Content-defined chunking and Fixed-sized chunking approach. Next, we measured the data deduplication time consumption for several approaches under workloads. The server and the client platform consist of 3 GHz Pentium 4 Processor, WD- 1600 JS hard disk, 100 Mbps network. The software is implemented on Linux kernel version 2.6.18 Fedora Core 9. To perform comprehensive analysis on data deduplication algorithm, we implemented several deduplication algorithms for comparison purpose including fixedlength chunking, and content-defined chunking. We made experimental data set using for modifying a file in a random manner. In this experiment, we modified a data file using lseek() function in Linux system using randomly generated file offset and applied a patch to make test data file. Table 1. Amount of Modified Data Version Files of Given Older Version Number Data Size New Data OverLap(%) 1 5220 MB 634 MB 87 2 5220 MB 1091 MB 79 3 5220 MB 1613 MB 69 4 5220 MB 2252 MB 57 5 5220 MB 2727 MB 48 6 5220 MB 3146 MB 39 7 5220 MB 3715 MB 28 8 5220 MB 4180 MB 19 9 5220 MB 4702 MB 9 In Figure 9, we performed data deduplication experiment on deduplication capability varying duplication rate. We examined each of to see how much duplication there is between files under a chunking approaches workload. From the deduplication graph in Figure 9, we can see each of content-defined chunking approach, fixed-sized chunking approach and byteindex chunking could find data redundancy of given percentage amount modification file between given original file. From the overlap of modified file in graph, content-defined chunking approach does the closest data deduplication ratio result to overlap of modified file than others. Almost, there is no difference between overlapping amount and content-defined chunking approach overlapping amount. Fixed-size chunking approach shows the lowest deduplication ratio result because of vulnerable to shifts inside data stream. For multi-level byte-index based chunking data deduplication approach shows a much better result than fixed chunking approach though it couldn t reach high as content-defined chunking approach. Copyright c 2014 SERSC 347

Figure 9. Deduplication Result Varying Overlapped Data Size We measured the deduplication speed varying given percentage modification amount, Fig 10. We can see content-defined chunking approach is performed in longest time. Multi-level Byte-index based chunking approach is seen slightly slower than fixed chunking approach in the experiment. With this experiment result, we can conclude that Multi-level Byte-index chunking is very practical approach compared to several well-known data deduplication algorithms. Figure 10. Deduplication Performance Time Varying Overlapped Data Size 348 Copyright c 2014 SERSC

5. Conclusion In this paper, we propose a multi-level byte index chunking for detecting duplicated data in large-scale network systems. The algorithm identifies duplicated regions of a file and only sends non-duplicated part of data. The key idea is to adapt the byte index chunking approach for efficient metadata handling. Multi-level byte index chunking classifies data file by their size and predict highly similar region that contains identical data blocks, which performs within fast time using double chunk sized Index-tables. The key points are classifying data by their size and predict highly probable to identical data within fast time using double chunk sized Index-tables. Lookup process shifting only single byte or twice chunk size depending if there is probable chunk existed at the offset position or not. Multi-level byte index chunking approach can get separate amount as content-defined approach can do in data deduplication. Several issues remain open. First, our work has limitations on supporting simple data file which has redundant data blocks with spatial locality; therefore, if the file has several modifications then overall performance will be degrade. For future work, we plan to build a massive deduplication system with huge number of files. In this case, handling file similarity information needs more elaborated scheme. Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(.2012R1A1A2044694). References [1] K. Eshghi and H. Tang, A framework for analyzing and improving content-based chunking algorithms, Hewlett-Packard Labs Technical Report TR., vol. 30, (2005). [2] A. Muthitacharoen, B. Chen and D. Mazieres, A low-bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, (2001), pp. 174-187. [3] S. Quinlan and S. Dorward, Venti: a new approach to archival storage, Proceedings of the FAST 2002 Conference on File and Storage Technologies, (2002). [4] M. Ajtai, R. Burns, R. Fagin, D. D. E. Long and L. Stockmeyer, Compactly encoding unstructured inputs with differential compression, Journal of the Association for Computing Machinery, vol. 49, no. 3, (2002), pp. 318-367. [5] F. Douglis and A. Iyengar, Application-specific delta-encoding via resemblance detection, Proceedings of the USENIX Annual Technical Conference, (2003), pp. 1-23. [6] P. Kulkarni, F. Douglis, J. LaVoie and J. Tracey, Redundancy elimination within large collections of files, In: Proceedings of the annual conference on USENIX Annual Technical Conference, USENIX Association, (2004). [7] H. M. Jung, S. Y. Park, J. G. Lee and Y. W. Ko, Efficient Data Deduplication System Considering File Modification Pattern, International Journal of Security and Its Applications, vol. 6, no. 2, (2012). Authors Ider Lkhagvasuren, graduated from Mongolian University of Science and Technology in 2007. He also graduated Department of Computer Engineering, Hallym University with master s degree in 2013. He is currently working as a researcher in Hallym University. His research interests include Data deduplication and Cloud system. Copyright c 2014 SERSC 349

Jungmin So, received the B.S. degree in computer engineering from Seoul National University in 2001, and Ph.D. degree in Computer Science from University of Illinois at Urbana-Champaign in 2006. He is currently an assistant professor in Department of Computer Engineering, Hallym University. His research interests include wireless networking and mobile computing. Jeong-Gun Lee, received the B.S. degree in computer engineering from Hallym University in 1996, and M.S. and Ph.D. degree from Gwangju Institute of Science and Technology (GIST), Korea, in 1998 and 2005. He is currently an assistant professor in the Computer Engineering department at Hallym University. Jin Kim, received an MS degree in computer science from the college of Engineering at the Michigan State University in 1990, and in 1996 a PhD degree from the Michigan State University. Since then he has been working as a professor on computer engineering at the Hallym University. His research includes Bioinformatics and data mining. Young Woong Ko, received both a M.S. and Ph.D. in computer science from Korea University, Seoul, Korea, in 1999 and 2003, respectively. He is now a professor in Department of Computer engineering, Hallym University, Korea. His research interests include operating system, embedded system and multimedia system. 350 Copyright c 2014 SERSC