Data deduplication for Similar Files

Size: px

Start display at page:

Download "Data deduplication for Similar Files"

Stephen James
6 years ago
Views:

1 Int'l Conf. Scientific Computing CSC'17 37 Data deduplication for Similar Files Mohamad Zaini Nurshafiqah, Nozomi Miyamoto, Hikari Yoshii, Riichi Kodama, Itaru Koike, Toshiyuki Kinoshita School of Computer Science, Tokyo University of Technology Hachioji Tokyo, , Japan Abstract - Recently, massive data growth and duplicate data in enterprise systems have led to the use of deduplication technique. The deduplication technique is a powerful storage minimization technique that can be adopted to manage maintenance issues in data growth. The target files for deduplication are divided into several parts (each part is called a block) and any duplicate blocks are eliminated. In the variable-length block method, we use a particular bit-pattern (called a singularity) to decide the breakpoint of the block. Since multimedia data such as audio data and image data has a huge capacity, the demand for data compression is high due to its ability to save space. In this research, we extracted a file with high similarity using file similarity determination, and proposed a similar file extraction method to eliminate duplication only for these files. Morphological analysis and cosine similarity is used to determine the similarity in text files. By deduplicating only files with high similarity, the time required for deduplication can be minimized without reducing the effect of deduplication too much. Experimental results confirmed that if deduplication is performed on files with similarity of 0.5 or more, the reduction of deduplication rate is suppressed to about 8% and the processing time can be shortened to about 1/3.3. Keywords: data deduplication, variable-length block, similar file extraction, deduplication rate detected. It can reduce a huge amount of data by eliminating overlapping data (redundant data) in large-scale servers or data storage. Using data deduplication, only one representative of two or more overlapping data files or the same areas of similar data is preserved, while the overlapping data is replaced with links that point to the representative data (as shown in Figure 1). By replacing multiple overlapping data with links, data storage size can be extremely reduced. As a result, the efficiency of data storage can be highly improved and cost in data maintenance and storage can be also decreased. In this research, we extracted a file with high similarity using file similarity determination, and proposed a similar file extraction method to eliminate duplication only for these files. This is a method of improving the efficiency of deduplication by decomposing sentences into words using morphological analysis, extracting similar files by similarity determination using cosine similarity, and deduplicating only for similar files. Experiments confirmed that the similar file extraction method is an effective way to shorten the time required for deduplication without decreasing the effect of deduplication too much. 1 Introduction In recent years, the volume of file data in enterprise systems has greatly increased due to the growing popularity in handling multimedia data including audio, animation or video, etc. In these multimedia files, lots of exactly or mostly identical files might exist and deduplication techniques must be used to minimize the file data volume by eliminating redundant data. Data deduplication is one of file compaction techniques that is commonly used in general enterprise systems by removing duplicates within and across a file. The general concept has been successfully applied to file backup, virtual machine storage, and WAN replication and so on. Data deduplication is a process that calculates the similarity in record pairs and merges them if similarity is Fig.1 Concept of data deduplication

By dividing the files into blocks and finding the duplicate part by the blocks, files are not required to be strictly same and high deduplication efficiency can be achieved.

2 38 Int'l Conf. Scientific Computing CSC'17 2 Data deduplication 2.1 Two types of block In the data deduplication technique, the target files are divided into several parts, each part is called a block, and any duplicate blocks are eliminated. By dividing the files into blocks and finding the duplicate part by the blocks, files are not required to be strictly same and high deduplication efficiency can be achieved. There are two types of block; one is fixed-length block whose length is constant and the other is variable-length block whose length can be changed. In the fixed-length block method, when some data are inserted into the file and the blocks after inserted position are shifted, they cannot be recognized as duplicate blocks in the original data (Figure 2 (a)). On the other hand, in the variable-length block method, even if some data are inserted into the file, by adjusting the block length for the insertion, the blocks after inserted position can be recognized as duplicate blocks and be applied for deduplication (Figure 2 (b)). Thus, in the variable-length block method, the effect of deduplication can be maintained even if some data have been inserted or deleted. Figure 3 shows how to detect the break point of variablelength blocks efficiently. Firstly the hash value of a small part with constant length, that is called a window, is calculated. The window indicates the candidate of the breakpoint of the block. When a particular bit-pattern, that is called a singularity, is included in the hash value of the window, the candidate becomes a real breakpoint of the block. When the singularity is not included in the hash value, the candidate does not become a breakpoint. The effect of deduplication can be affected by the singularity; especially by the singularity size. In our previous works [5] [6], we found that the optimum singularity size is 15 bits and the optimum window size is around 32 bytes. In this study, we also used these parameters. 2.2 Variable-length block The Rabin-Karp string search algorithm is used to find the singularity in the hash value of the window. The following parameters are used in the algorithm. (1) Minimum file size (default is 40 bytes) Files that are smaller than this size will not be targeted for deduplication. (2) Minimum block length (default is 4,000 bytes) (3) Maximum block length (default is 16,000 bytes) (4) Window size (default is 32 bytes) Window is a unit for calculating a hash value. (5) Singularity (a) Fixed-length block (b) Variable-length block Fig. 2 Two types of block When the singularity is included in the bit-pattern of the hash value of the window, a breakpoint is found and a block is generated at this position. For a file whose size is between the minimum file size and the minimum block length, the whole file is generated as a block. When a file is larger than the minimum block length, a breakpoint will be determined. As shown in Figure 3, in searching for the breakpoint, a hash value is first created for the window at the location of the minimum length block, and is checked up if it includes the singularity, or not. When the hash value includes the singularity, a breakpoint is found and a block is generated at this position. When the hash value does not include the singularity, the window is shifted one byte and the breakpoint search is repeated. When the breakpoint is not found until the maximum block length, a maximum length block is generated at this position. In the variable-length block method, the block length can be changed, and the maximum and minimum block length is set not to generate an extremely large or small block. The effect of deduplication is also affected by this maximum / minimum block length.

Int'l Conf. Scientific Computing CSC'17 39 3 Related works Fig.

3 Int'l Conf. Scientific Computing CSC' Related works Fig.3 Breakpoint search The effect of deduplication in the fixed-length block method when the block length is set to 4 ~ 16 K bytes was investigated in [1] and the efficiency of the variable-length method was discussed in [3]. The differences of the effects of deduplication between in the variable-length block method and in the fixed-length block method when the block length is larger than 4 K bytes were reported in [2] and when the block length is smaller than 4 K bytes were discussed in [4]. In [5], the double layered deduplication method that combines the fixed-length method and variable-length method is proposed. These studies have investigated how the block length affects the effect of deduplication. In our previous work [6][7], we analyzed the relationship between the singularity size and the deduplication rate. In [8] and [9], we researched the efficiency of deduplication for firmware files and audio data files respectively. In this research, we proposed a method to shorten the processing time of deduplication without reducing the effect of duplication elimination too much by performing deduplication only on similar files. 4 Similar file extraction method Below is the process of deduplication: Create blocks Detect block duplications Delete duplicates Create links Since a large amount of processing is carried out, the load on the CPU is high. As the number of target files for deduplication increases, the duplicated parts are easier to find. However, the increase in the number of files also causes the increase of time needed to find the duplicated parts. In this research, we proposed similar file extraction method which extracts highly similar files from the target file group and performs deduplication only for files with high similarity prior to deduplication. By not performing deduplication on files with low similarity, it is possible to reduce the time required for deduplication without reducing the effect of deduplication too much. However, in this method, the originally duplicated parts that is actually possible for deduplication may not become the target for deduplication. Thus, the effect of deduplication may decrease. Through experiments, the effect of reducing the time needed for deduplication and the decrease in efficiency of deduplication are simultaneously evaluated. In the similar file extraction method, similar document files (doc, docx) were extracted using morphological analysis and cosine similarity. 4.1 Morphological analysis Morphological analysis is a method of separating sentences into meaningful words and identify the part of speech or content. In English language, it is easy to divide the sentences into morphemes in order to write sentences separated by words such as "I love you.. On the other hand, Japanese sentencewhich has the same meaning is harder to separate into morphemes. The morphemes in this example are,,,, and word match by dictionary is required to perform morphological analysis. 4.2 Cosine similarity Cosine similarity is a method to determine the similarity between two sentences by creating a vector of frequency of morphemes appearance between two sentences. Then, the normalized inner product of the vector is taken as the similarity between those two sentences. For example, in case of sentence A "I live in a big house in Tokyo." and sentence B "I stay in a big hotel in Boston.", all morphemes that appear are listed up (in this example: {I, live, in, a, big, house, Tokyo, stay, hotel, Boston}) and a vector of the frequency for each morphemes appearance in each sentence is created (in this example: V A ={1, 1, 2, 1, 1, 1, 1, 0, 0, 0} and V B ={1, 0, 2, 1, 1, 0, 0, 1, 1, 1}. The normalized inner product of V A and V B (In this example:

40 Int'l Conf. Scientific Computing CSC'17 cov( V 1110 22 111110 101 01 01 01 10 10 A 7 10, V B VA VB ) V V 0.7 A B which is 0.7 is the cosine similarity between sentences A and B.

4 40 Int'l Conf. Scientific Computing CSC'17 cov( V A 7 10, V B VA VB ) V V 0.7 A B which is 0.7 is the cosine similarity between sentences A and B. Cosine similarity takes a value between 0 and 1. The closer the value to 1 indicates that the two sentences are similar. However, even though the sentences to be compared are similar in appearance, the meanings and contents are not necessarily similar. Therefore, even if the cosine similarity is close to 1, it does not indicate that the two documents are possible for deduplication without fail, instead it only shows that there is a high possibility for deduplication. 5 Experimental results 5.1 Target files for deduplication The effect of similar file extraction method was examined through experiments. Using variable length block method, the deduplication rate was performed for the document file (doc or docx extension) using similar file extraction method. The deduplication rate using similar file extraction method was obtained from the following calculation formula. It can Table 1 Target files for deduplication Extension Total data size (Byte) Number of files doc 35,894, Fig. 4 Similarities of target files Fig. 5 Categorize target files be concluded that the higher the deduplication rate, the higher the effect of deduplication. Deduplication Rate "data size deleted by deduplication" 100(%) "similar file size " + "non - similar file size" Denominator for general deduplication rate is "file size targeted for deduplication" while in this situation the denominator is "similar file size". For similar file extraction method, non-similar files should also be considered as target of deduplication and included in the denominator of the deduplication rate. For similar file method, file extension was converted to txt and morphological analysis was conducted to break down the contents to word level. Then, similarity were determined using cosine similarity, and similar files were extracted in three stages with similarity of 0.3 or more, 0.4 or more, and 0.5 or more. The change in processing time and the rate of deduplication between two methods; deduplicating only similar files and deduplicating the entire file without performing similar file search were compared. For processing time, the time required for similar file search does not included. Table 1 shows the total data size and number of files to be deduplicated. The similarity of 20 target files for deduplication is shown in Fig. 4, and Fig. 5 shows a graph arranged in order of maximum similarity from extracted files with the highest degree of similarity of 0.3 or more, 0.4 or more, and 0.5 or more. 5.2 Experimental results Table 2 shows the result of deduplication using similar file extraction method. As a result of extracting similar files, the

5 Int'l Conf. Scientific Computing CSC'17 41 Table 2 Results of deduplication Attribute Similarity Number of File size Exec. Time Number of Reduce size Dedup. rate range files (Byte) (sec.) blocks (Byte) (%) All files 20 35,894, ,138 1,873, or more 17 33,664, ,604 1,873, Extracted files 0.4 or more 16 32,934, ,429 1,873, or more 6 7,856, ,514 1,549, total number of blocks and the execution time required for deduplication de-creases proportionally to the reduced file size. This is because deduplication was performed only for similar files, and searching for unnecessary block matches and link creation time was reduced. On the other hand, the similarity of 0.3 or more and 0.4 or more deduplicate exactly the same blocks as deduplicating the entire file. Files that are not similar according to similarity determination will not become the target for deduplication, but this does not reduce the effect of deduplication. Furthermore, if the similarity is more than 0.5, the files extracted as similar are drastically reduced to 22% (about 1 ) 4.6 of all files. The deduplication processing time was reduced to 30% (about 1 ), but the rate of deduplication was only 3.3 reduced from 5.22% to 4.32% that is about 8% (about 1 ). This shows that files can effectively be extracted for 1.2 deduplication. As described above, it was confirmed that similar file extraction method is an effective method capable to reduce the time required for deduplication without significantly reducing the effect of deduplication. 6 Conclusion In this research, we proposed a similar file extraction method which performs deduplication only for files with high similarity by preliminary similarity determination. For text such as doc and docx, similarity determination is performed using morphological analysis and cosine similarity. Then, deduplication is performed only for files with high similarity. This method shorten the time required for deduplication without reducing the effect of deduplication too much. Experiments confirmed that if the deduplication is performed on files with similarity of 0.5 or more, the reduction of the deduplication rate is suppressed to about 8% and Fig. 6 Results of deduplication rate and execution time the processing time can be shortened to 1 or less. In other 3.3 words, if deduplication is performed by narrow down to files with high similarity using similar file extraction method, the processing time can be shortened without decreasing the effect of deduplication too much. In the future, we will expand the scope of application so that the proposed similar file extraction method can be applied not only to text files but also to other types of files. 7 References [1] Q. He, Z. Li, X. Zhang, Data deduplication techniques, Future Information Technology and Management Engineering (FITME) 2010, vol.1, pp , Oct [2] C. Constantinescu, J. Glider, D. Chambliss, Mixing Deduplication and Compression on Active Data Sets, Data Compression Conference (DCC) 2011, pp , March 2011 [3] A.N. Yasa, P.C. Nagesh, Space savings and design considerations in variable length deduplication, ACM SIGOPS Operating Systems Review, Vol.46 Issue 3, pp.57-64, Dec [4] M. Noorafiza, I. Koike, H. Yamasaki, A. Rizalhasrin, T. Kinoshita, Block Length Optimization in Data Deduplication Technique, Proceedings of the 10th International Conference on Scientific Computing (CSC2013), pp , July 2013

6 42 Int'l Conf. Scientific Computing CSC'17 [5] H. Yamasaki, I. Koike, T. Kinoshita, Analysis of double layered deduplication efficiency, IPSJ SIGMPS Technical Report, Vol.2014-MPS-97 No.9, March 2014 (in Japanese) [6] M. Ogiwara, M. Takaya, T. Kasuya, I. Koike, T. Kinoshita, Singularity Size Optimization in Data Deduplication Technique, Proceedings of the 2014 International Conference on Parallel and Distributed Processing Techniques and Applications 2014, (PDPTA2014), pp , July 2014 [7] M. Noorafiza, M. Hirose, M. Takaya, I. Koike, T. Kinoshita, Optimum Singularity Size in Data Deduplication Technique, Proceedings of the 2015 International Conference on Scientific Computing (CSC2015), pp , July 2015 [8] N. Takeuchi, M. Hirose, M. Noorafiza, S. Takano, I. Koike, T. Kinoshita, Data Deduplication for Firmware Files, Proceedings of the 2016 International Conference on Scientific Computing (CSC2016), pp.14-19, July 2016 [9] MZ Nurshafiqah, H. Yoshii, F. Enomoto, I. Koike, T. Kinoshita, Data Deduplication for Audio Data Files, Proceedings of 32th International Conference on Computers and Their Applications (CATA2017), pp.17-21, April 2017

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, www.ijcea.com ISSN 2321-3469 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY Vidya Kurtadikar