International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN"

Barnard Russell
5 years ago
Views:

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, www.ijcea.

MITCOE, Pune, India. ABSTRACT: Nowadays, there is a drastic increase in demand for data storage. Due to increasing demand for data storage, the concept of cloud computing is on the rise.

1 International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY Vidya Kurtadikar 1,Chaitanya Atre 2, Dhanraj Gade 3, Ritwik Jadhav 4, Rohan Gandhi 5 Department of Information Technology, MITCOE, Pune, India. ABSTRACT: Nowadays, there is a drastic increase in demand for data storage. Due to increasing demand for data storage, the concept of cloud computing is on the rise. This enormous amount of data can be backed-up on the cloud storage but, the problem is that it significantly increases the cost of the storage and its performance. Traditional storage process of data introduces redundancies and hence concept of Data deduplication is developed. Data deduplication is an effective solution for eliminating the redundancies. It uses the concepts of hash values and index tables which help in removal of duplicate data. With the use of Data deduplication process, an effective performance increase and reduction in cost of storage can be observed. In this paper, we have discussed different data deduplication methods considering their advantages and disadvantages. We have also proposed enhanced security method for generating data chunks by using standard encryption method. Keywords: Data Deduplication, Cloud Storage, Hashing, Chunking, Redundant Data. [1] INTRODUCTION Nowadays, due to advancements in the Technology, the amount of digital data generated by the applications is increasing at a faster rate. As the storage systems have a limited capacity, storing such huge amount of data has posed challenge. Recent International Data Corporation (IDC) studies indicate that in past five years the volume of data has increased by almost nine times to 7 ZB per year and a more explosive growth is expected in next ten years [2]. Such a massive growth in storage is controlled by the technique of Data Deduplication [1]. This process identifies duplicate contents at the chunk level using hash values and deduplicates them. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 1

2 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY Data Deduplication can be implemented at File Level and at Chunk Level. Chunk level Deduplication is always preferred over the file deduplication as in the process of chunk level deduplication, the fingerprints of various chunks are compared and the redundant ones are deduplicated. As against in the File Level Deduplication, the whole file is compared with the other file by checking its metadata. Chunking plays a significant role in determining the efficiency of the data deduplication algorithm. The performance of the data deduplication algorithm can be computed by analysing the size and number of chunks [1]. As stated in [1] Data Deduplication saves a lot of storage space and money spent on the storage of the data by optimizing the storage space and bandwidth costs. Hence, a greener environment can be obtained as fewer spaces are required to house the data in primary and remote storage. As we are maintaining less storage, fast return on investment can be obtained. This process also helps in increasing the network bandwidth and improving the network efficiency. [2] DIFFERENT APPROACHES OF DATA DEDUPLICATION Data deduplication - often called intelligent compression or single-instance storage - is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques guarantee that only a single unique instance of data gets stored in the backup system. In the process of Data deduplication, we divide the file or any block of data into multiple chunks and a hash value of these chunks is calculated by using hash techniques like SHA-1 or MD5 [3]. Using these hash values, we can compare a chunk of data with the incoming data chunk and if a match is found, then, we can conclude that a similar data chunks exists in the storage system and hence, we need to replace the duplicate data chunk with the reference of existing data chunks. Figure:1. Diagram of Illustrating Data Deduplication Process [2.1] STEPS OF DATA DEDUPLICATION 1. Creation of Chunks: The File is divided into chunks using any of the chunking methods - Fixed length chunking or Variable length chunking. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 2

3 International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN Hash Value Computation: Depending upon the chunks formed, the hash values of these chunks are computed by using any of the various available hashing algorithms SHA-1, MD5, SHA Deduplication Process: Now, the hash values of the data chunks which can be called as fingerprints are stored in an index table. Duplicate entries of data can be checked by making use of the index tables. If the hash values are found to be matching, duplicate data can be replaced with the reference of the original data chunk [1]. [2.2] HASH ALGORITHMS Hash Algorithms are mainly used in the process of Data deduplication. Hash values of data chunks are computed which are generally called as Fingerprints are used for eliminating the redundancies in the data. Commonly used hash algorithm for the deduplication process is the SHA-1. [2.3] SHA-1 SHA-1 is a cryptographic security algorithm used in deduplication process for computing the fingerprints of the data chunks. SHA-1 is closely modelled after MD5. SHA-1 produces a message digest of 160 bits by dividing the input data in blocks of 512 bits each [4]. This 160 bits value computed for each data chunk is unique and used for eliminating redundancies in the data. Figure: 2. Classification Tree of Data Deduplication [2.4] SOURCE BASED DEDUPLICATION The process of elimination of redundancies from data before transfering that data to the target server is what we call Source Deduplication. Source deduplication provides numerous advantages like reduced bandwidth and storage usage. But, Source deduplication can be slower than target based deduplication, considering large amount of data. Source deduplication works from the client side that works in co-ordination with the server to compare new data chunks with previously stored data chunks. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 3

4 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY [2.5] TARGET BASED DEDUPLICATION Target deduplication is the removal of redundancies from a storage systems as it passes through an application placed between the source and the target server. Target deduplication reduces the amount of storage required but, unlike source deduplication, it does not reduce the amount of data that must be sent across a LAN or WAN during the storage. [2.6] INLINE DATA DEDUPLICATION Inline Deduplication refers to eliminating the data redundancies as the data enter the system. This process cuts down on the bulk of data and makes the system efficient. The benefit of the inline deduplication process is that the calculation of hash values and search process for redundant data is done before the data is actually stored in the system. [2.7] POST PROCESS DEDUPLICATION A process of analysis and removal of redundant data after the data is written to storage system is Post Process Deduplication or Asynchronous Deduplication. Post-process deduplication offers a number of advantages like efficient lookup process which includes calculation of hash values and search process for redundant data,which is done after the data files are stored on the storage system. [2.8] FILE BASED & SUB-FILE BASED DEDUPLICATION File Based Deduplication works by calculating a single checksum of the complete file and comparing it with another file s calculated checksum. It s simple and fast, but the efficiency of deduplication is less, as this process does not handle the problem of duplicate content found inside different files.while the sub-file level deduplication technique includes a process which breaks the file into smaller fixed or variable sized data chunks, and further uses a standard hashbased algorithm to find similar blocks. [2.9] FIXED LENGTH AND VARIABLE LENGTH DEDUPLICATION Fixed Length chunking method splits files into equally sized chunks. The chunk boundaries are based on offsets like 4, 8, 16 kb, etc. It uses a simple checksum based approach to find the duplicates. The process is highly constrained and offers limited advantages. Figure: 3. Fixed v/s Variable Length Chunks Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 4

5 International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN The process of variable length chunking states that the files can be broken into multiple chunks of variable sizes by breaking them up based on the content rather than on the fixed size of the files [5]. This method is used as an alternative to fixed length chunking. In variable length chunking, boundaries of the chunks are based on the content present in the file. Hence, whenever, any data gets updated, it is not necessary to alter the entire whole file. With data broken down based on the content, efficiency of deduplication process increases as it is easy to recognize and eliminate redundant data chunks [5]. [3] VARIABLE LENGTH CHUNKING MECHANISMS [3.1] RABIN KARP FINGERPRINT ALGORITHM Rabin Karp is a variable length chunking algorithm that uses the concept of hash functions and rolling hash technique. A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A rolling hash allows an algorithm to calculate a hash value without having the rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right, the algorithm uses a rolling hash to do an operation on the hash to get the new hash from the old hash. The concept behind Rabin Karp algorithm is to define a data window on the data stream, which has finite and predefined size of say "N". Now using a hashing technique calculate a hash on the window. Check if the hash matches a predefined "fingerprint". If the "fingerprint" does match the calculated rolling hash on the window, mark the Nth element as the chuck boundary. If the "fingerprint" doesn't match the calculated rolling hash on the window, slide the window by one element and recalculate the rolling hash. Repeat this until you find a match [10]. [3.2] TWO THRESHOLD TWO DIVISORS ALGORITHM TTTD algorithm proposed by HP Laboratories, Palo Alto, uses four parameters, the maximum threshold, the minimum threshold, the main divisor, and the second divisor. The maximum threshold parameter is used for eliminating very large sized chunks while minimum threshold is used for eliminating very small sized chunks. These two parameters are necessary to control the variations in chunk sizes. The main divisor can be used to make the chunk size close to our expected chunk size. The second divisor finds the breakpoint when the main divisor cannot find it. The breakpoint found by second divisor are large and close to maximum thresholds [6]. [4] PERFORMANCE ANALYSIS When comparing File level chunking, fixed length chunking and variable length chunking, it is observed that variable length chunking provides good results. Table below represents different deduplicate/undeduplicate size (8, 16, 32, 64 Kb) in % with different chunking algorithms namely File Level, Fixed Length and variable length chunking [5]. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 5

6 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY Table: 1. Performance of Different Length Deduplication processes. 8 Kb 16 Kb 32 Kb 64 Kb File Level Fixed Length Rabin Karp In Fixed Length chunking, each chunk is of particular fixed length. However, problem arises when the file needs to be updated. By inserting a line of text in the file, the chunking algorithm shifts all the subsequent chunks, also no chunk following the first chunk is preserved. Hence, simple small updates radically change the chunks and deduplication becomes an issue. As against in Variable Length chunking, chunks are content based and are not of uniform length. Similarly, in variable chunking method, when updation process takes place, only the chunk where any new line of text is inserted or deleted is changed. Remaining chunks are unaltered and this is biggest benefit of variable length chunking. Hence, the method proposed uses variable length chunking algorithm. Table: 2. Comparison of Various Deduplication Systems Metrics File Level Fixed Size Variable Size Deduplication Ratio Low Better Good Processing Time Medium Less High Table 2 shows the comparison of different deduplication algorithms with different performance metrics. Deduplication ratio indicates that how many redundancies are removed and variable sized deduplication is much better than others. In terms of processing time, variable sized deduplication is worst owing to expensive variable length chunking [9]. As chunks are of variable sizes, time taken to process is high. [5] ENHANCED METHOD As explained earlier, Data deduplication is a technique which stores only unique data chunks in the storage and removes the redundant data chunks. Hence, due to such efficient storage process, technique of data deduplication is preferred over the traditional storage systems. This paper focuses on presenting the studied data deduplication types. We have enhanced the existing method for data deduplication which depends on eliminating the redundant data chunks while it is stored in the storage or backup. Figure 4 shows the flow diagram of the method. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 6

7 International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN Figure: 4. Flowchart of the Deduplication Method In this method, the data (text file) which the client intends to upload to the storage, is divided into chunks by using any of the variable length chunking algorithm, preferably Rabin Karp Fingerprint. A hash value of the created chunks will be computed. SHA-1 algorithm will be used for computing the hash value of the data chunks. The reason for selecting SHA-1 over any other hashing algorithm like MD5 is that SHA-1 is efficient and more secure as its message digest is 160 bits. These unique chunks whose hash values are computed will be stored in the storage and hash values will be stored in the index tables. When any new file is ready to upload, the hash values of this new data chunks will be compared with the values stored in the index tables. If any match is found, a reference of original data chunk will be stored and value in index table will be updated and if any match is not found, the data chunk will be directly stored in the storage. Now for enhancing the security, the data chunk will be encrypted before it is stored in the backup. Basically, the need for security arises when deduplication system is implemented in the cloud. Security issues can be observed when information is processed on the cloud platform. The user who uploads the file does not have the right over where the file is stored. Hence, there might be the possibility that the cloud service provider or any third party application can handle and access the data. Hence, the need for encryption arises. Before storing the data, the text will be encrypted and then, it will be stored in the backup or target storages. The algorithm which will be used for encryption is Advanced Encryption Standard (AES). Before downloading the file, all the data chunks will be decrypted with the key provided, merged into a single file and then, it will be downloaded. [6] CONCLUSION This paper focuses on study of the Data Deduplication process for storage systems and an enhanced security method using standard encryption. Data Deduplication methods are used to achieve cost effective storage and effective network bandwidth while encryption is used to protect the data from unauthorized access. The actual concept lies in removing the redundancies present in the data items. It is one of the emerging concepts which is currently being implemented by cloud providers. Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 7

8 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY REFERENCES [1] Subhanshi Singhal and Naresh Kumar, A Survey on Data Deduplication in International Journal on Recent and Innovation Trends in computing and communication, May [2] Bo Mao, Hong Jiang, Suzhen Wu, Lei Tian, Leveraging Data Deduplication to improve performance of Primary Storage Systems in the Cloud in IEEE Transactions on Computer, Vol 65, No. 6, June [3] Golthi Tharunn, Gowtham Kommineni, Sarpella Sasank Varma, Akash Singh Verma, Data Deduplication in Cloud Storage in International Journal of Advanced Engineering and Global Technology, Vol 03, Issue 08, August [4] Chaitya B. Shah, Drashti R. Panchal, Secured Hash Algorithm-1 Review Paper in International Journal for Advanced Research in Engineering and Technology, Vol 02, Oct [5] A Venish and K. Shiva Shankar, Study of Chunking Algorithm in Data Deduplication. [6] BingChun Chang, A Running Time Improvement for Two Threshold Two Divisors Algorithm, MS, SJSU Scholar Works, [7] J. Malhotra and J. Bakal, A Survey and Comparative Study of Data Deduplication Techniques, in International Conference on Pervasive Computing, Pune [8] Zuhair S. Al-Sagar, Mohammed S. Saleh, Aws Zuhair Sameen, Optimizing Cloud Storage by Data Deduplication: A Study, in International Research Journal of Engineering and Technology, Vol 02, Issue 09, Dec [9] Daehee Kim, Sejun Song, Baek-Young Choi, Data Deduplication for Data optimization for Storage and Network Systems. [10] Vidya Kurtadikar, Chaitanya Atre, Dhanraj Gade, Ritwik Jadhav, Rohan Gandhi 8

Alternative Approaches for Deduplication in Cloud Storage Environment

Alternative Approaches for Deduplication in Cloud Storage Environment International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 13, Number 10 (2017), pp. 2357-2363 Research India Publications http://www.ripublication.com Alternative Approaches for