A Comparative Survey on Big Data Deduplication Techniques for Efficient Storage System

Size: px

Start display at page:

Download "A Comparative Survey on Big Data Deduplication Techniques for Efficient Storage System"

Cody Harrell
5 years ago
Views:

1 A Comparative Survey on Big Data Techniques for Efficient Storage System Supriya Milind More Sardar Patel Institute of Technology Kailas Devadkar Sardar Patel Institute of Technology ABSTRACT - Nowadays due to the exponential growth in use of emerging technology such as cloud computing and big data, the rate of data growth is also increasing rapidly. Every second millions of data is being generated because of the use of different new technologies like IOT and Sensor. Hence it is very challenging to store and handle such large amount of data. Many enterprise organizations are investing lots of money to store such big data for backup and disaster recovery purpose. But traditional backup solution does not provide any facility of preventing the system from storing duplicate data, which increases the storage cost and backup time, which in turn decreases the system performance. Data is the solution for such problem. It is a new emerging technique which eliminates the duplicate or redundant data and stores only unique copy of data. Hence it reduces the storage utilization and cost of maintaining redundant data. In this paper, we have studied different related research paper from literature, and attempted to summarize different storage utilization techniques, Stages in data deduplication, categorization of data deduplication techniques based on different criteria. KEYWORDS Big data, Cloud computing, Data, Storage Optimization, Stages in deduplication. INTRODUCTION Nowadays, due to the exponential growth in use of emerging technology like cloud computing and big data, data growth rate is also increasing rapidly. Every second millions of data is being generated because of the use of different new technologies like IOT and Sensor. Hence it is very challenging to store and handle such large amount of data. Many Enterprise organizations are investing lots of money to store such big data for backup and for disaster recovery purpose. But traditional backup solution does not provide any facility of preventing the system from storing duplicate data, which increases the storage cost and backup time, which in turn decreases the system performance. Data is the solution for such problem. It is a new emerging technique which eliminates the duplicate or redundant data and stores only unique copy of data. Hence it reduces the storage utilization and cost of maintaining redundant data. Today, not only enterprise companies but also a common person need there data to be kept safe. For that reason they store there data on multiple places. Storing data on multiple places results in high amount of storage utilization. Another problem could be disaster that can be occurred due to natural reason or artificial reason; hence everyone wants their sensitive data to be safe and secure. We cannot underestimate such long term perception because sensitive data is very important to be preserved. Data deduplication is solution for such problems; it eliminates redundant data and only store unique data. is a new compression technology which is used for efficient use of storage space and better way to handle the duplicate data. Data deduplication breaks the file into number of pieces, and then Hash algorithm is applied to each piece of file which generates unique identifier for each piece which is called as chunk of file. Further this unique identifier will be compared with already stored identifier stored in index table. After Comparison if match found then it will be considered as duplicate piece of data, hence simply discard that data and only store the reference pointer to that unique identifier so that retrieval operation will be easy[5]. 529

2 In rest of the paper, Section 2 describes different storage optimization techniques, Section 3 gives the brief idea of data deduplication, stages of deduplication, comparative study of current deduplication methods, Section 4 summarizes different types of data deduplication, and finally section 5 concluded the paper. STORAGE OPTIMIZATION TECHNIQUES Primary storage is very expensive need in today's era of digital world, but storage is most important not only for enterprises but also for home users also. Primary storage is also called as tier 1 storage through which we can store our data optimally. There are different optimization techniques provided by the suppliers, like thin provisioning, clones, snapshots, compression and deduplication. But among these which storage optimization technique is better is a question in front of the IT sector. We will see briefly pros and cons of each of these storage optimization techniques. 1. Compression: Compression is one of the important storage optimization techniques. Compression is used to store the data more efficiently so as the maximum data can be stored in limited storage. It is also used for bandwidth optimization across network. Compression technique removes the binary level redundant data from the data blocks in order to save storage space. There are again two techniques for compression lossy and lossless compression. In lossless compression when any file is compressed its quality will remain same and by decompressing original file can be obtained as it is, but in lossy it will remove the data permanently.compression does not take effort for duplicate data, it will store the data irrespective of duplication of data[7]. 2. Thin provisioning: It is the technique used for effectively allocating the storage space to save the data. It basically focuses on allocating disk space more reliably to multiple users. For that it will consider the minimum requirement of storage need of each user. Thin provisioning works in shared storage environment in which it will allocate the data block dynamically whenever there is need of storage or out of storage condition arises i.e. it is pure on demand process. It maintains the pool of free space because of which it will achieve higher storage utilization ratio [1].In traditional provisioning for each application it will allocate some extra amount of storage capacity, which cannot be used by any other application. Hence most of the time it results in wastage of physical memory unnecessarily[7].but in case of thin provisioning it will remove such extra paid storage capacity and allocate exact amount of storage needed. In case of more storage space need it will dynamically added to the existing combined storage system. 3. Snapshot: Snapshot technology is one of the popular data storage technologies nowadays. Snapshots are read-only copies of data which are useful not only for data protection but also for replication [1].Most vendors use this technology at operating system to access data at application layer effectively. Snapshot means at any given time of period recording state of the storage devices. Hence it can be further recovered at the time of failure. There are different ways to implement snapshot technology depending upon vendors and environment method can be determined. Some vendors may use read only snapshot and some may use writable snapshot technology. Copy-on-write, Redirect-on-write, Split mirror these are some of the techniques in snapshot technology. In which copy in write takes snapshot of metadata of the original data where redirect on write, writes only changed data instead of taking copy of complete original data. Though it gives data protection performance can be issue in this technology. 4. Clones: Clones and snapshots are somewhat may look similar and can confuse vendors but there is difference in them. Cloning means creating an exact similar copy of the virtual machine where snapshot creates a delta file, which allows you to rollback a virtual machine. Snapshots and clones are similar in nature but both have different attributes and mode of uses [7].Clone VM is an exact copy of production VM, including IP address, DNS name and so on. Snapshot is a "quick revert" feature, if something goes wrong. But you still need good old backups on other storage. Above all storage optimization methods use different techniques to efficiently store large amount of data in limited disk space with low cost and less storage requirement. But above techniques do not take care of the duplicate data; it will store the redundant data as it is which in turn require more storage space. For that reason data deduplication is used. 530

3 DATA DEDUPLICATION Data deduplication is a new emerging technique which is used to eliminate redundant data and stores only unique copy of the data. Hence it will be responsible for better storage utilization and efficient technique to handle similar data. For example, suppose there is one system in which there are 100 instance of particular 1 MB file attachment. When the system is backed up without deduplication it will need 100MB of storage. But if deduplication is applied on system, only one instance will be stored initially and then subsequent instances will be given reference pointer to the original instance saved. In such a way, the demand for storage space is reduced from 100MB to 1MB. There are different Stages in [5]: 1. The chunking/blocking method divides the large data file into small pieces called chunks or blocks. 2. Hash algorithm is applied to generate a unique hash identifier of each data block. 3. When new data block comes for backup, it will be compared with already stored hash identifier. 4. If match found, reference pointer will be given and duplicate data block is deleted, otherwise it will store the unique identifier and data block. TYPES OF DATA DEDUPLICATION Data techniques are depending upon following conditions: A. Based on chunking method B. Based on Location of deduplication C. Based on time of deduplication A) Based on chunking methods: The overall performance of data deduplication is depending on one major key element which is blocking algorithm. There are different methods of blocking. Depending on chunking/blocking method there are two types of deduplication: 1. File level chunking: File level chunking algorithm, does not divide the file into small blocks instead it considers whole file as a one chunk. Hence in this method, for whole file only one index value is generated and this index value is further compared with already stored index values. Because of single index value there are very less entries in index table which in turn reduce storage space and more number of indexes can be stored in index table. This file level chunking fails when there is slight change in file data, because it will generate index for complete file again rather it should generate index for only changed data.which in turns reduces the deduplication elimination ratio and throughput of the system. 2. Block Level Chunking: There are again two type of block level chunking algorithm, a) Fixed-Size Chunking: Fixed-size chunking algorithm divides the data file into fixed size blocks or chunks. The block boundaries can be offset like, 4kB, 8kB, 16kB etc. This algorithm solves the problem of file level chunking as it generates index value only for changes part and not for entire data file[5].however for large file it will creates large number of small chunks which in turn requires more storage space to store index value and metadata information. In this method Byte shifting problem can be occurred. b) Variable-Size Chunking: In this method, the data file can be divided into multiple small blocks, where blocks are of variable size. It will break the file based on content of data rather any fixed size. It resolves the issue of fixed size chunking. In fixed size chunking data boundary does not change even the data is changed but in variable size chunking there are different chunk boundary based on different parameters because of which boundary can be changed or shifted when any file is changed or deleted. Hence in case of any change in file less boundaries needs to be changed. B) based on location: Depending on location of deduplication following are types of deduplication [6]: 1. Source-based deduplication:-in this, duplicate data is eliminated before transmitting to the target machine. By using, source based deduplication we can reduce bandwidth use as only unique data is being transferred. It requires very less hardware requirement. But it requires more processing resources at source or client side. 531

4 2. Target-based deduplication:-in this, complete backup data is transferred to target location and there redundant data can be eliminated. It increases the bandwidth cost but gives good performance as compared to source based deduplication. C) based on time: Depending upon the question 'when to perform deduplication' there are following types of deduplication, 1. Inline deduplication-in this, deduplication occurred at client side i.e. before storing the data into disk data will be deduplicated [4].Only unique data has been transferred to the target server side. Inline deduplication reduces the network overhead while transferring of data, but it also need the high processing power at source side to perform deduplication. 2. Post process deduplication- In post process, applied at the server side. All the incoming data first stored in the disk as it is, then duplicate data is removed and unique copy of the data is saved in storage. Performance of post process is higher as compared to inline deduplication, but it needs more disk space to store the data and require fast disk cache. COMPARATIVE STUDY OF CURRENT RESEARCHES ON DATA DEDUPLICATION Lots of research has been done in data deduplication field. In traditional data deduplication technique single global index table has been used for storing hash values of files or data blocks which in turn puts computational overhead and decreases the performance[2].for that reason distributes and parallel deduplication is used. Table 1 shows different data deduplication research studies. Table 1. Comparison table Research paper Author name Chunking method Technique used Methodology Hadoop Based Scalable Cluster for Big Data Qing Liu, Yinjin Fu, Guiqiang Ni Fixed size blocking algorithm Map reduce and HDFS. They have used Mapreduce technique for parallel deduplication framework. Index table is distributed in each node which is stored in lightweight local MySQL databases. Boafft: Distributed for Big Data Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu Block level chunking(s uper chunk) data routing algorithm based on similarity index It uses efficient data routing algorithm which is based on data similarity, Hence reduces the network overhead to identify target storage location. It uses multiple storage data node for parallel deduplication. Bucket Based Data Technique for Big Data Storage System Naresh Kumar, Rahul Rawat, S. C. Jain Fixed size blocking algorithm bucket based technique Buckets are used to store the Hash value of blocks. Map reduce technique applied to compare hashes stored in bucket with incoming hash of block. 532

5 Extreme Binning: Scalable, Parallel for Chunk-based File Backup Deepavali Bhagwat,Kave Eshghi,Darrell D. E. Long File level Borders theorem, File similarity Extreme binning uses file similarity. First, it chooses minimum hash index value of particular file as its characteristic fingerprint using border s Theory. Then it transfers the files to the same deduplication server to deduplicate. ADMAD: Application- Driven Metadata Aware Archival Storage System Chuanyi LIU,Yingping Lu,unhui Shi,Guanlin Lu, David H.C. Du,Dong- ShenWANG. Variable size chunking metadata information Used metadata information of different levels in the I/O path such that more Meaningful data Chunks can be generated in the process of file partitioning in order to achieve interfile level deduplication. Next Level Approach of Data in the Era of Big Data Shamsher Singh,Ravinder Singh Fixed size chunking Both source and target level deduplication If file size<1gb deduplication will occur at primary namenode.if file size is > 1GB, then name node will apply create chunk of files and transfer chunks to the secondary data nodes where deduplication is performed. CONCLUSION In this paper, we studied different Data deduplication techniques and explored how data deduplication is used to handle duplicate data more efficiently. Depending upon location, chunking type and time there are different deduplication techniques. These techniques can be used depending upon the vendor's needs and expectations. We have done comparative survey of current data deduplication methods successfully. In order to overcome the issue of traditional deduplication which has global index table, distributed or parallel data deduplication can be used. Many researchers have contributed in this area but still more research hast to be done in order to increase processing time, data retrieval efficiency and throughput. REFERENCES [1] Andre Brinkmann, Sascha Effert, Snapshots and Continuous Data Replication in Cluster Storage Environments,Fourth International Workshop on Storage Network Architecture and Parallel I/O, IEEE,2008. [2] Q. Liu, Y. Fu, G. Ni, R. Hou, Hadoop Based Scalable Cluster for Big Data,2016 IEEE 36th International Conference on Distributed Computing Systems Workshops. [3] N Kumar, R. Rawat, and S. C. Jain, Bucket Based Data Technique,5th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Futur e Directions), Noida, 2016, pp [4] Z. Sun, J. Shen, and J. Yong, A novel approach to data deduplication over the engineering-oriented cloud systems,integrated Computer-Aided Engineering, vol. 20, no. 1, pp , [5] A. Venish and K. Siva Sankar, Study of Chunking Algorithm in Data, Springer India

6 [6] R. Vikraman and A. S, A Study on Various Data Systems,International Journal of Computer Applications, vol. 94, no.4, pp , [7] George Crump Optimization(2011, September 30),Which Primary Storage is beast? [Online] Available: [8] E. Manogar and S. Abirami, A Study on Data Techniques for Optimized Storage,2014 Sixth International Conference on Advanced Computing(lCoAC), IEEE 2014, pp [9] R-S Chang, C-S Liao, K-Z Fan, and C-M Wu, Dynamic Decision in a Hadoop Distributed File System International Journal of Distributed Sensor Networks, pp. 1-14, April 2014 [10] Min Xu, Yunfeng Zhu, Patrick P. C. Lee, Yinlong Xu, Even Data Placement for Load Balance in Reliable Distributed Storage Systems In Proc. of IEEE International Symposium on Quality of Service (IWQoS), pp , [11] Deepu S,Bhaskar,Shylaja, PERFORMANCE COMPARISON OF DEDUPLICATION TECHNIQUES FOR STORAGE IN CLOUD COMPUTING ENVIRONMENT,Asian Journal of Computer Science And Information Technology 4 : 5 (2014) [12] Amanpreet Kaur,Sonia Sharma, An Efficient Framework and Techniques of Data in Cloud Computing,IJCST Vol. 8,April - June [13] Shengmei Luo, Guangyan Zhang, Chengwen Wu, Boafft: Distributed for Big Data Storage in the Cloud,IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. X, XXXXX [14] Deepavali Bhagwat,Kave Eshghi,Darrell D. E. Long, Extreme Binning: Scalable, Parallel for Chunkbased File Backup,in Proc. IEEE Int. Symp. Modell. Anal. Simulation Comput. Telecommun. Syst., 2009, pp [15] C. Liu, Y. Lu, C. Shi, et al., ADMAD: Application-driven metadata aware deduplication archival storage System,in Proc. 5th IEEE Int. Workshop Storage Netw. Archit. Parallel I/Os, 2008, pp

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, www.ijcea.com ISSN 2321-3469 SECURE DATA DEDUPLICATION FOR CLOUD STORAGE: A SURVEY Vidya Kurtadikar