Compression and Decompression of Virtual Disk Using Deduplication Bharati Ainapure 1, Siddhant Agarwal 2, Rukmi Patel 3, Ankita Shingvi 4, Abhishek Somani 5 1 Professor, Department of Computer Engineering, MITCOE, Pune University, India 2,3,4,5 Students, Department of Computer Engineering, MITCOE, Pune University, India Abstract The basic goal of virtualization is to centralize administrative tasks while improving scalability and workload. One of the biggest challenges to the data storage community is how to effectively store data without taking the exact same data and storing again and again in different locations on the back end servers. The latest answer offered by the data storage field is the technology known as data deduplication. Data de-duplication is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on back end server. Redundant data is replaced with a pointer to the unique data copy. File de-duplication eliminates duplicate files. In this paper we are just focusing on how this deduplication method can be used for taking VM (virtual Machine) backup. This paper also shows the use of compression and decompression method in VM backup. Compression is another way to reduce the space requirements of backup file. Decompression comes into picture when we need to restore the compressed backup data again. The compressed data needs to be decompressed so that we can get the data in its original form, as it was before compressing. So our paper models an efficient approach for VM backup using compression and decompression by deduplication. Keyword Deduplication, VM (Virtual machine), Virtualization, Compression and Decompression, VD (Virtual Disk). I. INTRODUCTION Most of the enterprises are moving towards virtualization, so that they can use several virtual machines and store the data on a server. If any time, a server loses data of or crashes any virtual machine, then that data can be recovered and used on another virtual machine, if it is backed up. So, we have discussed the back up idea for virtual machines in this paper. Each file in the system is associated with an inode, which is identified by an integer number. Inodes store information about files and folders, such as file ownership, access mode (read, write, execute permissions), and file type. from the inode number, the file system driver portion of the kernel can access the contents of the inode, including the location of the file allowing access to the file. Inodes lists the files data blocks. In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. In the deduplication process, unique chunks of data, or files, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same block or file pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced. Restoring data includes the decompression of the compressed data and save it back. II. VIRTUAL DISK A Virtual Disk is a file that represents as a physical disk drive to a guest operating system. The file may be configured on host and also on a remote file system. The user can install a new operating system onto a virtual disk without repartitioning the physical disk or rebooting the host machine. III. COMPRESSION Compression is useful because it helps to reduce the consumption of expensive resources, such as hard disk space. Compression was one of the main drivers for the growth of information during the past two decades. There are two types of Compression: 1. Lossless- Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error [2]. It allows the exact original data to be reconstructed form the compressed data. It is used in cases where where it is important that the original and the decompressed data be identical. Typical examples of it are executable programs, text documents and source code. 2. Lossy- It is possible, if some loss of fidelity is acceptable. It provides a way to obtain the best fidelity for a given amount ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 49
of compression. Lossy compression is a data encoding method that compresses data by discarding some of it. It is most commonly used to compress the multimedia data. IV. DATA DEDUPLICATION A de-duplication system identifies and eliminates redundant blocks of data, significantly reducing the amount of disk needed to store said data [4]. It looks at the data on a sub-file (i.e. block) level, and attempts to determine if it s seen the data before. If it hasn t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that data are merely pointers. The most popular approach to determine duplicates is to assign an identifier to a file or a chunk of data using a Hash Table, that generates a unique ID for that block or file. This unique ID is then compared with the central index. If the ID exists, then the data is already stored before. Therefore, only a pointer to the previously saved data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored. Deduplication can be done at: 1. File Level- Here, checksum for each file is computed. If the calculated checksum is new, then it is stored in a Hash Table and the inode entry is made in the Hash Table. If the checksum is already stored in the Hash Table, then the data block of this file is just pointed to the data block of the previously saved inode for the same checksum in the Hash Table [3]. 2. Block Level- Here, the entire device, whose back up is to be taken, is divided into blocks of same size, and then checksum is calculated on these blocks [3]. 3. Byte Level- Here, data chunks are compared byte by byte. It checks for redundant data even more accurately. Examples of duplicate data that a deduplication system would store only once are:- 1. The same file backed up from different servers. 2. A weekly full backup when only 5% has changed. 3. A daily full backup of a database that doesn t support incremental backups. Deduplication can be Post-process or Inline. 1. Post- process Deduplication- In this, new data is first stored on the storage device and then a process at a later time will analyse the data looking for deduplication. The benefit is that there is no need to wait for the Hash calculations and lookup to be completed before storing the data thereby ensuring that performance is not degraded. One potential drawback is that you may unnecessarily store the duplicate data for a short time which may be an issue if the stora 2. ge system is near full capacity [2]. 2. Inline Deduplication- In this, the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that is already stored on the system, it does not store the new block, just references to the existing block. The benefit of this over Post-process deduplication is that it requires less storage as data is not duplicated. On the negative side, hash calculations and the lookups takes so long that data ingestion can be slower thereby reducing the backup throughout the device [2]. Fig. 1 File Level, Block Level, Byte Level Deduplication Generally, Block level deduplication is preferred over File level deduplication, because in file level, even if a small portion of the file is changed(say title of the file), then the entire file needs to be stored again, since it s checksum value changes. So, we may need to store large part of same data again. But this is not the case in block level deduplication, because it is evaluated on small individual blocks. But, File level deduplication can be performed most easily as compared to block level deduplication. It requires less processing power, since files hash numbers are relatively easy to generate. V. IMPLEMENTATION OF COMPRESSION OF VIRTUAL DISKS ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 50
This is of two types: 1. By mounting the File system on the Host machine Here, we mount the File system of the Guest machine, whose backup is to be taken, separately on the Host machine. So, we can get the file structure with inodes of the File system. So, we can perform File level as well as Block level deduplication. 2. Without mounting the File system on the Host machine Here, we don t mount the File system of the Guest machine, whose backup is to be taken, on the Host machine. Due to this, we don t get the file structure and the inodes of the file system. Hence, we cannot use the File level deduplication in this type. Only Block level deduplication can be used in this method. A. By mounting the File System of the Guest Machine on the Host Machine Implementation using File Level Deduplication 1. Here, consider we have a Host machine and a Guest machine. We need to take a backup of the Guest machine on the Host machine. We will use snapshots of the host machine here. 2. Snapshots are the state of the system at a particular point of time. To avoid the down-time, highavailability systems may perform the backup on the snapshots- a read-only copy of the data set frozen at a point in time. 3. So, we mount the snapshots of the Guest machine on the Host machine. 4. We then perform the File level deduplication on these mounted snapshots and store these compressed data on the back end server. 5. While restoring, we again mount these compressed data on the Host machine and perform decompression and restore on the Guest machine. 6. Here, we have to write the deduplication algorithm for each of the file systems separately [5]. a. In File level deduplication, a change within the file causes the whole file to be saved again. b. But, indexes for File level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. a. The reduction ratio in File level Deduplication may be only in the ratio 5:1 or less. b. Backup process is less affected by deduplication process. Reassembling the files is easier as they are less in number to reassemble. Implementation using Block Level Deduplication 1. In this, we copy the snapshots of the Guest machine on the Host machine. 2. Then, we divide the snapshots as a whole into a number of fixed-sized or variable-sized blocks. 3. Then, checksum is calculated for each of these blocks individually. If new checksum value is found, then it is stored in the Hash table, else pointer to the already stored Hash table entry is saved. 4. Then the unique blocks are compressed. 5. These compressed snapshots are the saved on the back end. 6. While restoring, we again mount these compressed data on the Host machine and perform the decompression and restore on the Guest machine [5]. Fig. 3 Implementing Block Level Deduplication Significance and Fig. 2 Implementing File Level Deduplication Significance and disadvantages: a. If we use the Block level deduplication, then there is no need to write the algorithm for each different file system. Instead, the same algorithm can be used for any file system type. b. Block based deduplication would only save the changed blocks between one version of the file and ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 51
the next. The reduction ratio is found out to be in the 20:1 to 50:1 range for stored data. a. Block based deduplication will require reassembly of the chunks based on the master index that maps the unique segments and pointers to unique segments. B. Without mounting the File system of the Guest Machine on the Host Machine Implementation using Block Level Deduplication without mounting the File System 1. Here, the snapshots, whose backup is to be taken, is directly saved without mounting them on the Host machine. 2. So, we don t get the file structure and inodes of the File system. 3. We then divide these snapshots into a number of fixed-sized or variable-sized blocks. 4. Then we compute the checksum of each block. 5. If the checksum value is already present, then we only store the pointer, else we save the checksum value in the Hash Table. 6. After that, we compress the individual files and save them on the back end server. 7. While restoring, we again use the Hash table and the pointers to restore the blocks saved on the back end server on the Guest Machine. Significance and a. In Block level deduplication, then there is no need to write the algorithm for each different file system. Instead, the same algorithm can be used for any file system type. b. Block based deduplication would only save the changed blocks between one version of the file and the next. The reduction ratio is found out to be in the 20:1 to 50:1 range for stored data. a. Block based deduplication will require reassembly of the chunks based on the master index that maps the unique segments and pointers to unique segments. b. File structure is not known to us. c. Inode structure in not known to us. VI. ADVANTAGES 1. Data deduplication is an effective way to reduce the storage space in a backup environment and can achieve compression ratios ranging from 10:1 to 50:1. 2. Deduplication eliminates the redundant data and ensures that only one instance of the data is actually retained on the storage media. 3. Compression is a good choice for data that is uncompressed and unencrypted. 4. Compression is also useful for extending the life of older storage systems. 5. Snapshots are point-in-time copies of files, directories or volumes that are especially helpful in the context of backup. 6. Some systems save space by copying only the changes and using pointers to the original snapshots. VII. LIMITATIONS 1. If the data is compressed or deduplicated, the process of data analysis will be slower and in the case of a partially corrupted file, it will not be recoverable at all. 2. Deduplication that works at the File level compares file for duplicity. Since the files can be large, it can adversely affect both the dedup ratio and the throughput. 3. Data that is already compressed does not compress well, in fact the resulting data can actually be larger than the original data. 4. In lossless and lossy compression, compression behaves best when the data type is comprehended. Results will be far from ideal if an appropriate compression algorithm is used on the wrong type of data. VIII. CONCLUSION Thus, compression and decompression using Deduplication is an efficient way of backup. De-duplication is an efficient approach to reduce storage demands in environments with large numbers of VM disk images. Deduplication of VM disk images can save 80% or more of the space required to store the operating system and application environment. It is particularly effective when disk images correspond to different versions of a single operating system. Snapshots offer Recovery Time Objectives and Recovery Point Objectives. IX. FUTURE SCOPE 1. People are looking to deduplication for the future of data storage in order to reduce the amount of drives spinning. 2. This in turn reduces the data center footprint for storage and reduces power needs. 3. Propose effective methods to estimate the opportunities of data reduction for large-scale storage systems. 4. The challenge is to achieve maximum dedup ratio with as little effect on throughput as possible. ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 52
REFERENCES [1] Santos, W.; Teixeira, T.; Machado, C.; Meira, W.; Da Silva, A.S.; Ferreira, D.R.; Guedes, D.; Univ. Fed. de Minas Gerais, Belo Horizonte, A Scalable Parallel Deduplication Algorithm. [2] Ming-Bo Lin. Member, IEEE and Yung-Ti Chang, A New Architecture of a Two-Stage Lossless Data Compression and Decompression Algorithm, VLSI Systems, Vol 17, No 9, Sep- 2009. [3] Jaehong Min, Daeyoung Yoon and Youjip Won, Efficient Deduplication Technique for Modern Backup Operation, IEEE Transactions on Computers, Vol 60, No 6, June-2011. [4] Srivatsa Maddodi, Girja V. Attigeri and Dr. Karunakar A. K, Data Deduplication Techniques and Analysis. [5] Cornel Constantinescu, Joseph Glider and David Chambliss, Mixing Deduplication and Compression on Active Data Sets, 2011 Data Compression Conference. ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 53