Compression and Decompression of Virtual Disk Using Deduplication

Similar documents
Compression and Replication of Device Files using Deduplication Technique

Chapter 10 Protecting Virtual Environments

Deploying De-Duplication on Ext4 File System

Oracle Advanced Compression. An Oracle White Paper June 2007

Understanding Virtual System Data Protection

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

A New Compression Method Strictly for English Textual Data

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Data Reduction Meets Reality What to Expect From Data Reduction

An Oracle White Paper October Advanced Compression with Oracle Database 11g

Why Datrium DVX is Best for VDI

In-line Deduplication for Cloud storage to Reduce Fragmentation by using Historical Knowledge

Chapter 7. GridStor Technology. Adding Data Paths. Data Paths for Global Deduplication. Data Path Properties

Reducing Costs in the Data Center Comparing Costs and Benefits of Leading Data Protection Technologies

HP Dynamic Deduplication achieving a 50:1 ratio

Protect enterprise data, achieve long-term data retention

ADVANCED DEDUPLICATION CONCEPTS. Thomas Rivera, BlueArc Gene Nagle, Exar

GFS: The Google File System. Dr. Yingwu Zhu

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

ADVANCED DATA REDUCTION CONCEPTS

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Dell DR4000 Replication Overview

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

EMC DATA DOMAIN OPERATING SYSTEM

Don t Get Duped By Dedupe: Introducing Adaptive Deduplication

Symantec NetBackup 7 for VMware

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Scale-out Data Deduplication Architecture

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Chapter 1. Storage Concepts. CommVault Concepts & Design Strategies:

SMD149 - Operating Systems - File systems

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

EMC DATA DOMAIN PRODUCT OvERvIEW

Saving capacity by using Thin provisioning, Deduplication, and Compression In Qsan Unified Storage. U400Q Series U600Q Series

Backup management with D2D for HP OpenVMS

Tape Drive Data Compression Q & A

Deduplication and Incremental Accelleration in Bacula with NetApp Technologies. Peter Buschman EMEA PS Consultant September 25th, 2012

Final Examination CS 111, Fall 2016 UCLA. Name:

Setting Up the Dell DR Series System on Veeam

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

SolidFire and Pure Storage Architectural Comparison

Veeam and HP: Meet your backup data protection goals

Cybernetics Virtual Tape Libraries Media Migration Manager Streamlines Flow of D2D2T Backup. April 2009

Technology Insight Series

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

TIBX NEXT-GENERATION ARCHIVE FORMAT IN ACRONIS BACKUP CLOUD

COMPARATIVE STUDY OF TWO MODERN FILE SYSTEMS: NTFS AND HFS+

Get More Out of Storage with Data Domain Deduplication Storage Systems

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Chapter 2 CommVault Data Management Concepts

Setting Up the DR Series System on Veeam

ZYNSTRA TECHNICAL BRIEFING NOTE

REVIEW ON IMAGE COMPRESSION TECHNIQUES AND ADVANTAGES OF IMAGE COMPRESSION

Journal of Computer Engineering and Technology (IJCET), ISSN (Print), International Journal of Computer Engineering

Backup 2.0: Simply Better Data Protection

Single Instance Storage Strategies

IBM Real-time Compression and ProtecTIER Deduplication

Maximizing your Storage Capacity: Data reduction techniques and performance metrics

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

HYCU and ExaGrid Hyper-converged Backup for Nutanix

Data deduplication for Similar Files

ExaGrid Using Veeam Backup and Replication Software With an ExaGrid System

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A Research Paper on Lossless Data Compression Techniques

Executive Summary SOLE SOURCE JUSTIFICATION. Microsoft Integration

IMAGE COMPRESSION USING HYBRID TRANSFORM TECHNIQUE

DELL EMC DATA DOMAIN OPERATING SYSTEM

Hybrid Cloud NAS for On-Premise and In-Cloud File Services with Panzura and Google Cloud Storage

Strategies for Single Instance Storage. Michael Fahey Hitachi Data Systems

WHITE PAPER: ENTERPRISE SOLUTIONS. Disk-Based Data Protection Achieving Faster Backups and Restores and Reducing Backup Windows

HOW DATA DEDUPLICATION WORKS A WHITE PAPER

Ext3/4 file systems. Don Porter CSE 506

The storage challenges of virtualized environments

StorageCraft OneXafe and Veeam 9.5

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

Deploying Application and OS Virtualization Together: Citrix and Virtuozzo

Announcements. Persistence: Crash Consistency

Study of LZ77 and LZ78 Data Compression Techniques

Faculty of Engineering Computer Engineering Department Islamic University of Gaza Network Lab # 5 Managing Groups

Flash Decisions: Which Solution is Right for You?

Data Deduplication using Even or Odd Block (EOB) Checking Algorithm in Hybrid Cloud

Oracle Advanced Compression. An Oracle White Paper April 2008

Physical Representation of Files

White Paper Simplified Backup and Reliable Recovery

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Module 6 STILL IMAGE COMPRESSION STANDARDS

Virtuozzo Containers

Why In-Place Copy Data Management is The Better Choice

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

IMAGE COMPRESSION USING HYBRID QUANTIZATION METHOD IN JPEG

White paper ETERNUS CS800 Data Deduplication Background

Transcription:

Compression and Decompression of Virtual Disk Using Deduplication Bharati Ainapure 1, Siddhant Agarwal 2, Rukmi Patel 3, Ankita Shingvi 4, Abhishek Somani 5 1 Professor, Department of Computer Engineering, MITCOE, Pune University, India 2,3,4,5 Students, Department of Computer Engineering, MITCOE, Pune University, India Abstract The basic goal of virtualization is to centralize administrative tasks while improving scalability and workload. One of the biggest challenges to the data storage community is how to effectively store data without taking the exact same data and storing again and again in different locations on the back end servers. The latest answer offered by the data storage field is the technology known as data deduplication. Data de-duplication is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on back end server. Redundant data is replaced with a pointer to the unique data copy. File de-duplication eliminates duplicate files. In this paper we are just focusing on how this deduplication method can be used for taking VM (virtual Machine) backup. This paper also shows the use of compression and decompression method in VM backup. Compression is another way to reduce the space requirements of backup file. Decompression comes into picture when we need to restore the compressed backup data again. The compressed data needs to be decompressed so that we can get the data in its original form, as it was before compressing. So our paper models an efficient approach for VM backup using compression and decompression by deduplication. Keyword Deduplication, VM (Virtual machine), Virtualization, Compression and Decompression, VD (Virtual Disk). I. INTRODUCTION Most of the enterprises are moving towards virtualization, so that they can use several virtual machines and store the data on a server. If any time, a server loses data of or crashes any virtual machine, then that data can be recovered and used on another virtual machine, if it is backed up. So, we have discussed the back up idea for virtual machines in this paper. Each file in the system is associated with an inode, which is identified by an integer number. Inodes store information about files and folders, such as file ownership, access mode (read, write, execute permissions), and file type. from the inode number, the file system driver portion of the kernel can access the contents of the inode, including the location of the file allowing access to the file. Inodes lists the files data blocks. In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. In the deduplication process, unique chunks of data, or files, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same block or file pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced. Restoring data includes the decompression of the compressed data and save it back. II. VIRTUAL DISK A Virtual Disk is a file that represents as a physical disk drive to a guest operating system. The file may be configured on host and also on a remote file system. The user can install a new operating system onto a virtual disk without repartitioning the physical disk or rebooting the host machine. III. COMPRESSION Compression is useful because it helps to reduce the consumption of expensive resources, such as hard disk space. Compression was one of the main drivers for the growth of information during the past two decades. There are two types of Compression: 1. Lossless- Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error [2]. It allows the exact original data to be reconstructed form the compressed data. It is used in cases where where it is important that the original and the decompressed data be identical. Typical examples of it are executable programs, text documents and source code. 2. Lossy- It is possible, if some loss of fidelity is acceptable. It provides a way to obtain the best fidelity for a given amount ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 49

of compression. Lossy compression is a data encoding method that compresses data by discarding some of it. It is most commonly used to compress the multimedia data. IV. DATA DEDUPLICATION A de-duplication system identifies and eliminates redundant blocks of data, significantly reducing the amount of disk needed to store said data [4]. It looks at the data on a sub-file (i.e. block) level, and attempts to determine if it s seen the data before. If it hasn t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that data are merely pointers. The most popular approach to determine duplicates is to assign an identifier to a file or a chunk of data using a Hash Table, that generates a unique ID for that block or file. This unique ID is then compared with the central index. If the ID exists, then the data is already stored before. Therefore, only a pointer to the previously saved data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored. Deduplication can be done at: 1. File Level- Here, checksum for each file is computed. If the calculated checksum is new, then it is stored in a Hash Table and the inode entry is made in the Hash Table. If the checksum is already stored in the Hash Table, then the data block of this file is just pointed to the data block of the previously saved inode for the same checksum in the Hash Table [3]. 2. Block Level- Here, the entire device, whose back up is to be taken, is divided into blocks of same size, and then checksum is calculated on these blocks [3]. 3. Byte Level- Here, data chunks are compared byte by byte. It checks for redundant data even more accurately. Examples of duplicate data that a deduplication system would store only once are:- 1. The same file backed up from different servers. 2. A weekly full backup when only 5% has changed. 3. A daily full backup of a database that doesn t support incremental backups. Deduplication can be Post-process or Inline. 1. Post- process Deduplication- In this, new data is first stored on the storage device and then a process at a later time will analyse the data looking for deduplication. The benefit is that there is no need to wait for the Hash calculations and lookup to be completed before storing the data thereby ensuring that performance is not degraded. One potential drawback is that you may unnecessarily store the duplicate data for a short time which may be an issue if the stora 2. ge system is near full capacity [2]. 2. Inline Deduplication- In this, the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that is already stored on the system, it does not store the new block, just references to the existing block. The benefit of this over Post-process deduplication is that it requires less storage as data is not duplicated. On the negative side, hash calculations and the lookups takes so long that data ingestion can be slower thereby reducing the backup throughout the device [2]. Fig. 1 File Level, Block Level, Byte Level Deduplication Generally, Block level deduplication is preferred over File level deduplication, because in file level, even if a small portion of the file is changed(say title of the file), then the entire file needs to be stored again, since it s checksum value changes. So, we may need to store large part of same data again. But this is not the case in block level deduplication, because it is evaluated on small individual blocks. But, File level deduplication can be performed most easily as compared to block level deduplication. It requires less processing power, since files hash numbers are relatively easy to generate. V. IMPLEMENTATION OF COMPRESSION OF VIRTUAL DISKS ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 50

This is of two types: 1. By mounting the File system on the Host machine Here, we mount the File system of the Guest machine, whose backup is to be taken, separately on the Host machine. So, we can get the file structure with inodes of the File system. So, we can perform File level as well as Block level deduplication. 2. Without mounting the File system on the Host machine Here, we don t mount the File system of the Guest machine, whose backup is to be taken, on the Host machine. Due to this, we don t get the file structure and the inodes of the file system. Hence, we cannot use the File level deduplication in this type. Only Block level deduplication can be used in this method. A. By mounting the File System of the Guest Machine on the Host Machine Implementation using File Level Deduplication 1. Here, consider we have a Host machine and a Guest machine. We need to take a backup of the Guest machine on the Host machine. We will use snapshots of the host machine here. 2. Snapshots are the state of the system at a particular point of time. To avoid the down-time, highavailability systems may perform the backup on the snapshots- a read-only copy of the data set frozen at a point in time. 3. So, we mount the snapshots of the Guest machine on the Host machine. 4. We then perform the File level deduplication on these mounted snapshots and store these compressed data on the back end server. 5. While restoring, we again mount these compressed data on the Host machine and perform decompression and restore on the Guest machine. 6. Here, we have to write the deduplication algorithm for each of the file systems separately [5]. a. In File level deduplication, a change within the file causes the whole file to be saved again. b. But, indexes for File level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. a. The reduction ratio in File level Deduplication may be only in the ratio 5:1 or less. b. Backup process is less affected by deduplication process. Reassembling the files is easier as they are less in number to reassemble. Implementation using Block Level Deduplication 1. In this, we copy the snapshots of the Guest machine on the Host machine. 2. Then, we divide the snapshots as a whole into a number of fixed-sized or variable-sized blocks. 3. Then, checksum is calculated for each of these blocks individually. If new checksum value is found, then it is stored in the Hash table, else pointer to the already stored Hash table entry is saved. 4. Then the unique blocks are compressed. 5. These compressed snapshots are the saved on the back end. 6. While restoring, we again mount these compressed data on the Host machine and perform the decompression and restore on the Guest machine [5]. Fig. 3 Implementing Block Level Deduplication Significance and Fig. 2 Implementing File Level Deduplication Significance and disadvantages: a. If we use the Block level deduplication, then there is no need to write the algorithm for each different file system. Instead, the same algorithm can be used for any file system type. b. Block based deduplication would only save the changed blocks between one version of the file and ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 51

the next. The reduction ratio is found out to be in the 20:1 to 50:1 range for stored data. a. Block based deduplication will require reassembly of the chunks based on the master index that maps the unique segments and pointers to unique segments. B. Without mounting the File system of the Guest Machine on the Host Machine Implementation using Block Level Deduplication without mounting the File System 1. Here, the snapshots, whose backup is to be taken, is directly saved without mounting them on the Host machine. 2. So, we don t get the file structure and inodes of the File system. 3. We then divide these snapshots into a number of fixed-sized or variable-sized blocks. 4. Then we compute the checksum of each block. 5. If the checksum value is already present, then we only store the pointer, else we save the checksum value in the Hash Table. 6. After that, we compress the individual files and save them on the back end server. 7. While restoring, we again use the Hash table and the pointers to restore the blocks saved on the back end server on the Guest Machine. Significance and a. In Block level deduplication, then there is no need to write the algorithm for each different file system. Instead, the same algorithm can be used for any file system type. b. Block based deduplication would only save the changed blocks between one version of the file and the next. The reduction ratio is found out to be in the 20:1 to 50:1 range for stored data. a. Block based deduplication will require reassembly of the chunks based on the master index that maps the unique segments and pointers to unique segments. b. File structure is not known to us. c. Inode structure in not known to us. VI. ADVANTAGES 1. Data deduplication is an effective way to reduce the storage space in a backup environment and can achieve compression ratios ranging from 10:1 to 50:1. 2. Deduplication eliminates the redundant data and ensures that only one instance of the data is actually retained on the storage media. 3. Compression is a good choice for data that is uncompressed and unencrypted. 4. Compression is also useful for extending the life of older storage systems. 5. Snapshots are point-in-time copies of files, directories or volumes that are especially helpful in the context of backup. 6. Some systems save space by copying only the changes and using pointers to the original snapshots. VII. LIMITATIONS 1. If the data is compressed or deduplicated, the process of data analysis will be slower and in the case of a partially corrupted file, it will not be recoverable at all. 2. Deduplication that works at the File level compares file for duplicity. Since the files can be large, it can adversely affect both the dedup ratio and the throughput. 3. Data that is already compressed does not compress well, in fact the resulting data can actually be larger than the original data. 4. In lossless and lossy compression, compression behaves best when the data type is comprehended. Results will be far from ideal if an appropriate compression algorithm is used on the wrong type of data. VIII. CONCLUSION Thus, compression and decompression using Deduplication is an efficient way of backup. De-duplication is an efficient approach to reduce storage demands in environments with large numbers of VM disk images. Deduplication of VM disk images can save 80% or more of the space required to store the operating system and application environment. It is particularly effective when disk images correspond to different versions of a single operating system. Snapshots offer Recovery Time Objectives and Recovery Point Objectives. IX. FUTURE SCOPE 1. People are looking to deduplication for the future of data storage in order to reduce the amount of drives spinning. 2. This in turn reduces the data center footprint for storage and reduces power needs. 3. Propose effective methods to estimate the opportunities of data reduction for large-scale storage systems. 4. The challenge is to achieve maximum dedup ratio with as little effect on throughput as possible. ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 52

REFERENCES [1] Santos, W.; Teixeira, T.; Machado, C.; Meira, W.; Da Silva, A.S.; Ferreira, D.R.; Guedes, D.; Univ. Fed. de Minas Gerais, Belo Horizonte, A Scalable Parallel Deduplication Algorithm. [2] Ming-Bo Lin. Member, IEEE and Yung-Ti Chang, A New Architecture of a Two-Stage Lossless Data Compression and Decompression Algorithm, VLSI Systems, Vol 17, No 9, Sep- 2009. [3] Jaehong Min, Daeyoung Yoon and Youjip Won, Efficient Deduplication Technique for Modern Backup Operation, IEEE Transactions on Computers, Vol 60, No 6, June-2011. [4] Srivatsa Maddodi, Girja V. Attigeri and Dr. Karunakar A. K, Data Deduplication Techniques and Analysis. [5] Cornel Constantinescu, Joseph Glider and David Chambliss, Mixing Deduplication and Compression on Active Data Sets, 2011 Data Compression Conference. ISSN: 2231-2803 http://www.internationaljournalssrg.org Page 53