DEMYSTIFYING DATA DEDUPLICATION A WHITE PAPER

Similar documents
HOW DATA DEDUPLICATION WORKS A WHITE PAPER

GLOBAL APPLICATION & SAN ACCELERATION WITH SOLID-STATE STORAGE A WHITE PAPER

Frequently Asked Questions

WHY DO I NEED FALCONSTOR OPTIMIZED BACKUP & DEDUPLICATION?

Frequently Asked Questions

White paper ETERNUS CS800 Data Deduplication Background

INTRODUCING VERITAS BACKUP EXEC SUITE

Symantec NetBackup 7 for VMware

Protect enterprise data, achieve long-term data retention

Technology Insight Series

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

Building Backup-to-Disk and Disaster Recovery Solutions with the ReadyDATA 5200

BEST PRACTICES FOR OPTIMIZING IT THE OPTIMIZED DATA SERVICES MODEL

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation

Data Domain OpenStorage Primer

Optimizing and Managing File Storage in Windows Environments

OPTIMIZING MICROSOFT EXCHANGE AVAILABILITY AND RECOVERABILITY

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

The storage challenges of virtualized environments

1 Quantum Corporation 1

The Data-Protection Playbook for All-flash Storage KEY CONSIDERATIONS FOR FLASH-OPTIMIZED DATA PROTECTION

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

WHITE PAPER. DATA DEDUPLICATION BACKGROUND: A Technical White Paper

Balakrishnan Nair. Senior Technology Consultant Back Up & Recovery Systems South Gulf. Copyright 2011 EMC Corporation. All rights reserved.

Disk-Based Data Protection Architecture Comparisons

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

SQL Server Consolidation with Server Virtualization on NetApp Storage

Chapter 7. GridStor Technology. Adding Data Paths. Data Paths for Global Deduplication. Data Path Properties

Scale-out Object Store for PB/hr Backups and Long Term Archive April 24, 2014

Rio-2 Hybrid Backup Server

Combining HP StoreOnce and HP StoreEver Tape

The Microsoft Large Mailbox Vision

See what s new: Data Domain Global Deduplication Array, DD Boost and more. Copyright 2010 EMC Corporation. All rights reserved.

Dell DR4000 Replication Overview

50 TB. Traditional Storage + Data Protection Architecture. StorSimple Cloud-integrated Storage. Traditional CapEx: $375K Support: $75K per Year

PRIMEQUEST 400 Series & SQL Server 2005 Technical Whitepaper (November, 2005)

EMC DATA DOMAIN PRODUCT OvERvIEW

Get More Out of Storage with Data Domain Deduplication Storage Systems

The Freedom of High-Performance Backup and Restore with Dynamic Solutions International (DSI) Virtual Tape Library (VTL)

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

HYCU and ExaGrid Hyper-converged Backup for Nutanix

ZYNSTRA TECHNICAL BRIEFING NOTE

White paper: Agentless Backup is Not a Myth. Agentless Backup is Not a Myth

Scale-Out Architectures for Secondary Storage

TOP REASONS TO CHOOSE DELL EMC OVER VEEAM

Data Deduplication Methods for Achieving Data Efficiency

Virtualization with Protection for SMBs Using the ReadyDATA 5200

FOUR WAYS TO LOWER THE COST OF REPLICATION

Microsoft DPM Meets BridgeSTOR Advanced Data Reduction and Security

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump

The World s Fastest Backup Systems

Netapp Exam NS0-510 NCIE-Backup & Recovery Implementation Engineer Exam Version: 7.0 [ Total Questions: 216 ]

Virtualization Selling with IBM Tape

IBM Real-time Compression and ProtecTIER Deduplication

Data safety for digital business. Veritas Backup Exec WHITE PAPER. One solution for hybrid, physical, and virtual environments.

IBM ProtecTIER and Netbackup OpenStorage (OST)

Microsoft Azure StorSimple Hybrid Cloud Storage. Manu Aery, Raju S

EMC Business Continuity for Microsoft Applications

Maximizing Data Efficiency: Benefits of Global Deduplication

Microsoft E xchange 2010 on VMware

CA ARCserve Backup. Benefits. Overview. The CA Advantage

IBM Storage Software Strategy

VTL with Deduplication: Best Practices Guide

IBM TS7700 grid solutions for business continuity

Archiving, Backup, and Recovery for Complete the Promise of Virtualisation Unified information management for enterprise Windows environments

FUJITSU Backup as a Service Rapid Recovery Appliance

Scale-out Data Deduplication Architecture

WHITE PAPER. How Deduplication Benefits Companies of All Sizes An Acronis White Paper

Real-time Recovery Architecture as a Service by, David Floyer

April 2010 Rosen Shingle Creek Resort Orlando, Florida

StorageCraft OneXafe and Veeam 9.5

NEC Express5800 R320f Fault Tolerant Servers & NEC ExpressCluster Software

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection

ADVANCED DEDUPLICATION CONCEPTS. Thomas Rivera, BlueArc Gene Nagle, Exar

Boost your data protection with NetApp + Veeam. Schahin Golshani Technical Partner Enablement Manager, MENA

arcserve r16.5 Hybrid data protection

Backup and Recovery: New Strategies Drive Disk-Based Solutions

Nexsan DeDupe SG FalconStor and Nexsan Joint Venture Delivering a new High-Performance, Power-Efficient Deduplication Storage System Family

Maximizing Availability With Hyper-Converged Infrastructure

Achieving Rapid Data Recovery for IBM AIX Environments An Executive Overview of EchoStream for AIX

Backup-as-a-Service Powered by Veritas

Backup and Recovery. Benefits. Introduction. Best-in-class offering. Easy-to-use Backup and Recovery solution

Reducing Costs in the Data Center Comparing Costs and Benefits of Leading Data Protection Technologies

7 Ways Compellent Optimizes VMware Server Virtualization WHITE PAPER FEBRUARY 2009

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

HOW TO AVOID STORAGE. Leveraging Flash Memory and Breakthrough Architecture to Accelerate Performance, Simplify Management, and Protect Your Data

A Practical Guide to Cost-Effective Disaster Recovery Planning

Trends in Data Protection CDP and VTL

Hyperconvergence and Medical Imaging

Protecting Oracle databases with HPE StoreOnce Catalyst and RMAN

Arcserve Solutions for Amazon Web Services (AWS)

Accelerate the Journey to 100% Virtualization with EMC Backup and Recovery. Copyright 2010 EMC Corporation. All rights reserved.

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

Enterprise Cloud Data Protection With Storage Director and IBM Cloud Object Storage October 24, 2017

White Paper. A System for Archiving, Recovery, and Storage Optimization. Mimosa NearPoint for Microsoft

Cybernetics Virtual Tape Libraries Media Migration Manager Streamlines Flow of D2D2T Backup. April 2009

Moving From Reactive to Proactive Storage Management with an On-demand Cloud Solution

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Chapter 2 CommVault Data Management Concepts

Transcription:

DEMYSTIFYING DATA DEDUPLICATION A WHITE PAPER

DEMYSTIFYING DATA DEDUPLICATION ABSTRACT While data redundancy was once an acceptable operational part of the backup process, the rapid growth of digital content in the data center has pushed organizations to rethink how they approach this issue and to look for ways to optimize storage capacity utilization across the enterprise. With the advent of different data deduplication technologies and methods to optimize storage capacity utilization, IT directors and storage administrators are left to make choices on key initiatives. There are many providers of data deduplication solutions today, and each vendor lays claim to offering the best approach. However, some vendors set unrealistic expectations by predicting huge reductions in data volume, ultimately disappointing customers. In order to make the right decision, multiple factors must be considered in order to select a data deduplication solution which suits the needs of the organization while providing significant value and minimal disruption to infrastructures and processes.

TABLE OF CONTENTS Introduction 3 Key Criteria for a Robust Deduplication Solution 3 1. Integration with current environment 3 2. Impact of deduplication on backup and restore performance 3 3. Scalability 4 4. Distributed topology support 4 5. Highly available deduplication repository 5 6. Efficiency and effectiveness 5 7. End-to-end backup and recovery process 6 Conclusion: Focus on the total solution 6

INTRODUCTION Ongoing data growth and industry-standard backup practices have resulted in large amounts of duplicate data. In order to protect data, the traditional backup paradigm is to make copies of the data on secondary storage every night, creating an overload of backed-up information. Under this scenario, every backup exacerbates the problem. Incremental and differential backups help to decrease the amount of data required compared with full backups. However, even within incremental backups, there is significant duplication of data when protection is based on file-level changes. By eliminating duplicate data and ensuring that data archives are as compact as possible, companies can keep more data online longer at significantly lower costs. When considered across multiple servers and multiple sites, the opportunity for storage reduction by implementing a deduplication solution becomes huge. As a result, data deduplication has become a required technology for any company trying to optimize the performance, efficiency, and cost-effectiveness of storing data. Data deduplication can minimize the bandwidth needed to transfer backup data to offsite archives. With the hazards of physically transporting tapes being well-established (damage, theft, loss, etc.), electronic transfer is fast becoming the offsite storage modality of choice for companies concerned about minimizing risks and protecting essential information resources. KEY CRITERIA FOR A ROBUST DEDUPLICATION SOLUTION There are some important criteria to consider when evaluating a deduplication solution: 1. INTEGRATION WITH CURRENT ENVIRONMENT An effective deduplication solution should be as non-disruptive as possible. Solutions requiring proprietary appliances tend to be more costly than those providing more openness and deployment flexibility. An ideal solution is one that integrates with an organization s existing backup environment and is available in flexible deployment options to provide global coverage across the data center as well as branch and remote offices. Many companies turn to virtual tape library (VTL) technology as a method of implementing a deduplication solution and improving the quality of their backups without having to make significant changes to policies, procedures, or software. VTL-based data deduplication is one of the least disruptive ways to implement a deduplication solution in a tape-centric environment. In such cases, the capabilities of the VTL itself must be considered as part of the evaluation process. Consider the functionality, performance, stability, and support of the VTL as well as its deduplication extension. Keep in mind how well the VTL can emulate your existing tape environment (e.g. same libraries, same tape formats) and communicate with your physical tape infrastructure if required. It is very important that the vendor emulate a large number of tape libraries and tape drives in order to provide administrators with maximum flexibility regarding any future decisions they may make. 2. IMPACT OF DEDUPLICATION ON BACKUP AND RESTORE PERFORMANCE It is important to consider where and when data deduplication takes place in relation to the backup process. There are tradeoffs for the various options, so you will want to have maximum flexibility. For example, inline deduplication processes the backup stream as it comes into the deduplication appliance. This reduces the storage requirement because inline does not require a staging area, but it also tends to be slower than a post-processing approach and can impact the backup window. 3

Conversely, any solution that runs concurrently with tape backup processes or after the backup job is complete will provide higher speeds and minimal impact on the backup window, but will require extra staging space. That is because the backup data is read from the backup repository after backups have been cached to disk, so the production system is free to resume. An enterprise-class solution that offers various deduplication options is ideal because this flexibility gives storage administrators the ability to select a method based on the characteristics and/or business value of the data being protected. For example, data that is being deduplicated for the first time might start by using post-processing to handle the initial load for speed during the demanding initial deduplication set up and then switch to inline for subsequent space savings. Most deduplication solutions only provide a single deduplication method in a forced one size fits all scenario. For maximum manageability, the solution should allow granular (tape- or group-level) policy-based deduplication that can accommodate variety of factors: resource utilization, production schedules, time since creation, and so on. Certain types of data, such as pre-compressed data and large databases, are not good deduplication candidates. In those cases, given the remarkable increases in physical tape recording densities and performance, moving those specific files to tape offers the best level of protection in the event that the data needs to be restored. In this way, storage efficiencies can be achieved while optimizing the use of system resources. Restore performance is also crucial. Some technologies are good at deduplicating data but perform much slower when it comes to rebuilding data (often referred to as re-inflating data). If you are testing systems, you need to know how long it will take to restore a large database or full system. Ask the solution provider to explain how they can ensure reasonable restore speeds. Compare backup and restore performance metrics for the vendors you are considering. 3. SCALABILITY A deduplication solution is usually chosen for rapidly growing longer-term data storage, so scalability is an important consideration, particularly in terms of capacity and performance. Consider growth expectations over five years or more. How much data will you want to keep on disk for fast access? How will the data index system scale to your requirements? A deduplication solution should have an architecture that allows economic right-sizing for both the initial implementation and the long-term growth of the system. For example, a clustering approach allows organizations to scale to meet growing capacity requirements even for environments with many petabytes of data without compromising deduplication efficiency or system performance. Clustering enables a deduplication solution to be managed and used logically as a single data repository, supporting even the largest of tape libraries. Clustering also inherently provides a high-availability (HA) environment, protecting the backup repository interface (VTL or file interface) and deduplication nodes by offering failover support. In addition to clustering, the deduplication vendor should be able to easily add nodes or storage as needed. 4. DISTRIBUTED TOPOLOGY SUPPORT Data deduplication should occur throughout a distributed enterprise, not just in the data center. A deduplication solution that includes replication and global deduplication provides maximum benefits to its customers. For example, a company with a corporate headquarters, regional offices, and a secure disaster recovery (DR) facility should be able to implement deduplication in its regional offices to facilitate efficient local storage and replication to the central site. Only unique data across all sites should be replicated to the central site and subsequently to the disaster recovery site, to avoid excessive bandwidth usage. 4

5. HIGHLY AVAILABLE DEDUPLICATION REPOSITORY It is extremely important to create a highly available deduplication repository. Since a very large amount of vital data is consolidated in one location, risk tolerance for data loss is very low. Access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure. A robust deduplication solution will include mirroring to protect against local storage failure as well as replication to protect against disaster. The solution should have failover capabilities in the event of a node failure. Even if multiple nodes in a cluster fail, the company must be able to continue to recover its data and maintain ongoing business operations. 6. EFFICIENCY AND EFFECTIVENESS File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level. Consider, for example, changing a single sentence in a 4MB Powerpoint presentation. With a file level based dedupe solution, the entire file may have to be re-stored, doubling the storage requirements. If the presentation is sent to multiple people, as presentations often are, the negative effects multiply. Most sub-file deduplication processes use some sort of chunking method to break up a large amount of data into smallersized pieces to search for duplicate data. Larger chunks of data can be processed at a faster rate, but less duplication is detected. It is easier to detect more duplication in smaller chunks, but the overhead to scan the data is higher. If the chunking begins at the beginning of a tape (or data stream in other implementations), the deduplication process can be fooled by the metadata created by the backup software, even if the file is unchanged. However, if the solution is intelligent and can segregate the metadata and look for duplication in chunks within actual data files, the duplication detection will be much higher. Some solutions even adjust chunk size based on information gleaned from the data formats. The combination of these techniques can lead to a 30 to 40% increase in the amount of duplicate data detected. This can have a major impact on the cost-effectiveness of the deduplication solution. Backup Servers Active Node 1 Standby Node FC Switch SAN Tape Library Repository Storage Active Node 2 Figure 1: An example of a clustered deduplication architecture. 5

7. END-TO-END BACKUP AND RECOVERY PROCESS By focusing only on deduplication, you may find yourself with a solution that breaks down somewhere within the larger process. Deduplication is part of a tiered storage protection hierarchy that includes storage virtualization, continuous data protection, and physical tape support on the back end. The larger data protection and recovery process includes backing within a specified window, copying data to tape, deduplicating data, replicating data, restoring data, and managing all of these processes. Before deciding on a deduplication solution, you should ensure that you understand the entire data protection process, from backup to restore, and know how to manage it. SUMMARY: FOCUS ON THE TOTAL SOLUTION As stored data volumes continually increase while IT spending decreases, data deduplication is fast becoming a vital technology. Data deduplication is the best way to dramatically reduce data volumes, slash storage requirements, and minimize data protection costs and risks. Although the benefits of data deduplication are dramatic, organizations should not be seduced by the hype sometimes attributed to the technology. No matter the approach, the amount of data deduplication that can occur is driven by the nature of the data and the policies used to protect it. In order to achieve maximum benefits from deduplication technology, organizations should choose deduplication solutions based on the full spectrum of their IT and business needs. FalconStor technology is unique in that it allows users to select the deduplication method best suited to their data recovery and protection criteria. The user has the choice of inline, post process, concurrent deduplication, or no deduplication at all. If deduplication is not an option for the specific data-type (IE: encrypted, compressed, or image files) then the data can be written to our high performance disk cache, and then directly to an attached tape library. Some data (for example, large compressed databases) are poor deduplication candidates and are best protected on high-performance, high-density tape. In non-intelligent dedupe solutions, the process is always on, even if the data is known to NOT be dedupable. Storing this type of data within a dedupe repository is wasteful, an may require up to twice the amount of storage in the long run. On a per node basis, FalconStor provides the fastest and most efficient data deduplication solution in the world today. When integrated with physical tape for data archives, we provide a holistic approach to data protection, and are able to move the data to the storage media best suited to the business policy you create. 6

REFERENCES 1. Ryan Murphy, Where In The World Is Your Next Data Center, ReadWrite, May 2, 2013 2. CA Technologies Survey, Cost of Downtime, November 2010 3. Jacques Bughin, Michael Chui, and James Manyika, Ten IT-Enabled Business Trends For The Decade Ahead, McKinsey Quarterly, May 2013 4. Data Data Everywhere, The Economist, February 25, 2010 5. Steve LaValle, Michael S. Hopkins, Eric Lesser, Rebecca Shockley, and Nina Kruschwitz, Analytics: The New Path to Value, MIT Sloan Management Review, October 24, 2010 6. Brian Proffitt, Why Business Still Needs Help Managing Explosive Data Growth, ReadWrite, November 20, 2013 Corporate Headquarters 2 Huntington Quadrangle, Suite 2S01 Melville, NY 11747 Tel: +1.631.777.5188 salesinfo@falconstor.com Europe Headquarters Landsberger Str. 312 80687 Munich, Germany Tel: +49 (0) 89.41615321.10 salesemea@falconstor.com Asia Headquarters PICC Office Tower No. 2 Room 1901 2 Jian Guo Men Wai Avenue Chaoyang District Beijing 100022 China Tel: +86.10.6530.9505 salesasia@falconstor.com Information in this document is provided AS IS without warranty of any kind, and is subject to change without notice by FalconStor, which assumes no responsibility for any errors or claims herein. Copyright 2014 FalconStor Software. All rights reserved. FalconStor Software and FalconStor are registered trademarks of FalconStor Software, Inc. in the United States and other countries. All other company and product names contained herein are or may be trademarks of the respective holder. DDDWP141014 7